# 10 minutes to pandas
 파이썬 데이터 처리를 위한 라이브러리, 판다스(Pandas)를 스터디합니다.
 
[10 Minuts to Pandas 참조](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
---
**Contents**
1. Object Creation
2. Viewing Data
3. Selection
4. Missing Data
5. Operation
6. Merge
7. Grouping
8. Reshaping
9. Time Series
10. Cataegoricals
11. Plotting
12. Getting Data In / Out
13. Gotchas

In [71]:
# Custimarily, we import as follows
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## 1. Object Creation

Creating a **Series** by passing a list of values, letting pandas create a default integer index.

In [72]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a **DataFrame** by passing a NumPy array, with a datetime index and labeled columns.

In [73]:
dates = pd.date_range("20210101", periods=6)
dates

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

In [74]:
df = pd.DataFrame(
    data=np.random.randn(6, 4),
    index=dates,
    columns=list("ABCD")
)
df

Unnamed: 0,A,B,C,D
2021-01-01,-0.986969,-0.457018,-0.096804,0.583139
2021-01-02,-0.227225,-0.766473,0.418936,1.895144
2021-01-03,-0.747426,0.680011,0.432155,-1.150434
2021-01-04,-0.70113,1.453221,-1.830686,2.957552
2021-01-05,0.515004,0.906651,-0.310801,0.291661
2021-01-06,1.657326,0.497257,0.051016,0.023969


Creating a **DataFrame** by passing a dict of objects that can be converted to series-like.

In [75]:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20210101"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2021-01-01,1.0,3,test,foo
1,1.0,2021-01-01,1.0,3,train,foo
2,1.0,2021-01-01,1.0,3,test,foo
3,1.0,2021-01-01,1.0,3,train,foo


The columns of the resulting **DataFrame** have different **dtypes**.

In [76]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

## 2. Viewing Data

Here is how to view the top and bottom rows of the frame.

In [77]:
df.head()

Unnamed: 0,A,B,C,D
2021-01-01,-0.986969,-0.457018,-0.096804,0.583139
2021-01-02,-0.227225,-0.766473,0.418936,1.895144
2021-01-03,-0.747426,0.680011,0.432155,-1.150434
2021-01-04,-0.70113,1.453221,-1.830686,2.957552
2021-01-05,0.515004,0.906651,-0.310801,0.291661


In [78]:
df.tail(3)

Unnamed: 0,A,B,C,D
2021-01-04,-0.70113,1.453221,-1.830686,2.957552
2021-01-05,0.515004,0.906651,-0.310801,0.291661
2021-01-06,1.657326,0.497257,0.051016,0.023969


Display the index, columns.

In [79]:
df.index

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

In [80]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

`DataFrame.to_numpy()` gives a NumPy representation of the underlying data.

Note that this can be an expensive operation when your `DataFrame` has columns with different data types, which comes down to a fundamental difference between pandas and NumPy.

**Numpy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column.**

Whenm you call `DataFrame.to_numpy()`, pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame.

This may end up being object, wich requires casting every value to Python object.

For `df`, our `DataFrame` of all floating-point values, `DataFrame.to_numpy()` is fast and doesn't require copying data.

In [81]:
df.to_numpy()

array([[-0.98696934, -0.45701765, -0.09680358,  0.58313889],
       [-0.22722541, -0.76647318,  0.41893611,  1.8951443 ],
       [-0.74742632,  0.68001133,  0.43215521, -1.15043415],
       [-0.70112961,  1.45322109, -1.83068576,  2.95755163],
       [ 0.51500444,  0.90665082, -0.31080093,  0.29166132],
       [ 1.6573259 ,  0.49725666,  0.05101612,  0.02396921]])

For `df2`, the `DataFrame` with multiple dtypes, `DataFrane,to_numpy()` is relatively expensive.

In [82]:
df2.to_numpy()

array([[1.0, Timestamp('2021-01-01 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2021-01-01 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2021-01-01 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2021-01-01 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

> **Note:** `DataFrame.to_numpy()` does not include the index or column labels in the output.

`describe()` shows a quick static summary of your data.

In [83]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.081737,0.385608,-0.222697,0.766839
std,1.004773,0.842346,0.839605,1.453246
min,-0.986969,-0.766473,-1.830686,-1.150434
25%,-0.735852,-0.218449,-0.257302,0.090892
50%,-0.464178,0.588634,-0.022894,0.4374
75%,0.329447,0.849991,0.326956,1.567143
max,1.657326,1.453221,0.432155,2.957552


Transposing your data.

In [84]:
df.T

Unnamed: 0,2021-01-01,2021-01-02,2021-01-03,2021-01-04,2021-01-05,2021-01-06
A,-0.986969,-0.227225,-0.747426,-0.70113,0.515004,1.657326
B,-0.457018,-0.766473,0.680011,1.453221,0.906651,0.497257
C,-0.096804,0.418936,0.432155,-1.830686,-0.310801,0.051016
D,0.583139,1.895144,-1.150434,2.957552,0.291661,0.023969


Sorting by an axis.

In [85]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2021-01-01,0.583139,-0.096804,-0.457018,-0.986969
2021-01-02,1.895144,0.418936,-0.766473,-0.227225
2021-01-03,-1.150434,0.432155,0.680011,-0.747426
2021-01-04,2.957552,-1.830686,1.453221,-0.70113
2021-01-05,0.291661,-0.310801,0.906651,0.515004
2021-01-06,0.023969,0.051016,0.497257,1.657326


Sorting by values.

In [86]:
df.sort_values(by="B")

Unnamed: 0,A,B,C,D
2021-01-02,-0.227225,-0.766473,0.418936,1.895144
2021-01-01,-0.986969,-0.457018,-0.096804,0.583139
2021-01-06,1.657326,0.497257,0.051016,0.023969
2021-01-03,-0.747426,0.680011,0.432155,-1.150434
2021-01-05,0.515004,0.906651,-0.310801,0.291661
2021-01-04,-0.70113,1.453221,-1.830686,2.957552


## 3. Selection
> **Note:** While standard Python/NumPy expression for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized padnas data access methods, `.at`, `.iat`, `.loc` and `.iloc`.

### Getting
Selection a single column, which yields a Series, equivalent to `df.A`.

In [87]:
df["A"]  # == df.A

2021-01-01   -0.986969
2021-01-02   -0.227225
2021-01-03   -0.747426
2021-01-04   -0.701130
2021-01-05    0.515004
2021-01-06    1.657326
Freq: D, Name: A, dtype: float64

Selection via  `[]`, which slices the rows.

In [88]:
df[0:3]

Unnamed: 0,A,B,C,D
2021-01-01,-0.986969,-0.457018,-0.096804,0.583139
2021-01-02,-0.227225,-0.766473,0.418936,1.895144
2021-01-03,-0.747426,0.680011,0.432155,-1.150434


In [89]:
df["20210101":"20210103"]

Unnamed: 0,A,B,C,D
2021-01-01,-0.986969,-0.457018,-0.096804,0.583139
2021-01-02,-0.227225,-0.766473,0.418936,1.895144
2021-01-03,-0.747426,0.680011,0.432155,-1.150434


### Selection by label
For getting a cross secion using a label.

In [90]:
df.loc[dates[0]]

A   -0.986969
B   -0.457018
C   -0.096804
D    0.583139
Name: 2021-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label.

In [91]:
df.loc[:, ["A", "B"]]

Unnamed: 0,A,B
2021-01-01,-0.986969,-0.457018
2021-01-02,-0.227225,-0.766473
2021-01-03,-0.747426,0.680011
2021-01-04,-0.70113,1.453221
2021-01-05,0.515004,0.906651
2021-01-06,1.657326,0.497257


Showing label slicing, both endpoints are included.

In [92]:
df.loc["20210101":"20210103", ["A", "B"]]

Unnamed: 0,A,B
2021-01-01,-0.986969,-0.457018
2021-01-02,-0.227225,-0.766473
2021-01-03,-0.747426,0.680011


Reduction in the dimensions of the returned object.

In [93]:
df.loc["20210101", ["A", "B"]]

A   -0.986969
B   -0.457018
Name: 2021-01-01 00:00:00, dtype: float64

For getting a scalar value.

In [94]:
df.loc[dates[0], "A"]

-0.9869693392403752

For getting fast access to a scalar (equivalent to the prior method).

In [95]:
df.at[dates[0], "A"]

-0.9869693392403752

### Selection by position

Selection via the position of the pased integers.

In [96]:
df.iloc[3]

A   -0.701130
B    1.453221
C   -1.830686
D    2.957552
Name: 2021-01-04 00:00:00, dtype: float64

By integer slices, acting similar to NumPy/Python.

In [97]:
df.iloc[[1, 2, 4], [0, 2]]

Unnamed: 0,A,C
2021-01-02,-0.227225,0.418936
2021-01-03,-0.747426,0.432155
2021-01-05,0.515004,-0.310801


For slicing rows explicitly.

In [98]:
df.iloc[1:3, :]

Unnamed: 0,A,B,C,D
2021-01-02,-0.227225,-0.766473,0.418936,1.895144
2021-01-03,-0.747426,0.680011,0.432155,-1.150434


For slicing columns explicitly.

In [99]:
df.iloc[:, 1:3]

Unnamed: 0,B,C
2021-01-01,-0.457018,-0.096804
2021-01-02,-0.766473,0.418936
2021-01-03,0.680011,0.432155
2021-01-04,1.453221,-1.830686
2021-01-05,0.906651,-0.310801
2021-01-06,0.497257,0.051016


For getting a value explicitly.

In [100]:
df.iloc[1, 1]

-0.7664731754353645

For getting fast access to a scalar (equivalent to the prior method).

In [101]:
df.iat[1, 1]

-0.7664731754353645

### Boolean indexing

using a single column's values to select data.

In [102]:
df[df["A"] > 0]

Unnamed: 0,A,B,C,D
2021-01-05,0.515004,0.906651,-0.310801,0.291661
2021-01-06,1.657326,0.497257,0.051016,0.023969


Selecting values from a DataFrame where a boolean condition is met.

In [103]:
df[df > 0]

Unnamed: 0,A,B,C,D
2021-01-01,,,,0.583139
2021-01-02,,,0.418936,1.895144
2021-01-03,,0.680011,0.432155,
2021-01-04,,1.453221,,2.957552
2021-01-05,0.515004,0.906651,,0.291661
2021-01-06,1.657326,0.497257,0.051016,0.023969


Using `isin()` method for filtering.

In [104]:
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
df2

Unnamed: 0,A,B,C,D,E
2021-01-01,-0.986969,-0.457018,-0.096804,0.583139,one
2021-01-02,-0.227225,-0.766473,0.418936,1.895144,one
2021-01-03,-0.747426,0.680011,0.432155,-1.150434,two
2021-01-04,-0.70113,1.453221,-1.830686,2.957552,three
2021-01-05,0.515004,0.906651,-0.310801,0.291661,four
2021-01-06,1.657326,0.497257,0.051016,0.023969,three


In [105]:
df2[df2["E"].isin(["two", "four"])]

Unnamed: 0,A,B,C,D,E
2021-01-03,-0.747426,0.680011,0.432155,-1.150434,two
2021-01-05,0.515004,0.906651,-0.310801,0.291661,four


### Setting

Setting a new column automatically aligns the data by the indexes.

In [106]:
s1 = pd.Series(
    data=[1, 2, 3, 4, 5, 6], 
    index=pd.date_range("20210102", periods=6)
)
s1

2021-01-02    1
2021-01-03    2
2021-01-04    3
2021-01-05    4
2021-01-06    5
2021-01-07    6
Freq: D, dtype: int64

In [107]:
df["F"] = s1
df

Unnamed: 0,A,B,C,D,F
2021-01-01,-0.986969,-0.457018,-0.096804,0.583139,
2021-01-02,-0.227225,-0.766473,0.418936,1.895144,1.0
2021-01-03,-0.747426,0.680011,0.432155,-1.150434,2.0
2021-01-04,-0.70113,1.453221,-1.830686,2.957552,3.0
2021-01-05,0.515004,0.906651,-0.310801,0.291661,4.0
2021-01-06,1.657326,0.497257,0.051016,0.023969,5.0


Setting values by **label**.

In [108]:
df.at[dates[0], "A"] = 0

Setting values by **position**.

In [109]:
df.iat[0, 1] = 0

Setting by **assigning with a NumPy array**.

In [110]:
df.loc[:, "D"] = np.array([5] * len(df))

The result of the prior setting operations.

In [111]:
df

Unnamed: 0,A,B,C,D,F
2021-01-01,0.0,0.0,-0.096804,5,
2021-01-02,-0.227225,-0.766473,0.418936,5,1.0
2021-01-03,-0.747426,0.680011,0.432155,5,2.0
2021-01-04,-0.70113,1.453221,-1.830686,5,3.0
2021-01-05,0.515004,0.906651,-0.310801,5,4.0
2021-01-06,1.657326,0.497257,0.051016,5,5.0


A `where` operation with setting.

In [112]:
df2 = df.copy()
df2[df2 > 0] = -df2
df2

Unnamed: 0,A,B,C,D,F
2021-01-01,0.0,0.0,-0.096804,-5,
2021-01-02,-0.227225,-0.766473,-0.418936,-5,-1.0
2021-01-03,-0.747426,-0.680011,-0.432155,-5,-2.0
2021-01-04,-0.70113,-1.453221,-1.830686,-5,-3.0
2021-01-05,-0.515004,-0.906651,-0.310801,-5,-4.0
2021-01-06,-1.657326,-0.497257,-0.051016,-5,-5.0


## 4. Missing Data

## 5. Operation

## 6. Merge

## 7. Grouping

## 8. Reshaping

## 9. Time Series

## 10. Cataegoricals

## 11. Plotting

## 12. Getting Data In / Out

## 13. Gotchas