# Introduction to pandas

pandasは、Pythonのデータ解析ライブラリである。大きな表データを高速に処理することができる。可視化のためのデータの読み込みや前処理に便利。

主要な機能として、

* CSVファイルの読み書き
* 統計量の算出
* 並べ替え
* データの選択
* 条件指定による選択
* 欠損値の除去／補間

などがある。

まずはパッケージを読み込み、バージョンを確認する。

In [3]:
import numpy as np
import pandas as pd

In [4]:
pd.__version__

'1.4.2'

## オブジェクトを作る

pandasの主要なデータ構造は、`DataFrame`と`Series`という2つのクラスである。

* `DataFrame`: 行と名前付きの列を持つデータテーブル。
* `Series`: 単一のカラム。`DataFrame`は1つ以上の`Series`とそれぞれの`Series`に対応する名前により構成される。

### Seriesを作る

#### 配列から作る

In [5]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])

In [6]:
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

行ラベル（index）は自動的に追加されるが、明示的に指定することもできる。

In [10]:
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=[1 for _ in range(6)])

In [11]:
s

1    1.0
1    3.0
1    5.0
1    NaN
1    6.0
1    8.0
dtype: float64

#### 辞書から作る

In [7]:
s = pd.Series({'japan': 'tokyo', 'thailand': 'bangkok', 'australia': 'canberra'})

In [8]:
s

japan           tokyo
thailand      bangkok
australia    canberra
dtype: object

### DataFrame

#### 配列から作る

In [None]:
dates = 

In [5]:
dates = pd.date_range("20130101", periods=6)

In [6]:
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [7]:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

In [8]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [9]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [16]:
california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")

In [19]:
california_housing_dataframe.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.562108,35.625225,28.589353,2643.664412,539.410824,1429.573941,501.221941,3.883578,207300.912353
std,2.005166,2.13734,12.586937,2179.947071,421.499452,1147.852959,384.520841,1.908157,115983.764387
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.79,33.93,18.0,1462.0,297.0,790.0,282.0,2.566375,119400.0
50%,-118.49,34.25,29.0,2127.0,434.0,1167.0,409.0,3.5446,180400.0
75%,-118.0,37.72,37.0,3151.25,648.25,1721.0,605.25,4.767,265000.0
max,-114.31,41.95,52.0,37937.0,6445.0,35682.0,6082.0,15.0001,500001.0


## データをながめる

In [11]:
index = pd.date_range("1/1/2000", periods=8)

In [13]:
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])

In [20]:
df.head()

Unnamed: 0,A,B,C
2000-01-01,3.237676,0.101675,-1.278389
2000-01-02,0.007413,0.829631,-0.516682
2000-01-03,0.046629,0.116351,0.89566
2000-01-04,-0.374651,0.794817,-0.988649
2000-01-05,1.221676,2.585122,0.716088


In [21]:
df.tail(3)

Unnamed: 0,A,B,C
2000-01-06,-0.769356,1.310351,-2.035116
2000-01-07,-1.315484,-1.188941,1.491256
2000-01-08,0.714891,0.038984,0.228209


In [22]:
df.index

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08'],
              dtype='datetime64[ns]', freq='D')

In [23]:
df.columns

Index(['A', 'B', 'C'], dtype='object')

In [24]:
df.to_numpy()

array([[ 3.23767637,  0.10167539, -1.27838862],
       [ 0.00741286,  0.82963085, -0.51668194],
       [ 0.04662866,  0.11635116,  0.8956604 ],
       [-0.37465139,  0.79481666, -0.98864929],
       [ 1.22167554,  2.58512182,  0.71608784],
       [-0.76935644,  1.3103514 , -2.03511631],
       [-1.31548413, -1.18894122,  1.49125608],
       [ 0.71489096,  0.03898432,  0.22820928]])

In [25]:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

In [26]:
df.describe()

Unnamed: 0,A,B,C
count,8.0,8.0,8.0
mean,0.346099,0.573499,-0.185953
std,1.413852,1.103738,1.215063
min,-1.315484,-1.188941,-2.035116
25%,-0.473328,0.086003,-1.061084
50%,0.027021,0.455584,-0.144236
75%,0.841587,0.949811,0.760981
max,3.237676,2.585122,1.491256


In [28]:
df.T

Unnamed: 0,2000-01-01,2000-01-02,2000-01-03,2000-01-04,2000-01-05,2000-01-06,2000-01-07,2000-01-08
A,3.237676,0.007413,0.046629,-0.374651,1.221676,-0.769356,-1.315484,0.714891
B,0.101675,0.829631,0.116351,0.794817,2.585122,1.310351,-1.188941,0.038984
C,-1.278389,-0.516682,0.89566,-0.988649,0.716088,-2.035116,1.491256,0.228209


In [31]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,C,B,A
2000-01-01,-1.278389,0.101675,3.237676
2000-01-02,-0.516682,0.829631,0.007413
2000-01-03,0.89566,0.116351,0.046629
2000-01-04,-0.988649,0.794817,-0.374651
2000-01-05,0.716088,2.585122,1.221676
2000-01-06,-2.035116,1.310351,-0.769356
2000-01-07,1.491256,-1.188941,-1.315484
2000-01-08,0.228209,0.038984,0.714891


In [32]:
df.sort_values(by="C")

Unnamed: 0,A,B,C
2000-01-06,-0.769356,1.310351,-2.035116
2000-01-01,3.237676,0.101675,-1.278389
2000-01-04,-0.374651,0.794817,-0.988649
2000-01-02,0.007413,0.829631,-0.516682
2000-01-08,0.714891,0.038984,0.228209
2000-01-05,1.221676,2.585122,0.716088
2000-01-03,0.046629,0.116351,0.89566
2000-01-07,-1.315484,-1.188941,1.491256


## データを選ぶ

### Getting

In [33]:
df["A"]

2000-01-01    3.237676
2000-01-02    0.007413
2000-01-03    0.046629
2000-01-04   -0.374651
2000-01-05    1.221676
2000-01-06   -0.769356
2000-01-07   -1.315484
2000-01-08    0.714891
Freq: D, Name: A, dtype: float64

In [34]:
df[0:3]

Unnamed: 0,A,B,C
2000-01-01,3.237676,0.101675,-1.278389
2000-01-02,0.007413,0.829631,-0.516682
2000-01-03,0.046629,0.116351,0.89566


In [36]:
df["20000102":"20000104"]

Unnamed: 0,A,B,C
2000-01-02,0.007413,0.829631,-0.516682
2000-01-03,0.046629,0.116351,0.89566
2000-01-04,-0.374651,0.794817,-0.988649


### Selection by label

In [38]:
df.loc[index[0]]

A    3.237676
B    0.101675
C   -1.278389
Name: 2000-01-01 00:00:00, dtype: float64

In [39]:
df.loc[:, ["A", "C"]]

Unnamed: 0,A,C
2000-01-01,3.237676,-1.278389
2000-01-02,0.007413,-0.516682
2000-01-03,0.046629,0.89566
2000-01-04,-0.374651,-0.988649
2000-01-05,1.221676,0.716088
2000-01-06,-0.769356,-2.035116
2000-01-07,-1.315484,1.491256
2000-01-08,0.714891,0.228209


In [40]:
df.loc["20000103":"20000107", ["A", "B"]]

Unnamed: 0,A,B
2000-01-03,0.046629,0.116351
2000-01-04,-0.374651,0.794817
2000-01-05,1.221676,2.585122
2000-01-06,-0.769356,1.310351
2000-01-07,-1.315484,-1.188941


In [42]:
df.loc["20000106", ["A", "B"]]

A   -0.769356
B    1.310351
Name: 2000-01-06 00:00:00, dtype: float64

In [43]:
df.loc[index[0], "A"]

3.23767637008419

In [44]:
df.at[index[0], "A"]

3.23767637008419

### Selection by position

In [45]:
df.iloc[3]

A   -0.374651
B    0.794817
C   -0.988649
Name: 2000-01-04 00:00:00, dtype: float64

In [46]:
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2000-01-04,-0.374651,0.794817
2000-01-05,1.221676,2.585122


In [47]:
df.iloc[[1, 2, 4], [0, 2]]

Unnamed: 0,A,C
2000-01-02,0.007413,-0.516682
2000-01-03,0.046629,0.89566
2000-01-05,1.221676,0.716088


In [48]:
df.iloc[1:3, :]

Unnamed: 0,A,B,C
2000-01-02,0.007413,0.829631,-0.516682
2000-01-03,0.046629,0.116351,0.89566


In [49]:
df.iloc[:, 1:3]

Unnamed: 0,B,C
2000-01-01,0.101675,-1.278389
2000-01-02,0.829631,-0.516682
2000-01-03,0.116351,0.89566
2000-01-04,0.794817,-0.988649
2000-01-05,2.585122,0.716088
2000-01-06,1.310351,-2.035116
2000-01-07,-1.188941,1.491256
2000-01-08,0.038984,0.228209


In [50]:
df.iloc[1, 1]

0.8296308517147354

In [51]:
df.iat[1, 1]

0.8296308517147354

### Boolean indexing

In [52]:
df[df["A"] > 0]

Unnamed: 0,A,B,C
2000-01-01,3.237676,0.101675,-1.278389
2000-01-02,0.007413,0.829631,-0.516682
2000-01-03,0.046629,0.116351,0.89566
2000-01-05,1.221676,2.585122,0.716088
2000-01-08,0.714891,0.038984,0.228209


In [53]:
df[df > 0]

Unnamed: 0,A,B,C
2000-01-01,3.237676,0.101675,
2000-01-02,0.007413,0.829631,
2000-01-03,0.046629,0.116351,0.89566
2000-01-04,,0.794817,
2000-01-05,1.221676,2.585122,0.716088
2000-01-06,,1.310351,
2000-01-07,,,1.491256
2000-01-08,0.714891,0.038984,0.228209


In [54]:
df3 = df.copy()

In [56]:
df3["E"] = ["one", "one", "two", "three", "four", "three", "four", "five"]

In [57]:
df3

Unnamed: 0,A,B,C,E
2000-01-01,3.237676,0.101675,-1.278389,one
2000-01-02,0.007413,0.829631,-0.516682,one
2000-01-03,0.046629,0.116351,0.89566,two
2000-01-04,-0.374651,0.794817,-0.988649,three
2000-01-05,1.221676,2.585122,0.716088,four
2000-01-06,-0.769356,1.310351,-2.035116,three
2000-01-07,-1.315484,-1.188941,1.491256,four
2000-01-08,0.714891,0.038984,0.228209,five


In [58]:
df3[df3["E"].isin(["two", "five"])]

Unnamed: 0,A,B,C,E
2000-01-03,0.046629,0.116351,0.89566,two
2000-01-08,0.714891,0.038984,0.228209,five
