## ★Chapter4  Pythonでのデータ取り扱いの基本

### ●01　データ処理で使用するライブラリ

リスト4.1　NumPyの インポート

In [1]:
import numpy as np

リスト4.2　pandasのインポート

In [2]:
import pandas as pd

リスト4.3　janomeのインポート

In [3]:
import janome

### ●02　ビジュアライゼーションで使用するライブラリ

リスト4.4　matplotlibのインポート

In [4]:
import matplotlib.pyplot as plt

リスト4.5　seabornのインポート

In [5]:
import seaborn as sns

### ●03 Pythonで扱うデータ構造

リスト4.6　リストの例

In [6]:
sample_list = [1, 2, 3, 4]
sample_list

[1, 2, 3, 4]

In [7]:
type(sample_list)

list

リスト4.7　シリーズ形式の例

In [8]:
sample_series = pd.Series([1,2,3])
sample_series

0    1
1    2
2    3
dtype: int64

In [9]:
type(sample_series)

pandas.core.series.Series

リスト4.8　データフレームの例

In [10]:
sample_df = pd.DataFrame({
    "名前": ["Alice", "Bob", "Charlie"],
    "点数": [78, 65, 90]
})
sample_df

Unnamed: 0,名前,点数
0,Alice,78
1,Bob,65
2,Charlie,90


In [11]:
type(sample_df)

pandas.core.frame.DataFrame

### ●04　基本的な操作

リスト4.9　CSVファイルの読み込み

In [12]:
import numpy as np
import pandas as pd

new_data = pd.read_csv("read_sample.csv")

リスト4.10　読み込んだデータの表示

In [13]:
new_data

Unnamed: 0,名前,点数
0,Alice,78
1,Bob,65
2,Charlie,90


### ●05　基本的な演算

リスト4.11　事前に実行しておくコード

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

リスト4.12　基本的な演算

In [16]:
# 足し算
1 + 1

2

In [18]:
# 引き算
2 - 1

1

In [19]:
# 掛け算
3 * 1

3

In [20]:
# 割り算
4 / 1

4.0

リスト4.13　文字数を調べる例

In [21]:
len("python")

6

### ●06 データフレームを扱う

リスト4.14　サンプルデータをpandasのデータフレーム形式で読み込む

In [22]:
titanic = sns.load_dataset("titanic")
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


リスト4.15　1行目の要素を取得する例

In [23]:
titanic.iloc[0]

survived                 0
pclass                   3
sex                   male
age                     22
sibsp                    1
parch                    0
fare                  7.25
embarked                 S
class                Third
who                    man
adult_male            True
deck                   NaN
embark_town    Southampton
alive                   no
alone                False
Name: 0, dtype: object

リスト4.16　1つのカラムを取得する例

In [24]:
titanic_class = titanic["class"]
titanic_class

0       Third
1       First
2       Third
3       First
4       Third
        ...  
886    Second
887     First
888     Third
889     First
890     Third
Name: class, Length: 891, dtype: category
Categories (3, object): [First, Second, Third]

リスト4.17　データの行数を数える例

In [26]:
len(titanic)

891

リスト4.18　describe関数を利用したデータ概要の確認の例

In [27]:
titanic.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


リスト4.19　列データの要素ごとの件数を数える例

In [28]:
titanic_class = titanic["class"].value_counts()
titanic_class

Third     491
First     216
Second    184
Name: class, dtype: int64

リスト4.20　列データの要素ごとの件数を数える例

In [29]:
titanic_unique = titanic["class"].nunique()
titanic_unique

3

リスト4.21　データを集約して件数を出力する

In [30]:
titanic_sex_class = titanic.groupby("sex")["class"].value_counts()
titanic_sex_class

sex     class 
female  Third     144
        First      94
        Second     76
male    Third     347
        First     122
        Second    108
Name: class, dtype: int64

リスト4.22　特定の変数ごとの平均値を出す例

In [31]:
titanic_group_mean = titanic.groupby("sex").mean()
titanic_group_mean

Unnamed: 0_level_0,survived,pclass,age,sibsp,parch,fare,adult_male,alone
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
female,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818,0.0,0.401274
male,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893,0.930676,0.712305


リスト4.23　性別ごとに平均値を出す例

In [33]:
titanic_group_mean = titanic.groupby("sex", as_index=False).mean()
titanic_group_mean

Unnamed: 0,sex,survived,pclass,age,sibsp,parch,fare,adult_male,alone
0,female,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818,0.0,0.401274
1,male,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893,0.930676,0.712305


リスト4.24　2つの変数で集約した平均値を出す例

In [34]:
titanic_group_mean2 = titanic.groupby(["sex", "class"], as_index=False).mean()
titanic_group_mean2

Unnamed: 0,sex,class,survived,pclass,age,sibsp,parch,fare,adult_male,alone
0,female,First,0.968085,1.0,34.611765,0.553191,0.457447,106.125798,0.0,0.361702
1,female,Second,0.921053,2.0,28.722973,0.486842,0.605263,21.970121,0.0,0.421053
2,female,Third,0.5,3.0,21.75,0.895833,0.798611,16.11881,0.0,0.416667
3,male,First,0.368852,1.0,41.281386,0.311475,0.278689,67.226127,0.97541,0.614754
4,male,Second,0.157407,2.0,30.740707,0.342593,0.222222,19.741782,0.916667,0.666667
5,male,Third,0.135447,3.0,26.507589,0.498559,0.224784,12.661633,0.919308,0.760807


リスト4.25　件数を集計する例

In [35]:
cross_class = pd.crosstab(titanic["who"], titanic["class"])
cross_class

class,First,Second,Third
who,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
child,6,19,58
man,119,99,319
woman,91,66,114


リスト4.26　正規化する例

In [36]:
cross_nmrl = pd.crosstab(titanic["who"], titanic["class"], normalize="index")
cross_nmrl

class,First,Second,Third
who,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
child,0.072289,0.228916,0.698795
man,0.221601,0.184358,0.594041
woman,0.335793,0.243542,0.420664


リスト4.27　一部の条件に該当したデータを抽出する例

In [37]:
titanic_female = titanic[titanic["sex"] == "female"]
titanic_female

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
880,1,2,female,25.0,0,1,26.0000,S,Second,woman,False,,Southampton,yes,False
882,0,3,female,22.0,0,0,10.5167,S,Third,woman,False,,Southampton,no,True
885,0,3,female,39.0,0,5,29.1250,Q,Third,woman,False,,Queenstown,no,False
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


リスト4.28　条件に該当した列を取得する例

In [38]:
titanic_female = titanic.query("sex == 'female'")
titanic_female

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
880,1,2,female,25.0,0,1,26.0000,S,Second,woman,False,,Southampton,yes,False
882,0,3,female,22.0,0,0,10.5167,S,Third,woman,False,,Southampton,no,True
885,0,3,female,39.0,0,5,29.1250,Q,Third,woman,False,,Queenstown,no,False
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


リスト4.29　データの並べ替えの例

In [39]:
titanic_female_sort = titanic_female.sort_values("fare")
titanic_female_sort

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
654,0,3,female,18.0,0,0,6.7500,Q,Third,woman,False,,Queenstown,no,True
875,1,3,female,15.0,0,0,7.2250,C,Third,child,False,,Cherbourg,yes,True
19,1,3,female,,0,0,7.2250,C,Third,woman,False,,Cherbourg,yes,True
780,1,3,female,13.0,0,0,7.2292,C,Third,child,False,,Cherbourg,yes,True
367,1,3,female,,0,0,7.2292,C,Third,woman,False,,Cherbourg,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
742,1,1,female,21.0,2,2,262.3750,C,First,woman,False,B,Cherbourg,yes,False
311,1,1,female,18.0,2,2,262.3750,C,First,woman,False,B,Cherbourg,yes,False
88,1,1,female,23.0,3,2,263.0000,S,First,woman,False,C,Southampton,yes,False
341,1,1,female,24.0,3,2,263.0000,S,First,woman,False,C,Southampton,yes,False


リスト4.30　降順に並べ替える例

In [40]:
titanic_female_sort = titanic_female.sort_values("fare", ascending=False)
titanic_female_sort

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
258,1,1,female,35.0,0,0,512.3292,C,First,woman,False,,Cherbourg,yes,True
341,1,1,female,24.0,3,2,263.0000,S,First,woman,False,C,Southampton,yes,False
88,1,1,female,23.0,3,2,263.0000,S,First,woman,False,C,Southampton,yes,False
742,1,1,female,21.0,2,2,262.3750,C,First,woman,False,B,Cherbourg,yes,False
311,1,1,female,18.0,2,2,262.3750,C,First,woman,False,B,Cherbourg,yes,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
367,1,3,female,,0,0,7.2292,C,Third,woman,False,,Cherbourg,yes,True
780,1,3,female,13.0,0,0,7.2292,C,Third,child,False,,Cherbourg,yes,True
19,1,3,female,,0,0,7.2250,C,Third,woman,False,,Cherbourg,yes,True
875,1,3,female,15.0,0,0,7.2250,C,Third,child,False,,Cherbourg,yes,True


リスト4.31　カラム名の変更を行う例

In [42]:
# ageというカラム名を年齢に変更
titanic_rename = titanic_female_sort.rename(columns={"age": "年齢"})
titanic_rename

Unnamed: 0,survived,pclass,sex,年齢,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
258,1,1,female,35.0,0,0,512.3292,C,First,woman,False,,Cherbourg,yes,True
341,1,1,female,24.0,3,2,263.0000,S,First,woman,False,C,Southampton,yes,False
88,1,1,female,23.0,3,2,263.0000,S,First,woman,False,C,Southampton,yes,False
742,1,1,female,21.0,2,2,262.3750,C,First,woman,False,B,Cherbourg,yes,False
311,1,1,female,18.0,2,2,262.3750,C,First,woman,False,B,Cherbourg,yes,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
367,1,3,female,,0,0,7.2292,C,Third,woman,False,,Cherbourg,yes,True
780,1,3,female,13.0,0,0,7.2292,C,Third,child,False,,Cherbourg,yes,True
19,1,3,female,,0,0,7.2250,C,Third,woman,False,,Cherbourg,yes,True
875,1,3,female,15.0,0,0,7.2250,C,Third,child,False,,Cherbourg,yes,True


リスト4.32　決められた回数だけ繰P返し処理を行う例

In [1]:
for i in range(5):
    print(i * 2)

0
2
4
6
8


リスト4.33　リストに対して繰り返しの処理を行う例

In [44]:
# リストの作成
sample_list = [10, 20, 30, 40, 50]

for i in sample_list:
    print(i)

10
20
30
40
50
