データを読み込み

In [37]:
import pandas as pd

# 学習用データ
train = pd.read_csv("train.csv",index_col=0)
# 評価用データ
test = pd.read_csv("test.csv",index_col=0)
## 応募用サンプルファイル
sample_submit = pd.read_csv("sample_submit.csv",index_col=0,header=None)

データの概要を確認

学習用データ

In [38]:
train.head()

Unnamed: 0_level_0,survived,pclass,sex,age,sibsp,parch,fare,embarked
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S
7,0,3,male,2.0,3,1,21.075,S
9,1,2,female,14.0,1,0,30.0708,C
11,1,1,female,58.0,0,0,26.55,S


評価用データ<br>
学習用データにはあった目的変数であるカラムsurvivedが無い<br>
答えのある学習用データの乗客の情報を元にして機械学習により生存の法則をモデル化し、そのモデルを評価用データに当てはめて、評価用データに記録されている各乗客が生存できるのか否かを予測

In [39]:
test.head()

Unnamed: 0_level_0,pclass,sex,age,sibsp,parch,fare,embarked
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,3,male,22.0,1,0,7.25,S
1,1,female,38.0,1,0,71.2833,C
2,3,female,26.0,0,0,7.925,S
5,3,male,,0,0,8.4583,Q
6,1,male,54.0,0,0,51.8625,S


In [40]:
sample_submit.head(10)

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
0,0
1,1
2,0
5,1
6,1
8,1
10,1
12,1
14,1
15,0


データのサイズ

In [41]:
print(train.shape) #学習用データは445人の乗客情報
print(test.shape) #評価用データは446人

(445, 8)
(446, 7)


欠損値やデータ型を確認

In [42]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 445 entries, 3 to 888
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  445 non-null    int64  
 1   pclass    445 non-null    int64  
 2   sex       445 non-null    object 
 3   age       360 non-null    float64
 4   sibsp     445 non-null    int64  
 5   parch     445 non-null    int64  
 6   fare      445 non-null    float64
 7   embarked  443 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 31.3+ KB


 データを分析

In [43]:
train["survived"].value_counts()

Unnamed: 0_level_0,count
survived,Unnamed: 1_level_1
0,266
1,179


In [44]:
train.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,445.0,445.0,360.0,445.0,445.0,445.0
mean,0.402247,2.296629,29.211583,0.546067,0.431461,33.959971
std,0.490903,0.834024,14.1543,1.195247,0.850489,52.079492
min,0.0,1.0,0.67,0.0,0.0,0.0
25%,0.0,2.0,20.0,0.0,0.0,7.925
50%,0.0,3.0,28.0,0.0,0.0,15.0
75%,1.0,3.0,37.25,1.0,1.0,31.3875
max,1.0,3.0,80.0,8.0,5.0,512.3292


**pd.get_dummies(train)**<br>
train データフレームに含まれるカテゴリ変数（文字列やカテゴリ型の列）を ダミー変数（0/1の数値列） に変換する。

**.corrwith(train["survived"])**<br>
train["survived"]（生存フラグ: 0=死亡, 1=生存）との 相関係数（ピアソン相関） を、それぞれの列について計算する。
<br><br>
相関係数が 1 に近いと正の相関、-1 に近いと負の相関が強く、0に近いと相関が弱いことを表します。結果から、性別や客室クラスとの相関が強く、年齢や乗船した港やとの相関は比較的弱いようです。

In [45]:
pd.get_dummies(train).corrwith(train["survived"])

Unnamed: 0,0
survived,1.0
pclass,-0.358097
age,-0.081394
sibsp,-0.045087
parch,0.079669
fare,0.258605
sex_female,0.559465
sex_male,-0.559465
embarked_C,0.182568
embarked_Q,0.005062


**pclassが1に近い高級クラスほど、survivedが1の割合が多い**

**train[["pclass","survived"]]**<br>
データフレーム train から
「客室クラス (pclass)」と「生存フラグ (survived)」だけを取り出す。

**.groupby(["pclass"])**<br>
pclass ごとにグループ化する。

In [46]:
train[["pclass","survived"]].groupby(["pclass"]).mean()

Unnamed: 0_level_0,survived
pclass,Unnamed: 1_level_1
1,0.685185
2,0.443299
3,0.258333


女性の生存率が圧倒的に高い

In [47]:
train[["sex","survived"]].groupby(["sex"]).mean()

Unnamed: 0_level_0,survived
sex,Unnamed: 1_level_1
female,0.775641
male,0.200692


**データの前処理**<br>
学習用データと評価用データを一括で処理するため、まずは両データを結合

In [48]:
data = pd.concat([train,test])
data

Unnamed: 0_level_0,survived,pclass,sex,age,sibsp,parch,fare,embarked
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3,1.0,1,female,35.0,1,0,53.1000,S
4,0.0,3,male,35.0,0,0,8.0500,S
7,0.0,3,male,2.0,3,1,21.0750,S
9,1.0,2,female,14.0,1,0,30.0708,C
11,1.0,1,female,58.0,0,0,26.5500,S
...,...,...,...,...,...,...,...,...
885,,3,female,39.0,0,5,29.1250,Q
886,,2,male,27.0,0,0,13.0000,S
887,,1,female,19.0,0,0,30.0000,S
889,,1,male,26.0,0,0,30.0000,C


欠損値の個数

In [49]:
data.isnull().sum()

Unnamed: 0,0
survived,446
pclass,0
sex,0
age,177
sibsp,0
parch,0
fare,0
embarked,2


In [50]:
data["age"].isnull().sum()

np.int64(177)

ageには全体の平均値、embarkedには最頻値で補完

In [51]:
#欠損値（NaN）を、"age" 列の 平均値 で埋める。
data["age"] = data["age"].fillna(data["age"].mean())
data["age"]

Unnamed: 0_level_0,age
id,Unnamed: 1_level_1
3,35.0
4,35.0
7,2.0
9,14.0
11,58.0
...,...
885,39.0
886,27.0
887,19.0
889,26.0


In [52]:
data["age"].isnull().sum()

np.int64(0)

In [53]:
data["embarked"]=data["embarked"].fillna(data["embarked"].mode())
data["embarked"]

Unnamed: 0_level_0,embarked
id,Unnamed: 1_level_1
3,S
4,S
7,S
9,C
11,S
...,...
885,Q
886,S
887,S
889,C


dataをダミー化

In [54]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 3 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  445 non-null    float64
 1   pclass    891 non-null    int64  
 2   sex       891 non-null    object 
 3   age       891 non-null    float64
 4   sibsp     891 non-null    int64  
 5   parch     891 non-null    int64  
 6   fare      891 non-null    float64
 7   embarked  889 non-null    object 
dtypes: float64(3), int64(3), object(2)
memory usage: 62.6+ KB


In [55]:
#ワンホットベクトルに変換する関数
data=pd.get_dummies(data, dtype=int)

In [56]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 3 to 890
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   survived    445 non-null    float64
 1   pclass      891 non-null    int64  
 2   age         891 non-null    float64
 3   sibsp       891 non-null    int64  
 4   parch       891 non-null    int64  
 5   fare        891 non-null    float64
 6   sex_female  891 non-null    int64  
 7   sex_male    891 non-null    int64  
 8   embarked_C  891 non-null    int64  
 9   embarked_Q  891 non-null    int64  
 10  embarked_S  891 non-null    int64  
dtypes: float64(3), int64(8)
memory usage: 83.5 KB


In [57]:
data["sex_female"]

Unnamed: 0_level_0,sex_female
id,Unnamed: 1_level_1
3,1
4,0
7,0
9,1
11,1
...,...
885,1
886,0
887,1
889,0


In [58]:
train = data.loc[train.index]
test = data.loc[test.index]

In [59]:
train.head()

Unnamed: 0_level_0,survived,pclass,age,sibsp,parch,fare,sex_female,sex_male,embarked_C,embarked_Q,embarked_S
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3,1.0,1,35.0,1,0,53.1,1,0,0,0,1
4,0.0,3,35.0,0,0,8.05,0,1,0,0,1
7,0.0,3,2.0,3,1,21.075,0,1,0,0,1
9,1.0,2,14.0,1,0,30.0708,1,0,1,0,0
11,1.0,1,58.0,0,0,26.55,1,0,0,0,1


In [60]:
test.head()

Unnamed: 0_level_0,survived,pclass,age,sibsp,parch,fare,sex_female,sex_male,embarked_C,embarked_Q,embarked_S
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,,3,22.0,1,0,7.25,0,1,0,0,1
1,,1,38.0,1,0,71.2833,1,0,1,0,0
2,,3,26.0,0,0,7.925,1,0,0,0,1
5,,3,29.699118,0,0,8.4583,0,1,0,1,0
6,,1,54.0,0,0,51.8625,0,1,0,0,1


In [61]:
test = test.drop(["survived"],axis = 1)

In [62]:
print(test.shape,train.shape)

(446, 10) (445, 11)


**モデリング**<br>
目的変数とそれ以外に学習用データを分割

In [63]:
y = train["survived"]
x = train.drop(["survived"],axis = 1)

In [64]:
y.head()

Unnamed: 0_level_0,survived
id,Unnamed: 1_level_1
3,1.0
4,0.0
7,0.0
9,1.0
11,1.0


In [65]:
x.head()

Unnamed: 0_level_0,pclass,age,sibsp,parch,fare,sex_female,sex_male,embarked_C,embarked_Q,embarked_S
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
3,1,35.0,1,0,53.1,1,0,0,0,1
4,3,35.0,0,0,8.05,0,1,0,0,1
7,3,2.0,3,1,21.075,0,1,0,0,1
9,2,14.0,1,0,30.0708,1,0,1,0,0
11,1,58.0,0,0,26.55,1,0,0,0,1


In [66]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(x,y)

In [67]:
#predict_proba といった分類確信度のスコアを取得できるメソッド
#[:,1]…: は「全行」、
#1 は「2列目（インデックスは0始まりなのでクラス1の確率）」
#クラス1に属する確率だけを取り出す。
pred  = model.predict_proba(test)[:,1]

各要素は そのサンプルがクラス1である確率。
* 0.105781 → 10.6% の確率でクラス1
* 0.92529693 → 92.5% の確率でクラス1
* 0.64274681 → 64.3% の確率でクラス1
* 0.16363808 → 16.4% の確率でクラス1
* 0.27498188 → 27.5% の確率でクラス1

In [68]:
pred[:5]

array([0.105781  , 0.92529693, 0.64274681, 0.16363808, 0.27498188])

In [69]:
print(len(pred))

446


In [70]:
# 課題が0/1のラベル提出を要求する場合はこちら
sample_submit[1] = (pred >= 0.5).astype(int)
sample_submit.to_csv("submit.csv", index=True, header=False, encoding="utf-8")

In [71]:
out = pd.DataFrame({0: test.index, 1: pred})
out.to_csv("submit.csv",
           index=False, header=False,
           encoding="utf-8",
           lineterminator="\n",
           float_format="%.10f")

In [72]:
import pandas as pd
df = pd.read_csv("submit.csv", header=None)
assert df.shape == (446, 2)
assert df[1].between(0, 1).all()
print("OK")

OK
