#####予測モデルの訓練の流れ
1. データセットを訓練データとテストデータに分割
2. 説明変数の標準化
3. 予測モデルの指定
4. 損失関数の指定
5. 訓練データと損失関数を用いたモデルの訓練
6. テストデータを用いたモデルの評価
---
#####Flow of training a predictive model
1. split the dataset into training and test data
2. standardize explanatory variables
3. specify a predictive model
4. specify a loss function
5. train the model using the training data and the loss function
6. evaluate the model using the test data

ロジスティック回帰/logistic regression
* タイタニック号の乗客の情報から**生存したか否かを予測**
* **Predicting whether or not the Titanic survived** based on information about its passengers.

データセット/Dataset
* train.csv: 顧客情報と生存の有無が記載されています.
* train.csv: It contains customer information and whether the customer is still alive or not.

データ項目/Data item
* PassengerId: 乗客識別連番ID/Passenger Identification Sequential Number ID
* Survived: 生存の有無（0=死亡, 1=生存）/Survival (0=dead, 1=alive)
* Pclass: チケットクラス（1=上層クラス, 2=一般クラス, 3=下層クラス）/Ticket Class (1=Upper Class, 2=General Class, 3=Lower Class)
* Name: 乗客の名前/Passenger's Name
* Sex: 性別（male=男性, female＝女性）/Gender (male=male, female=female)
* Age: 年齢/Age
* SibSp: タイタニックに同乗している兄弟配偶者の数/Number of sibling spouses on board the Titanic.
* Parch: タイタニックに同乗している親子供の数/Number of parents and children boarding with the Titanic
* Ticket: チケット番号/Ticket Number
* Fare: 料金/Fee
* Cabin: 客室番号/Room Number
* Embarked: 出港地（C = Cherbourg, Q = Queenstown, S = Southampton）/Departure point (C = Cherbourg, Q = Queenstown, S = Southampton)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import numpy as np
import pandas as pd
from sklearn import datasets

# データセット読み込み（train.csvを利用）/Load dataset (using train.csv)
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/7data/train.csv")

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


データ加工 (数値変換)/Data processing (numerical conversion)

In [None]:
# 性別を[0,1]変換/Convert gender to [0,1]
def gender_convert(x):
    if x == 'male':
        return 0
    elif x == 'female':
        return 1

# 港を数値変換/Convert ports to numeric value
def embarked_convert(x):
    if x == 'C':
        return 0
    elif x == 'Q':
        return 1
    elif x == 'S':
        return 2


# 性別を数値変換/Convert gender to numeric value
df["Sex"] = df["Sex"].apply(gender_convert)

# 港を数値変換/Convert ports to numeric value
df["Embarked"] = df["Embarked"].apply(embarked_convert)

データ加工 (特徴量スケーリング: 標準化）/Data processing (feature scaling: standardization)

In [None]:
# 2. 説明変数の標準化/Standardize explanatory variables
from sklearn.preprocessing import StandardScaler

# 標準化インスタンス (平均=0, 標準偏差=1)/standardized instances (mean=0, standard deviation=1)
standard_sc = StandardScaler()

# 年齢と運賃を標準化/Standardize ages and fares
X = df.loc[:, ['Age','Fare']]
X = standard_sc.fit_transform(X)

#標準化後のデータ出力/Output data after standardization
df.loc[:, ['Age','Fare']] = X

データ加工 (欠損値の除去)/Data processing (removal of missing values)

In [None]:
# 各種説明変数における欠損値数の確認/Check number of missing values in the explanatory variables
df.isnull().sum()

# 出力イメージ/output image
# PassengerId      0
# Survived         0
# Pclass           0
# Name             0
# Sex              0
# Age            177
# SibSp            0
# Parch            0
# Ticket           0
# Fare             0
# Cabin          687
# Embarked         2
# dtype: int64

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


In [None]:
# 欠損値がある行を除去/Remove rows with missing values
df = df.dropna(axis=0).reset_index(drop=True)

# 欠損値がある行が除去されたことを再確認/Reconfirm that rows with missing values have been removed
df.isnull().sum()

# 出力イメージ/Output image
# PassengerId    0
# Survived       0
# Pclass         0
# Name           0
# Sex            0
# Age            0
# SibSp          0
# Parch          0
# Ticket         0
# Fare           0
# Cabin          0
# Embarked       0
# dtype: int64

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,0
SibSp,0
Parch,0
Ticket,0
Fare,0


変数選択/Select variables
* 説明変数/explanatory variable: Pclass, Sex, Age, SibSp, Embarked, Fare
* 目的変数/objective variable: Survived

In [None]:
# 説明変数/explanatory variable
X = df[["Pclass","Sex","Age","SibSp","Embarked","Fare"]]

# 目的変数/objective variable
y = df["Survived"]

* ロジスティック回帰のインスタンスを作成し、モデル学習を行います。
* Create an instance of logistic regression and perform model training.

In [None]:
from sklearn.linear_model    import LogisticRegression
from sklearn.model_selection import train_test_split

# 1. データセットを訓練データとテストデータに分割/Split dataset into training and test data
# トレーニングデータおよびテストデータ分割/Split training and test data
X_train,X_test,Y_train,Y_test = train_test_split(X,y, test_size=0.3, shuffle=True, random_state=3)

# 3. 予測モデルの指定 4. 損失関数の指定 5. 訓練データと損失関数を用いたモデルの訓練
# 3. specify specify the predictive model 4. specify the loss function 5. train the model using the training data and the loss function
# ロジスティック回帰のインスタンス/Instances of logistic regression
model = LogisticRegression(penalty='l2',          # 正則化項(L1正則化 or L2正則化が選択可能)/Regularization term (L1 regularization or L2 regularization can be selected)
                           dual=False,            # Dual or primal
                           tol=0.0001,            # 計算を停止するための基準値/Reference value to stop calculation
                           C=1.0,                 # 正則化の強さ/Strength of regularization
                           fit_intercept=True,    # バイアス項の計算要否/Bias term calculation required or not
                           intercept_scaling=1,   # solver=‘liblinear’の際に有効なスケーリング基準値/Valid scaling reference value when solver='liblinear'.
                           class_weight=None,     # クラスに付与された重み/Weight assigned to the class
                           random_state=None,     # 乱数シード/random number seed
                           solver='lbfgs',        # ハイパーパラメータ探索アルゴリズム/Hyperparameter search algorithm
                           max_iter=100,          # 最大イテレーション数/Maximum number of iterations
                           multi_class='auto',    # クラスラベルの分類問題（2値問題の場合'auto'を指定）/Classification problem for class labels (specify 'auto' for binary problems)
                           verbose=0,             # liblinearおよびlbfgsがsolverに指定されている場合、冗長性のためにverboseを任意の正の数に設定/If liblinear and lbfgs are specified for solver, set verbose to any positive number for redundancy
                           warm_start=False,      # Trueの場合、モデル学習の初期化に前の呼出情報を利用/If True, use previous call information to initialize model training
                           n_jobs=None,           # 学習時に並列して動かすスレッドの数/Number of threads to run in parallel during training
                           l1_ratio=None          # L1/L2正則化比率(penaltyでElastic Netを指定した場合のみ)/L1/L2 regularization ratio (only if Elastic Net is specified in penalty)
                          )

# モデル学習/model training
model.fit(X_train, Y_train)



偏回帰係数の確認/Check partial regression coefficients

In [None]:
df_model = pd.DataFrame(index=["Pclass","Sex","Age","SibSp","Embarked","Fare"])
df_model["偏回帰係数/partial regression coefficient"] = model.coef_[0]

# 出力イメージ/Output image
#               偏回帰係数/partial regression coefficient
# b1  Pclass   -0.418780
# b2  Sex       2.363631
# b3  Age      -0.607834
# b4  SibSp     0.195629
# b5  Embarked -0.428897
# b6  Fare      0.015289

* 偏回帰係数の絶対値が大きい程、その説明変数がモデルの精度向上に際して重要な役割を果たしていることを意味します。
* 重要度の低い説明変数の係数は、その説明変数の影響を打ち消すために0に近い値を取るのが特徴です。
---
* The larger the absolute value of the partial regression coefficient, the more important the explanatory variable plays in improving the accuracy of the model.
* The coefficients for less important explanatory variables are characterized by values close to zero in order to cancel out the influence of the explanatory variable.

バイアス項（切片）の確認/Check bias term (intercept)

In [None]:
print("intercept: ", model.intercept_)

# 出力イメージ/Output image
# intercept:  [1.0597535]

intercept:  [1.05972307]


モデル推論/model inference
* 作成したロジスティック回帰モデルにテストデータを渡し、予測値を得ます。
* A logistic regression model created is passed through the test data to obtain predictions.

In [None]:
# 確率算出の際は、predict_proba()メソッドを利用/Use predict_proba() method for probability calculation
Y_pred_proba = model.predict_proba(X_test)

# データ出力/Data output
df_proba     = pd.DataFrame()
df_proba["非生存率/non-survival rate(Surviced=0)"] = Y_pred_proba[:,0]
df_proba["生存率/survival rate(Surviced=1)"] = Y_pred_proba[:,1]
print(df_proba)

# 出力イメージ/Output image
#     非生存率/non-survival rate(Surviced=0)  生存率/survival rate(Surviced=1)
# 0          0.715506          0.284494
# 1          0.070686          0.929314
# 2          0.327845          0.672155
# 3          0.108550          0.891450
# 4          0.065910          0.934090
# 5          0.294308          0.705692
# 6          0.536102          0.463898
# 7          0.055910          0.944090
# 8          0.123385          0.876615
# 9          0.176352          0.823648
# 10         0.024932          0.975068

    非生存率/non-survival rate(Surviced=0)  生存率/survival rate(Surviced=1)
0                             0.715481                       0.284519
1                             0.070645                       0.929355
2                             0.327779                       0.672221
3                             0.108588                       0.891412
4                             0.065869                       0.934131
5                             0.294154                       0.705846
6                             0.536054                       0.463946
7                             0.055937                       0.944063
8                             0.123314                       0.876686
9                             0.176290                       0.823710
10                            0.024933                       0.975067
11                            0.612967                       0.387033
12                            0.579596                       0.420404
13                  

2値（0または1）で結果を出力/Output results as binary values (0 or 1)

In [None]:
# 推論/inference
Y_pred = model.predict(X_test)
print(Y_pred)

# 出力イメージ/Output image
# [0 1 1 1 1 1 0 1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1
# 1 1 0 1 0 0 0 1 1 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0]

[0 1 1 1 1 1 0 1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 0 0 0 1 1
 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0]


作成したモデルの性能評価/Performance evaluation of the created model

* 評価指標 (混合行列・正解率・適合率・再現率・F1) を用いて学習済みモデルを評価します。
* Evaluates trained models using evaluation metrics (mixture matrix, percentage correct, goodness-of-fit, recall, F1).

混合行列/mixed matrix

In [None]:
# 6. テストデータを用いたモデルの評価/Evaluate the model using test data
from sklearn.metrics import confusion_matrix

# 混合行列/mixed matrix
print(confusion_matrix(y_true=Y_test, # 実測値/actual value
                       y_pred=Y_pred  # 予測値/predicted value
                      ))

# 出力イメージ/Output image
# [[ 9  9]
#  [12 25]]

[[ 9  9]
 [12 25]]


混合行列・正解率・適合率・再現率/mixture matrix, percentage of correct answers, goodness-of-fit rate, and repeatability

In [None]:
# 6. テストデータを用いたモデルの評価/Evaluate the model using test data
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 正解率/percentage of correct answers
print('accuracy: ',  round(accuracy_score(y_true=Y_test, y_pred=Y_pred),2))
# 適合率/precision ratio
print('precision: ', round(precision_score(y_true=Y_test, y_pred=Y_pred),2))
# 再現率/recall ratio
print('recall: ',    round(recall_score(y_true=Y_test, y_pred=Y_pred),2))
# f1スコア/f1 score
print('f1 score: ',  round(f1_score(y_true=Y_test, y_pred=Y_pred),2))

# 出力イメージ/Output image
# accuracy:   0.62
# precision:  0.74
# recall:     0.68
# f1 score:   0.7

accuracy:  0.62
precision:  0.74
recall:  0.68
f1 score:  0.7


上記指標のレポート出力/Report output of the above indicators

In [None]:
from sklearn.metrics import classification_report

# 分類問題における評価レポート/Evaluation report in classification problems
print(classification_report(Y_test, Y_pred))

# 出力イメージ/Output image
#               precision    recall  f1-score   support

#            0       0.43      0.50      0.46        18
#            1       0.74      0.68      0.70        37

#     accuracy                           0.62        55
#    macro avg       0.58      0.59      0.58        55
# weighted avg       0.63      0.62      0.62        55

              precision    recall  f1-score   support

           0       0.43      0.50      0.46        18
           1       0.74      0.68      0.70        37

    accuracy                           0.62        55
   macro avg       0.58      0.59      0.58        55
weighted avg       0.63      0.62      0.62        55

