### 訓練與測試資料的處理

#### 1. 訓練與測試資料的分佈(Train/test distributions)

(1) 訓練與測試資料的分割:

a. 分割訓練與測試資料的套件: train_test_split (預設值: 75% training set, 25% test set)

    from sklearn.model_selection import train_test_split

b. 單一種類/數值資料

    train, test = train_test_split(X, y, test_size=測試資料的比例)

c. 含兩個種類的資料

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=測試資料的比例)

(2) 若訓練資料的分佈明顯不同於測試資料的分佈，會造成 ML 模型有較差的表現

(3) 如何解決訓練與測試資料分佈不同的問題?

Example: 100 samples, 50 class 1 and 50 class 2

a. Training set: 80 samples

- class 1: 70


- class 2: 10


b. Test set: 20 samples

- class 1: 5


- class 2: 15


分層抽樣法(Stratified sampling): 使訓練和測試資料中的種類比例相同於總體資料中的種類比例

    # Stratified sampling
    X_train,X_test,y_train,y_test = train_test_split(X, y, stratify=y)

a. Training set: 80 samples

- class 1: 40


- class 2: 40


b. Test set: 20 samples

- class 1: 10


- class 2: 10


(4) 不平衡種類(imbalanced classes)資料的處理

- Categorical target variable should approximately have equal number observations/class


- Large difference between the number of observations in each class $\rightarrow$ misleading results


Example: 100 samples, 80 class 1 and 20 class 2

a. Training set: 75 samples

- class 1: 60


- class 2: 15


b. Test set: 25 samples

- class 1: 20


- class 2: 5


方法一: 分層抽樣法(Stratified sampling)
     
    # Create a data with all columns except categories
    X = df.drop("categories", axis=1)

    # Create a categories labels dataset
    y = df[["categories"]]

    # Stratified sampling for imbalanced class
    X_train,X_test,y_train,y_test = train_test_split(X, y, stratify=y)

    # 查看 labels 的數量
    y["labels"].value_counts()
    y_train["labels"].value_counts()
    y_test["labels"].value_counts()


方法二: 重新採樣(resampling) $\rightarrow$ 試著平衡不同種類的數量

- majority class (class A)


- minority class (class B)


- target class


套件: sklearn.utils.resample(m, n_samples=len(n))

a. Oversample minority class (class B)

    # Upsample minority and combine with majority
    data_upsampled = resample(classB, replace=True, n_samples=len(classA), random_state=123)
    upsampled = pd.concat([classA, data_upsampled])

    # Upsampled feature matrix and target array
    X_train_up = upsampled.drop('targetclass', axis=1)
    y_train_up = upsampled['targetclass']
    
b. Undersample majority class (class A)

    # Downsample majority and combine with minority
    data_downsampled = resample(classA, replace = False,  n_samples = len(classB), random_state = 123)
    downsampled = pd.concat([data_downsampled, classB])

    # Downsampled feature matrix and target array
    X_train_down = downsampled.drop('targetclass', axis=1)
    y_train_down = downsampled['targetclass']

**注意事項: 分割訓練與測試資料之後再重新採樣(resampling)**

#### 2. 訓練與測試資料分佈的視覺化

(1) Matrix of distributions and scatterplots:

    sns.pairplot()

Example: 訓練與測試資料分佈的視覺化

    # Create subset: data_subset
    data_subset = data[['A','B','class']]

    # Create train and test sets
    trainingSet, testSet = train_test_split(data_subset, test_size=0.2, random_state=123)

    # Examine pairplots
    plt.figure()
    sns.pairplot(trainingSet, hue='class', palette='RdBu')
    plt.show()

    plt.figure()
    sns.pairplot(testSet, hue='class', palette='RdBu')
    plt.show()

#### 3. 不平衡種類(imbalanced classes)資料的模型評估(model evaluation)

(1) Confusion matrix

- It shows the number of correctly and incorrectly classified observations in each class


(2) Performance metrics

- accuracy 


- precision 


- recall/sensitivity 


- specificity 


- F1 score


(3) Functions

- confusion matrix: sklearn.metrics.confusion_matrix(y_test, y_pred)


- precision: sklearn.metrics.precision_score(y_test, y_pred)


- recall: sklearn.metrics.recall_score(y_test, y_pred) 


- f1 score: sklearn.metrics.f1_score(y_test, y_pred)


**Imbalanced class metrics**
    
    # Import
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

    # Instantiate, fit, predict
    lr = LogisticRegression(solver='liblinear')
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)

    # Print evaluation metrics
    print("Confusion matrix:\n {}".format(confusion_matrix(y_test, y_pred)))
    print("Accuracy: {}".format(accuracy_score(y_test, y_pred)))
    print("Precision: {}".format(precision_score(y_test, y_pred)))
    print("Recall: {}".format(recall_score(y_test, y_pred)))
    print("F1: {}".format(f1_score(y_test, y_pred)))

**Using resampling techniques**
   
a. Oversample minority class

    # Instantiate logistic regression, fit, predict
    loan_lr_up = LogisticRegression(solver='liblinear')
    loan_lr_up.fit(X_train_up, y_train_up)
    upsampled_y_pred = loan_lr_up.predict(X_test)

    # Print evaluation metrics
    print("Confusion matrix:\n {}".format(confusion_matrix(y_test, upsampled_y_pred)))
    print("Accuracy: {}".format(accuracy_score(y_test, upsampled_y_pred)))
    print("Precision: {}".format(precision_score(y_test, upsampled_y_pred)))
    print("Recall: {}".format(recall_score(y_test, upsampled_y_pred)))
    print("F1: {}".format(f1_score(y_test, upsampled_y_pred)))

b. Undersample majority class

    # Instantiate, fit, predict
    loan_lr_down = LogisticRegression(solver='liblinear')
    loan_lr_down.fit(X_train_down, y_train_down)
    downsampled_y_pred = loan_lr_down.predict(X_test)

    # Print evaluation metrics
    print("Confusion matrix:\n {}".format(confusion_matrix(y_test, downsampled_y_pred)))
    print("Accuracy: {}".format(accuracy_score(y_test, downsampled_y_pred)))
    print("Precision: {}".format(precision_score(y_test, downsampled_y_pred)))
    print("Recall: {}".format(recall_score(y_test, downsampled_y_pred)))   
    print("F1: {}".format(f1_score(y_test, downsampled_y_pred)))