## Ensemble (組合) (集成學習)
#### 透過多次執行"基礎學習演算法"，並且針對每次產生的假說進行投票，最後整合投票的結果構成一致同意的假說。分為:
- <font color=blue size=3>Boosting(連續) :</font>

 <font color=black size=2>中心主旨: 訓練出一個很厲害的分類器</font>     
 <font color=black size=2>缺點: 一定要等上個分類器訓練結束，才能將誤判的輸出提高權重變為輸入，訓練下一個分類器 -> 耗時、太依賴前次結果</font>              
<font color=black size=2>-------------------------------------------------------------------------------------------</font> 
- <font color=blue size=3>Bagging(平行) :</font> 

 <font color=black size=2> 統合單獨假說(分類器)的預測，並建立一個具有整體性、一致同意的假說</font>            
 <font color=black size=2> 重點: 每個分類器間一定要略有不同 -> 餵給每個分類器不一樣的資料</font>                   
 <font color=black size=2> 經典演算法: 隨機森林</font>             

## Random Forest (隨機森林)
- 隨機: 隨機放棄某部分資料
- 由多個決策樹(分類器)組成

#### <font color=red size=3>步驟一: 讀入Titanic資料集<font>

In [1]:
import pandas as pd
train_df = pd.read_csv("train.csv", encoding="utf-8")
test_df = pd.read_csv("test.csv", encoding="utf-8")
train_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


#### <font color=red size=3>步驟二: 資料預處理 - 檢查表格那些位置為空格</font>

In [2]:
train_df.isna()
# 也可用train_df.isna().sum()查看每欄缺失值數量  # false當成0 true當成1

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


#### <font color=blue size=3>目標: 把表格內的缺失值補齊</font>
#### <font color=blue size=2>1. 以欄為單位補值，先將所有欄分為兩大類:</font>
- <font color=blue size=2>數字型態(ex.age) : Age、SibSp、Parch、Fare</font>
- <font color=blue size=2>類別型態(ex.embarked) : Pclass、Name(mid)、Sex、Embarked</font>
- <font color=blue size=2>***拿掉的欄: Ticket(對分析可能作用不大)、Cabin(缺失值過多)</font>
- <font color=blue size=2>Survived為答案欄，不進行補植</font>

#### <font color=blue size=2>2. 補缺失值 (sklearn規則: 表格不得為空)</font>
- <font color=blue size=2>補什麼值?最可能的值 => 1)數字型態:中位數 (中間水平)   2)類別型態:最多的</font>

#### <font color=red size=3>步驟三: 資料預處理 - 計算各數字型態欄的中位數，並補入表中為空值位置</font>

In [3]:
# 計算各數字型態欄的中位數: .median() -> 只會出現數字型態欄的資料
# 測試資料的缺失值拿訓練資料的來補，不用重新算
med = train_df.median()
train_df = train_df.fillna(med)
test_df = test_df.fillna(med)

#### <font color=red size=3>步驟三: 資料預處理 - 找出各類別型態欄出現最多的值，並補入表中為空值位置</font>

In [4]:
most = train_df["Embarked"].value_counts().idxmax()
train_df["Embarked"] = train_df["Embarked"].fillna(most)
test_df["Embarked"] = test_df["Embarked"].fillna(most)

train_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,28.0,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


#### <font color=red size=3>步驟四: 資料預處理 - 將Name裡可能對分析有用的middle name取出，並將所有Name欄位改成有用的middle name，無用的則為None</font>

In [5]:
# 定義抓取Name欄中所有名字的middle name
def nameflow(n):
    n = n.split(",")[-1].split(".")[0]
    return n.strip()
# .strip() 移除字串頭尾指定的字元（預設為空格）

#計算每個middle name數量；若middle name數量太多 -> 去掉出現次數過少的middle name
mid = train_df["Name"].apply(nameflow).value_counts()
# 將要留下的特徵值(middle name)存入reserved
reserved = mid[mid>30].index
reserved

Index(['Mr', 'Miss', 'Mrs', 'Master'], dtype='object')

In [6]:
# 定義一個只抓取reserved中有的middle name的函式
def nameflow2(n):
    n = n.split(",")[-1].split(".")[0]
    n = n.strip()
    if n in reserved:
        return n
    else:
        return None
# Why可以設為None? 對None做One-Hot encoding: 結果為0

# 將原本表中Name欄位的完整名字，取代為middle name (沒有在reserved裡的則取代為None)
train_df["Name"] = train_df["Name"].apply(nameflow2)
test_df["Name"] = test_df["Name"].apply(nameflow2)

# apply(): 為pandas.DataFrame的方法

> <font color=blue size=3>One-Hot encodong: 將類別特徵轉換為數個是非題(ex.將Embarked欄分為S、C、Q三欄，值為0(否)、1(是))</font>
> #### Q1: 類別特徵值太多? 
1.把相似值群組   
2.把出現次數極少的特徵值刪除
> #### Q2: One-Hot encodong使用時機? 
用於轉換類別型態的欄 (因機器學習的演算法會以"數值"處理特徵值，儘管以數字代表特徵值，但實則無順序畫大小關係)
> #### Q3: 可以偷懶不做One-Hot encodong的時機?
1.是類別型態但具有大小關係(ex.Pclass) =>這樣做決策數不會有問題     
 2.是類別型態但只有兩種值(ex.Sex)=>本身就相當於是非題  但也需要考量之後可不可能加入第三種值 

#### <font color=red size=3>步驟五: 資料預處理 - 對類別型態的欄位做One-Hot encodong；使用pandas.get_dummies()</font>

In [7]:
# pandas.get_dummies(["Name", "Sex","Embarked"],dummy_na=True)效果: 多一欄 None(其他)-> ex.Name_nan、Sex_nan、Embarked_nan
# 加不加dummy_na? maybe對訓練結果影響不大，可觀察最後結果決定
x_train_nodrop = pd.get_dummies(train_df,
                                columns=["Name", "Sex", "Embarked"])
x_test_nodrop = pd.get_dummies(test_df,
                                columns=["Name", "Sex", "Embarked"])

# 丟掉不要的column
# pandas.DataFrame.drop(axis=) -> axis=1:drop column(橫列) / axis=0:drop row(直行)
x_train = x_train_nodrop.drop(["PassengerId", "Survived", "Cabin", "Ticket"],
                              axis=1)
y_train = x_train_nodrop["Survived"]
x_test = x_test_nodrop.drop(["PassengerId", "Cabin", "Ticket"],
                            axis=1)
testid = x_test_nodrop["PassengerId"]

> <font color=blue size=3>交叉驗證 (Cross Validation): 用來切割資料、選擇參數 (此例: 選擇這個隨機森林要有幾棵樹(分類器)、樹的最大深度設多少)</font>
- 將資料等分(通常分10份)，每份輪流當一次測試資料，其他當訓練資料
- 得出每次的驗證分數(10份即做10次，有10個分數)
- 交叉驗證可以較好的表示出模型的好壞

#### <font color=red size=3>步驟六: 用Grid Search，對每一組參數都做一次交叉驗證，try&error找到其中的最佳參數，並縮小參數範圍、進行調整</font>

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
params = {
    "n_estimators": range(25, 40),  # 你要有幾棵樹
    "max_depth": range(6, 11)
}
# 參數試法: n_estimators可先從10~110每次多10，max_depth可先從3~30，慢慢縮小範圍
clf = RandomForestClassifier()
cv = GridSearchCV(clf, params, cv=10, n_jobs=4)
cv.fit(x_train, y_train)
print(cv.best_params_)
print(cv.best_score_)

{'max_depth': 9, 'n_estimators': 26}
0.840661672908864


#### <font color=red size=3>步驟七: 用隨機森林建模型，以上步試出的最佳參數進行參數調整；用交叉驗證切割資料</font>

In [9]:
import numpy as np  # 處理大量數字
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
# 通常會調整RandomForestClassifier()內的參數，使正確率最佳 -> 利用交叉驗證找出
# n_estimators:有多少顆數(分類器)
clf = RandomForestClassifier(n_estimators=36, max_depth=10)

# 不用fit，fit和predict會由交叉驗證幫你做
score = cross_val_score(clf, x_train, y_train, cv=10, n_jobs=4)
print(score)

# np.average(): 與.mean()相似，但np.average()可指定權重
print(np.average(score))

[0.75555556 0.84269663 0.74157303 0.82022472 0.86516854 0.86516854
 0.82022472 0.78651685 0.87640449 0.84269663]
0.8216229712858926


#### <font color=red size=3>步驟八: 正式預測</font>

In [10]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=36, max_depth=10)
clf.fit(x_train, y_train)

pre = clf.predict(x_test)
result_df = pd.DataFrame({
    "PassengerId":testid,
    "Survived":pre
})
# result_df.to_csv("rf.csv", index=False, encoding="utf-8") # index=False 不要儲存列編號 0, 1, 2
result_df

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [11]:
pd.DataFrame({
    "column":x_train.columns,
    "importance":clf.feature_importances_
})

Unnamed: 0,column,importance
0,Pclass,0.089067
1,Age,0.155803
2,SibSp,0.060719
3,Parch,0.035446
4,Fare,0.209999
5,Name_Master,0.018239
6,Name_Miss,0.027127
7,Name_Mr,0.08983
8,Name_Mrs,0.013557
9,Sex_female,0.145187


## KNN(k-nearest neighbors)
<font color=black size=2>選擇離目標點最近的k個點，這k個點越多某分類，則目標點就被當成該分類</font>
- 優點:直覺、概念簡單
- 缺點: 解釋性差、沒有考慮整體資料，僅考慮部分資料、若每個標籤數量落差太大，容易預測錯誤

> <font color=blue size=3>Scaling</font>
 - 為何要Scaling? 因為算距離時，不同基值影響非常大
 - 使用時機: 含有計算距離時 ex.KNN、KMeans
 - 為什麼決策數不用Scaling? 以能分類最乾淨的特徵條件當篩選值，基數單位不影響
 - Scaling種類: MinMaxScaling、RobustScaling、StanderedScaling等
 #### <font color=red size=3>步驟六: 對資料做Scaling</font>

In [12]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
x_train_scale = scaler.fit_transform(x_train)
x_test_scale = scaler.transform(x_test)

 #### <font color=red size=3>步驟七: 用Grid Search找KNN最佳參數</font>

In [13]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()
params = {
    "n_neighbors":range(5, 100)
}
cv = GridSearchCV(clf, params, cv=10, n_jobs=4)
cv.fit(x_train_scale, y_train)
print(cv.best_params_)
print(cv.best_score_)

{'n_neighbors': 22}
0.8193508114856428


 #### <font color=red size=3>步驟八:正式預測</font>

In [14]:
# .best_estimator_: 直接用最佳參數做KNN分類器
cv.best_estimator_.predict(x_test_scale)
result_df = pd.DataFrame({
    "PassengerId":testid,
    "Survived":pre
})

result_df

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


#### <font color=blue size=3>比較Random Forest與KNN演算法的預測表現:</font>
- <font color=black size=2>影響因素:</font>     
 <font color=black size=2>1. 資料多寡: 資料多-> 用rf可能表現較佳</font>                 
 <font color=black size=2>2. 資料混雜度</font>                    
 <font color=black size=2>3. 特徵重要性(feature_importances):rf -> 有幫你篩選重要特徵 / KNN -> 每個特徵同等重要</font>
 
#### <font color=blue size=3>如何提升預測準確率?</font>
<font color=black size=2>從資料預處理著手，而非替換演算法 -> 個演算法之間的差異微乎其微</font>          
<font color=black size=2>ex. scaling從MinMaxScaler改為RobustScaler -> 較不受離群值影響</font>                   
<font color=red size=2>***注意: 不要過度處理資料，ex刪掉抹特徵值，或將特徵值群組 -> 不必要，正確率也可能因此降低</font>