### 3rd_ML100Marathon Midterm

- 安隆公司曾是一間能源公司，2001 年破產前是世界上最大的電力、天然氣及電信公司之一。擁有上千億資產的公司於 2002 年竟然在短短幾周內宣告破產，才揭露其財報在多年以來均是造假的醜聞。在本資料集中你將會扮演偵探的角色，透過高層經理人內部的 mail 來往的情報以及薪資、股票等財務特徵，訓練出一個機器學習模型來幫忙你找到可疑的詐欺犯罪者是誰! 我們已經先幫你找到幾位犯罪者 (Person-of-Interest, poi) 與清白的員工，請利用這些訓練資料來訓練屬於自己的詐欺犯機器學習模型吧!

### 特徵說明
- 有關財務的特徵: ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees'] (單位皆為美元)。更詳細的特徵說明請參考 enron61702insiderpay.pdf 的最後一頁(請至Data頁面參考該PDF檔)
- 有關 email 的特徵: ['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi'] (除了 email_address，其餘皆為次數)

- 嫌疑人的標記，也就是我們常用的 **y**。POI label: [‘poi’] (boolean, represented as integer)

我們也建議你對既有特徵進行一些特徵工程如 rescale, transform ，也試著發揮想像力與創意，建立一些可以幫助找到嫌疑犯的特徵，增進模型的預測能力


In [130]:
# Import 需要的套件
import os
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


from IPython.display import display
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.ensemble import GradientBoostingRegressor,RandomForestClassifier,RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV

%matplotlib inline

### 讀取資料

In [158]:
df_train = pd.read_csv("train_data.csv")
df_test = pd.read_csv("test_features.csv")
submit = pd.read_csv("sample_submission.csv")

In [11]:
print(df_train.info())
print("--------------------------------------------")
print(df_test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113 entries, 0 to 112
Data columns (total 22 columns):
name                         113 non-null object
bonus                        61 non-null float64
deferral_payments            28 non-null float64
deferred_income              34 non-null float64
director_fees                13 non-null float64
email_address                83 non-null object
exercised_stock_options      81 non-null float64
expenses                     73 non-null float64
from_messages                65 non-null float64
from_poi_to_this_person      65 non-null float64
from_this_person_to_poi      65 non-null float64
loan_advances                2 non-null float64
long_term_incentive          49 non-null float64
other                        69 non-null float64
poi                          113 non-null bool
restricted_stock             82 non-null float64
restricted_stock_deferred    10 non-null float64
salary                       73 non-null float64
shared_receipt_wi

### 前處理


In [160]:
# df = df_train.append(df_test)
# df.reset_index(inplace=True, drop = True)
df

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,email_address,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,...,name,other,poi,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
0,1750000.0,,-3504386.0,,ken.rice@enron.com,19794175.0,46950.0,18.0,42.0,4.0,...,RICE KENNETH D,174839.0,True,2748364.0,,420636.0,864.0,905.0,505050.0,22542539.0
1,5600000.0,,,,jeff.skilling@enron.com,19250000.0,29336.0,108.0,88.0,30.0,...,SKILLING JEFFREY K,22122.0,True,6843672.0,,1111258.0,2042.0,3627.0,8682716.0,26093672.0
2,200000.0,,-4167.0,,rex.shelby@enron.com,1624396.0,22884.0,39.0,13.0,14.0,...,SHELBY REX,1573324.0,True,869220.0,,211844.0,91.0,225.0,2003885.0,2493616.0
3,800000.0,,,,michael.kopper@enron.com,,118134.0,,,,...,KOPPER MICHAEL J,907502.0,True,985032.0,,224305.0,,,2652612.0,985032.0
4,1250000.0,,-262500.0,,christopher.calger@enron.com,,35818.0,144.0,199.0,25.0,...,CALGER CHRISTOPHER F,486.0,True,126027.0,,240189.0,2188.0,2598.0,1639297.0,126027.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
141,1000000.0,,,,philippe.bibi@enron.com,1465734.0,38559.0,40.0,23.0,8.0,...,BIBI PHILIPPE A,425688.0,,378082.0,,213625.0,1336.0,1607.0,2047593.0,1843816.0
142,1500000.0,,,,john.sherriff@enron.com,1835558.0,,92.0,28.0,23.0,...,SHERRIFF JOHN R,1852186.0,,1293424.0,,428780.0,2103.0,3187.0,4335388.0,3128982.0
143,,504610.0,,,dana.gibbs@enron.com,2218275.0,,12.0,0.0,0.0,...,GIBBS DANA R,,,,,,23.0,169.0,966522.0,2218275.0
144,200000.0,204075.0,,,tod.lindholm@enron.com,2549361.0,57727.0,,,,...,LINDHOLM TOD A,2630.0,,514847.0,,236457.0,,,875889.0,3064208.0


In [161]:
df['poi'].unique()

array([True, False, nan], dtype=object)

In [162]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146 entries, 0 to 145
Data columns (total 22 columns):
bonus                        82 non-null float64
deferral_payments            39 non-null float64
deferred_income              49 non-null float64
director_fees                17 non-null float64
email_address                111 non-null object
exercised_stock_options      102 non-null float64
expenses                     95 non-null float64
from_messages                86 non-null float64
from_poi_to_this_person      86 non-null float64
from_this_person_to_poi      86 non-null float64
loan_advances                4 non-null float64
long_term_incentive          66 non-null float64
name                         146 non-null object
other                        93 non-null float64
poi                          113 non-null object
restricted_stock             110 non-null float64
restricted_stock_deferred    18 non-null float64
salary                       95 non-null float64
shared_recei

In [163]:
df.describe()

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,loan_advances,long_term_incentive,other,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
count,82.0,39.0,49.0,17.0,102.0,95.0,86.0,86.0,86.0,4.0,66.0,93.0,110.0,18.0,95.0,86.0,86.0,125.0,126.0
mean,2374235.0,1642674.0,-1140475.0,166804.9,5987054.0,108728.9,608.790698,64.895349,41.232558,41962500.0,1470361.0,919065.0,2321741.0,166410.6,562194.3,1176.465116,2073.860465,5081526.0,6773957.0
std,10713330.0,5161930.0,4025406.0,319891.4,31062010.0,533534.8,1841.033949,86.979244,100.073111,47083210.0,5942759.0,4589253.0,12518280.0,4201494.0,2716369.0,1178.317641,2582.700981,29061720.0,38957770.0
min,70000.0,-102500.0,-27992890.0,3285.0,3285.0,148.0,12.0,0.0,0.0,400000.0,69223.0,2.0,-2604490.0,-7576788.0,477.0,2.0,57.0,148.0,-44093.0
25%,431250.0,81573.0,-694862.0,98784.0,527886.2,22614.0,22.75,10.0,1.0,1600000.0,281250.0,1215.0,254018.0,-389621.8,211816.0,249.75,541.25,394475.0,494510.2
50%,769375.0,227449.0,-159792.0,108579.0,1310814.0,46950.0,41.0,35.0,8.0,41762500.0,442035.0,52382.0,451740.0,-146975.0,259996.0,740.5,1211.0,1101393.0,1102872.0
75%,1200000.0,1002672.0,-38346.0,113784.0,2547724.0,79952.5,145.5,72.25,24.75,82125000.0,938672.0,362096.0,1002370.0,-75009.75,312117.0,1888.25,2634.75,2093263.0,2949847.0
max,97343620.0,32083400.0,-833.0,1398517.0,311764000.0,5235198.0,14368.0,528.0,609.0,83925000.0,48521930.0,42667590.0,130322300.0,15456290.0,26704230.0,5521.0,15149.0,309886600.0,434509500.0


### 資料分析

In [164]:
#犯人
#df_poi = df.iloc[:12,]

df_poi.info

<bound method DataFrame.info of         bonus  deferral_payments  deferred_income  director_fees  \
0   1750000.0                NaN       -3504386.0            NaN   
1   5600000.0                NaN              NaN            NaN   
2    200000.0                NaN          -4167.0            NaN   
3    800000.0                NaN              NaN            NaN   
4   1250000.0                NaN        -262500.0            NaN   
5         NaN            10259.0              NaN            NaN   
6         NaN                NaN              NaN            NaN   
7   1200000.0            27610.0        -144062.0            NaN   
8   7000000.0           202911.0        -300000.0            NaN   
9    600000.0                NaN              NaN            NaN   
10   700000.0                NaN              NaN            NaN   
11   700000.0           214678.0        -100000.0            NaN   

                   email_address  exercised_stock_options  expenses  \
0           

In [165]:
df_poi

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,email_address,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,...,name,other,poi,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
0,1750000.0,,-3504386.0,,ken.rice@enron.com,19794175.0,46950.0,18.0,42.0,4.0,...,RICE KENNETH D,174839.0,True,2748364.0,,420636.0,864.0,905.0,505050.0,22542539.0
1,5600000.0,,,,jeff.skilling@enron.com,19250000.0,29336.0,108.0,88.0,30.0,...,SKILLING JEFFREY K,22122.0,True,6843672.0,,1111258.0,2042.0,3627.0,8682716.0,26093672.0
2,200000.0,,-4167.0,,rex.shelby@enron.com,1624396.0,22884.0,39.0,13.0,14.0,...,SHELBY REX,1573324.0,True,869220.0,,211844.0,91.0,225.0,2003885.0,2493616.0
3,800000.0,,,,michael.kopper@enron.com,,118134.0,,,,...,KOPPER MICHAEL J,907502.0,True,985032.0,,224305.0,,,2652612.0,985032.0
4,1250000.0,,-262500.0,,christopher.calger@enron.com,,35818.0,144.0,199.0,25.0,...,CALGER CHRISTOPHER F,486.0,True,126027.0,,240189.0,2188.0,2598.0,1639297.0,126027.0
5,,10259.0,,,joe.hirko@enron.com,30766064.0,77978.0,,,,...,HIRKO JOSEPH,2856.0,True,,,,,,91093.0,30766064.0
6,,,,,scott.yeager@enron.com,8308552.0,53947.0,,,,...,YEAGER F SCOTT,147950.0,True,3576206.0,,158403.0,,,360300.0,11884758.0
7,1200000.0,27610.0,-144062.0,,wes.colwell@enron.com,,16514.0,40.0,240.0,11.0,...,COLWELL WESLEY,101740.0,True,698242.0,,288542.0,1132.0,1758.0,1490344.0,698242.0
8,7000000.0,202911.0,-300000.0,,kenneth.lay@enron.com,34348384.0,99832.0,36.0,123.0,16.0,...,LAY KENNETH L,10359729.0,True,14761694.0,,1072321.0,2411.0,4273.0,103559793.0,49110078.0
9,600000.0,,,,ben.glisan@enron.com,384728.0,125978.0,16.0,52.0,6.0,...,GLISAN JR BEN F,200308.0,True,393818.0,,274975.0,874.0,873.0,1272284.0,778546.0


In [170]:
#藉由df_poi的資料發現，'restricted_stock_deferred'、'loan_advances'、'director_fees'、'email_address'對於poi並沒有直接影響

df2 = df.drop(['restricted_stock_deferred','loan_advances','director_fees','email_address'], axis = 1)

df2

Unnamed: 0,bonus,deferral_payments,deferred_income,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,long_term_incentive,name,other,poi,restricted_stock,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
0,1750000.0,,-3504386.0,19794175.0,46950.0,18.0,42.0,4.0,1617011.0,RICE KENNETH D,174839.0,True,2748364.0,420636.0,864.0,905.0,505050.0,22542539.0
1,5600000.0,,,19250000.0,29336.0,108.0,88.0,30.0,1920000.0,SKILLING JEFFREY K,22122.0,True,6843672.0,1111258.0,2042.0,3627.0,8682716.0,26093672.0
2,200000.0,,-4167.0,1624396.0,22884.0,39.0,13.0,14.0,,SHELBY REX,1573324.0,True,869220.0,211844.0,91.0,225.0,2003885.0,2493616.0
3,800000.0,,,,118134.0,,,,602671.0,KOPPER MICHAEL J,907502.0,True,985032.0,224305.0,,,2652612.0,985032.0
4,1250000.0,,-262500.0,,35818.0,144.0,199.0,25.0,375304.0,CALGER CHRISTOPHER F,486.0,True,126027.0,240189.0,2188.0,2598.0,1639297.0,126027.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
141,1000000.0,,,1465734.0,38559.0,40.0,23.0,8.0,369721.0,BIBI PHILIPPE A,425688.0,,378082.0,213625.0,1336.0,1607.0,2047593.0,1843816.0
142,1500000.0,,,1835558.0,,92.0,28.0,23.0,554422.0,SHERRIFF JOHN R,1852186.0,,1293424.0,428780.0,2103.0,3187.0,4335388.0,3128982.0
143,,504610.0,,2218275.0,,12.0,0.0,0.0,461912.0,GIBBS DANA R,,,,,23.0,169.0,966522.0,2218275.0
144,200000.0,204075.0,,2549361.0,57727.0,,,,175000.0,LINDHOLM TOD A,2630.0,,514847.0,236457.0,,,875889.0,3064208.0


# 補上缺漏值

In [173]:
#檢查 DataFrame 空缺值的狀態
def na_check(df2):
    df2_na = (df2.isnull().sum() / len(df2)) * 100
    df2_na = df2_na.drop(df2_na[df2_na == 0].index).sort_values(ascending=False)
    missing_data = pd.DataFrame({'Missing Ratio' :df2_na})
    display(missing_data.head(10))
na_check(df2)

Unnamed: 0,Missing Ratio
poi,22.60274


In [172]:
df2['deferral_payments'] = df2['deferral_payments'].fillna("0")
df2['deferred_income'] = df2['deferred_income'].fillna("0")
df2['long_term_incentive'] = df2['long_term_incentive'].fillna("0")
df2['bonus'] = df2['bonus'].fillna("0") 
df2['to_messages'] = df2['to_messages'].fillna("0") 
df2['shared_receipt_with_poi'] = df2['shared_receipt_with_poi'].fillna("0") 
df2['from_this_person_to_poi'] = df2['from_this_person_to_poi'].fillna("0") 
df2['from_poi_to_this_person'] = df2['from_poi_to_this_person'].fillna("0") 
df2['from_messages'] = df2['from_messages'].fillna("0") 
df2['other'] = df2['other'].fillna("0")
df2['expenses'] = df2['expenses'].fillna("0")
df2['exercised_stock_options'] = df2['exercised_stock_options'].fillna("0") 
df2['restricted_stock'] = df2['restricted_stock'].fillna("0") 
df2['total_payments'] = df2['total_payments'].fillna("0") 
df2['total_stock_value'] = df2['total_stock_value'].fillna("0")
df2['salary'] = df2['salary'].fillna(df2['salary'].mean())

In [183]:
df2.head(20)

Unnamed: 0,bonus,deferral_payments,deferred_income,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,long_term_incentive,name,other,poi,restricted_stock,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
0,1750000.0,0.0,-3504390.0,19794200.0,46950,18,42,4,1617010.0,RICE KENNETH D,174839.0,True,2748360.0,420636.0,864,905,505050.0,22542500.0
1,5600000.0,0.0,0.0,19250000.0,29336,108,88,30,1920000.0,SKILLING JEFFREY K,22122.0,True,6843670.0,1111258.0,2042,3627,8682720.0,26093700.0
2,200000.0,0.0,-4167.0,1624400.0,22884,39,13,14,0.0,SHELBY REX,1573320.0,True,869220.0,211844.0,91,225,2003880.0,2493620.0
3,800000.0,0.0,0.0,0.0,118134,0,0,0,602671.0,KOPPER MICHAEL J,907502.0,True,985032.0,224305.0,0,0,2652610.0,985032.0
4,1250000.0,0.0,-262500.0,0.0,35818,144,199,25,375304.0,CALGER CHRISTOPHER F,486.0,True,126027.0,240189.0,2188,2598,1639300.0,126027.0
5,0.0,10259.0,0.0,30766100.0,77978,0,0,0,0.0,HIRKO JOSEPH,2856.0,True,0.0,562194.3,0,0,91093.0,30766100.0
6,0.0,0.0,0.0,8308550.0,53947,0,0,0,0.0,YEAGER F SCOTT,147950.0,True,3576210.0,158403.0,0,0,360300.0,11884800.0
7,1200000.0,27610.0,-144062.0,0.0,16514,40,240,11,0.0,COLWELL WESLEY,101740.0,True,698242.0,288542.0,1132,1758,1490340.0,698242.0
8,7000000.0,202911.0,-300000.0,34348400.0,99832,36,123,16,3600000.0,LAY KENNETH L,10359700.0,True,14761700.0,1072321.0,2411,4273,103560000.0,49110100.0
9,600000.0,0.0,0.0,384728.0,125978,16,52,6,71023.0,GLISAN JR BEN F,200308.0,True,393818.0,274975.0,874,873,1272280.0,778546.0


# 套入模型

In [175]:
Train = df2[pd.notnull(df2['poi'])].sort_values(by=["name"])
Test = df2[~pd.notnull(df2['poi'])].sort_values(by=["name"])

In [127]:
print(Train.columns.values.tolist())  #獲取列名
print('--------------------------')
print(Test.columns.values.tolist())  #獲取列名


['bonus', 'deferral_payments', 'deferred_income', 'exercised_stock_options', 'expenses', 'from_messages', 'from_poi_to_this_person', 'from_this_person_to_poi', 'long_term_incentive', 'name', 'other', 'poi', 'restricted_stock', 'salary', 'shared_receipt_with_poi', 'to_messages', 'total_payments', 'total_stock_value']
--------------------------
['bonus', 'deferral_payments', 'deferred_income', 'exercised_stock_options', 'expenses', 'from_messages', 'from_poi_to_this_person', 'from_this_person_to_poi', 'long_term_incentive', 'name', 'other', 'poi', 'restricted_stock', 'salary', 'shared_receipt_with_poi', 'to_messages', 'total_payments', 'total_stock_value']


In [187]:
#更改順序
Train = Train[['poi','bonus', 'deferral_payments', 'deferred_income', 'exercised_stock_options', 'expenses', 'from_messages', 'from_poi_to_this_person', 'from_this_person_to_poi', 'long_term_incentive', 'other',  'restricted_stock', 'salary', 'shared_receipt_with_poi', 'to_messages', 'total_payments', 'total_stock_value']]
Test = Test[['bonus', 'deferral_payments', 'deferred_income', 'exercised_stock_options', 'expenses', 'from_messages', 'from_poi_to_this_person', 'from_this_person_to_poi', 'long_term_incentive', 'other',  'restricted_stock', 'salary', 'shared_receipt_with_poi', 'to_messages', 'total_payments', 'total_stock_value']]

In [207]:
Test['bonus']

131         400000
113       5.25e+06
139              0
141          1e+06
114       1.35e+06
138         300000
117          1e+06
120              0
116          3e+06
127         800000
129         425000
121         800000
134        2.5e+06
119          2e+06
143              0
115        1.5e+06
118        1.7e+06
123              0
130              0
124              0
144         200000
145        2.6e+06
135         600000
126              0
137         325000
128              0
132              0
122         100000
125              0
142        1.5e+06
133    9.73436e+07
140              0
136              0
Name: bonus, dtype: object

In [222]:
#隨機森林樹

rf = RandomForestClassifier(criterion='entropy',
                            n_estimators=1000,
                            min_samples_split=12,
                            min_samples_leaf=1,
                            oob_score=True, 
                            random_state=1,
                            n_jobs=-1) 

rf.fit(Train.iloc[:, 1:], Train.iloc[:, 0])
print("%.4f" % rf.oob_score_)

0.8938


In [198]:
# 設定要訓練的超參數組合
param_grid = {"criterion" : ["entropy", "gini"],
              "max_depth" : [4,6,8,10],
              "min_samples_split" : [2,5,10],
              "min_samples_leaf" : [2,5,10],
              "n_estimators" : [20,50, 100], # 使用 n 顆樹
              "max_features" : ['auto', 'sqrt'], # 如何選取 features
              "oob_score" : [True]
             }

## 建立搜尋物件，放入模型及參數組合字典 (n_jobs=-1 會使用全部 cpu 平行運算)
grid_search = GridSearchCV(estimator = RandomForestClassifier(random_state = 50),
                  param_grid = param_grid,
                  cv = 5,
                  scoring = "accuracy")

# 開始搜尋最佳參數
grid_result = grid_search.fit(Train.iloc[:, 1:], Train.iloc[:, 0])






In [199]:
# 印出最佳結果與最佳參數 
print("best_score  : %s" % grid_result.best_score_) 
print("best_params : %s \n" % grid_result.best_params_)

best_score  : 0.8938053097345132
best_params : {'criterion': 'entropy', 'max_depth': 4, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 20, 'oob_score': True} 



In [219]:
rf = grid_result.best_score_

In [221]:
# 訓練模型 
rf.fit(Train.iloc[:, 1:], Train.iloc[:, 0]) 

AttributeError: 'numpy.float64' object has no attribute 'fit'

In [195]:
# 隨機森林預測檔 
rf_res = rf.predict(Test) 
submit['poi'] = rf_res 
submit['poi'] = submit['poi'].astype(int) 
submit.to_csv('mid_submit.csv', index = False)