### 作業
####  目標: 建立一深度學習模型預測客戶是否流失

- 問題一: 請將資料區分為訓練與測試資料集
- 問題二: 請將資料標準化
- 問題三: 使用Keras 建立深度學習模型預測客戶是否流失
- 問題四: 評估模型準確度
- 問題五: 請繪製ROC Curve, 並求出 AUC
- 問題六: 請比較 ANN, SVM, Gradient Boosting, Random Forest, Logist Regression, Decision Tree 各模型的 AUC 與 ROC Curve

In [10]:
import pandas
df = pandas.read_csv('https://raw.githubusercontent.com/ywchiu/tibamedl/master/Data/Churn_Modelling.csv', header = 0 )
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             10000 non-null int64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [12]:
# df.iloc[列,欄]
df = df.iloc[:,3:] # 取得所有的列, 只取得第三欄以後所有的資料
df.head(3)

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1


## 建立機器學習模型前，資料必須是什麼格式：

- 資料必須是結構化資料表 (每一列有相同欄位數, 每個欄位都有自己的格式, 資料呈現方方正正的樣貌)

- 所有的欄位都必須是數值化格式　(FLOAT, INT)

- 所有的資料都必須存在, 沒有遺失值(Missing Value)

#### 資料必須是結構化資料表 (每一列有相同欄位數, 每個欄位都有自己的格式, 資料呈現方方正正的樣貌) 

In [13]:
type(df)

pandas.core.frame.DataFrame

#### 所有的欄位都必須是數值化格式　(FLOAT, INT)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             10000 non-null int64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(2), int64(7), object(2)
memory usage: 859.5+ KB


In [15]:
df.select_dtypes('object').head()

Unnamed: 0,Geography,Gender
0,France,Female
1,Spain,Female
2,France,Female
3,France,Female
4,Spain,Female


In [17]:
df['Geography'].unique()

array(['France', 'Spain', 'Germany'], dtype=object)

In [19]:
geo = pandas.get_dummies(df['Geography'])
del geo['Spain']
geo.head(3)

Unnamed: 0,France,Germany
0,1,0
1,0,0
2,1,0


In [20]:
df['Gender'].unique()

array(['Female', 'Male'], dtype=object)

In [22]:
gender = pandas.get_dummies(df['Gender'])
del gender['Female']
gender.head(3)

Unnamed: 0,Male
0,0
1,0
2,0


In [24]:
df = pandas.concat([gender, geo, df],axis  = 1)

In [25]:
df.head(3)

Unnamed: 0,Male,France,Germany,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,1,0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,0,0,0,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,0,1,0,502,France,Female,42,8,159660.8,3,1,0,113931.57,1


In [26]:
del df['Geography']

In [27]:
del df['Gender']

In [28]:
df.head()

Unnamed: 0,Male,France,Germany,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,1,0,619,42,2,0.0,1,1,1,101348.88,1
1,0,0,0,608,41,1,83807.86,1,0,1,112542.58,0
2,0,1,0,502,42,8,159660.8,3,1,0,113931.57,1
3,0,1,0,699,39,1,0.0,2,0,0,93826.63,0
4,0,0,0,850,43,2,125510.82,1,1,1,79084.1,0


In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
Male               10000 non-null uint8
France             10000 non-null uint8
Germany            10000 non-null uint8
CreditScore        10000 non-null int64
Age                10000 non-null int64
Tenure             10000 non-null int64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(2), int64(7), uint8(3)
memory usage: 732.5 KB


#### 所有的資料都必須存在, 沒有遺失值(Missing Value)

In [31]:
df.isna().sum()

Male               0
France             0
Germany            0
CreditScore        0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [35]:
X =  df.iloc[:,:-1]
#X.head()
y =  df.iloc[:,-1]
#y.head()

### 問題一: 請將資料區分為訓練與測試資料集

In [36]:
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X,y, test_size = 0.2, random_state = 42 )

In [37]:
train_X.shape

(8000, 11)

In [38]:
test_X.shape

(2000, 11)

In [39]:
train_y.shape

(8000,)

In [40]:
test_y.shape

(2000,)

### 問題二: 請將資料標準化

In [41]:
from sklearn.preprocessing import StandardScaler
# (data - mean) / std
sc = StandardScaler()
scaled_X = sc.fit_transform(train_X)

In [44]:
scaled_X[0:3,:]

array([[ 0.91324755,  1.00150113, -0.57946723,  0.35649971, -0.6557859 ,
         0.34567966, -1.21847056,  0.80843615,  0.64920267,  0.97481699,
         1.36766974],
       [ 0.91324755, -0.99850112,  1.72572313, -0.20389777,  0.29493847,
        -0.3483691 ,  0.69683765,  0.80843615,  0.64920267,  0.97481699,
         1.6612541 ],
       [ 0.91324755, -0.99850112, -0.57946723, -0.96147213, -1.41636539,
        -0.69539349,  0.61862909, -0.91668767,  0.64920267, -1.02583358,
        -0.25280688]])

In [42]:
test_X = sc.transform(test_X)

In [45]:
test_X[0:3,:]

array([[ 0.91324755, -0.99850112,  1.72572313, -0.57749609, -0.6557859 ,
        -0.69539349,  0.32993735,  0.80843615, -1.54035103, -1.02583358,
        -1.01960511],
       [ 0.91324755,  1.00150113, -0.57946723, -0.29729735,  0.3900109 ,
        -1.38944225, -1.21847056,  0.80843615,  0.64920267,  0.97481699,
         0.79888291],
       [-1.09499335, -0.99850112, -0.57946723, -0.52560743,  0.48508334,
        -0.3483691 , -1.21847056,  0.80843615,  0.64920267, -1.02583358,
        -0.72797953]])

### 問題三: 使用Keras 建立深度學習模型預測客戶是否流失

In [46]:
import keras

Using TensorFlow backend.


In [47]:
from keras.layers import Dense, Dropout

In [49]:
?Dense

In [51]:
model = keras.Sequential()
model.add(Dense(units = 5, activation='relu', input_shape = (11,) ))
model.add(Dense(units = 5, activation='relu' ))
model.add(Dense(units = 1, activation='sigmoid' ))

In [52]:
?model.compile

In [53]:
model.compile('adam', loss = 'binary_crossentropy', metrics = ['acc'])

W0831 10:20:01.401132  5152 deprecation_wrapper.py:119] From C:\ProgramData\Anaconda3\lib\site-packages\keras\optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0831 10:20:01.428133  5152 deprecation_wrapper.py:119] From C:\ProgramData\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:3376: The name tf.log is deprecated. Please use tf.math.log instead.

W0831 10:20:01.432134  5152 deprecation.py:323] From C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\ops\nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [58]:
history = model.fit(scaled_X, train_y,
                    batch_size=32,
                    epochs=10,
                    verbose=1,
                    validation_data=(test_X, test_y))

Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:

問題四: 評估模型準確度
問題五: 請繪製ROC Curve, 並求出 AUC
問題六: 請比較 ANN, SVM, Gradient Boosting, Random Forest, Logist Regression, Decision Tree 各模型的 AUC 與 ROC Curve