![](../src/logo.svg)

**© Jesús López**

Ask him any doubt on **[Twitter](https://twitter.com/jsulopz)** or **[LinkedIn](https://linkedin.com/in/jsulopz)**

# #03 | Train Test Split for Model Selection

## Load the Data

- The goal of this dataset is
- To predict if **bank's customers** (rows) could have the approval for a credit card `target`
- Based on their **socio-demographical characteristics** (columns)

First import the data using pandas

In [1]:
import pandas as pd #!

df_credit = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data',
                 na_values='?', header=None)

df_credit.rename(columns={15: 'target'}, inplace=True)
df_credit.target.replace({'+': 1, '-': 0}, inplace=True)
df_credit.columns = [str(i) for i in df_credit.columns]
df_credit

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,target
0,b,30.83,0.000,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.460,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.50,0.500,u,g,q,h,1.50,t,f,0,f,g,280.0,824,1
3,b,27.83,1.540,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,260.0,0,0
686,a,22.67,0.750,u,g,c,v,2.00,f,t,2,t,g,200.0,394,0
687,a,25.25,13.500,y,p,ff,ff,2.00,f,t,1,t,g,200.0,1,0
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,280.0,750,0


## Preprocess the Data

Clean the data, drop the null rows.

In [2]:
df_credit=df_credit.dropna()

Code the categorical columns , so as to be able to run classifiers on them.

In [3]:
df_credit = pd.get_dummies(df_credit,drop_first=True)

In [4]:
df_credit

Unnamed: 0,1,2,7,10,13,14,target,0_b,3_u,3_y,...,6_j,6_n,6_o,6_v,6_z,8_t,9_t,11_t,12_p,12_s
0,30.83,0.000,1.25,1,202.0,0,1,1,1,0,...,0,0,0,1,0,1,1,0,0,0
1,58.67,4.460,3.04,6,43.0,560,1,0,1,0,...,0,0,0,0,0,1,1,0,0,0
2,24.50,0.500,1.50,0,280.0,824,1,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,27.83,1.540,3.75,5,100.0,3,1,1,1,0,...,0,0,0,1,0,1,1,1,0,0
4,20.17,5.625,1.71,0,120.0,0,1,1,1,0,...,0,0,0,1,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,21.08,10.085,1.25,0,260.0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
686,22.67,0.750,2.00,2,200.0,394,0,0,1,0,...,0,0,0,1,0,0,1,1,0,0
687,25.25,13.500,2.00,1,200.0,1,0,0,0,1,...,0,0,0,0,0,0,1,1,0,0
688,17.92,0.205,0.04,0,280.0,750,0,1,1,0,...,0,0,0,1,0,0,0,0,0,0


## Feature Selection

Separate the dataframe into features(independent variables ) and target(dependent variables)

In [5]:
features = df_credit.drop(columns = 'target')
target = df_credit['target']

In [6]:
features

Unnamed: 0,1,2,7,10,13,14,0_b,3_u,3_y,4_gg,...,6_j,6_n,6_o,6_v,6_z,8_t,9_t,11_t,12_p,12_s
0,30.83,0.000,1.25,1,202.0,0,1,1,0,0,...,0,0,0,1,0,1,1,0,0,0
1,58.67,4.460,3.04,6,43.0,560,0,1,0,0,...,0,0,0,0,0,1,1,0,0,0
2,24.50,0.500,1.50,0,280.0,824,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
3,27.83,1.540,3.75,5,100.0,3,1,1,0,0,...,0,0,0,1,0,1,1,1,0,0
4,20.17,5.625,1.71,0,120.0,0,1,1,0,0,...,0,0,0,1,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,21.08,10.085,1.25,0,260.0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
686,22.67,0.750,2.00,2,200.0,394,0,1,0,0,...,0,0,0,1,0,0,1,1,0,0
687,25.25,13.500,2.00,1,200.0,1,0,0,1,0,...,0,0,0,0,0,0,1,1,0,0
688,17.92,0.205,0.04,0,280.0,750,1,1,0,0,...,0,0,0,1,0,0,0,0,0,0


In [7]:
target

0      1
1      1
2      1
3      1
4      1
      ..
685    0
686    0
687    0
688    0
689    0
Name: target, Length: 653, dtype: int64

## Build & Compare Models

We try to use a function that will be used to run the tests in order tp avoid repetition of codes

In [8]:
def best_score(model):
    model.fit(X=features,y=target)
    result = model.score(X=features,y=target)
    return result

## import the libraries needed, and instantiate it , before running each test on the data

### `DecisionTreeClassifier()`

In [9]:
from sklearn.tree import DecisionTreeClassifier
model_dt = DecisionTreeClassifier()

In [10]:
best_score(model_dt)

1.0

### `RandomForestClassifier()`

In [11]:
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier()

In [12]:
best_score(model_rf)

1.0

### `KNeighborsClassifier()`

In [13]:
from sklearn.neighbors import KNeighborsClassifier

model_kn=KNeighborsClassifier()

In [14]:
best_score(model_kn)

0.7840735068912711

## Which Model is the Best?

In [30]:
modelss = pd.DataFrame()

In [31]:
modelss['name'] = (model_kn,model_rf,model_dt)

In [33]:
modelss['outcome'] = modelss.name.apply(best_score)

In [34]:
modelss

Unnamed: 0,name,outcome
0,KNeighborsClassifier(),0.784074
1,"(DecisionTreeClassifier(max_features='sqrt', r...",1.0
2,DecisionTreeClassifier(),1.0


Decision tree classifier and random forest classifier has the best score hence the best model

## `train_test_split()` & Compare Again

## to reduce the margin of error, we will be using the test_split method

As required , import the library from sklearn , instantiate it and craete a function to run the classifiers.Separate the data set into test and train.

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
>>> X_train, X_test, y_train, y_test = train_test_split(
...     features, target, test_size=0.33, random_state=42)

In [17]:
def best_score_split(model):
    model.fit(X=X_train,y=y_train)
    result = model.score(X=X_test,y=y_test)
    return result

### `DecisionTreeClassifier()`

In [18]:
best_score_split(model_dt)

0.8194444444444444

### `RandomForestClassifier()`

In [19]:
best_score_split(model_rf)

0.8611111111111112

### `KNeighborsClassifier()`

In [20]:
best_score_split(model_kn)

0.6666666666666666

In [21]:
All = pd.DataFrame()

In [22]:
All['models']=[model_dt,model_rf,model_kn]

In [23]:
All

Unnamed: 0,models
0,DecisionTreeClassifier()
1,"(DecisionTreeClassifier(max_features='sqrt', r..."
2,KNeighborsClassifier()


In [24]:
All['same_data'] = All.models.apply(best_score)

In [25]:
All['test_data']=All.models.apply(best_score_split)

In [26]:
All

Unnamed: 0,models,same_data,test_data
0,DecisionTreeClassifier(),1.0,0.819444
1,"(DecisionTreeClassifier(max_features='sqrt', r...",1.0,0.865741
2,KNeighborsClassifier(),0.784074,0.666667


In [27]:
All.style.background_gradient()

Unnamed: 0,models,same_data,test_data
0,DecisionTreeClassifier(),1.0,0.819444
1,RandomForestClassifier(),1.0,0.865741
2,KNeighborsClassifier(),0.784074,0.666667


## Which is the Best Model with `train_test_split()`?

RandomForestClassifier

# Achieved Goals

_Double click on **this cell** and place an `X` inside the square brackets (i.e., [X]) if you think you understand the goal:_

- [x] Understand the necessity to **create functions** to avoid the repetition of the code.
- [x] **Bootstrapping** as a way to create an artificial dataset that helps to reduce the bias.
- [x] **Classification threshold** to predict categories out of probabilities.
- [x] Different ways to **compare classification models**.
- [x] Understand the importance to check how good is a model with **data not seen during training**.

<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.