# XGBoost Classifier

## Part 1 - Data Preprocessing

### Importing the dataset

In [None]:
import pandas as pd
dataset = pd.read_csv('churn_modelling.csv')

In [None]:
dataset.head()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


### Checking missing data

In [None]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerId       10000 non-null  int64  
 1   Surname          10000 non-null  object 
 2   CreditScore      10000 non-null  int64  
 3   Geography        10000 non-null  object 
 4   Gender           10000 non-null  object 
 5   Age              10000 non-null  int64  
 6   Tenure           10000 non-null  int64  
 7   Balance          10000 non-null  float64
 8   NumOfProducts    10000 non-null  int64  
 9   HasCrCard        10000 non-null  int64  
 10  IsActiveMember   10000 non-null  int64  
 11  EstimatedSalary  10000 non-null  float64
 12  Exited           10000 non-null  int64  
dtypes: float64(2), int64(8), object(3)
memory usage: 1015.8+ KB


### Handling categorical variables

CustomerId and Surname columns

In [None]:
dataset.drop(['CustomerId', 'Surname'], axis = 1, inplace = True)

In [None]:
dataset.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


Geography column

In [None]:
dataset['Geography'].unique()

array(['France', 'Spain', 'Germany'], dtype=object)

In [None]:
geography_dummies = pd.get_dummies(dataset['Geography'], drop_first = True)

In [None]:
geography_dummies

Unnamed: 0,Germany,Spain
0,0,0
1,0,1
2,0,0
3,0,0
4,0,1
...,...,...
9995,0,0
9996,0,0
9997,0,0
9998,1,0


In [None]:
dataset = pd.concat([geography_dummies, dataset], axis = 1)

In [None]:
dataset.head()

Unnamed: 0,Germany,Spain,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,0,1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,0,0,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,0,0,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,0,1,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [None]:
dataset.drop(['Geography'], axis = 1, inplace = True)

In [None]:
dataset.head()

Unnamed: 0,Germany,Spain,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,0,619,Female,42,2,0.0,1,1,1,101348.88,1
1,0,1,608,Female,41,1,83807.86,1,0,1,112542.58,0
2,0,0,502,Female,42,8,159660.8,3,1,0,113931.57,1
3,0,0,699,Female,39,1,0.0,2,0,0,93826.63,0
4,0,1,850,Female,43,2,125510.82,1,1,1,79084.1,0


Gender column

In [None]:
dataset['Gender'].unique()

array(['Female', 'Male'], dtype=object)

In [None]:
dataset['Gender'] = dataset['Gender'].apply(lambda x: 0 if x == 'Female' else 1)

In [None]:
dataset.head(10)

Unnamed: 0,Germany,Spain,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,0,619,0,42,2,0.0,1,1,1,101348.88,1
1,0,1,608,0,41,1,83807.86,1,0,1,112542.58,0
2,0,0,502,0,42,8,159660.8,3,1,0,113931.57,1
3,0,0,699,0,39,1,0.0,2,0,0,93826.63,0
4,0,1,850,0,43,2,125510.82,1,1,1,79084.1,0
5,0,1,645,1,44,8,113755.78,2,1,0,149756.71,1
6,0,0,822,1,50,7,0.0,2,1,1,10062.8,0
7,1,0,376,0,29,4,115046.74,4,1,0,119346.88,1
8,0,0,501,1,44,4,142051.07,2,0,1,74940.5,0
9,0,0,684,1,27,2,134603.88,1,1,1,71725.73,0


### Creating the Training Set and the Test Set

Getting the inputs and output

In [None]:
X = dataset.iloc[:, :-1].values

In [None]:
y = dataset.iloc[:, -1].values

In [None]:
X

array([[0.0000000e+00, 0.0000000e+00, 6.1900000e+02, ..., 1.0000000e+00,
        1.0000000e+00, 1.0134888e+05],
       [0.0000000e+00, 1.0000000e+00, 6.0800000e+02, ..., 0.0000000e+00,
        1.0000000e+00, 1.1254258e+05],
       [0.0000000e+00, 0.0000000e+00, 5.0200000e+02, ..., 1.0000000e+00,
        0.0000000e+00, 1.1393157e+05],
       ...,
       [0.0000000e+00, 0.0000000e+00, 7.0900000e+02, ..., 0.0000000e+00,
        1.0000000e+00, 4.2085580e+04],
       [1.0000000e+00, 0.0000000e+00, 7.7200000e+02, ..., 1.0000000e+00,
        0.0000000e+00, 9.2888520e+04],
       [0.0000000e+00, 0.0000000e+00, 7.9200000e+02, ..., 1.0000000e+00,
        0.0000000e+00, 3.8190780e+04]])

In [None]:
y

array([1, 0, 1, ..., 1, 1, 0])

Getting the Training Set and the Test Set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Part 2 - Building and training the model

### Building the model

In [None]:
import xgboost
model = xgboost.XGBClassifier(max_depth = 4, learning_rate = 0.1, n_estimators = 100)

### Training the model

In [None]:
model.fit(X_train, y_train)

XGBClassifier(max_depth=4)

### Inference

In [None]:
y_pred = model.predict(X_test)

In [None]:
y_pred

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
y_test

array([0, 1, 0, ..., 0, 0, 0])

### Predicting the result of a single observation

**Homework**

Use our model to predict if the customer with the following informations will leave the bank:

Geography: France

Credit Score: 600

Gender: Male

Age: 40 years old

Tenure: 3 years

Balance: \$ 60000

Number of Products: 2

Does this customer have a credit card? Yes

Is this customer an Active Member: Yes

Estimated Salary: \$ 50000

So, should we say goodbye to that customer?

In [None]:
model.predict([[0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])

array([0])

**Solution**

Therefore, our model predicts that this customer stays in the bank!

**Important note 1:** Notice that the values of the features were all input in a double pair of square brackets. That's because the "predict" method always expects a 2D array as the format of its inputs. And putting our values into a double pair of square brackets makes the input exactly a 2D array.

**Important note 2:** Notice also that the "France" country was not input as a string in the last column but as "0, 0" in the first two columns. That's because of course the predict method expects the dummy values of the Geography variable.

## Part 3: Evaluating the model

### Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

array([[1526,   69],
       [ 199,  206]])

### Accuracy

In [None]:
(1521+208)/(1521+208+74+197)

0.8645

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.866

### k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = model,
                             X = X,
                             y = y,
                             scoring = 'accuracy',
                             cv = 10)
print("Average Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Average Accuracy: 86.51 %
Standard Deviation: 0.69 %
