<a href="https://colab.research.google.com/github/zhangwiki86-soton/Python-Files-for-Practice/blob/main/classification_bank_churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification example (Bank Churn dataset)

This notebook introduces classification using in scikit-learn package. We will create a logistic regression model (a classification model despite its name).  We start by loading data with pandas. We will select the columns that we need to train the model, fit the model and make predictions. We will also look at how to handle categorical variables. In these examples we will use the [bank churn dataset](course_datasets.md#bank-churn). We will visualise the data using the matplotlib package.

## Setup

The import statements below use numpy, pandas and several modules from scikit-learn.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import OneHotEncoder


Load the bank churn dataset using pandas and display the first few rows.

In [2]:
df = pd.read_csv('https://zomalextrainingstorage.blob.core.windows.net/datasets/misc/Churn.csv')
df.drop(columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True)
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


To keep it simple, we will only have a few columns, Gender, Age and Geography as the features of the model.  Note that:

* Gender: (Male, Female)
* Geography: (France, Germany, Spain)
* Age:  an integer

In [3]:
X = df[['Gender', 'Geography', 'Age']]
y = df['Exited']

Split the data into training and test datasets

In [4]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)
#0.8 (80%) in train X, 0.2 (20%) in test X

print(f'X_train\n{X_train[:5]}')
print(f'y_train\n{y_train[:5]}')

X_train
      Gender Geography  Age
9254    Male    France   32
1561    Male   Germany   42
1670    Male     Spain   24
6087  Female    France   27
6669    Male    France   56
y_train
9254    0
1561    0
1670    1
6087    1
6669    1
Name: Exited, dtype: int64


## One-hot encoding of categorical features

Gender and Geography are categorical variables.  We can use one-hot encoding to convert this into a numeric format suitable for machine learning models. There are two ways of doing this: using pandas get_dummies() function or using scikit-learn OneHotEncoder class.

Note that without the drop='first' the encoding would creates 3 columns

* Geography_France: [1, 0, 0]
* Geography_Germany: [0, 1, 0]
* Geography_Spain: [0, 0, 1]

With drop='first' the encoding generates 2 columns

* Geography_Germany: [0, 1, 0]
* Geography_Spain: [0, 0, 1]

When all three columns are present, they're perfectly collinear (if Germany=0 and Spain=0, then France must equal 1). This creates redundancy and extra columns which we don't need

In [5]:
categorical_features = ['Gender', 'Geography']
numerical_features = ['Age']

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first'), categorical_features),
        ('num', 'passthrough', numerical_features)
    ])

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)



Inspect the transformed data and compare it to the original data

In [6]:
print(f'Feature names: {preprocessor.get_feature_names_out()}')
print(f'X_train\n{X_train[:5]}')
print(f'X_train_processed\n{X_train_processed[:5]}')

Feature names: ['cat__Gender_Male' 'cat__Geography_Germany' 'cat__Geography_Spain'
 'num__Age']
X_train
      Gender Geography  Age
9254    Male    France   32
1561    Male   Germany   42
1670    Male     Spain   24
6087  Female    France   27
6669    Male    France   56
X_train_processed
[[ 1.  0.  0. 32.]
 [ 1.  1.  0. 42.]
 [ 1.  0.  1. 24.]
 [ 0.  0.  0. 27.]
 [ 1.  0.  0. 56.]]


Create and fit the model

In [None]:
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_processed, y_train)

Make predictions

In [None]:
y_pred = model.predict(X_test_processed)

print(f'y_pred\n{y_pred[:5]}')
print(f'y_test\n{y_test[:5]}')


Evaluate the model's performance

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

END OF TUTORIAL