# ANN Classification - Bank Customer Retention
## Part 1 - DATA PREPROCESSING
In this notebook, we load the raw dataset file and implement initial cleaning and preprocessing to prepare it for the model training phase.

> **INPUT:** the raw dataset file as downloaded from its original source.<br>
> **OUTPUT:** a cleaned version of the dataset stored to an intermediate csv file.

### 1. INITIALIZATION

In [111]:
# Import necessary libraries and modules
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

In [112]:
# Check tensorflow version
tf.__version__

'2.16.1'

### 2. LOADING DATASET FILE

In [113]:
# Prepare file location and load dataset
data_file_location = "..\\data\\raw\\"
data_file_name = "churn_modelling"
data_file_ext = "csv"
data = pd.read_csv(data_file_location + data_file_name + "." + data_file_ext)

In [114]:
# Check dataset head
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [115]:
# Check dataset columns
data.columns

Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography',
       'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Exited'],
      dtype='object')

In [116]:
# Check column types
data.dtypes

RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

### 3. DATA CLEANING AND PREPROCESSING

#### Drop irrelevant columns

In [117]:
# Drop irrelevant columns such as identifiers and names
data.drop(["RowNumber", "CustomerId", "Surname"], axis=1, inplace=True)

# Check dataset head
data.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


#### Encoding categorical features

In [118]:
# Encoding "Gender" feature with label encoding
le = LabelEncoder()
data["Gender"] = le.fit_transform(data["Gender"])

# Check encoded feature
data["Gender"].unique()

array([0, 1])

In [119]:
# Encoding "Geography" feature with one hoe encoding
ohe = OneHotEncoder()
ct = ColumnTransformer(transformers=[
    ("one_hot_encoder", ohe, ["Geography"])
], remainder="passthrough")

In [120]:
# Apply transformations
data_encoded = ct.fit_transform(data)

In [122]:
data_encoded

array([[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 1.0000000e+00,
        1.0134888e+05, 1.0000000e+00],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, ..., 1.0000000e+00,
        1.1254258e+05, 0.0000000e+00],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 0.0000000e+00,
        1.1393157e+05, 1.0000000e+00],
       ...,
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 1.0000000e+00,
        4.2085580e+04, 1.0000000e+00],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, ..., 0.0000000e+00,
        9.2888520e+04, 1.0000000e+00],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 0.0000000e+00,
        3.8190780e+04, 0.0000000e+00]])

In [121]:
data_encoded_df = pd.DataFrame(data_encoded, columns=ct.get_feature_names_out(['Geography']))

ValueError: input_features is not equal to feature_names_in_