## Data Loading and Formatting

Before we do anything, we need to load the credit card customer data properly. The `load_credit_card_customer_data` checks to see if the data file `BankChurners.csv` has already been downloaded and if not, it downloads it from `zenodo.org`. Initially we had this function return an `X, y` but it returns a `pandas` dataframe. This is because it's easier to index a `pandas` with the columns that we want to select, this functionality is present in `select_features`. We then select personal features that we want to use as information to predict a financial feature.

In [19]:
import os
import requests
import pandas as pd
import numpy as np

def load_credit_card_customer_data():
    filename = "BankChurners.csv"
    if not os.path.exists(filename):
        url = "https://zenodo.org/records/4322342/files/BankChurners.csv?download=1"
        response = requests.get(url)
        with open(filename, "wb") as f:
            f.write(response.content)

    data = pd.read_csv(filename)
    return data

def select_features(data, X_features, y_label):
    X = data[X_features].values
    y = data[y_label].values
    
    return X, y

data = load_credit_card_customer_data()

personal_features = [
    'Attrition_Flag',
    'Customer_Age',
    'Gender',
    'Dependent_count',
    'Education_Level',
    'Marital_Status',
    'Income_Category',
    'Total_Relationship_Count'
]

labels_column = 'Credit_Limit'

X, y = select_features(data, personal_features, labels_column)
X.shape, y.shape

((10127, 8), (10127,))

Here we can see that we have the `8` personal features and `1` label column. Also there is `10127` rows in this dataset which is pretty big. However, we can see that some of the columns contain categorical data. For example, the `'Attrition_Flag'` will contain either `'Existing Customer'` or `'Attrited Customer'` and the `Gender` column will only include `M` or `F` and so on with `'Education_Level'`, `'Marital_Status'` and `'Income_Category'`.

In [20]:
pd.DataFrame(X[:5, :])

Unnamed: 0,0,1,2,3,4,5,6,7
0,Existing Customer,45,M,3,High School,Married,$60K - $80K,5
1,Existing Customer,49,F,5,Graduate,Single,Less than $40K,6
2,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,4
3,Existing Customer,40,F,4,High School,Unknown,Less than $40K,3
4,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,5


Classifiers like `SVM` can only interpret numerical data and not strings like `"Existing Customer"`. So we need to convert these categorical features to binary categorical data using `OneHotEncoder`. This will create $n$ columns for $n$ different categories in the categorical column and label the entries with either $1$ or $0$ depending on if the entry is in that category. Here is an example with `'Income Category'`.

In [28]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

income_category = X[:, 6]
X_2darray = np.reshape(income_category, (-1, 1)) 

encoder.fit_transform(X_2darray).toarray()

array([[0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0., 0.],
       ...,
       [0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0.]], shape=(10127, 6))

Here we can see that there are `6` different income categories as the `OneHotEncoder` as created `6` new columns each with either a `1` or a `0` which implies that an entry contains that column. But we want to numericalize all of our categorical data, not just `'Income Category'`.

In [32]:
from sklearn.preprocessing import OneHotEncoder

def one_hot_encode(col):
    encoder = OneHotEncoder()
    reshaped_col = np.reshape(col, (-1, 1))
    numerical_col = encoder.fit_transform(reshaped_col).toarray()
    return encoder.categories_, numerical_col

# this makes indexing the columns much cleaner
index_dict = {} 

attribution_flag_categories, attribution_flag_numerical = one_hot_encode(X[:, 0])
# age is already numerical
gender_categories, gender_numerical = one_hot_encode(X[:, 2])
# dependant count is already numerical
education_categories, education_numerical = one_hot_encode(X[:, 4])
maritial_categories, maritial_numerical = one_hot_encode(X[:, 5])
income_categories, income_numerical = one_hot_encode(X[:, 6])

[array(['Attrited Customer', 'Existing Customer'], dtype=object)]