# Customer Behaviour Prediction

in this notebook we will prepare and clean the data then apply a predictive model to predict if a new client will pay the bill statement of the credict card or not.

#### Content
1. <a href="#1">Exploring the Data</a><br>
2. <a href="#2">Data Cleaning</a><br>
3. <a href="#3">Data Preprocessing</a><br>
4. <a href="#4">K-Fold</a>
5. <a href="#5">Model Training</a>
6. <a href="#6">Conclusion</a>

In [None]:
import numpy as np 
import pandas as pd
import os

# <a id="1">Exploring the Data</a>

In [None]:
data = pd.read_csv('/kaggle/input/default-of-credit-card-clients-dataset/UCI_Credit_Card.csv')
data.head()

In [None]:
data.info()

In [None]:
data.describe()

# <a name="2">Data Cleaning</a>

The main aim of **Data Cleaning** is to identify and remove errors & duplicate data, in order to create a reliable dataset. This improves the quality of the training data for analytics and enables accurate decision-making.

in our dataset, we can apply some modifications on the dataframe in order to remove nans or outliers and improve the quality of the dataset.

as the columns data has different types so we will deal with each type of them separately.

In [None]:
to_be_dropped = ['ID']
categorical_cols = ['SEX', 'EDUCATION', 'MARRIAGE']
numerical_cols = ['LIMIT_BAL', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2',
                  'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'AGE']
target_col = ['will_pay']

for `PAY_n values`, They all present an undocumented label -2. If 1,2,3, etc are the months of delay, 0 should be labeled 'pay duly' and every negative value should be seen as a 0

In [None]:
for col in ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']:
    data[col] = np.where(data[col]<=0, 0, data[col]) 

### Remove NaNs

In [None]:
data.isnull().sum()

In [None]:
# drop nans
data = data.dropna()

# drop unneeded cols
data = data.drop(to_be_dropped, axis=1)

In [None]:
data.info()

### update some values

In [None]:
# change some columns name to be more understandable
data = data.rename(columns={'PAY_0': 'PAY_1', 'default.payment.next.month': 'will_pay'})

# replace unknown values with nans
data['EDUCATION'] = data['EDUCATION'].replace('unknown', np.NaN) 
data['MARRIAGE'] = data['MARRIAGE'].replace('unknown', np.NaN)

# replace 'others' values with different name
data['EDUCATION'] = data['EDUCATION'].replace('others', 'other education') 
data['MARRIAGE'] = data['MARRIAGE'].replace('others', 'other status')

In [None]:
print(data['SEX'].value_counts(), '\n')
print(data['EDUCATION'].value_counts(), '\n')
print(data['MARRIAGE'].value_counts(), '\n')
print(data['will_pay'].value_counts(), '\n')

It seems that the data is **unbalanced** as we see from the target values. we will handle this case later!

Now, we need to apply **one hot encodings** on the categorical columns. and we will do this process in two steps: 
1. convert integer values to its original categorical values
2. convert categorical values to one hot encoding

### convert int values to categorical variables in the dataframe

In [None]:
# convert int to categorical
def int2cat(df, col, dic):
    """
    Parameters:
        df : dataframe object
        col: column name in the dataframe
        dic: int to categorical dictionary related to this column
    Return:
        df : return the dataframe with this column updated 
    """
    df[col] = df[col].apply(lambda x: dic[x])
    return df



sex_dic = {1:'male', 2:'female'}
education_dic = {1:'graduate school', 2:'university', 3:'high school', 4:'others', 5:'unknown', 6:'unknown', 0:'unknown'}
marriage_dic = {1:'married', 2:'single', 3:'others', 0: 'unknown'}

categorical_dics = [sex_dic, education_dic, marriage_dic]

for col, dic in list(zip(categorical_cols, categorical_dics)):
    data = int2cat(data, col, dic)

In [None]:
data.head()

Now, we have done the first step successfully in order to apply **one hot encoding**. the second step will be in the next section!

# <a name="3">Data Preprocessing</a>

### get the features and labels from the dataframe

In [None]:
from sklearn.model_selection import train_test_split

x = data.drop('will_pay', axis=1)
y = data['will_pay']

### create a pipeline for preprocessing and model training

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score


# Preprocess the categorical features
categorical_processor = Pipeline([
    ('cat_encoder', OneHotEncoder(handle_unknown='ignore'))
                                ])

numerical_processor = Pipeline([
    ('standard_scaler', StandardScaler())
                               ])

data_preprocessor = ColumnTransformer([
    ('categorical_pre', categorical_processor, (categorical_cols)),
    ('numerical_pre', numerical_processor, (numerical_cols))
                                    ]) 

pipeline = Pipeline([
    ('data_preprocessing', data_preprocessor),
    #('dt', RandomForestClassifier(max_depth=2, random_state=0))
                    ])

In [None]:
x_transformed = pipeline.fit_transform(x)

### Apply SMOTE technique for oversampling 

The unbalanced problem can add a bias to the model towards either of the two classes. one of the techniques to overcome this problem is **SMOTE**

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=101)
smote_x, smote_y = smote.fit_resample(x_transformed, y)

In [None]:
print(smote_x.shape, smote_y.shape)

# <a name="4">K-Fold</a>

by applying **K-Fold**, it ensures that every observation from the original dataset has the chance of appearing in training and test set. This is one among the best approach if we have a limited input data.

In [None]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier

results = {
    'model': [],
    'score': []
}

models = [
    ('random_forest', RandomForestClassifier(n_estimators=10)),
    ('svm', SVC(gamma='auto')),
    ('decision_tree', DecisionTreeClassifier(max_depth = 3, class_weight = "balanced")),
    ('xgboost', XGBClassifier())
]

cv = KFold(n_splits=5)

for (model_name, model) in models:
    score = np.mean(cross_val_score(model, smote_x, smote_y, scoring='accuracy', cv=cv, n_jobs=1))
    results['model'].append(model_name)
    results['score'].append(score)


In [None]:
pd.DataFrame(results)

we see that **xgboost** model gives the highest score among all the models. so we will investigate in this model in the next section.

# <a name="5"> Model Training</a>

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC


x_train, x_test, y_train, y_test = train_test_split(smote_x, smote_y, test_size=0.2)

xgb = XGBClassifier(n_estimators=100, max_depth=5, booster='gbtree')

xgb.fit(x_train, y_train)

y_pred = xgb.predict(x_train)
print('\n----------------------- TRAIN RESULTS ------------------------')
print(classification_report(y_train, y_pred))

y_pred = xgb.predict(x_test)
print('\n----------------------- TEST RESULTS -------------------------')
print(classification_report(y_test, y_pred))


# <a name="6">Conclusion</a>

by applying the **XGboost** model, we reached to **87% accuracy** on the test set and also high **F1-score** for the two classes. I think this may sound good result especially that we deal with high data imbalance. <br>

Fianlly, in the future work we can improve this result by applying the **GridSearch** technique in order to choose the best parameters that will give the model highest score!

Thank you for spending time reading this notebook. if you find it useful, please give it upvote.