**Import the libraries**

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
card_data = pd.read_csv('/kaggle/input/credit-card-customers/BankChurners.csv')
card_data.head()

**Attribute Information**
------------------------
* **CLIENTNUM**                : Client number. Unique identifier for the customer holding the account
* **Attrition_Flag**           : Internal event (customer activity) variable - if the account is closed then 1 else 0
* **Customer_Age**             : Customer's Age in Years
* **Gender**                   : M=Male, F=Female
* **Dependent_count**          : Number of dependents
* **Education_Leel**           : Educational Qualification of the account holder (example: high school, college graduate, etc.)
* **Marital_status**           : Married, Single, Divorced, Unknown
* **Income _Category**         : Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, >
* **Card_Category**            : Product Variable - Type of Card (Blue, Silver, Gold, Platinum)
* **Months_On_Book**           : Period of relationship with bank
* **Total_Relationship_Count** : Total no. of products held by the customer
* **Months_Inactive_12_mon**   : No. of months inactive in the last 12 months
* **Contacts_Count_12_mon**    : No. of Contacts in the last 12 months
* **Credit_Limit**             : Credit Limit on the Credit Card                                                                                                           
* **Total_Revolving_Bal**      : Total Revolving Balance on the Credit Card
* **Avg_Open_To_Buy**          : Open to Buy Credit Line (Average of last 12 months)
* **Total_Amt_Chng_Q4_Q1**     : Change in Transaction Amount (Q4 over Q1)
* **Total_Trans_Amt**          : Total Transaction Amount (Last 12 months)
* **Total_Trans_Ct**           : Total Transaction Count (Last 12 months)
* **Total_Ct_Chng_Q4_Q1**      : Change in Transaction Count (Q4 over Q1)
* **Avg_Utilization_Ratio**    : Average Card Utilization Ratio
* **Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1**  
* **Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2**

# Business Understanding

From this data set we can provide couple of sloutions:
    1. We can predict the total amount that will be spent by any customer for an year.
        Using this model company can make sure they get profitted from new customers who are coming in and also decide weather to issue credit card to the new customers based on the amount that they are going to spend by looking at their profile data.
    2. Who are the customers that are going to stop using credit cards.
         Using this model/result company can make offer to employess to retain them.


# Data Preparation

In this step prepare the data for the analysis. Data might have null values or  be of different data types or might contain outliers. So lets see if data need to be treated

**Check for the Null Values**

In [None]:
#Check for percentage of Null Values
card_data.isnull().sum()/card_data.shape[0]*100

From the above results we can see that there are no null values so lets proceed check for data types and outliers.

In [None]:
card_data.dtypes

In [None]:
card_data['Dependent_count'] = pd.to_numeric(card_data['Dependent_count'])

Most of the featuers look to be in correct data types. Lets procced for now and change if need in further steps.

**Outlier Treatment**
*     This is needed because sometimes the extreme outliers may affect the final Model. So we must identify the extreme outliers and then treat them for the better Analysis and better performance of the models.

In [None]:
plt.figure(figsize=(15,15))

card_data.boxplot(column=['Customer_Age','Dependent_count','Months_on_book','Total_Relationship_Count','Months_Inactive_12_mon','Contacts_Count_12_mon','Credit_Limit',
                          'Total_Revolving_Bal','Avg_Open_To_Buy','Total_Amt_Chng_Q4_Q1','Total_Trans_Amt','Total_Trans_Ct','Total_Ct_Chng_Q4_Q1'])
plt.xticks(rotation=45)
plt.show()

There are no Extreme outliers, so we can proceed with EDA.

In [None]:
card_data.head()

# Exploratory Data Analysis (EDA)

Not much weightage is given to EDA at this point.

In [None]:
card_data['Customer_Age'].plot(kind='hist')


We can see that age group of ppl between 35 to 55 use credit cards more

In [None]:
card_data.groupby('Marital_Status').agg({'Total_Trans_Amt':'sum'}).plot(kind='bar')

Single and Married ppl contribute to most of the total credit cards transactions

In [None]:
card_data.groupby('Education_Level').agg({'Total_Trans_Amt':'sum'}).plot(kind='bar')

# Education isnt really playing a role in credit card amount being spent by customers

In [None]:
plt.scatter(x=card_data['Total_Trans_Ct'],y=card_data['Total_Trans_Amt'])

# Machine Learning Model
As a credit card company, we are interested in knowing how much a card lender can spend which will in turn benifit the company. with the predefined parameters if we know that if a person is not sepeding too much and if he inactive all the time, then company will not benifit much from that customer. 
  So lets build a ML model to predict how much a credit card holder will spennd in an year so that company can decide weather to give the card or provide him some exiting offers or reject the issue of credit card. 

# Linear Regression Model.
This is to address the first solution that we discussed above. This model is used to predict that amount spent by any given customer in 12 months.

Linear Regresseion requires data preparation and data scaling. So lets have a relook at the data and scale the data and convert all the categorical variables to numberic ones.

In [None]:
card_data.head()
#card_data.Income_Category.unique()

**Lets Drop the Columns which are not/least important to our model like CLIENTNUM and last two columns**

In [None]:
card_data.drop(columns=['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
                'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2',
                'CLIENTNUM'],axis=1,inplace=True)
card_data.head()

**Attrition_Flag**

In [None]:
card_data.Attrition_Flag.unique()

This is about the existed customer or attrited customer, since we are interested in know how much a customer is going to spend using credit card be it existing or attrited customer. So we can exclude this feature from our analysis

In [None]:
card_data.drop(columns=['Attrition_Flag'],axis=1,inplace=True)

**Gender**
*     Lets Convert Gender to Numberic Variable. Since this is nominal data, we can use one hot encoding type of techniques. I will be using simple logic to convert this.
    

In [None]:
card_data['Gender'] = card_data['Gender'].apply(lambda x: 1 if x=='M' else 0)
card_data.head()

We cannot build regression models with categorical variables in a dataset. So lets convert all the categorical variables to numeric. Since I am using dummy encoding to convert categorical variables, we need to drop one column otherwise it will lead to multi collinearity

In [None]:
#function to convert categorical variables
def convert_cat_variables(col,prefix,df):
    #Get dummies for a column uisng pandas
    dummies = pd.get_dummies(df[col],prefix=prefix)
    #Lets append this to Original Dataset
    df = df.join(dummies)
    #Now Lets drop the original Sex column
    df.drop(col,axis=1,inplace=True)
    return df

In [None]:
card_data = convert_cat_variables('Education_Level','Education_Level',card_data)
card_data.head()

In [None]:
card_data.drop('Education_Level_Unknown',axis=1,inplace=True)
card_data.head()

In [None]:
card_data = convert_cat_variables('Marital_Status','Marital_Status',card_data)
card_data.head()

In [None]:
card_data.drop('Marital_Status_Unknown',axis=1,inplace=True)
card_data.head()

**Card Cateogry and Income Category** are Ordinal data, hence different method is used to conecert these two features

In [None]:
card_data.Card_Category.unique()

In [None]:
def transformCategory(x):
    #print(x)
    if x=='Blue': return 0
    elif x=='Gold': return 2 
    elif x=='Silver': return 1 
    elif x=='Platinum': return 3

In [None]:
card_data['Card_Category'].value_counts()

In [None]:
card_data['Card_Category'] = card_data['Card_Category'].apply(transformCategory)
card_data.head()

In [None]:
card_data.Income_Category.unique()

In [None]:
card_data.Income_Category.value_counts()

In [None]:
def transformIncomeCat(x):
    #print(x)
    if x=='Less than $40K': return 0
    elif x=='$40K - $60K': return 1 
    elif x=='$60K - $80K': return 2 
    elif x=='$80K - $120K': return 3
    elif x=='$120K +': return 4
    elif x=='Unknown': return 2

In [None]:
card_data.head()

In [None]:
card_data['Income_Category'] = card_data['Income_Category'].apply(transformIncomeCat)
card_data.head()

**Now that data is cleaned, we need to check if there are any colinearity between features and eliminate them. Colinearity will affect in model building and we will not get the right co efficient values if not removed from the model. So lets eliminate multicolinearity using Variance Inflation Factor method**

In [None]:
from sklearn.model_selection import train_test_split

# Calculate VIF to eliminate Multicolinearity

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

def cal_vif(X):
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    return vif.sort_values('VIF',ascending=False)

In [None]:
X = card_data.drop('Total_Trans_Amt',axis=1)
cal_vif(X)

**Lets eliminate all the columns with VIF > 10**

In [None]:
card_data.drop(columns=['Avg_Open_To_Buy','Total_Revolving_Bal','Credit_Limit','Customer_Age','Months_on_book','Total_Amt_Chng_Q4_Q1','Total_Ct_Chng_Q4_Q1'],axis=1,inplace=True)
card_data.head()

In [None]:
X = card_data.drop('Total_Trans_Amt',axis=1)
cal_vif(X)

# Build Linear Regression Model

In [None]:
import statsmodels.api as sm
#function to perform Linear Regression
def doLinearRegression(data,predictColName):
    result = dict()
    #split the data 
    x_train,x_test,y_train,y_test = train_test_split(data.drop(predictColName,axis=1),data[predictColName],test_size=0.3,random_state=112)
    x_train = sm.add_constant(x_train)
    x_test = sm.add_constant(x_test)
    model = sm.OLS(y_train,x_train)
    model_result = model.fit()
    result['model_result'] = model_result
    result['x_train'] = x_train
    result['x_test'] = x_test
    result['y_train'] = y_train
    result['y_test'] = y_test
    return result

In [None]:
result1 = doLinearRegression(card_data,'Total_Trans_Amt')
result1['model_result'].summary()

**Now Lets look at the Probability of all variables by defining Null Hypothesis(H0) and alternate hypothesisH(a).**
**H(0)=Feature depends on total transaction amount**
**H(a)=Feature doesnt depends on total transaction amount**
**Now By keeping threshold of 5%, lets eliminate all the features which have P(t)>5%**

In [None]:
card_data.drop('Contacts_Count_12_mon',axis=1,inplace=True)

In [None]:
result1 = doLinearRegression(card_data,'Total_Trans_Amt')
result1['model_result'].summary()

In [None]:
card_data.drop('Months_Inactive_12_mon',axis=1,inplace=True)
result1 = doLinearRegression(card_data,'Total_Trans_Amt')
result1['model_result'].summary()

In [None]:
card_data.drop(columns=['Education_Level_College','Education_Level_Doctorate','Education_Level_Graduate','Education_Level_High School',
                       'Education_Level_Post-Graduate','Education_Level_Uneducated'],axis=1,inplace=True)
result1 = doLinearRegression(card_data,'Total_Trans_Amt')
result1['model_result'].summary()

In [None]:
card_data.drop(columns=['Marital_Status_Divorced','Marital_Status_Married'],axis=1,inplace=True)
result1 = doLinearRegression(card_data,'Total_Trans_Amt')
result1['model_result'].summary()

**Notice that R2 and Adjusted R2 havent changed even after eliminating all the unaffected features.**

In [None]:
coef = result1['model_result'].params

In [None]:
coef[1:].plot(kind='bar')

**From the above graph we can see how each feature is dependent on total transaction and by how much magnitiude and also politively or negatively co related**

# Accuracy of the Model

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error,mean_squared_error
#function to print metric values on predicted results
def predict_print_results(result,x_test,y_test,strTypeOfData):
    predicted_y = result.predict(x_test)
    print('--------------------------')
    print(' Measures on '+strTypeOfData+' Data    ')
    print('--------------------------')
    print('R2  is '+str(r2_score(y_test,predicted_y)))
    print('MSE is '+str(mean_squared_error(y_test,predicted_y)))
    print('MAE is '+str(mean_absolute_error(y_test,predicted_y)))
    return predicted_y

In [None]:
predict_print_results(result1['model_result'],result1['x_train'],result1['y_train'],'Train') 
y_predicted1 = predict_print_results(result1['model_result'],result1['x_test'],result1['y_test'],'Test') 

**Notice that model is 70%**

In [None]:
#function to plot a graph of predicted and actual y values
def plot_scatter_predicted_actuaal(y_test,y_predicted):
    # Plotting Scatter graph to show the prediction  
    plt.scatter(y_test, y_predicted, c = 'green') 
    plt.xlabel("Price: in $1000's") 
    plt.ylabel("Predicted value") 
    plt.title("True value vs predicted value : Linear Regression") 
    plt.show() 

In [None]:
plot_scatter_predicted_actuaal(result1['y_test'],y_predicted1)

# Random Forest Classifier
This is to address the second part of our business problem. To check if company is going to loose the customer.

reload the dataset

In [None]:
card_data = pd.read_csv('/kaggle/input/credit-card-customers/BankChurners.csv')
card_data.head()

drop unwanted columns

In [None]:
card_data.drop(columns=['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
                'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2',
                'CLIENTNUM'],axis=1,inplace=True)
card_data.head()

**Convert all categorical variables into numeric. In this case I have used Label Encoding as my conversion algorithm. Since this model is not going to get affected by the magnitude of the feature data. No scaling is done**

Prediction variable. Attrition flag.

In [None]:
card_data['Attrition_Flag'] = card_data['Attrition_Flag'].apply(lambda x: 1 if x=='Existing Customer' else 0)

In [None]:
card_data.head()

Convert Gender

In [None]:
card_data['Gender'] = card_data['Gender'].apply(lambda x: 1 if x=='M' else 0)
card_data.head()

**Using Label Encoder.. I am converting all the categorical variables**

In [None]:
# Import label encoder 
from sklearn import preprocessing 
  
# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder() 

# Encode labels in column 'Education_Level'. 
card_data['Education_Level']= label_encoder.fit_transform(card_data['Education_Level']) 

# Encode labels in column 'Education_Level'. 
card_data['Marital_Status']= label_encoder.fit_transform(card_data['Marital_Status']) 
 
# Encode labels in column 'Education_Level'. 
card_data['Income_Category']= label_encoder.fit_transform(card_data['Income_Category']) 
 
# Encode labels in column 'Education_Level'. 
card_data['Card_Category']= label_encoder.fit_transform(card_data['Card_Category']) 

card_data.head()

# Model Building

In [None]:
X = card_data.drop('Attrition_Flag',axis=1)
y = card_data['Attrition_Flag']

**Split the data for the model**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=42)

**Create Random Forest Classifer model and fit the data**

In [None]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)

y_pred=clf.predict(X_test)

# Accuray of Model

In [None]:
from sklearn.metrics import accuracy_score,confusion_matrix

def find_model_accuracy(threshold,y_test,predictions):
    
    #Build predicted class using predictions with threshold value
    predicted_classes = np.where(predictions>threshold, 1, 0)
    acc_score = accuracy_score(y_test, predicted_classes)
    print('***********************************************')
    print('   Accuracy Score of Model is '+str(acc_score))
    print('***********************************************')
    
    #Build the consuion matrix on test and predicted results
    confusion_mat = confusion_matrix(y_test,predicted_classes)
    #Build data frame of consuion matrix
    confusion_df  = pd.DataFrame(confusion_mat,index=['Actual Neg', 'Actual Pos'],columns=['Predicted Neg','Predicted Pos'])
    print('             Model Results                     ')
    print('             *************                     ')
    #Calculate True Positive and False Positive Accuracy
    TN = confusion_mat[0][0]
    TP = confusion_mat[1][1]
    FN = confusion_mat[1][0]
    FP = confusion_mat[0][1]
    total = TN+TP+FN+FP
    acc = (TN + TP)/total
    missClassification = (FN+FP)/total
    nullErrorRate = (TN+FP)/total
    print('Accuray Of the Model '+ str(acc))
    print('Misclassification Rate '+str(missClassification))
    print('Null Error Rate '+ str(nullErrorRate))
    print('***********************************************')
    print('             Confusion Matrix                  ')
    print('             ----------------                  ')
    #Plot a head map using sns
    sns.heatmap(data=confusion_df,cmap='coolwarm',annot=True)
    plt.show() 

In [None]:
find_model_accuracy(0.75,y_test,y_pred)

# Note: This is my first Kaggle work and I am still persuing Data Science. Please feel free to provide feedback for my improvements. Lets learn together :) 