# Logistic Regression
Logistic regression is a statistical technique used to analyze the relationship between a dependent variable and one or more independent variables. The dependent variable is a binary outcome variable, meaning it can only take two values (e.g., yes or no, 0 or 1, success or failure), while the independent variables can be either continuous or categorical.

The aim of logistic regression is to find the relationship between the independent variables and the probability of the dependent variable taking one of its two possible values. This is achieved by fitting a logistic function (also called a sigmoid function) to the data.

The logistic function is an S-shaped curve that maps any real-valued input to a value between 0 and 1, representing the probability of the dependent variable taking the value 1. The function is defined as:

p(x) = 1 / (1 + exp(-z))

where p(x) is the predicted probability of the dependent variable taking the value 1 given the independent variable(s), and z is a linear combination of the independent variables and their associated coefficients.

The logistic regression model estimates the values of the coefficients that best fit the data by maximizing the likelihood of the observed data given the model. This is typically done using maximum likelihood estimation or gradient descent algorithms.

Once the model is trained, it can be used to predict the probability of the dependent variable taking the value 1 for new observations with known independent variable values. A threshold probability can then be chosen to classify the observations into the two categories based on their predicted probabilities.

Logistic regression is widely used in fields such as finance, marketing, healthcare, and social sciences for tasks such as predicting customer churn, identifying fraud, and diagnosing diseases.





# Random Forest Classifier

Random Forest Regressor is a machine learning algorithm used for regression problems. It is an ensemble method that combines multiple decision trees to make a prediction. In a Random Forest Regressor, a large number of decision trees are created on randomly selected subsets of the training data. Each tree is trained on a random subset of features from the dataset.

When a new observation is presented to the model, each decision tree in the random forest makes a prediction for the target variable, and the final prediction is the average of all the predictions made by individual trees. This averaging helps to reduce the variance and overfitting that can occur when using a single decision tree.

The algorithm works by constructing decision trees recursively, splitting the data into smaller subsets based on the feature that provides the best split. At each step, the algorithm chooses the best feature to split the data based on a criterion such as the reduction in variance or mean squared error. The splitting process continues until a stopping criterion is met, such as reaching a maximum tree depth or a minimum number of samples per leaf node.

One of the advantages of using Random Forest Regressor is its ability to handle missing data and outliers. It is also less prone to overfitting than other regression algorithms due to the averaging effect of multiple decision trees.

Random Forest Regressor is widely used in various applications such as predicting stock prices, housing prices, and customer churn rates. It can be easily implemented using popular machine learning libraries such as scikit-learn in Python.





# Credit Card Customer Attrition

In [48]:
#importing library
import numpy as np
import pandas as pd 
#StandardScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
#LogisticRegression
from sklearn.linear_model import LogisticRegression
#Random Forest Regressor
from sklearn.ensemble import RandomForestClassifier
#Decision Tree Regressor
from sklearn.tree import DecisionTreeClassifier
#Neural Network
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC




# Loading the Dataset

In [49]:
df=pd.read_csv('/kaggle/input/credit-card-customers/BankChurners.csv')
#showing the dataset
df

Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
0,768805383,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,5,1,3,12691.0,777,11914.0,1.335,1144,42,1.625,0.061,0.000093,0.999910
1,818770008,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,6,1,2,8256.0,864,7392.0,1.541,1291,33,3.714,0.105,0.000057,0.999940
2,713982108,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,4,1,0,3418.0,0,3418.0,2.594,1887,20,2.333,0.000,0.000021,0.999980
3,769911858,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,3,4,1,3313.0,2517,796.0,1.405,1171,20,2.333,0.760,0.000134,0.999870
4,709106358,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,5,1,0,4716.0,0,4716.0,2.175,816,28,2.500,0.000,0.000022,0.999980
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10122,772366833,Existing Customer,50,M,2,Graduate,Single,$40K - $60K,Blue,40,3,2,3,4003.0,1851,2152.0,0.703,15476,117,0.857,0.462,0.000191,0.999810
10123,710638233,Attrited Customer,41,M,2,Unknown,Divorced,$40K - $60K,Blue,25,4,2,3,4277.0,2186,2091.0,0.804,8764,69,0.683,0.511,0.995270,0.004729
10124,716506083,Attrited Customer,44,F,1,High School,Married,Less than $40K,Blue,36,5,3,4,5409.0,0,5409.0,0.819,10291,60,0.818,0.000,0.997880,0.002118
10125,717406983,Attrited Customer,30,M,2,Graduate,Unknown,$40K - $60K,Blue,36,4,3,3,5281.0,0,5281.0,0.535,8395,62,0.722,0.000,0.996710,0.003294


# Getting the Preliminary Information about the dataset

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 23 columns):
 #   Column                                                                                                                              Non-Null Count  Dtype  
---  ------                                                                                                                              --------------  -----  
 0   CLIENTNUM                                                                                                                           10127 non-null  int64  
 1   Attrition_Flag                                                                                                                      10127 non-null  object 
 2   Customer_Age                                                                                                                        10127 non-null  int64  
 3   Gender                                                                           

# Checking for the Missing Values in the Dataset

In [51]:
df.isna().sum()

CLIENTNUM                                                                                                                             0
Attrition_Flag                                                                                                                        0
Customer_Age                                                                                                                          0
Gender                                                                                                                                0
Dependent_count                                                                                                                       0
Education_Level                                                                                                                       0
Marital_Status                                                                                                                        0
Income_Category                                 

# Getting the Information about descriptive Statistics

In [52]:
df.describe()

Unnamed: 0,CLIENTNUM,Customer_Age,Dependent_count,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
count,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0,10127.0
mean,739177600.0,46.32596,2.346203,35.928409,3.81258,2.341167,2.455317,8631.953698,1162.814061,7469.139637,0.759941,4404.086304,64.858695,0.712222,0.274894,0.159997,0.840003
std,36903780.0,8.016814,1.298908,7.986416,1.554408,1.010622,1.106225,9088.77665,814.987335,9090.685324,0.219207,3397.129254,23.47257,0.238086,0.275691,0.365301,0.365301
min,708082100.0,26.0,0.0,13.0,1.0,0.0,0.0,1438.3,0.0,3.0,0.0,510.0,10.0,0.0,0.0,8e-06,0.00042
25%,713036800.0,41.0,1.0,31.0,3.0,2.0,2.0,2555.0,359.0,1324.5,0.631,2155.5,45.0,0.582,0.023,9.9e-05,0.99966
50%,717926400.0,46.0,2.0,36.0,4.0,2.0,2.0,4549.0,1276.0,3474.0,0.736,3899.0,67.0,0.702,0.176,0.000181,0.99982
75%,773143500.0,52.0,3.0,40.0,5.0,3.0,3.0,11067.5,1784.0,9859.0,0.859,4741.0,81.0,0.818,0.503,0.000337,0.9999
max,828343100.0,73.0,5.0,56.0,6.0,6.0,6.0,34516.0,2517.0,34516.0,3.397,18484.0,139.0,3.714,0.999,0.99958,0.99999


# Preprocessing

In [68]:
def onehot_encode(df,column):
    df=df.copy()
    dummies=pd.get_dummies(df[column],prefix=column)
    df=pd.concat([df,dummies],axis=1)
    df=df.drop(column,axis=1)
    return df

In [82]:
def preprocess_inputs(df):
    df=df.copy()
    #dropping the last two column
    df=df.drop(df.columns[-2:],axis=1)
    #dropping Clientnum
    df=df.drop('CLIENTNUM',axis=1)
    #filling unknown to mode the the education columns
    df['Education_Level']=df['Education_Level'].replace({'Unknown':df['Education_Level'].mode()[0]})
    df['Income_Category']=df['Income_Category'].replace({'Unknown':df['Income_Category'].mode()[0]})
    
    education_encoding={'High School':1,
      'Graduate':3,
      'Uneducated':0,
      'College':2,
      'Post-Graduate':4,
      'Doctorate':5}
    df['Education_Level']=df['Education_Level'].replace(education_encoding)
    
    df['Gender']=df['Gender'].apply(lambda x:0 if x=='M' else 1)
    
    df=onehot_encode(df,'Marital_Status')
    
    df=onehot_encode(df,'Card_Category')
    df['Attrition_Flag']=df['Attrition_Flag'].replace({'Existing Customer':0, 'Attrited Customer':1})
    income_encoding={'$60K - $80K':2,
  'Less than $40K':0,
  '$80K - $120K':3,
  '$40K - $60K':1,
  '$120K +':4}
    df['Income_Category']=df['Income_Category'].replace(income_encoding)
    
    
    
    
    y=df['Attrition_Flag']
    x=df.drop('Attrition_Flag',axis=1)
    
    #train_test_split
    x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.7)
    scaler=StandardScaler()
    scaler.fit(x_train)
    x_train=pd.DataFrame(scaler.transform(x_train),columns=x_train.columns)
    x_test=pd.DataFrame(scaler.transform(x_test),columns=x_test.columns)
    return x_train,x_test,y_train,y_test

In [58]:
df['Income_Category'].mode()[0]

'Less than $40K'

In [83]:
x_train,x_test,y_train,y_test=preprocess_inputs(df)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(7088, 25)
(3039, 25)
(7088,)
(3039,)


# Checking Unique values in each columns

In [56]:
{column:len(df[column].unique()) for column in df.columns if df[column].dtypes=='object'}

{'Attrition_Flag': 2,
 'Gender': 2,
 'Education_Level': 7,
 'Marital_Status': 4,
 'Income_Category': 6,
 'Card_Category': 4}

# Getting the List of Unique Values in Each Categorical Columns

In [60]:
{column:list(x[column].unique()) for column in x.columns if x[column].dtypes=='object'}

{'Attrition_Flag': ['Existing Customer', 'Attrited Customer'],
 'Gender': ['M', 'F'],
 'Education_Level': ['High School',
  'Graduate',
  'Uneducated',
  'College',
  'Post-Graduate',
  'Doctorate'],
 'Marital_Status': ['Married', 'Single', 'Unknown', 'Divorced'],
 'Income_Category': ['$60K - $80K',
  'Less than $40K',
  '$80K - $120K',
  '$40K - $60K',
  '$120K +'],
 'Card_Category': ['Blue', 'Gold', 'Silver', 'Platinum']}

In [85]:
models={'Logistic Regression':LogisticRegression(),
#Random Forest Regressor
'Random Forest':RandomForestClassifier(),
#Decision Tree Regressor
'Decision Tree':DecisionTreeClassifier(),
#Neural Network
'MLP Classifier':MLPClassifier(),
'Support Vector':SVC()
}


In [88]:
for name,model in models.items():
    model.fit(x_train,y_train)
    print(name)
    print(model.score(x_test,y_test))

Logistic Regression
0.9081934846989141
Random Forest
0.9667653833497861
Decision Tree
0.9381375452451465




MLP Classifier
0.9387956564659428
Support Vector
0.9299111549851925


# Root Mean Square Error

In [89]:
for name,model in models.items():
    model.fit(x_train,y_train)
    y_pred=model.predict(x_test)
    print(name)
    
    print(np.sqrt(np.mean(y_pred-y_test)**2))

Logistic Regression
0.039157617637380716
Random Forest
0.027311615663047056
Decision Tree
0.006252056597564989




MLP Classifier
0.009542612701546561
Support Vector
0.04244817374136229


Loading the Dataset