<a href="https://colab.research.google.com/github/rupalidawkoregithub/Credit-Card-Default-Prediction/blob/main/Individual_Credit_Card_Default_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Credit Card Default Prediction



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Team Member 1 -** Rupali Dawkore


# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The problem statement for credit card default prediction is to develop a model that accurately predicts the likelihood of a credit card holder defaulting on their debt payment in the near future. The model should take into account various factors such as the cardholder's credit history and other relevant financial information. The goal is to minimize false negatives (predicting a non-default when the cardholder actually defaults) and false positives (predicting a default when the cardholder is able to repay their debt) while maximizing the overall accuracy of the prediction.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Importing basic libraries for data processing
import numpy as np
import pandas as pd
import math
from datetime import datetime
# Importing libraries for data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Adding this to ignore future warnings
import warnings
warnings.filterwarnings("ignore")
# importing missingo library which helps us to visualize the missing values
import missingno as msno

In [None]:
#pip install --upgrade xlrd         ##run the cell at every restart of runtime

In [None]:
# Mount the Google Drive for Import the Dataset
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_excel('/content/drive/MyDrive/Credit Card Default Prediction - Classification/default of credit card clients.xls',header=1)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

Checked first five rows of dataset

In [None]:
df.tail()

Checked last five rows of dataset

### Dataset Rows & Columns count

In [None]:
# Get the number of rows and columns
rows, columns = df.shape

In [None]:
# Print the number of rows and columns
print("Number of rows: ", rows)
print("Number of columns: ", columns)

There are 30000 rows and 25 columns in the dataset.

### Dataset Information

In [None]:
# Dataset Info
df.info()

All types of columns dtypes are integer

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(f"Number of duplicated rows in default of credit card clients dataset: {df.duplicated().sum()}")

We do not have any duplicated rows in the dataset and that is very good for us.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values 
print(f"Null values count in default of credit card clients dataset:\n{df.isna().sum()}\n")
print("-"*50)
print(f"Infinite values count in default of credit card clients dataset:\n{df.isin([np.inf, -np.inf]).sum()}\n")

We don't have null or infinite values default of credit card clients dataset.

In [None]:
# Visualizing the missing values in default of credit card clients dataset
msno.bar(df,figsize=(10,5), color="tab:green")

Dataset dose not contains any NA values, null values and duplicates.

### What did you know about your dataset?

The data provided is a sample of a credit card default dataset. The first row provides the header information, with each column indicating a feature of the credit card holder. The first column (ID) is the unique identifier for each record. The second column (LIMIT_BAL) indicates the credit limit of the credit card. The third column (SEX) indicates the gender of the cardholder. The fourth column (EDUCATION) indicates the level of education of the cardholder. The fifth column (MARRIAGE) indicates the marital status of the cardholder. The sixth column (AGE) indicates the age of the cardholder. The seventh to sixteenth columns (PAY_0 to PAY_9) indicate the repayment status for the last ten months. The remaining columns (BILL_AMT1 to BILL_AMT6, PAY_AMT1 to PAY_AMT6) indicate the amount of bill statement and amount paid in the last six months. The final column (default payment next month) is the target variable and indicates whether the cardholder defaulted on their payment in the next month (1) or not (0).

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

 # Descriptive Statistics

Now we see the descriptive statistics such as count, mean, standard deviation, minimum,maximum, quantiles

In [None]:
# Dataset Describe
df.describe().T

**Inference:**

- There are around 30000 distict credit card clients.
- The average value of credit card Limits is Rs 1,67,484.
- The Limited Balance has a high Standard deviation as the meadian value is Rs 1,40,000 and the extreme values as Rs 10,00,000.
- Here the average is about 35 and meadian is 28 with a standard deviation of 9.2. This difference is explained by some very old people in the data set as given that the maximum age is 79.
- Bill Amount and Pay Amount also shows us that there some people with extremely high bill amount which may be because for the higher Credit Limit or because of the pending dues added up. 
- Bill amount for all the months, the mean is around 40,000 to 50,000 with some extreme amount in bill amount 3 of Rs 16,64,089.
- Pay amount for all the months, the mean is around Rs 4800 to Rs 5800, with some extreme values such as Rs 16,64,089.
- As the value 0 for default payment means 'not default' and value 1 means 'default', the mean of 0.221 means that there are 22.1% of credit card contracts that will default next month (will verify this in the next sections of this analysis).

In [None]:
df['LIMIT_BAL'].value_counts()

### Variables Description 

**ID:** ID of each client

**LIMIT_BAL:** Amount of given credit in NT dollars (includes individual and family/supplementary credit

**SEX:** Gender (1=male, 2=female)

**EDUCATION:** (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)

**MARRIAGE:** Marital status (1=married, 2=single, 3=others)

**AGE:** Age in years

**PAY_0:** Repayment status in September, 2005 (-1=pay duly, 1=payment delay for

one month, 2=payment delay for two months,8=payment delay for eight months,

9=payment delay for nine months and above)

**PAY_2:** Repayment status in August, 2005 (scale same as above)

**PAY_3:** Repayment status in July, 2005 (scale same as above)

**PAY_4:** Repayment status in June, 2005 (scale same as above)

**PAY_5:** Repayment status in May, 2005 (scale same as above)

**PAY_6:** Repayment status in April, 2005 (scale same as above)

**BILL_AMT1:** Amount of bill statement in September, 2005 (NT dollar)

**BILL_AMT2:** Amount of bill statement in August, 2005 (NT dollar)

**BILL_AMT3:** Amount of bill statement in July, 2005 (NT dollar)

**BILL_AMT4:** Amount of bill statement in June, 2005 (NT dollar)

**BILL_AMT5:** Amount of bill statement in May, 2005 (NT dollar)

**BILL_AMT6:** Amount of bill statement in April, 2005 (NT dollar)

**PAY_AMT1:** Amount of previous payment in September, 2005 (NT dollar)

**PAY_AMT2:** Amount of previous payment in August, 2005 (NT dollar)

**PAY_AMT3:** Amount of previous payment in July, 2005 (NT dollar)

**PAY_AMT4:** Amount of previous payment in June, 2005 (NT dollar)

**PAY_AMT5:** Amount of previous payment in May, 2005 (NT dollar)

**PAY_AMT6:** Amount of previous payment in April, 2005 (NT dollar)

**default.payment.next.month:** Default payment (1=yes, 0=no)

Answer Here

### Check Unique Values for each variable.

In [None]:
df.columns

In [None]:
# Check Unique Values for each variable.
def get_all_unique_values(df):
    for col in df.columns:
        print(f"Unique values in column '{col}':")
        print(df[col].unique())
        print("-"*50)

In [None]:
# Get and print all unique values
get_all_unique_values(df)

In [None]:
# We dont need to ID Column so we drop it
df.drop(['ID'],axis=1,inplace=True)

In [None]:
pd.options.display.max_columns = 25
df.head(10)


## 3. ***Data Wrangling***

### Data Wrangling Code

Copied dataset for data wrangling.

In [None]:
df1 = df.copy()

Some of the columns didn’t make sense to me, so I decided to rename them into more understandable terms.

In [None]:
# Rename target column
df1 = df.rename(columns={'default payment next month':'default'})

In [None]:
df1.rename(columns={'PAY_0':'PAY_SEPT','PAY_2':'PAY_AUG','PAY_3':'PAY_JUL','PAY_4':'PAY_JUN','PAY_5':'PAY_MAY','PAY_6':'PAY_APR'},inplace=True)
df1.rename(columns={'BILL_AMT1':'BILL_AMT_SEPT','BILL_AMT2':'BILL_AMT_AUG','BILL_AMT3':'BILL_AMT_JUL','BILL_AMT4':'BILL_AMT_JUN','BILL_AMT5':'BILL_AMT_MAY','BILL_AMT6':'BILL_AMT_APR'}, inplace = True)
df1.rename(columns={'PAY_AMT1':'PAY_AMT_SEPT','PAY_AMT2':'PAY_AMT_AUG','PAY_AMT3':'PAY_AMT_JUL','PAY_AMT4':'PAY_AMT_JUN','PAY_AMT5':'PAY_AMT_MAY','PAY_AMT6':'PAY_AMT_APR'},inplace=True)

In [None]:
#replacing values with there labels
df1.replace({'SEX': {1 : 'Male', 2 : 'Female'}}, inplace=True)
df1.replace({'EDUCATION' : {1 : 'Graduate School', 2 : 'University', 3 : 'High School', 4 : 'Others'}}, inplace=True)
df1.replace({'MARRIAGE' : {1 : 'Married', 2 : 'Single', 3 : 'Others'}}, inplace = True)
df1.replace({'default' : {1 : 'Yes', 0 : 'No'}}, inplace = True)

In [None]:
# After renamed look of dataset
df1.head(10)

In [None]:
#category wise values
df1['EDUCATION'].value_counts()

In education column, values such as 5,6 and 0 are unknown. Let's combine those values as others.

In [None]:
#replcae values with 5, 6 and 0 to Others
df1.EDUCATION = df1.EDUCATION.replace({5: "Others", 6: "Others",0: "Others"})

In [None]:
# Rechecking value count of education column after combine others values
df1['EDUCATION'].value_counts()

In [None]:
#category wise values
df1['MARRIAGE'].value_counts()

In marriage column, 0 values are not known. Combine those values in others category.

In [None]:
#replace 0 with Others
df1.MARRIAGE = df1.MARRIAGE.replace({0: "Others"})

In [None]:
# Rechecking
df1['MARRIAGE'].value_counts()

**Checking Distribution of default or non-default case**



In [None]:
mi0 = df[df['default payment next month']==0]
mi1 = df[df['default payment next month']==1]

In [None]:
mi0.value_counts().sum()  # Non-Default => 0 or NO

In [None]:
mi1.value_counts().sum()  # Default => 1 or YES

In [None]:
df.columns

In [None]:
con_col=['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
# Checking Distribution of Non - Default 
plt.figure(figsize=(20,20))
for num,i in enumerate(con_col):
    plt.subplot(6,5,num+1)
    sns.distplot(mi0[i],color='g')
    plt.tight_layout()
plt.show()

In [None]:
# Checking Distribution of Default Case 
plt.figure(figsize=(20,20))
for num,i in enumerate(con_col):
    plt.subplot(6,5,num+1)
    sns.distplot(mi1[i],color='r')
    plt.tight_layout()
plt.show()

### What all manipulations have you done and insights you found?

1. I have Changed the Name of a columns with its meaningfull name

2. Grouped unknown EDUCATIONcategories (0,5,6) and re-assigned them to 4 (others)

3. Grouped unknown MARRIAGEcategories (0) and re-assigned them to 3 (others)

4. Checking Distribution of default or non-default case

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# **Univariate**

In [None]:
df1.columns

## Chart - 1

In [None]:
#Plotting payment staus using countplot
pay_col = ['PAY_SEPT',	'PAY_AUG',	'PAY_JUL',	'PAY_JUN',	'PAY_MAY',	'PAY_APR']
plt.figure(figsize=(20,20))
for num,i in enumerate(pay_col):
    plt.subplot(4,3,num+1)
    sns.countplot(df1[i])
    plt.tight_layout()
plt.show()

Each column represents the status of payment for a given month:

**1.'PAY_SEPT':** Represents the payment status in September.

**2.'PAY_AUG':** Represents the payment status in August.

**3.'PAY_JUL':** Represents the payment status in July.

**4.'PAY_JUN':** Represents the payment status in June.

**5.'PAY_MAY':** Represents the payment status in May.

**6.'PAY_APR':** Represents the payment status in April.

The countplot is used to visualize the distribution of the values in each of these columns. It shows the count of each unique value in the column, creating a histogram-like representation. The goal of this visualization is to see the frequency of each payment status, which can help understand the overall payment behavior of credit card customers in the data.






## Chart - 2

**Distribution of balance limit of credit card of customer**

In [None]:
import plotly.express as px
fig1 = px.histogram(df1, x = 'LIMIT_BAL', marginal = 'box',
                    title = 'Distribution of balance limit of card', 
                    labels = {'x': 'Dollar($)', 'y': 'Number of card'},
                   color_discrete_sequence=px.colors.qualitative.Antique)
fig1.update_layout(width=900, height=700)
fig1.show()

It is found that limit balance feature is right skewed, middle 50% of value lie between 50K to 240k.
few of limit goes beyond 530k taiwan dollar

In [None]:
# values count plot of Default
plt.figure(figsize=(5,5))
sns.countplot(x = 'default', data = df1)

This plot gives us an insight into the class distribution in the target variable. It is observed that the classes are not proportionate, indicating an imbalanced dataset. The data shows that there are **23,000 non-defaulters and 6,000 defaulters** , which means that this is a case of **imbalanced data**.

## Chart - 3

In [None]:
# values count plot of Education
plt.figure(figsize=(5,5))
sns.countplot(x = 'EDUCATION', data = df1)

In [None]:
# values count plot of Marriage
plt.figure(figsize=(5,5))
sns.countplot(x = 'MARRIAGE', data = df1)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

# **Bivariate**

## Chart - 4

**What is a qualification of the card holder**

In [None]:
df_vis = df1['EDUCATION'].value_counts().reset_index()
df_vis.columns = ['Education', 'No of people']
fig1 = px.pie(df_vis, values = 'No of people', names = 'Education',color_discrete_sequence =  px.colors.sequential.Plasma,
             title = 'Education qualification of credit card holder')
fig1.update_layout(width=500, height=400)
fig1.show()

47% of them have university qualification
35% of them have graduate school qualification 16% of them have high school qualification 2% of them qualification are unknown

## Chart - 5

**Distribution of Age of credit card holder**

In [None]:
fig2 = px.histogram(df1, x = 'AGE', marginal = 'box',
                    title = 'Distribution of Age of card holder', 
                    labels = {'x': 'Dollar($)', 'y': 'Noumber of card'},
                   color_discrete_sequence=px.colors.qualitative.D3,
                   nbins = 75)
fig2.update_layout(width=500, height=400)
fig2.update_traces(marker_line_width=1,marker_line_color="white")

## Chart - 6

 **Distribution of credit limit for default and non-default cases**

In [None]:
sns.boxplot(x="default", y="LIMIT_BAL", data=df1)
plt.show()

**Sex With Respective to Default** 

**or** 

**How many male and female are credit card defaulter.**

In [None]:
# count plot for Sex and with respect to Default
fig, axes = plt.subplots(ncols=2,figsize=(10,5))
sns.countplot(x = 'SEX', ax = axes[0], data = df1)
sns.countplot(x = 'SEX', hue = 'default',ax = axes[1], data = df1)

**Education With Respective to Default**

In [None]:
# count plot for EDUCATION and with respect to Default
fig, axes = plt.subplots(ncols=2,figsize=(18,5))
sns.countplot(x = 'EDUCATION', ax = axes[0], data = df1)
sns.countplot(x = 'EDUCATION', hue = 'default',ax = axes[1], data = df1)

**Marriage With Respective to Default**

In [None]:
#count plot for MARRIAGE and with respect to IsDefaulter
fig, axes = plt.subplots(ncols=2,figsize=(10,5))
sns.countplot(x = 'MARRIAGE', ax = axes[0], data = df1)
sns.countplot(x = 'MARRIAGE', hue = 'default',ax = axes[1], data = df1)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

# **Multivariate**

## Chart - 7

In [None]:
# Chart - 3 visualization code
#Creating new variables
var = df1[['SEX', 'LIMIT_BAL','AGE']].copy()
var['default'] = df1['default']

#replace values in varibles with original names
var.replace({'SEX': {1 : 'MALE', 2 : 'FEMALE'}},inplace = True)


In [None]:
#taking catplot for the given variable
sns.catplot(x = "SEX",
            y = "LIMIT_BAL",
            kind = "box",
            hue = "default",
            color = '#0c4f4e',
            data = var, saturation = 2,
            margin_titles = True).set(title = "limit balance by sex and default payments");

##### 1. Why did you pick the specific chart?

Catplot is used in Seaborn to create categorical plots, which are plots that show the relationship between a categorical variable i.e SEX and one continuous variables i.e LIMIT_BAL. These plots are useful for visualizing the distribution and spread of data.

##### 2. What is/are the insight(s) found from the chart?

There are more Female defaulters than men ,female have more ouliers in Limit Balance Variable

## Chart - 8 

### Correlation Heatmap

In [None]:
df1.columns

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize = [25, 15])
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## Chart - 9 
### Pair Plot 

In [None]:
# Pair Plot visualization code
# sns.pairplot(df)
# plt.show()

##### 1. Why did you pick the specific chart?

The pairplot was chosen in this case to visualize the relationships and distributions between multiple variables in the dataframe. Pairplots are a convenient way to quickly visualize relationships between variables.

##### 2. What is/are the insight(s) found from the chart?

pairplots can help identify relationships between variables, outliers, skewness, and other patterns in the data that can inform further analysis and modeling.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

1. Null Hypothesis - There is no relation between Categorical Variables and Default

   Alternate Hypothesis - There is a relationship between Categorical Variables and Default

2. Null Hypothesis - There is no relation between Numeric Variable and Default

   Alternate Hypothesis - There is a relation between Numeric Variable and Default

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

1- Null Hypothesis - There is no relation between Categorical Variables and Default

Alternate Hypothesis - There is a relationship between Categorical Variables and Default

#### 2. Perform an appropriate statistical test.

In [None]:
df.columns

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import chi2_contingency

def hypothesis_test_chi2(df, categorical_variable, alpha=0.05):
    # Split the data into default and non-default groups
    default = df[df['default payment next month'] == 1]
    non_default = df[df['default payment next month'] == 0]

    # Conduct a chi-square test for the independence of the categorical variable and default
    cont = pd.crosstab(df['default payment next month'], df[categorical_variable])
    chi2, p_value, dof, expected = chi2_contingency(cont)

    # Make a decision based on the p-value and alpha
    if p_value < alpha:
        return f"Reject the null hypothesis. There is a significant association between {categorical_variable} and default."
    else:
        return f"Fail to reject the null hypothesis. There is no significant association between {categorical_variable} and default."

In [None]:
# Define a list of categorical variables
categorical = ['SEX', 'EDUCATION', 'MARRIAGE','PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
# Loop through the list of categorical variables
for categorical_variable in categorical:
    result = hypothesis_test_chi2(df, categorical_variable)
    print(result)

##### Which statistical test have you done to obtain P-Value?

The function hypothesis_test_chi2 performs a chi-square test for independence to obtain the p-value. The chi2_contingency function from the scipy.stats library is used to calculate the chi-square statistic and the p-value.

##### Why did you choose the specific statistical test?

The chi-square test for independence is a common test for determining if there is a significant association between two categorical variables. In this case, the categorical variable of interest and the binary outcome variable "default" are being tested for independence. The choice of the chi-square test for independence is appropriate for this type of analysis.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

2- Null Hypothesis - There is no relation between Numeric Variable and Default

Alternate Hypothesis - There is a relation between Numeric Variable and Default

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

def hypothesis_test_t(df, numerical_variable, alpha=0.05):
    # Split the data into default and non-default groups
    default = df[df['default payment next month'] == 1][numerical_variable]
    non_default = df[df['default payment next month'] == 0][numerical_variable]

    # Conduct a two-sample t-test for the means of the numerical variable for default and non-default groups
    t, p_value = ttest_ind(default, non_default)

    # Make a decision based on the p-value and alpha
    if p_value < alpha:
        return f"Reject the null hypothesis. There is a significant difference in the means of {numerical_variable} between default and non-default groups."
    else:
        return f"Fail to reject the null hypothesis. There is no significant difference in the means of {numerical_variable} between default and non-default groups."

In [None]:
numerical_columns = ['LIMIT_BAL','AGE','BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
for col in numerical_columns:
    result = hypothesis_test_t(df, col, alpha=0.05)
    print(result)

##### Which statistical test have you done to obtain P-Value?

The function hypothesis_test_t performs a two-sample t-test to obtain the p-value. The ttest_ind function from the scipy.stats library is used to calculate the t-statistic and the p-value

##### Why did you choose the specific statistical test?

The two-sample t-test is a common test for determining if there is a significant difference in means between two groups. In this case, the two groups are the default and non-default groups for a numerical variable. The choice of the two-sample t-test is appropriate for this type of analysis when the numerical variable is continuous and the sample size is relatively small.

## ***6. Feature Engineering & Data Pre-processing***

#### To identify the categorical, numerical columns, and input and target columns

In [None]:
df1.columns

In [None]:
# independant variable
Input_columns=[ 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_SEPT',
       'PAY_AUG', 'PAY_JUL', 'PAY_JUN', 'PAY_MAY', 'PAY_APR', 'BILL_AMT_SEPT',
       'BILL_AMT_AUG', 'BILL_AMT_JUL', 'BILL_AMT_JUN', 'BILL_AMT_MAY',
       'BILL_AMT_APR', 'PAY_AMT_SEPT', 'PAY_AMT_AUG', 'PAY_AMT_JUL',
       'PAY_AMT_JUN', 'PAY_AMT_MAY', 'PAY_AMT_APR']
# dependent variable
Target_column=["default"]

In [None]:
categorical_columns = ['SEX','EDUCATION','MARRIAGE','PAY_SEPT', 'PAY_AUG',
       'PAY_JUL', 'PAY_JUN', 'PAY_MAY', 'PAY_APR']

In [None]:
numerical_columns = ['LIMIT_BAL','AGE','BILL_AMT_SEPT',
       'BILL_AMT_AUG', 'BILL_AMT_JUL', 'BILL_AMT_JUN', 'BILL_AMT_MAY',
       'BILL_AMT_APR', 'PAY_AMT_SEPT', 'PAY_AMT_AUG', 'PAY_AMT_JUL',
       'PAY_AMT_JUN', 'PAY_AMT_MAY', 'PAY_AMT_APR']

### 1. Handling Missing Values

In [None]:
# Looking for null value by using .info
df1.info()

### 2. Handling Outliers

In [None]:
# Checking skewness for numerical columns
import scipy.stats as stats
for col in numerical_columns:
    skewness = stats.skew(df1[col])
    print("Skewness of column {}: {:.2f}".format(col, skewness))

In [None]:
plt.figure(figsize=(15,15))
for num,cols in enumerate(numerical_columns):
    plt.subplot(5,3,num+1)
    sns.boxplot(df1[cols])
    plt.title(f'{cols.title()}',weight='bold')
    plt.tight_layout()
    #print(' Box Plot of',cols)
plt.show()

From the boxplot, it can be observed that the columns: LIMIT_BAL , AGE ,  BILL_AMT1 , BILL_AMT2 , BILL_AMT3 , BILL_AMT4 , BILL_AMT5 , BILL_AMT6 , PAY_AMT1 , PAY_AMT2 , PAY_AMT3 , PAY_AMT4 , 'PAY_AMT5 , and PAY_AMT6 contain outliers.

In [None]:
print("Number of Outlier Records:")

for col in numerical_columns:
    upper = df1[col].quantile(0.75) + 1.5 * (df1[col].quantile(0.75) - df1[col].quantile(0.25))
    outliers = df1[df1[col] > upper][col].count()
    print("{}: {}".format(col, outliers))


Calculates the upper bound of outliers using the interquartile range (IQR) and then counts the number of values in each column that are greater than this upper bound.

# Capping

In [None]:
for col in numerical_columns:
    upper = df1[col].quantile(0.75) + 1.5 * (df1[col].quantile(0.75) - df1[col].quantile(0.25))
    df1[col] = np.where(df1[col] > upper, upper, df1[col])
    print("{}: {}".format(col, outliers))


The upper bound of outliers is calculated using the interquartile range (IQR) as in the previous code. Then, using np.where, the values in each column that are greater than the upper bound are replaced with the upper bound. And performs the capping of the outliers.

In [None]:
# Rechecking outliers for numerical columns
plt.figure(figsize=(15,15))
for num,cols in enumerate(numerical_columns):
    plt.subplot(5,3,num+1)
    sns.boxplot(df1[cols])
    plt.title(f'{cols.title()}',weight='bold')
    plt.tight_layout()
    #print(' Box Plot of',cols)
plt.show()

In [None]:
df1.head()

### 3. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
def replace_values(df1, col, values):
    # This code is replacing all values of -2, -1, and 0 in the specified columns with 0.
    fil = (df1[col] == -2) | (df1[col] == -1) | (df1[col] == 0)
    df1.loc[fil, col] = values

columns = ['PAY_SEPT', 'PAY_AUG', 'PAY_JUL', 'PAY_JUN', 'PAY_MAY', 'PAY_APR']				
values = 0

for col in columns:
    replace_values(df1, col, values)

This code is replacing all values of -2, -1, and 0 in the specified columns with 0.

In [None]:
def replace_values(df1, col, values):
    #  This code is replacing all values greater then 0 in the specified columns with 0.
    fil = (df1[col] < 0)
    df1.loc[fil, col] = values

columns = ['BILL_AMT_SEPT', 'BILL_AMT_AUG', 'BILL_AMT_JUL', 'BILL_AMT_JUN', 'BILL_AMT_MAY', 'BILL_AMT_APR']						
values = 0

for col in columns:
    replace_values(df1, col, values)

As negative Bill Amount paid indicates that the person has paid his due payment already. Hence, we transform the data as above

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
fil = (df1.PAY_SEPT == 0) & (df1.PAY_AUG == 0) & (df1.PAY_JUL == 0) & (df1.PAY_JUN == 0) & (df1.PAY_MAY == 0) & (df1.PAY_APR == 0) & (df1['default'] == 1)
df1.loc[fil,'default'] = 0


As customers who have not defaulted in any month cannot be potential defaulters hence, we have transformed the data as above.

In [None]:
# Select your features wisely to avoid overfitting
fil = (df1.PAY_SEPT > 0) & (df1.PAY_AUG > 0) & (df1.PAY_JUL > 0) & (df1.PAY_JUN > 0) & (df1.PAY_MAY > 0) & (df1.PAY_APR > 0) & (df1['default'] == 0)
df1.loc[fil,'default'] = 1

As customers who have defaulted in every month are the potential defaulters hence, we have transformed the data as above.

**Binning the 'AGE' column**

In [None]:
print(df1['AGE'].min())
print(df1['AGE'].max())

we can see here min age is 21.0 and maximum age is 60.5 in our dataset

In [None]:
# creating function to create the cohort for age group
def age(x):
    if x in range(21,41):
        return 1
    elif x in range(41,61):
        return 2
    elif x in range(61,80):
        return 3

df1['AGE']=df1['AGE'].apply(age)

the ages are into three categories, By applying the categorize_age function to the AGE column of the dataframe, the original ages in the column are transformed into categorical values.

In [None]:
#visualizing age group
plt.figure(figsize=(10,8),dpi=60)
sns.countplot(x=df1['AGE'].sort_values(),data=df1,hue='default')

In [None]:
df1.head()

**obsevation:**
In aur dataset we can clearly see that most of the credit card holder are of age between 21 to 41 , so we can say that company's target customer are mostly youngster.

**Binning the 'PAY' column**

In [None]:
def bins(x):
    if x == -2:
        return 'Paid Duly'
    if x == 0:
        return 'Paid Duly'
    if x == -1:
        return 'Paid Duly'
    if x in range(1,4):
        return '1 to 3'
    if x in range(4,7):
        return '4 to 6'
    if x in range(7,9):
        return '7 to 9'

for i in df1[['PAY_SEPT','PAY_AUG','PAY_JUL','PAY_JUN','PAY_MAY','PAY_APR']]:						
    df1[i]=df1[i].apply(bins)

The values of the columns 'PAY_SEPT', 'PAY_AUG', 'PAY_JUL', 'PAY_JUN', 'PAY_MAY', and 'PAY_APR' in the DataFrame 'df_copy'.

For each value of these columns, the 'bins' function is being applied, which maps the values to one of four categorical bins:

Paid Duly (for values of -2, 0, or -1)

1 to 3 (for values in the range 1 to 3)

4 to 6 (for values in the range 4 to 6)

7 to 9 (for values in the range 7 to 9)

In [None]:
df1.head()

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 4. Categorical Encoding

**Label Encoding**

In [None]:
df1.head()

In [None]:
#label encoding
encoders_nums = {"SEX":{"Male":0,"Female":1}, "default":{"Yes":1, "No":0}}
df1 = df1.replace(encoders_nums)

In [None]:
# check for changed labels
df1.head(5)

**One Hot Encoding**

In [None]:
# Encode your categorical columns
# Importing
from sklearn.preprocessing import OneHotEncoder

In [None]:
#categorical features
categorical_cols_to_encode = ['EDUCATION', 'MARRIAGE','PAY_SEPT','PAY_AUG','PAY_JUL','PAY_JUN','PAY_MAY','PAY_APR']

In [None]:
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore').fit(df1[categorical_cols_to_encode])
encoded_cols = list(encoder.get_feature_names(categorical_cols_to_encode))
df1[encoded_cols] = encoder.transform(df1[categorical_cols_to_encode])

In [None]:
encoded_cols

In [None]:
df1.drop(['EDUCATION', 'MARRIAGE','PAY_SEPT','PAY_AUG','PAY_JUL','PAY_JUN','PAY_MAY','PAY_APR'],axis=1,inplace=True)

In [None]:
df1.shape

In [None]:
pd.options.display.max_columns = 47
df1.head()

In [None]:
df1.columns

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 7. Dimesionality Reduction

### 9. Handling Imbalanced Dataset

##### Checking if Data is Imbalance

In [None]:
print((df1['default'].value_counts()/df1['default'].shape)*100)
sns.countplot(df1['default'])
plt.show()

In [None]:
df1.default.value_counts() 

##### Do you think the dataset is imbalanced? Explain Why.

 Yes, Here we can see that the data is imbalanced. bcoz the Based on the values 77.88% for 0 and 22.12% for 1), it appears that the dataset is imbalanced. An imbalanced dataset refers to a situation where the distribution of classes (in this case, 0s and 1s) is not equal. In the given code, the class distribution is heavily skewed towards class 0, with 77.88% of the observations being class 0 and only 22.12% being class 1. This imbalance can lead to difficulties in training machine learning models, as the model may be biased towards predicting the majority class. As a result, the model's performance on the minority class may be poor.

In [None]:
# Handling Imbalanced Dataset (If needed)
print('Before OverSampling, the shape of train_X: {}'.format(X_train.shape)) 
print('Before OverSampling, the shape of train_y: {} \n'.format(y_train.shape))

In [None]:
# Replace missing values with the mean of the column
# df1.fillna(df1.mean(), inplace=True)

# #importing SMote to make our dataset balanced
# from imblearn.over_sampling import SMOTE

# # Apply SMOTE to balance the dataset
# smote = SMOTE()
# X_smote, y_smote = smote.fit_resample(df1.drop('default', axis=1), df1['default'])

# print('Original dataset shape', len(df1))
# print('Resampled dataset shape', len(y_smote))

In [None]:
#importing SMOTE to handle class imbalance
from imblearn.over_sampling import SMOTE

smote = SMOTE()

# fit predictor and target variable
x_smote, y_smote = smote.fit_resample(df1[(i for i in list(df1.describe(include='all').columns) if i != 'default')], df1['default'])

print('Original unbalanced dataset shape', len(df1))
print('Resampled balanced dataset shape', len(y_smote))

In [None]:
#creating new dataframe from balanced dataset after SMOTE
balanced_df = pd.DataFrame(x_smote, columns=list(i for i in list(df1.describe(include='all').columns) if i != 'default'))

In [None]:
#adding target variable to new created dataframe
balanced_df['default'] = y_smote


In [None]:
#check for class imbalance
plt.figure(figsize=(5,5))
sns.countplot('default', data = balanced_df)


In [None]:
#shape of balanced dataframe
balanced_df.shape

In [None]:
balanced_df.head()

In [None]:
# #seperating dependant and independant variabales
# X = balanced_df[(list(i for i in list(balanced_df.describe(include='all').columns) if i != 'dfault'))]
# y = balanced_df['default']

In [None]:
#X.shape

In [None]:
#y.shape

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

The technique used to handle the imbalanced dataset is Synthetic Minority Over-sampling Technique (**SMOTE**). SMOTE is a commonly used oversampling technique for imbalanced datasets, which creates synthetic samples of the minority class instead of simply duplicating existing samples. This helps to balance the class distribution and reduce overfitting, which can occur when a model is trained on a highly imbalanced dataset. The reason for choosing SMOTE in this case is because it is effective in handling class imbalance by generating new samples of the minority class, while still preserving the characteristics of the original data.

### 6. Data Splitting

In [None]:
# independent variable (estimator)
X = balanced_df.drop("default", axis = 1)

# dependent variable (label)
y = balanced_df["default"]

In [None]:
# Split your data to train and test
#importing libraries for splitting data into training and testing dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, shuffle = True, random_state = 11)

In [None]:
X_train.shape

In [None]:
X_test.shape

##### What data splitting ratio have you used and why? 

Answer Here.

### 6. Data Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler 

In [None]:
# Scaling your data
scaler = StandardScaler()
X_train_scale = scaler.fit_transform(X_train)
X_test_scale = scaler.transform(X_test)

X_test = pd.DataFrame(X_test_scale, columns = X_test.columns)
X_train = pd.DataFrame(X_train_scale, columns = X_train.columns)

##### Which method have you used to scale you data and why?

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

## ***7. ML Model Implementation***

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, plot_precision_recall_curve
from sklearn.metrics import precision_score,recall_score,f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
import math
import time

#### **Creating Function**

In [None]:
def model_score(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_train_pred = model.predict(X_train)

    train_accuracy = round(accuracy_score(y_train_pred, y_train), 3)
    accuracy = round(accuracy_score(y_pred, y_test), 3)
    precision = round(precision_score(y_pred, y_test), 3)
    recall = round(recall_score(y_pred, y_test), 3)
    f1 = round(f1_score(y_pred, y_test), 3)
    roc_score = round(roc_auc_score(y_pred, y_test), 3)

    print("The accuracy on train data is ", train_accuracy)
    print("The accuracy on test data is ", accuracy)
    print("The precision on test data is ", precision)
    print("The recall on test data is ", recall)
    print("The f1 on test data is ", f1)
    print("The roc_score on test data is ", roc_score)


In [None]:
#creating function to get features importance of all the tree based model
def get_features_importance(optimal_model,X_train):
  imp_feat=pd.DataFrame(index=X.columns,data=optimal_model.feature_importances_,columns=['importance'])
  imp_feat=imp_feat[imp_feat['importance']>0]
  imp_feat=imp_feat.sort_values('importance')
  plt.figure(figsize=(15,5))
  print(f'==========================Features Importance============================\n\n {optimal_model}\
  \n=========================================================================\n') 
  sns.barplot(data=imp_feat,x=imp_feat.index,y='importance')
  plt.xticks(rotation=90);


### ML Model - 1

### **Logistic Regression**

In [None]:
#creating Instance of Logistic Regression
LR= LogisticRegression()
model_score(LR, X_train, y_train, X_test, y_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

**implementing GridSearch for Hyperparameter Tuning**

In [None]:
from sklearn.model_selection import GridSearchCV

# Define a set of hyperparameters to search over
param_grid = {'penalty': ['l1', 'l2', 'elasticnet'],
              'C': [0.1, 1, 10, 100],
              'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
              'fit_intercept': [True, False],
              'max_iter': [100, 500, 1000]}

# Create a GridSearchCV object and fit it to the data
grid = GridSearchCV(LR, param_grid, cv=5)
grid.fit(X_train, y_train)

# Print the best hyperparameters and the best score
print("Best hyperparameters: ", grid.best_params_)
print("Best accuracy: ", grid.best_score_)


In [None]:
# Create a new instance of LogisticRegression with the best hyperparameters
LR_best = LogisticRegression(C=1, fit_intercept=True, max_iter=100, penalty='l2', solver='lbfgs')

# Fit the model to the training data
LR_best.fit(X_train, y_train)

# Get the feature importance from the fitted model
importance = LR_best.coef_[0]

# Create a DataFrame to store the feature names and importance
imp_feat = pd.DataFrame({'Features': X_train.columns, 'Importance': importance})
imp_feat = imp_feat.sort_values('Importance', ascending=False)

# Plot the feature importance
plt.figure(figsize=(15,5))
sns.barplot(x='Features', y='Importance', data=imp_feat)
plt.xticks(rotation=90)
plt.show()


##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

In [None]:
Rf = RandomForestClassifier(n_estimators=50)
model_score(Rf, X_train, y_train, X_test, y_test)

**implementing gridsearch for hyperparameter tuning in Random Forest**

In [None]:
# finding the best parameters for rfc_model by gridsearchcv
grid_values = {'n_estimators': [100,125,150],'max_depth': [7,10,15],'criterion': ['entropy']}
grid_rfc_model = GridSearchCV(estimator=Rf,param_grid = grid_values, scoring='balanced_accuracy',cv=3,verbose=5,n_jobs=-1)

In [None]:
# training and evaluating the Random forest with hyperparameter tuing
model_score(grid_rfc_model,X_train, y_train, X_test, y_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
#fitting data into XG Boosting Classifier
xgb = XGBClassifier()
model_score(xgb, X_train, y_train, X_test, y_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***