### Project : Credit Card Approval 
###### This Dataset is taken from the below link :
https://www.kaggle.com/datasets/rikdifos/credit-card-approval-prediction
###### We will be focusing on varaious parts of Data Science Life Cycles Some of them are given below :
1. Data Analysis
2. Feature Engineering
3. Feature Selection
4. Model Selection
5. Model Deployment

Now, We are going to understand the data what does it says about itself.

###### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import missingno as msno
import datetime
from datetime import timedelta
pd.pandas.set_option('display.max_columns',None)
sns.set_style('whitegrid')

In [None]:
## Reading Data
independent_feature = pd.read_csv('../input/credit-card-approval-prediction/application_record.csv')

In [None]:
## Let's view the data
independent_feature.head()

In [None]:
## Check It's Size
independent_feature.shape

In [None]:
## Now, We are going to read the second file what does it contains.
dependent_feature = pd.read_csv('../input/credit-card-approval-prediction/credit_record.csv')

In [None]:
## Let's view the data
dependent_feature.head()

In [None]:
## Let's check it's shape
dependent_feature.shape

Now, We will be merging the two dataframes to deal with only one dataframe. We will merge by taking intersection of both dataframes based on their 'ID'.

In [None]:
data = independent_feature.merge(dependent_feature,how='inner',on=['ID'])

In [None]:
## Let's check the data
data.head()

In [None]:
## Let's check the shape of dataset
data.shape

# Data Cleaning

Let's Check for below stuff :
1. Check missing value
2. Deal with missing value
3. Check Duplicated value
4. Drop Unnecceray Columns
5. Check dtypes of all Columns

In [None]:
## Let's find out info about our data
data.info()

In [None]:
data.describe()

In [None]:
## Let's check is there any null value
data.isnull().sum().sort_values(ascending=False)

In [None]:
## Let's visualize null values
cols = data.columns
sns.heatmap(data[cols].isnull(),cmap='Blues',yticklabels=False,cbar=False)

Hence, We get from this heatmap that only Occurance_type have some null values.

In [None]:
## Now Let's check if this data have any duplicated data or not
data.duplicated().sum()

Hence, Data have not any duplicated values. Now, let's deal with those null values.

In [None]:
## Let's dig into OCCUPATION_TYPE column
data.OCCUPATION_TYPE.value_counts()

In [None]:
## Let's fill those null values with 'others'.
data['OCCUPATION_TYPE'].fillna('others',inplace=True)

# Feature Engineering

In [None]:
## Let's view our data
data.head(3)

In [None]:
## Let's dig into CODE_GENDER column
data.CODE_GENDER.value_counts()

In [None]:
## Now, Let's convert F and M into dummies variable
Male = pd.get_dummies(data['CODE_GENDER'],drop_first=True)
Male.value_counts()

Here, 1 represent to Male and 0 represent to Female.

In [None]:
## Now, Let's dig into FLAG_OWN_CAR
data.FLAG_OWN_CAR.value_counts()

In [None]:
## Now, Let's convert N and Y into dummies variable
Car = pd.get_dummies(data['FLAG_OWN_CAR'],drop_first=True)
Car.value_counts()

Here, 1 means this person have a car or 0 means this person doesn't own a car.

In [None]:
## Now, Let's dig into property column
data.FLAG_OWN_REALTY.value_counts()

In [None]:
## Now, Let's convert it into dummy variable
Property = pd.get_dummies(data['FLAG_OWN_REALTY'],drop_first=True)
Property.value_counts()

Here, 1 denotes that this person holds some property and 0 means that this person doesn't have any property.

In [None]:
## Let's merge those dummy variables
data = pd.concat([data,Property,Car,Male],axis=1)

In [None]:
data.head()

In [None]:
## Now, Let's dig about another columns
data.NAME_INCOME_TYPE.value_counts()

In [None]:
data.NAME_EDUCATION_TYPE.value_counts()

Let's replace Secondary / secondary special to Secondary education.

In [None]:
## Defining a function
def education(x):
    if x=='Secondary / secondary special':
        x=x.split(' /')[0]
    return x

data['NAME_EDUCATION_TYPE'] = data['NAME_EDUCATION_TYPE'].apply(education)
       

In [None]:
data.NAME_EDUCATION_TYPE.value_counts() 

In [None]:
data.NAME_FAMILY_STATUS.value_counts()

Let's replace Single / not married to Single

In [None]:
## Defining a function
def family(x):
    if x=='Single / not married':
        x=x.split(' /')[0]
    return x

data['NAME_FAMILY_STATUS'] = data['NAME_FAMILY_STATUS'].apply(family)
       

In [None]:
data.NAME_FAMILY_STATUS.value_counts()

In [None]:
data.NAME_HOUSING_TYPE.value_counts()

Let's replace House / apartment to House

In [None]:
## Defining a function
def housing(x):
    if x=='House / apartment':
        x=x.split(' /')[0]
    return x

data['NAME_HOUSING_TYPE'] = data['NAME_HOUSING_TYPE'].apply(housing)
       

In [None]:
data.NAME_HOUSING_TYPE.value_counts()

##### Let's define some functions for some columns for better understanding of data

In [None]:
## This function takes no of days and convert it into their datetime format
def Date_of_Birth(total_days):
    today = datetime.date.today()
    birthday = (today + timedelta(days=total_days)).strftime('%Y-%m-%d')
    return birthday                  

## This Function takes value of colunn Days of Employed and convert it into datetime format
def Employed_day(total_days):
    today = datetime.date.today()
    employed_date = (today + datetime.timedelta(days=total_days)).strftime('%Y-%m-%d')
    return employed_date

## This function for calculating age
def age(days_birth):
    days_birth = datetime.datetime.strptime(days_birth, '%Y-%m-%d')
    today = datetime.date.today()
    Age = today.year - days_birth.year
    return Age

Now, We will be going to apply those functions to their respective columns.

In [None]:
## Applying Functions
## To get date of birth of each person
data['DAYS_BIRTH'] = data['DAYS_BIRTH'].apply(Date_of_Birth)

## To get date of employed
data['DAYS_EMPLOYED'] = data['DAYS_EMPLOYED'].apply(Employed_day)

## To find the age of every person
data['Age'] = data['DAYS_BIRTH'].apply(age)

In [None]:
## Now, Let's view the modified data
data.head()

##### Now, Let's merge those dummy variables and drop the useless columns.

In [None]:
## Let's drop useless columns
data = data.drop(['ID','DAYS_BIRTH','MONTHS_BALANCE','FLAG_WORK_PHONE','DAYS_EMPLOYED'],axis=1)

In [None]:
# replacing the value of C,X to numeric value 
data.loc[data['STATUS']=='C','STATUS'] = 6
data.loc[data['STATUS']=='X','STATUS'] = 7

In [None]:
data.head()

In [None]:
data.drop(columns=['CODE_GENDER','FLAG_OWN_CAR','FLAG_OWN_REALTY','OCCUPATION_TYPE'],axis=1,inplace=True)

In [None]:
data.head()

# Data Labeling :
We have to label the datasets because this dataset doesn't contain any target values. We will be going to mark 1(risky people) or 0 (Not risky people) based on 'STATUS' if someone have more than 60 days of installments then it is risky to give credit card to those people.

In [None]:
data.STATUS.value_counts()

In [None]:
## Let's convert dtypes of status column.
data['STATUS'] = data['STATUS'].astype(float)

In [None]:
## Define a function for labeling the data
def get_label_for_data(x):
    target = ''
    if x in (2,3,4,5) :
       target = 1 #risky
    else:
         target = 0  #not risky

    return target

In [None]:
data['target'] = data['STATUS'].apply(get_label_for_data)

In [None]:
data.target.value_counts()

# Data Visualization

In [None]:
## Let's create some plots for target values
sns.countplot(y=data['target'])

###### We conclude that we have imbalanced datasets. So, we have to use over scaling later.

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(data.corr(),annot=True)

In [None]:
## Let's find out correlation between continuous features
data.corr()

In [None]:
## Let's visualize FLAG_MOBIL
sns.countplot(y=data['FLAG_MOBIL'])

Hence, All of them have Mobile Phones.

In [None]:
## Let's visualize FLAG_EMAIL
sns.countplot(y=data['FLAG_EMAIL'])

In [None]:
data.FLAG_EMAIL.value_counts()

In [None]:
## Let's find out about how many family members have
plt.figure(figsize=(15,10))
sns.countplot(x=data['CNT_FAM_MEMBERS'])

In [None]:
plt.figure(figsize=(15,10))
sns.barplot(x=data['CNT_FAM_MEMBERS'],y=data['AMT_INCOME_TOTAL'])

In [None]:
# plt.figure(figsize=(10,12))
# e=(0.05,0.02,0,0,0)
# m=data['NAME_FAMILY_STATUS']=='Married'
# m=m.sum()
# s=data['NAME_FAMILY_STATUS']=='Single'
# s=s.sum()
# Cv=data['NAME_FAMILY_STATUS']=='Civil marriage'
# Cv=Cv.sum()
# sep=data['NAME_FAMILY_STATUS']=='Separated'
# sep=sep.sum()
# w=data['NAME_FAMILY_STATUS']=='Widow'
# w=w.sum()
# y=np.array([m,s,Cv,sep,w])
# label=['Married','Single','Civil marriage','Separated','Widow']
# plt.pie(y,explode=e,labels=label)
# plt.legend(title="Title")

# Feature Scaling

In [None]:
## Let's see head of data
data.head()

In [None]:
## Let's formed a dictionary
lst = {'Working':1,'Commercial associate':2,'Pensioner':3,'State servant':4,'Student':5}
lst1 = {'Secondary':1,'Higher education':2,'Incomplete higher':3,'Lower secondary':4,'Academic degree':5}
lst2 = {'Married':1,'Single':2,'Civil marriage':3,'Separated':4,'Widow':5}
lst3 = {'House':1,'With parents':2,'Municipal apartment':3,'Rented apartment':4,'Office apartment':5,'Co-op apartment':6}

In [None]:
data.replace({'NAME_INCOME_TYPE':lst},inplace=True)
data.replace({'NAME_EDUCATION_TYPE':lst1},inplace=True)
data.replace({'NAME_FAMILY_STATUS':lst2},inplace=True)
data.replace({'NAME_HOUSING_TYPE':lst3},inplace=True)

In [None]:
data.head()

In [None]:
data.target.value_counts()

In [None]:
X = data.drop('target',axis=1)
Y = data['target']

In [None]:
## Let's check the skewness of data.
data.skew()

###### Hence,target value is far from Normal Distrubition. so, we are going to use MinMaxScaler.

In [None]:
feature_scale = [feature for feature in data.columns if feature!='target']
## Importing library 
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
scaler.fit(X)

In [None]:
scaler.transform(X)

In [None]:
data = pd.concat([data['target'].reset_index(drop=True),
                    pd.DataFrame(scaler.transform(X), columns=feature_scale)],
                    axis=1)

In [None]:
data.head()

# Dealing with Unbalanced Dataset

In [None]:
## Performing Over-Sampling

## Importing libraries
from imblearn.combine import SMOTETomek 

In [None]:
smk = SMOTETomek(random_state=42)

In [None]:
data.shape

In [None]:
X_res,Y_res = smk.fit_resample(X,Y)

In [None]:
X_res.shape,Y_res.shape

In [None]:
from collections import Counter

In [None]:
print('original dataset shape {}'.format(Counter(Y)))
print('Resampled dataset shape {}'.format(Counter(Y_res)))

##### Train-Test split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train,x_test,y_train,y_test = train_test_split(X_res,Y_res,test_size=0.2,random_state=100)

In [None]:
x_train.shape,x_test.shape

# Model Selection

Let's choose some models then choose which one is performing better based on confusion matrix.

In [None]:
## import libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import ConfusionMatrixDisplay,accuracy_score

In [None]:
model_list = [LogisticRegression,RandomForestClassifier,DecisionTreeClassifier,GaussianNB,KNeighborsClassifier]

In [None]:
accuracy =[]
for model in model_list :
    model = model()
    model.fit(x_train,y_train)
    y_pred = model.predict(x_test)
    ConfusionMatrixDisplay.from_predictions(y_test,y_pred)
    accuracy.append(accuracy_score(y_test,y_pred))

In [None]:
accuracy

# Conclusion
##### From this model we conlude that KNNeigbhours and Decision Trees are the best model that can predict whether  we should approve credit card to a person or not based on their data. 