# **Section 1 - Questions to Answer**

**1. Why is your proposal important in today’s world? How predicting a good client is worthy for a bank?**

The proposal for implementing data-driven models, such as machine learning models, in the banking industry is important in today's world for several reasons such as -

Financial Inclusion, Regulatory Compliance, Competitive Advantage, Cost Reduction, Fraud Detection, Enhanced Customer Experience, Risk Mitigation, Improved Decision Making.

**2. How is it going to impact the banking sector?**

The impact of data-driven models on the banking sector from Machine Learning is multifaceted, with benefits ranging from improved risk assessment and operational efficiency to enhanced customer experiences and the development of innovative financial services. As technology continues to evolve, data-driven approaches will play a central role in shaping the future of banking.

**3. If any, what is the gap in the knowledge or how your proposed method can be helpful if required in future for any bank in India.**

Using data-driven methods and advanced technology can help banks in India become better at making decisions and serving their customers. However, there are some things to keep in mind:



1. Good Data: Banks need to make sure that the information they use for these methods is accurate and complete.

2. Privacy and Security: Banks must protect their customers' information and follow the rules about data privacy.

3. Costs and Benefits: Using data can cost money, so banks need to make sure it's worth it in the long run.

4. Staying Safe: Banks must protect their computer systems and customer data from hackers and cyberattacks.

5. Technology and Skills: Some banks may need to invest in better technology and hire people who know how to use it.

Even though there are challenges, using data in smart ways can help banks do a better job of serving their customers and staying competitive in the modern world.

# **Section 2 -Initial Hypothesis**

**1. Here you have to make some assumptions based on the questions you want to address based on the DA track or ML track.**

1. If DA track please aim to identify patterns in the data and important features that may impact a ML model.



Data analysis (DA) plays a crucial role in preparing data for machine learning (ML) models and identifying patterns and important features that can impact model performance. It helps you understand your data, clean it, identify patterns, select relevant features, and prepare the data for model training. Effective data analysis is often the foundation for building accurate and robust machine learning models.

2. If ML track please perform part ‘i’ as well as multiple machine learning models, perform all required steps to check if there is any assumption and justify your model. Why is your model better than any other possible model? Please justify it by relevant cost functions and if possible by any graph.



There is no one-size-fits-all answer to which model is the "best." The choice of model should be based on a careful evaluation of these factors and the specific requirements of your project. It's often a good practice to experiment with multiple models, perform rigorous testing and validation, and choose the model that best meets your objectives and constraints.

**2. From step 1, you may see some relationship that you want to explore and will develop a belief about data.**

# **Section 3 - Data analysis approach**

1. What approach are you going to take in order to prove or disprove your hypothesis?

As an AI language model, I don't have personal hypotheses or conduct experiments. The specific approach you take will depend on your research question and the type of data you have. It's important to follow accepted scientific and statistical practices to ensure the validity and reliability of your findings.



2. What feature engineering techniques will be relevant to your project?

To determine the relevant feature engineering techniques for your project, it's essential to understand the specific nature of your data, the problem I am trying to solve, and the machine learning algorithms I plan to use. I have used Feature Engineering techniques in this project are:

* Handling Missing Data   
* Encoding Categorical Variables
* Feature Scaling

3. Please justify your data analysis approach.

First, I have imported the Libraries, then, imported the datasets. After that, I took care of the missing data form the dataset. Then, I splited the dataset into X and y.
Then, I encoded the categorical data.
After endoding, I splited the dataset into the Training Set and Test Set. Soon, I did Feature Scaling. After feature scaling, I trained the dataset with various Models like Logistic Regression Classifier, Kernel SVM model, SVM Model, K-NN Model, Naive Bayes Model, Random Forest Classification Model and Decision Tree Classification Model. Then, I predicted all the Test Set Results and did the Confusion Matrix Analysis. Thus, I then found the best fit Model for this project.

4. Identify important patterns in your data using the EDA approach to justify your findings.

# **Section 4 - Machine learning approach**

1. What method will you use for machine learning based predictions for credit card approval?



For making predictions for credit card approval using machine learning, several methods and algorithms can be applied. The choice of method depends on factors like the nature of the data, the specific problem, and the desired outcome. I have use for ML based predictions for Credit Card Approval are :
* Random Forest Classification Model
* Naive Bayes Model
* K-NN Model
* Kernel SVM Model
* Decision Tree Classification
* Logistic Regression

2. Please justify the most appropriate model.

Random Forest Classification Model has the Highest Confusion Matrix Accuracy i.e., 90% ...
Thus, this Model is the most appropriate among others.

3. Please perform necessary steps required to improve the accuracy of your model.

Improving the accuracy of a machine learning model involves several steps and strategies. While I can provide a general framework, it's important to note that the specific actions you should take depend on the type of model you're using, the nature of your data, and the current state of your model's performance.

I have performed the most necessary steps which are required to improve the accuracy of the model are Data Cleaning, Handling the Missing Data, Feature Scaling, encoding the categorical data.

4. Please compare all models (at least 4  models).

1. Logistic Regression Classification Model-
* The Accuracy Score of this model = 90%

2. Decision Tree Classification model -
* The Accuracy Score of this model = 88.70967741935484%


3. Kernel SVM Model -
* The Accuracy Score of this model = 90%


4. SVM model -
* The Accuracy Score of this model = 89.6774193548387%


5. K-NN Model -
* The Accuracy Score of this model = 85.48387096774194%


6. Naive Bayes Model -
* The Accuracy Score of this model = 89.03225806451613%


7. Random Forest Classification Model -
* The Accuracy Score of this model = 92.25806451612903%


# **Importing the Libraries :**

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# **Importing the Datasets :**

In [2]:
ccard = pd.read_csv("Credit_card.csv")
ccard_label = pd.read_csv("Credit_card_label.csv")
cc_merged = pd.merge(ccard, ccard_label, how='outer', on='Ind_ID')

In [3]:
cc_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1548 entries, 0 to 1547
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Ind_ID           1548 non-null   int64  
 1   GENDER           1541 non-null   object 
 2   Car_Owner        1548 non-null   object 
 3   Propert_Owner    1548 non-null   object 
 4   CHILDREN         1548 non-null   int64  
 5   Annual_income    1525 non-null   float64
 6   Type_Income      1548 non-null   object 
 7   EDUCATION        1548 non-null   object 
 8   Marital_status   1548 non-null   object 
 9   Housing_type     1548 non-null   object 
 10  Birthday_count   1526 non-null   float64
 11  Employed_days    1548 non-null   int64  
 12  Mobile_phone     1548 non-null   int64  
 13  Work_Phone       1548 non-null   int64  
 14  Phone            1548 non-null   int64  
 15  EMAIL_ID         1548 non-null   int64  
 16  Type_Occupation  1060 non-null   object 
 17  Family_Members

# **Taking Care of Missing Data**

In [4]:
# Check for missing values in each column
missing_data = cc_merged.isnull().sum()
print(missing_data)

Ind_ID               0
GENDER               7
Car_Owner            0
Propert_Owner        0
CHILDREN             0
Annual_income       23
Type_Income          0
EDUCATION            0
Marital_status       0
Housing_type         0
Birthday_count      22
Employed_days        0
Mobile_phone         0
Work_Phone           0
Phone                0
EMAIL_ID             0
Type_Occupation    488
Family_Members       0
label                0
dtype: int64


In [5]:
cc_merged['GENDER'].fillna(cc_merged['GENDER'].mode()[0], inplace=True)

In [6]:
mean_income = cc_merged['Annual_income'].mean()
cc_merged['Annual_income'].fillna(mean_income, inplace=True)

In [7]:
cc_merged['Type_Occupation'].fillna('Unknown', inplace=True)

In [8]:
mean_birthday_count = cc_merged['Birthday_count'].mean()
cc_merged['Birthday_count'].fillna(mean_income, inplace=True)

In [9]:
cc_merged.replace('\W','',regex=True, inplace=True)

In [10]:
# Check for missing values in each column
missing_data = cc_merged.isnull().sum()
print(missing_data)

Ind_ID             0
GENDER             0
Car_Owner          0
Propert_Owner      0
CHILDREN           0
Annual_income      0
Type_Income        0
EDUCATION          0
Marital_status     0
Housing_type       0
Birthday_count     0
Employed_days      0
Mobile_phone       0
Work_Phone         0
Phone              0
EMAIL_ID           0
Type_Occupation    0
Family_Members     0
label              0
dtype: int64


In [11]:
cc_merged.head(50)

Unnamed: 0,Ind_ID,GENDER,Car_Owner,Propert_Owner,CHILDREN,Annual_income,Type_Income,EDUCATION,Marital_status,Housing_type,Birthday_count,Employed_days,Mobile_phone,Work_Phone,Phone,EMAIL_ID,Type_Occupation,Family_Members,label
0,5008827,M,Y,Y,0,180000.0,Pensioner,Highereducation,Married,Houseapartment,-18772.0,365243,1,0,0,0,Unknown,2,1
1,5009744,F,Y,N,0,315000.0,Commercialassociate,Highereducation,Married,Houseapartment,-13557.0,-586,1,1,1,0,Unknown,2,1
2,5009746,F,Y,N,0,315000.0,Commercialassociate,Highereducation,Married,Houseapartment,191399.32623,-586,1,1,1,0,Unknown,2,1
3,5009749,F,Y,N,0,191399.32623,Commercialassociate,Highereducation,Married,Houseapartment,-13557.0,-586,1,1,1,0,Unknown,2,1
4,5009752,F,Y,N,0,315000.0,Commercialassociate,Highereducation,Married,Houseapartment,-13557.0,-586,1,1,1,0,Unknown,2,1
5,5009753,F,Y,N,0,315000.0,Pensioner,Highereducation,Married,Houseapartment,-13557.0,-586,1,1,1,0,Unknown,2,1
6,5009754,F,Y,N,0,315000.0,Commercialassociate,Highereducation,Married,Houseapartment,-13557.0,-586,1,1,1,0,Unknown,2,1
7,5009894,F,N,N,0,180000.0,Pensioner,Secondarysecondaryspecial,Married,Houseapartment,-22134.0,365243,1,0,0,0,Unknown,2,1
8,5010864,M,Y,Y,1,450000.0,Commercialassociate,Secondarysecondaryspecial,Married,Houseapartment,-18173.0,-678,1,0,1,1,Corestaff,3,1
9,5010868,M,Y,Y,1,450000.0,Pensioner,Secondarysecondaryspecial,Married,Houseapartment,-18173.0,-678,1,0,1,1,Corestaff,3,1


# **Splitting the Dataset into X and y :**

In [12]:
# Assuming your dataset is stored in a DataFrame called 'df'
# X will contain the features, and y will contain the target variable

X = cc_merged.drop(columns=['label'])  # Exclude the 'label' column to get the features
y = cc_merged['label']  # Select only the 'label' column as the target variable


In [13]:
print(X)

       Ind_ID GENDER Car_Owner Propert_Owner  CHILDREN  Annual_income  \
0     5008827      M         Y             Y         0   180000.00000   
1     5009744      F         Y             N         0   315000.00000   
2     5009746      F         Y             N         0   315000.00000   
3     5009749      F         Y             N         0   191399.32623   
4     5009752      F         Y             N         0   315000.00000   
...       ...    ...       ...           ...       ...            ...   
1543  5028645      F         N             Y         0   191399.32623   
1544  5023655      F         N             N         0   225000.00000   
1545  5115992      M         Y             Y         2   180000.00000   
1546  5118219      M         Y             N         0   270000.00000   
1547  5053790      F         Y             Y         0   225000.00000   

              Type_Income                  EDUCATION    Marital_status  \
0               Pensioner            Highereducat

# **Encoding categorical data**

In [14]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

categorical_cols = [1, 2, 3, 6, 7, 8, 9, 16]
# Initialize OneHotEncoder with handle_unknown='ignore'
encoder = ColumnTransformer(transformers=[('encoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), categorical_cols)], remainder='passthrough')

# Fit and transform the data
X = encoder.fit_transform(X)

In [15]:
print(X)

[[0. 1. 0. ... 0. 0. 2.]
 [1. 0. 0. ... 1. 0. 2.]
 [1. 0. 0. ... 1. 0. 2.]
 ...
 [0. 1. 0. ... 0. 0. 4.]
 [0. 1. 0. ... 1. 0. 2.]
 [1. 0. 0. ... 0. 0. 2.]]


# **Splitting the dataset into the Training set and Test set**

In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [17]:
print(X_train)

[[1. 0. 1. ... 1. 0. 2.]
 [1. 0. 1. ... 0. 0. 1.]
 [1. 0. 1. ... 0. 1. 2.]
 ...
 [1. 0. 0. ... 0. 0. 2.]
 [1. 0. 1. ... 0. 0. 2.]
 [0. 1. 1. ... 0. 0. 1.]]


In [18]:
print(X_test)

[[0. 1. 0. ... 0. 0. 2.]
 [1. 0. 1. ... 1. 0. 2.]
 [1. 0. 0. ... 0. 0. 2.]
 ...
 [1. 0. 0. ... 0. 0. 1.]
 [1. 0. 0. ... 0. 0. 2.]
 [1. 0. 1. ... 0. 0. 2.]]


In [19]:
print(y_train)

1431    0
614     0
1505    0
978     0
1302    0
       ..
715     0
905     0
1096    0
235     0
1061    0
Name: label, Length: 1238, dtype: int64


In [20]:
print(y_test)

531     0
286     0
386     0
718     0
248     0
       ..
487     0
372     0
1085    0
683     0
422     0
Name: label, Length: 310, dtype: int64


# **Feature Scaling**

In [21]:
from sklearn.preprocessing import StandardScaler

columns_to_scale = [5, 11]

# Create an instance of StandardScaler
sc = StandardScaler()

X_train[:, columns_to_scale] = sc.fit_transform(X_train[:, columns_to_scale])
X_test[:, columns_to_scale] = sc.transform(X_test[:, columns_to_scale])



In [22]:
print(X_train)

[[1. 0. 1. ... 1. 0. 2.]
 [1. 0. 1. ... 0. 0. 1.]
 [1. 0. 1. ... 0. 1. 2.]
 ...
 [1. 0. 0. ... 0. 0. 2.]
 [1. 0. 1. ... 0. 0. 2.]
 [0. 1. 1. ... 0. 0. 1.]]


In [23]:
print(X_test)

[[0. 1. 0. ... 0. 0. 2.]
 [1. 0. 1. ... 1. 0. 2.]
 [1. 0. 0. ... 0. 0. 2.]
 ...
 [1. 0. 0. ... 0. 0. 1.]
 [1. 0. 0. ... 0. 0. 2.]
 [1. 0. 1. ... 0. 0. 2.]]


# **Training Various Models :**

## Logistic Regression Classification Model -

In [24]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

### `Predicting the Test set results`

In [25]:
y_pred = classifier.predict(X_test)

### Making the Confusion Matrix

In [26]:
from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[279   0]
 [ 31   0]]


0.9



---



##Decision Tree Classification model -

In [27]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

### Predicting the Test set results

In [28]:
y_pred = classifier.predict(X_test)

### Making the Confusion Matrix

In [29]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[260  19]
 [ 16  15]]


0.8870967741935484



---



## Kernel SVM Model -

In [30]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)

### Predicting the Test set results

In [31]:
y_pred = classifier.predict(X_test)

### Making the Confusion Matrix

In [32]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[279   0]
 [ 31   0]]


0.9



---



## SVM model -

In [33]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)

### Predicting the Test set results

In [34]:
y_pred = classifier.predict(X_test)

### Making the Confusion Matrix

In [35]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[278   1]
 [ 31   0]]


0.896774193548387



---



## K-NN model

In [36]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

### Predicting the Test set results

In [37]:
y_pred = classifier.predict(X_test)

### Making the Confusion Matrix

In [38]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[261  18]
 [ 27   4]]


0.8548387096774194



---



## Naive Bayes model

In [39]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

### Predicting the Test set results

In [40]:
y_pred = classifier.predict(X_test)

### Making the Confusion Matrix

In [41]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[276   3]
 [ 31   0]]


0.8903225806451613



---



## Random Forest Classification

In [42]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

### Predicting the Test set results

In [43]:
y_pred = classifier.predict(X_test)


### Making the Confusion Matrix

In [44]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[276   3]
 [ 21  10]]


0.9225806451612903



---





---

