# Credit Approval

This file concerns credit card applications.  All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.

The dataset contains the following variables:

approval:	Target	Categorical<br>
A15:	Feature,	Continuous<br>
A14:	Feature,	Continuous<br>
A13:	Feature,	Categorical<br>
A12:	Feature,	Categorical<br>
A11:	Feature,	Continuous<br>
A10:	Feature,	Categorical<br>
A9:	Feature,	Categorical<br>
A8:	Feature,	Continuous<br>
A7:	Feature,	Categorical<br>
A6:	Feature,	Categorical<br>
A5:	Feature,	Categorical<br>
A4:	Feature,	Categorical<br>
A3:	Feature,	Continuous<br>
A2:	Feature,	Continuous<br>
A1:	Feature,	Categorical<br>

Source: https://archive.ics.uci.edu/dataset/27/credit+approval

1) Download the dataset

In [None]:
!gdown '1cd6qvO3dSzxmWKXOPMVzL87qIanS8Gqu'

2) Load the required modules: pandas, numpy and matplotlib.pyplot

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

3) Read the dataset into a pandas DataFrame:

In [None]:
df = pd.read_csv('credit_approval.csv')

4) Examine the number of missing observations per variable, using the DataFrame command isnull.sum()

In [None]:
df.isnull().sum()

5) Drop observations with missing values using the DataFrame command .dropna()

In [None]:
df=df.dropna()

6) Display the first 10 rows of your DataFrame

In [None]:
df.head(10)

7) Display the last 10 rows of your DataFrame

In [None]:
df.tail(10)

8) Create a function named 'credit_approved'. The function should accept a parameter with a value of either + or -. The function should return 1 if the value was a +, 0 otherwise.

In [None]:
def credit_approved(x):
  if x=='+':
    return 1
  else:
    return 0

9) Create a new column in your DataFrame with the name 'approved' by applying the 'credit_approved' function to the column 'approval'

In [None]:
df['approved']=df['approval'].apply(credit_approved)

10) Drop the column 'approval' from the DataFrame by using the drop(columns='approval') command.

In [None]:
df=df.drop(columns='approval')

11) Use the command .value_counts() on the column 'approval' to compute the number of credit approvals that were approved vs those that were denied.

In [None]:
df['approved'].value_counts()

12) Create dummy variables from all the categorical variables. You can use the following code:

df=pd.get_dummies(df, drop_first=True)




In [None]:
df=pd.get_dummies(df, drop_first=True)

13) Obtain information about the variables in the dataset using the describe().T command

In [None]:
df.describe().T

14) Split your DataFrame into two parts:

- *X* which contains all the features (explanatory variables)<br>
- *y* which is your outcome variable (credit approval).

In [None]:
X=df.drop(columns=['approved'])
y=df['approved']

15) Split your X and y datasets into training and test datasets:

  Use 85% of your sample for training and 15% of your sample for testing.

  You can use the train_test_split function from sklearn.model_selection to do this.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.85)

16) Normalize the columns in your X vectors. Normalization means that you change the range of each variable from 0 to 1.

  You can normalize all values using the following lines of code:

  from sklearn import preprocessing<br>
  min_max_scaler = preprocessing.MinMaxScaler()<br>
  X_train = min_max_scaler.fit_transform(X_train)<br>
  X_test = min_max_scaler.transform(X_test)

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

17) Use the following machine learning algorithms to predict Fraudulent reporting.

  1. k-Nearest neighbors
  2. Logistic regression
  3. Decision tree
  4. Neural network

For each algorithm save the accuracy, precision and recall scores.

17a) K-Nearest Neigbors:
- Import the k-Nearest neighbors classifier algorithm from scikit-learn.
- Import *ConfusionMatrixDisplay*, *precision_score* and *recall_score* from sklearn.metrics
- Create an instance of the model and fit it to the training data
- Compute: 1) the accuracy of the model on the test data, 2) the precision score for the test data, and 3) the recall score for the test data.
- Display a confusion matrix
- Decide on the appropriate 'k' for the the algorithm.

Utilize print statements with f-strings to neatly display the scores..

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

from sklearn.metrics import ConfusionMatrixDisplay, recall_score, precision_score

y_pred=knn.predict(X_test)

print(f"Accuracy score:  {knn.score(X_test,y_test):.4f}")
print(f"Precision score: {precision_score(y_test,y_pred):.4f}")
print(f"Recall score:    {recall_score(y_test,y_pred):.4f}")

ConfusionMatrixDisplay.from_predictions(y_test,y_pred)

17b) Logistic regression:
- Import the logistic regression classifier algorithm from scikit-learn.
- Create an instance of the model and fit it to the training data
- Compute: 1) the accuracy of the model on the training data, 2) the accuracy of the model on the test data, 3) the precision score for the test data, and 4) the recall score for the test data.
- Display a confusion matrix

Utilize print statements with f-strings to neatly display the scores.

In [None]:
from sklearn.linear_model import LogisticRegression
logistic=LogisticRegression()
logistic.fit(X_train,y_train)

y_pred=logistic.predict(X_test)

print(f"Acc. training:   {logistic.score(X_train,y_train):.4f}")
print(f"Accuracy score:  {logistic.score(X_test,y_test):.4f}")
print(f"Precision score: {precision_score(y_test,y_pred):.4f}")
print(f"Recall score:    {recall_score(y_test,y_pred):.4f}")

ConfusionMatrixDisplay.from_predictions(y_test,y_pred)

17c) Decision tree:
- Import the decision tree classifier algorithm from scikit-learn.
- Create an instance of the model and fit it to the training data
- Compute: 1) the accuracy of the model on the training data, 2) the accuracy of the model on the test data, 3) the precision score for the test data, and 4) the recall score for the test data.
- Display a confusion matrix
- Decide on the appropriate depth of the tree.

Utilize print statements with f-strings to neatly display the scores.

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(max_depth=4)
dtc.fit(X_train, y_train)

y_pred=dtc.predict(X_test)

print(f"Acc. training:   {dtc.score(X_train,y_train):.4f}")
print(f"Accuracy score:  {dtc.score(X_test,y_test):.4f}")
print(f"Precision score: {precision_score(y_test,y_pred):.4f}")
print(f"Recall score:    {recall_score(y_test,y_pred):.4f}")

ConfusionMatrixDisplay.from_predictions(y_test,y_pred)

17d) Neural network:
- Import the MLP neural network classifier algorithm from scikit-learn.
- Create an instance of the model and fit it to the training data
- Compute: 1) the accuracy of the model on the training data, 2) the accuracy of the model on the test data, 3) the precision score for the test data, and 4) the recall score for the test data.
- Display a confusion matrix

Utilize print statements with f-strings to neatly display the scores.

Note: you can increase the maximum number of iterations by setting as follows:<br>
mlp = MLPClassifier(max_iter=2000)



In [None]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(max_iter=2000)
mlp.fit(X_train, y_train)

y_pred=mlp.predict(X_test)

print(f"Acc. training:   {mlp.score(X_train,y_train):.4f}")
print(f"Accuracy score:  {mlp.score(X_test,y_test):.4f}")
print(f"Precision score: {precision_score(y_test,y_pred):.4f}")
print(f"Recall score:    {recall_score(y_test,y_pred):.4f}")

ConfusionMatrixDisplay.from_predictions(y_test,y_pred)

18) Discuss the algorithms you used and how you determined the optimal specifications. Which algorithm performs best?