# Loan Status Prediction

Credit Risk Models are one of the most successful applications of data science. The problem can be put as the prediction of how much financially-reliable a person is, given a set of information(data).
The implementation of such predictive models has raised many discussions concerning privacy, racial profiling etc.

We will be using an anonymized dataset that consists of a set of attributes for a bank's users.
You may find the dataset [here](https://www.kaggle.com/zaurbegiev/my-dataset).

Let us first install sklearn and numpy libraries:

In [None]:
!pip install scikit-learn numpy matplotlib pandas

Now let's import all the required libraries.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import preprocessing
from sklearn.impute import SimpleImputer as Imputer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, accuracy_score

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from warnings import filterwarnings
filterwarnings('ignore')

## Load Raw Data

In [None]:
df = pd.read_csv("credit_train.csv")

In [None]:
df = df[:10000]

In [None]:
df.head()

In [None]:
df.info()

## Data Preprocessing

#### Feature Selection

In [None]:
cols_to_remove = ['Loan ID','Customer ID']
data = df.drop(cols_to_remove, axis=1)


#### Data Cleaning

In [None]:
#Imputation strategy: Replace Non-existing values with the respective column's average
cols_to_clean =['Current Loan Amount','Credit Score','Annual Income','Years of Credit History',
        'Months since last delinquent','Number of Open Accounts','Number of Credit Problems',
       'Current Credit Balance','Maximum Open Credit','Bankruptcies','Tax Liens']

imputer = Imputer()
data[cols_to_clean] = imputer.fit_transform(data[cols_to_clean])
data[cols_to_clean] = data[cols_to_clean].astype(int)

#Remove rows that still contain one or more NaN values
data=data.dropna()


#### Feature Engineering

In [None]:
#Convert our target attribute to numerical values
y = []
for i in data['Loan Status']:
    if i == 'Fully Paid':
        y.append(1)
    else:
        y.append(0)

data = data.drop('Loan Status', axis=1)

In [None]:
# Convert categorical attributes to numerical values
print(data.info())
data = pd.get_dummies(data)
print(data.info())

In [None]:
# Data Normalization -- Version one
# All variables will have a mean zero and variance/standard deviation of 1
xMean = np.mean(data, axis=0)
xDev = np.std(data, axis=0)
xNorm = (data - xMean) / xDev

In [None]:
# Data Normalization -- Version two
# All features are scaled within a given range (0.0-1.0 by default)

x = data.values #returns numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
xMinMax = pd.DataFrame(x_scaled)
xNoNorm = data

In [None]:
xNorm

Next we will see how we get different predictive performances by using the raw features vs the two other feature normalization techniques.

In [None]:
data_versions = []
# Define three cases we will study
for x in [xNorm, xMinMax, xNoNorm]:
    version = {}
    version['x_train'], version['x_test'], version['y_train'], version['y_test'] = train_test_split(x, y, test_size= 0.25, random_state=13)
    data_versions.append(version)

## Model Training and Evaluation

In [None]:
# Initialize the classifiers that we will be testing
clf = KNeighborsClassifier(n_neighbors=2)

for i, data_version in enumerate(data_versions):
    print("Evaluating the model with data from the version #{}".format(i+1))
    #train model with train data
    clf.fit(X=data_version['x_train'], y=data_version['y_train'])
        
    #predict test data
    predictions = clf.predict(X=data_version['x_test'])
        
    #calculate the accuracy
    accuracy = accuracy_score(data_version['y_test'], predictions)

    print("\t Classifier  achieved {} accuracy on test data.".format(100*accuracy))

### TODO: 
Follow a similar approach with the one described in the previous cell.
<br>Evaluate the performance of Logistic Regression and Random Forest classifiers. 
<br>Try to find a set of parameters that work the best for each classifier.

In [None]:
# Using: Logistic Regression Classifier

In [None]:
# Using: Random Forest Classifier

#### Evaluate the performance of the best combination, using the k-fold cross-validation approach

In [None]:
# Implement Cross Validation using Logistic Regression classifier
# Using xMinMax dataframe

clf = <BEST CLASSIFIER>
X_data = <BEST DATA FORMAT>

# 5 fold cross validation
scores = cross_val_score(clf, X_data, y, verbose=1, cv=5)
print(scores)
print("Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))

In [None]:
### TODO: Evaluate the performance of 