**Loan Status Classification**
Here in this notebook we take a look at the data from a bank/financial organization of all their loans. We explore various features about the borrowers like credit score, mortgage, annual income, years of employment and try to train our classifer to predict if the loan would be paid or not.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))
from sklearn import preprocessing
pd.set_option('display.float_format', lambda x: '%.2f' % x)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier
from sklearn.linear_model import SGDClassifier

# Any results you write to the current directory are saved as output.

In [None]:
dataframe = pd.read_csv('../input/credit_train.csv')

**Before starting with any analysis, we take a small peek at our data and some of the values.**

In [None]:
print("Number of rows:", dataframe.shape[0])
print("Number of columns:", dataframe.shape[1])

In [None]:
dataframe.head()

In [None]:
dataframe.describe()

Here, we see something strange. If you notice the average credit scoe is 1076+ which is strange considering the credit score are within th range of 300-850. Let's take a look and try to find sense of the credit score data and check if there are any score that are greater than 800.

In [None]:
df = dataframe[dataframe['Credit Score']>800]
df.head()

As we can see, it looks like some of the credit score are just scaled up by 10. For the ease of our calculation we can consider, scaling them back is accurate.

In [None]:
dataframe['Credit Score'] = dataframe['Credit Score'].apply(lambda val: (val /10) if val>850 else val)

In [None]:
dataframe.describe()

Now we can see our average credit score is within a normal credit score range so we can go further with our preprocessing.

In [None]:
dataframe.head()

In [None]:
dataframe.dropna(subset=['Loan Status'], inplace = True)

In [None]:
le = preprocessing.LabelEncoder()
dataframe['Loan Status'] = le.fit_transform(dataframe['Loan Status'])

In [None]:
dataframe.head()

# Loan Status is the categorical variable here denoting if the certain variable is paid off or not. In this notebook, we aim to predict that as our final output.

In [None]:
coffvalue = dataframe[dataframe['Loan Status'] == 0]['Loan Status'].count()
fpaidvalue = dataframe[dataframe['Loan Status'] == 1]['Loan Status'].count()
data = {"Counts":[coffvalue, fpaidvalue] }
statusDF = pd.DataFrame(data, index=["Charged Off", "Fully Paid"])
# statusDF.head()
statusDF.plot(kind='bar', title="Status of the Loan")

In [None]:
print("Value counts for each term: \n",dataframe['Term'].value_counts())
print("Missing data in loan term:",dataframe['Term'].isna().sum())

In [None]:
dataframe['Term'].replace(("Short Term","Long Term"),(0,1), inplace=True)
dataframe.head()

In [None]:
scount = dataframe[dataframe['Term'] == 0]['Term'].count()
lcount = dataframe[dataframe['Term'] ==1]['Term'].count()

data = {"Counts":[scount, lcount]}
termDF = pd.DataFrame(data, index=["Short Term", "Long Term"])
termDF.head()

In [None]:
termDF.plot(kind="barh", title="Term of Loans")

Since credit score is one of the important part of our analysis, we first try to explore and handle our missing data before processing further with anything.

In [None]:
print("There are ", dataframe['Credit Score'].isna().sum(), "null values for Credit score.")

***Since there are multiple ways to handle the missing data, one of which is to fill in the average of the column in the place of missing data. Here we follow the same concept but with a small tweak. We asume that the credit score of people having short term loan wouldn't be the same as credit score of people having long term loans. Hence we take separate average of credit score of people with short term loan and separate average of people with long term loan and then fill the missing credit score looking up at the term of the loan.***

In [None]:
cscoredf = dataframe[dataframe['Term']==0]
stermAVG = cscoredf['Credit Score'].mean()
print(stermAVG)

In [None]:
lscoredf = dataframe[dataframe['Term']==1]
ltermAVG = lscoredf['Credit Score'].mean()
print(ltermAVG)

In [None]:
dataframe.head()

In [None]:
do_nothing = lambda: None

In [None]:
dataframe.loc[(dataframe.Term ==0) & (dataframe['Credit Score'].isnull()),'Credit Score'] = stermAVG

In [None]:
dataframe.loc[(dataframe.Term ==1) & (dataframe['Credit Score'].isnull()),'Credit Score'] = ltermAVG

Since our problem is a classification problem, we can't have continuos variables in our dataframe. After the calculation of the missing variables we give our credit scores a range based on **Experian's Credit Score Range**.

In [None]:
dataframe['Credit Score'] = dataframe['Credit Score'].apply(lambda val: "Poor" if np.isreal(val) and val < 580 else val)
dataframe['Credit Score'] = dataframe['Credit Score'].apply(lambda val: "Average" if np.isreal(val) and (val >= 580 and val < 670) else val)
dataframe['Credit Score'] = dataframe['Credit Score'].apply(lambda val: "Good" if np.isreal(val) and (val >= 670 and val < 740) else val)
dataframe['Credit Score'] = dataframe['Credit Score'].apply(lambda val: "Very Good" if np.isreal(val) and (val >= 740 and val < 800) else val)
dataframe['Credit Score'] = dataframe['Credit Score'].apply(lambda val: "Exceptional" if np.isreal(val) and (val >= 800 and val <= 850) else val)

In [None]:
dataframe['Credit Score'].value_counts().sort_values(ascending = True).plot(kind='bar', title ='Number of loans in terms of Credit Score category')

Next up we look at our annual income column and fill up the missing values with the average of the column.

In [None]:
print("There are",dataframe['Annual Income'].isna().sum(), "Missing Annual Income Values.")

In [None]:
dataframe['Annual Income'].fillna(dataframe['Annual Income'].mean(), inplace=True)

In [None]:
dataframe.head()

Following up on our step with Credit Score, we now try to change it to a discrete value, but since it has multiple class, we use one hot encoding to make sure we dont increase the dimension of our data. We also look out for potential multi-colinearity as we drop one variable of the encoding, which is easily calculated given the other four variables. This is consistent through out the notebook, with the use of one hot encoding.

In [None]:
dataframe = dataframe.join(pd.get_dummies(dataframe['Credit Score'], drop_first = True))

Since our values were only adjectives, we give it new names to make it more clear.

In [None]:
dataframe.rename(index = str, columns={'Good':'Credit Good', 'Very Good':'Credit Very Good'})

In [None]:
dataframe = dataframe.drop(['Credit Score'], axis=1)

In [None]:
dataframe['Purpose'].value_counts().sort_values(ascending=True).plot(kind='barh', title="Purpose for Loans", figsize=(15,10))

In [None]:
purposeloanstatus = dataframe[['Purpose','Loan Status']]
purposeloanstatus.head()

In [None]:
pd.crosstab(purposeloanstatus['Purpose'], purposeloanstatus['Loan Status']).plot(kind='bar', stacked=True, figsize=(20,10), title="Purpose of Loan Vs Loan Payment Status", )

Next up, we take a look at the Home ownership status of the people who have taken loan and try to visualize it.

In [None]:
dataframe['Home Ownership'].value_counts().sort_values(ascending = True).plot(kind='bar', title="Number of Loan based on Home ownership")

As we can see most of the loans have been either by the people have mortgage on their home or by people who are in rent.

In [None]:
dataframe = dataframe.join(pd.get_dummies(dataframe['Home Ownership'],drop_first = True))

In [None]:
dataframe = dataframe.drop(['Home Ownership'], axis=1)

Moving forward, age of employment is one of the major factor in deciding the person's financial stability and secure income sources.  Here our data was a String with non uniform spread. first we need to extract the given numbers from our data and then give it a unform range to convert it into a categorical variable

In [None]:
dataframe['Years in current job']=dataframe['Years in current job'].str.extract(r"(\d+)")
dataframe['Years in current job'] = dataframe['Years in current job'].astype(float)
# dataframe['Years in current job'].fillna(dataframe['Years in current job'].mean(), inplace = True)


In [None]:
expmean = dataframe['Years in current job'].mean()

In [None]:
dataframe['Years in current job'].fillna(expmean, inplace=True)

Now That we have a numerical value for our Employment Age, we use a uniform range to convert it into categories.

In [None]:
dataframe['Employment History'] = dataframe['Years in current job'].apply(lambda x: "Emp Level Jr." if x<4 else ("Emp Level Mid" if x>4 and x<8 else "Emp Senior"))

In [None]:
dataframe.head()

In [None]:
dataframe = dataframe.drop(['Years in current job'], axis=1)

Now that we have the categories for our employment history, we use one hot encoding on the column.

In [None]:
dataframe = dataframe.join(pd.get_dummies(dataframe['Employment History'],drop_first = True))

we then drop the Employment History column.

In [None]:
dataframe = dataframe.drop(['Employment History'], axis=1)

If we take a look at our data, there are columns like Loan ID, Customer ID which isn't important for our analysis. While we can argue   in some cases  purpose of loan could be one deciding factor but here we consider it to be unimportant and drop that as well.

In [None]:
dataframe = dataframe.drop(['Loan ID','Customer ID','Purpose'], axis=1)

In [None]:
dataframe.head()

Next up is number of credit problems reported for each individual loanee. We split that into three categories with 0 being None, 1-5 as Some and more than 5 to be major credit problems.

In [None]:
dataframe['Credit Problems'] = dataframe['Number of Credit Problems'].apply(lambda x: "No Credit Problem" if x==0 else ("Some Credit promblem" if x>0 and x<5 else "Major Credit Problems"))

In [None]:
dataframe['Credit Problems'].value_counts()

In [None]:
dataframe['Credit Problems'].value_counts().sort_values(ascending=True).plot(kind='barh', title="Loans vs Credit problems of Loanee")

Looking at the graph above, we establish a common asumption that loans are generally not given to people having credit problems. Next up, we convert Credit Problems into discrete variables. 

In [None]:
dataframe = dataframe.join(pd.get_dummies(dataframe['Credit Problems'],drop_first = True))

In [None]:
dataframe = dataframe.drop(['Credit Problems','Number of Credit Problems'], axis=1)

In [None]:
dataframe.head()

Another important feature for financial stability identification is the years of credit history. We look at the given credit age of individuals and categorize them using one hot encoding.

In [None]:
dataframe['Credit Age'] = dataframe['Years of Credit History'].apply(lambda x: "Short Credit Age" if x<5 else ("Good Credit Age" if x>5 and x<17 else "Exceptional Credit Age"))

In [None]:
dataframe = dataframe.join(pd.get_dummies(dataframe['Credit Age'],drop_first = True))

In [None]:
dataframe = dataframe.drop(['Credit Age','Years of Credit History'], axis =1)
dataframe.head()

We move forward with the asumption that some of the columns are correlated with the others and hence we try to reduce the number of features. For example, we have credit score and credit problems which can is calculated using features like maximum open credit, current credit balance etc. So we drop some of the columns that we asume are already covered by features we have on our dataframe.

In [None]:
dataframe = dataframe.drop(['Months since last delinquent','Number of Open Accounts','Maximum Open Credit','Current Credit Balance','Monthly Debt'],axis=1)

In [None]:
dataframe.head()

Further exploring the financial stability of each loanee, we take the look at number of liens on their property by court which would give us information about their previous commitments.

In [None]:
dataframe['Tax Liens'] = dataframe['Tax Liens'].apply(lambda x: "No Tax Lien" if x==0 else ("Some Tax Liens" if x>0 and x<3 else "Many Tax Liens"))

In [None]:
dataframe = dataframe.join(pd.get_dummies(dataframe['Tax Liens'],drop_first = True))

In [None]:
dataframe = dataframe.drop(['Tax Liens'],axis=1)
dataframe.head()

Furthermore, we take a look at nuber of bankruptcies filed by people and categorize them.

In [None]:
dataframe['Bankruptcies'] = dataframe['Bankruptcies'].apply(lambda x: "No bankruptcies" if x==0 else ("Some Bankruptcies" if x>0 and x<3 else "Many Bankruptcies"))

In [None]:
dataframe = dataframe.join(pd.get_dummies(dataframe['Bankruptcies'],drop_first = True))

In [None]:
dataframe = dataframe.drop(['Bankruptcies'],axis=1)
dataframe.head()

In [None]:
dataframe.describe()

**We already know we're predicting categorical variables hence we have to convert our cateorical variables into discrete. Next up we try to convert annual income and total loan amount into discrete variables. 
There are some calculation we do before deciding a range from the categories. If we take a look, we have some data that are outliers and are way off the other loan amounts. so we try to calculate the average and standard deviation without the outlier. 
We asume : Mean - 1 standard deviation = low income line
                    Mean + 1 standard deviation = high income line
  and similar for the loan amount as well.
**

In [None]:
meanxoutlier = dataframe[dataframe['Annual Income'] < 99999999.00 ]['Annual Income'].mean()
stddevxoutlier = dataframe[dataframe['Annual Income'] < 99999999.00 ]['Annual Income'].std()
poorline = meanxoutlier -  stddevxoutlier
richline = meanxoutlier + stddevxoutlier

In [None]:
dataframe['Annual Income'] = dataframe['Annual Income'].apply(lambda x: "Low Income" if x<=poorline else ("Average Income" if x>poorline and x<richline else "High Income"))

In [None]:
dataframe = dataframe.join(pd.get_dummies(dataframe['Annual Income'],drop_first = True))

In [None]:
dataframe = dataframe.drop(['Annual Income'], axis=1)
dataframe.head()

In [None]:
lmeanxoutlier = dataframe[dataframe['Current Loan Amount'] < 99999999.00 ]['Current Loan Amount'].mean()
lstddevxoutlier = dataframe[dataframe['Current Loan Amount'] < 99999999.00 ]['Current Loan Amount'].std()
lowrange = lmeanxoutlier - lstddevxoutlier
highrange = lmeanxoutlier + lstddevxoutlier
print(lowrange, highrange)

In [None]:
dataframe['Current Loan Amount'] = dataframe['Current Loan Amount'].apply(lambda x: "Small Loan" if x<=lowrange else ("Medium Loan" if x>lowrange and x<highrange else "Big Loan"))

In [None]:
dataframe = dataframe.join(pd.get_dummies(dataframe['Current Loan Amount'],drop_first = True))

In [None]:
dataframe = dataframe.drop(['Current Loan Amount'], axis=1)

In [None]:
dataframe.head()

Now we can see that we have all categorical values for our dataframe we can divide it into training and test set and plug into some classification algorithm.

In [None]:
y = dataframe['Loan Status']
X = dataframe.drop(['Loan Status'],axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
knnclassifier = KNeighborsClassifier(n_neighbors = int(X.shape[1]/2))
knnclassifier.fit(X_train, y_train)
prediction = knnclassifier.predict(X_test)
print("Accuracy Score: ", accuracy_score(y_test, prediction))
# y_true = y_test


In [None]:
tneg, fpos, fneg, tpos = confusion_matrix(y_test, prediction).ravel()
print(tneg,fpos,fneg,tpos)

In [None]:
lregclassifier = LogisticRegression()
lregclassifier.fit(X_train,y_train)
lregprediction = lregclassifier.predict(X_test)
print("Score: ",lregclassifier.score(X_test, y_test))

In [None]:
tneg, fpos, fneg, tpos = confusion_matrix(y_test, lregprediction).ravel()
print(tneg,fpos,fneg,tpos)

In [None]:
from sklearn.svm import SVC
clf = SVC(gamma='auto', kernel ='linear')
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print("Accuracy Score: ", accuracy_score(y_test, pred))

In [None]:
tneg, fpos, fneg, tpos = confusion_matrix(y_test, pred).ravel()
print(tneg,fpos,fneg,tpos)

In [None]:
XGBclf = XGBClassifier()
XGBclf.fit(X_train,y_train)

In [None]:
XGBpred = XGBclf.predict(X_test)
print("Accuracy Score: ", accuracy_score(y_test, XGBpred))

In [None]:
tneg, fpos, fneg, tpos = confusion_matrix(y_test, XGBpred).ravel()
print(tneg,fpos,fneg,tpos)

In [None]:
SGDclf = SGDClassifier(loss='modified_huber',shuffle=True)
SGDclf.fit(X_train,y_train)

In [None]:
SGDpred = SGDclf.predict(X_test)
print("Accuracy Score: ", accuracy_score(y_test, SGDpred))

In [None]:
tneg, fpos, fneg, tpos = confusion_matrix(y_test, SGDpred).ravel()
print(tneg,fpos,fneg,tpos)