# Lesson 11: Hands-on practices on Data Processing and Machine Learning models

## Practices on Data Exploration and Pre-processing with Python

## Problem 
#### Company A wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers. 

## Our mission: 
### 1. Explore/analyse and pre-process the data. Make it ready for use as input to train some Machine learning model (data modeling step).
### 2. Build Machine Learning models to predict Loan eligibility for customers


### Dataset: loan-prediction (train.csv, test.csv)

## Create a Pandas DataFrame with date imported from the dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("../data/train.csv")

## Data exploration

## Print out first N rows of data for checking

In [1]:
print(df.head(10))

NameError: name 'df' is not defined

3. Descriptive statistics of data

In [None]:
df.describe()

4. Count the number of values for specific columns

In [None]:
df['Property_Area'].value_counts()

In [None]:
df['Credit_History'].value_counts()

In [None]:
df['Education'].value_counts()

## Distribution Analysis

5. Show Histogram and ScatterPlot of values of specific columns, observe the 'Outliers'

In [None]:
df['ApplicantIncome'].hist(bins=50)
plt.show()

In [None]:
plt.boxplot(df['ApplicantIncome'])
plt.show()

In [None]:
plt.hist(df['ApplicantIncome'], bins=50)
plt.show()

In [None]:
df.plot(kind='scatter', x='ApplicantIncome', y='LoanAmount')
plt.show()

6. Draw Boxplot and Histogram of values of specific columns, observe the 'Outliers'

In [None]:
df.boxplot(column='ApplicantIncome')
plt.show()

In [None]:
df.boxplot(column='ApplicantIncome', by='Education')
plt.show()

In [None]:
df.boxplot(column='LoanAmount', by='Education')
plt.show()

In [None]:
df.boxplot(column='ApplicantIncome', by='Gender')
plt.show()

In [None]:
df['LoanAmount'].hist(bins=50)
plt.show()

In [None]:
df.boxplot(column='LoanAmount')
plt.show()

## Analysis on categorial variables 

7. Frequency Table, Pivot Table, Cross Table

In [None]:
temp1 = df['Credit_History'].value_counts(ascending=True)
print("Frequency Table for Credit History:")
print(temp1)


In [None]:
temp1_ = df['Property_Area'].value_counts()
print("Frequency table for Property Area:")
print(temp1_)

In [None]:
temp2 = df.pivot_table(values='Loan_Status', index=['Credit_History'], 
                       aggfunc=lambda x:x.map({'Y':1,'N':0}).mean())
print("\nProbability of getting Loan for each Credit History class:")
print(temp2)

In [None]:
my_add = lambda x, y : (x + y) 

In [None]:
def my_add2(x, y):
    return x + y

In [None]:
my_add(10,6)

In [None]:
my_add2(10, 6)

In [None]:
temp2___ = df.pivot_table(values=['Loan_Status','LoanAmount'], index=['Property_Area'],
                          aggfunc=lambda x:x.map({'Y':1,'N':0}).mean())
print(temp2___)

In [None]:
temp2_ = df.pivot_table(values='Loan_Status', index=['Education'],
                        aggfunc=lambda x:x.map({'Y':1, 'N':0}).mean())
print(temp2_)

In [None]:
temp2__ = df.pivot_table(values='ApplicantIncome', index=['Education'])
print(temp2__)

## Visualize the frequency tables, pivot tables

#### Visualize Frequency Table 

In [None]:
fig = plt.figure(figsize=(8,4)) #figsize=(x,y) --> set size of the figure, width x inches, heigh y inches         
ax1 = fig.add_subplot(121)#(121)
ax1.set_xlabel('Credit_History')
ax1.set_ylabel('Count of Applicants')
ax1.set_title("Applicants by Credit_History")
temp1.plot(kind='bar')
plt.show()

#### Visualize Pivot Table

In [None]:
ax2 = fig.add_subplot(122)
ax2.set_xlabel('Credit_History')
ax2.set_ylabel('Probability of getting loan')
ax2.set_title('Probability of getting Loan by Credit_History')
temp2.plot(kind='bar')
plt.show()

In [None]:
temp3 = pd.crosstab(df['Credit_History'], df['Loan_Status'])
temp3.plot(kind='bar', stacked=True, color=['red','blue'], grid=False)
plt.show()

In [None]:
#Add 'Gender' into the above cross table
temp3 = pd.crosstab([df['Credit_History'], df['Gender']], df['Loan_Status'])
temp3.plot(kind='bar', stacked=True, color=['red','blue'], grid=False)
plt.show()

## Data Pre-processing with Python (using Pandas)

8. Check missing values in Dataset 

In [None]:
#A quick way: use descriptive statistics
df.describe()

In [None]:
df.apply(lambda x: sum(x.isnull()), axis=0)  #axis=0 => along the rows of each column;

In [None]:
df['LoanAmount'].head(10)

### 9. How to fill missing values 

#### Example: 'LoanAmount'

In [None]:
#Note: teach about fillna function
# example = pd.DataFrame([[np.nan, 2, np.nan, 0],
#                     [3, 4, np.nan, 1],
#                     [np.nan, np.nan, np.nan, 5],
#                     [np.nan, 3, np.nan, 4]],
#                     columns=list('ABCD'))
# example
#Replace all NaN elements with 0s.
#df.fillna(0)
#We can also propagate non-null values forward or backward
#df.fillna(method='ffill')
#Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively
#values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
#>>> df.fillna(value=values)
#Only replace the first NaN element
#df.fillna(value=values, limit=1)

In [None]:
df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)

In [None]:
df.apply(lambda x: sum(x.isnull()), axis=0)

In [None]:
df.boxplot(column='ApplicantIncome', by=['Education','Self_Employed'])
plt.show()

In [None]:
df['Self_Employed'].value_counts()

In [None]:
df['Self_Employed'].fillna('No', inplace=True)

In [None]:
df.apply(lambda x: sum(x.isnull()), axis=0)
#We can see below in the column 'Self_Employed' that all missing values/NaN were filled

In [None]:
df['Self_Employed'].value_counts()
#--> with value as 'No'

In [None]:
table = df.pivot_table(values='LoanAmount', index='Self_Employed', columns='Education', 
                       aggfunc=np.median)
print(table)

In [None]:
def fage(x):
 return table.loc[x['Self_Employed'],x['Education']]

# Replace missing values
#df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)

### 10. How to deal with 'Outliers' (extreme values) in distribution 

#### Example: 'LoanAmount', 'ApplicantIncome'

In [None]:
df['LoanAmount_log'] = np.log(df['LoanAmount'])
df['LoanAmount_log'].hist(bins=20)
plt.show()

In [None]:
df['LoanAmount'].hist(bins=20)
plt.show()

In [None]:
df.boxplot(column='LoanAmount')
plt.show()

In [None]:
df.boxplot(column='LoanAmount_log')
plt.show()
# => It's better now, No or much less number of outliers!

In [None]:
print(df.head(10))

In [None]:
df.boxplot(column='ApplicantIncome')
plt.show()

In [None]:
df.boxplot(column='CoapplicantIncome')
plt.show()

=> We can see a lot of outliers in the columns 'ApplicantIncome' and 'CoapplicantIncome'

 Now we use 'log transformation' to observe their Log distribution

In [None]:
df['TotalIncome'] = df['ApplicantIncome'] + df['CoapplicantIncome']
df['TotalIncome_log'] = np.log(df['TotalIncome'])
df['TotalIncome_log'].hist(bins=20)
plt.show()

In [None]:
df.boxplot(column='TotalIncome')
plt.show()

In [None]:
df.boxplot('TotalIncome_log')
plt.show()

In [None]:
print(df.head(10))

## Note: 
### Apply the same data pre-processing techniques to process other columns/fields of Dataset. 
### The Dataset is then ready for use as input to train some Machine learning model.
### Next lesson: Building predictive model with Python (dataset: loan-prediction)

In [None]:
df.isnull().sum()

In [None]:
train = df

In [None]:
#Filling missing values in Train Dataset
train['Gender'].fillna(train['Gender'].mode()[0], inplace=True)
train['Married'].fillna(train['Married'].mode()[0], inplace=True)
train['Dependents'].fillna(train['Dependents'].mode()[0], inplace=True)
train['Self_Employed'].fillna(train['Self_Employed'].mode()[0], inplace=True)
train['Credit_History'].fillna(train['Credit_History'].mode()[0], inplace=True)
train['Loan_Amount_Term'].fillna(train['Loan_Amount_Term'].mode()[0], inplace=True)
train['LoanAmount'].fillna(train['LoanAmount'].median(), inplace=True)

In [None]:
train.isnull().sum()

In [None]:
test = pd.read_csv("../data/test.csv")              

In [None]:
test.isnull().sum()

In [None]:
#Filling missing values in Test Dataset
test['Gender'].fillna(train['Gender'].mode()[0], inplace=True)
test['Dependents'].fillna(train['Dependents'].mode()[0], inplace=True)
test['Self_Employed'].fillna(train['Self_Employed'].mode()[0], inplace=True)
test['Credit_History'].fillna(train['Credit_History'].mode()[0], inplace=True)
test['Loan_Amount_Term'].fillna(train['Loan_Amount_Term'].mode()[0], inplace=True)
test['LoanAmount'].fillna(train['LoanAmount'].median(), inplace=True)

In [None]:
test.isnull().sum()

In [None]:
#Checking the outlier parts
train['LoanAmount_log'] = np.log(train['LoanAmount'])
train['LoanAmount_log'].hist(bins=20)
test['LoanAmount_log'] = np.log(test['LoanAmount'])

In [None]:
#Model Building


In [None]:
train=train.drop('Loan_ID',axis=1)
test=test.drop('Loan_ID',axis=1)

In [None]:
X = train.drop('Loan_Status',1)
y = train.Loan_Status

In [None]:
X=pd.get_dummies(X)
train=pd.get_dummies(train)
test=pd.get_dummies(test)

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_cv, y_train, y_cv = train_test_split(X,y, test_size =0.3)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()
model.fit(x_train, y_train)

In [None]:
pred_cv = model.predict(x_cv)
accuracy_score(y_cv,pred_cv)

In [None]:
train['Total_Income']=train['ApplicantIncome']+train['CoapplicantIncome']
test['Total_Income']=test['ApplicantIncome']+test['CoapplicantIncome']
train['Total_Income_log'] = np.log(train['Total_Income'])
test['Total_Income_log'] = np.log(test['Total_Income'])
train['EMI']=train['LoanAmount']/train['Loan_Amount_Term']
test['EMI']=test['LoanAmount']/test['Loan_Amount_Term']
train['Balance Income']=train['Total_Income']-(train['EMI']*1000)
test['Balance Income']=test['Total_Income']-(test['EMI']*1000)

In [None]:
#Dropping the old Variables
train=train.drop(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term'], axis=1)
test=test.drop(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term'], axis=1)

In [None]:
#X = train.drop('Loan_Status',1)
#y = train.Loan_Status 

In [None]:
#Logistic Regression
model = LogisticRegression(random_state=1)
model.fit(X, y)
pred_test = model.predict(X)
score = accuracy_score(y,pred_test)
print('accuracy_score',score)

In [None]:
#Decision Tree
from sklearn import tree
model = tree.DecisionTreeClassifier(random_state=1)
model.fit(X,y)
pred_test = model.predict(X)
score = accuracy_score(y,pred_test)
print('accuracy_score',score)