# **Mortgage Bank Loan E2E Analysis**

Mortgage Bank Loan Analytics with ARIMA and Machine Learning Mortgage Loans Analytics Banks can now use mortgage loan analytics using Data Science techniques. The system can provide detail information of the mortgage loans and the mortgage loan markets. It is a powerful tool for mortgage brokers to seek counterparties and generate trading interests and is useful for the CFOs to conduct what-ifs scenarios on the balance sheets.

Loan file template requires below details: 
- Loan ID: to identify the special loan 
- Loan Type: to indicate the loan if fixed rate, or balloon loan , or ARM, or AMP (alternative mortgage product).
- Balance: 
- Loan program type: to indicate conforming loan, FHA/VA loan, Jumbo loan or sub-prime loan 
- Current coupon rate: 
- Amortization type: the original amortization term 
- Maturity: the maturity loan (the remaining term of the loan) 
- FICO Score: the updated fico score 
- LTV: the current loan to value ratio 
- Loan Size: the loan amount of the loan 
- Loan origination location (City & Zip) 
- Unit Types (Types of property)

In [None]:
import datetime
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [None]:
import os                             # Library to do things on the filesystem
import pandas as pd                   # Super cool general purpose data handling library
import matplotlib.pyplot as plt       # Standard plotting library
import seaborn as sns
import numpy as np                    # General purpose math library
from IPython.display import display   # A notebook function to display more complex data (like tables)
import scipy.stats as stats

In [None]:
import numpy as np
from scipy.stats import kurtosis
from scipy.stats import skew

In [None]:
data = pd.read_csv('../input/mortgage-bank-loan/mwb2014.csv', header=0, encoding='cp1252')
data.info()

In [None]:
data.shape

In [None]:
# Print the 5th and 95th percentiles
data.quantile([0.05, 0.95])


In [None]:
data['Loan Amount'].quantile([0.05, 0.95])

In [None]:
data['Loan Amount'].quantile([0.1, 0.90])

In [None]:
St=['CT', 'FL', 'NJ', 'NY', 'PA']
print(St)

In [None]:
#DataFrame subsets by State
ct_data = data[data['State'].isin(['CT'])]
fl_data = data[data['State'].isin(['FL'])]
ny_data = data[data['State'].isin(['NY'])]
nj_data = data[data['State'].isin(['NJ'])]
pa_data = data[data['State'].isin(['PA'])]

In [None]:
#Selecting data per column
filtered_columns =('Created Date', 'First Name', 'Last Name', 'Loan Amount', 'City', 'unit_type_code', 'Loan Type')
ct_lo_amount = ct_data.reindex(columns=filtered_columns)
fl_lo_amount = fl_data.reindex(columns=filtered_columns)
nj_lo_amount = nj_data.reindex(columns=filtered_columns)
ny_lo_amount = ny_data.reindex(columns=filtered_columns)
pa_lo_amount = pa_data.reindex(columns=filtered_columns)

In [None]:
#creating boxplot to see the distribution median and outliers
plt.figure(figsize=(10,5))
sns.violinplot(y=data['Loan Amount'],inner="quartile")
plt.title("Boxplot showing distribution of index of accessbility of Loan Amount")
plt.ylim(0.1,1000000)
plt.ylabel('price ($)')
plt.show()

We can see that mid 50% loan amount between USD 340,000 &  USD 570,000

# Mortgage interest rates

Mortgage interest rates have a significant impact on the amount of mortgage applications. If the interest rates are low, the mortgages are relatively cheaper for the borrower as they have to pay less interest, which leads to an increased amount of mortgage applications. A high mortgage interest rate means the mortgage borrower pays a high amount of interest to the lender, which makes the mortgage less attractive for the borrower. Interest rate changes have a significant impact on mortgage applications, as was seen in November of last year, where a sudden increase in interest rates led to a large peak in mortgage applications. The main difference between the mortgages offered by these types of companies lies in the mortgage interest rates. Even a small difference in mortgage interest rates can often save or cost the borrower a vast amount of money, due to the large sum of a mortgage.

In general, there are two types of mortgage interest rate: variable rates (ARM) and fixed rates. Variable interest rates are generally lower than fixed interest rates, but can change every month. Fixed interest rates are slightly higher, but are fixed for a certain period of time. A fixed interest rate is generally preferred when the mortgage interest rates are expected to rise, or when the borrower wants to know its monthly expenses upfront. A variable interest rate (ARM) is preferred when interest rates are expected to decrease. If a financial institution has a significantly higher interest rate than its competitors, it will generally receive fewer mortgage applications as the independent mortgage advisors will forward its customers to a different mortgage lender.

Financial institutions sometimes increase their interest rates during the summer months, and at the end of the year, as there is less personnel available to handle the requests due to vacations and holidays. With less personnel available they can handle less mortgage requests, so in order to keep the processing time the same they choose to reduce the input, by increasing the interest rates. Financial institutions may also specifically keep interest rates low for mortgages with a certain fixed interest period. Interest rate changes are not always directly influenced by changes in the cost of lending, but can have numerous reasons.


## Load US 10-Years Treasury Data

In [None]:
US10Y = pd.read_csv('../input/mortgage-bank-loan/US10Y.csv', header=0, index_col='DATE', encoding='cp1252')
US10Y.head()

In [None]:
US10Y.RATE.value_counts()

In [None]:
US10Y.replace(".", value=np.nan, inplace=True)
US10Y= US10Y.replace(to_replace=-1, value=np.nan)
US10Y.RATE.value_counts()

In [None]:
US10Y= US10Y.fillna(method='ffill')
US10Y= US10Y[['RATE']].astype('float64')
US10Y.info()

In [None]:
US10Y.describe()

In [None]:
US10Y.info()

In [None]:
type(US10Y)

In [None]:
total_missing_rate=US10Y.isnull().sum() #Checking for missing values
total_missing_rate

In [None]:
US10Y.index = pd.to_datetime(US10Y.index)
ax=monthly_rate_data=US10Y.resample('M').mean().plot(title="Interest Rate for 10 Years Treasury - ",figsize=(20,5))
plt.ylabel('RATE')
ax.get_legend().remove()
plt.grid()

# Display Monthly US Treasury 10 Years Interest Rate

Generally, when US 10 Years Treasury Rate fluctuates, that leads lenders to adjust their internal bank rates accordingly. Also interest rates for consumers varies on several risk factors, such as DTI (Debt to Income Ratio), FICO Scores, Recent derogatory events on their credit history, stable job history, W2 or 1099, Stated Income, Profit or Loss Statements, Student Loans, Auto Payments, Credit utilization, Property types, number of households, rental history, etc.

Interest Rate is currently historical low. In the short run rate may go ups and down but in the long run rate will go up. As housing price goes up, interest rate will go up to control the housing price.

In [None]:
import os                             # Library to do things on the filesystem
import pandas as pd                   # Super cool general purpose data handling library
import matplotlib.pyplot as plt       # Standard plotting library
import seaborn as sns
import numpy as np                    # General purpose math library
from IPython.display import display   # A notebook function to display more complex data (like tables)
import scipy.stats as stats  

In [None]:
loan_patterns = data[['Loan Amount', 'Created Date']]
loan_patterns.head()

In [None]:
'''
We can create a histogram with 20 bins to show the distribution of purchasing patterns.'''

loan_patterns_plot = loan_patterns['Loan Amount'].hist(alpha=0.6, bins=40, grid=True,figsize=(20,5))

loan_patterns_plot = loan_patterns['Loan Amount'].apply(np.sqrt)

param = stats.norm.fit(loan_patterns_plot) 
x = np.linspace(0, 100000, 1250000)      # Linear spacing of 100 elements between 0 and 20.
pdf_fitted = stats.norm.pdf(x, *param)    # Use the fitted paramters to 
loan_patterns_plot.plot.hist(alpha=0.6, bins=40, grid=True, density=True, legend=None,figsize=(20,5))
# Fit a normal distribution to the data
# Plot the histogram again
# Plot some fancy text to show us what the paramters of the distribution are (mean and standard deviation)
plt.text(x=np.min(loan_patterns_plot), y=800, s=r"  $\mu=%0.2f$" % param[0] + "\n" 
         + r"  $\sigma=%0.2f$" % param[1], color='b')

# Plot a line of the fitted distribution over the top
# Standard plot stuff
plt.xticks(rotation=75)
plt.ylim((1,900))
plt.xlim((1,1210000))
plt.xlabel("Loan Amount($)")
plt.ylabel("Loan frequency")
plt.title("Histogram with fitted normal distribution for Mortgage Bank Loan")

# Density Plot with Rug Plot

plt.show()


In [None]:
print( 'excess kurtosis of normal distribution (should be 0): {}'.format(round((kurtosis(loan_patterns_plot)),2)))
print( 'skewness of normal distribution (should be 0):        {}'.format(round((skew(loan_patterns_plot)),2)))

In [None]:
print("mean : ", round(np.mean(loan_patterns_plot),2))
print("var  : ", round(np.var(loan_patterns_plot),2))
print("skew : ",round(skew(loan_patterns_plot),2))
print("kurt : ",round(kurtosis(loan_patterns_plot),2))

Now, we will compare Monthly Revenue, Monthly Closed Loan Number and Active Mortgage Loan Originators. We will count number of MLO actively closing loans on any given month.

In [None]:
loan_patterns = data[['Loan Amount', 'Created Date']]
loan_patterns.head()

In [None]:
mlo_num=data[['Loan Officer Name']]
mlo_num['date'] = pd.DatetimeIndex(data['Created Date'])
mlo_num = mlo_num.set_index('date')
monthly_mlo_num=mlo_num.resample('M').nunique()

monthly_loan_num=data[['LoanInMonth']]
monthly_loan_num.head()

In [None]:
monthly_loan_num['date'] = pd.DatetimeIndex(data['Created Date'])
monthly_loan_num = monthly_loan_num.set_index('date')
monthly_loan_num_data=monthly_loan_num.resample('M').last().plot(title="Mortgage Bank - Total Sales by Month",
                                                                 legend=None,grid=True,figsize=(20,5))
plt.ylabel("Loan in Months")
plt.show()

Summer seems to be high sales seasons for the Mortgage bank. Numbers of loans closed per months varies between 20 & 80.

In [None]:
loan_rev_data=data[['Loan Amount']]
loan_rev_data['date'] = pd.DatetimeIndex(data['Created Date'])
loan_rev_data = loan_rev_data.set_index('date')
monthly_loan_rev_data=loan_rev_data.resample('M').sum().plot(title="Mortgage Bank - Total Sales by Month",
                                                             legend=None,grid=True,figsize=(20,5))
plt.ylabel("Loan Amount")
plt.show()


Mortgage Bank Monthly Sales Since October 2014. Sales varies between 12M & 33M per months.

In [None]:
monthly_loan_num_data= monthly_loan_num.resample('M').last()
plt.figure(figsize=(20,5))
plt.xlabel('Loan origination Months')
plt.xticks(rotation=60)
plt.ylabel('Loans per Months')
plt.title('Mortgage Bank Monthly Loan numbers')


from matplotlib.lines import Line2D
colors = ['red', 'green', 'blue']
lines = [Line2D([0], [0], color=c, linewidth=3, linestyle='--') for c in colors]
labels = ['Monthly closed loans', 'Monthly Loan Revenue / 400000', 'Monthly Active MLO * 4']
plt.plot(monthly_loan_num_data, color='red')
plt.plot(loan_rev_data.resample('M').sum() /400000, color='green')
plt.plot(monthly_mlo_num*4, color='blue')
plt.legend(lines, labels)
plt.title('Comparing Monthly: Revenue, Closed_Loan_Num, Active_MLO_Num')
plt.grid()

As we know that number of producer is essential component in any given business. MLO (Mortgage loan Originator) is core component in Mortgage business. Many MLO works indedendently and interect directly to clients, involve in marketting and grow their business. There could be many MLO in Mortgage Bank, but active MLO generate more reverue for the bank. As number of active MLO goes up, which will directly and positively impact numbers of loan closed per month, eventually mortgage revenue will go up. On the other hand, once number of active MLO goes down, mortgage revenue and number of loan per month goes down as well. By visualizing the graphs, we can see that monthly data of closed loan numbers , monthly revenue and active MLO numbers ber months, all moving at the same direction.

Let’s find out interest rate effect on Monthly Closed Loans and Monthly Revenue.

In [None]:
US10Y.index = pd.to_datetime(US10Y.index)
monthly_rate_data=US10Y.resample('M').mean().plot(title="Interest Rate for 10 Years Treasury - ",figsize=(20,5)
                                                  ,legend=None,grid=True)
plt.ylabel('RATE')
plt.show()

In [None]:
monthly_rate_data=US10Y.resample('M').mean()
from matplotlib.lines import Line2D
plt.figure(figsize=(20,5))
colors = ['red', 'green', 'blue']
lines = [Line2D([0], [0], color=c, linewidth=3, linestyle='--') for c in colors]
labels = ['Monthly closed loans', 'Monthly Loan Revenue / 400000', '10 Years Interest Rate * 20']

plt.plot(monthly_loan_num_data, color='red')
plt.plot(loan_rev_data.resample('M').sum()/400000, color='green')
plt.plot(monthly_rate_data*20, color='blue')
plt.legend(lines, labels)
plt.title('Comparing Monthly: Revenue, Closed_Loan_Num, VS US 10 Year Treasury Rates')
plt.grid()

We can see the strong positive correlation between Monthly Closed Loans and Monthly Revenue. This graph also suggest that, as interest rates goes down, banks monthly revenue and numbers of loans increases, and when the Rates goes up, both Monthly Closed Loans and Monthly Revenue for the Mortgage bank decline. Pearson correlation coefficient between Monthly Interest & Monthly loans Closed Data is -0.334, which clearly proves that Monthly Interest Rates & Monthly loans Closed Data is negatively correlated.

In [None]:
def pearson_r(x, y):
    """Compute Pearson correlation coefficient between two arrays."""
    # Compute correlation matrix: corr_mat
    corr_mat = scipy.stats.pearsonr()

    # Return entry [0,1]
    return corr_mat[0, 1]

In [None]:
%matplotlib inline 

import numpy as np
import pandas as pd
import scipy
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn

fico = data['Qualification FICO']
loan_amount=data['Loan Amount']
cltv_data = data['CLTV']
scipy.stats.pearsonr(cltv_data, fico)
r_loan_amount_fico = scipy.stats.pearsonr(loan_amount, fico)
# Print the result
print('Pearson correlation coefficient between FICO and Loan_Amount: ', r_loan_amount_fico)


In [None]:
r_cltv_data_fico = scipy.stats.pearsonr(cltv_data, fico)
print('Pearson correlation coefficient between FICO and CLTV: ', r_cltv_data_fico)

In [None]:
r_loan_amount_cltv_data = scipy.stats.pearsonr(loan_amount, cltv_data)
print('Pearson correlation coefficient between Loan Amount and CLTV: ', r_loan_amount_cltv_data)

In [None]:
monthly_loan_num=np.array(monthly_loan_num_data,dtype=np.float)
monthly_loan_num=monthly_loan_num.flatten()
monthly_loan_rev=np.array(loan_rev_data.resample('M').sum(),dtype=np.float)
monthly_loan_rev=monthly_loan_rev.flatten()

In [None]:
r_monthly_loan_num_data_monthly_loan_rev = scipy.stats.pearsonr(monthly_loan_num,monthly_loan_rev)
print('Pearson correlation coefficient between Loan number and Loan Revenue: ', r_monthly_loan_num_data_monthly_loan_rev)

Acquire 1000 pairs bootstrap replicates of the Pearson correlation coefficient using the draw_bs_pairs() function you wrote in the previous exercise for CLTV data VS Qualification FICO Data and Monthly Loan_num_data VS. Monthly_loan_rev. Compute the 95% confidence interval for both using your bootstrap replicates. -We have created a NumPy array of percentiles to compute. These are the 2.5th, and 97.5th. By creating a list and convert the list to a NumPy array using np.array(). For example, np.array([2.5, 97.5]) would create an array consisting of the 2.5th and 97.5th percentiles.

In [None]:
pearson_r0=scipy.stats.pearsonr(cltv_data, fico)
pearson_r1=scipy.stats.pearsonr(loan_amount, fico)
# Print results
print('CLTV data VS Qualification FICO Data       :', pearson_r0)

In [None]:
print('Monthly Loan_Amount VS. FICO Data          :', pearson_r1)

# Random Walk

Are Interest Rates or Monthly Loan Returns Prices a Random Walk?

Most returns prices follow a random walk (perhaps with a drift). We will look at a time series of Monthly Sales Revenue, and run the 'Augmented Dickey-Fuller Test' from the statsmodels library to show that it does indeed follow a random walk. With the ADF test, the "null hypothesis" (the hypothesis that we either reject or fail to reject) is that the series follows a random walk. Therefore, a low p-value (say less than 5%) means we can reject the null hypothesis that the series is a random walk. Print out just the p-value of the test (adfuller_loan_rev_data[0] is the test statistic, and adfuller_loan_rev_data[1] is the p-value). Print out the entire output, which includes the test statistic, the p-values, and the critical values for tests with 1%, 10%, and 5% levels.

In [None]:
# Import the adfuller module from statsmodels
from statsmodels.tsa.stattools import adfuller
loan_rev_data=data[['Loan Amount']]
loan_rev_data['date'] = pd.DatetimeIndex(data['Created Date'])
loan_rev_data = loan_rev_data.set_index('date')
monthly_loan_rev_data= loan_rev_data.resample('M').sum()
monthly_loan_rev_data[:5]

In [None]:
# Run the ADF test on the monthly_loan_rev_data series and print out the results
adfuller_loan_rev_data = adfuller(monthly_loan_rev_data['Loan Amount'], autolag='AIC')
print(adfuller_loan_rev_data)

In [None]:
# Just print out the p-value
print('The p-value of the test on loan_rev is: ' + str(adfuller_loan_rev_data[1]))


In [None]:
print('Print in different format')
print('Results of Dickey-Fuller Test:')
dfoutput = pd.Series(adfuller_loan_rev_data[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in adfuller_loan_rev_data[4].items():
    dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)

According to this test, p-value is very low (lower than 0.05). We reject the hypothesis that monthly_loan_rev_data follow a random walk.

In [None]:
loan_rev_data=data[['Loan Amount']]
loan_rev_data['date'] = pd.DatetimeIndex(data['Created Date'])
loan_rev_data = loan_rev_data.set_index('date')
monthly_loan_rev_data= loan_rev_data.resample('M').sum()
monthly_loan_rev_data[:5]

In [None]:
# Run the ADF test on the monthly_loan_rev_data series and print out the results
adfuller_loan_rev_data = adfuller(monthly_loan_rev_data['Loan Amount'], autolag='AIC')

print(adfuller_loan_rev_data)

In [None]:
# Just print out the p-value
print('The p-value of the test on loan_rev is: ' + str(adfuller_loan_rev_data[1]))

In [None]:
print('Print in different format')
print('Results of Dickey-Fuller Test:')
dfoutput = pd.Series(adfuller_loan_rev_data[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in adfuller_loan_rev_data[4].items():
    dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)

In [None]:
monthly_rate=US10Y.resample('M').mean()
monthly_rate_data=monthly_rate['RATE']
monthly_rate_data[:5]

In [None]:
# Run the ADF test on the monthly_rate_data series and print out the results
adfuller_monthly_rate_data = adfuller(monthly_rate_data)
print(adfuller_monthly_rate_data)

In [None]:
# Just print out the p-value
print('The p-value of the test on monthly_rate_data is: ' + str(adfuller_monthly_rate_data[1]))

In [None]:
print('Results of Dickey-Fuller Test:')
dfoutput = pd.Series(adfuller_monthly_rate_data[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in adfuller_monthly_rate_data[4].items():
    dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)

# Are Interest Rates Autocorrelated?

When we look at daily changes in interest rates, the autocorrelation is close to zero. However, if we resample the data and look at annual changes, the autocorrelation is negative. This implies that while short term changes in interest rates may be uncorrelated, long term changes in interest rates are negatively autocorrelated. A daily move up or down in interest rates is unlikely to tell us anything about interest rates tomorrow, but a move in interest rates over a year can tell us something about where interest rates are going over the next year. And this makes some economic sense: over long horizons, when interest rates go up, the economy tends to slow down, which consequently causes interest rates to fall, and vice versa.

One of the really cool things that pandas allows us to do is resample the data. If we want to look at the data by monthly and anually. We can easily resample and sum it up. I’m using ‘M’ as the period for resampling which means the data should be resampled on a month boundary and 'A' for annual data'. Finally find the The autocorrelation of annual interest rate changes'''

In [None]:
US10Y['change_rates'] = US10Y.diff()
US10Y['change_rates'] = US10Y['change_rates'].dropna()
US10Y.describe()

In [None]:
# Compute and print the autocorrelation of daily changes
autocorrelation_daily = US10Y['change_rates'].autocorr()
print("The autocorrelation of daily interest rate changes is %4.2f" %(autocorrelation_daily))

In [None]:
US10Y.index = pd.to_datetime(US10Y.index)
monthly_rate_data = US10Y['RATE'].resample(rule='M').last()
#annual_data = annual_data.dropna()
# Repeat above for annual data
monthly_rate_data['diff_rates'] = monthly_rate_data.diff()
monthly_rate_data['diff_rates'] = monthly_rate_data['diff_rates'].dropna()
monthly_rate_data['diff_rates'][:5]

In [None]:
autocorrelation_monthly = monthly_rate_data['diff_rates'].autocorr()
print("The autocorrelation of monthly interest rate changes is %4.2f" %(autocorrelation_monthly))

In [None]:
US10Y.index = pd.to_datetime(US10Y.index)
annual_rate_data = US10Y['RATE'].resample(rule='A').last()
# Repeat above for annual data
annual_rate_data['diff_rates'] = annual_rate_data.diff()
annual_rate_data['diff_rates'] = annual_rate_data['diff_rates'].dropna()
annual_rate_data['diff_rates']

In [None]:
autocorrelation_annual = annual_rate_data['diff_rates'].autocorr()
print("The autocorrelation of annual interest rate changes is %4.2f" %(autocorrelation_annual))

Daily and monthly autocorrelation is small but the annual autocorrelation is large and negative

Visual exploration is the most effective way to extract information between variables.

We can plot a barplot of the frequency distribution of a categorical feature using the seaborn package, which shows the frequency distribution of the mortgage dataset column

In [None]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,5))
loan_type_count = data['Loan Type'].value_counts()
sns.set(style="darkgrid")
ax=sns.barplot(x=loan_type_count.index,y= loan_type_count.values, alpha=0.9)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.title('Frequency Distribution of Loan Types')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Loan Types', fontsize=12)
plt.xticks(rotation=45)
plt.show()

Conventional loan type is top market for the Bank, secound is FHA Loan Type

In [None]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
unit_type_count = data['Unit Type'].value_counts()
sns.set(style="darkgrid")
plt.figure(figsize=(20,5))
ax=sns.barplot(x=unit_type_count.index, y=unit_type_count.values, alpha=0.9)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.title('Frequency Distribution of Loan Types')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Loan Types', fontsize=12)
plt.xticks(rotation=45)
plt.show()

One Family, Two Family and Condos are the top unit types for the Bank

# Data Collection

Since we are only interested in the event log data we will only be using one of the tables. This table contains data about every mortgage application. Every action performed by the system or by a user on a mortgage application is logged, and the status before and after that specific action is logged. For our analysis we are mainly interested in the date and time at which each of the mortgage applications have entered the system. Besides Mortgage Application DataSet, we have joined two separate (10 Years US Treasury Rate, Home Supply Index) with our existing Mortgage Application DataSet to enhance predictive power of our model.

# Encoding Categorical Data

There are different techniques to encode the categorical features to numeric quantities.

The techniques are as following:
- Replacing values
- Encoding labels
- One-Hot encoding
- Binary encoding
- Backward difference encoding
- Miscellaneous features
- Replace Values

Let's start with the most basic method, which is just replacing the categories with the desired numbers. This can be achieved with the help of the replace() function in pandas. The idea is that you have the liberty to choose whatever numbers we want to assign to the categories according to the business use case.

It's a good practice to typecast categorical features to a category dtype because they make the operations on such columns much faster than the object dtype. You can do the typecasting by using .astype() method on your columns like shown below:

In [None]:
data = pd.read_csv('../input/mortgage-bank-loan/mwb2014.csv', index_col='Created Date', header=0, encoding='cp1252')
data_lc = data.copy()
data_lc['City'] = data_lc['City'].astype('category')
data_lc['Zip'] = data_lc['Zip'].astype('category')
data_lc['Loan Type'] = data_lc['Loan Type'].astype('category')
data_lc['Unit Type'] = data_lc['Unit Type'].astype('category')
data_lc['loan_purpose_code'] = data_lc['Loan Purpose'].astype('category')
data_lc.dtypes

In [None]:
data_lc['lo_code'] = data_lc['Loan Officer Name'].astype('category')
lo_code =data_lc['lo_code']
data_lc.info()

# Label Encoding

We can achieve the label encoding using scikit-learn's LabelEncoder:

In [None]:
from sklearn.preprocessing import LabelEncoder
lb_make = LabelEncoder()
data_lc['loan_purpose_code'] = lb_make.fit_transform(data_lc['Loan Purpose'])
data_lc['loan_type_code'] = lb_make.fit_transform(data_lc['Loan Type'])
data_lc['unit_type_code'] = lb_make.fit_transform(data_lc['Unit Type'])
data_lc.head() #Results in appending a new column to df

In [None]:
data_lc.info()

Label Encoding Another approach is to encode categorical values with a technique called "label encoding", which allows you to convert each value in a column to a number. Numerical labels are always between 0 and n_categories-1.

We can do label encoding via attributes .cat.codes on your DataFrame's column.

In [None]:
data_lc['lo_code'] = data_lc['Loan Officer Name'].astype('category')
lo_code =data_lc['lo_code']
data_lc['city_code'] = data_lc['City'].cat.codes
data_lc['zip_code'] = data_lc['Zip'].cat.codes
lo_code = data_lc['lo_code'].cat.codes
lo_code[:5]

# One-Hot encoding

The basic strategy is to convert each category value into a new column and assign a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly.

In [None]:
data_lc = pd.get_dummies(data_lc, columns=['Fix'], prefix = ['Fix'])
data_lc['Fix_True']
data_lc['Fix_False']
data_lc['Fix_True'].head()

In [None]:
data_lc['Fix_True'].value_counts()

In [None]:
data_lc[['Unit Type', 'unit_type_code']].head()

In [None]:
data_lc['Unit Type'].value_counts()

In [None]:
 data_lc['unit_type_code'].value_counts()

In [None]:
data_lc[['Loan Type', 'loan_type_code']].head()

In [None]:
data_lc['Loan Type'].value_counts()

In [None]:
data_lc['unit_type_code'].value_counts()

In [None]:
data_lc[['Loan Purpose', 'loan_purpose_code']].head()

In [None]:
data_lc['loan_purpose_code'].value_counts()

Forward Fill Missing Data

In [None]:
#Categorical value 'loan_purpose_code' has been update based on frequency of the value_counts
data_lc=data_lc.fillna(method='ffill')
total_missing_data_lc=data_lc.isnull().sum()
US10Y= US10Y.fillna(method='ffill')
data_lc.info()

For Machine Learning Modeling, we need to create new Data Frame

In [None]:
model_data1 = data_lc[['Loan Amount', 'city_code', 'zip_code', 
                       'loan_purpose_code', 'Qualification FICO', 'unit_type_code', 
                       'loan_type_code', 'Fix_True', 'CLTV', 'LoanInMonth']]
model_data1.columns

In [None]:
model_data1.tail()

In [None]:
total_missing_mode_data1=model_data1.isnull().sum()
total_missing_mode_data1

In [None]:
model_data1.info()

In [None]:
print('model_data1 Keys: \n',model_data1.keys())

In [None]:
print('model_data1 shape: ',model_data1.shape)

In [None]:
model_data1.info()

In [None]:
plt.figure(figsize=(20,5))
loan_amt=model_data1[['Loan Amount']]
pd.plotting.autocorrelation_plot(loan_amt[::30])
plt.grid()

# Joining DataFrame

Join two DataFrames model_data1 & US10Y save the results in model_data2

In [None]:
US10Y = pd.read_csv('../input/mortgage-bank-loan/US10Y.csv', header=0, index_col='DATE', encoding='cp1252')
#index_col='DATE'
US10Y.replace('.', -1, inplace=True)
US10Y= US10Y.replace(to_replace=[-1], value=[np.nan])
US10Y.tail()

In [None]:
US10Y= US10Y.fillna(method='ffill')
US10Y= US10Y[['RATE']].astype('float64')
total_missing_rate=US10Y.isnull().sum()
total_missing_rate

In [None]:
model_data2 = model_data1.join(US10Y)
total_missing=model_data2.isnull().sum()
model_data2= model_data2.fillna(method='ffill')
total_missing_model_data2=model_data2.isnull().sum()
total_missing_model_data2

# Home Supply

US Home Supply directly impact Housing Market and Mortgage Market. We have collected the data from FRED Integrate Monthly housing supply index data and merging with current dataset

In [None]:
home_supply = pd.read_csv('../input/mortgage-bank-loan/MonHouseSupply.csv', header=0, index_col='DATE', encoding='cp1252')
home_supply.info()

In [None]:
total_missing_home_supply=home_supply.isnull().sum()
model_data = model_data2.join(home_supply)
total_missing=model_data.isnull().sum()
model_data= model_data.fillna(method='ffill')
total_missing_home_supply=model_data.isnull().sum()
total_missing_home_supply

In [None]:
model_data.head()

In [None]:
model_data3=model_data.copy()
model_data3.head()

# DATA EXPLORATION

Since our dataset can be grouped per day to create meaningful visualizations. The dataset contains data from October 2014 until December 2018. In order to get a feel of the amount of mortgage applications per day and the distribution of the mortgage applications, different visualizations can be made using Python. Two graphs have been created, which can be found in Figure 1 and Figure 2. Both of these graphs only contain the amount of mortgage applications on the weekdays. As there are almost no applications coming in on the weekends they have been excluded from the graphs. As can be seen from the graphs, there seems to be a seasonal pattern on a monthly level, but from these graphs it is not very clear. It also seems like there are some outliers, so these data points will have to be investigated to see if they will have to be included in our model, as there can be multiple underlying reasons for outliers in our dataset. It also seems there is an increase in mortgage applications during the last few months of each year. The amount of applications per day during these months is higher compared to the other months. This can have multiple explanations so this will have to be accounted for in the model.

In [None]:
loan_amount_data = data[data['Loan Amount'].isin(St)]
loan_amount_data.shape

In [None]:
ct_loan_amount=sum(ct_data['Loan Amount'])
fl_loan_amount=sum(fl_data['Loan Amount'])
ny_loan_amount=sum(ny_data['Loan Amount'])
nj_loan_amount=sum(nj_data['Loan Amount'])
pa_loan_amount=sum(pa_data['Loan Amount'])

loan_amount_per_state = [ct_loan_amount, fl_loan_amount, nj_loan_amount, ny_loan_amount, pa_loan_amount]

print('====================================================')
print('=========    Total Sales by State     ==============')
print(' ')
print('Total Sales in Cunnecticut   : $', ct_loan_amount)
print('Total Sales in Florida       : $', fl_loan_amount)
print('Total Sales in New York      : $', ny_loan_amount)
print('Total Sales in New Jersey    : $', nj_loan_amount)
print('Total Sales in Pennsylvania  : $', pa_loan_amount)
print(' ')
print('====================================================')
print(' ')

In [None]:
loan_types=data['Loan Type'].unique()
group_loan_types=data.groupby(data['Loan Type']).size()
print('Unique Loan Types        : ', loan_types)

In [None]:
print(' ')
print('====================================================')

print('Number of loan per Types : ', group_loan_types)

print(' ')
print('====================================================')
print(' ')

In [None]:
from matplotlib.ticker import FuncFormatter
x = np.arange(5)
money = [1.5e5, 2.5e6, 5.5e6, 1.0e7, 2.0e7, 3.0e7, 4.0e7, 5.0e7, 6.0e7]
def millions(x, pos):
    'The two args are the value and tick position'
    return '$%1.1fM' % (x * 1e-6)
formatter = FuncFormatter(millions)
fig, ax = plt.subplots(figsize=(20, 5))
ax.yaxis.set_major_formatter(formatter)
a=sns.barplot(x=St, y=loan_amount_per_state)
for p in a.patches:
    a.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(x, ('Connecticut', 'Florida', 'New Jersey', 'New York', 'Pennsylvania'))
plt.ylabel('Loan Amount')
plt.xlabel('Loan Origination per State')
plt.xticks(rotation=60)
plt.title('Mortgage Bank Loans per State')
plt.show()


Eventhough Bank is Licensed for business in NY, NJ, CT, PA and FL, their strong mortgage market is in NY & NI

In [None]:
data = pd.read_csv('../input/mortgage-bank-loan/mwb2014.csv', header=0, encoding='cp1252')
data.quantile([0.05, 0.95])

In [None]:
data['Loan Amount'].quantile([0.05, 0.95])

In [None]:
data['Loan Amount'].quantile([0.1, 0.90])

In [None]:
St=['CT', 'FL', 'NJ', 'NY', 'PA']
print(St)

In [None]:
#DataFrame subsets by State
ct_data = data[data['State'].isin(['CT'])]
fl_data = data[data['State'].isin(['FL'])]
ny_data = data[data['State'].isin(['NY'])]
nj_data = data[data['State'].isin(['NJ'])]
pa_data = data[data['State'].isin(['PA'])]

In [None]:
#Selecting data per column
ct_lo_amount = ct_data.reindex(('Created Date', 'First Name', 'Last Name', 'Loan Amount', 'City', 'unit_type_code', 'Loan Type'))
fl_lo_amount = fl_data.reindex(('Created Date', 'First Name', 'Last Name', 'Loan Amount', 'City', 'Unit Type', 'Loan Type'))
nj_lo_amount = nj_data.reindex(('Created Date', 'First Name', 'Last Name', 'Loan Amount', 'City', 'Unit Type', 'Loan Type'))
ny_lo_amount = ny_data.reindex(('Created Date', 'First Name', 'Last Name', 'Loan Amount', 'City', 'Unit Type', 'Loan Type'))
pa_lo_amount = pa_data.reindex(('Created Date', 'First Name', 'Last Name', 'Loan Amount', 'City', 'Unit Type', 'Loan Type'))

In [None]:
plt.figure(figsize=(20,5))
plt.xlabel('Loan origination Data')
plt.xticks(rotation=60)
plt.ylabel('Loan Amount')
plt.title('MWB Loan Data for CT')
plt.plot(ct_data['Created Date'], ct_data['Loan Amount'])
plt.show()

Here is the Loan amount Distribution for CT State

In [None]:
print('==============================')
print(' ')
#Unit Type Loan Data
total_unit_type = data.groupby(data['Unit Type']).size()
print('Loan originated in all States per unit types : \n', total_unit_type)

print('==============================')
print(' ')

In [None]:
#Unit Type Loan Data for CT
print('==============================')
print(' ')
ct_data_unit_type=ct_data['Loan Amount'].groupby(data['Unit Type']).size()
print('Loan originated in Cunnecticut per unit types : \n', ct_data_unit_type)

print('==============================')
print(' ')

In [None]:
#Unit Type Loan Data for FL
print('==============================')
print(' ')
fl_data_unit_type=fl_data['Loan Amount'].groupby(data['Unit Type']).size()
print('Loan originated in Florida per unit types : \n', fl_data_unit_type)
print('==============================')
print(' ')

In [None]:
#Unit Type Loan Data for NJ
print('==============================')
print(' ')
nj_data_unit_type=nj_data['Loan Amount'].groupby(data['Unit Type']).size()
print('Loan originated in New Jersey per unit types : \n', nj_data_unit_type)

print('==============================')
print(' ')

In [None]:
#Unit Type Loan Data for NY
print('==============================')
print(' ')
ny_data_unit_type=ny_data['Loan Amount'].groupby(data['Unit Type']).size()
print('Loan originated in New York per unit types : \n', ny_data_unit_type)

print('==============================')
print(' ')

In [None]:
#Unit Type Loan Data for PA
print('==============================')
print(' ')
pa_data_unit_type=pa_data['Loan Amount'].groupby(data['Unit Type']).size()
print('Loan originated in Pennsylvania per unit types : \n', pa_data_unit_type)

print('==============================')
print(' ')

In [None]:
#Unit Type Loan Data for CT
ct_data['Loan Amount'].groupby(data['Unit Type']).size()

# Loan data based on Unit Types (One Fami, Two Family etc) per State

In [None]:
lo_data = pd.read_csv('../input/mortgage-bank-loan/mwb2014.csv', header=0, index_col = 'Loan Officer Name', encoding='cp1252')

# Create a separate dataframe with the columns ['', 'total', 'voters']: results
lo_df = lo_data[['Created Date', 'Loan Amount', 'Unit Type', 'Loan Type', 'City', 'Zip']]

# Print the output of results.head()
lo_df.head()

In [None]:
lo_df.count()

In [None]:
lo_df.describe()

In [None]:
lo_df.describe().transpose()

In [None]:
print('=======================================================================')
print('**********************  Loan Statistics for Mortgage Bank ***********************')
print(' ')


from scipy.stats import scoreatpercentile
import numpy as np

q0 = scoreatpercentile(lo_df['Loan Amount'],10)
q1 = scoreatpercentile(lo_df['Loan Amount'],25)
q2 = scoreatpercentile(lo_df['Loan Amount'],55)
q3 = scoreatpercentile(lo_df['Loan Amount'],75)
q4 = scoreatpercentile(lo_df['Loan Amount'],90)


print('Average Loan Amount is               :  $', '%.2f' %lo_df['Loan Amount'].mean())
print('Median Loan Amount is                :  $', '%.2f' %lo_df['Loan Amount'].median())
print(' ')
print('Standard deviation of Loan Amount is :  $', '%.2f' %lo_df['Loan Amount'].std())
print(' ')
print('Minimum Loan Amount is               :  $', '%.2f' %lo_df['Loan Amount'].min())
print('Maximum Loan Amount is               :  $', '%.2f' %lo_df['Loan Amount'].max())
print(' ')
print('Total of Loan Amount is              :  $', '%.2f' %lo_df['Loan Amount'].sum())
print(' ')
print('10% of Loan Amount is below          :  $', '%.2f' %q0)
print('25% of Loan Amount is below          :  $', '%.2f' %q1)
print('50% of Loan Amount is below          :  $', '%.2f' %q2)
print('75% of Loan Amount is below          :  $', '%.2f' %q3)
print('90% of Loan Amount is below          :  $', '%.2f' %q4)
print(' ')
print('==========================================================================')


# Loan Statistics for the Bank

In [None]:
lo_df.describe()

In [None]:
fico = lo_data[['Qualification FICO']]
fico.describe()

In [None]:
print('=======================================================================')
print('**************** Qualification FICO  Statistics for Mortgage Bank ***************')
print(' ')
from scipy.stats import scoreatpercentile
import numpy as np

fico_q7 = scoreatpercentile(lo_data['Qualification FICO'],5)
fico_q0 = scoreatpercentile(lo_data['Qualification FICO'],10)
fico_q1 = scoreatpercentile(lo_data['Qualification FICO'],25)
fico_q5 = scoreatpercentile(lo_data['Qualification FICO'],40)
fico_q2 = scoreatpercentile(lo_data['Qualification FICO'],50)
fico_q6 = scoreatpercentile(lo_data['Qualification FICO'],65)
fico_q3 = scoreatpercentile(lo_data['Qualification FICO'],75)
fico_q4 = scoreatpercentile(lo_data['Qualification FICO'],90)
fico_q8 = scoreatpercentile(lo_data['Qualification FICO'],95)


print('Average FICO is                  :  $', '%.2f' %lo_data['Qualification FICO'].mean())
print('Median FICO is                   :  $', '%.2f' %lo_data['Qualification FICO'].median())
print('Standard deviation is            :  $', '%.2f' %lo_data['Qualification FICO'].std())
print('Minimum FICO is                  :  $', '%.2f' %lo_data['Qualification FICO'].min())
print('Maximum FICO is                  :  $', '%.2f' %lo_data['Qualification FICO'].max())
print(' ')
print('5% of FICO is below              :  $', '%.2f' %fico_q7)
print('10% of FICO is below             :  $', '%.2f' %fico_q0)
print('25% of FICO is below             :  $', '%.2f' %fico_q1)
print('40% of FICO is below             :  $', '%.2f' %fico_q5)
print('50% of FICO is below             :  $', '%.2f' %fico_q2)
print('65% of FICO is below             :  $', '%.2f' %fico_q6)
print('75% of FICO is below             :  $', '%.2f' %fico_q3)
print('90% of FICO is below             :  $', '%.2f' %fico_q4)
print('95% of FICO is below             :  $', '%.2f' %fico_q8)

print(' ')
print('==========================================================================')


In [None]:
fico_score = [fico_q7, fico_q0, fico_q1, fico_q5, fico_q2, fico_q6, fico_q3, fico_q4, fico_q8]
fico_pct=['5% Loan Below', '10% Loan Below', '25% Loan Below', '40% Loan Below', '50% Loan Below', '65% Loan Below', '75% Loan Below','90% Loan Below', '95% Loan Below']
plt.figure(figsize=(20,5))
plt.xticks(rotation=75)
plt.ylim((500,850))
a=sns.barplot(x=fico_pct, y=fico_score)
for p in a.patches:
    a.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.ylabel('Qualification FICO Scores')
plt.xlabel('Qualification FICO (%)')
plt.title('Mortgage Loans Qualification FICO  Statistics')
plt.show()

# Mortgage Bank FICO Score Statistics

In [None]:
print('==============================================================================')
print('**************** Mortgage Loan Ofiicer'' Sales Volume per Loan Type ***************')
print(' ')


###########################
lo_loan = data[['Loan Officer Name', 'Loan Type', 'Created Date', 'Loan Amount']]

#We can use groupby to organize the data by category and name.
lo_loan_group = lo_loan.groupby(['Loan Officer Name', 'Loan Type']).sum()

lo_loan_group_count = lo_loan.groupby(['Loan Officer Name', 'Loan Type']).count()

lo_loan_group


In [None]:
lo_loan_group_count

In [None]:
lo_loan['Loan Type'].describe()

In [None]:
'''The category representation looks good but we need to break it apart
to graph it as a stacked bar graph. unstack can do this for us.'''
lo_loan_group.unstack().head()

In [None]:
lo_loan_group_plot = lo_loan_group.unstack().plot(kind='bar',stacked=True,
                                                  title="Total Sales by Loan Officers by Loan Type",figsize=(20,5))
lo_loan_group_plot.set_xlabel("Loan Officers")
lo_loan_group_plot.set_ylabel("Sales per Loan Type")
lo_loan_group_plot.legend(["Commercial","Conventional","FHA","Other","VA"], loc=2,ncol=1)
plt.ylim(10000000, 140000000)
plt.show()

In [None]:
print('==============================================================================')
print('**************** MWB Loan Ofiicers Sales Volume per Unit Type ***************')
print(' ')
lo_unit = data[['Loan Officer Name', 'Unit Type', 'Created Date', 'Loan Amount']]
lo_unit_group = lo_unit.groupby(['Loan Officer Name', 'Unit Type']).sum()
lo_unit_group_count = lo_unit.groupby(['Loan Officer Name', 'Unit Type']).count()
lo_unit_group

In [None]:
lo_unit_group_count

In [None]:
lo_unit['Unit Type'].describe()

In [None]:
lo_unit_group.unstack().head()

In [None]:
lo_unit_group_plot = lo_unit_group.unstack().plot(kind='bar',stacked=True,title="Total Sales by Loan Officers by Unit Type",figsize=(20, 5))
lo_unit_group_plot.set_xlabel("Loan Officers")
lo_loan_group_plot.set_ylabel("Sales per Unit Type")
plt.show()


# MWB Loan Officers Sales Volume per Unit Type

The category representation looks good also displayed stacked bar graph and. unstack can do this for us.''' Mortgage Loan Ofiicer'' Sales Volume per Loan Type displayed and grahically displayed Loan Officer’s Sales Volume per Unit Type.

In [None]:
#Importing data with date as without index
data = pd.read_csv('../input/mortgage-bank-loan/mwb2014.csv', header=0, encoding='cp1252')

One of the really cool things that Pandas allows us to do is resample the data. We want to look at the data by month, we can easily resample and sum it all up. We're using ‘M’ as the period for resampling which means the data should be resampled on a month boundary.

In [None]:
loan_patterns = data[['Loan Amount', 'Created Date']]
loan_patterns.head()

In [None]:
'''
If we want to analyze the data by date,
we need to set the date column as the index using set_index .
'''
#Convert Date Index
loan_patterns['date'] = pd.DatetimeIndex(data['Created Date'])

loan_patterns = loan_patterns.set_index('date')
loan_patterns.index

In [None]:
loan_patterns.head()

In [None]:
monthly_loan_rev=loan_patterns.resample('M').sum()
loan_patterns.info()

In [None]:
loan_patterns_month_plot = loan_patterns.resample('M').sum().plot(title="Mortgage Bank - Total Sales by Month",
                                                                  legend=False,figsize=(20,5))
loan_patterns_month_plot.set_xlabel("Months")
loan_patterns_month_plot.set_ylabel("Monthly Slaes")
plt.xticks(rotation=45)
plt.ylim((12000000, 35000000))
fig = loan_patterns_month_plot.get_figure()

We can see that monthly mortgage loan sales volume varies between 15M and 32M. Another interesting find is Loan sales are at the peak during summer seasons. Winter sales are normally slow but 2018 was an exception.

In [None]:
loan_patterns.resample('Q').sum()
monthly_loan_rev=loan_patterns.resample('Q').sum()
loan_patterns.info()

In [None]:
oan_patterns_month_plot = loan_patterns.resample('Q').sum().plot(title="Mortgage Bank - Total Sales by Quaters",
                                                                 legend=False,figsize=(20,5))
loan_patterns_month_plot.set_xlabel("Quaters")
loan_patterns_month_plot.set_ylabel("Quaterly Slaes")
plt.xticks(rotation=45)
plt.ylim((50000000, 80000000))
fig = loan_patterns_month_plot.get_figure()

In [None]:
fig.savefig("./loan_patterns_month_plot.png")

Mortgage Bank's Quarterly Sales Revenue varies between 50 (millions) and 77 (millions)

In [None]:
#Monthy Sales Sorted
'''
Grouping on a function of the index
Groupby operations can also be performed on transformations
of the index values. In the case of a DateTimeIndex,
we can extract portions of the datetime over which to group.
'''

data = pd.read_csv('../input/mortgage-bank-loan/mwb2014.csv', index_col='Created Date', parse_dates=True, encoding='cp1252')
data.head()

In [None]:
# Create a groupby object: by_day
by_month = data.groupby(data.index.strftime('%B'))
by_year = data.groupby(data.index.strftime('Y'))
by_day = data.groupby(data.index.strftime('%a'))

'''
%a - day
%m - month (01 to 12)
%b - abbreviated month name
%B - full month name
%y - year without a century (range 00 to 99)
%Y - year including the century
'''

In [None]:
# Create sum: units_sum
monthly_loan_amount_sum = by_month['Loan Amount'].sum()
monthly_loan_amount_sum

In [None]:
daily_loan_amount_sum = by_day['Loan Amount'].sum()
daily_loan_amount_sum

In [None]:
yearly_loan_amount_sum = by_year['Loan Amount'].sum()
yearly_loan_amount_sum

In [None]:
formatter = FuncFormatter(millions)
fig, ax = plt.subplots(figsize=(20,5))
ax.yaxis.set_major_formatter(formatter)
plt.xlabel('Loan Origination Month')
plt.ylabel('Loan Amount')
plt.title('Mortgage Bank : 48 Months Total - Monthly Loan Sales')
plt.xticks(rotation=90)
plt.plot(monthly_loan_amount_sum)
plt.show()

In [None]:
monthly_loan_amount_sum_sorted= monthly_loan_amount_sum.sort_values()
# Print units_sum
monthly_loan_amount_sum

In [None]:
yearly_loan_amount_sum

In [None]:
formatter = FuncFormatter(millions)
fig, ax = plt.subplots(figsize=(20,5))
ax.yaxis.set_major_formatter(formatter)
plt.xlabel('Loan Origination Month')
plt.ylabel('Loan Amount')
plt.title('MWB : 24 Months Total - Monthly Loan Sales (Sorted)')
plt.xticks(rotation=90)
plt.plot(monthly_loan_amount_sum_sorted)
plt.show()

In [None]:
lo_zip=data.groupby(['Loan Officer Name', 'Zip'])
lo_zip_count = lo_zip['Zip'].count()
lo_zip_filt = lo_zip.filter(lambda c:c['Zip'].count() > 3)
lo_zip_filt_top10= (lo_zip_filt.groupby(['Zip']).size()).sort_values(ascending=False)
print('Top 10 marketing location by Zip Codes :', lo_zip_filt_top10.head(10))

Top Ten Loan Origination Zip Code. Bank can focus more on marketting to generate more revenue

In [None]:
import datetime
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import calendar
from time import strptime
data = pd.read_csv('../input/mortgage-bank-loan/mwb2014.csv', header=0, encoding='cp1252')
M_loan_city_num=data[['Created Date', 'LoanInMonth']]
M_loan_city_num['date'] = pd.DatetimeIndex(data['Created Date'])
M_loan_city_num = M_loan_city_num.set_index('date')
M_loan_city_num_2017 = M_loan_city_num.loc['2017-1-1':'2017-9-16']
print('2017 : ')
M_loan_city_num_2017.head()

In [None]:
M_loan_city_num_2018 = M_loan_city_num.loc['2018-1-1':'2018-9-16']
print('2018 :')
M_loan_city_num_2018.head()

In [None]:
M17= M_loan_city_num_2017.resample('M').count()
M17.index=M17.index.month
M_loan_city_num_2017.resample('M').count().head()

In [None]:
M18=M_loan_city_num_2018.resample('M').count()
M18.index=M18.index.month
M_loan_city_num_2018.resample('M').count().head()

In [None]:
M_loan_city_num_2017_plot = M_loan_city_num_2017.resample('M').count().plot(title="MWB Total Monthly Sales for 2017",
                                                                            legend=False,figsize=(20,5))
plt.xlabel('Loan origination Months')
plt.xticks(rotation=60)
plt.ylabel('Loans per Monthly')
plt.show()

In [None]:
M_loan_city_num_2018_plot = M_loan_city_num_2018.resample('M').count().plot(title="MWB Total Monthly Sales for 2018",
                                                                            legend=False,figsize=(20,5))
plt.xlabel('Loan origination Months')
plt.xticks(rotation=60)
plt.ylabel('Loans per Months')
plt.show()

We can see number of loan closed per month (Comparing data for 2017 & 2018)

In [None]:
import matplotlib.patches as mpatches
data = pd.read_csv('../input/mortgage-bank-loan/mwb2014.csv', encoding='cp1252')
M_loan_amount_num=data[['Created Date', 'Loan Amount']]
M_loan_amount_num['date'] = pd.DatetimeIndex(data['Created Date'])
M_loan_amount_num = M_loan_amount_num.set_index('date')
M_loan_amount_num_2017 = M_loan_amount_num.loc['2017-1-1':'2017-9-30']
M_loan_amount_num_2018 = M_loan_amount_num.loc['2018-1-1':'2018-9-30']
M_amount_17= M_loan_amount_num_2017.resample('M').sum()
M_amount_17.index=M_amount_17.index.month
M_amount_18= M_loan_amount_num_2018.resample('M').sum()
M_amount_18.index=M_amount_18.index.month
M_amount_17.sum()

In [None]:
M_amount_18.sum()

In [None]:
formatter = FuncFormatter(millions)
fig, ax = plt.subplots(figsize=(20,5))
ax.yaxis.set_major_formatter(formatter)
plt.plot(M_amount_17, color='red')
plt.plot(M_amount_18, color='blue')
red_patch = mpatches.Patch(color='red', label=('2017 Loans (Jan - Sep) Total: ', M_amount_17['Loan Amount'].sum()) )
blue_patch = mpatches.Patch(color='blue', label=('2018 Loans (Jan - Sep) Total: ',  M_amount_18['Loan Amount'].sum()) )
plt.legend(handles=[red_patch, blue_patch])
plt.xticks(range(len(M17)), [calendar.month_name[month] for month in M17.index], rotation=60)
plt.xlabel('Loan origination Months')
plt.ylabel('Loans per Months')
plt.title('Mortgage Monthly Loan numbers 2017 & 2018: ')
plt.show()

# Comparing Mortgage Monthly Loan numbers 2017 & 2018

In [None]:
M_amount_17= M_loan_amount_num_2017.resample('M').sum()
M_amount_17.index=M_amount_17.index.month
M_amount_18= M_loan_amount_num_2018.resample('M').sum()
M_amount_18.index=M_amount_18.index.month
M_amount_17.sum()

In [None]:
M_amount_18.sum()

In [None]:
from matplotlib.ticker import FuncFormatter
x = np.arange(5)
money = [1.5e5, 2.5e6, 5.5e6, 1.0e7, 2.0e7, 3.0e7, 4.0e7, 5.0e7, 6.0e7]
def millions(x, pos):
    'The two args are the value and tick position'
    return '$%1.1fM' % (x * 1e-6)

formatter = FuncFormatter(millions)
fig, ax = plt.subplots(figsize=(20,5))
ax.yaxis.set_major_formatter(formatter)
plt.plot(M_amount_17, color='red')
plt.plot(M_amount_18, color='blue')
red_patch = mpatches.Patch(color='red', label=('2017 Loans (Jan - Sep) Total: ', M_amount_17['Loan Amount'].sum()) )
blue_patch = mpatches.Patch(color='blue', label=('2018 Loans (Jan - Sep) Total: ',  M_amount_18['Loan Amount'].sum()) )
plt.legend(handles=[red_patch, blue_patch])
plt.xticks(range(len(M17)), [calendar.month_name[month] for month in M17.index], rotation=60)
plt.show()

In 2017, Mortgage Bank sales revenue was (M_amount_17.sum()) = $180M

In 2018, Mortgage Bank sales revenue was (M_amount_18.sum()) = $197M

We have seen numbers of loan closed went down, but sales revenue went up. Main reason behind this, housing price went up, which drive average loan amount to increase. As a result total sales went up for 2018 comparing 2017

In [None]:
print('Pearson correlation coefficient between Loan number and Loan Revenue: ', r_monthly_loan_num_data_monthly_loan_rev)

As we have seen earlier earlier, there is positive strong correlation (0.844) between Loan number and Loan Revenue.

Monthly Loan numbers and Monthly Sales volumes are strongly positively correlated. Once we analyze number of the loans closed per month, we find the similarity between sales volume and loan numbers. As average loan numbers goes up, total monthly sales volume goes up. Any given months, if the number of loans closed are higher; we find that average loan amounts are low for that particular months. In other words, this may suggest that loan processing requirements and guidelines loans with higher loan amount take longer time to close the loans with lower loan amounts.

Visual exploration is the most effective way to extract information between variables.

In [None]:
import pylab  #Plotting
import scipy.stats as stats # scintific calculation
plt.figure(figsize=(20,5))
stats.probplot(model_data['Loan Amount'], dist="norm", plot=pylab)
pylab.show()

In [None]:
type(model_data)

In [None]:
model_data.info()

In [None]:
X = np.array(model_data.drop(['Loan Amount'],1))
y = np.array(model_data['Loan Amount'])

In [None]:
model_data.columns

In [None]:
for index, columns in enumerate(X[1:5]):
    plt.figure(figsize=(20, 5))
    plt.scatter(X[:, index], y, color='g')
    plt.ylabel('Loan Amount', size=10)
    plt.xlabel(columns, size=10)
    plt.tight_layout()

Scatter plot for each (Column 1-5) feature with respect to Loan Amount

In [None]:
for index, columns in enumerate(X[6:10]):
    plt.figure(figsize=(20, 5))
    plt.scatter(X[:, index], y, color='b')
    plt.ylabel('Loan Amount', size=15)
    plt.xlabel(columns, size=15)
    plt.tight_layout()

Scatter plot for each (Column 6-10) feature with respect to Loan Amount

In [None]:
plt.figure(figsize=(20, 5))
plt.hist(model_data['RATE'])
plt.title("RATE")
plt.xlabel("US10Y Rate Distribution")
plt.ylabel("Frequency")
plt.show()

This distribution is somewhat normal distribution

In [None]:
plt.figure(figsize=(20, 5))
plt.hist(model_data['Home'])
plt.title("New Home Supply")
plt.xlabel("Housing Supply")
plt.ylabel("Frequency")
plt.show()

The distribution is somewhat left skewed with a mean to the left

In [None]:
plt.figure(figsize=(15,8))
ax=sns.distplot( model_data['RATE'] , color="blue", label="US 10 Years Treasury Rate")
ax=sns.distplot( model_data['Home'] , color="green", label="New Home Supply")
plt.legend()
plt.title("US 10 Years Treasury Rate vs. New Home Supply")
plt.show()

**Home Supply goes up, RATE goes up**

The distribution plot comparing US 10 Years Treasury Rate & New Home Supply shows that US 10Y RATE is normally distributed and New Home Supply is skewed to the right. Rate is the key component for the government to keep rising housing price in check. Once rate goes up, borrowers buying power will go down, that will keep housing price in check.

In [None]:
fit_data = model_data.drop(['loan_purpose_code', 'Qualification FICO' ],1)
sns.pairplot(fit_data, hue = 'loan_type_code',corner=True,palette='Set2',diag_kind="hist")
plt.show()

In [None]:
# Plot colored by continent for years 2000-2007
sns.pairplot(fit_data[fit_data['CLTV'] >= 8],
             vars = ['Loan Amount', 'loan_type_code', 'LoanInMonth'],
             hue = 'unit_type_code',corner=True,palette='Set2',diag_kind="hist")
# Title
plt.suptitle('Pair Plot of Mortgage Data for 2014-2018 for CLTV over 80%',
             size = 12);
plt.show()

In [None]:
# Function to calculate correlation coefficient between two arrays
def corr(x, y, **kwargs):
    # Calculate the value
    coef = np.corrcoef(x, y)[0][1]
    # Make the label
    label = r'$\rho$ = ' + str(round(coef, 2))

    # Add the label to the plot
    ax = plt.gca()
    ax.annotate(label, xy = (0.2, 0.95), size = 20, xycoords = ax.transAxes)
# Create a pair grid instance
grid = sns.PairGrid(data= fit_data[fit_data['CLTV'] > 8],
                    vars = ['Loan Amount', 'unit_type_code',
       'loan_type_code', 'LoanInMonth'], height = 5)
# Map the plots to the locations
grid = grid.map_upper(plt.scatter, color = 'darkred')
grid = grid.map_upper(corr)
grid = grid.map_lower(sns.kdeplot, cmap = 'Reds')
grid = grid.map_diag(plt.hist, bins = 10, edgecolor =  'k', color = 'darkred')

Calculated correlation coefficient between two arrays: data and vars=['Loan Amount', 'unit_type_code', 'loan_type_code', 'LoanInMonth']

**FEATURE SELECTION** Feature Selection is one of the core concepts in machine learning which hugely impacts the performance of your model. The data features that you use to train your machine learning models have a huge influence on the performance you can achieve. How to select features and what are Benefits of performing feature selection before modeling your data? Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise. Improves Accuracy: Less misleading data means modeling accuracy improves.Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster. Feature Selection Methods: 

We will share 3 Feature selection techniques that are easy to use and also gives good results. 
-  Univariate Selection 
- Feature Importance 
- Correlation Matrix with Heatmap

# Univariate Selection

Statistical tests can be used to select those features that have the strongest relationship with the output variable. The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.

The example below uses the chi-squared (chi²) statistical test for non-negative features to select 10 of the best features from the Mobile Price Range Prediction Dataset.

In [None]:
model_data=model_data3.copy()
model_data['Qualification FICO']=model_data['Qualification FICO'].astype('int64')
model_data.head()

Making sure Target Variable (Qualification FICO) is integer. In order to reduce the weight, we will be scalling.

In [None]:
#Scalling data
model_data['Loan Amount']=model_data['Loan Amount']/100000
model_data['Loan Amount']=model_data['Loan Amount'].astype('float64')
model_data['city_code']=model_data['city_code']/100
model_data['city_code']=model_data['city_code'].astype('float64')
model_data['zip_code']=model_data['zip_code']/100
model_data['zip_code']=model_data['zip_code'].astype('float64')
model_data['Qualification FICO']=model_data['Qualification FICO']/100
model_data['CLTV']=model_data['CLTV']/10
model_data['CLTV']=model_data['CLTV'].astype('float64')
model_data['LoanInMonth']=model_data['LoanInMonth']/10
model_data['LoanInMonth']=model_data['LoanInMonth'].astype('float64')
model_data['Qualification FICO']=model_data['Qualification FICO'].astype('float64')
model_data.head()

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
model_data.info()

In [None]:
X = np.array(model_data.drop(['Fix_True'],1))
y = np.array(model_data['Fix_True'])

In [None]:
#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k='all')
fit = bestfeatures.fit(X,y)

In [None]:
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(model_data.columns)

In [None]:
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
featureScores.nlargest(11,'Score')

In [None]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="darkgrid")
plt.figure(figsize=(20,5))
ax=sns.barplot(x=featureScores['Specs'], y=round((featureScores['Score']),2), alpha=0.9)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.title('Frequency Distribution of Features')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Specs', fontsize=12)
plt.xticks(rotation=45)
plt.show()

# Feature Importance

The feature importance of each feature of our dataset by using the feature importance property of the model. Feature importance gives us a score for each feature of our data, the higher the score more important or relevant is the feature towards your output variable. Feature importance is an inbuilt class that comes with Tree Based Classifiers, we will be using Extra Tree Classifier for extracting the top 10 features for the dataset.

In [None]:
model_data['CLTV']=model_data['CLTV'].astype('int64')
import pandas as pd
import numpy as np

X = np.array(model_data.drop(['CLTV'],1))
y = np.array(model_data['CLTV'])    #target column
Z = model_data.drop(['loan_purpose_code'],1) #Max Col = 10

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
model.feature_importances_

In [None]:
#plot graph of feature importances for better visualization
plt.figure(figsize=(20,5))
feat_importances = round((pd.Series(model.feature_importances_, index=Z.columns)),3)
ax=feat_importances.nlargest(11).plot(kind='bar')
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.show()

# Correlation Matrix with Heatmap

Correlation states how the features are related to each other or the target variable.

Correlation can be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one value of feature decreases the value of the target variable)

Heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of correlated features using the seaborn library.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
X = np.array(model_data.drop(['CLTV'],1))
y = np.array(model_data['CLTV'])    #target column
#get correlations of each features in dataset
corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(10,10))
#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")
plt.show()

In [None]:
''' Let's Drop Fretures that are not important '''
fit_data = model_data.drop(['loan_purpose_code', 'Qualification FICO' ],1)
fit_data.info()


In [None]:
X = np.array(fit_data.drop(['Fix_True'],1))
y = np.array(fit_data['Fix_True'])    #target column
X[:5,] #print 1st 5 row of input

In [None]:
print('Target variable is based on the five input rows above. \nFix Mortgage = 1 & ARM (Adjustable Rate Mortgage) = 0 : \n',y[:5,])

# PCA (Principal Component Analysis)

PCA use PCA to de-correlate these measurements, then plot the de-correlated points and measure their Pearson correlation. Compute Pearson correlation coefficient between two arrays.

In [None]:
def pearson_r(x, y):
    # Compute correlation matrix: corr_mat
    corr_mat = np.corrcoef(x, y)
    # Return entry [0,1]
    return corr_mat[0, 1]

In [None]:
# Import PCA
from sklearn.decomposition import PCA
# Create PCA instance: model
model = PCA()
# Apply the fit_transform method of model to grains: pca_features
pca_features = model.fit_transform(model_data)
# Assign 0th column of pca_features: xs
xs = pca_features[:,0]
# Assign 1st column of pca_features: ys
ys = pca_features[:,1]
plt.figure(figsize=(20,5))
plt.scatter(xs, ys) # Scatter plot xs vs ys
plt.axis('equal')
plt.show()

# Variance of the PCA features

The dataset is 10-dimensional. But what is its intrinsic dimension? Make a plot of the variances of the PCA features to find out. As before, samples is a 2D array, where each row represents a fish.

We'll need to standardize the features first. Tthe use of principal component analysis for dimensionality reduction, for visualization of high-dimensional data, for noise filtering, and for feature selection within high-dimensional data. Because of the versatility and interpretability of PCA, it has been shown to be effective in a wide variety of contexts and disciplines.

Given any high-dimensional dataset, I tend to start with PCA in order to visualize the relationship between points ), to understand the main variance in the data and to understand the intrinsic dimensionality (by plotting the explained variance ratio).

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
import numpy as np
X = np.array(fit_data.drop(['Fix_True'],1))
y = np.array(fit_data['Fix_True'])
X.shape

In [None]:
# Create scaler: scaler
scaler = StandardScaler()

In [None]:
# Create a PCA instance: pca
pca = PCA()

In [None]:
# Create pipeline: pipeline
pipeline = make_pipeline(scaler, pca)

In [None]:
# Fit the pipeline to 'samples'
pipeline.fit(fit_data)
pca.n_components_

In [None]:
# Plot the explained variances
columns = ['Loan Amount', 'zip_code', 'loan_purpose_code', 'Qualification FICO', 'unit_type_code',
       'loan_type_code', 'Fix_True', 'CLTV', 'RATE', 'Home']
for x in ax.get_xticklabels(minor=True):
    columns.set_rotation(45)
    print(x)

In [None]:
fit_data.columns

Now we want to know how many principal components we can choose for our new feature subspace?

A useful measure is the so-called “explained variance ratio“. 

The explained variance ratio tells us how much information (variance) can be attributed to each of the principal components. We can plot bar graph between no. of features on X axis and variance ratio on Y axis

In [None]:
features = range(pca.n_components_)
feature_names = features = range(pca.n_components_)
plt.figure(figsize=(20,5))
ax=plt.bar(features, pca.explained_variance_)
plt.xlabel('PCA feature')
plt.ylabel('variance')
plt.xticks(feature_names)
plt.show()

In [None]:
pca.fit_transform(X)
print(pca.mean_)

In [None]:
print(pca.components_)

In [None]:
print(pca.explained_variance_)

In [None]:
print(pca.explained_variance_ratio_)

In [None]:
print(pca.singular_values_)

In [None]:
print(pca.n_components_)

In [None]:
print(pca.noise_variance_)

Features with high variance ratio
Text(0,0,'Loan Amount') = 4.45 
Text(3,0,'unit_type_code') = 7.1
Text(5,0,'Fix_True') = 4.2 

Finding Correlation between Features and Target Variable in mortgage Dataset using Heatmap


In [None]:
correlation = model_data.corr()
plt.figure(figsize=(15,15))
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='viridis')
plt.title('Correlation between Features and Target Variable in mortgage Dataset')
plt.show()

In [None]:
#Let us load the basic packages needed for the PCA analysis
pca = PCA().fit(fit_data)
plt.figure(figsize=(20,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.show()

We can reduce to six to get the 98% accurary

# PCA as dimensionality reduction

Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance.

Here is an example of using PCA as a dimensionality reduction transform:

In [None]:
pca = PCA(n_components=6)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape:   ", X.shape)

In [None]:
print("transformed shape:", X_pca.shape)

The transformed data has been reduced to a 6 dimension. To understand the effect of this dimensionality reduction, we can perform the inverse transform of this reduced data and plot it along with the original data:



In [None]:
X_new = pca.inverse_transform(X_pca)
plt.figure(figsize=(20,5))
plt.scatter(X[:, 0], X[:, 7], alpha=0.7, c='red')
plt.scatter(X_new[:, 0], X_new[:, 5], alpha=0.5, c='blue')
plt.axis('equal')
plt.show()

The light points are the original data, while the dark points are the projected version. This makes clear what a PCA dimensionality reduction means: the information along the least important principal axis or axes is removed, leaving only the component(s) of the data with the highest variance. The fraction of variance that is cut out (proportional to the spread of points about the line formed in this figure) is roughly a measure of how much "information" is discarded in this reduction of dimensionality.

This reduced-dimension dataset is in some senses "good enough" to encode the most important relationships between the points: despite reducing the dimension of the data by 50%, the overall relationship between the data points are mostly preserved.

# MODELING

In the Modeling stage we discuss the activities related to the model building part of our project. A selection of five modeling techniques is made that are applicable to our capstone project. From each of these five modeling techniques, a model is built with the feature set provided earlier, and the models are validated using repeated cross-validation.

## SELECTION OF MODELING TECHNIQUES

For our modeling we use a combination of predictive techniques. Multiple techniques are selected and applied on the data. For the non-linear regression techniques, we use Support Vector Regression (SVR) and Neural Networks (NN). SVR has shown to obtain excellent performances in regression and time series applications. Neural Networks are a widely used method for time series data that generally gives mixed results.

Another technique we use is Classification and Decession Trees, which is a simple technique that is easy to visualize. Also two ensemble techniques are included, in order to improve the performance of the Classification and Regression Trees. These ensemble techniques are Gradient Boosting Machines (GBM) and Random Forests (RF). These techniques create a multitude of regression trees and select a combination of them in order to maximize the performance.

## MODEL BUILDING

Using these five techniques (ARIMA, Linear Regression, Logistic Regression, SVM, SVR, Decision Tree, RF, and KNN) we can create five models. We use the list of features mentioned in section as input for our models. A total of 12 features are included, the remaining features were excluded after performing feature selection. For each of the five models hyperparameters were tuned, using grid search. Hyperparameters are the model-specific parameters that are used for optimizing the model. They generally have to be tuned in order to optimize the model’s performance, and reduce the variance and bias of the model. By training the model with different values of the size and the decay, and evaluating its performance, we can select the hyperparameters that result in the best performing model in terms of predictive power.

# ARMA Model

Estimating an AR Model

We will estimate the AR(1) parameter, ϕ, of one of the Rate, Revenue, Loan_num, series that generated in the earlier . Since the parameters are known for a series, it is a good way to understand the stimation routines before applying it to real data. For monthly_rate_data with a true ϕ of 0.9, we will print out the estimate of ϕ. In addition, we will also print out the entire output that is produced when you fit a time series, so we can get an idea of what other tests and summary statistics are available in statsmodels

In [None]:
import warnings
warnings.filterwarnings('ignore', 'statsmodels.tsa.arima_model.ARMA',
                        FutureWarning)
import warnings
warnings.filterwarnings('ignore', 'statsmodels.tsa.arima_model.ARMA',
                        FutureWarning)

In [None]:
US10Y.index = pd.to_datetime(US10Y.index)
monthly_rate_data = US10Y['RATE'].resample(rule='M').last()
# Import the ARMA module from statsmodels
from statsmodels.tsa.arima_model import ARMA
# Fit an AR(1) model to the first simulated data
mod_rate = ARMA(np.asarray(monthly_rate_data), order=(1,0))
res_rate = mod_rate.fit()
# Print out summary information on the fit
print(res_rate.summary())

In [None]:
# Print out the estimate for the constant and for phi
print("When the true phi=0.9, the estimate of phi (and the constant) are:")
print(res_rate.params)

Forecasting with an AR Model

In addition to estimating the parameters of a model, we can also do forecasting using statsmodels. The in-sample is a forecast of the next data point using the data up to that point, and the out-of-sample forecasts any number of data points in the future. These forecasts can be made using either the predict() method if we want the forecasts in the form of a series of data, or using the plot_predict() method we you want a plot of the forecasted data. We will supply the starting point for forecasting and the ending point, which can be any number of data points after the data set ends. For the simulated series Monthly Interest Rate with ϕ=0.9, we will plot in-sample and out-of-sample forecasts.

Being able to forecast interest rates is of enormous importance, not only for bond investors but also for individuals like new homeowners who must decide between fixed and floating rate mortgages.

There is some mean reversion in interest rates over long horizons. In other words, when interest rates are high, they tend to drop and when they are low, they tend to rise over time. Currently they are below long-term rates, so they are expected to rise, but an AR model attempts to quantify how much they are expected to rise.

In [None]:
from statsmodels.tsa.arima_model import ARMA
# Forecast interest rates using an AR(1) model
mod_monthly_rate_data = ARMA(monthly_rate_data, order=(1,0))
res = mod_monthly_rate_data.fit()
# Plot the original series and the forecasted series
fig, ax = plt.subplots(figsize=(20,5))
ax=res.plot_predict(start=0, end='2021',ax=ax)
plt.legend(fontsize=8)
plt.ylabel('Rate')
plt.xlabel('Year')
plt.show()

Since we have only used only FOUR years of Monthly Interest Rate Data, we can see the short term downward momentum on the interest rate

A daily move up or down in interest rates is unlikely to tell us anything about interest rates tomorrow, but a move in interest rates over a year can tell us something about where interest rates are going over the next year. The DataFrame daily_data contains daily data of 10-year interest rates from 1962 to 2017

In [None]:
daily_data = pd.read_csv('../input/mortgage-bank-loan/DGS10.csv', index_col='DATE')
#Data Cleaning
daily_data.replace('.', -1, inplace=True)
daily_data= daily_data.replace(to_replace=[-1], value=[np.nan])
daily_data = daily_data.dropna()
daily_data['DGS10']= daily_data['DGS10'].astype('float64')
daily_data['change_rates'] = daily_data.diff()
daily_data = daily_data.dropna()
#Convert index to datetime
daily_data.index = pd.to_datetime(daily_data.index)
annual_data = daily_data['DGS10'].resample(rule='A').last()
annual_data[:5]

In [None]:
annual_data = annual_data.dropna()
# Repeat above for annual data
annual_data['diff_rates'] = annual_data.diff()
annual_data['diff_rates'] = annual_data['diff_rates'].dropna()
annual_data['diff_rates'][:5]

In [None]:
# Compute and print the autocorrelation of daily changes
autocorrelation_daily = daily_data['change_rates'].autocorr()
print("The autocorrelation of daily interest rate changes is %4.2f" %(autocorrelation_daily))

In [None]:
# Compute and print the autocorrelation of annual changes
autocorrelation_annual = annual_data['diff_rates'].autocorr()
print("The autocorrelation of annual interest rate changes is %4.2f" %(autocorrelation_annual))

Notice how the daily autocorrelation is small (0.07) but the annual autocorrelation is large and negative (-0.22)

In [None]:
annual_rate=daily_data.resample('A').mean()
annual_rate_data=annual_rate['DGS10']
# Import the ARMA module from statsmodels
from statsmodels.tsa.arima_model import ARMA
# Forecast interest rates using an AR(1) model
mod_annual_rate = ARMA(annual_rate_data, order=(1,0))
res_annual_rate = mod_annual_rate.fit()
# Plot the original series and the forecasted series
fig, ax = plt.subplots(figsize=(20,5))
res_annual_rate.plot_predict(start=0, end='2024',ax=ax)
plt.ylabel('Rate')
plt.xlabel('Year')
plt.legend(fontsize=8)
plt.show()

Over long horizons, when interest rates go up, the economy tends to slow down, which consequently causes interest rates to fall, and vice versa.

According to an AR(1) model, 10-year interest rates are forecasted to rise from 2.16%, towards the end of 2017 to 3.35% in five years

# Forecast expected monthly closed loans

Our Mortgage DataSet contains data from Oct 2014 to December 2018. Let’s forecast Monthly Closed Loans and Monthly Revenue. We will forecast monthly_loan_num_data using an AR(1) model

In [None]:
monthly_rate=daily_data.resample('M').sum()
monthly_rate_data=monthly_rate['DGS10']
# Import the ARMA module from statsmodels
from statsmodels.tsa.arima_model import ARMA
# Forecast interest rates using an AR(1) model
mod_monthly_rate = ARMA(monthly_rate_data, order=(1,0))
res_monthly_rate = mod_monthly_rate.fit()
# Plot the original series and the forecasted series
fig, ax = plt.subplots(figsize=(20,5))
res_annual_rate.plot_predict(start=0, end='2024',ax=ax)
plt.legend(fontsize=8)
plt.ylabel('Rate')
plt.xlabel('Year')
plt.show()

Over long horizons, when interest rates go up, the economy tends to slow down, which consequently causes interest rates to fall, and vice versa.

According to an AR(1) model, 10-year interest rates are forecasted to rise from 2.16%, towards the end of 2017 to 3.35% in five years

# Forecast expected monthly closed loans

Our Mortgage DataSet contains data from Oct 2014 to December 2018. Let’s forecast Monthly Closed Loans and Monthly Revenue. We will forecast monthly_loan_num_data using an AR(1) model

In [None]:
## Import the ARMA module from statsmodels
from statsmodels.tsa.arima_model import ARMA
M_loan_city_num.resample('M').last()
monthly_loan_num_data= M_loan_city_num.resample('M').last()
monthly_loan_num_data=monthly_loan_num_data['LoanInMonth']
# Forecast monthly_loan_num_data using an AR(1) model
mod_monthly_loan_num_data = ARMA(monthly_loan_num_data, order=(1,0))
res_monthly_loan_num_data = mod_monthly_loan_num_data.fit()

# Plot the original series and the forecasted series
fig, ax = plt.subplots(figsize=(20,5))
ax=res_monthly_loan_num_data.plot_predict(start=0, end='2022',ax=ax)
plt.legend(fontsize=12)
plt.title('Monthly Loan Application Forecast')
plt.ylabel('Loan in Month')
plt.xlabel('Year')
plt.show()

Above we have plotted the original series and the forecasted series. With 95% confidence interval, Expected Loans per month will be around 50. Low end is 28 & High End is 70

In [None]:
## Import the ARMA module from statsmodels
from statsmodels.tsa.arima_model import ARMA
M_loan_city_num.resample('Q').last()
Q_loan_num_data= M_loan_city_num.resample('Q').last()
Q_loan_num_data=Q_loan_num_data['LoanInMonth']
# Forecast quaterly_loan_num_data using an AR(1) model
mod_Q_loan_num_data = ARMA(Q_loan_num_data, order=(1,0))
res_mod_Q_loan_num_data = mod_Q_loan_num_data.fit()
# Plot the original series and the forecasted series
fig, ax = plt.subplots(figsize=(20,5))
res_mod_Q_loan_num_data.plot_predict(start=0, end='2020',ax=ax)
plt.legend(fontsize=14)
plt.title('Quaterly Loan Application Forecast')
plt.ylabel('Loan in Month')
plt.xlabel('Year')
plt.show()

Above we have plotted the original series and the forecasted series. With 95% confidence interval, Expected Loans per month will be around 47. Low end is 20 & High End is 75

# Forecast expected monthly revenue

![](http://)![](http://)![](http://)![](http://)![](http://)Similarly, we ma[](http://)y forecast expected monthly revenue and plot original series and the forecasted series

In [None]:
mod_monthly_loan_rev_data = ARMA(monthly_loan_rev_data, order=(1,0))

res_monthly_loan_rev_data = mod_monthly_loan_rev_data.fit()
print("The AIC for an AR(1) is: ", res_monthly_loan_rev_data.aic)

In [None]:
# Plot the original series and the forecasted series
fig, ax = plt.subplots(figsize=(20,5))
res_monthly_loan_rev_data.plot_predict(start=0, end='2022',ax=ax)
plt.legend(fontsize=16)
plt.title('Monthly Sales Revenue Forecast')
plt.show()

With 95% confidence interval, Expected Revenue per month will be around 22M. Low end is 13M & High End is 32M

# Forecasting Quarterly Sales Revenue

In [None]:
loan_patterns['date']= pd.DatetimeIndex(data['Created Date'])
loan_patterns = loan_patterns.set_index('date')
loan_patterns.index
quarterly_revenue_data = loan_patterns['Loan Amount'].resample(rule='Q').sum()
mod_quarterly_loan_rev_data = ARMA(quarterly_revenue_data, order=(1,0))
res_quarterly_loan_rev_data = mod_quarterly_loan_rev_data.fit()
# Plot the original series and the forecasted series
fig, ax = plt.subplots(figsize=(20,5))
res_quarterly_loan_rev_data.plot_predict(start=0, end='2022',ax=ax)
plt.legend(fontsize=10)
plt.title('Quarterly Sales Revenue Forecast')
plt.show()

With 95% confidence interval, Expected Revenue per Quarters will be around 67M. Low end is 52M & High End is 80M

# Forecasting Annual Sales Revenue

In [None]:
annual_revenue_data = loan_patterns['Loan Amount'].resample(rule='A').sum()
mod_annual_loan_rev_data = ARMA(annual_revenue_data, order=(1,0))
res_annual_loan_rev_data = mod_annual_loan_rev_data.fit()
# Plot the original series and the forecasted series
fig, ax = plt.subplots(figsize=(20,5))
res_annual_loan_rev_data.plot_predict(start=0, end='2022',ax=ax)
plt.legend(fontsize=14)
plt.title('Annual Sales Revenue Forecast')
plt.show()

In the above graph, we have plotted the original series and the forecasted series. We are expecting little drop in annual mortgage revene. Our current annual revenue is $270M. By end of 2021, with 95% confidence interval, Expected Revenue per Years will be around 220M. Low end is 70M & High End is 360M. With more annual data, we should be able to do better annual revenue prediction.

# Linear regression

Purpose of linear regression: Given a dataset containing predictor variables X and outcome/response variable Y, linear regression can be used to:

Build a predictive model to predict future values, using new data X where Y is unknown. Model the strength of the relationship between each independent variable X_i and Y Many times, only a subset of independent variables X_i will have a linear relationship with Y Need to figure out which X_i contributes most information to predict Y It is in many cases, the first pass prediction algorithm for continuous outcomes.

Linear Regression is a method to model the relationship between a set of independent variables X (also knowns as explanatory variables, features, predictors) and a dependent variable Y. This method assumes the relationship between each predictor X is linearly related to the dependent variable Y.

Independence means that the residuals are not correlated -- the residual from one prediction has no effect on the residual from another prediction. Correlated errors are common in time series analysis and spatial analyses.

In [None]:
from sklearn.model_selection import train_test_split # for train and test set split
from sklearn.model_selection import cross_val_score #Sklearn.model_seletion is used instead of sklearn.cross_validation to avoid
#warning

In [None]:
X = np.array(model_data.drop(['Loan Amount'],1))
y = np.array(model_data['Loan Amount'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print("size of the training feature set is",X_train.shape)
print("size of the test feature set is",X_test.shape)
print("size of the training Target set is",y_train.shape)
print("size of the test Target set is",y_test.shape)

In [None]:
#Linear regression
from sklearn.linear_model import LinearRegression #import from sklearn
linear_reg= LinearRegression() # instantiated linreg
linear_reg.fit(X_train,y_train) #fit the model

In [None]:
#predict using X_test
predicted_train= linear_reg.predict(X_train)
predicted_test= linear_reg.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error # import mse from sklearn
#calculate root mean squarred error
rmse_train=np.sqrt(mean_squared_error(y_train, predicted_train))
rmse_test=np.sqrt(mean_squared_error(y_test, predicted_test))

In [None]:
print('The train root mean squarred error is :', rmse_train)
print('The test root mean squarred error is  :', rmse_test)

In [None]:
print('The Linear Regression coefficient parameters are :', linear_reg.coef_ )

In [None]:
print('The Linear Regression intercept value is :', linear_reg.intercept_)

RMSE of the test data is closer to the training RMSE (and lower) if you have a well trained model. It will be higher if we have an overfitted model.

In [None]:
from sklearn import metrics # import metrics from sklearn
Rsquared=linear_reg.score(X_train,y_train) # to determine r square Goodness of fit
# how good the model fits the training data can be determined by R squared metric which is here 0.12
Rsquared
print('The R squared metric is :', Rsquared)

R^2 = 0.12

The R^2 in scikit learn is the coefficient of determination. It is 1 - residual sum of square / total sum of squares.

Since R^2 = 1 - RSS/TSS, the only case where RSS/TSS > 1 happens when our model is even worse than the worst model assumed (which is the absolute mean model).

here RSS = sum of squares of difference between actual values(yi) and predicted values(yi^) and TSS = sum of squares of difference between actual values (yi) and mean value (Before applying Regression). So you can imagine TSS representing the best(actual) model, and RSS being in between our best model and the worst absolute mean model in which case we'll get RSS/TSS < 1. If our model is even worse than the worst mean model then in that case RSS > TSS(Since difference between actual observation and mean value < difference predicted value and actual observation).

In [None]:
# K fold cross validation
# cross validation score
cv_score= cross_val_score(LinearRegression(),X,y,scoring='neg_mean_squared_error', cv=10) # k =10
print('cv_score is :\n', cv_score)

In [None]:
# mean squared error
print('cv_score is :', cv_score.mean())

In [None]:
# Root mean squared error
rmse_cv= np.sqrt(cv_score.mean() * -1)
print('The cross validation root mean squarred error is :', rmse_cv)

With Linear regressor we are able to predict the model with RMSE: 
- train RMSE                                : 2.349923482270037 
- test RMSE                                 : 3.9655849638809144
- R squared                                 : 0.12 
- cross validation root mean squarred error : 2.75

# Fitting Linear Regression using statsmodels

Statsmodels is a great Python library for a lot of basic and inferential statistics. It also provides basic regression functions using an R-like syntax, so it's commonly used by statisticians. The version of least-squares we will use in statsmodels is called ordinary least-squares (OLS). There are many other versions of least-squares such as partial least squares (PLS) and weighted least squares (WLS).

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
# statsmodels works nicely with pandas dataframes
# The thing inside the "quotes" is called a formula, a bit on that below
m_rate = ols('y ~ RATE',model_data).fit()
print(m_rate.summary())

In [None]:
m_home = ols('y ~ Home',model_data).fit()
print(m_home.summary())

In [None]:
m_cltv = ols('y ~ CLTV',model_data).fit()
print(m_cltv.summary())

In [None]:
m_zip = ols('y ~ zip_code',model_data).fit()
print(m_zip.summary())

In [None]:
m_Fix_True = ols('y ~ Fix_True',model_data).fit()
print(m_Fix_True.summary())

In [None]:
m_loaninmonths = ols('y ~ LoanInMonth',model_data).fit()
print(m_loaninmonths.summary())

In [None]:
m_rcpi = ols('y ~ LoanInMonth + RATE + Home + CLTV ',model_data).fit()
print(m_rcpi.summary())

In [None]:
plt.figure(figsize=(20,5))
fdval = m_rcpi.fittedvalues
plt.scatter(fdval, y)
plt.ylabel('Predicted prices')
plt.xlabel('Original Prices')
plt.show()

In [None]:
plt.figure(figsize=(20,5))
sns.regplot(x=fdval, y="Loan Amount", data=model_data, fit_reg = True, color='g')
plt.xlabel('Original Prices')
plt.show()

- RATE R^2 = 0.12 
- Home R^2 = 0.12 
- CLTV R^2 = 0.00 

In [None]:
model_data.head()

# Fitting Linear Regression using sklearn

Look inside lm object using dir(lm):

- lm.predit() 
- lm.fit()
- lm.score()
- lm.coef
- lm.intercept

Fit a linear model: The lm.fit() function estimates the coefficients the linear regression using least square

In [None]:
from sklearn.linear_model import LinearRegression
X = np.array(fit_data.drop(['Fix_True'],1))
y = np.array(fit_data['Fix_True'])

In [None]:
# This creates a LinearRegression object
lm = LinearRegression()

In [None]:
# Use all 13 predictors to fit linear regression model
lm.fit(X, y)
lm.coef_

In [None]:
lm.intercept_

In [None]:
# The mean squared error
print("Mean squared error (Fix_True Rate): %.2f" % np.mean((lm.predict(X) - y) ** 2))

In [None]:
X = np.array(fit_data.drop(['loan_type_code'],1))
y = np.array(fit_data['loan_type_code'])
lm.fit(X, y)
lm.coef_

In [None]:
lm.intercept_

In [None]:
# The mean squared erro
print("Mean squared error (loan_type_code): %.2f" % np.mean((lm.predict(X) - y) ** 2))

In [None]:
X = np.array(fit_data.drop(['Loan Amount'],1))
y = np.array(fit_data['Loan Amount'])
lm.fit(X, y)
lm.coef_

In [None]:
lm.intercept_

In [None]:
# The mean squared error
print("Mean squared error (Loan Amount): %.2f" % np.mean((lm.predict(X) - y) ** 2))

In [None]:
X = np.array(fit_data.drop(['unit_type_code'],1))
y = np.array(fit_data['unit_type_code'])
lm.fit(X, y)
lm.coef_

In [None]:
lm.intercept_

In [None]:
# The mean squared error
print("Mean squared error (unit_type_code): %.2f" % np.mean((lm.predict(X) - y) ** 2))

# Let's try Regression: with Scale

Here is the different types of regression we have used with StandardScaler and GridSearchCV:
- LinearRegression
- Lasso
- Ridge
- ElasticNet
- SGDRegressor

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame,Series
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

In [None]:
def pretty_print_linear(coefs, names = None, sort = False):
    if names is None:
        names = ["X%s" % x for x in range(len(coefs))]
    lst = zip(coefs, names)
    if sort:
        lst = sorted(lst, key = lambda x:-np.abs(x[0]))
    return " + ".join("%s * %s" % (round(coef, 3), name) for coef, name in lst)

In [None]:
def scale_data(X):
    scaler = StandardScaler()
    X = scaler.fit_transform(X)
    return X

In [None]:
def split_data(X,Y):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20, random_state=42)
    return X_train, X_test, Y_train, Y_test

In [None]:
def root_mean_square_error(y_pred,y_test):
    rmse_train = np.sqrt(np.dot(abs(y_pred-y_test),abs(y_pred-y_test))/len(y_test))
    return rmse_train

In [None]:
def plot_real_vs_predicted(y_pred,y_test):
    plt.figure(figsize=(20,5))
    plt.plot(y_pred,y_test,'ro')
    plt.plot([0,50],[0,50], 'g-')
    plt.xlabel('predicted')
    plt.ylabel('real')
    plt.title('real_vs_predicted')
    plt.show()
    return plt

In [None]:
np.set_printoptions(precision=2, linewidth=100, suppress=True, edgeitems=2)
X = np.array(fit_data.drop(['Fix_True'],1))
y = np.array(fit_data['Fix_True'])
X = scale_data(X)  #Scalling Input Data
X_train, X_test, y_train, y_test = split_data(X,y)

# Let's try Regular Linear Regression:

In [None]:
fit_data.head()

In [None]:
# Create linear regression object
linreg = LinearRegression()
# Train the model using the training sets
linreg.fit(X_train,y_train)
print("Linear model: ", pretty_print_linear(linreg.coef_, sort = True))

In [None]:
# Predict the values using the model
y_lin_predict = linreg.predict(X_test)
# Print the root mean square error
print("Linear Regression - Root Mean Square Error: ", root_mean_square_error(y_lin_predict,y_test))

In [None]:
plot_real_vs_predicted(y_test,y_lin_predict)
plt.show()

# Let's try Lasso Regression:

In [None]:
# Create lasso regression object
lasso = Lasso(alpha=.3)
# Train the model using the training sets
lasso.fit(X_train, y_train)
print("Lasso model: ", pretty_print_linear(lasso.coef_, sort = True))

In [None]:
# Predict the values using the model
y_lasso_predict = lasso.predict(X_test)
# Print the root mean square error
print("Lasso model - Root Mean Square Error: ", root_mean_square_error(y_lasso_predict,y_test))

In [None]:
plot_real_vs_predicted(y_test,y_lasso_predict)
plt.show()

# Let's try Ridge Regression:

In [None]:
ridge = Ridge(fit_intercept=True, alpha=.3)
# Train the model using the training sets
ridge.fit(X_train, y_train)
print("Ridge model: ", pretty_print_linear(ridge.coef_, sort = True))

In [None]:
# Predict the values using the model
y_ridge_predict = ridge.predict(X_test)
# Print the root mean square error
print("Ridge Regression - Root Mean Square Error: ", root_mean_square_error(y_ridge_predict,y_test))

In [None]:
plot_real_vs_predicted(y_test,y_ridge_predict)
plt.show()

# Now let's try to do regression via Elastic Net.

In [None]:
elnet = ElasticNet(fit_intercept=True, alpha=.3)
# Train the model using the training sets
elnet.fit(X_train, y_train)
print("Elastic Net model: ", pretty_print_linear(elnet.coef_, sort = True))

In [None]:
# Predict the values using the model
y_elnet_predict = elnet.predict(X_test)
# Print the root mean square error
print("Elastic Net - Root Mean Square Error: ", root_mean_square_error(y_elnet_predict,y_test))

In [None]:
plot_real_vs_predicted(y_test,y_elnet_predict)
plt.show()

# Now let's try to do regression via Stochastic Gradient Descent.

In [None]:
sgdreg = SGDRegressor(penalty='l2', alpha=0.15, max_iter=200)
# Train the model using the training sets
sgdreg.fit(X_train, y_train)
print("Stochastic Gradient Descent model: ", pretty_print_linear(sgdreg.coef_, sort = True))

In [None]:
# Predict the values using the model
y_sgdreg_predict = sgdreg.predict(X_test)
# Print the root mean square error
print("Stochastic Gradient Descent - Root Mean Square Error: ", root_mean_square_error(y_sgdreg_predict,y_test))

In [None]:
plot_real_vs_predicted(y_test,y_sgdreg_predict)
plt.show()

# Regression Summary Report

- Linear Regression - Root Mean Square Error : 0.298
- Lasso model - Root Mean Square Error : 0.308
- Ridge Regression - Root Mean Square Error : 0.298
- Elastic Net - Root Mean Square Error : 0.308
- Stochastic Gradient Descent - Root Mean Square Error : 0.298

Linear Regression, Ridge Regression, Stochastic Gradient Descent all have - Root Mean Square Error: 0.298 are performing best with low Root Mean Square Error

Now we have a pandas DataFrame called model_data containing all the data we want to use to predict Mortgage Loan prices. Let's create a variable called 'Loan Amount' which will contain the prices.

This information is contained in the target data.

In [None]:
model_data.head()

In [None]:
model_data=model_data3.copy()
model_data.head()

In [None]:
model_data.columns

In [None]:
import seaborn as sns
%matplotlib inline
g1=sns.pairplot(model_data, x_vars=['RATE', 'Home'], y_vars='Loan Amount', height=8, aspect=0.9, kind='reg')
g1.axes[0,1].set_ylim(300000,600000)
g1.axes[0,0].set_xlim(1.5,3.5)
g1.axes[0,1].set_xlim(4.5,8)
plt.show()

# US 10-Years Interest Rate goes up loan amount increases

Home Supply increases, loan amount also increases slightly but eventually housing market will slow down

Interest Rate Chage has bigger impact on Loan Amount compare to Home Supply

In [None]:
g=sns.pairplot(model_data, x_vars=['Qualification FICO', 'CLTV'], y_vars='Loan Amount', size=8, aspect=0.9, kind='reg')
g.axes[0,1].set_ylim(400000,500000)
g.axes[0,0].set_xlim(600, 820)
g.axes[0,1].set_xlim(30,100)
plt.show()

- Majority of the FICO scores between 600 and 820
- Majority of the CLTV scores between 30% and 100%
- FICO goes up, Loan Amount goes up
- CLTV goes up, Loan Amount Goes down

In [None]:
g2=sns.pairplot(model_data, x_vars=['unit_type_code',
       'loan_type_code', 'RATE'], y_vars='Loan Amount', size=8, aspect=0.9, kind='reg')
g2.axes[0,1].set_ylim(300000,600000)
g2.axes[0,0].set_xlim(1.0,12.1)
g2.axes[0,1].set_xlim(1,5)
g2.axes[0,2].set_xlim(1.5,3.5)
plt.show()

In [None]:
data_lc['Unit Type'].value_counts()

In [None]:
data_lc['unit_type_code'].value_counts()

In [None]:
data_lc['Loan Type'].value_counts()

In [None]:
data_lc['loan_type_code'].value_counts()

As number of unit decreases ave Loan Amount also decreases

loan_type_code: Conventional is high volume but Slightly low average loan amount

ARM (Adjustable Rate Mortgage) has higher loan amount thn Fix_True Rate mortgage Two Family (Code = 10) has higher Loan Amount that three Family (Code = 9)

Conventional Mortgage has Higher Loan Amount FHA ARM has higher Loan Amount compare to Fix Rate Mortgage

In [None]:
g3=sns.pairplot(model_data, x_vars=[ 'CLTV', 'Home', 'LoanInMonth'], y_vars='Loan Amount', size=8, aspect=1.2, kind='reg')
g3.axes[0,1].set_ylim(300000,600000)
g3.axes[0,0].set_xlim(30,100)
g2.axes[0,1].set_xlim(4.5,7.5)
g3.axes[0,2].set_xlim(0,80)
plt.show()

NYC NJ (zip_code between 7 & 11) seems to be closing more loans and generation more revenues for the Mortgage Bank. As number of loans per month inceases, loan amount decrease

In [None]:
g4=sns.pairplot(model_data, x_vars=[ 'zip_code', 'unit_type_code'], y_vars='Loan Amount', size=8, aspect=1.2, kind='reg')
g4.axes[0,1].set_ylim(300000,600000)
g4.axes[0,0].set_xlim(6,500)
g4.axes[0,1].set_xlim(1,10)
plt.show()

NYC NJ (zip_code 1st digit starting with 7 & 11) seems to be closing more loans and generation more revenues for the Mortgage Bank. City shows similar results. Population density plays bigger role on Loan Amount. As number of loans per month increases, loan amount decreases

In [None]:
a

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import train_test_split
from sklearn import preprocessing, neighbors

In [None]:
X = np.array(fit_data.drop(['loan_type_code'],1))
y = np.array(fit_data['loan_type_code'])
X_train, X_test, y_train, y_test_knn = train_test_split(X, y, test_size = 0.2)

In [None]:
# Create the DataFrame: numeric_data_only
numeric_data_only = model_data[0:10].fillna(-1)
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [None]:
# Instantiate the classifier: clf
clf = OneVsRestClassifier(LogisticRegression())
# Fit the classifier to the training data
clf.fit(X_train, y_train)
# Print the accuracy
print("Logistic Regression Accuracy: {}".format(clf.score(X_test, y_test)))

Logistic Regression Model Accuracy for Loan Types: 67.46%

**> Now Centering, Scaling and Logistic Regression and look at the model accuracy**

In [None]:
# Import necessary packages
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from sklearn import datasets
from sklearn import linear_model
import numpy as np

In [None]:
# Load data
X = np.array(fit_data.drop(['loan_type_code'],1))
y = np.array(fit_data['loan_type_code'])
X.shape

In [None]:
y.shape

In [None]:
# Split the data into test and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Train logistic regression model and print performance on the test set
lr = linear_model.LogisticRegression()

In [None]:
# fit the model
lr = lr.fit(X_train, y_train)
print('Logistic Regression score for training set: %f' % lr.score(X_train, y_train))

In [None]:
from sklearn.metrics import classification_report
y_true, y_pred = y_test, lr.predict(X_test)
print(classification_report(y_true, y_pred))

In [None]:
from sklearn.preprocessing import scale
Xs = scale(X)
Xs_train, Xs_test, y_train, y_test = train_test_split(Xs, y, test_size=0.2)
lr_2 = lr.fit(Xs_train, y_train)
print('Scaled Logistic Regression score for test set: %f' % lr_2.score(Xs_test, y_test))

In [None]:
y_true, y_pred = y_test, lr_2.predict(Xs_test)
print(classification_report(y_true, y_pred))

This is very interesting! The performance of logistic regression did not improve with data scaling. Why not, particularly when we saw that k-Nearest Neigbours performance improved substantially with scaling? The reason is that, if there predictor variables with large ranges that do not effect the target variable, a regression algorithm will make the corresponding coefficients ai small so that they do not effect predictions so much. K-nearest neighbours does not have such an inbuilt strategy and so we very much needed to scale the data.

# Scaling Synthesized Data

Scaling numerical data (that is, multiplying all instances of a variable by a constant in order to change that variable’s range) has two related purposes: i) if your measurements are in different currencies and, then, if we both scale our data, they end up being the same & ii) if two variables have vastly different ranges, the one with the larger range may dominate your predictive model, even though it may be less important to our target variable than the variable with the smaller range. What we saw is that this problem identified in ii) occurs with k-NN, which explicitly looks at how close data are to one another but not in logistic regression which, when being trained, will shrink the relevant coefficient to account for the lack of scaling. We can see was how the models performed before and after scaling.

Let’s now split into testing & training sets & plot both sets:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
plt.figure(figsize=(20,5));
plt.subplot(1, 2, 1 );
plt.title('training set')
plt.scatter(X_train[:,0] , X_train[:,1],  c = y_train, alpha = 0.7);
plt.subplot(1, 2, 2);
plt.scatter(X_test[:,0] , X_test[:,1],  c = y_test, alpha = 0.7);
plt.title('test set')
plt.show()

Looking good! Now let’s instantiate a k-Nearest Neighbors voting classifier & train it on our training set

In [None]:
from sklearn import neighbors, linear_model
knn = neighbors.KNeighborsClassifier()
knn_model = knn.fit(X_train, y_train)
print('k-NN score for test set: %f' % knn_model.score(X_test, y_test))

In [None]:
print('k-NN score for training set: %f' % knn_model.score(X_train, y_train))

In [None]:
from sklearn.metrics import classification_report
y_true, y_pred = y_test, knn_model.predict(X_test)
print(classification_report(y_true, y_pred))

We can notice the improvement for KNN compare to Logistic Regression Now with scaling KNN: We’ll now scale the predictor variables and then use k-NN again:

In [None]:
from sklearn.preprocessing import scale
Xs = scale(X)
Xs_train, Xs_test, y_train, y_test = train_test_split(Xs, y, test_size=0.2)
plt.figure(figsize=(20,5));
plt.subplot(1, 2, 1 );
plt.scatter(Xs_train[:,0] , Xs_train[:,1],  c = y_train, alpha = 0.7);
plt.title('scaled training set')
plt.subplot(1, 2, 2);
plt.scatter(Xs_test[:,0] , Xs_test[:,1],  c = y_test, alpha = 0.7);
plt.title('scaled test set')
plt.show()

In [None]:
knn_model_s = knn.fit(Xs_train, y_train)
print('k-NN score for test set: %f' % knn_model_s.score(Xs_test, y_test))

k-NN score for test set: 0.71 

It doesn’t perform any better with scaling! This is most likely because both features were already around the same range. It really makes sense to scale when variables have widely varying ranges. To see this in action, we’re going to add another feature. Moreover, this feature will bear no relevance to the target variable: it will be mere noise.

# Adding Gaussian noise to the signal (KNN):

We add a third variable of Gaussian noise with mean 0 and variable standard deviation σ. We’ll call σ the strength of the noise and we’ll see that the stronger the noise, the worse the performance of k-Nearest Neighbours

In [None]:
X = np.array(fit_data.drop(['loan_type_code'],1))
y = np.array(fit_data['loan_type_code'])
# Generate some clustered data (blobs!)
import numpy as np
from sklearn.datasets import make_blobs
n_samples=2000
X, y = make_blobs(n_samples, centers=4, n_features=2,
                  random_state=42)
# Add noise column to predictor variables
ns = 10**(3) # Strength of noise term
newcol = np.transpose([ns*np.random.randn(n_samples)])
Xn = np.concatenate((X, newcol), axis = 1)
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(15,15));
ax = fig.add_subplot(111, projection='3d' , alpha = 0.5);
ax.scatter(Xn[:,0], Xn[:,1], Xn[:,2], c = y);

In [None]:
Xn_train, Xn_test, y_train, y_test = train_test_split(Xn, y, test_size=0.2, random_state=42)
knn = neighbors.KNeighborsClassifier(n_neighbors=6)
knn_model = knn.fit(Xn_train, y_train)
print('k-NN score for test set: %f' % knn_model.score(Xn_test, y_test))

k-NN score for test set: 0.58

This is a horrible model! How about we scale and check out performance?

In [None]:
Xns = scale(Xn)
s = int(.2*n_samples)
Xns_train = Xns[s:]
y_train = y[s:]
Xns_test = Xns[:s]
y_test = y[:s]
knn_models = knn.fit(Xns_train, y_train)
knn_accuracy = knn_models.score(Xns_test, y_test)
print('knn_accuracy for test set: ' , knn_accuracy)

In [None]:
knn_prediction = knn.predict(Xns[1515:1535])
print('KNN : - Output of Real Data : Conv=5, FHA=4, Res=3, Comm=2: ', (y[1515:1535]))

In [None]:
print('KNN : - Output of prediction: Conv=5, FHA=4, Res=3, Comm=2: ',knn_prediction)

With Scale and Synthesize the data we can see huge improvement on the model accuracy.. 36% to 100% Let’s do same for Logistic Regression and check out the performance.

In [None]:
# Set sc = True if you want to scale your features
sc = False
#Import packages
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import neighbors, linear_model
from sklearn.preprocessing import scale
from sklearn.datasets import make_blobs
    #Generate some data
n_samples=2000
X, y = make_blobs(n_samples, centers=4, n_features=2, random_state=0)
# Add noise column to predictor variables
newcol = np.transpose([ns*np.random.randn(n_samples)])
Xn = np.concatenate((X, newcol), axis = 1)
#Scale if desired
if sc == True:
    Xn = scale(Xn)    
#Train model and test after splitting
Xn_train, Xn_test, y_train, y_test = train_test_split(Xn, y, test_size=0.2, random_state=42)
lr = linear_model.LogisticRegression()
lr_model = lr.fit(Xn_train, y_train)
print('logistic regression score for test set: %f' % lr_model.score(Xn_test, y_test))

logistic regression score for test set has improved from 70% to 0.87 or 87%

We can see big improvement. We have seen the essential place in the data scientific pipeline by preprocessing, in its scaling and centering incarnation, and we have done so to promote a holistic approach to minimize the challenges of machine learning.

To conclude, we have seen the essential place occupied in the data scientific pipeline by preprocessing, in its scaling and centering incarnation, and we have done so to promote a holistic approach to the challenges of machine learning.

# Random Forests (RF)

Random forests is a supervised learning algorithm. It can be used both for classification and regression. It is also the most flexible and easy to use algorithm. A forest is comprised of trees. It is said that the more trees it has, the more robust a forest is. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance. Random forests has a variety of applications, such as recommendation engines, image classification and feature selection. It can be used to classify loyal loan applicants, identify fraudulent activity and predict diseases.

- Random forests is considered as a highly accurate and robust method because of the number of decision trees participating in the process.

- It does not suffer from the over fitting problem. The main reason is that it takes the average of all the predictions, which cancels out the biases.

- The algorithm can be used in both classification and regression problems.

- Random forests can also handle missing values

In [None]:
import pandas as pd
import numpy as np
from sklearn import preprocessing, neighbors, svm 
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
X = np.array(fit_data.drop(['loan_type_code'],1))
y = np.array(fit_data['loan_type_code'])
X_train, X_test, y_train, y_test_rfc = train_test_split(X, y, test_size = 0.2)
rfc = RandomForestClassifier(n_jobs=-1,max_features= 'auto' ,n_estimators=200, oob_score = True) 
rfc.fit(X_train, y_train)
rfc_accuracy = rfc.score(X_test, y_test_rfc)
print('Unscaled: Random Forest Classifier Accuracy : ', rfc_accuracy)

In [None]:
X = np.array(fit_data.drop(['loan_type_code'],1))
y = np.array(fit_data['loan_type_code'])
from sklearn.preprocessing import scale
X_scaled = scale(X)
X_train, X_test, y_train, y_test_rfc = train_test_split(X, y, test_size = 0.2)
rfc = RandomForestClassifier(n_jobs=-1,max_features= 'auto' ,n_estimators=200, oob_score = True) 
rfc.fit(X_train, y_train)
rfc_accuracy = rfc.score(X_test, y_test_rfc)
print('Scaled: Random Forest Classifier Accuracy   : ', rfc_accuracy)

In [None]:
#Look for the confussion Matrix
from sklearn.metrics import confusion_matrix
#confusion_matrix?
rfc.predict(X_test)
y_pred_rfc = rfc.predict(X_test)
y_pred_rfc_out = rfc.predict(X[975:995,])
print('RFC Confussion Matrix: ')
confusion_matrix(y_test_rfc, y_pred_rfc, labels=None, sample_weight=None)

In [None]:
print('KNeighborsClassifier Accuracy : ', rfc_accuracy) 

In [None]:
print('RandomForestClassifier :Input Real Data : Conv=5, FHA=4, Res=3, Comm=2:\n', X[975:995,] )

In [None]:
print('RandomForestClassifier :Output of X[975:995]:Conv=5, FHA=4, Res=3, Comm=0:' , y[975:995])

In [None]:
print('RandomForestClassifier :Output of prediction: Conv=5, FHA=4, Res=3, Comm=:', y_pred_rfc_out)

In [None]:
print('RandomForestClassifier : - Output of prediction: Fix_True =1 & ARM =0:\n', y_pred_rfc)

Unscaled: Random Forest Classifier Accuracy : 76%

Scaled: Random Forest Classifier Accuracy : 79%

Both unscalled and scalled for Random Forest Model accuray is almost close

# Let’s analyze Interest Rate types using Random Forest Classifier.

We have achieved greater accuracy compare to Loan Types. We have Fixed Rate mortgage where Fix_True = 1 & Fix_True = 0 for ARM (Adjustable Rate Mortgage)

In [None]:
import numpy as np
from sklearn import preprocessing, neighbors, svm
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
X = np.array(model_data.drop(['Fix_True'],1))
y = np.array(model_data['Fix_True'])
from sklearn.preprocessing import scale
# Scale the features: X_scaled
X_scaled = scale(X)
X_train, X_test, y_train, y_test_rfc = train_test_split(X, y, test_size = 0.2)
rfc = RandomForestClassifier(n_jobs=-1,max_features= 'auto' ,n_estimators=200, oob_score = True)
rfc.fit(X_train, y_train)
rfc_accuracy = rfc.score(X_test, y_test_rfc)
print('Random Forest Classifier Accuracy : ', rfc_accuracy)

In [None]:
# Look for the confussion Matrix
from sklearn.metrics import confusion_matrix
#confusion_matrix? # Hit enter
rfc.predict(X_test)
y_pred_rfc = rfc.predict(X_test)
y_pred_rfc_out = rfc.predict(X[975:995,])
print('RFC Confussion Matrix: ')
print(confusion_matrix(y_test_rfc, y_pred_rfc, labels=None, sample_weight=None))

In [None]:
print('RandomForestClassifier Accuracy : ', rfc_accuracy)

In [None]:
print('RandomForestClassifier : - Input of Real Data : Fix_True =1 & ARM =0:\n ', X[975:995,])

In [None]:
print('RandomForestClassifier : - Output of X[975:995]: Fix_True =1 & ARM =0:' , y[975:995])

In [None]:
print('RandomForestClassifier : - Output of prediction: Fix_True =1 & ARM =0:', y_pred_rfc_out)

In [None]:
print('\n############ Prediction  ######################')
print('RandomForestClassifier : - Output of prediction: Fix_True =1 & ARM =0:\n', y_pred_rfc)

Random Forest Classifier Accuracy for Fix or ARM is : 0.87 or 87%

# Using Random Forest finding Important Features in Scikit-learn

Here, we are finding important features or selecting features in the Mortgage Loan dataset. In scikit-learn, we can perform this task in the following steps:

- First, we need to create a random forests model.

- Second, use the feature importance variable to see feature importance scores.

- Third, visualize these scores using the seaborn library.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing, neighbors, svm
X = np.array(model_data.drop(['Fix_True'],1))
y = np.array(model_data['Fix_True'])
from sklearn.preprocessing import scale
# Scale the features: X_scaled
X_scaled = scale(X)
X_train, X_test, y_train, y_test_rfc = train_test_split(X, y, test_size = 0.2)
rfc = RandomForestClassifier(n_jobs=-1,max_features= 'auto' ,n_estimators=200, oob_score = True)
rfc.fit(X_train, y_train)
rfc_accuracy = rfc.score(X_test, y_test_rfc)
print('Random Forest Classifier Accuracy : ', rfc_accuracy)

In [None]:
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
import pandas as pd
#col_names = (fit_data.columns)
X = (model_data.drop(['Qualification FICO'],1))
feature_cols = fit_data.columns
model_data.columns
feature_names = feature_cols
feature_imp = pd.Series(clf.feature_importances_,index=X.columns).sort_values(ascending=False)
feature_imp

We can also visualize the feature importance. Visualizations are easy to understand and interpretable. For visualization, we have used a combination of matplotlib and seaborn

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.figure(figsize=(20,5))
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()

Higher the value, greater the feature importance

# Support Vector Regression (SVR)

**SUPPORT VECTOR REGRESSION.** Those who are in Machine Learning or Data Science are quite familiar with the term SVM or Support Vector Machine. But SVR is a bit different from SVM. As the name suggest the SVR is an regression algorithm, so we can use SVR for working with continuous Values instead of Classification which is SVM.

In [None]:
X = np.array(model_data.drop(['Loan Amount'],1))
y = np.array(model_data['Loan Amount'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# We want to use svm now
from sklearn import preprocessing,svm
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from matplotlib import style
clf = svm.SVR()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print('SVM accuracy: ', accuracy)

SVM-SVR accuracy: -0.026 which is unacceptable for our Mortgage Loan Data Sets

# Support Vector Machine (SVM)

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimensional space this hyperplane is a line dividing a plane in two parts where in each class lay in either side.

In [None]:
from sklearn import preprocessing, neighbors, svm
X = np.array(model_data.drop(['Fix_True'],1))
y = np.array(model_data['Fix_True'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#Define SVM support vector classifier
svmc = svm.SVC(kernel='rbf', C=10, gamma=1)
svmc.fit(X_train, y_train)
svm_accuracy = svmc.score(X_test, y_test)
print('Support Vector Classifier Accuracy : ', svm_accuracy)

In [None]:
# Look for the confussion Matrix
from sklearn.metrics import confusion_matrix
#confusion_matrix? # Hit enter
svmc.predict(X_test)
y_pred_svm = svmc.predict(X_test)
print('SVC Confussion Matrix: ')
print(confusion_matrix(y_test, y_pred, labels=None, sample_weight=None))

In [None]:
from sklearn.model_selection import cross_validate
svm_scores = cross_val_score(svmc, X, y, cv=7, scoring='accuracy')
print('SVM: cross_val_score accuracy : ', svm_scores)

Support Vector Classifier Accuracy : 0.896 or 90%

# k-Nearest Neighbors (KNN)

k-Nearest Neighbors: FIT

Having explored the Congressional mortgage records dataset, we have build our classifier. Here, we will fit a k-Nearest Neighbors classifier to the mortgage dataset. The features need to be in an array where each column is a feature and each row a different observation or data point. The target needs to be a single column with the same number of observations as the feature data. Notice we named the feature array X and response variable y: This is in accordance with the common scikit-learn practice. We need create an instance of a k-NN classifier with 6 neighbors (by specifying the n_neighbors parameter) and then fit it to the data.

# k-Nearest Neighbors: Predict

Having fit a k-NN classifier, we can use it to predict the label of a new data point. However, there is no unlabeled data available since all of it was used to fit the model! We will use your classifier to predict the label for this new data point, as well as on the training data X that the model has already seen.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing,neighbors
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split # for train and test set split
from sklearn.model_selection import cross_val_score
X = np.array(model_data.drop(['Fix_True'],1))
y = np.array(model_data['Fix_True'])
X_train, X_test, y_train, y_test_knn = train_test_split(X, y, test_size = 0.2)
from sklearn.neighbors import KNeighborsClassifier
print("size of the training feature set is",X_train.shape)

In [None]:
print("size of the test feature set is",X_test.shape)

In [None]:
print("size of the training Target set is",y_train.shape)

In [None]:
print("size of the test Target set is",y_test.shape)

In [None]:
# Import scale
from sklearn.preprocessing import scale
# Scale the features: X_scaled
X_scaled = scale(X)
# Print the mean and standard deviation of the unscaled features
print("Mean of Unscaled Features: {}".format(np.mean(X))) 

In [None]:
print("Standard Deviation of Unscaled Features: {}".format(np.std(X)))

In [None]:
# Print the mean and standard deviation of the scaled features
print("Mean of Scaled Features: {}".format(np.mean(X_scaled))) 

In [None]:
print("Standard Deviation of Scaled Features: {}".format(np.std(X_scaled)))

In [None]:
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
knn_accuracy = knn.score(X_test, y_test_knn)
print('KNeighborsClassifier Accuracy : ', knn_accuracy) 

In [None]:
knn_prediction = knn.predict(X[1200:1220,])
print('KNN : - Input of Real Data :: \n', X[1200:1220,])

In [None]:
print('KNN : - Output of Real Data :: ', y[1200:1220,])

In [None]:
print('KNN : - Output of prediction:: ', knn_prediction)

KNeighborsClassifier Accuracy : 0.882 or 88%

# How do we improve the KNN model

## Preprocessing: scaling

Here below I will take following steps to improve the model.

(i) scale the data,

(ii) use k-Nearest Neighbors

(iii) check the model performance.

We'll use scikit-learn's scale function, which standardizes all features (columns) in the array passed to it.

In [None]:
from sklearn.preprocessing import scale
Xs = scale(X)
from sklearn.model_selection import train_test_split
Xs_train, Xs_test, y_train, y_test = train_test_split(Xs, y, test_size=0.2)
knn_model_2 = knn.fit(Xs_train, y_train)
print('k-NN score for test set: %f' % knn_model_2.score(Xs_test, y_test))

In [None]:
print('k-NN score for training set: %f' % knn_model_2.score(Xs_train, y_train))

In [None]:
y_true, y_pred = y_test, knn_model_2.predict(Xs_test)
print(classification_report(y_true, y_pred))

In [None]:
y_true, y_pred = y_test, knn_model_2.predict(Xs_test)
print(classification_report(y_true, y_pred))

All these measures improved by 3.25% improvement and significant! As hinted at above, before scaling there were a number of predictor variables with ranges of different order of magnitudes, meaning that one or two of them could dominate in the context of an algorithm such as k-NN. The two main reasons for scaling our data are

Our predictor variables may have significantly different ranges and, in certain situations, such as when implementing k-NN, this needs to be mitigated so that certain features do not dominate the algorithm; We want our features to be unit-independent, that is, not reliant on the scale of the measurement involved. If we both scale our respective data, this feature will be the same for each of us.

# Decision Tree Classifier

Using Scikit-learn, optimization of decision tree classifier performed by only pre-pruning. Maximum depth of the tree can be used as a control variable for pre-pruning. In the following the example, we can plot a decision tree on the same data with max_depth=4. Other than pre-pruning parameters, We have also tried other attribute selection measure such as entropy This pruned model is less complex, explainable, and easy to understand than the previous decision tree model plot.

**Pros**

- Decision trees are easy to interpret and visualize.
- It can easily capture Non-linear patterns.
- It requires fewer data preprocessing from the user, for example, there is no need to normalize columns.
- It can be used for feature engineering such as predicting missing values, suitable for variable selection.
- The decision tree has no assumptions about distribution because of the non-parametric nature of the algorithm.

**Cons**

- Sensitive to noisy data. It can overfit noisy data.
- The small variation(or variance) in data can result in the different decision tree.
- Decision trees are biased with imbalance dataset, so we can balance out the dataset before creating the decision tree.

# Decision Tree Algorithm

A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This flowchart-like structure helps you in decision making. It's visualization like a flowchart diagram which easily mimics the human level thinking. That is why decision trees are easy to understand and interpret.

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
# Split dataset into training set and test set
X = np.array(fit_data.drop(['Fix_True'],1))
y = np.array(fit_data['Fix_True'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 70% training and 30% test
#Building Decision Tree Model Let's create a Decision Tree Model using Scikit-learn.
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
# Import scale
from sklearn.preprocessing import scale
# Scale the features: X_scaled
X_scaled = scale(X)
clf.predict(X_test)
y_pred_dt = clf.predict(X_test)
y_pred_dt_out = clf.predict(X[975:995,])
print('DecisionTre Confussion Matrix: ')
print(confusion_matrix(y_test, y_pred_dt, labels=None, sample_weight=None))

In [None]:
dt_accuracy = clf.score(X_test, y_test)
print('Decision Tree Classifier Accuracy : ', dt_accuracy) 

In [None]:
print('Decision Tree Classifier : - Input of Real Data : Fix_True =1 & ARM =0:\n ', X[975:995,])

In [None]:
print('Decision Tree Classifier : - Output of X[975:995]: Fix_True =1 & ARM =0:' , y[975:995])

In [None]:
print('Decision Tree Classifier : - Output of prediction: Fix_True =1 & ARM =0:', y_pred_dt_out)

Decision Tree Classifier Accuracy : 0.840 or 84%

# Visualizing Decision Trees

We have used Scikit-learn's export_graphviz function for display the tree within a Jupyter notebook. For plotting tree, you also need to install graphviz and pydotplus. export_graphviz function converts decision tree classifier into dot file and pydotplus convert this dot file to png or displayable form on Jupyter Notebook.

In [None]:
!pip install pydotplus

In [None]:
from sklearn.tree import export_graphviz
from six import StringIO
from IPython.display import Image
import pydotplus
col_names = (fit_data.columns)
X = (fit_data.drop(['Loan Amount'],1))
y = (fit_data['Loan Amount'])
feature_cols = X.columns
fit_data.columns
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
                filled=True, rounded=True,
                special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('./mortgage_dt.png')
Image(graph.create_png())

In Scikit-learn, optimization of decision tree classifier performed by only pre-pruning. Maximum depth of the tree can be used as a control variable for pre-pruning.

We can plot a decision tree on the same data with max_depth=3. Other than pre-pruning parameters, We can also try other attribute selection measure such as entropy.

In [None]:
# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
#Visualizing Decision Trees
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
                filled=True, rounded=True,
                special_characters=True,feature_names = feature_cols,class_names=['0','1','2'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('./mortgage_dt.png')
Image(graph.create_png())

Plotted a decision tree on the same data with max_depth=3. Easy to visualize.

Decision tree model accuracy has also improved.