# Building a Regression Model (KMeans Clustering) using Machine Learning 

Applying Logistic Regression, K-NN Regressor, Random Forest Regressor and Multi Layer Perceptron Regressor to each of our clusters. We also calculate the RMSE and MAE value for the train and test data as a performance metric for each of our models to understand the better performing model.

Import necessary packages and define mean_absolute_percentage_error

In [1]:
import pandas as pd
import random
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn import linear_model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.cross_validation import train_test_split  
from sklearn.utils import check_array


def mean_absolute_percentage_error(y_test,x_predict):
    np.seterr(divide='ignore',invalid='ignore')
    y_test,x_predict=np.array(y_test),np.array(x_predict) 
    return np.mean(np.abs((y_test - x_predict)/y_test))*100




Load Cluster

In [2]:
loandata=pd.read_csv('cluster3.csv',encoding="ISO-8859-1")



In [3]:
loandata=loandata[loandata.int_rate!=0]

Divide test and train in the 30-70 ratio and separating the int_rate column into y_train, y_test dataframes for building our model

In [4]:
loandataforprediction=loandata[['loan_amnt','int_rate','term','emp_length','home_ownership','annual_inc','verification_status','purpose','addr_state','dti','delinq_2yrs','Risk_Score','inq_last_6mths','open_acc','revol_bal','revol_util','total_acc','mths_since_last_major_derog','funded_amnt_inv','installment','application_type','pub_rec','addr_state']]

loandataforprediction=pd.get_dummies(loandataforprediction, columns=["purpose"])
loandataforprediction=pd.get_dummies(loandataforprediction,columns=["application_type"])
train,test = train_test_split(loandataforprediction, train_size = 0.7)
y_train = train['int_rate']
y_test = test['int_rate']
x_train = train.loc[:, train.columns != 'int_rate']
x_test = test.loc[:, test.columns != 'int_rate']

### Machine Learning algorithms used :

We have applied below machine learning algorithms and evaluated the performance of the data on each model by calculating the MAE and RMSE for test and train data

## Linear Regression

In [5]:
#Multple Linear Regression Case 1:
reg=linear_model.LinearRegression()
reg.fit(x_train,y_train)
predicted_train=reg.predict(x_train)
predicted_test=reg.predict(x_test)
print("Testing mean absolute error : %f" %mean_absolute_error(y_test,predicted_test))
print("Training mean absolute error: %f" %mean_absolute_error(y_train,predicted_train))
rmse=np.sqrt(mean_squared_error(y_test,predicted_test))
print("RMSE Value for Testing")
print(rmse)
rmse=np.sqrt(mean_squared_error(y_train,predicted_train))
print("RMSE Value for Training")
print(rmse)
print("MAPE Value for Testing")
print(mean_absolute_percentage_error(y_test, predicted_test))
print("MAPE Value for Training")
print(mean_absolute_percentage_error(y_train, predicted_train))


Testing mean absolute error : 1.162224
Training mean absolute error: 1.161734
RMSE Value for Testing
1.61704634321
RMSE Value for Training
1.62071260134
MAPE Value for Testing
9.61941925112
MAPE Value for Training
9.59630107929


In [1]:
## KNN Regressor

In [6]:
  
#KNN Regressor


# Create the knn model.
# Look at the five closest neighbors.
knn = KNeighborsRegressor(n_neighbors=5)
# Fit the model on the training data.
knn.fit(x_train, y_train)
# Make point predictions on the test set using the fit model.
predicted_train = knn.predict(x_train)
predicted_test = knn.predict(x_test)
print("Testing mean absolute error : %f" %mean_absolute_error(y_test,predicted_test))
print("Training mean absolute error: %f" %mean_absolute_error(y_train,predicted_train))
rmse=np.sqrt(mean_squared_error(y_test,predicted_test))
print("RMSE Value for Testing")
print(rmse)
rmse=np.sqrt(mean_squared_error(y_train,predicted_train))
print("RMSE Value for Training")
print(rmse)
print("MAPE Value for Testing")
print(mean_absolute_percentage_error(y_test, predicted_test))
print("MAPE Value for Training")
print(mean_absolute_percentage_error(y_train, predicted_train))


Testing mean absolute error : 3.581730
Training mean absolute error: 2.894301
RMSE Value for Testing
4.65497172055
RMSE Value for Training
3.77356901555
MAPE Value for Testing
30.8065811187
MAPE Value for Training
24.794504981


## RandomForest Regressor

In [7]:

#RandomForest Regressor
from sklearn.ensemble import RandomForestRegressor

max_features=['auto']
n_estimators=[100]
min_samples_leaf=[1,2]
for x in n_estimators:
    for y in min_samples_leaf:
        for z in max_features:
            model=RandomForestRegressor(n_estimators=x,max_features=z,oob_score=True,n_jobs=-1,random_state=50,min_samples_leaf=y)
            model.fit(x_train,y_train)
            predicted_train=model.predict(x_train)
            predicted_test=model.predict(x_test)
            print("Tuning Combination") 
            print(x," ",y ," ", z,"")
            print("Testing mean absolute error : %f" % mean_absolute_error(y_test,predicted_test))
            print("Training mean absolute error: %f" %mean_absolute_error(y_train,predicted_train))
            rmse=np.sqrt(mean_squared_error(y_test,predicted_test))
            print("RMSE Value for testing")
            print(rmse)
            rmse=np.sqrt(mean_squared_error(y_train,predicted_train))
            print("RMSE Value for training")
            print(rmse)
            print("MAPE Value for Testing")
            print(mean_absolute_percentage_error(y_test, predicted_test))
            print("MAPE Value for Training")
            print(mean_absolute_percentage_error(y_train, predicted_train))


Tuning Combination
(100, ' ', 1, ' ', 'auto', '')
Testing mean absolute error : 0.023746
Training mean absolute error: 0.009399
RMSE Value for testing
0.125986630824
RMSE Value for training
0.0491870343412
MAPE Value for Testing
0.189031056481
MAPE Value for Training
0.0750257037069
Tuning Combination
(100, ' ', 2, ' ', 'auto', '')
Testing mean absolute error : 0.025150
Training mean absolute error: 0.012425
RMSE Value for testing
0.128317892437
RMSE Value for training
0.0680403787089
MAPE Value for Testing
0.203359488297
MAPE Value for Training
0.102110952762


## MLP Regressor 

In [8]:

#Neural Network Case 1:
#List of Alpha Values
from sklearn.neural_network import MLPRegressor
hidden_layers=[]
alpha=[10]


for a in alpha:  
    nn = MLPRegressor(hidden_layer_sizes=(100,),  activation='relu', solver='adam', alpha=a,batch_size='auto',learning_rate='constant',
    learning_rate_init=0.001, max_iter=1000, shuffle=True,random_state=50, tol=0.0001, verbose=False, warm_start=False,
    early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
    n = nn.fit(x_train,y_train)
    predicted_trainnn=n.predict(x_train)
    predicted_testnn=n.predict(x_test)
    print("Testing mean absolute error : %f" % mean_absolute_error(y_test,predicted_testnn))
    print("training mean absolute error: %f"% mean_absolute_error(y_train,predicted_trainnn))
    rmse=np.sqrt(mean_squared_error(y_test,predicted_testnn))
    print("RMSE Value For Testing")
    print(rmse)
    rmse=np.sqrt(mean_squared_error(y_train,predicted_trainnn))
    print("RMSE Value For Training")
    print(rmse)
    print("MAPE Value for Testing")
    print(mean_absolute_percentage_error(y_test, predicted_testnn))
    print("MAPE Value for Training")
    print(mean_absolute_percentage_error(y_train, predicted_trainnn))

Testing mean absolute error : 20.703014
training mean absolute error: 20.863420
RMSE Value For Testing
51.1978936018
RMSE Value For Training
86.6206419733
MAPE Value for Testing
180.313941349
MAPE Value for Training
181.593306124


# Conclusion :

#### We see that the Random Forest Regressor outperforms for this cluster of FICO Score clustering with MAE for test and train as 0.025 and 0.012 

The text in the document by Tushar Goel & Trupti Gore is licensed under CC BY 3.0 https://creativecommons.org/licenses/by/3.0/us/

### MIT License

The code in the document by Tushar Goel & Trupti Gore is licensed under the MIT License https://opensource.org/licenses/MIT 

Copyright (c) 2018 Trupti Gore

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
