# Interest Rate Prediction

## Case Study

This data belongs to a loan aggregator agency which connects loan applications to different financial institutions in attempt to get the best interest rate. They want to now utilise past data to predict interest rate given by any financial institute just by looking at loan application characteristics.

To achieve that , they have decided to do a POC with a data from a particular financial institution. The data is given in the file "loans data.csv". Lets begin: 

## Step 0: Basic Imports

In [None]:
import pandas as pd 
import numpy as np

## Step 1: Load dataset

In [None]:
train_file='data/loan_data_train.csv'
test_file='data/loan_data_test.csv'

df_train=pd.read_csv(train_file)
df_test=pd.read_csv(test_file)               

In [None]:
print(df_train.shape)
df_train.head()

14 features and 1 target variable

In [None]:
#test data does not have interest rate
print(df_test.shape)
df_test.head()

## Step 2: Data Visualisation and Feature Selection

### Check dtypes

In [None]:
df_train.dtypes.unique()

In [None]:
obj_cols = df_train.select_dtypes("object").columns
obj_cols, len(obj_cols)

Dtypes for 12 feature has to be fixed

In [None]:
df_train[obj_cols].sample(5)

In [None]:
df_train["Loan.Length"].value_counts()

In [None]:
df_train["Loan.Purpose"].value_counts()

### Some observations

1. 'Amount.Requested': convert it to numeric
2. 'Amount.Funded.By.Investors': drop
3. 'Interest.Rate': remove % and then to numeric
4. 'Loan.Length': dummies for categories
5. 'Loan.Purpose': dummies for categories
6. 'Debt.To.Income.Ratio': remove % and then to numeric
7. 'State': dummies for categories
8. 'Home.Ownership': dummies for categories
9. 'FICO.Range': replace it by a numeric column which is average of the range
10. 'Open.CREDIT.Lines': convert it to numeric 
11. 'Revolving.CREDIT.Balance': convert it to numeric 
12. 'Employment.Length': convert it to number


Lets group by operation:

1. drop: 
    - ID 
    - Amount.Funded.By.Investors
<br>
2. convert it to numeric 
    - Amount.Requested
    - Open.CREDIT.Lines
    - Revolving.CREDIT.Balance
<br>
3. remove % and then to numeric
    - Interest Rate
    - Debt to income ratio
<br>
4. replace it by a numeric column which is average of the range
    - FICO.Range 
<br>
5. convert to number:
    - Employment Length 
<br>
6. dummies for categories with good occurence rate:
    - Loan Lenth
    - Loan Purpose
    - State
    - Home ownership

### Fix Dtypes

#### Operation 1

In [None]:
df_train.drop(['ID','Amount.Funded.By.Investors'],axis=1,inplace=True)

#### Operation 2

We can see that many columns which should have really been numbers have been imported as character columns , probably because some characters values in those columns in the files. We'll convert all such columns to numbers .

In [None]:
for col in ['Amount.Requested', 'Open.CREDIT.Lines','Revolving.CREDIT.Balance']:
    df_train[col]=pd.to_numeric(df_train[col],errors='coerce') 

#### Operation 3

variable `Interest.Rate` and `Debt.To.Income.Ratio` contain "%" sign in their values and because of which they have come as character columns in the data. Lets remove these percentages first.

In [None]:
for col in ['Interest.Rate','Debt.To.Income.Ratio']:
    print(col)
    df_train[col]=df_train[col].str.replace("%","")

In [None]:
for col in ['Interest.Rate','Debt.To.Income.Ratio']:
    df_train[col]=pd.to_numeric(df_train[col],errors='coerce') 

#### Operation 4

If we look at first few values of variable FICO.Range , we can see that we can convert it to numeric by taking average of the range given. To do that first we need to split the column with "-", so that we can have both end of ranges in separate columns and then we can simply average them.

In [None]:
k=df_train['FICO.Range'].str.split("-",expand=True).astype(float)

In [None]:
df_train['fico']=0.5*(k[0]+k[1])
del df_train['FICO.Range']

#### Operation 5

In [None]:
df_train['Employment.Length'].value_counts()

In [None]:
df_train['Employment.Length']=df_train['Employment.Length'].str.replace('years',"")
df_train['Employment.Length']=df_train['Employment.Length'].str.replace('year',"")

In [None]:
#np.where(condition, value_if_True, value_if_False)
df_train['Employment.Length']=np.where(df_train['Employment.Length'].str[:2]=="10",10,df_train['Employment.Length'])
df_train['Employment.Length']=np.where(df_train['Employment.Length'].str[0]=="<",0,df_train['Employment.Length'])

In [None]:
df_train['Employment.Length']=pd.to_numeric(df_train['Employment.Length'],errors='coerce')

#### Operation 6

In [None]:
# Notice that to apply string function on pandas data frame columns you need to str attribute
cat_cols=df_train.select_dtypes(['object']).columns
cat_cols

In [None]:
print("*"*50)
for col in cat_cols:
    print(col)
    print("-"*50)
    print(df_train[col].nunique())
    print(df_train[col].value_counts())
    print("*"*50)

In [None]:
# you can use following method if you want to ignore categories with too low frequencies ,
# in next section for logistic regression we will be using  pandas' get dummies function. 
# you can work with either of these . 
# ignoring categories with low frequencies however will result in fewer columns without 
# affecting model performance too much .

for col in cat_cols:
    freqs=df_train[col].value_counts()
    k=freqs.index[freqs>20][:-1]
    for cat in k:
        name=col+'.'+cat
        df_train[name]=(df_train[col]==cat).astype(int)
    del df_train[col]
    print(col)  

In [None]:
df_train.columns

### Missing values

In [None]:
df_train.isnull().sum()

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

In [None]:
df_train = pd.DataFrame(imputer.fit_transform(df_train), columns=df_train.columns)

In [None]:
df_train.isnull().sum()

## Step 3: Defining Training and Test Set

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, y_train = df_train.loc[:,df_train.columns!="Interest.Rate"].values, df_train["Interest.Rate"].values
X_train.shape, y_train.shape

## Step 4: Data Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
std = StandardScaler().fit(X_train)
X_train = std.transform(X_train)

## Step 5: Test set pipeline

## Step 6: Modelling

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lm=LinearRegression()

In [None]:
lm.fit(x_train1,y_train1)

In [None]:
x_train1.shape

In [None]:
lm.intercept_

In [None]:
list(zip(x_train1.columns,lm.coef_))

In [None]:
x_train2=ld_train2.drop('Interest.Rate',axis=1)

In [None]:
predicted_ir=lm.predict(x_train2)

In [None]:
from sklearn.metrics import mean_absolute_error

In [None]:
mean_absolute_error(ld_train2['Interest.Rate'],predicted_ir)

We know the tentative performance now, lets build the model on entire training to make prediction on test/production

In [None]:
x_train=ld_train.drop('Interest.Rate',axis=1)
y_train=ld_train['Interest.Rate']

In [None]:
lm.fit(x_train,y_train)

In [None]:
test_pred=lm.predict(ld_test)

We can write these to a csv file for submission like this :

In [None]:
pd.DataFrame(test_pred).to_csv("mysubmission.csv",index=False)

### Ridge  Regression

In [None]:
from sklearn.linear_model import Ridge,Lasso
from sklearn.model_selection import GridSearchCV

In [None]:
lambdas=np.linspace(1,100,100)

In [None]:
params={'alpha':lambdas}

In [None]:
model=Ridge(fit_intercept=True)

In [None]:
grid_search=GridSearchCV(model,param_grid=params,cv=10,scoring='neg_mean_absolute_error')

In [None]:
grid_search.fit(x_train,y_train)

In [None]:
grid_search.best_estimator_

In [None]:
grid_search.cv_results_

 if you want you can now fit a ridge regression model with obtained value of alpha , although there is no need, grid search automatically fits the best estimator on the entire data, you can directly use this to make predictions on test_data. But if you want to look at coefficients , its much more convenient to fit the model with direct function

Using the report function given below you can see the cv performance of top few models as well, that will the tentative performance

In [None]:
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

In [None]:
report(grid_search.cv_results_,100)

In [None]:
test_pred=grid_search.predict(ld_test)

In [None]:
pd.DataFrame(test_pred).to_csv("mysubmission.csv",index=False)

#### For looking at coefficients

In [None]:
ridge_model=grid_search.best_estimator_

In [None]:
ridge_model.fit(x_train,y_train)

In [None]:
list(zip(x_train1.columns,ridge_model.coef_))

### Lasso Regression

In [None]:
lambdas=np.linspace(1,10,100)
model=Lasso(fit_intercept=True)
params={'alpha':lambdas}

In [None]:
grid_search=GridSearchCV(model,param_grid=params,cv=10,scoring='neg_mean_absolute_error')

In [None]:
grid_search.fit(x_train,y_train)

In [None]:
grid_search.best_estimator_

you can see that, the best value of alpha comes at the edge of the range that we tried , we should expand the trial range on that side and run this again

In [None]:
lambdas=np.linspace(.001,2,100)
params={'alpha':lambdas}

In [None]:
grid_search=GridSearchCV(model,param_grid=params,cv=10,scoring='neg_mean_absolute_error')
grid_search.fit(x_train,y_train)

In [None]:
grid_search.best_estimator_

In [None]:
report(grid_search.cv_results_,5)

In [None]:
lasso_model=grid_search.best_estimator_

In [None]:
lasso_model.fit(x_train,y_train)

In [None]:
list(zip(x_train.columns,lasso_model.coef_))