<h1 style="color:blue;">Outline of Scenario 10 notebook:</h1>

- C1.S10.Py01 – Combine “other” and “none” and run regression
- C1.S10.Py02 - Create interaction and re-run regression
- C1.S10.Py03 - Remove outliers and re-run regression



In [None]:
#Code Block 1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



#style options

%matplotlib inline
#if you want graphs to automatically without plt.show

pd.set_option('display.max_columns',500) #allows for up to 500 columns to be displayed when viewing a dataframe

plt.style.use('seaborn') #a style that can be used for plots - see style reference above



In [None]:
#Code Block 2
df = pd.read_csv('data/Scenario10.csv', index_col = 0, header=0)
    #DOES set the first column to the index
    # and the top row as the headers

In [None]:
#Code Block 3
df.head(3)

<h2 style="color:blue;">Combine “OTHER” and “NONE” and run regression</h2>



<h3 style="color:blue;">Rename Revolving Accounts </h3>

- Revolving Accounts is the amount of debt balance for revolving debt types (ex. credit cards)
- Total Revolving Credit Line is the credit limit on the revolving credit line (ex. max amount for a credit card)

#### Let's change Revolving Accounts to Revolving Balance

In [None]:
#Code Block 4
df.columns

In [None]:
#Code Block 5
df = df.rename(columns = {'Revolving Accounts': "Revolving Balance"})

#### NOTE: single quotes ' ' is the same as double quotes " "

In [None]:
#Code Block 6
df.info()

In [None]:
#Code Block 7
df['Home Ownership'].value_counts(dropna=False)

<h3 style="color:blue;">Combine "other" and "rent" by using .replace </h3>

In [None]:
#Code Block 8
df = df.replace({"NONE":"OTHER"})
df['Home Ownership'].value_counts(dropna=False)

<h3 style="color:blue;">Create dummy variables for Home Ownership </h3>

- Create the dummy variables *(ex. MORTGAGE, RENT, OWN, OTHER)*
- Concatenate with the original DataFrame *(ex. dummies_HomeOwnership with df_reg)*
- Drop the original variable that was used to create the dummy variables *(ex. Home Ownership)*

In [None]:
#Code Block 9
dummies_HomeOwnership = pd.get_dummies(df['Home Ownership'], drop_first = False)
df = pd.concat([df, dummies_HomeOwnership], axis = 1)
#df = df.drop(['Home Ownership'], axis = 1)
df.head()

In [None]:
#Code Block 10
print('---------------------------------------------------')
print(df['MORTGAGE'].value_counts(dropna=False))
print('---------------------------------------------------')
print(df['OWN'].value_counts(dropna=False))
print('---------------------------------------------------')
print(df['RENT'].value_counts(dropna=False))
print('---------------------------------------------------')
print(df['OTHER'].value_counts(dropna=False))
print('---------------------------------------------------')

In [None]:
#Code Block 11
df_reg = df.copy()

<h3 style="color:blue;">Create a regression model to include Home Ownership</h3>

- Not including MORTGAGE due to its likelihood of multi-colinearity
- If OWN, RENT, and OTHER are 0's, then it can be inferred that MORTGAGE is 1.  


In [None]:
#Code Block 12
df_reg.columns

In [None]:
#Code Block 13
df_reg = df_reg[['Amount Funded', 'Total Debt', 'Annual Income', 'Revolving Balance',
                'Total Revolving Credit Line', 'Loan_Income', 'Debt_Income', 'RevBal_Line', 'RevBal_Income',
                'Income Verification', 'IncVer_Income_act', 'OWN', 'RENT', 'OTHER','Interest Rate']]
X = df_reg.drop(["Interest Rate"], axis = 1)
y = df_reg[['Interest Rate']]

In [None]:
#Code Block 14
import statsmodels
import statsmodels.api as sm

In [None]:
#Code Block 15
X = sm.add_constant(X) # adding a constant

reg1 = sm.OLS(y, X).fit()

predictions1 = reg1.predict(X)
resid1 = reg1.resid
reg1.summary()

In [None]:
#Code Block 16

#Create Predictions dataframe
df_predictions1 = pd.DataFrame(predictions1)
df_predictions1=df_predictions1.rename(columns = {0:'Int_Pred1'})

#Create Residuals dataframe
df_resid1 = pd.DataFrame(resid1)
df_resid1=df_resid1.rename(columns = {0:'Resid1'})


#Concat results into one dataframe
df_reg_results = pd.concat([df_reg, df_predictions1, df_resid1], axis=1)

df_reg_results.head()

In [None]:
#Code Block 17
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
#Code Block 18
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns #adds a column with the labels
round(vif, 1).sort_values(by = 'VIF Factor', ascending = False)

In [None]:
#Code Block 19

corrMatrix = df_reg.corr()
df_corrMatrix = pd.DataFrame(corrMatrix)
round(df_corrMatrix,3)

In [None]:
#Code Block 20
colormap = plt.cm.viridis
plt.figure(figsize=(14,10))
plt.title('Correlation Heat Map', y=1.05, size=15)
sns.heatmap(df_corrMatrix,linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

In [None]:
#Code Block 21
plt.figure(figsize=(20,10)) #changes area of scatterplot
sns.regplot(x='Int_Pred1', y='Resid1',
              data = df_reg_results, scatter_kws={"color":"blue","alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
              line_kws={'color': 'black'})

<h2 style="color:blue;">Create interaction and re-run regression</h2>

- Interaction variable for **OWN** and **Total Revolving Credit Line**

In [None]:
#Code Block 22
df.head()

In [None]:
#Code Block 23
df['Own_RevLine_act'] = df['Total Revolving Credit Line'] * df['OWN']

In [None]:
#Code Block 24
df.tail()

In [None]:
#Code Block 25
df.columns

In [None]:
#Code Block 26
df_reg = df[['Amount Funded', 'Total Debt', 'Annual Income', 'Revolving Balance',
                'Total Revolving Credit Line', 'Loan_Income', 'Debt_Income', 'RevBal_Line', 'RevBal_Income',
                'Income Verification', 'IncVer_Income_act', 'OWN', 'RENT', 'OTHER','Own_RevLine_act','Interest Rate']]
X = df_reg.drop(["Interest Rate"], axis = 1)
y = df_reg[['Interest Rate']]

In [None]:
#Code Block 27
X = sm.add_constant(X) # adding a constant

reg2 = sm.OLS(y, X).fit()

predictions2 = reg2.predict(X)
resid2 = reg2.resid
reg2.summary()

In [None]:
#Code Block 28

#Create Predictions dataframe
df_predictions2 = pd.DataFrame(predictions2)
df_predictions2=df_predictions2.rename(columns = {0:'Int_Pred2'})

#Create Residuals dataframe
df_resid2 = pd.DataFrame(resid2)
df_resid2=df_resid2.rename(columns = {0:'Resid2'})


#Concat results into one dataframe
df_reg_results = pd.concat([df_reg_results, df_predictions2, df_resid2], axis=1)

df_reg_results.head()

In [None]:
#Code Block 29
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns #adds a column with the labels
round(vif, 1).sort_values(by = 'VIF Factor', ascending = False)

In [None]:
#Code Block 30

corrMatrix = df_reg.corr()
df_corrMatrix = pd.DataFrame(corrMatrix)
round(df_corrMatrix,3)

In [None]:
#Code Block 31
colormap = plt.cm.viridis
plt.figure(figsize=(14,10))
plt.title('Correlation Heat Map', y=1.05, size=15)
sns.heatmap(df_corrMatrix,linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

In [None]:
#Code Block 32
plt.figure(figsize=(20,10)) #changes area of scatterplot
sns.regplot(x='Int_Pred2', y='Resid2',
              data = df_reg_results, scatter_kws={"color":"blue","alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
              line_kws={'color': 'black'})

### Graphically looking at residuals using lowess

- LOWESS (Locally Weighted Scatterplot Smoothing), is a tool used in regression analysis that creates a smooth line through a scatter plot to help you to see relationship between variables and foresee trends.
- https://www.statisticshowto.com/lowess-smoothing/

In [None]:
#Code Block 33
sns.set(style='dark')
plt.figure(figsize=(20,14))
#top left Amopunt Funded
ax1 = plt.subplot2grid((2, 2), (0, 0))
plt.title('Amount Funded', fontweight='bold', color = 'green', fontsize='17', horizontalalignment='center')
ax1 = sns.regplot(x='Amount Funded', y='Resid2', lowess=True,
                  data = df_reg_results, scatter_kws={"color":"blue","alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
                  line_kws={'color': 'black'})
#top right Total Debt
ax2 = plt.subplot2grid((2, 2), (0, 1))
plt.title('Total Debt', fontweight='bold', color = 'green', fontsize='17', horizontalalignment='center')
ax2 = sns.regplot(x='Total Debt', y='Resid2', lowess=True,
                  data = df_reg_results, scatter_kws={"color":"blue","alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
                  line_kws={'color': 'black'})
#bottom left Annual Income
ax3 = plt.subplot2grid((2, 2), (1, 0))
plt.title('Annual Income', fontweight='bold', color = 'green', fontsize='17', horizontalalignment='center')
ax3 = sns.regplot(x='Annual Income', y='Resid2', lowess=True,
                  data = df_reg_results, scatter_kws={"color":"blue","alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
                  line_kws={'color': 'black'})
#bottom right Loan_Income
ax4 = plt.subplot2grid((2, 2), (1, 1))
plt.title('Total Revolving Credit Line', fontweight='bold', color = 'green', fontsize='17', horizontalalignment='center')
ax4 = sns.regplot(x= 'Total Revolving Credit Line', y='Resid2', lowess=True,
                  data = df_reg_results, scatter_kws={"color":"blue","alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
                  line_kws={'color': 'black'})

<h2 style="color:blue;">Remove outliers and re-run regression</h2>


In [None]:
#Code Block 34
sns.set(style='dark')
plt.figure(figsize=(20,20))


ax1 = plt.subplot2grid((3, 2), (0, 0))
ax1.grid(b=True, which='major')
ax1.grid(b=True, which='minor')
plt.title('Amount Funded', fontweight='bold', color = 'green', fontsize='17', horizontalalignment='center')
ax1 = sns.regplot(x='Amount Funded', y='Interest Rate',
                  data = df_reg_results, scatter_kws={"alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
                  line_kws={'color': 'black'})


ax2 = plt.subplot2grid((3, 2), (0, 1))
ax2.grid(b=True, which='major')
ax2.grid(b=True, which='minor')
plt.title('Total Debt', fontweight='bold', color = 'green', fontsize='17', horizontalalignment='center')
ax2 = sns.regplot(x='Total Debt', y='Interest Rate',
                  data = df_reg_results, scatter_kws={"alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
                  line_kws={'color': 'black'})


ax3 = plt.subplot2grid((3, 2), (1, 0))
ax3.grid(b=True, which='major')
ax3.grid(b=True, which='minor')
plt.title('Annual Income', fontweight='bold', color = 'green', fontsize='17', horizontalalignment='center')
ax3 = sns.regplot(x='Annual Income', y='Interest Rate',
                  data = df_reg_results, scatter_kws={"alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
                  line_kws={'color': 'black'})

ax4 = plt.subplot2grid((3, 2), (1, 1))
ax4.grid(b=True, which='major')
ax4.grid(b=True, which='minor')
plt.title('Loan_Income', fontweight='bold', color = 'green', fontsize='17', horizontalalignment='center')
ax4 = sns.regplot(x='Loan_Income', y='Interest Rate',
                  data = df_reg_results, scatter_kws={"alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
                  line_kws={'color': 'black'})

ax5 = plt.subplot2grid((3, 2), (2, 0))
ax5.grid(b=True, which='major')
ax5.grid(b=True, which='minor')
plt.title('Revolving Balance', fontweight='bold', color = 'green', fontsize='17', horizontalalignment='center')
ax5 = sns.regplot(x='Revolving Balance', y='Interest Rate',
                  data = df_reg_results, scatter_kws={"alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
                  line_kws={'color': 'black'})


ax6 = plt.subplot2grid((3, 2), (2, 1))
ax6.grid(b=True, which='major')
ax6.grid(b=True, which='minor')
plt.title('Total Revolving Credit Line', fontweight='bold', color = 'green', fontsize='17', horizontalalignment='center')
ax6 = sns.regplot(x='Total Revolving Credit Line', y='Interest Rate',
                  data = df_reg_results, scatter_kws={"alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
                  line_kws={'color': 'black'})

In [None]:
#Code Block 35
df.info()

In [None]:
#Code Block 36
df_new = df.copy()
df_new = df_new[df_new['Annual Income']<500000]
df_new = df_new[df_new['Total Debt']<1000000]
df_new = df_new[df_new['Revolving Balance']<150000]
df_new = df_new[df_new['Total Revolving Credit Line']<250000]
df_new.info()

In [None]:
#Code Block 37
sns.set(style='dark')
plt.figure(figsize=(20,20))


ax1 = plt.subplot2grid((3, 2), (0, 0))
ax1.grid(b=True, which='major')
ax1.grid(b=True, which='minor')
plt.title('Amount Funded', fontweight='bold', color = 'green', fontsize='17', horizontalalignment='center')
ax1 = sns.regplot(x='Amount Funded', y='Interest Rate',
                  data = df_new, scatter_kws={"alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
                  line_kws={'color': 'black'})


ax2 = plt.subplot2grid((3, 2), (0, 1))
ax2.grid(b=True, which='major')
ax2.grid(b=True, which='minor')
plt.title('Total Debt', fontweight='bold', color = 'green', fontsize='17', horizontalalignment='center')
ax2 = sns.regplot(x='Total Debt', y='Interest Rate',
                  data = df_new, scatter_kws={"alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
                  line_kws={'color': 'black'})


ax3 = plt.subplot2grid((3, 2), (1, 0))
ax3.grid(b=True, which='major')
ax3.grid(b=True, which='minor')
plt.title('Annual Income', fontweight='bold', color = 'green', fontsize='17', horizontalalignment='center')
ax3 = sns.regplot(x='Annual Income', y='Interest Rate',
                  data = df_new, scatter_kws={"alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
                  line_kws={'color': 'black'})

ax4 = plt.subplot2grid((3, 2), (1, 1))
ax4.grid(b=True, which='major')
ax4.grid(b=True, which='minor')
plt.title('Loan_Income', fontweight='bold', color = 'green', fontsize='17', horizontalalignment='center')
ax4 = sns.regplot(x='Loan_Income', y='Interest Rate',
                  data = df_new, scatter_kws={"alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
                  line_kws={'color': 'black'})

ax5 = plt.subplot2grid((3, 2), (2, 0))
ax5.grid(b=True, which='major')
ax5.grid(b=True, which='minor')
plt.title('Revolving Balance', fontweight='bold', color = 'green', fontsize='17', horizontalalignment='center')
ax5 = sns.regplot(x='Revolving Balance', y='Interest Rate',
                  data = df_new, scatter_kws={"alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
                  line_kws={'color': 'black'})


ax6 = plt.subplot2grid((3, 2), (2, 1))
ax6.grid(b=True, which='major')
ax6.grid(b=True, which='minor')
plt.title('Total Revolving Credit Line', fontweight='bold', color = 'green', fontsize='17', horizontalalignment='center')
ax6 = sns.regplot(x='Total Revolving Credit Line', y='Interest Rate',
                  data = df_new, scatter_kws={"alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
                  line_kws={'color': 'black'})

## Run new regression model with filtered dataset - df_new

### Create new regression datasets without outliers

In [None]:
#Code Block 38
df_new.columns

In [None]:
#Code Block 39
df_reg = df_new[['Amount Funded', 'Total Debt', 'Annual Income', 'Revolving Balance',
                'Total Revolving Credit Line', 'Loan_Income', 'Debt_Income', 'RevBal_Line', 'RevBal_Income',
                'Income Verification', 'IncVer_Income_act', 'OWN', 'OTHER', 'RENT','Own_RevLine_act','Interest Rate']]
X = df_reg.drop(["Interest Rate"], axis = 1)
y = df_reg[['Interest Rate']]

In [None]:
#Code Block 40
X = sm.add_constant(X) # adding a constant

reg3 = sm.OLS(y, X).fit()

predictions3 = reg3.predict(X)
resid3 = reg3.resid
reg3.summary()

In [None]:
#Code Block 41

#Create Predictions dataframe
df_predictions3 = pd.DataFrame(predictions3)
df_predictions3 =df_predictions3.rename(columns = {0:'Int_Pred3'})

#Create Residuals dataframe
df_resid3 = pd.DataFrame(resid3)
df_resid3=df_resid3.rename(columns = {0:'Resid3'})


#Concat results into one dataframe
df_reg_results = pd.concat([df_reg, df_predictions3, df_resid3], axis=1)

df_reg_results.head()

In [None]:
#Code Block 42
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns #adds a column with the labels
round(vif, 1).sort_values(by = 'VIF Factor', ascending = False)

In [None]:
#Code Block 43

corrMatrix = df_reg.corr()
df_corrMatrix = pd.DataFrame(corrMatrix)
round(df_corrMatrix,3)

In [None]:
#Code Block 44
colormap = plt.cm.viridis
plt.figure(figsize=(14,10))
plt.title('Correlation Heat Map', y=1.05, size=15)
sns.heatmap(df_corrMatrix,linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

In [None]:
#Code Block 45
plt.figure(figsize=(20,10)) #changes area of scatterplot
sns.regplot(x='Int_Pred3', y='Resid3',
              data = df_reg_results, scatter_kws={"color":"blue","alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
              line_kws={'color': 'black'})