 <h1 style="color:blue;"> Scenario 8 notebook</h1>

- C1S8.Py01	Create a ratio of balance of revolving accounts to total credit line
- C1S8.Py02	Create a revolving balance to income ratio
- C1S8.Py03	Multiple Regression with all features
- C1S8.Py04	Calculating VIF for features in a model!

In [None]:
#Code Block 1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



#style options

%matplotlib inline
#if you want graphs to automatically without plt.show

pd.set_option('display.max_columns',500) #allows for up to 500 columns to be displayed when viewing a dataframe

plt.style.use('seaborn') #a style that can be used for plots - see style reference above



In [None]:
#Code Block 2
df = pd.read_csv('data/Scenario8.csv', index_col = 0, header=0)
    #DOES set the first column to the index
    # and the top row as the headers

In [None]:
#Code Block 3
df.head(3)

In [None]:
#Code Block 4
df.info()

<h2 style="color:blue;">C1.S8.Py01 - Create a ratio of balance of revolving accounts to total credit line</h2>

- Revolving Accounts - Total credit revolving balance
- Total Revolving Credit Line - Total revolving high credit/credit limit

### What do you do with the null values?


In [None]:
#Code Block 5
df[df['Total Revolving Credit Line'].isnull()]

### Are their any credit lines that are == 0?

In [None]:
#Code Block 6
df[df['Total Revolving Credit Line']==0]

In [None]:
#Code Block 7
df['RevBal_Line'] = df['Revolving Accounts'] / df['Total Revolving Credit Line']

In [None]:
#Code Block 8
df[['Revolving Accounts', 'Total Revolving Credit Line', 'RevBal_Line']].describe()

In [None]:
#Code Block 9
round(df[['Revolving Accounts', 'Total Revolving Credit Line', 'RevBal_Line']].describe(), 2)

In [None]:
#Code Block 10
df[['Revolving Accounts', 'Total Revolving Credit Line',
    'RevBal_Line']].sort_values(by='RevBal_Line', ascending = False).head(10)

In [None]:
#Code Block 11
df[['Revolving Accounts', 'Total Revolving Credit Line',
    'RevBal_Line']].sort_values(by='RevBal_Line', ascending = True).head(5)

### How to use .iloc (integer location)
https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/

In [None]:
#Code Block 12
df.iloc[:, 0:9].info()

In [None]:
#Code Block 13
df[['Revolving Accounts', 'Total Revolving Credit Line']] = df[['Revolving Accounts', 'Total Revolving Credit Line']].fillna(0)
df.iloc[:, 0:9].info()

In [None]:
#Code Block 14
df[df['Total Revolving Credit Line']==0].head()

In [None]:
#Code Block 15
df['RevBal_Line'] = df['Revolving Accounts'] / df['Total Revolving Credit Line']
round(df[['Revolving Accounts', 'Total Revolving Credit Line', 'RevBal_Line']].describe(), 2)

### For all ['Total Revolving Credit Line'] ==0, set the RevBal_Line to 0.

In [None]:
#Code Block 16
def revline(c):
  if c['Total Revolving Credit Line'] ==0:
    return 0
  else:
    return c['Revolving Accounts'] / c['Total Revolving Credit Line']

df['RevBal_Line'] = df.apply(revline, axis=1)
round(df[['Revolving Accounts', 'Total Revolving Credit Line', 'RevBal_Line']].describe(), 2)

In [None]:
#Code Block 17
df[df['Total Revolving Credit Line']==0].head()

<h2 style="color:blue;">C1.S8.Py02 - Create a revolving balance to income ratio</h2>

### Creating a ratio assumptions to check:
- Are there any null values?  
    - Fill in null values
    - Leave as is and new ratio will be null
- Are there any **zero (0)** values for the denominator?
    - Check the corresponding numerator, if all values for numerator are **zero (0)** then use a function and ratio == 0 when denominator and numerator == 0.
    - If the numerator is not zero (0), then based on the situation, you will have to make a judgment.

In [None]:
#Code Block 18
df.iloc[:, 0:9].info()

In [None]:
#Code Block 19
df[df['Annual Income']==0]

In [None]:
#Code Block 20
df['RevBal_Income'] = df['Revolving Accounts'] / df['Annual Income']
round(df[['Revolving Accounts', 'Annual Income', 'RevBal_Income']].describe(), 2)

<h2 style="color:blue;">C1.S8.Py03 - Multiple Regression with all features</h2>

- Create X (2 options)
    - Create X using .iloc
    - Create X using label names

<h3 style="color:blue;">The last regression model included:</h3>

- X = ['Amount Funded', 'Annual Income', 'Total Debt', 'Loan_Income', 'Debt_Income']
- y = ['Interest Rate']

<h3 style="color:blue;">This model will include:</h3>

- X = ['Amount Funded', 'Annual Income', 'Total Debt', 'Revolving Accounts', 'Total Revolving Credit Line', 'Loan_Income', 'Debt_Income', 'RevBal_Line', 'RevBal_Income']
- y = ['Interest Rate']

<h3 style="color:blue;">Create X with .iloc</h3>

In [None]:
#Code Block 21
df.info()

In [None]:
#Code Block 22

X = df.iloc[:, [4, 5, 6, 7, 8, 26, 27, 28, 29]]
display(X.head())

y = df[['Interest Rate']]
y.head()

In [None]:
#Code Block 23
df_reg = df[['Interest Rate', 'Amount Funded', 'Annual Income', 'Total Debt', 'Revolving Accounts', 'Total Revolving Credit Line', 'Loan_Income', 'Debt_Income', 'RevBal_Line', 'RevBal_Income']]
df_reg.info()

In [None]:
#Code Block 24
import statsmodels
import statsmodels.api as sm

In [None]:
#Code Block 25
X = sm.add_constant(X) # adding a constant

reg1 = sm.OLS(y, X).fit()

predictions1 = reg1.predict(X)
resid1 = reg1.resid
reg1.summary()

### Create the predictions dataframe

In [None]:
#Code Block 26
df_predictions1 = pd.DataFrame(predictions1)
df_predictions1=df_predictions1.rename(columns = {0:'Int_Pred1'})
df_predictions1.head()

### Create the residuals dataframe

In [None]:
#Code Block 27
df_resid1 = pd.DataFrame(resid1)
df_resid1=df_resid1.rename(columns = {0:'Resid1'})
df_resid1.head()

### Create DataFrame for Actual and its Predictions

In [None]:
#Code Block 27
df_reg_results = pd.concat([X, y, df_predictions1, df_resid1], axis=1)
df_reg_results.head()

 <h3 style="color:blue;">Graphically looking at y and its residuals</h3>

https://seaborn.pydata.org/generated/seaborn.residplot.html

In [None]:
#Code Block 28
plt.figure(figsize=(20,10)) #changes area of scatterplot
sns.regplot(x='Int_Pred1', y='Resid1',
              data = df_reg_results, scatter_kws={"color":"blue","alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
              line_kws={'color': 'black'})


In [None]:
#Code Block 29
plt.figure(figsize=(20,10)) #changes area of scatterplot
sns.regplot(x='Int_Pred1', y='Resid1',
              data = df_reg_results, scatter_kws={"color":"blue","alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
              line_kws={'color': 'black'})
plt.xlim(5, 23)
plt.ylim(-10, 15)

In [None]:
#Code Block 30
display(df_reg_results.sort_values(by = 'Resid1').head())
df_reg_results.sort_values(by = 'Resid1', ascending=False).head()

In [None]:
#Code Block 31
display(df_reg_results.sort_values(by = 'Annual Income', ascending=False).head())

In [None]:
#Code Block 32
df_reg_results = pd.concat([df_reg_results, df['Home Ownership']], axis=1)
df_reg_results.head()

In [None]:
#Code Block 33
plt.figure(figsize=(20,10)) #changes area of scatterplot
sns.lmplot(x='Int_Pred1', y='Resid1', col="Home Ownership", col_wrap=2, data = df_reg_results, palette="Set1",
           aspect = 2, scatter_kws={"alpha":0.15,"s":150,"linewidth":2,"edgecolor":"white"}, line_kws={'color': 'red'})
plt.xlim(5, 23)
plt.ylim(-10, 15)

<h2 style="color:blue;">C1.S8.Py04 - Calculating VIF for features to test multi-colinearity</h2>

### What is VIF?



The Variance Inflation Factor (VIF) is a measure of colinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variane of a single beta if it were fit alone.

### How to calculate VIF
- Run a multiple regression for each feature in the X dataset. (for example, if X has 6 features [x1, x2, x3, x4, x5, and x6], then run six models where each feature is the target variable, such as x1 ~ x2 + x3 + x4 + x5 + x6.  

**Steps for Implementing VIF**
- Run a multiple regression.
- Calculate the VIF factors.
    - Run a multiple regression for each feature in the X dataset. (for example, if X has 6 features [x1, x2, x3, x4, x5, and x6], then run six models where each feature is the target variable, such as x1 ~ x2 + x3 + x4 + x5 + x6.  
    - Calculate VIF: 1 / (1 - R^2)
- Inspect the factors for each predictor variable, if the VIF is between 5-10, multicolinearity is likely present and you should consider dropping the variable.

https://etav.github.io/python/vif_factor_python.html

In [None]:
#Code Block 34
df.head()

In [None]:
#Code Block 35
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
#Code Block 36
X.info()

In [None]:
#Code Block 37
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns #adds a column with the labels
round(vif, 1).sort_values(by = 'VIF Factor', ascending = False)

In [None]:
#Code Block 38
sns.pairplot(df_reg)

In [None]:
#Code Block 39
corrMatrix = df_reg.corr()
df_corrMatrix = pd.DataFrame(corrMatrix)
df_corrMatrix

In [None]:
#Code Block 40
plt.figure(figsize=(10,10))
sns.heatmap(df_corrMatrix)

In [None]:
#Code Block 41
colormap = plt.cm.viridis
plt.figure(figsize=(12,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(df_corrMatrix,linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

<h2 style="color:blue;">C1.S8.Py05 - Re-run a multiple regression becuase of multi-colinearity</h2>

- Remove either ['Revolving Accounts'] or  ['Total Revolving Credit Line'] and see if it makes a difference.

In [None]:
#Code Block 42
X = df[['Amount Funded', 'Annual Income', 'Total Debt', 'Revolving Accounts', 'Total Revolving Credit Line', 'Loan_Income', 'Debt_Income', 'RevBal_Line', 'RevBal_Income']]
y = df[['Interest Rate']]

In [None]:
#Code Block 43
X = sm.add_constant(X) # adding a constant

reg1 = sm.OLS(y, X).fit()

predictions1 = reg1.predict(X)
resid1 = reg1.resid
reg1.summary()

In [None]:
#Code Block 44
X = df[['Amount Funded', 'Annual Income', 'Total Debt', 'Total Revolving Credit Line', 'Loan_Income', 'Debt_Income', 'RevBal_Line', 'RevBal_Income']]

In [None]:
#Code Block 45
X = sm.add_constant(X) # adding a constant

reg1 = sm.OLS(y, X).fit()

predictions1 = reg1.predict(X)
resid1 = reg1.resid
reg1.summary()

In [None]:
#Code Block 46
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns #adds a column with the labels
round(vif, 1).sort_values(by = 'VIF Factor', ascending = False)

In [None]:
#Code Block 47

corrMatrix = X[['Amount Funded', 'Annual Income', 'Total Debt', 'Total Revolving Credit Line', 'Loan_Income', 'Debt_Income', 'RevBal_Line', 'RevBal_Income']].corr()
df_corrMatrix = pd.DataFrame(corrMatrix)
round(df_corrMatrix,3)

In [None]:
#Code Block 48
colormap = plt.cm.viridis
plt.figure(figsize=(14,10))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(df_corrMatrix,linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)