<h1 style="color:blue;">Outline of Scenario 9 notebook:</h1>

- C1.S9.Py01 -  One-hot encoding of income verification category
- C1.S9.Py02 -  Manually encoding the income verification category
- C1.S9.Py03 -  Combining multiple categories of verified income and using label encoding
- C1.S9.Py04 -  Creating a new regression model with income verification
- C1.S9.Py05 -  Adding an interaction feature variable and running a new regression model


In [None]:
#Code Block 1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



#style options

%matplotlib inline
#if you want graphs to automatically without plt.show

pd.set_option('display.max_columns',500) #allows for up to 500 columns to be displayed when viewing a dataframe

plt.style.use('seaborn') #a style that can be used for plots - see style reference above



In [None]:
#Code Block 2
df = pd.read_csv('data/Scenario9.csv', index_col = 0, header=0)
    #DOES set the first column to the index
    # and the top row as the headers

In [None]:
#Code Block 3
df.head(3)

In [None]:
#Code Block 4
df.info()

<h2 style="color:blue;">C1.S9.Py01 - One-hot encoding of verified income category</h2>



In [None]:
#Code Block 5
df['Income Verification'].value_counts(dropna=False)

### Fill in NaN values with "Unverified"

In [None]:
#Code Block 6
df['Income Verification'] = df['Income Verification'].fillna("Unverified")
df['Income Verification'].value_counts(dropna=False)

In [None]:
#Code Block 7
df_IncomeVerification = pd.DataFrame(df['Income Verification'])
df_IncomeVerification.head()

### Create 3 columns based on Income Verification
- For every unique value in Income Verification, a column will be created.  
- Since there are 3 unique values in Income Verification, there will be three columns:
    - Verified
    - Unverified     
    - Source Verified
- If a row is **Verified**, then:
    - Verified = 1
    - Unverified = 0
    - Source Verified = 0
- If a row is **Unverified**, then:
    - Verified = 0
    - Unverified = 1
    - Source Verified = 0
- If a row is **Source Verified**, then:
    - Verified = 0
    - Unverified = 0
    - Source Verified = 1

In [None]:
#Code Block 8
dummies_IncomeVerification = pd.get_dummies(df_IncomeVerification['Income Verification'], drop_first = False)
dummies_IncomeVerification.head()

In [None]:
#Code Block 9
df_IncomeVerification = pd.concat([df_IncomeVerification, dummies_IncomeVerification], axis = 1)
df_IncomeVerification.head()

<h3 style="color:blue;">Let's create dummy variables in a more efficient way </h3>

- Create the dummy variables *(ex. Source Verified, Unverified, and Verified)*
- Concatenate with the original DataFrame *(ex. df_IncomeVerification with df)*
- Drop the original variable that was used to create the dummy variables *(ex. Income Verification)*

In [None]:
#Code Block 10
df_dummy = df.copy()

In [None]:
#Code Block 11
dummies_IncomeVerification = pd.get_dummies(df_dummy['Income Verification'], drop_first = False)
df_dummy = pd.concat([df_dummy, dummies_IncomeVerification], axis = 1)
df_dummy = df_dummy.drop(["Income Verification"], axis = 1)
df_dummy.head()

<h2 style="color:blue;">C1.S9.Py02 - Label encoding of verified income category</h2>

### 4 ways to convert a categorical variable (object) into a numerical variable (int)
- One-hot encoding (dummy variables) ***Previous Video***
- pandas.DataFrame.replace
- A custom function
- Label encoding from sci-kit learn(sequential numbering based on alphabetical order) ***Next Video***


### Use pandas.DataFrame.replace
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html
- Examples:
    - https://www.geeksforgeeks.org/python-pandas-dataframe-replace/
    - http://queirozf.com/entries/pandas-dataframe-replace-examples

In [None]:
#Code Block 12
df_label_ex1 = df.copy()

In [None]:
#Code Block 13
df_label_ex1 = df_label_ex1.replace({"Unverified":0})
df_label_ex1.head()

In [None]:
#Code Block 14
df_label_ex2 = df.copy()

In [None]:
#Code Block 15
df_label_ex2 = df_label_ex2.replace({'Unverified': 0, 'Verified':1, "Source Verified":2})
df_label_ex2.head()

In [None]:
#Code Block 16
print(df_label_ex2['Income Verification'].value_counts())
df_label_ex2.info()

### Custom Function for converting categorical variable to numeric

In [None]:
#Code Block 17
df_label_ex3 = df.copy()

### Create a new variable

In [None]:
#Code Block 18
def incomever(v):
    if v['Income Verification'] == 'Source Verified':
        return 2
    elif v['Income Verification'] == 'Verified':
        return 1
    else:
        return 0
df_label_ex3['IncomeVer_num'] = df_label_ex3.apply(incomever, axis = 1)
display(df_label_ex3[['Income Verification', 'IncomeVer_num']].head())
df_label_ex3[['Income Verification', 'IncomeVer_num']].dtypes

### Replace categorcial variable with numeric values with function

In [None]:
#Code Block 19
df_label_ex4 = df.copy()

In [None]:
#Code Block 20
def incomever(v):
    if v['Income Verification'] == 'Source Verified':
        return 2
    elif v['Income Verification'] == 'Verified':
        return 1
    else:
        return 0
df_label_ex4['Income Verification'] = df_label_ex4.apply(incomever, axis = 1)
display(df_label_ex4.head())
df_label_ex4['Income Verification'].dtypes

<h2 style="color:blue;">C1.S9.Py03 - Combining multiple categories of verified income and using label encoding</h2>



In [None]:
#Code Block 21
df_reg = df.copy()

In [None]:
#Code Block 22
df_reg['Income Verification'].value_counts()

In [None]:
#Code Block 23
df_reg = df_reg.replace({'Source Verified': 'Verified'})
df_reg['Income Verification'].value_counts()

### What is scikit-learn?
- Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python.
- It is the most common machine learning library.

https://scikit-learn.org/stable/index.html

In [None]:
#Code Block 24
from sklearn.preprocessing import LabelEncoder
lc = LabelEncoder()

In [None]:
#Code Block 25
df_reg['Income Verification'].head()

### Label encoder converts categorical variables to numeric in alphabetical order

In [None]:
#Code Block 26
df_reg['IncomeVer_num'] = lc.fit_transform(df_reg['Income Verification'])
df_reg[['Income Verification', 'IncomeVer_num']].head()

### Drop ['Income Verification'] and rename ['IncomeVer_num'] to ['Income Verification']

In [None]:
#Code Block 27
df_reg = df_reg.drop(["Income Verification"], axis = 1)
df_reg = df_reg.rename(columns = {'IncomeVer_num':'Income Verification'})
df_reg.info()

<h2 style="color:blue;">C1.S9.Py04 - Creating a new regression model with income verification</h2>



In [None]:
#Code Block 28
df_reg.head()

In [None]:
#Code Block 29
df_reg.columns

In [None]:
#Code Block 30
df_reg = df_reg[['Interest Rate', 'Amount Funded', 'Total Debt', 'Annual Income', 'Revolving Accounts',
                'Total Revolving Credit Line', 'Loan_Income', 'Debt_Income', 'RevBal_Line', 'RevBal_Income',
                'Income Verification']]
X = df_reg.drop(["Interest Rate"], axis = 1)
y = df_reg[['Interest Rate']]

In [None]:
#Code Block 31
X.info()

In [None]:
#Code Block 32
import statsmodels
import statsmodels.api as sm

In [None]:
#Code Block 33
X = sm.add_constant(X) # adding a constant

reg1 = sm.OLS(y, X).fit()

predictions1 = reg1.predict(X)
resid1 = reg1.resid
reg1.summary()

In [None]:
#Code Block 34

#Create Predictions dataframe
df_predictions1 = pd.DataFrame(predictions1)
df_predictions1=df_predictions1.rename(columns = {0:'Int_Pred1'})

#Create Residuals dataframe
df_resid1 = pd.DataFrame(resid1)
df_resid1=df_resid1.rename(columns = {0:'Resid1'})


#Concat results into one dataframe
df_reg_results = pd.concat([df_reg, df_predictions1, df_resid1], axis=1)
df_reg_results.head()

In [None]:
#Code Block 35
plt.figure(figsize=(20,10)) #changes area of scatterplot
sns.regplot(x='Int_Pred1', y='Resid1',
              data = df_reg_results, scatter_kws={"color":"blue","alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
              line_kws={'color': 'black'})

In [None]:
#Code Block 36
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
#Code Block 37
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns #adds a column with the labels
round(vif, 1).sort_values(by = 'VIF Factor', ascending = False)

In [None]:
#Code Block 38

corrMatrix = df_reg.corr()
df_corrMatrix = pd.DataFrame(corrMatrix)
round(df_corrMatrix,3)

In [None]:
#Code Block 39
colormap = plt.cm.viridis
plt.figure(figsize=(14,10))
plt.title('Correlation Heat Map', y=1.05, size=15)
sns.heatmap(df_corrMatrix,linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

<h2 style="color:blue;">C1.S9.Py05 - Adding an interaction feature variable and running a new regression model</h2>

- What is an interaction variable?
    - https://statisticsbyjim.com/regression/interaction-effects/

### Create a variable that shows interactions between ['Income Verification'] and ['Annual Income']
- Multiply  ['Income Verification'] and ['Annual Income'] into a new variable named ['IncVer_Income_act']

In [None]:
#Code Block 40
df_reg.head()

In [None]:
#Code Block 41
df_reg['IncVer_Income_act'] = df_reg['Annual Income'] * df_reg['Income Verification']
df_reg.head()

In [None]:
#Code Block 42
X = df_reg.drop(["Interest Rate"], axis = 1) #this will include IncVer_Income_act
y = df_reg[['Interest Rate']]

In [None]:
#Code Block 43
X = sm.add_constant(X) # adding a constant

reg2 = sm.OLS(y, X).fit()

predictions2 = reg2.predict(X)
resid2 = reg2.resid
reg2.summary()

In [None]:
#Code Block 44

#Create Predictions dataframe
df_predictions2 = pd.DataFrame(predictions2)
df_predictions2=df_predictions2.rename(columns = {0:'Int_Pred2'})

#Create Residuals dataframe
df_resid2 = pd.DataFrame(resid2)
df_resid2=df_resid2.rename(columns = {0:'Resid2'})


#Concat results into one dataframe
df_reg_results2 = pd.concat([df_reg, df_predictions1, df_resid1, df_predictions2, df_resid2], axis=1)
df_reg_results2.head()

In [None]:
#Code Block 45
plt.figure(figsize=(20,10)) #changes area of scatterplot
sns.regplot(x='Int_Pred2', y='Resid2',
              data = df_reg_results2, scatter_kws={"color":"blue","alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
              line_kws={'color': 'black'})

In [None]:
#Code Block 46
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns #adds a column with the labels
round(vif, 1).sort_values(by = 'VIF Factor', ascending = False)

In [None]:
#Code Block 47

corrMatrix = df_reg.corr()
df_corrMatrix = pd.DataFrame(corrMatrix)
round(df_corrMatrix,3)

In [None]:
#Code Block 48
colormap = plt.cm.viridis
plt.figure(figsize=(14,10))
plt.title('Correlation Heat Map', y=1.05, size=15)
sns.heatmap(df_corrMatrix,linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)