<h1 style="color:blue;">Outline of Scenario 11 notebook:</h1>

- C1.S11.Py01 – One-hot encoding Loan Purpose and its properties
- C1.S11.Py02 – Fill in Null values with median in length of employment
- C1.S11.Py03 – Creating and applying a function to code delinquencies
- C1.S11.Py04 – Run regression with newest features
- C1.S11.Py05 – Calculating VIF and correlation



In [None]:
#Code Block 1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



#style options

%matplotlib inline
#if you want graphs to automatically without plt.show

pd.set_option('display.max_columns',500) #allows for up to 500 columns to be displayed when viewing a dataframe

plt.style.use('seaborn') #a style that can be used for plots - see style reference above



In [None]:
#Code Block 2
df = pd.read_csv('data/Scenario11.csv', index_col = 0, header=0)
    #DOES set the first column to the index
    # and the top row as the headers

In [None]:
#Code Block 3
df.info()

In [None]:
#Code Block 4
df.head()

<h2 style="color:blue;">One-hot encoding Loan Purpose and its properties</h2>

In [None]:
#Code Block 5
df['Loan Purpose'].value_counts()

### If you like seeing results in a DataFrame format, it is easy to do with pd.DataFrame( )

In [None]:
#Code Block 6
pd.DataFrame(df['Loan Purpose'].value_counts())

In [None]:
#Code Block 7
df_reg = df.copy()

<h3 style="color:blue;">Create dummy variables for Loan Purpose </h3>

- Create the dummy variables *(ex. car, creditcard, debt_consolidation, etc)*
- Concatenate with the original DataFrame *(ex. dummies_LoanPurpose with df_reg)*
- Drop the original variable that was used to create the dummy variables *(ex. Loan Purpose)*

In [None]:
#Code Block 8

#Create Dummy Variables for Loan Purpose
dummies_LoanPurpose = pd.get_dummies(df_reg['Loan Purpose'], drop_first = False)
#df_reg = pd.concat([df_reg, dummies_LoanPurpose], axis = 1)

dummies_LoanPurpose.head()

### Add a prefix for each category

In [None]:
#Code Block 9
dummies_LoanPurpose1 = pd.get_dummies(df_reg['Loan Purpose'], prefix='lp', drop_first = False)
dummies_LoanPurpose1.head()

### Drop the first category based on alphabetical order

In [None]:
#Code Block 10
dummies_LoanPurpose2 = pd.get_dummies(df_reg['Loan Purpose'], prefix='lp', drop_first = True)
dummies_LoanPurpose2.head()

### Change the dtype from uint8 to int64

In [None]:
#Code Block 11
dummies_LoanPurpose3 = pd.get_dummies(df_reg['Loan Purpose'], prefix='lp', drop_first = True, dtype='int')
dummies_LoanPurpose3.head()

In [None]:
#Code Block 12
dummies_LoanPurpose.info()
dummies_LoanPurpose3.info()

#### Cannot drop first or last if you wish to drop the baseline or category with the most values.

In [None]:
#Code Block 13
print('---------------------------------------------------')
print("car")
print(dummies_LoanPurpose['car'].value_counts())
print('---------------------------------------------------')
print("credit_card")
print(dummies_LoanPurpose['credit_card'].value_counts())
print('---------------------------------------------------')
print("debt_consolidation")
print(dummies_LoanPurpose['debt_consolidation'].value_counts())
print('---------------------------------------------------')
print("home_improvement")
print(dummies_LoanPurpose['home_improvement'].value_counts())
print('---------------------------------------------------')

In [None]:
#Code Block 14
dummies_LoanPurpose.head()

In [None]:
#Code Block 15
#dummies_LoanPurpose = dummies_LoanPurpose.drop(['debt_consolidation'], axis = 1)
df_reg = pd.concat([df_reg, dummies_LoanPurpose], axis = 1)

#Drop Loan Purpose
df_reg = df_reg.drop(['Loan Purpose'], axis = 1)
df_reg.info()

<h2 style="color:blue;">Fill in Null values with median in length of employment</h2>

### What can you do with missing data?  (NaN)
- **Leave as-is**
    - cannot leave it as is if you plan on using it for a predictive model (cannot have NaNs)
- **5.1 Drop them**
    - df_dropped = df_nan.dropna()
- **5.2 Fill missing value**
    - Fill with a value
        - df_nan['gender'] = df_nan['gender'].fillna('missing')
        - df_nan[['total_bill', 'size']] = tips_nan[['total_bill','size']].fillna(0)
    - Fill with a summary statistic
        - df_nan['tip'] = df_nan['tip'].fillna(df_nan['tip'].mean())
    - Fill in with Forward or backward
        - df.fillna(method='ffill')
        - df.fillna(method='bfill')
        


http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.fillna.html

In [None]:
#Code Block 16
df_reg.head()

In [None]:
#Code Block 17
pd.DataFrame(df_reg['Length of Employment'].value_counts(dropna=False))

In [None]:
#Code Block 18
pd.DataFrame(round(df_reg['Length of Employment'].describe(), 2))

#### NOTE: 50th percentile is also the median

In [None]:
#Code Block 19
df_reg['Length of Employment'].median()

In [None]:
#Code Block 20
# Plot a histogram and color magenta
sns.distplot(df_reg['Length of Employment'], color="m")

In [None]:
#Code Block 21
df_reg['Length of Employment'] = df_reg['Length of Employment'].fillna(df_reg['Length of Employment'].median())
pd.DataFrame(df_reg['Length of Employment'].value_counts(dropna=False))

In [None]:
#Code Block 22
pd.DataFrame(round(df_reg['Length of Employment'].describe(), 2))

In [None]:
#Code Block 23
# Plot a histogram and color blue
sns.distplot(df_reg['Length of Employment'], color="b")

<h2 style="color:blue;">Creating and applying a function to code delinquencies</h2>


In [None]:
#Code Block 24
pd.DataFrame(df_reg['Delinquencies Past 24 Months'].value_counts(dropna=False))

### Instead of filling in the NaNs with a number or calculation, we will turn this quantitative variable into a binary decision:
- If **delinquencies within the past 24 Months** > 0 then code it as a 1.
- Otherwise code it as a 0, which will be all of the NaN values.

### Change Delinquencies Past 24 Months to a binary feature

In [None]:
#Code Block 25
def delinq(c):
  if c['Delinquencies Past 24 Months'] >= 0:
    return 1
  else:
    return 0

df_reg['Delinquencies Past 24 Months'] = df_reg.apply(delinq, axis=1)
display(df_reg['Delinquencies Past 24 Months'].value_counts())

In [None]:
#Code Block 26
df_reg.info()

<h2 style="color:blue;">Run regression with newest features</h2>

<h3 style="color:blue;">Features to include:</h3>

- All previous features
- Term
- Length of employment
- Open accounts
- Credit inquiries
- Loan purpose categories (not including debt consolidation)
- Delinquencies Past 24 Months


In [None]:
#Code Block 27
df_reg.columns

#### Do not include the following vairables:
- 'Member ID'
- 'Loan ID'
- 'Origination Date'
- 'Interest Rate' (y or target variable)
- 'Grade'
- 'Employee Title'
- 'Zip Code of Residence'
- 'State of Residence'
- 'TermString'
- 'Day'
- 'Month'
- 'Year'
- 'MORTGAGE' (baseline for Home Ownership)
- 'debt_consolidation' (baseline for Loan Purpose)

In [None]:
#Code Block 28

#X includes all expected features including Home Ownership, Length of employmnet, and Loan Purpose
#No MORTGAGE or debt_consolidation

X = df_reg[['Amount Funded', 'Total Debt', 'Annual Income', 'Revolving Balance',
            'Total Revolving Credit Line', 'Term','Length of Employment', 'Delinquencies Past 24 Months',
            'Credit Inquires Last 6 Months','Open Accounts', 'Loan_Income', 'Debt_Income',
            'RevBal_Line', 'RevBal_Income', 'Income Verification', 'IncVer_Income_act', 'OTHER',
            'OWN', 'RENT', 'Own_RevLine_act', 'car', 'credit_card', 'home_improvement', 'house', 'major_purchase',
            'medical', 'moving', 'other', 'renewable_energy','small_business', 'vacation', 'wedding']]
y = df_reg[['Interest Rate']]

In [None]:
#Code Block 29
import statsmodels
import statsmodels.api as sm

In [None]:
#Code Block 30
X = sm.add_constant(X) # adding a constant

reg1 = sm.OLS(y, X).fit()

predictions1 = reg1.predict(X)
resid1 = reg1.resid
reg1.summary()

In [None]:
#Code Block 31

#Create Predictions dataframe
df_predictions1 = pd.DataFrame(predictions1)
df_predictions1=df_predictions1.rename(columns = {0:'Int_Pred1'})

#Create Residuals dataframe
df_resid1 = pd.DataFrame(resid1)
df_resid1=df_resid1.rename(columns = {0:'Resid1'})


#Concat results into one dataframe
df_reg_results = pd.concat([df_reg, df_predictions1, df_resid1], axis=1)

df_reg_results[['Amount Funded', 'Total Debt', 'Annual Income','Interest Rate', 'Int_Pred1', 'Resid1']].head()

<h2 style="color:blue;">Calculating VIF on feature variables and correlation</h2>



In [None]:
#Code Block 32
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
#Code Block 33
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns #adds a column with the labels
round(vif, 1).sort_values(by = 'VIF Factor', ascending = False)

In [None]:
#Code Block 34

corrMatrix = df_reg.corr()
df_corrMatrix = pd.DataFrame(corrMatrix)
round(df_corrMatrix,3)

In [None]:
#Code Block 35
colormap = plt.cm.viridis
plt.figure(figsize=(14,10))
plt.title('Correlation Heat Map', y=1.05, size=15)
sns.heatmap(df_corrMatrix,linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

### Take off annotation and change colormap
- Link to the different styles of colormaps
    - https://matplotlib.org/3.2.1/tutorials/colors/colormaps.html

In [None]:
#Code Block 36
colormap = plt.cm.coolwarm
plt.figure(figsize=(14,10))
plt.title('Correlation Heat Map', y=1.05, size=15)
sns.heatmap(df_corrMatrix,linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=False)

### Graphically looking at residuals using lowess

- LOWESS (Locally Weighted Scatterplot Smoothing), is a tool used in regression analysis that creates a smooth line through a scatter plot to help you to see relationship between variables and foresee trends.
- https://www.statisticshowto.com/lowess-smoothing/

In [None]:
#Code Block 37
plt.figure(figsize=(20,10)) #changes area of scatterplot
sns.regplot(x='Int_Pred1', y='Resid1',
              data = df_reg_results, scatter_kws={"color":"blue","alpha":0.15, "s":100,"linewidth":2,"edgecolor":"white"},
              line_kws={'color': 'black'})

### Look at the highest deviations from actuals (highest residual and lowest residual)

In [None]:
#Code Block 38
df_reg_results.sort_values(by='Resid1').head(10)

In [None]:
#Code Block 39
df_reg_results.sort_values(by='Resid1', ascending=False).head(10)