# DS-SF-23 | Lab 07 | Introduction to Regression and Model Fit, Part 2 | Answer Key

In [140]:
import os
import numpy as np
import pandas as pd
from sklearn import feature_selection, linear_model

pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

In [141]:
df = pd.read_csv(os.path.join('..', 'datasets', 'credit.csv'))

In [142]:
df.head()

Unnamed: 0,Income,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
0,14.891,283,2,34,11,Male,No,Yes,Caucasian,333
1,106.025,483,3,82,15,Female,Yes,Yes,Asian,903
2,104.593,514,4,71,11,Male,No,No,Asian,580
3,148.924,681,3,36,11,Female,No,No,Asian,964
4,55.882,357,2,68,16,Male,No,Yes,Caucasian,331


A description of the dataset is as follows:

- Income (in thousands of dollars)
- Rating: Credit score rating
- Cards: Number of Credit cards owned
- Age
- Education: Years of Education
- Gender: Male/Female
- Student: Yes/No
- Married: Yes/No
- Ethnicity: African American/Asian/Caucasian
- Balance: Average credit card debt

> ## Question 1.  Let's explore the quantitative variables that affect `Balance`.  From your preliminary analysis, which 2 variables seem to affect `Balance` the most?  Our goal is interpretation; can we use these 2 variables simultaneously?  Why or why not?

In [143]:
df.corr()

Unnamed: 0,Income,Rating,Cards,Age,Education,Balance
Income,1.0,0.791378,-0.018273,0.175338,-0.027692,0.463656
Rating,0.791378,1.0,0.053239,0.103165,-0.030136,0.863625
Cards,-0.018273,0.053239,1.0,0.042948,-0.051084,0.086456
Age,0.175338,0.103165,0.042948,1.0,0.003619,0.001835
Education,-0.027692,-0.030136,-0.051084,0.003619,1.0,-0.008062
Balance,0.463656,0.863625,0.086456,0.001835,-0.008062,1.0


Answer: Income and Rating have the largest impact on balance. We should not use these two variables simulatneously because they are tightly correlated, raising the chance of inaccuracy due to multicorrelation.

> ## Question 2.  `Race`, `Gender`, `Married`, and `Student` are categorical variables.  Go ahead and create dummy variables for all of them.

In [144]:
race_df = pd.get_dummies(df.Ethnicity, prefix = 'Race')
gender_df = pd.get_dummies(df.Gender, prefix = 'Gender')
married_df = pd.get_dummies(df.Married, prefix = 'Married')
student_df = pd.get_dummies(df.Student, prefix = 'Student')

df = df.join([race_df, gender_df, married_df, student_df])
df.columns

Index(['Income', 'Rating', 'Cards', 'Age', 'Education', 'Gender', 'Student',
       'Married', 'Ethnicity', 'Balance', 'Race_African American',
       'Race_Asian', 'Race_Caucasian', 'Gender_Female', 'Gender_Male',
       'Married_No', 'Married_Yes', 'Student_No', 'Student_Yes'],
      dtype='object')

> ## Question 3.  Using sklearn and a linear regression, predict `Balance` using `Income`, `Cards`, `Age`, `Education`, `Gender`, and `Race`

First, find the coefficients of your regression line.

In [145]:
X = df[ ['Income', 'Cards', 'Age', 'Education', 'Gender_Male', 'Race_African American'] ]
y = df.Balance

model = linear_model.LinearRegression()
model.fit(X,y)

print(model.intercept_)
print(model.coef_)

257.369493707
[  6.27936781  33.59766321  -2.31938588   1.59642886 -26.9394313
  -0.10049563]


Then, find the p-values of your estimates.  You have several variables, so try to show your p-values alongside the names of the variables.

In [146]:
list(zip(X.columns.values, feature_selection.f_regression(X, y)[1]))

[('Income', 1.0308858025893513e-22),
 ('Cards', 0.084176555599370956),
 ('Age', 0.97081387233013317),
 ('Education', 0.87230640156710226),
 ('Gender_Male', 0.66851610550260099),
 ('Race_African American', 0.78443031908756577)]

> ## Question 4.  Which of your coefficients are significant at the 5% significance level?

Answer: Income is the only one.

In [147]:
## Question 5.  What is your model's $R^2$?

In [148]:
model.score(X,y)

0.23223261170777187

> ## Question 6.  How do we interpret this value?

Answer: Our model explains approximately 23% of the variation in Balance (i.e., avg. credit card debt).

> ## Question 7.  Now let's focus on the two most significant variables from your previous model and re-run your regression model.

In [149]:
X2 = df[ ['Income', 'Cards'] ]
y2 = df.Balance

model_v2 = linear_model.LinearRegression()
model_v2.fit(X2, y2)

print(model.intercept_)
print(model.coef_)

257.369493707
[  6.27936781  33.59766321  -2.31938588   1.59642886 -26.9394313
  -0.10049563]


> ## Question 8.  In comparison to the previous model, did the $R^2$ increase or decrease?  Why?

In [150]:
model_v2.score(X2, y2)

0.22399175162249518

Answer: R-squared decreased very slightly because the other variables marginally increased the amount of variation in the DV that we were successfully predicting (specifically, by about 1%). Given that none of the p-values for these variables were significant, this was likely due to random error, so the higher R-squared value in the prior case does not necessarily indicate that the former model was superior in any meaningful way.

> ## Question 9.  Now let's regress `Balance` on `Gender` alone.  After running your linear regressions, do you have enough evidence to claim that females have more balance than males?  (Hint: Look at the p-value of the Gender coefficient.  If it is significant then you will have evidence to support that claim, otherwise you cannot support the statement.)

In [151]:
X3 = df[ ['Gender_Female'] ]
y3 = df.Balance

model_v3 = linear_model.LinearRegression()
model_v3.fit(X3, y3)

print(model_v3.intercept_)
print(model_v3.coef_)
print(X3.columns.values[0], feature_selection.f_regression(X3, y3)[1])

509.803108808
[ 19.73312308]
Gender_Female [ 0.66851611]


Answer: No, the p-value is 67%, which is significantly above the 5% threshold for signifiance. Fail to reject the null hypothesis.

> ## Question 10.  Now let's regress `Balance` on `Ethnicity`.  After running your linear regressions, do you have enough evidence to claim that some ethnic groups carry more balance than others?

In [152]:
X4 = df[ ['Race_African American', 'Race_Asian', 'Race_Caucasian'] ]
y4 = df.Balance

model_v4 = linear_model.LinearRegression()
model_v4.fit(X4, y4)

print(model_v4.intercept_)
print(model_v4.coef_)
print(x4.columns.values, feature_selection.f_regression(X4, y4)[1])


-6.05772146767e+16
[  6.05772147e+16   6.05772147e+16   6.05772147e+16]
['Ethnicity'] [ 0.78443032  0.84489564  0.94772751]


The evidence shows that the p-values for African-American, Asian, and Caucasian are all much greater than 5%, so we fail to reject the null hypothessis that there is no correlation between race and balance.')

> ## Question 11.  Finally let's regress `Balance` on `Student`.  After running your linear regressions, do you have enough evidence to claim that students carry more balance than non-students?

In [158]:
X5 = df[ ['Student_Yes'] ]
y5 = df.Balance

model_v5 = linear_model.LinearRegression()
model_v5.fit(X5, y5)

print(model_v5.intercept_)
print(model_v5.coef_)
list(zip(X5.columns.values, feature_selection.f_regression(X5, y5)[1]))

480.369444444
[ 396.45555556]


[('Student_Yes', 1.4877341077327523e-07)]

Answer: Yes, we have evidence at the 5% significance level that students carry about $396 more debt than the non-students in the sample.

> ## Question 12.  Now let's consider the effect of `Student` and `Income` on `Balance` simultaneously.  Are all the coefficients significant?

In [165]:
X6 = df[ ['Student_Yes', 'Income'] ]
y6 = df.Balance

model_v6 = linear_model.LinearRegression()
model_v6.fit(X6, y6)

print(model_v6.intercept_)
print(X6.columns.values, model_v6.coef_)
print(X6.columns.values, feature_selection.f_regression(X6, y6)[1])

211.142964398
['Student_Yes' 'Income'] [ 382.67053884    5.98433557]
['Student_Yes' 'Income'] [  1.48773411e-07   1.03088580e-22]


#### Answer: Yes, at 95% confidence, both Student and Income significantly are positively associated with increased Balance. Being a student is associated with 382 dollars of debt compared to non-students, and each one thousand dollar increase in income is associated with a 6 dollar increase in debt.

> ## Question 13.  Now let's consider the interaction effect of `Student` and `Income` on `Balance` simultaneously.  Are all the coefficients significant?  It they are, write down your regression model below

(First generate a new variable for the interaction term)

In [168]:
df['Student * Income'] = df['Student_Yes'] * df['Income']

In [173]:
X7 = df[ ['Student_Yes', 'Income', 'Student * Income'] ]
y7 = df.Balance

model_v7 = linear_model.LinearRegression()
model_v7.fit(X7, y7)

print(model_v7.intercept_)
print(X7.columns.values, model_v7.coef_)
print(X7.columns.values, feature_selection.f_regression(X7, y7)[1])

200.62315295
['Student_Yes' 'Income' 'Student * Income'] [ 476.67584321    6.21816874   -1.99915087]
['Student_Yes' 'Income' 'Student * Income'] [  1.48773411e-07   1.03088580e-22   4.61768368e-08]


Answer: Yes, all three coefficients are significant at the 5% level. Being a student is associated with having 476 dollars more debt than non-students, each 1000 dollars earned correlates to 6 dollars more debt, and, for students, each 1000 dollars earned correlates to 2 dollars less debt.

$Balance = 200.62 * (476.68*Student_{Yes}) * (6.21*Income) * (-1.99*Student_{Yes}*Income)$

> ## Question 14.  Is there any income level at which students and non-students on average carry same level of balance?

Answer: