# DS-SF-23 | Lab 07 | Introduction to Regression and Model Fit, Part 2 | Answer Key

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn import feature_selection, linear_model

pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

In [2]:
df = pd.read_csv(os.path.join('..', 'datasets', 'credit.csv'))

In [3]:
df.head()

Unnamed: 0,Income,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
0,14.891,283,2,34,11,Male,No,Yes,Caucasian,333
1,106.025,483,3,82,15,Female,Yes,Yes,Asian,903
2,104.593,514,4,71,11,Male,No,No,Asian,580
3,148.924,681,3,36,11,Female,No,No,Asian,964
4,55.882,357,2,68,16,Male,No,Yes,Caucasian,331


A description of the dataset is as follows:

- Income (in thousands of dollars)
- Rating: Credit score rating
- Cards: Number of Credit cards owned
- Age
- Education: Years of Education
- Gender: Male/Female
- Student: Yes/No
- Married: Yes/No
- Ethnicity: African American/Asian/Caucasian
- Balance: Average credit card debt

> ## Question 1.  Let's explore the quantitative variables that affect `Balance`.  From your preliminary analysis, which 2 variables seem to affect `Balance` the most?  Our goal is interpretation; can we use these 2 variables simultaneously?  Why or why not?

In [4]:
df.corr()

Unnamed: 0,Income,Rating,Cards,Age,Education,Balance
Income,1.0,0.791378,-0.018273,0.175338,-0.027692,0.463656
Rating,0.791378,1.0,0.053239,0.103165,-0.030136,0.863625
Cards,-0.018273,0.053239,1.0,0.042948,-0.051084,0.086456
Age,0.175338,0.103165,0.042948,1.0,0.003619,0.001835
Education,-0.027692,-0.030136,-0.051084,0.003619,1.0,-0.008062
Balance,0.463656,0.863625,0.086456,0.001835,-0.008062,1.0


Answer: `Income` and `Rating` have the highest impact on `Balance`.  We cannot use these 2 variables simultaneously because they are highly correlated.

> ## Question 2.  `Race`, `Gender`, `Married`, and `Student` are categorical variables.  Go ahead and create dummy variables for all of them.

In [5]:
race_df = pd.get_dummies(df.Ethnicity, prefix = 'Race')
gender_df = pd.get_dummies(df.Gender, prefix = 'Gender')
married_df = pd.get_dummies(df.Married, prefix = 'Married')
student_df = pd.get_dummies(df.Student, prefix = 'Student')

df = df.join([race_df, gender_df, married_df, student_df])

df.columns

Index([u'Income', u'Rating', u'Cards', u'Age', u'Education', u'Gender',
       u'Student', u'Married', u'Ethnicity', u'Balance',
       u'Race_African American', u'Race_Asian', u'Race_Caucasian',
       u'Gender_Female', u'Gender_Male', u'Married_No', u'Married_Yes',
       u'Student_No', u'Student_Yes'],
      dtype='object')

> ## Question 3.  Using sklearn and a linear regression, predict `Balance` using `Income`, `Cards`, `Age`, `Education`, `Gender`, and `Race`

First, find the coefficients of your regression line.

In [6]:
X = df[ ['Income', 'Cards', 'Age', 'Education', 'Gender_Male', 'Race_Asian', 'Race_Caucasian'] ]
y = df.Balance

model = linear_model.LinearRegression()
model.fit(X,y)

print model.intercept_
print model.coef_

257.167785624
[  6.27995894  33.62953508  -2.32970547   1.64553607 -27.12543123
  -6.54603078   3.47497641]


Then, find the p-values of your estimates.  You have a few variables try to show your p-values alongside the names of the variables.

In [7]:
zip(X.columns.values, feature_selection.f_regression(X, y)[1])

[('Income', 1.0308858025893587e-22),
 ('Cards', 0.084176555599370956),
 ('Age', 0.97081387233013317),
 ('Education', 0.87230640156710226),
 ('Gender_Male', 0.66851610550260099),
 ('Race_Asian', 0.84489564436221742),
 ('Race_Caucasian', 0.94772751139663791)]

> ## Question 4.  Which of your coefficients are significant at the 5% significance level?

Answer: Only `Income` is.

> ## Question 5.  What is your model's $R^2$?

In [8]:
model.score(X, y)

0.23231260833540457

> ## Question 6.  How do we interpret this value?

Answer: 23% of the variability of `Balance` is captured by the linear model.

> ## Question 7.  Now let's focus on the two most significant variables from your previous model and re-run your regression model.

In [9]:
X = df[ ['Income', 'Cards'] ]
y = df.Balance

model = linear_model.LinearRegression()
model.fit(X,y)

print model.intercept_
print model.coef_

151.329946349
[  6.07099859  31.83812895]


> ## Question 8.  In comparison to the previous model, did the $R^2$ increase or decrease?  Why?

In [10]:
model.score(X, y)

0.22399175162249518

Answer: It decreased since we used subset of the variables we used earlier.  This does not mean that the precision of our model has decreased. $R^2$ on its own is not a good measure to compare two models.

> ## Question 9.  Now let's regress `Balance` on `Gender` alone.  After running your linear regressions, do you have enough evidence to claim that females have more balance than males?  (Hint: Look at the p-value of the Gender coefficient.  If it is significant then you will have evidence to support that claim, otherwise you cannot support the statement.)

In [11]:
X = df[ ['Gender_Female'] ]
y = df.Balance

model = linear_model.LinearRegression()
model.fit(X,y)

print model.intercept_
print model.coef_
print X.columns.values, feature_selection.f_regression(X, y)[1]

509.803108808
[ 19.73312308]
['Gender_Female'] [ 0.66851611]


Answer: (If your answer is yes, interpret the results).  The p-value is extremely high: Although the coefficient of dummy variable for females is positive, we cannot claim females spend more on average than males.

> ## Question 10.  Now let's regress `Balance` on `Ethnicity`.  After running your linear regressions, do you have enough evidence to claim that some ethnic groups carry more balance than others?

In [12]:
X = df[ ['Race_Asian', 'Race_Caucasian'] ]
y = df.Balance

model = linear_model.LinearRegression()
model.fit(X,y)

print model.intercept_
print model.coef_
print feature_selection.f_regression(X, y)[1]

531.0
[-18.68627451 -12.50251256]
[ 0.84489564  0.94772751]


Answer: (If your answer is yes, interpret the results).  Again the p-values are extremely high.  We don't have enough evidence to say that an ethnic group carries more balance than others.

> ## Question 11.  Finally let's regress `Balance` on `Student`.  After running your linear regressions, do you have enough evidence to claim that students carry more balance than non-students?

In [13]:
X = df[ ['Student_Yes'] ]
y = df.Balance

model = linear_model.LinearRegression()
model.fit(X,y)

print model.intercept_
print model.coef_
print feature_selection.f_regression(X, y)[1]

480.369444444
[ 396.45555556]
[  1.48773411e-07]


Answer: (If your answer is yes, interpret the results).  Yes, we have enough evidence to support that students are on average carrying \$396 more balance than non-students.

> ## Question 12.  No let's consider the effect of `Student` and `Income` on `Balance` simultaneously.  Are all the coefficients significant?

In [14]:
X = df[ ['Student_Yes', 'Income'] ]
y = df.Balance

model = linear_model.LinearRegression()
model.fit(X,y)

print model.intercept_
print model.coef_
print feature_selection.f_regression(X, y)[1]

211.142964398
[ 382.67053884    5.98433557]
[  1.48773411e-07   1.03088580e-22]


Answer: (If your answer is yes, interpret the results).  Yes, both coefficients are significant.  We find that when fixing income, students on average tend to carry \$382 more balance than non-students.  Also, on average higher income earners tend to carry more balance on their credit cards: For every \$1,000 additional income, people on average carry around a higher balance of \$6.

> ## Question 13.  No let's consider the interaction effect of `Student` and `Income` on `Balance` simultaneously.  Are all the coefficients significant?  It they are, write down your regression model below

(First generate a new variable for the interaction term)

In [15]:
df['Student * Income'] = df['Student_Yes'] * df['Income']

In [16]:
X = df[ ['Student_Yes', 'Income', 'Student * Income'] ]
y = df.Balance

model = linear_model.LinearRegression()
model.fit(X,y)

print model.intercept_
print model.coef_
print feature_selection.f_regression(X, y)[1]

200.62315295
[ 476.67584321    6.21816874   -1.99915087]
[  1.48773411e-07   1.03088580e-22   4.61768368e-08]


Answer: Yes, they are all significant.  $Balance = 200 + 477 * Student_{Yes} + 6.21 * Income - 2 * Income * Student_{Yes}$

> ## Question 14.  Is there any income level at which students and non-students on average carry same level of balance?

Answer:

- Non-students: $Balance = 200 + 6.21 * Income$
- Students: $Balance = 200 + 477 + 6.21 * Income - 2 * Income$

At the \$238 income level, these two groups carry the same amount of balance.  Since this range of income is higher than any observed values for students, it is safe to say that within the range of our observations, students on average carry more balance.

We interpret the results this way: We say, students on average carry \$477 more than non-students.  But for every \$1,000 they make this difference between the balance that students and non-students are carrying is decreased by \$2.