# MATH 300 - Tools for Data Science 
# Exercises for: Linear Regression 2


# 1. Conceptual Questions

## Task 1.1

Suppose we have a data set with five predictors, $X_1$ = GPA, $X_2$ =IQ, $X_3$ = Level (1 for College and 0 for High School), $X_4$ = Interaction between GPA and IQ, and $X_5$ = Interaction between GPA and Level. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get
$\beta_0 = 50$, $\beta_1 = 20$, $\beta_2 = 0.07$, $\beta_3 = 35$, $\beta_4 = 0.01$, $\beta_5 = −10$. 

(a) Which answer is correct, and why?

i. For a fixed value of IQ and GPA, high school graduates earn more, on average, than college graduates.

ii. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates.

iii. For a fixed value of IQ and GPA, high school graduates earn more, on average, than college graduates provided that the
GPA is high enough.

iv. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates provided that the GPA is high enough.

**Your answer goes here:**

The answer is iii. My work is included in my response zip file. 

(b) Predict the salary of a college graduate with IQ of 110 and a GPA of 4.0.

**Your answer goes here:**

In [36]:
b_0 = 50
b_1 = 20
b_2 = 0.07
b_3 = 35
b_4 = 0.01
b_5 = -10

prediction = b_0 + (b_1 * 4) + (b_2 * 110) + b_3 + (b_4 * (4.0 * 110)) + (b_5 * 4)
print(f"We predict that a college graduate with a IQ of 110 will make {prediction}k a year.")

We predict that a college graduate with a IQ of 110 will make 137.1k a year.


(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer. 

**Your answer goes here:** False, the coefficient means nothing in regards to statistical significance. Yes, the coefficient is small, but it is still a meaningful calculation in the model. 

## Task 1.2

I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e. $Y=\beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 +\epsilon$.

(a) Suppose that the true relationship between X and Y is linear, i.e. $Y=\beta_0 + \beta_1 X +\epsilon$. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

**Your answer goes here:** The cubic regression model can fit training data as well as the linear model, but $\beta_2$ and $\beta_3$ will likely be very close to zero if the true relationship is linear. However, the cubic regression model will overfit to noise. Overfitting will minimize the training RSS close to 0. A complex model will always fit training data better than a simpler model. 

(b) Answer (a) using test rather than training RSS.

**Your answer goes here:** While the cubic regression model can fit training data very well, it will struggle generalizing to fresh data in comparison to the linear model. Thus, the RSS value will be much higher than the linear model. 

(c) Suppose that the true relationship between X and Y is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the
other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

**Your answer goes here:** Since the cubic regression model includes all terms of the linear model and adds additional flexibility, its training RSS will always be equal to or lower than that of the linear regression. Even if the true relationship is not cubic, the cubic model can better capture nonlinear patterns in the training data, reducing the RSS. Thus, the training RSS for the cubic regression will be lower than or equal to the training RSS for the linear regression.

(d) Answer (c) using test rather than training RSS.

**Your answer goes here:** If the true relationship between X and Y isn't linear, the test RSS for the cubic regression may be lower or higher than that of the linear regression, depending on how well it captures the true pattern. If the cubic model provides a better approximation of the true relationship, it will have a lower test RSS. However, if it overfits the training data by capturing noise, it may have a higher test RSS than the linear model. I would say there is not enough information to determine which test RSS will be lower.

## Task 1.3

Explain what cross validation is, how it is implemented, and why we would want to use it.

**Your answer goes here:** Cross-validation is a method used to assess a model’s performance by splitting the data into multiple training and validation sets. It is implemented by dividing the dataset into k folds, training the model on k-1 folds, and evaluating it on the remaining fold, repeating this process k times. The final performance metric is the average of all iterations. The goal of cross-validation  is to obtain a more reliable estimate of a model’s generalization ability, reduce overfitting, and select the best model.

## 2: Applied Questions - Analysis of the credit dataset 

Recall the 'Credit' dataset introduced in class. 
This dataset consists of some credit card information for 400 people. 

First import the data and convert income to thousands.


In [89]:
# imports and setup

import scipy as sc
import numpy as np

import pandas as pd
import statsmodels.formula.api as sm     #Last lecture: used statsmodels.formula.api.ols() for OLS
from sklearn import linear_model         #Last lecture: used sklearn.linear_model.LinearRegression() for OLS

import matplotlib.pyplot as plt
%matplotlib inline  
plt.rcParams['figure.figsize'] = (10, 6)

from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

# Import data from Credit.csv file
credit = pd.read_csv('credit_copy.csv',index_col=0) #load data
credit["Income"] = credit["Income"].map(lambda x: 1000*x) # Replace income data with the income data * 1000 (element wise)
credit

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
1,14891.0,3606,283,2,34,11,Male,No,Yes,Caucasian,333
2,106025.0,6645,483,3,82,15,Female,Yes,Yes,Asian,903
3,104593.0,7075,514,4,71,11,Male,No,No,Asian,580
4,148924.0,9504,681,3,36,11,Female,No,No,Asian,964
5,55882.0,4897,357,2,68,16,Male,No,Yes,Caucasian,331
...,...,...,...,...,...,...,...,...,...,...,...
396,12096.0,4100,307,3,32,13,Male,No,Yes,Caucasian,560
397,13364.0,3838,296,5,65,17,Male,No,No,African American,480
398,57872.0,4171,321,5,67,12,Female,No,Yes,Caucasian,138
399,37728.0,2525,192,1,44,13,Male,No,Yes,Caucasian,0


## 2.1: A First Regression Model

**Exercise:** First regress Limit on Rating: 
$$
\text{Limit} = \beta_0 + \beta_1 \text{Rating}. 
$$
Since credit ratings are primarily used by banks to determine credit limits, we expect that Rating is very predictive for Limit, so this regression should be very good. 

Use the 'ols' function from the statsmodels python library. What is the $R^2$ value? What are $H_0$ and $H_A$ for the associated hypothesis test and what is the $p$-value? 


In [107]:
# Fit the linear regression model using formula notation
model = sm.ols('Limit ~ Rating', data=credit).fit()

# Extract R^2 value
r_squared = model.rsquared

# Extract p-value for Rating coefficient
p_value = model.pvalues["Rating"]

# Print results
print(f"R^2 Value: {r_squared:.4f}")
print(f"P-value for Rating coefficient: {p_value:.4e}")

# Display the full regression summary
print(model.summary())


R^2 Value: 0.9938
P-value for Rating coefficient: 0.0000e+00
                            OLS Regression Results                            
Dep. Variable:                  Limit   R-squared:                       0.994
Model:                            OLS   Adj. R-squared:                  0.994
Method:                 Least Squares   F-statistic:                 6.348e+04
Date:                Mon, 10 Mar 2025   Prob (F-statistic):               0.00
Time:                        14:20:11   Log-Likelihood:                -2649.1
No. Observations:                 400   AIC:                             5302.
Df Residuals:                     398   BIC:                             5310.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------

**Your answer goes here:**

$H_0$: The rating coefficient is not statistically significant (0).
$H_A$: The rating coefficient is statistically significant (not 0).

## 2.2: Predicting Limit without Rating 

Since Rating and Limit are almost the same variable, next we'll forget about Rating and just try to predict Limit from the real-valued variables (non-categorical variables): Income, Cards, Age, Education, Balance. 

**Exercise:** Develop a multilinear regression model to predict Rating. Interpret the results. 

For now, just focus on the real-valued variables (Income, Cards, Age, Education, Balance)
and ignore the categorical variables (Gender, Student, Married, Ethnicity). 



In [111]:
# Define predictor variables and response variable
predictors = ["Income", "Cards", "Age", "Education", "Balance"]
response = "Rating"

# Fit the multiple linear regression model using formula notation
multi_model = sm.ols(f"{response} ~ {' + '.join(predictors)}", data=credit).fit()

# Extract key results
r_squared = multi_model.rsquared  # R^2 value
p_values = multi_model.pvalues     # p-values for predictors
summary = multi_model.summary()    # Full regression summary

# Print results
print(f"R^2 Value: {r_squared:.4f}\n")
print("P-values for predictors:\n", p_values, "\n")
print(summary)

R^2 Value: 0.9409

P-values for predictors:
 Intercept     6.570730e-31
Income       6.371109e-123
Cards         6.105378e-01
Age           2.307941e-01
Education     3.678292e-01
Balance      1.635432e-158
dtype: float64 

                            OLS Regression Results                            
Dep. Variable:                 Rating   R-squared:                       0.941
Model:                            OLS   Adj. R-squared:                  0.940
Method:                 Least Squares   F-statistic:                     1255.
Date:                Mon, 10 Mar 2025   Prob (F-statistic):          1.68e-239
Time:                        14:32:02   Log-Likelihood:                -2017.9
No. Observations:                 400   AIC:                             4048.
Df Residuals:                     394   BIC:                             4072.
Df Model:                           5                                         
Covariance Type:            nonrobust                            

Which independent variables are good/bad predictors? What is the best overall model?

**Your observations:** It seems that cards, age, and education are not statistically significant predictors. However, income and balance are highly significant. The best model would be to predict rating using income and balance. 


## 2.3: Incorporating Categorical Variables Into Regression Models

Now consider the binary categorical variables which we mapped to integer 0, 1 values in class.

In [121]:
credit["Gender_num"] = credit["Gender"].map({' Male':0, 'Female':1})
credit["Student_num"] = credit["Student"].map({'Yes':1, 'No':0})
credit["Married_num"] = credit["Married"].map({'Yes':1, 'No':0})

Can you improve the model you developed in Activity 2 by incorporating one or more of these variables?


In [123]:
# Define the best model with significant predictors
final_predictors = ["Income", "Balance", "Student_num"]
response = "Rating"

# Fit the multiple linear regression model
final_model = sm.ols(f"{response} ~ {' + '.join(final_predictors)}", data=credit).fit()

# Print model summary
print(final_model.summary())

                            OLS Regression Results                            
Dep. Variable:                 Rating   R-squared:                       0.974
Model:                            OLS   Adj. R-squared:                  0.974
Method:                 Least Squares   F-statistic:                     4964.
Date:                Mon, 10 Mar 2025   Prob (F-statistic):          1.09e-313
Time:                        14:40:01   Log-Likelihood:                -1853.1
No. Observations:                 400   AIC:                             3714.
Df Residuals:                     396   BIC:                             3730.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept     149.3993      2.178     68.584      

**Your answer goes here:** The model's $R^2$ value does seem to be a bit higher with this model!