# AAE 722 — Lab 3: Linear Regression on the College Dataset


## Q1. Load data, show first 10 rows, list all columns, and briefly describe variables


In [21]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.anova import anova_lm

try:
    from ISLP import load_data
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "ISLP"])
    from ISLP import load_data

College = load_data("College")
print("Q1: Load College; show first 10 rows and list all columns")
display(College.head(10))
display(pd.Series(College.columns))



Q1: Load College; show first 10 rows and list all columns


Unnamed: 0,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
0,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
1,Yes,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56
2,Yes,1428,1097,336,22,50,1036,99,11250,3750,400,1165,53,66,12.9,30,8735,54
3,Yes,417,349,137,60,89,510,63,12960,5450,450,875,92,97,7.7,37,19016,59
4,Yes,193,146,55,16,44,249,869,7560,4120,800,1500,76,72,11.9,2,10922,15
5,Yes,587,479,158,38,62,678,41,13500,3335,500,675,67,73,9.4,11,9727,55
6,Yes,353,340,103,17,45,416,230,13290,5720,500,1500,90,93,11.5,26,8861,63
7,Yes,1899,1720,489,37,68,1594,32,13868,4826,450,850,89,100,13.7,37,11487,73
8,Yes,1038,839,227,30,63,973,306,15595,4400,300,500,79,84,11.3,23,11644,80
9,Yes,582,498,172,21,44,799,78,10468,3380,660,1800,40,41,11.5,15,8991,52


0         Private
1            Apps
2          Accept
3          Enroll
4       Top10perc
5       Top25perc
6     F.Undergrad
7     P.Undergrad
8        Outstate
9      Room.Board
10          Books
11       Personal
12            PhD
13       Terminal
14      S.F.Ratio
15    perc.alumni
16         Expend
17      Grad.Rate
dtype: object

**Dataset description (brief):**

- The `College` dataset contains institutional-level variables for U.S. colleges/universities.
- **Outstate**: Out-of-state tuition (response variable used in this lab).
- **Private**: Indicator of whether the institution is private (`Yes`/`No`).
- **Room.Board**: Estimated annual room and board cost.
- **PhD**: Percentage of faculty with a PhD.
- **Top10perc**: Percentage of new students from the top 10% of their high school class.

---

## Q2. Simple linear regression: Outstate ~ Top10perc
- Fit the model and report estimated coefficients.
- Interpret the relationship between `Top10perc` and `Outstate`.


In [17]:
print("Q2: Simple OLS — Outstate ~ Top10perc")
model_q2 = smf.ols("Outstate ~ Top10perc", data=College).fit()
print(model_q2.params)
print(model_q2.summary())




Q2: Simple OLS — Outstate ~ Top10perc
Intercept    6906.458580
Top10perc     128.243669
dtype: float64
                            OLS Regression Results                            
Dep. Variable:               Outstate   R-squared:                       0.316
Model:                            OLS   Adj. R-squared:                  0.315
Method:                 Least Squares   F-statistic:                     358.4
Date:                Fri, 19 Sep 2025   Prob (F-statistic):           5.46e-66
Time:                        11:37:46   Log-Likelihood:                -7403.3
No. Observations:                 777   AIC:                         1.481e+04
Df Residuals:                     775   BIC:                         1.482e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------

**Q2 Interpretation:**
- The estimated intercept is 6906.46 and the slope for Top10perc is 128.24 (p < 0.001).
- This means that for each additional percentage point of new students coming from the top 10% of their high school class, the out-of-state tuition increases by about $128, on average. The relationship is statistically significant, and the positive coefficient suggests that more selective institutions tend to charge higher out-of-state tuition.

---

## Q3. Multiple linear regression: Outstate ~ Top10perc + Room.Board + PhD
- Fit the model and report coefficients.
- Interpret the marginal effect of each predictor, holding others constant.


In [18]:
print("Q3: Multiple OLS — Outstate ~ Top10perc + Room.Board + PhD")
model_q3 = smf.ols('Outstate ~ Top10perc + Q("Room.Board") + PhD', data=College).fit()
print(model_q3.params)
print(model_q3.summary())



Q3: Multiple OLS — Outstate ~ Top10perc + Room.Board + PhD
Intercept         -430.635022
Top10perc           82.000309
Q("Room.Board")      1.882480
PhD                  5.622552
dtype: float64
                            OLS Regression Results                            
Dep. Variable:               Outstate   R-squared:                       0.547
Model:                            OLS   Adj. R-squared:                  0.545
Method:                 Least Squares   F-statistic:                     310.7
Date:                Fri, 19 Sep 2025   Prob (F-statistic):          2.61e-132
Time:                        11:37:56   Log-Likelihood:                -7243.6
No. Observations:                 777   AIC:                         1.450e+04
Df Residuals:                     773   BIC:                         1.451e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                

**Q3 Interpretation:**

- Top10perc: The coefficient is 82.00 (p < 0.001), meaning that even after controlling for room and board and faculty PhD percentage, each one-point increase in Top10perc is associated with an $82 increase in tuition.

- Room.Board: The coefficient is 1.88 (p = 0.397), which is not statistically significant. This suggests that after accounting for the other variables, room and board costs are not strongly related to out-of-state tuition.

- PhD: The coefficient is 5.62 (p = 0.432), also not significant. Faculty qualifications (share with PhD) do not show a clear marginal effect on tuition in this model.

- In short, only Top10perc remains significant in the multiple regression, while Room.Board and PhD do not show significant effects when included together.

---

## Q4. Add a quadratic term for Top10perc
- Model: Outstate ~ Top10perc + Top10perc^2 + Room.Board + PhD
- Report new coefficients and interpret the quadratic term.


In [19]:
print("Q4: Quadratic OLS — Outstate ~ Top10perc + Top10perc^2 + Room.Board + PhD")
model_q4 = smf.ols('Outstate ~ Top10perc + I(Top10perc**2) + Q("Room.Board") + PhD', data=College).fit()
print(model_q4.params)
print(model_q4.summary())



Q4: Quadratic OLS — Outstate ~ Top10perc + Top10perc^2 + Room.Board + PhD
Intercept           -1087.275765
Top10perc             138.067460
I(Top10perc ** 2)      -0.685852
Q("Room.Board")         1.908079
PhD                     1.961703
dtype: float64
                            OLS Regression Results                            
Dep. Variable:               Outstate   R-squared:                       0.553
Model:                            OLS   Adj. R-squared:                  0.550
Method:                 Least Squares   F-statistic:                     238.5
Date:                Fri, 19 Sep 2025   Prob (F-statistic):          2.95e-133
Time:                        11:38:09   Log-Likelihood:                -7238.4
No. Observations:                 777   AIC:                         1.449e+04
Df Residuals:                     772   BIC:                         1.451e+04
Df Model:                           4                                         
Covariance Type:            nonrobu

**Q4 Interpretation:**

- Adding a quadratic term for Top10perc, the estimated coefficient on the squared term is –0.69 (p = 0.001).

- This negative and significant coefficient indicates a concave relationship: at lower levels of Top10perc, the marginal effect on tuition is positive and relatively large, but as Top10perc increases, the marginal effect diminishes. In other words, tuition still rises with student selectivity, but at a decreasing rate as the share of top students becomes very high.

---

## Q5. Compare linear (Q3) vs quadratic (Q4) using ANOVA
- Use `anova_lm(model_q3, model_q4)`.
- Report F-statistic and p-value; state whether the quadratic term significantly improves fit.


In [20]:
print("Q5: anova_lm comparison — linear (Q3) vs quadratic (Q4)")
anova_res = anova_lm(model_q3, model_q4)
print(anova_res)


Q5: anova_lm comparison — linear (Q3) vs quadratic (Q4)
   df_resid           ssr  df_diff       ss_diff          F    Pr(>F)
0     773.0  5.693394e+09      0.0           NaN        NaN       NaN
1     772.0  5.618020e+09      1.0  7.537388e+07  10.357499  0.001343


**Q5 Conclusion:**

- The ANOVA comparison shows an F-statistic of 10.36 with a p-value of 0.0013.

- Since the p-value is less than 0.05, the quadratic model (with Top10perc²) provides a significantly better fit than the purely linear model. This supports the inclusion of the quadratic term, confirming that the relationship between selectivity and tuition is better captured as nonlinear.