# CS-E-106: Data Modeling
## Fall 2019: Lecture 02-03

## Lecture 02

In [2]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
import seaborn as sns
from scipy import stats

In [3]:
toluca_data  = pd.read_csv("data/toluca_data.csv")
toluca_data.shape

(25, 2)

**Note:** We use the formula version of the Ordinary Least Squares function here instead of `sm.OLS()` because the function `sm.stats.anova_lm()` only works on that version. So we can only get the ANOVA table if we use formula.

**statsmodels Homepage:** http://www.statsmodels.org/stable/index.html

**statsmodels.formula.api:** https://www.statsmodels.org/dev/example_formulas.html

In [4]:
toluca_data = toluca_data.rename(columns={"lotsize ":"lotsize"})
lm_formula = ols("workhrs ~ lotsize ", data=toluca_data).fit()
lm_formula.summary()

0,1,2,3
Dep. Variable:,workhrs,R-squared:,0.822
Model:,OLS,Adj. R-squared:,0.814
Method:,Least Squares,F-statistic:,105.9
Date:,"Tue, 08 Oct 2019",Prob (F-statistic):,4.45e-10
Time:,15:52:54,Log-Likelihood:,-131.64
No. Observations:,25,AIC:,267.3
Df Residuals:,23,BIC:,269.7
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,62.3659,26.177,2.382,0.026,8.214,116.518
lotsize,3.5702,0.347,10.290,0.000,2.852,4.288

0,1,2,3
Omnibus:,0.608,Durbin-Watson:,1.432
Prob(Omnibus):,0.738,Jarque-Bera (JB):,0.684
Skew:,0.298,Prob(JB):,0.71
Kurtosis:,2.45,Cond. No.,202.0


In [5]:
X = sm.add_constant(toluca_data["lotsize"])
Y = toluca_data["workhrs"]

  return ptp(axis=axis, out=out, **kwargs)


**T-statistic:**
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.t.html

In [10]:
pr_t = stats.t.sf(np.abs(10.29), 23)
pr_t

2.222734665439868e-10

**OLS Regression Results:** https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html
**conf_int:** https://www.statsmodels.org/dev/generated/generated/statsmodels.regression.linear_model.RegressionResults.conf_int.html#statsmodels.regression.linear_model.RegressionResults.conf_int

In [6]:
lm_formula.conf_int(alpha=0.05)

Unnamed: 0,0,1
Intercept,8.213711,116.518006
lotsize,2.852435,4.287969


In [7]:
lm_formula.conf_int(alpha=0.01)

Unnamed: 0,0,1
Intercept,-11.122986,135.854703
lotsize,2.596135,4.544269


In [8]:
# Not expected to do it this way:
3.57-stats.t.ppf(1-0.05/2,23)*0.347

2.852175809184593

In [9]:
lm_formula.fittedvalues

0     347.982020
1     169.471919
2     240.875960
3     383.684040
4     312.280000
5     276.577980
6     490.790101
7     347.982020
8     419.386061
9     240.875960
10    205.173939
11    312.280000
12    383.684040
13    133.769899
14    455.088081
15    419.386061
16    169.471919
17    240.875960
18    383.684040
19    455.088081
20    169.471919
21    383.684040
22    205.173939
23    347.982020
24    312.280000
dtype: float64

**Confidence & Prediction Interval:** `get_predictions()` method gives us both confidence intervals (`mean_ci_lower` & `mean_ci_upper`) and  prediction intervals (`obs_ci_lower` & `obs_ci_upper`) for each observation in the data frame that is passed to the "exog" argument. `mean` is the fitted value of that observtion and `mean_se` is the standard error.
**OLS Regression Results:** https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html
**get_prediction():** https://www.statsmodels.org/dev/generated/generated/statsmodels.regression.linear_model.RegressionResults.get_prediction.html#statsmodels.regression.linear_model.RegressionResults.get_prediction

In [10]:
predictions = lm_formula.get_prediction(X)
predictions.summary_frame(alpha=0.05)

Unnamed: 0,mean,mean_se,mean_ci_lower,mean_ci_upper,obs_ci_lower,obs_ci_upper
0,347.98202,10.362798,326.544938,369.419102,244.733348,451.230693
1,169.471919,16.969741,134.367335,204.576503,62.546377,276.397462
2,240.87596,11.979336,216.094815,265.657105,136.881513,344.870407
3,383.68404,11.979336,358.902895,408.465185,279.689593,487.678487
4,312.28,9.764662,292.080258,332.479742,209.281119,415.278881
5,276.57798,10.362798,255.140898,298.015062,173.329307,379.826652
6,490.790101,19.907858,449.607559,531.972643,381.717915,599.862287
7,347.98202,10.362798,326.544938,369.419102,244.733348,451.230693
8,419.386061,14.272328,389.861502,448.91062,314.160401,524.61172
9,240.87596,11.979336,216.094815,265.657105,136.881513,344.870407


## Lecture 03

In [11]:
Xh = pd.DataFrame([100], columns=["lotsize"])
Xh

Unnamed: 0,lotsize
0,100


In [12]:
predictions = lm_formula.get_prediction(Xh)
predictions.summary_frame(alpha=0.1)

Unnamed: 0,mean,mean_se,mean_ci_lower,mean_ci_upper,obs_ci_lower,obs_ci_upper
0,419.386061,14.272328,394.925125,443.846996,332.207177,506.564945


In [13]:
np.sqrt(predictions.var_resid) # this is same as residual.scale in R

48.82331018110064

In [14]:
# Variance on m new observations

MSE = predictions.var_resid
VarYhat = (14.272328)**2
m = 3
var_predmean = MSE/m + VarYhat

print(var_predmean)

tc=stats.t.ppf(1-0.10/2,23)

(419.3861-tc*np.sqrt(var_predmean),419.3861+tc*np.sqrt(var_predmean))


998.2712188862389


(365.23559151935166, 473.53660848064834)

**F-statistic:** https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f.html

In [15]:
## Confidence Interval on a Regression Line
# Working Hotelling
W = np.sqrt( 2 * stats.f.cdf(0.1,2,23))

In [16]:
predictions = lm_formula.get_prediction(X)
CI = predictions.summary_frame(alpha=0.1)

In [17]:
np.vstack((CI["mean"]-W*CI["mean_se"],CI["mean"]+W*CI["mean_se"]))

array([[324.58278674, 131.1541876 , 213.8265786 , 356.63465941,
        290.23136063, 253.17874634, 445.83809092, 324.58278674,
        387.15909744, 213.8265786 , 172.94697623, 290.23136063,
        356.63465941,  88.81788889, 416.77034921, 387.15909744,
        131.1541876 , 213.8265786 , 356.63465941, 416.77034921,
        131.1541876 , 356.63465941, 172.94697623, 324.58278674,
        290.23136063],
       [371.38125366, 207.78965079, 267.92534059, 410.7334214 ,
        334.32863937, 299.97721326, 535.74211111, 371.38125366,
        451.61302377, 267.92534059, 237.40090256, 334.32863937,
        410.7334214 , 178.72190908, 493.4058124 , 451.61302377,
        207.78965079, 267.92534059, 410.7334214 , 493.4058124 ,
        207.78965079, 410.7334214 , 237.40090256, 371.38125366,
        334.32863937]])

**ANOVA Table:** https://www.statsmodels.org/0.9.0/generated/statsmodels.stats.anova.anova_lm.html

In [18]:
sm.stats.anova_lm(lm_formula)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
lotsize,1.0,252377.580808,252377.580808,105.875709,4.448828e-10
Residual,23.0,54825.459192,2383.715617,,


**Correlation Coefficients:**

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html

In [3]:
expenditures = pd.read_csv("data/expenditures.csv")

In [8]:
stats.pearsonr(expenditures["Y1"], expenditures["Y2"])

(0.6737664071298821, 0.01629169661963086)

In [7]:
stats.spearmanr(expenditures["Y1"], expenditures["Y2"])

SpearmanrResult(correlation=0.8951048951048951, pvalue=8.36658642909172e-05)