# Section 4: Statistical Observations (via Stats Linear Regression)

**Corbett includes his work using stats linear regression here).** Fit a stats linear regression to dfX and describe the statistical findings (R-squared, p-value, F value, etc).

In [43]:
import statsmodels.formula.api as sm

ols1 = sm.ols(formula='CAISO_HourlyLoad ~ isWeekend + DayofYear + CA_CustomerCount + apparentTemperatureMax + apparentTemperatureMin + dewPoint + uvIndex + GDP + CAPOP + pressure + humidity', 
              data=dfX).fit()

ols1.summary()

0,1,2,3
Dep. Variable:,CAISO_HourlyLoad,R-squared:,0.744
Model:,OLS,Adj. R-squared:,0.742
Method:,Least Squares,F-statistic:,331.5
Date:,"Fri, 19 Apr 2019",Prob (F-statistic):,0.0
Time:,08:38:03,Log-Likelihood:,-15328.0
No. Observations:,1267,AIC:,30680.0
Df Residuals:,1255,BIC:,30740.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.697e+06,1.35e+06,-1.260,0.208,-4.34e+06,9.46e+05
isWeekend,-5.221e+04,2718.465,-19.207,0.000,-5.75e+04,-4.69e+04
DayofYear,61.6018,15.324,4.020,0.000,31.538,91.665
CA_CustomerCount,0.0115,0.007,1.673,0.095,-0.002,0.025
apparentTemperatureMax,6073.3752,593.240,10.238,0.000,4909.524,7237.227
apparentTemperatureMin,1.023e+04,708.251,14.439,0.000,8837.217,1.16e+04
dewPoint,-1.145e+04,1135.772,-10.085,0.000,-1.37e+04,-9225.737
uvIndex,-89.4488,1017.101,-0.088,0.930,-2084.854,1905.957
GDP,-0.0143,0.007,-1.924,0.055,-0.029,0.000

0,1,2,3
Omnibus:,19.156,Durbin-Watson:,0.365
Prob(Omnibus):,0.0,Jarque-Bera (JB):,21.655
Skew:,0.239,Prob(JB):,1.99e-05
Kurtosis:,3.426,Cond. No.,51200000000.0


**Interpretation:** We used the stats model package to evaluate our data with linear regression. We started by creating a linear model with all 12 independent variables. The initial linear model has an adjusted R squared of 0.742 which is a good starting point. There are some independent variables with high p-values (CA_CustomerCount, uvIndex, CAPOP) indicating that they are probably not statically significant to our model and our model would perform better without them. 

In [85]:
ols2 = sm.ols(formula='CAISO_HourlyLoad ~ isWeekend + DayofYear + apparentTemperatureMax + apparentTemperatureMin + dewPoint + GDP + pressure + humidity', 
              data=dfX).fit()

ols2.summary()

0,1,2,3
Dep. Variable:,CAISO_HourlyLoad,R-squared:,0.743
Model:,OLS,Adj. R-squared:,0.742
Method:,Least Squares,F-statistic:,455.4
Date:,"Fri, 19 Apr 2019",Prob (F-statistic):,0.0
Time:,09:10:26,Log-Likelihood:,-15330.0
No. Observations:,1267,AIC:,30680.0
Df Residuals:,1258,BIC:,30720.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.191e+06,4.03e+05,-2.953,0.003,-1.98e+06,-4e+05
isWeekend,-5.225e+04,2718.407,-19.221,0.000,-5.76e+04,-4.69e+04
DayofYear,66.3125,12.258,5.410,0.000,42.263,90.362
apparentTemperatureMax,6063.8541,568.624,10.664,0.000,4948.299,7179.409
apparentTemperatureMin,1.008e+04,666.086,15.130,0.000,8771.190,1.14e+04
dewPoint,-1.132e+04,978.657,-11.572,0.000,-1.32e+04,-9404.829
GDP,-0.0115,0.001,-7.964,0.000,-0.014,-0.009
pressure,1227.6122,397.319,3.090,0.002,448.132,2007.093
humidity,5.43e+05,5.52e+04,9.830,0.000,4.35e+05,6.51e+05

0,1,2,3
Omnibus:,19.038,Durbin-Watson:,0.362
Prob(Omnibus):,0.0,Jarque-Bera (JB):,21.358
Skew:,0.241,Prob(JB):,2.3e-05
Kurtosis:,3.415,Cond. No.,6400000000.0


**Interpretation:** After removing the variables CA_CustomerCount, uvIndex, and CAPOP the adjusted R squared of our model mainly stayed the same at 0.742. We did see a good increase in the F-stat of the model from 331.5 to  455.4. Looking at the scatter plot matrix it appears that apparentTemperatureMax and apparentTemperatureMin could have more of an exponential rather a linear relationship with our dependent variable. 

In [83]:
ols3 = sm.ols(formula='CAISO_HourlyLoad ~ isWeekend + DayofYear + I(apparentTemperatureMin ** 3)+ I(apparentTemperatureMax ** 3)+ dewPoint + GDP + humidity', 
              data=dfX).fit()

ols3.summary()

0,1,2,3
Dep. Variable:,CAISO_HourlyLoad,R-squared:,0.887
Model:,OLS,Adj. R-squared:,0.886
Method:,Least Squares,F-statistic:,1411.0
Date:,"Fri, 19 Apr 2019",Prob (F-statistic):,0.0
Time:,09:08:30,Log-Likelihood:,-14810.0
No. Observations:,1267,AIC:,29640.0
Df Residuals:,1259,BIC:,29680.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.231e+05,2.18e+04,28.553,0.000,5.8e+05,6.66e+05
isWeekend,-5.376e+04,1802.382,-29.825,0.000,-5.73e+04,-5.02e+04
DayofYear,56.5961,8.116,6.973,0.000,40.674,72.518
I(apparentTemperatureMin ** 3),0.8718,0.035,24.813,0.000,0.803,0.941
I(apparentTemperatureMax ** 3),0.3500,0.014,24.838,0.000,0.322,0.378
dewPoint,-9705.8320,335.427,-28.936,0.000,-1.04e+04,-9047.774
GDP,-0.0084,0.001,-8.809,0.000,-0.010,-0.007
humidity,5.169e+05,1.94e+04,26.613,0.000,4.79e+05,5.55e+05

0,1,2,3
Omnibus:,94.473,Durbin-Watson:,0.701
Prob(Omnibus):,0.0,Jarque-Bera (JB):,267.805
Skew:,-0.375,Prob(JB):,7.03e-59
Kurtosis:,5.124,Cond. No.,568000000.0


**Interpretation:** By cubing both apparentTemperatureMax and apparentTemperatureMin we an increase in the adjusted R squared of our model from 0.742 to 0.887. This means that by cubing the temperature variables we were to improve the overall fit of our regression model. We also see a significant increase if the F-stat from 455.4 to 1411. This indicts that we were able to improve the overall statistical significance of our regression model. Now that we have a linear that appears to be a good fit for our data and has a high statistical significance we need to test our model to see how well it able to predict actually load generation.