Statsmodels is a Python module that provides classes and functions for the estimation of statistical models (such as Ordinary, Weighted, and Generalized Least Squares), as well as conducting statistical tests and data exploration.  I will be using this module to analyze the historical data set of Szeged's weather from 2006-2016 from https://www.kaggle.com/budincsevity/szeged-weather, which is a data set consisting of 96,453 observations and 12 potential weather variables

In [None]:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 200)

In [None]:
df=pd.read_csv('../input/szeged-weather/weatherHistory.csv')
df.head()

To prepare my data for analysis, I determined that the time, summary, loud cover, and daily summaries were irrelevant to how I wanted to analyze my data. First of all, the time was in a format that Python did not recognize and I did not want to spend additional time cropping the date and formatting it into a interpretative integer so that we could add it to our model.  Secondly, the summary and daily summary consisted of strings that do not convert to dummy variables well as they had multiple categories of partly cloudy, mostly cloudy, overcast, foggy, clear, etc.  Finally, loud cover consisted of a column entirely made up of 0's, therefore irrelevant to our data analysis.

Next, Precip type had three values: rain, snow, or null. I converted this parameter into two dummy variable such that the combination of 1 and 0 indicated rain, 0 and 0 indicated snow, 0 and 1 indicated null (or clear weather). My final data set looked like:

In [None]:
#dropping Date, Summary, loud cover, and daily summary
df.drop(['Formatted Date','Summary','Loud Cover','Daily Summary'], axis=1,inplace=True)

#alighing Apparent Temp to "y" and adding dummy variables for precip type
first_column=df.pop('Apparent Temperature (C)')
df.insert(0,'Apparent Temperature (C)',first_column)
df=pd.get_dummies(df, columns=['Precip Type'],dummy_na=True)

#Precip Type_nan is when it is neither rainy or snowy
df.drop('Precip Type_snow', axis=1,inplace=True)
df.head(10)

In [None]:
#creating X-Matrix
X=df.copy()
X=X.iloc[:,1:]
X.head()

In [None]:
#creating y-vector
y=df.copy()
y=y.iloc[:,0]
y

In [None]:
#inserting our intercept variable (note: SKLearn does this automatically, but not statsmodels)
ones=np.full((96453,1),1)
X.insert(0,'intercept',ones)
X.head()

Using Python's SKLearn module, I then split my data set into two equal halves: a train set and test set. My train set I would be using to estimate my parameters, and my test set would be to determine how well my model worked on existing data.

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.5)
X_train=X_train.reset_index(drop=True)
y_train=y_train.reset_index(drop=True)
X_train.head()

My first model will include all remaining variables: Temperature, Humidity, Wind Speed, Wind Bearing, Visibility, Pressure, and dummy variables for rain, snow, and null.

In [None]:
import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols
from scipy import stats
from matplotlib import pyplot as plt

In [None]:
model=sm.OLS(y_train,X_train)
results=model.fit()
params=results.params
params=pd.DataFrame(params)
params

In [None]:
results.summary()

In [None]:
residuals = results.resid
fig, ax = plt.subplots(figsize=(10,8))
fig=sm.qqplot(residuals,color='k', ax=ax)
ax.set_title('Normal Q-Q')
ax.set_ylabel('Standardized Residuals')
ax.set_xlabel('Theoretical Quantiles')
plt.show()

From our probability plot we see flattening at the extremes, which is a pattern typical of samples from a distribution with heavier tails than normal. This potentially indicates that our true error (difference between our observed value and true unobserved value) is not normally distributed.

In [None]:
fitted = results.fittedvalues
fig, ax = plt.subplots(figsize=(10,8))
ax.scatter(fitted, residuals, edgecolors = 'k', facecolors = 'none')
ax.set_ylabel('Residuals')
ax.set_xlabel('Fitted Values')
ax.set_title('Residuals vs. Fitted')
ax.plot([min(fitted),max(fitted)],[0,0],color = 'k',linestyle = ':', alpha = .3)
plt.show()

A curved plot (such as a plot exhibiting a u-like shape or inverted u-like shape) indicates nonlinearity.  This could mean that other regressor variables are needed in model or transformations on the regressor and/or the response variable could be helpful. However, before we go on to transforming our model, I would like to note that the condition number in our summary table is large (1.47x$10^{04}$) indicating that there is strong multicollinearity or other numerical problems in our data. Multicollinearity means that one of the predictors is an exact linear combination of some of the others.  So at this point we could either transform our model or reduce the amount of parameters in our model to remove our multicollinearity using a form of backward elimination.

Backward elimination begins with a model that includes all candidate regressors.  Then the partial F statistic (or equivalently, a t statistic) is computed for each regressor as if it were the last variable to enter the model.  The smallest of these partial F (or t) statistics is compared with a preselected value, and if the smallest partial F (or t) is less than our presselected value, that regressor is removed from the model.

In [None]:
plt.rc("figure", figsize=(20,10))
plt.rc("font", size=14)
fig = sm.graphics.plot_partregress_grid(results)

After obtaining the least-squares fit, three questions come into mind:
*     How well does this equation fit the data?
*     Is the model likely to be useful as a predictor?
*     Are any of the basic assumptions (such as constant variance and uncorrelated errors) violated, and if so, how serious is this?

From our summary data, our $R^2$ and $R^2_{Adj}$ indicate that our equation fits the data extremely well. To determine how likely the model will be useful as a predictor we need to calculate our $R^2_{prediction}$.  The PRESS statistic can be used to compute an $R^2$-like statistic for prediction.  This statistic gives some indication of the predictive capability of the regression model:
\begin{equation*}
    \begin{split}
        PRESS&=\sum_{i=1}^n\left(\frac{e_i}{1-h_{ii}}\right)^2\\
        R^2_{prediction}&=1-\frac{PRESS}{SS_T}
    \end{split}
\end{equation*}
For the weather data model, we find
\begin{equation*}
    \begin{split}
        R^2_{prediction}&=1-\frac{55158.4212}{5544530.877}\\
        &=0.9901
    \end{split}
\end{equation*}
Therefore, we could expect this model to "explain" about 99.01\% of the variability in predicting new observations.

In [None]:
n=48226
results.mse_total*n

In [None]:
from statsmodels.stats.outliers_influence import OLSInfluence

#This number may vary from the data in markdown as the train test split is different
#calculating PRESS
infl=results.get_influence()
diag=infl.hat_matrix_diag
PRESS=np.full(diag.shape,0,dtype=float)
for i in range(len(diag)):
    denom=(1-diag[i])
    PRESS[i]=np.divide(residuals[i],denom)**2
PRESS.sum()

In [None]:
#calculating R^2 prediction
n=48226
Rpred=1-(PRESS.sum()/(results.mse_total*n))
Rpred

Lets see if we can make our model better. I removed the dummy variables and the pressure as their t-statistics were smaller than compared to the other parameters

In [None]:
X_train2=X_train.copy()
X_train2.drop(['Precip Type_nan','Precip Type_rain','Pressure (millibars)'], axis=1,inplace=True)
X_train2.head()

In [None]:
model2=sm.OLS(y_train,X_train2)
results2=model2.fit()
results2.summary()

Once again we see strong multicollinearity, however I wanted to see how my PRESS statistic and $R^2_{prediction}$ values changed.
\begin{equation*}
    \begin{split}
        PRESS&=55807.794\\
        R^2_{prediction}&=1-\frac{55807.794}{5544530.877}\\
        &=0.9899 \quad (\Delta 0.0001171)\textrm{ from Model 1}
    \end{split}
\end{equation*}

In [None]:
residuals2 = results2.resid
infl2=results2.get_influence()
diag2=infl2.hat_matrix_diag
PRESS2=np.full(diag2.shape,0,dtype=float)
for i in range(len(diag2)):
    denom=(1-diag2[i])
    PRESS2[i]=np.divide(residuals2[i],denom)**2
PRESS2.sum()

In [None]:
n=48226
Rpred2=1-(PRESS2.sum()/(results2.mse_total*n))
Rpred2

Lets remove a few more parameters

In [None]:
X_train3=X_train2.copy()
X_train3.drop(['Visibility (km)','Wind Bearing (degrees)'], axis=1,inplace=True)
X_train3.head()

In [None]:
model3=sm.OLS(y_train,X_train3)
results3=model3.fit()
results3.summary()

In [None]:
residuals3 = results3.resid
fig, ax = plt.subplots(figsize=(10,8))
fig=sm.qqplot(residuals3,color='k', ax=ax)
ax.set_title('Normal Q-Q')
ax.set_ylabel('Standardized Residuals')
ax.set_xlabel('Theoretical Quantiles')
plt.show()

In [None]:
fitted3 = results3.fittedvalues
fig, ax = plt.subplots(figsize=(10,8))
ax.scatter(fitted3, residuals3, edgecolors = 'k', facecolors = 'none')
ax.set_ylabel('Residuals')
ax.set_xlabel('Fitted Values')
ax.set_title('Residuals vs. Fitted')
ax.plot([min(fitted3),max(fitted3)],[0,0],color = 'k',linestyle = ':', alpha = .3)
plt.show()

In [None]:
infl3=results3.get_influence()
diag3=infl3.hat_matrix_diag
PRESS3=np.full(diag3.shape,0,dtype=float)
for i in range(len(diag3)):
    denom=(1-diag3[i])
    PRESS3[i]=np.divide(residuals3[i],denom)**2
PRESS3.sum()

In [None]:
n=48226
Rpred3=1-(PRESS3.sum()/(results3.mse_total*n))
Rpred3

### PRESS Comparison

In [None]:
print(PRESS.sum(), PRESS2.sum(),PRESS3.sum())

### R^2 prediction comparison

In [None]:
print(Rpred,Rpred2,Rpred3)

### $MS_{Res}$ comparison

In [None]:
ssr=results.ssr
ssr2=results2.ssr
ssr3=results3.ssr
msres=ssr/(n-9)
msres2=ssr2/(n-6)
msres3=ssr3/(n-4)
print(msres,msres2,msres3)

When determining the best model, we would like to choose a model that maximizes $R^2$ and $R^2_{prediction}$, but also minimizes our $MS_{Res}$ and PRESS. Taking into consideration that our first two models had high multicollinearity, we see that choosing model 3 is an acceptable choice as our $R^2$, $R^2_{prediction}$, $MS_{Res}$ and PRESS values are not that different than our first model (which has the best values). Therefore our final model would be model 3 with the following parameters

In [None]:
results3.params

# How well does it predict new values?

In [None]:
resMatrix=results3.params.to_numpy().reshape((4,1))
X=X_test.to_numpy()
x=np.full(X[:,:4].shape,0,dtype=float)
for i in range(len(resMatrix)):
    for j in range(len(X_test)):
        a=np.dot(resMatrix[i],X[j,i])
        x[j][i]=a[0]

In [None]:
yhat=x.sum(axis=1).reshape((48227,1))
y=y_test.to_numpy().reshape((48227,1))
resid=np.subtract(y,yhat)

In [None]:
fig, ax = plt.subplots(figsize=(10,8))
ax.scatter(yhat, resid, edgecolors = 'k', facecolors = 'none')
ax.set_ylabel('Residuals')
ax.set_xlabel('Fitted Values')
ax.set_title('Residuals vs. Fitted')
ax.plot([min(yhat),max(yhat)],[0,0],color = 'k',linestyle = ':', alpha = .3)
plt.show()

In [None]:
#calculating SS_Res
np.matmul(resid.T,resid)

We see our $SS_{Res}$ isn't as different as the values we determined from the previous models, so I would conclude we have a pretty good model.

Let me know what you think

-Sonja