# 18-5-1-DRILL-Weather-Evaluating performance

## Evaluating performance

### Weather model

* For this assignment, you'll revisit the historical temperature dataset. To complete this assignment, submit a link a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the weatherinszeged table from Thinkful's database.

* Like in the previous checkpoint, build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. 

* As explanatory variables, use humidity and windspeed. Now, estimate your model using OLS.

* What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why?

* Next, include the interaction of humidity and windspeed to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one?

* Add visibility as an additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the visibility in terms of the improvement in the adjusted R-squared. Which one is more useful?

* Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
import statsmodels.formula.api as smf
from sqlalchemy import create_engine

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings(action="ignore")

In [5]:
#postgres_user = 'dsbc_student'
#postgres_pw = '7*.8G9QH21'
#postgres_host = '142.93.121.174'
#postgres_port = '5432'
#postgres_db = 'weatherinszeged'

### Load the data 

In [12]:
#engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    #postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
#weather = pd.read_sql_query('select * from weatherinszeged',con=engine)

# no need for an open connection, as we're only doing a single query
#engine.dispose()

weather = pd.read_csv('weatherHistory.csv')

#weather.head(10)

#weather.head(10)

In [1]:
# weather.info()

In [4]:
weather.rename(columns={'Apparent Temperature (C)': 'apparenttemperature','Humidity': 'humidity', 'Wind Speed (km/h)': 'windspeed', 'Wind Bearing (degrees)': 'windbearing', 'Pressure (millibars)': 'pressure', 'Temperature (C)': 'temperature'}, inplace=True)

In [13]:
# Next check the data to make sure there is no missing data. 
# Remove rows containing missing data if it exists. 

weather.isnull().sum()*100/weather.isnull().count()

### Build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. 

In [6]:
#As explanatory variables, use humidity and windspeed.[ X = humidity, windspeed]
X = weather[['humidity','windspeed']] 

# Y is the target variable
# where your target variable is the difference between the apparenttemperature and the 
# temperature. [y=apparenttemperature - temperature]
Y = weather['apparenttemperature'].values - weather['temperature'].values

## Estimate your model using OLS.


In [7]:
import statsmodels.api as sm

# We need to manually add a constant
# in statsmodels' sm
X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Fri, 10 Jan 2020",Prob (F-statistic):,0.0
Time:,01:50:45,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.4381,0.021,115.948,0.000,2.397,2.479
humidity,-3.0292,0.024,-126.479,0.000,-3.076,-2.982
windspeed,-0.1193,0.001,-176.164,0.000,-0.121,-0.118

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.264
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,-0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


### What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why?

- R-squared is probably the most common measure of goodness of fit in a linear regression model. It is a proportion (between 0 and 1) that expresses how much variance in the outcome variable is explained by the explanatory variables in the model.

- This R squared here tells us that 28.8% of the variance is in apparenttemperature and (100-28.8) 71.2% is unexplained. Don't think this is a very good model...

### Next, include the interaction of humidity and windspeed to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one?

In [8]:
#### Next, include the interaction of humidity and windspeed to the model above and 
# estimate the model using OLS.
weather['humidity_windspeed_interaction'] = weather.humidity * weather.windspeed

# Y is the target variable
Y = weather['apparenttemperature'] - weather['temperature']
# X is the feature set
X = weather[['humidity','windspeed', 'humidity_windspeed_interaction']]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Fri, 10 Jan 2020",Prob (F-statistic):,0.0
Time:,01:50:52,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0839,0.033,2.511,0.012,0.018,0.149
humidity,0.1775,0.043,4.133,0.000,0.093,0.262
windspeed,0.0905,0.002,36.797,0.000,0.086,0.095
humidity_windspeed_interaction,-0.2971,0.003,-88.470,0.000,-0.304,-0.291

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.262
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,-0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


### Now, what is the R-squared of this model? Does this model improve upon the previous one?
- R-squared is 0.341 - so its 34.1% variance in temparture - still about 65% or so unexplainable

### Add visibility as an additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the visibility in terms of the improvement in the adjusted R-squared. Which one is more useful?

In [9]:
weather.rename(columns={'Visibility (km)':'visibility'}, inplace=True)

In [11]:

# Y is the target variable
Y = weather['apparenttemperature'] - weather['temperature']
# X is the feature set
X = weather[['humidity','windspeed', 'visibility']]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.304
Model:,OLS,Adj. R-squared:,0.303
Method:,Least Squares,F-statistic:,14010.0
Date:,"Fri, 10 Jan 2020",Prob (F-statistic):,0.0
Time:,01:55:28,Log-Likelihood:,-169380.0
No. Observations:,96453,AIC:,338800.0
Df Residuals:,96449,BIC:,338800.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.5756,0.028,56.605,0.000,1.521,1.630
humidity,-2.6066,0.025,-102.784,0.000,-2.656,-2.557
windspeed,-0.1199,0.001,-179.014,0.000,-0.121,-0.119
visibility,0.0540,0.001,46.614,0.000,0.052,0.056

0,1,2,3
Omnibus:,3833.895,Durbin-Watson:,0.279
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4584.022
Skew:,-0.459,Prob(JB):,0.0
Kurtosis:,3.545,Cond. No.,131.0


### Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor

* best model is 
- F-test: model with higher F statistic is superior to the other one
- R-squared: higher R^2 is better
- AIC and BIC: For both AIC and BIC, the lower the value the better

Among these three we would go with the lowerst AIC/BIC

Model 1: 
---------
F-test: 1.949e+04 = 19490
R-squared:  0.288
AIC and BIC: 3.409e+05 = 340900


Model 2: interaction
---------
F-test: 1.666e+04 = 16660
R-squared: 0.341         r-squared improved by interaction and decreased when visibility is used (model 3) - therefore, interaction gives us more information than feature visibility.
AIC and BIC: 3.334e+05 = 333400


Model 3: 
---------
F-test:   1.401e+04 = 14010
R-squared: 0.304
AIC and BIC: 3.388e+05 = 338800

Model 2 is the best since it has lowest AIC and BIC