#                                            Machine Learning Project - Weather Prediction


### Problem Statement

Given the dataset containing Temperature data for 4 years at different weather stations, make a model to predict temperature using appropriate algorithm. Evaluate the model using possible model evaluation techniques. 

### Data Loading and Description

#### Dataset Description

The data is of weather patterns observed at 159 different weather stations in US during the period 1940 to 1945. It contains datewise information on minimum and maximum temperature, precipitation and snowfall etc. The classification goal is to predict the mean temperature as accurately as possible.

_Source_ : https://raw.githubusercontent.com/insaid2018/Term-2/master/Projects/Summary%20of%20Weather.csv

#### Description of features

|Feature    |	Description                                               |
|-----------| ------------                                                | 	
|STA	| Weather Station |
|Date	|  Self-explanatory |
|Precip	| Precipitation in mm |
|WindGustSpd	| Peak wind gust speed in km/h |
|MaxTemp	| Temperature in degrees Celsius |
|MinTemp	| Temperature in degrees Celsius |
|MeanTemp	| Temperature in degrees Celsius |
|Snowfall	| Snowfall and Ice Pellets in mm |
|PoorWeather	| A repeat of the TSHDSBRSGF column |
|YR	| Year of Observation |
|MO	| Month of Observation |
|DA	| Day of Observation |
|PRCP	| Precipitation in Inches and Hundredths |
|DR	| Peak wind gust direction in tens of degrees |
|SPD	| Peak wind gust speed in knots |
|MAX	| Maximum temperature in degrees Fahrenheit |
|MIN	| Minimum temperature in degrees Fahrenheit |
|MEA	| Mean temperature in degrees Fahrenheit |
|SNF	| Snowfall in inches and tenths |
|SND	| Snow depth (includes ice pellets) recorded at 1200 GMT except 0000 GMT in Far East Asian Area in - inches and tenths |
|FT	| Frozen Ground Top (depth in inches) |
|FB	| Frozen Ground Base (depth in inches) |
|FTI	| Frozen Ground Thickness (thickness in inches) |
|ITH	| Ice Thickness on Water (inches and tenths) |
|PGT	| Peak wind gust time (hours and tenths) |
|TSHDSBRSGF	| Day with: Thunder; Sleet; Hail; Dust or Sand; Smoke or Haze; Blowing Snow; Rain; Snow; Glaze; Fog; 0 = No, 1 = Yes |
|SD3	| Snow depth at 0030 GMT includes ice pellets in inches and tenths |
|RHX	| Hour maximum relative humidity, as a whole percent |
|RHN	| Hour minimum relative humidity, as a whole percent |
|RVG	| River guage in feet and tenths |
|WTE	| Water equivalent of snow and ice on ground in inches and hundredths |

#### Importing Packages

In [10]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix

from sklearn import metrics

import numpy as np

# allow plots to appear directly in the notebook
%matplotlib inline                     

#### Importing the Dataset

In [11]:
data = pd.read_csv('https://raw.githubusercontent.com/insaid2018/Term-2/master/Projects/Summary%20of%20Weather.csv',low_memory=False)
data.head()

Unnamed: 0,STA,Date,Precip,WindGustSpd,MaxTemp,MinTemp,MeanTemp,Snowfall,PoorWeather,YR,...,FB,FTI,ITH,PGT,TSHDSBRSGF,SD3,RHX,RHN,RVG,WTE
0,10001,1942-7-1,1.016,,25.555556,22.222222,23.888889,0,,42,...,,,,,,,,,,
1,10001,1942-7-2,0.0,,28.888889,21.666667,25.555556,0,,42,...,,,,,,,,,,
2,10001,1942-7-3,2.54,,26.111111,22.222222,24.444444,0,,42,...,,,,,,,,,,
3,10001,1942-7-4,2.54,,26.666667,22.222222,24.444444,0,,42,...,,,,,,,,,,
4,10001,1942-7-5,0.0,,26.666667,21.666667,24.444444,0,,42,...,,,,,,,,,,


### Exploratory Data Analysis

In [7]:
data.shape

(119040, 31)

In [8]:
import pandas_profiling
profile = pandas_profiling.ProfileReport(data)
profile.to_file(outputfile="Preprofiling.html")   

In [223]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119040 entries, 0 to 119039
Data columns (total 31 columns):
STA            119040 non-null int64
Date           119040 non-null object
Precip         119040 non-null object
WindGustSpd    532 non-null float64
MaxTemp        119040 non-null float64
MinTemp        119040 non-null float64
MeanTemp       119040 non-null float64
Snowfall       117877 non-null object
PoorWeather    34237 non-null object
YR             119040 non-null int64
MO             119040 non-null int64
DA             119040 non-null int64
PRCP           117108 non-null object
DR             533 non-null float64
SPD            532 non-null float64
MAX            118566 non-null float64
MIN            118572 non-null float64
MEA            118542 non-null float64
SNF            117877 non-null object
SND            5563 non-null float64
FT             0 non-null float64
FB             0 non-null float64
FTI            0 non-null float64
ITH            0 non-null float64

In [224]:
data.describe(include = 'all')       

Unnamed: 0,STA,Date,Precip,WindGustSpd,MaxTemp,MinTemp,MeanTemp,Snowfall,PoorWeather,YR,...,FB,FTI,ITH,PGT,TSHDSBRSGF,SD3,RHX,RHN,RVG,WTE
count,119040.0,119040,119040.0,532.0,119040.0,119040.0,119040.0,117877.0,34237.0,119040.0,...,0.0,0.0,0.0,525.0,34237.0,0.0,0.0,0.0,0.0,0.0
unique,,2192,540.0,,,,,35.0,38.0,,...,,,,,38.0,,,,,
top,,1945-4-24,0.0,,,,,0.0,1.0,,...,,,,,1.0,,,,,
freq,,122,64267.0,,,,,115690.0,31980.0,,...,,,,,31980.0,,,,,
mean,29659.435795,,,37.774534,27.045111,17.789511,22.411631,,,43.805284,...,,,,12.085333,,,,,,
std,20953.209402,,,10.297808,8.717817,8.334572,8.297982,,,1.136718,...,,,,5.731328,,,,,,
min,10001.0,,,18.52,-33.333333,-38.333333,-35.555556,,,40.0,...,,,,0.0,,,,,,
25%,11801.0,,,29.632,25.555556,15.0,20.555556,,,43.0,...,,,,8.5,,,,,,
50%,22508.0,,,37.04,29.444444,21.111111,25.555556,,,44.0,...,,,,11.6,,,,,,
75%,33501.0,,,43.059,31.666667,23.333333,27.222222,,,45.0,...,,,,15.0,,,,,,


#### Remove columns with missing value

In [12]:
total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(40)

Unnamed: 0,Total,Percent
WTE,119040,1.0
ITH,119040,1.0
RVG,119040,1.0
FB,119040,1.0
FTI,119040,1.0
FT,119040,1.0
SD3,119040,1.0
RHX,119040,1.0
RHN,119040,1.0
PGT,118515,0.99559


In [13]:
#Drop columns with >70% missing values
data.drop(['WTE','ITH','RVG','FB','FTI','FT','SD3','RHX','RHN','PGT','WindGustSpd','SPD','DR','SND','TSHDSBRSGF','PoorWeather'], axis = 1,inplace = True)

#### Remove duplicate columns

In [14]:
#MAX, MIN and MEA are temperatures in Fahrenheit. We already have temperatures in Celsius with complete values. Hence dropping these
#SNF is Snowfall in inches and PRCP is Precipitation in inches. We already have snowfall and Precipitation in millimeters in the dataset
data.drop(['MAX','MIN','MEA','SNF','PRCP'], axis = 1,inplace = True)

#### Remove inconsistent values

In [15]:
#Precip and snowfall are of object datatype as these contain string values also. convert them to float
precip_garb=data[pd.to_numeric(data['Precip'], errors='coerce').isnull()]['Precip'].unique()
snwf_garb=data[pd.to_numeric(data['Snowfall'], errors='coerce').isnull()]['Snowfall'].unique()
print(precip_garb)
print(snwf_garb)

['T']
[nan '#VALUE!']


In [16]:
data['Precip'].replace('T',0,inplace=True)

In [17]:
data['Snowfall'].replace('#VALUE!',0,inplace=True)

In [18]:
median_snowfall = data.Snowfall.median()
data.Snowfall.fillna(median_snowfall, inplace = True)

In [19]:
#Convert to numeric field
data[["Precip", "Snowfall"]] = data[["Precip", "Snowfall"]].apply(pd.to_numeric)

In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119040 entries, 0 to 119039
Data columns (total 10 columns):
STA         119040 non-null int64
Date        119040 non-null object
Precip      119040 non-null float64
MaxTemp     119040 non-null float64
MinTemp     119040 non-null float64
MeanTemp    119040 non-null float64
Snowfall    119040 non-null float64
YR          119040 non-null int64
MO          119040 non-null int64
DA          119040 non-null int64
dtypes: float64(5), int64(4), object(1)
memory usage: 9.1+ MB


In [234]:
data.describe(include = 'all')

Unnamed: 0,STA,Date,Precip,MaxTemp,MinTemp,MeanTemp,Snowfall,YR,MO,DA
count,119040.0,119040,119040.0,119040.0,119040.0,119040.0,119040.0,119040.0,119040.0,119040.0
unique,,2192,,,,,,,,
top,,1945-4-24,,,,,,,,
freq,,122,,,,,,,,
mean,29659.435795,,3.225612,27.045111,17.789511,22.411631,0.243054,43.805284,6.726016,15.79753
std,20953.209402,,10.801044,8.717817,8.334572,8.297982,2.613366,1.136718,3.425561,8.794541
min,10001.0,,0.0,-33.333333,-38.333333,-35.555556,0.0,40.0,1.0,1.0
25%,11801.0,,0.0,25.555556,15.0,20.555556,0.0,43.0,4.0,8.0
50%,22508.0,,0.0,29.444444,21.111111,25.555556,0.0,44.0,7.0,16.0
75%,33501.0,,0.762,31.666667,23.333333,27.222222,0.0,45.0,10.0,23.0


In [21]:
#Some records have Max temp less than Min temp. These are erraneous records which are removed
data=data[data['MaxTemp']>=data['MinTemp']]

In [22]:
#Correcting values for mean temp
data['MeanTemp']=data[['MaxTemp','MinTemp']].mean(axis=1)

In [23]:
data = data.reset_index(drop=True)

In [24]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119008 entries, 0 to 119007
Data columns (total 10 columns):
STA         119008 non-null int64
Date        119008 non-null object
Precip      119008 non-null float64
MaxTemp     119008 non-null float64
MinTemp     119008 non-null float64
MeanTemp    119008 non-null float64
Snowfall    119008 non-null float64
YR          119008 non-null int64
MO          119008 non-null int64
DA          119008 non-null int64
dtypes: float64(5), int64(4), object(1)
memory usage: 9.1+ MB


### Model Selection

We need to build a Machine Learning model that can predict the temperature of any given station on any particular date. We will try different algorithms with different set of features to identify the one with most accuracy.

### Linear Regression

We will first use Linear Regression Algorithm which is a type of Supervised Learning in which labelled data is used, and this data is used to make predictions in a continuous form.

In [25]:
data_modify = data.drop(data[['Date']], axis=1)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(data_modify)
data1 = scaler.transform(data_modify)

In [26]:
data1 = pd.DataFrame(data1)
data1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,-0.938145,-0.204624,-0.172796,0.53175,0.177509,-0.093002,-1.588544,0.080055,-1.68255
1,-0.938145,-0.298678,0.210838,0.465098,0.345745,-0.093002,-1.588544,0.080055,-1.568844
2,-0.938145,-0.063543,-0.108857,0.53175,0.211156,-0.093002,-1.588544,0.080055,-1.455137
3,-0.938145,-0.063543,-0.044918,0.53175,0.244803,-0.093002,-1.588544,0.080055,-1.34143
4,-0.938145,-0.298678,-0.044918,0.465098,0.211156,-0.093002,-1.588544,0.080055,-1.227723


In [27]:
data1.columns = ['STA','Precip','MaxTemp','MinTemp','MeanTemp','Snowfall','YR','MO','DA']
data1.head()

Unnamed: 0,STA,Precip,MaxTemp,MinTemp,MeanTemp,Snowfall,YR,MO,DA
0,-0.938145,-0.204624,-0.172796,0.53175,0.177509,-0.093002,-1.588544,0.080055,-1.68255
1,-0.938145,-0.298678,0.210838,0.465098,0.345745,-0.093002,-1.588544,0.080055,-1.568844
2,-0.938145,-0.063543,-0.108857,0.53175,0.211156,-0.093002,-1.588544,0.080055,-1.455137
3,-0.938145,-0.063543,-0.044918,0.53175,0.244803,-0.093002,-1.588544,0.080055,-1.34143
4,-0.938145,-0.298678,-0.044918,0.465098,0.211156,-0.093002,-1.588544,0.080055,-1.227723


In [28]:
feature_cols = ['YR','MO','DA']                       # create a Python list of feature names
X = data1[feature_cols]                                     # use the list to select a subset of the original DataFrame-+

In [29]:
y = data1.MeanTemp
y.head()

0    0.177509
1    0.345745
2    0.211156
3    0.244803
4    0.211156
Name: MeanTemp, dtype: float64

#### Splitting X and y into training and test datasets

In [31]:
from sklearn.cross_validation import train_test_split

def split(X,y):
    return train_test_split(X, y, test_size=0.20, random_state=1)

In [32]:
X_train, X_test, y_train, y_test=split(X,y)
print('Train cases as below')
print('X_train shape: ',X_train.shape)
print('y_train shape: ',y_train.shape)
print('\nTest cases as below')
print('X_test shape: ',X_test.shape)
print('y_test shape: ',y_test.shape)

Train cases as below
X_train shape:  (95206, 3)
y_train shape:  (95206,)

Test cases as below
X_test shape:  (23802, 3)
y_test shape:  (23802,)


In [33]:
def linear_reg( X, y, gridsearch = False):
    
    X_train, X_test, y_train, y_test = split(X,y)
    
    from sklearn.linear_model import LinearRegression
    linreg = LinearRegression()
    
    if not(gridsearch):
        linreg.fit(X_train, y_train) 

    else:
        from sklearn.model_selection import GridSearchCV
        parameters = {'normalize':[True,False], 'copy_X':[True, False]}
        linreg = GridSearchCV(linreg,parameters, cv = 10,refit = True)
        linreg.fit(X_train, y_train)                                                           # fit the model to the training data (learn the coefficients)
        print("Mean cross-validated score of the best_estimator : ", linreg.best_score_)  
        
        y_pred_test = linreg.predict(X_test)                                                   # make predictions on the testing set

        RMSE_test = np.sqrt(metrics.mean_squared_error(y_test, y_pred_test))                          # compute the RMSE of our predictions
        print('RMSE for the test set is {}'.format(RMSE_test))

    return linreg

In [34]:
linreg = linear_reg(X,y)

#### Interpreting Model Coefficients

In [35]:
print('Intercept:',linreg.intercept_)          # print the intercept 
print('Coefficients:',linreg.coef_)  

Intercept: -0.002378606323660595
Coefficients: [ 0.01603145  0.0519276  -0.0027779 ]


In [36]:
feature_cols.insert(0,'Intercept')
coef = linreg.coef_.tolist()            
coef.insert(0, linreg.intercept_)     
eq1 = zip(feature_cols, coef)

for c1,c2 in eq1:
    print(c1,c2)

Intercept -0.002378606323660595
YR 0.016031453042119885
MO 0.05192760006924528
DA -0.0027778995042508333


#### Using the Model for Prediction

In [37]:
y_pred_train = linreg.predict(X_train)  
y_pred_test = linreg.predict(X_test)                                                           

### Model evaluation 

#### Model Evaluation using metrics

__Mean Absolute Error__ (MAE) is the mean of the absolute value of the errors:
$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$
Computing the MAE for our Sales predictions

In [38]:
MAE_train = metrics.mean_absolute_error(y_train, y_pred_train)
MAE_test = metrics.mean_absolute_error(y_test, y_pred_test)

In [39]:
print('MAE for training set is {}'.format(MAE_train))
print('MAE for test set is {}'.format(MAE_test))

MAE for training set is 0.7054965403902925
MAE for test set is 0.6982206219143078


__Mean Squared Error__ (MSE) is the mean of the squared errors:
$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

Computing the MSE for our Sales predictions

In [40]:
MSE_train = metrics.mean_squared_error(y_train, y_pred_train)
MSE_test = metrics.mean_squared_error(y_test, y_pred_test)

In [41]:
print('MSE for training set is {}'.format(MSE_train))
print('MSE for test set is {}'.format(MSE_test))

MSE for training set is 1.0014211013153638
MSE for test set is 0.9794479148822799


### Classification Algorithms

We will now use some classification algorithms to predict the mean temperature range

In [55]:
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

Create a field MeanTemp_bracket which gives a 5 degree celsius temperature range of the mean temperature

In [56]:
data['MeanTemp_bracket'] = data.apply(lambda x: (5*np.ceil(x['MeanTemp']/5).astype(int)-5).astype(str) + ' to ' + (5*np.ceil(x['MeanTemp']/5).astype(int)).astype(str),axis=1)

In [350]:
data.head()

Unnamed: 0,STA,Date,Precip,MaxTemp,MinTemp,MeanTemp,Snowfall,YR,MO,DA,MeanTemp_bracket
0,10001,1942-7-1,1.016,25.555556,22.222222,23.888889,0.0,42,7,1,20 to 25
1,10001,1942-7-2,0.0,28.888889,21.666667,25.277778,0.0,42,7,2,25 to 30
2,10001,1942-7-3,2.54,26.111111,22.222222,24.166667,0.0,42,7,3,20 to 25
3,10001,1942-7-4,2.54,26.666667,22.222222,24.444444,0.0,42,7,4,20 to 25
4,10001,1942-7-5,0.0,26.666667,21.666667,24.166667,0.0,42,7,5,20 to 25


In [57]:
feature_cols = ['STA','YR','MO','DA']                # create a Python list of feature names
X = data[feature_cols]

y = data.MeanTemp_bracket
y.head()


0    20 to 25
1    25 to 30
2    20 to 25
3    20 to 25
4    20 to 25
Name: MeanTemp_bracket, dtype: object

In [58]:
validation_size = 0.20
seed = 7

from sklearn.cross_validation import train_test_split

def split(X,y):
    return train_test_split(X, y, test_size=validation_size, random_state=seed)


X_train, X_test, y_train, y_test=split(X,y)
print('Train cases as below')
print('X_train shape: ',X_train.shape)
print('y_train shape: ',y_train.shape)
print('\nTest cases as below')
print('X_test shape: ',X_test.shape)
print('y_test shape: ',y_test.shape)


Train cases as below
X_train shape:  (95206, 4)
y_train shape:  (95206,)

Test cases as below
X_test shape:  (23802, 4)
y_test shape:  (23802,)


In [59]:
scoring = 'accuracy'

In [None]:
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
#models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
	kfold = model_selection.KFold(n_splits=10, random_state=seed)
	cv_results = model_selection.cross_val_score(model, X_train, y_train , cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)

LR: 0.481052 (0.004865)




LDA: 0.481052 (0.004865)
KNN: 0.761790 (0.002940)
CART: 0.790402 (0.003568)
NB: 0.481902 (0.004758)


In [60]:
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
#models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
	kfold = model_selection.KFold(n_splits=10, random_state=seed)
	cv_results = model_selection.cross_val_score(model, X_test, y_test , cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)

LR: 0.480675 (0.009660)




LDA: 0.480675 (0.009660)
KNN: 0.691328 (0.008317)
CART: 0.735779 (0.006832)
NB: 0.484288 (0.008849)
