In [None]:
# modules we'll use
import pandas as pd
import numpy as np

# for Box-Cox Transformation
from scipy import stats

# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt
import time
from sklearn.ensemble import RandomForestRegressor
import sklearn

Background: The data set contains data about the local weather of a town in Hungary between the years of 2006 - 2016. The data comes in the form of a CSV file from Kaggle and can be found by searching Kaggle.com for “Weather in Szeged 2006-2016”. The data set is granular to at least one data point per day.

Research question: Can we predict the temperature or apparent temperature based on the humidity?

To start, we'll import the Weather in Szeged 2006-2016 data set.

In [None]:
weatherData = pd.read_csv("../input/szeged-weather/weatherHistory.csv")

Now to get to know the data set, lets look at a summary:

In [None]:
weatherData.head()

There are four categories to account for. Of these, there is a column called "Formatted Date" which contains a timestamp for each data point.

The "Loud Cover" column contains only "0.0" for each row - not going to be much use in this analysis. 

See a summary of all null and N/A values by running the code below:

In [None]:
weatherData.isna().sum()

The "Precip Type" column has a number of "Nan" valued rows which we are going to assign a value of "none" for this analysis. The justification for using a generic "none" value is that we do not know what type of precipitation was recorded for this day, and for 517 out nearly 100,000 data points it is just not worth trying to fill these missing values in. Thus, a value of "none" is used to indicate that no precipitation was recorded. Something that I considered doing was trying to forecast this data based on what was recorded for those "days of the year" during previous years, but doing so did not improve the performance in the end.

In [None]:
weatherData['Precip Type'].fillna('none')

That wraps up the initial data cleaning, now to generate an initial impression of what columns are correlated to the temperature.

First step, lets convert some of the categorical data into category codes:

In [None]:
weatherData['Summary']=(weatherData['Summary'].astype('category')).cat.codes
weatherData['Daily Summary']=(weatherData['Daily Summary'].astype('category')).cat.codes
weatherData['Precip Type']=(weatherData['Precip Type'].astype('category')).cat.codes

The 'Formatted Date' column will take more work to transform into a useful form. The plan is to create a "day of the year," "week of the year," "month of the year," and "instant" column. The instant column will simply provide a complete time series representation of the data, but will not be used for predictions.

First, we have to convert the 'Formatted Date' column data type into a DataType64. I used the pandas to_datetime function to accomplish this:

In [None]:
weatherData['Formatted Date']=pd.to_datetime(weatherData['Formatted Date'],format='%Y-%m-%d %H:%M:%S.%f',utc=True)

Now we can easily create new columns containing the month, week, day, and hour numerical data we desire:

In [None]:
weatherData['mo']=weatherData['Formatted Date'].dt.month
weatherData['day']=weatherData['Formatted Date'].dt.dayofyear
weatherData['wk']=weatherData['Formatted Date'].dt.weekofyear
weatherData['hour']=weatherData['Formatted Date'].dt.hour

Next, we will create a column called "inst" containing the unix timestamp:

In [None]:
weatherData['inst']=weatherData['Formatted Date']
for i in range(weatherData['Formatted Date'].size):
    weatherData["inst"][i]=time.mktime(weatherData['Formatted Date'][i].timetuple())

Now, we can generate the correlation matrix - giving us an initial impression of what data should be included in the model.

In [None]:
weatherData.corr()

Looking at the correlations with the "Temperature (C)" column each of the other columns seemingly has a potential contribution, but the most important seem to be the Humidity, Visibility, Daily Summary, Precip Type, and various columns with time data.

Unsurprisingly, plotting each value as a time series reveals a time dependent structure. Since we are focusing on predicting temperature from humidity, I will plot the humidity and temperature/apparent temperature over time.

In [None]:
fig,subp = plt.subplots(3)
fig.suptitle("Humidity, Temperature versus time (UTC timestamp)")
subp[0].plot(weatherData['inst'],weatherData["Humidity"],".")
subp[1].plot(weatherData['inst'],weatherData["Temperature (C)"],".")
subp[2].plot(weatherData['inst'],weatherData["Apparent Temperature (C)"],".")

Histograms of the temperature and humidity data reveal that the temperature/apparent temperature are relatively normally distributed:

In [None]:
sns.distplot(weatherData["Temperature (C)"])

However, the humidity data is not:

In [None]:
sns.distplot(weatherData["Humidity"])

Something that was considered as part of this analysis was attempting to normalize the humidity data using a Box - Cox power series transformation; however, this was not productive and did not increase model performance. The skewed nature of the humidity in the area is a physical reality, and it is reasonable to find that the distribution is not normal.

Since the data is relatively clean, and we do not have any scaling or other transformations to do. Now we can try fiting and testing the random forest regressor.

First, I will create a new dataset to contain only the features we want to use in the model.  For now, that will be the humidity data:

In [None]:
# Copy the dataset
weatherDataF=weatherData

# Drop the columns
weatherDataF=weatherDataF.drop(['Humidity','Formatted Date','Temperature (C)','Apparent Temperature (C)','Summary','Precip Type','Wind Speed (km/h)','Wind Bearing (degrees)','Visibility (km)',
                                'Loud Cover','Pressure (millibars)','Daily Summary','inst','mo','day','wk'],axis=1)
# Copy the humidity column
weatherDataF["H"]=weatherData["Humidity"]

# Temperature (C) will be the predicted data
temp=weatherData["Temperature (C)"]

# Create training and test data sets 80/20 split
xtrain,xtest,ytrain,ytest = sklearn.model_selection.train_test_split(weatherDataF,temp,train_size=0.8)

Now to create a random forest model and score the predictions:

In [None]:
# Create the random forest model
weatherModel=RandomForestRegressor()

# Fit the model
weatherModel.fit(xtrain,ytrain)

# Generate predictions
preds=weatherModel.predict(xtest)

# Score the predictions
score=sklearn.metrics.r2_score(ytest,preds)
print(score)

sns.distplot(preds)
sns.distplot(ytest)

Not very impressive performance. If we normalize the humidity data, we get an r2 score of about 0.46. It would seem that predicting the temperature based on the humidity alone would be difficult, and likely would not result in a very good prediction.

Now what happens if we include some of the time data that was extracted earlier:

In [None]:
weatherDataF["M"]=weatherData["mo"]
weatherDataF["W"]=weatherData["wk"]
weatherDataF["D"]=weatherData["day"]
weatherDataF["Hour"]=weatherData["hour"]

Now to refit the model:

In [None]:
# Rebuild the training and test data
xtrain,xtest,ytrain,ytest = sklearn.model_selection.train_test_split(weatherDataF,temp,train_size=0.8)

# Create the random forest model
weatherModel=RandomForestRegressor()

# Fit the model
weatherModel.fit(xtrain,ytrain)

# Generate predictions
preds=weatherModel.predict(xtest)

# Score the predictions
score=sklearn.metrics.r2_score(ytest,preds)
print(score)

sns.distplot(preds)
sns.distplot(ytest)

Pretty big performance gains! However, the model density seems to be pretty strongly bimodal with two large peaks; most likely, reflecting a split between the warmer summer/string temperatures and colder winter/fall temperature. In reality, the data has a more complicated structure that the model is not capturing.

We can do better by adding more features.  In fact, the model performance continues to improve as we add more features:

In [None]:
# Add the rest of the features in the data set
weatherDataF["WS"]=weatherData["Wind Speed (km/h)"]
weatherDataF["WB"]=weatherData["Wind Bearing (degrees)"]
weatherDataF["P"]=weatherData["Pressure (millibars)"]
weatherDataF["Vis"]=weatherData["Visibility (km)"]
weatherDataF["Sum"]=weatherData["Summary"]
weatherDataF["DataSum"]=weatherData["Daily Summary"]
weatherDataF["PT"]=weatherData["Precip Type"]

# Rebuild the training and test data
xtrain,xtest,ytrain,ytest = sklearn.model_selection.train_test_split(weatherDataF,temp,train_size=0.8)

# Create the random forest model
weatherModel=RandomForestRegressor()

# Fit the model
weatherModel.fit(xtrain,ytrain)

# Generate predictions
preds=weatherModel.predict(xtest)

# Score the predictions
score=sklearn.metrics.r2_score(ytest,preds)
print(score)

sns.distplot(preds)
sns.distplot(ytest)

An r2 score of 0.962 is not quite "cutting edge" performance, but we have not touched on the question of how well we could expect a model to perform ideally, so it is unclear how well this model performs relative to the ideal. A good next step would be to assess both the ability of the dataset set to predict the temperature, and the shortcomings of the random forest model.