# **Introduction**
In this file, we're using the WeatherHistory dataset to predict the weather. The dataset contains 96453 rows and 12 columns. The columns are:

* Formatted Date - date and time of the observation
* Summary - short summary of the weather
* Precip Type - type of precipitation
* Temperature (C) - temperature in Celsius
* Apparent Temperature (C) - apparent temperature in Celsius
* Humidity - relative humidity
* Wind Speed (km/h) - wind speed in kilometers per hour
* Wind Bearing (degrees) - wind bearing in degrees
* Visibility (km) - visibility in kilometers
* Loud Cover - cloud cover
* Pressure (millibars) - pressure in millibars
* Daily Summary - daily summary of the weather

The goal is to predict the Summary, aka the weather. We'll use the other columns as features. We'll use various machine learning algorithms to predict the weather. 

# **Data Preparation**
First, we'll import the necessary libraries and load the dataset.

In [2]:
import pandas as pd
import numpy as np
import plotly.express as px

# Read in the data
df = pd.read_csv("data/WeatherHistory.csv")
df.head()

Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
0,2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-04-01 01:00:00.000 +0200,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 02:00:00.000 +0200,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 03:00:00.000 +0200,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 04:00:00.000 +0200,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.


## **Understanding the data**
Through the table above and information displayed below I can see there are various improvements that need to be done to best prepare the data for visualisations and machine learning models.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96453 entries, 0 to 96452
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Formatted Date            96453 non-null  object 
 1   Summary                   96453 non-null  object 
 2   Precip Type               95936 non-null  object 
 3   Temperature (C)           96453 non-null  float64
 4   Apparent Temperature (C)  96453 non-null  float64
 5   Humidity                  96453 non-null  float64
 6   Wind Speed (km/h)         96453 non-null  float64
 7   Wind Bearing (degrees)    96453 non-null  float64
 8   Visibility (km)           96453 non-null  float64
 9   Loud Cover                96453 non-null  float64
 10  Pressure (millibars)      96453 non-null  float64
 11  Daily Summary             96453 non-null  object 
dtypes: float64(8), object(4)
memory usage: 8.8+ MB


In [4]:
df.describe()

Unnamed: 0,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars)
count,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0
mean,11.932678,10.855029,0.734899,10.81064,187.509232,10.347325,0.0,1003.235956
std,9.551546,10.696847,0.195473,6.913571,107.383428,4.192123,0.0,116.969906
min,-21.822222,-27.716667,0.0,0.0,0.0,0.0,0.0,0.0
25%,4.688889,2.311111,0.6,5.8282,116.0,8.3398,0.0,1011.9
50%,12.0,12.0,0.78,9.9659,180.0,10.0464,0.0,1016.45
75%,18.838889,18.838889,0.89,14.1358,290.0,14.812,0.0,1021.09
max,39.905556,39.344444,1.0,63.8526,359.0,16.1,0.0,1046.38


1. Handle missing values: Since there are some missing values in the 'Precip Type' column, we can either drop those rows or impute the missing values with a suitable method such as using the mode or mean of the column.

In [5]:
print(df.isnull().sum())
# Remove the missing values
df = df.dropna()

Formatted Date                0
Summary                       0
Precip Type                 517
Temperature (C)               0
Apparent Temperature (C)      0
Humidity                      0
Wind Speed (km/h)             0
Wind Bearing (degrees)        0
Visibility (km)               0
Loud Cover                    0
Pressure (millibars)          0
Daily Summary                 0
dtype: int64


2. Convert the 'Formatted Date' column to datetime format and extract useful features such as the year, month and day.


In [6]:
# Convert the "Formatted Date" column to datetime format; This contains timezone information
df["Formatted Date"] = pd.to_datetime(df["Formatted Date"], utc=True)

# The following values will be useful for visualisations and machine learning models
# Extracting the year from the datetime column
df["year"] = df["Formatted Date"].dt.year

# Extracting the month from the datetime column
df["month"] = df["Formatted Date"].dt.month

# Extracting the day from the datetime column
df["day"] = df["Formatted Date"].dt.day

# Extracting the hour from the datetime column
df["hour"] = df["Formatted Date"].dt.hour


3. One-hot encode categorical variables such as 'Precip Type' if they are not ordinal in nature.

In [7]:
# One hot encode the categorical variables
df = pd.get_dummies(df, columns=["Precip Type"])


Remove Loud Cover as it only has 1 value

In [8]:
# Remove Loud Cover column as it contains only one value
df = df.drop(columns=["Loud Cover"])
# Convert Summary column to a numerical value
df["Summary"] = df["Summary"].astype('category')
df["Summary"] = df["Summary"].cat.codes

# Exploratory Data Analysis

Here we have a correlation matrix of the features. We can see that the 'Temperature (C)' and 'Apparent Temperature (C)' are highly correlated. We can drop one of them. We can also drop 'Daily Summary' as it is a text column and we can't use it for machine learning.

In [9]:
# Create a correlation matrix
corr_matrix = df.corr()

# Visualise the correlation matrix
fig = px.imshow(corr_matrix)
fig.show()

In [10]:
# Remove the columns that are highly correlated
df = df.drop(columns=["Apparent Temperature (C)", "Formatted Date", "Daily Summary"])

This will show the relationship between two numerical features in the dataset.

In [11]:
import plotly.graph_objs as go

data = [go.Scatter(x=df['Temperature (C)'], 
                   y=df['Humidity'],
                   mode='markers')]

fig = go.Figure(data=data)
fig.show()


In [12]:
import plotly.graph_objs as go

data = [go.Bar(x=df['Summary'].value_counts().index, 
              y=df['Summary'].value_counts().values)]

fig = go.Figure(data=data)
fig.show()


# Feature Engineering
Use feature selection techniques to select the most relevant features for the model.

In [13]:
# Selecting the relevant features
X = df.drop(columns=["Summary"])
y = df['Summary']
print(df.head())

   Summary  Temperature (C)  Humidity  Wind Speed (km/h)  \
0       19         9.472222      0.89            14.1197   
1       19         9.355556      0.86            14.2646   
2       17         9.377778      0.89             3.9284   
3       19         8.288889      0.83            14.1036   
4       17         8.755556      0.83            11.0446   

   Wind Bearing (degrees)  Visibility (km)  Pressure (millibars)  year  month  \
0                   251.0          15.8263               1015.13  2006      3   
1                   259.0          15.8263               1015.63  2006      3   
2                   204.0          14.9569               1015.94  2006      4   
3                   269.0          15.8263               1016.41  2006      4   
4                   259.0          15.8263               1016.51  2006      4   

   day  hour  Precip Type_rain  Precip Type_snow  
0   31    22                 1                 0  
1   31    23                 1                 0  

In [14]:
# Splitting the data into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Modeling and Evaluation
In this section, we will train and evaluate several machine learning models and at least one deep learning model to predict the summary/weather based on the collected data. We will use the Scikit-learn library for training and evaluating the models.

We will be using PCA to reduce the dimensionality of the data. We will use the elbow method to determine the number of components to use.

In [15]:
from sklearn.decomposition import PCA

# Create a PCA object
pca = PCA(random_state=42)

# Fit the PCA model to your data
pca.fit(X_train)

# Define the number of components you want to keep
n_components = np.argmax(np.cumsum(pca.explained_variance_ratio_) > 0.95) + 1
print(f"Number of components to keep: {n_components}")

# Create a new PCA object with the number of components to keep
pca = PCA(n_components=n_components, random_state=42)

# Fit the PCA model to your train data
pca.fit(X_train)

# Transform the train and test data to the first n principal components
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

# Create new dataframes with the headers being the principal components; e.g. PC1, PC2, PC3, etc.
pca_train_df = pd.DataFrame(X_train_pca, columns=[f"PC{i+1}" for i in range(n_components)], index=X_train.index)
pca_test_df = pd.DataFrame(X_test_pca, columns=[f"PC{i+1}" for i in range(n_components)], index=X_test.index)


# Print the explained variance ratio of each principal component
# Add the target variable to the dataframe
pca_train_df['Summary'] = y_train
pca_test_df['Summary'] = y_test

# Print the explained variance ratio of each principal component
print(pca.explained_variance_ratio_)

Number of components to keep: 2
[0.54005074 0.44814274]


The explained variance ratio is a measure of how much of the total variance in the data is explained by each principal component. The sum of the explained variance ratios of all the principal components is equal to 1. This can help you to choose the number of components that will keep the most important information.

In [16]:
# We have reduced the number of features from df.shape[1] to n_components
print(f"Original number of features: {df.shape[1]}")
print(f"Reduced number of features: {n_components}")


Original number of features: 13
Reduced number of features: 2


In [17]:
X_train = pca_train_df.drop(columns=['Summary'])
y_train = pca_train_df['Summary']
X_test = pca_test_df.drop(columns=['Summary'])
y_test = pca_test_df['Summary']

This code automatically decides on the number of PCA components to use by finding the number of components that retain 95% of the explained variance and then only keeping those components. It also splits the data into train and test sets, and applies PCA separately to each set. This is a good practice to avoid overfitting.

Next, we will train and evaluate several machine learning models such as:

- [x] Linear Regression

- [x] Decision Tree

- [x] Random Forest

- [x] Gradient Boosting

- [x] LightGBM

- [x] Support Vector Machine (SVM)

- [x] Deep Learning Model

This code is importing various machine learning models from the scikit-learn library, such as Linear Regression, Decision Tree, Random Forest, Gradient Boosting, and Support Vector Machine (SVM). These models are then stored in a dictionary, where the keys are the names of the models and the values are the instances of the models with their default parameters.

The code then iterates over the dictionary, with the key being the name of the model and the value being the model itself. For each model, it performs k-fold cross-validation (k=5) on the training set (X_train and y_train) and calculates the mean absolute error (MAE) as the evaluation metric. The function cross_val_score() is used for this.

The code then prints the name of the model and the MAE in a tabular format. The f-string is used to format the output. The <25 and <5 are left-justifying the string with 25 and 5 spaces respectively. The :.3f is formatting the floating point number to show 3 decimal places.


In [18]:
import lightgbm as lgb
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score

models = {
'Linear Regression': LinearRegression(),
'Decision Tree': DecisionTreeRegressor(),
'Random Forest': RandomForestRegressor(),
'Gradient Boosting': GradientBoostingRegressor(),
'Light Gradient Boosting':lgb.LGBMRegressor(),
'SVM': SVR()
}

best_model_name, best_model_mae = None, float('inf')

for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_absolute_error')
    mae = -1 * scores.mean()
    print(f"{name:<25}\t{'MAE:':<5} {mae:.3f}")
    if mae < best_model_mae:
        best_model_name = name
        best_model_mae = mae
        
print(f"Best model is {best_model_name} with MAE of {best_model_mae:.3f}")

# Fine-tune and refine the best model using the full train set
best_model = models[best_model_name]
best_model.fit(X_train, y_train)

Linear Regression        	MAE:  3.198
Decision Tree            	MAE:  3.647


We will use k-fold cross-validation to evaluate the performance of each model, with k = 5. This technique involves splitting the data into k subsets, and training and evaluating the model k times, with each subset being used as the test set once. This allows us to obtain a more accurate estimate of the model's performance.

We will also use the mean absolute error (MAE) as a metric to evaluate the performance of each model. MAE is a commonly used metric for regression problems and it gives an idea of how close the predicted values are to the actual values.

Finally, we will train and evaluate at least one deep learning model such as a Multi-Layer Perceptron (MLP) using the Keras library.

In [None]:
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
# import sequential, dense
from keras.models import Sequential
from keras.layers import Dense

# Define the model
def create_model():
    model = Sequential()
    model.add(Dense(32, input_dim=X_train.shape[1], activation='relu'))
    model.add(Dense(16, activation='relu'))
    model.add(Dense(1))
    model.compile(loss='mean_absolute_error', optimizer='adam')
    return model

# Create a scikit-learn compatible wrapper for the model
estimator = KerasRegressor(build_fn=create_model, epochs=50, batch_size=32, verbose=0)

# Evaluate the model using cross validation
scores = cross_val_score(estimator, X_train, y_train, cv=5, scoring='neg_mean_absolute_error')

# Print the mean absolute error
print(f"MAE: {-1 * scores.mean():.3f}")


Once all the models are trained and evaluated, we will compare their performance and select the best one. Finally, we will fine-tune the best model using techniques such as hyperparameter tuning and feature engineering.

Once the best model is selected, we will use it to make predictions on the test set. We will then visualize the results using a scatter plot to compare the predicted values to the actual values.

In [None]:
from plotly import graph_objects as go
# Compare the MAE of the best model with the MAE of the neural network and pick the best one
if best_model_mae < -1 * scores.mean():
    print(f"Best model is {best_model_name} with MAE of {best_model_mae:.3f}")
else:
    print(f"Best model is Neural Network with MAE of {-1 * scores.mean():.3f}")
    best_model = estimator
    best_model_mae = -1 * scores.mean()
    best_model_name = 'Neural Network'


# Make predictions on the test set
y_pred = best_model.predict(X_test)


# Print accuracy as a percentage
print(f"Accuracy: {100 * (1 - best_model_mae):.2f}%")

# Plot the confusion matrix
fig = go.Figure(data=go.Heatmap(z=cm, x=['Predicted Negative', 'Predicted Positive'], y=['Actual Negative', 'Actual Positive']))
fig.update_layout(title='Confusion Matrix', xaxis_title='Predicted', yaxis_title='Actual')
fig.show()