The current script is used to cluster the dependend variables of the urban air flows into five clusters. This is one out of two ways we transform the continious variables wind_speed_kmh, gust_speed_kmh, gust_angle, wind_angle into classes so that they can be used in ML models. One of the main issues with the NETATMO dataset is the size. Clustering is an efficient way to reduce processing time by reducing the amount of data. The resulting dataframe is split into testing and training data.

Import the required libraries. In case you have not installed these packagages, uncomment the following lines.

In [12]:
# import sys
# !{sys.executable} -m pip install pandas
# !{sys.executable} -m pip install sklearn

In [1]:
import pandas as pd
from sklearn.cluster import MiniBatchKMeans
from sklearn.model_selection import train_test_split # Import train_test_split function

Read the normalized data and store it in a dataframe. Make sure that your working directory is set correctly and you have run the normalization script

In [4]:
df = pd.read_csv("norm_data.csv")

Remove the NA values from the dependent variables, wind_speed_kmh,wind_angle,gust_angle,gust_speed_kmh. Select the dependent variable for later analysis.

In [5]:
df = df.dropna(how='any',subset=["wind_speed_kmh","wind_angle","gust_angle","gust_speed_kmh"])
X = df[["wind_speed_kmh","wind_angle","gust_angle","gust_speed_kmh"]]

Start the clustering by defining and fitting the model on the dependend variables. Five clusters are choosen to capture enough variability but improve processing time.

In [6]:
model = MiniBatchKMeans(n_clusters=5)
model.fit(X)

MiniBatchKMeans(n_clusters=5)

Assign a cluster to each row

In [8]:
yhat = model.predict(X)

Add the cluster IDs to the dataframe for later usage in ML models

In [9]:
df["class_km"] = yhat
df = df.drop(["wind_speed_kmh","wind_angle","gust_angle","gust_speed_kmh"],axis=1)

Split the data set into testing and training data for validation and training of the ML model. 30% is used for testing, 70% for training

In [10]:
feature_cols = ["temperature","humidity","pressure",'rain_mm']
X = df[feature_cols]
y = df.class_km
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) 

Write the testing and training data with the cluster IDs to csvs for later usage

In [11]:
X_train["class"] = y_train
X_test["class"] = y_test

X_train.to_csv("train.csv")
X_test.to_csv("test.csv")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train["class"] = y_train
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test["class"] = y_test


In conclusion, the clustering was mainly challenging because of, suprise surprise, the size of the dataset. Many of the clustering algorithms tried, did not succeed because the set was too large and it took too long to create the clusters. The K batch mini means and K batch were the only algorithms out of Affinity Propagation, Agglomerative Clustering, BIRCH, DBSCAN, Mean Shift, OPTICS, Spectral Clustering and Gaussian Mixture Model that was able to run. As K batch mini means is faster for large datasets through updating the clustering based on mini batches, this algorithm was choosen. 

A well known issue with clustering is the lack of scientific underpinning as well as the difficulty of assessing the quality of the produced cluster. The choice for the algorithm and the number of clusters was rather arbitrary. Due to time constraints I was not able to extensively examine and explore possible options. Subsequently, as the ML models performed really bad the descision was made to drop this type of modelling. As such, the data set with the clusters was not used in the analysis.