- Temperature (numeric): The temperature in degrees Celsius, ranging from extreme cold to extreme heat.
- Humidity (numeric): The humidity percentage, including values above 100% to introduce outliers.
- Wind Speed (numeric): The wind speed in kilometers per hour, with a range including unrealistically high values.
- Precipitation (%) (numeric): The precipitation percentage, including outlier values.
- Cloud Cover (categorical): The cloud cover description.
- Atmospheric Pressure (numeric): The atmospheric pressure in hPa, covering a wide range.
- UV Index (numeric): The UV index, indicating the strength of ultraviolet radiation.
- Season (categorical): The season during which the data was recorded.
- Visibility (km) (numeric): The visibility in kilometers, including very low or very high values.
- Location (categorical): The type of location where the data was recorded.
- Weather Type (categorical): The target variable for classification, indicating the weather type.

In [None]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('weather_classification_data.csv')

In [None]:
df.shape

(13200, 11)

In [None]:
df.head()

Unnamed: 0,Temperature,Humidity,Wind Speed,Precipitation (%),Cloud Cover,Atmospheric Pressure,UV Index,Season,Visibility (km),Location,Weather Type
0,14.0,73,9.5,82.0,partly cloudy,1010.82,2,Winter,3.5,inland,Rainy
1,39.0,96,8.5,71.0,partly cloudy,1011.43,7,Spring,10.0,inland,Cloudy
2,30.0,64,7.0,16.0,clear,1018.72,5,Spring,5.5,mountain,Sunny
3,38.0,83,1.5,82.0,clear,1026.25,7,Spring,1.0,coastal,Sunny
4,27.0,74,17.0,66.0,overcast,990.67,1,Winter,2.5,mountain,Rainy


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13200 entries, 0 to 13199
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Temperature           13200 non-null  float64
 1   Humidity              13200 non-null  int64  
 2   Wind Speed            13200 non-null  float64
 3   Precipitation (%)     13200 non-null  float64
 4   Cloud Cover           13200 non-null  object 
 5   Atmospheric Pressure  13200 non-null  float64
 6   UV Index              13200 non-null  int64  
 7   Season                13200 non-null  object 
 8   Visibility (km)       13200 non-null  float64
 9   Location              13200 non-null  object 
 10  Weather Type          13200 non-null  object 
dtypes: float64(5), int64(2), object(4)
memory usage: 1.1+ MB


In [None]:
df.isnull().sum()

Unnamed: 0,0
Temperature,0
Humidity,0
Wind Speed,0
Precipitation (%),0
Cloud Cover,0
Atmospheric Pressure,0
UV Index,0
Season,0
Visibility (km),0
Location,0


In [None]:
df.duplicated().sum()

0

In [None]:
cat_col = [col for col in df.columns if df[col].dtype == 'O']
cat_col

['Cloud Cover', 'Season', 'Location', 'Weather Type']

In [None]:
for col in cat_col:
    print(col, df[col].unique())

Cloud Cover ['partly cloudy' 'clear' 'overcast' 'cloudy']
Season ['Winter' 'Spring' 'Summer' 'Autumn']
Location ['inland' 'mountain' 'coastal']
Weather Type ['Rainy' 'Cloudy' 'Sunny' 'Snowy']


In [None]:
from sklearn.preprocessing import LabelEncoder
for col in cat_col:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])

In [None]:
#'Rainy'-1, 'Cloudy'-0, 'Sunny'-3, 'Snowy'-2

In [None]:
df.head()

Unnamed: 0,Temperature,Humidity,Wind Speed,Precipitation (%),Cloud Cover,Atmospheric Pressure,UV Index,Season,Visibility (km),Location,Weather Type
0,14.0,73,9.5,82.0,3,1010.82,2,3,3.5,1,1
1,39.0,96,8.5,71.0,3,1011.43,7,1,10.0,1,0
2,30.0,64,7.0,16.0,0,1018.72,5,1,5.5,2,3
3,38.0,83,1.5,82.0,0,1026.25,7,1,1.0,0,3
4,27.0,74,17.0,66.0,2,990.67,1,3,2.5,2,1


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13200 entries, 0 to 13199
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Temperature           13200 non-null  float64
 1   Humidity              13200 non-null  int64  
 2   Wind Speed            13200 non-null  float64
 3   Precipitation (%)     13200 non-null  float64
 4   Cloud Cover           13200 non-null  int64  
 5   Atmospheric Pressure  13200 non-null  float64
 6   UV Index              13200 non-null  int64  
 7   Season                13200 non-null  int64  
 8   Visibility (km)       13200 non-null  float64
 9   Location              13200 non-null  int64  
 10  Weather Type          13200 non-null  int64  
dtypes: float64(5), int64(6)
memory usage: 1.1 MB


In [None]:
X = df.drop('Weather Type', axis=1)
y = df['Weather Type']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
model = KNeighborsClassifier()

In [None]:
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.896969696969697
[[548  58  10  14]
 [ 34 583  13  14]
 [ 15  18 636  12]
 [ 45  28  11 601]]
              precision    recall  f1-score   support

           0       0.85      0.87      0.86       630
           1       0.85      0.91      0.88       644
           2       0.95      0.93      0.94       681
           3       0.94      0.88      0.91       685

    accuracy                           0.90      2640
   macro avg       0.90      0.90      0.90      2640
weighted avg       0.90      0.90      0.90      2640



In [None]:
param_grid = {
    'n_neighbors': [3, 5, 6, 7, 8, 9, 10, 12],
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'leaf_size': [20, 30, 40, 50]
}

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
grid = GridSearchCV(model, param_grid, cv=5)

In [None]:
grid.fit(X_train, y_train)

In [None]:
grid.best_params_

{'algorithm': 'auto', 'leaf_size': 20, 'n_neighbors': 7, 'weights': 'uniform'}

In [None]:
grid.best_score_

0.8880681818181818

In [None]:
y_pred_grid = grid.predict(X_test)

In [None]:
print(accuracy_score(y_test, y_pred_grid))

0.8965909090909091
