# Predicting Humidity using Decision Tree Algorithm


####  In this notebook, we will use scikit-learn to perform a decision tree based classification of weather data.

In [1]:
#Importing necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix

In [2]:
#Create a Pandas DataFrame to read from a CSV file

data=pd.read_csv("daily_weather.csv")

In [3]:
data.head()

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
0,0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,36.16
1,1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0,24.328697,19.426597
2,2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,14.46
3,3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547
4,4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,76.74


In [4]:
data.shape

(1095, 11)

#### Daily Weather Data Description :

The file daily_weather.csv is a comma-separated file that contains weather data. This data comes from a weather station located in San Diego, California. The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity. Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data for different seasons and weather conditions is captured.

In [5]:
data.columns

Index(['number', 'air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am',
       'relative_humidity_3pm'],
      dtype='object')

In [6]:
for i in data.columns:
    print(i)

number
air_pressure_9am
air_temp_9am
avg_wind_direction_9am
avg_wind_speed_9am
max_wind_direction_9am
max_wind_speed_9am
rain_accumulation_9am
rain_duration_9am
relative_humidity_9am
relative_humidity_3pm


Each row in daily_weather.csv captures weather data for a separate day.

Sensor measurements from the weather station were captured at one-minute intervals. These measurements were then processed to generate values to describe daily weather. Since this dataset was created to classify low-humidity days vs. non-low-humidity days (that is, days with normal or high humidity), the variables included are weather measurements in the morning, with one measurement, namely relatively humidity, in the afternoon. The idea is to use the morning weather values to predict whether the day will be low-humidity or not based on the afternoon measurement of relative humidity.

Each row, or sample, consists of the following variables:

- number: unique number for each row
- air_pressure_9am: air pressure averaged over a period from 8:55am to 9:04am (Unit: hectopascals)
- air_temp_9am: air temperature averaged over a period from 8:55am to 9:04am (Unit: degrees Fahrenheit)
- air_wind_direction_9am: wind direction averaged over a period from 8:55am to 9:04am (Unit: degrees, with 0 means coming from the North, and increasing clockwise)
- air_wind_speed_9am: wind speed averaged over a period from 8:55am to 9:04am (Unit: miles per hour)
- max_wind_direction_9am: wind gust direction averaged over a period from 8:55am to 9:10am (Unit: degrees, with 0 being North and increasing clockwise)
- max_wind_speed_9am: wind gust speed averaged over a period from 8:55am to 9:04am (Unit: miles per hour)
- rain_accumulation_9am: amount of rain accumulated in the 24 hours prior to 9am (Unit: millimeters)
- rain_duration_9am: amount of time rain was recorded in the 24 hours prior to 9am (Unit: seconds)
- relative_humidity_9am: relative humidity averaged over a period from 8:55am to 9:04am (Unit: percent)
- relative_humidity_3pm: relative humidity averaged over a period from 2:55pm to 3:04pm (Unit: percent )

## Data Cleaning and Preprocessing

In [7]:
# Checking for null values

data.isnull().sum()

number                    0
air_pressure_9am          3
air_temp_9am              5
avg_wind_direction_9am    4
avg_wind_speed_9am        3
max_wind_direction_9am    3
max_wind_speed_9am        4
rain_accumulation_9am     6
rain_duration_9am         3
relative_humidity_9am     0
relative_humidity_3pm     0
dtype: int64

In [8]:
# Drop the null values
data.dropna(inplace=True)

# Reseting index and droping old Index
data.reset_index(drop=True,inplace=True)

In [9]:
data.shape

(1064, 11)

In [10]:
data.isnull().sum()
# now there is no null values

number                    0
air_pressure_9am          0
air_temp_9am              0
avg_wind_direction_9am    0
avg_wind_speed_9am        0
max_wind_direction_9am    0
max_wind_speed_9am        0
rain_accumulation_9am     0
rain_duration_9am         0
relative_humidity_9am     0
relative_humidity_3pm     0
dtype: int64

In [11]:
data.tail()

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
1059,1090,918.9,63.104,192.9,3.869906,207.3,5.21207,0.0,0.0,26.02,38.18
1060,1091,918.71,49.568,241.6,1.811921,227.4,2.371156,0.0,0.0,90.35,73.34
1061,1092,916.6,71.096,189.3,3.064608,200.8,3.892276,0.0,0.0,45.59,52.31
1062,1093,912.6,58.406,172.7,3.825167,189.1,4.764682,0.0,0.0,64.84,58.28
1063,1094,921.53,77.702,97.1,3.265932,125.9,4.451511,0.0,0.0,14.56,15.1


In [12]:
# "number" coloumn is not neededas it showing only index
data.drop(['number'],axis=1,inplace=True)

In [13]:
data.shape

(1064, 10)

## Convert to a Classification task

#### Binarize the relative humidity_3pm to 0 or 1

We are assigning the values 0 or 1 and adding a new column 'high humidity label'. We are basically classifying the data into two categories ( binary problem ) by setting a desired value ( 24.99 , in this case ) to be the threshold and anything above is high ( 1 ) and anything below is low ( 0 ).

In [14]:
clean_data=data.copy() # will use New data frame to avoid confusion 

clean_data["high_humidity_label"]=(clean_data['relative_humidity_3pm']> 24.99) *1

In [15]:
print(clean_data['high_humidity_label'])

0       1
1       0
2       0
3       0
4       1
       ..
1059    1
1060    1
1061    1
1062    1
1063    0
Name: high_humidity_label, Length: 1064, dtype: int32


#### Target is now stored as y. Here, target is the label - 'high_humidity_label'

In [16]:
y = clean_data[['high_humidity_label']]

#### Use 9am Sensor signals to predict Humidity at 3PM

In [17]:
# Selecting all columns which has '9am' string in it.
feature=list(clean_data.columns[clean_data.columns.str.contains('9am')])

# we do not need relative humidity at 9am 
feature.remove('relative_humidity_9am')

feature

['air_pressure_9am',
 'air_temp_9am',
 'avg_wind_direction_9am',
 'avg_wind_speed_9am',
 'max_wind_direction_9am',
 'max_wind_speed_9am',
 'rain_accumulation_9am',
 'rain_duration_9am']

In [18]:
# Make the data of these features as X
X = clean_data[feature]
X

Unnamed: 0,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am
0,918.060000,74.822000,271.100000,2.080354,295.400000,2.863283,0.0,0.0
1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0
2,923.040000,60.638000,51.000000,17.067852,63.700000,22.100967,0.0,20.0
3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0
4,921.160000,44.294000,277.800000,1.856660,136.500000,2.863283,8.9,14730.0
...,...,...,...,...,...,...,...,...
1059,918.900000,63.104000,192.900000,3.869906,207.300000,5.212070,0.0,0.0
1060,918.710000,49.568000,241.600000,1.811921,227.400000,2.371156,0.0,0.0
1061,916.600000,71.096000,189.300000,3.064608,200.800000,3.892276,0.0,0.0
1062,912.600000,58.406000,172.700000,3.825167,189.100000,4.764682,0.0,0.0


## Perform the test and Train split

In [19]:
# splitting data into train and test sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33,random_state=324)

In [20]:
# Fit the model on the training set 
humidity_classifier=DecisionTreeClassifier(max_leaf_nodes=10,random_state=0)
humidity_classifier.fit(X_train,y_train)

DecisionTreeClassifier(max_leaf_nodes=10, random_state=0)

### Testing the model on the testing set & checking accuracy

In [21]:
# Predicting values for Test data set
y_predicted=humidity_classifier.predict(X_test)

In [22]:
# Checking test accuracy
accuracy_score(y_test,y_predicted)

0.8153409090909091

In [23]:
mean_squared_error(y_test,y_predicted)

0.1846590909090909

We have predictd the humidity at 3PM based on the 9AM measurements with an 81% accuracy and 19% loss which are very good stats. Hence, it's a success.

In [24]:
confusion_matrix(y_test,y_predicted)

array([[147,  28],
       [ 37, 140]], dtype=int64)