# Introduction 

Data was collected from Environment and Climate Change Canada. The data comes from the weather station at Pearson International Airport which updates the weather measurement every 5 minutes. Each individual reading is then taken and aggregated by day to output daily weather measurements.
<br></br>
<br></br>
Source: https://toronto.weatherstats.ca/download.html
<br></br>
Info about the dataset can be found on the about page under the links section. 

Variables:
- date 
- max_temperatire 
- avg_hourly_temperature: the average of the hourly temperature (Think, the sum of the average temperature for each hr divided by 24 hrs)
- avg_temperature: the average temperature for the day. Calculated using the 5 min weather updates.  
- min_temperature
- max_humidex
- min_windchill
- max_relative_humidity
- avg_hourly_relative_humidity
- avg_relative_humidity
- min_relative_humidity
- max_dew_point
- avg_hourly_dew_point
- avg_dew_point
- min_dew_point
- max_wind_speed
- avg_hourly_wind_speed
- avg_wind_speed
- min_wind_speed
- max_wind_gust
- wind_gust_dir_10s
- max_pressure_sea
- avg_hourly_pressure_sea
- avg_pressure_sea
- min_pressure_sea
- max_pressure_station
- avg_hourly_pressure_station
- avg_pressure_station
- min_pressure_station
- max_visibility
- avg_hourly_visibility
- avg_visibility
- min_visibility
- max_health_index
- avg_hourly_health_index
- avg_health_index
- min_health_index
- heatdegdays
- cooldegdays
- growdegdays_5
- growdegdays_7
- growdegdays_10
- precipitation
- rain
- snow
- snow_on_ground
- sunrise
- sunset
- daylight
- sunrise_f
- sunset_f
- min_uv_forecast
- max_uv_forecast
- min_high_temperature_forecast
- max_high_temperature_forecast
- min_low_temperature_forecast
- max_low_temperature_forecast
- solar_radiation
- max_cloud_cover_4
- avg_hourly_cloud_cover_4
- avg_cloud_cover_4
- min_cloud_cover_4
- max_cloud_cover_8
- avg_hourly_cloud_cover_8
- avg_cloud_cover_8
- min_cloud_cover_8
- max_cloud_cover_10
- avg_hourly_cloud_cover_10
- avg_cloud_cover_10
- min_cloud_cover_10

<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
# Load the data 

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
dataset = pd.read_csv("data/weatherstats_toronto_daily.csv")
dataset.head(10)

Unnamed: 0,date,max_temperature,avg_hourly_temperature,avg_temperature,min_temperature,max_humidex,min_windchill,max_relative_humidity,avg_hourly_relative_humidity,avg_relative_humidity,...,avg_cloud_cover_4,min_cloud_cover_4,max_cloud_cover_8,avg_hourly_cloud_cover_8,avg_cloud_cover_8,min_cloud_cover_8,max_cloud_cover_10,avg_hourly_cloud_cover_10,avg_cloud_cover_10,min_cloud_cover_10
0,2021-06-29,32.5,24.41,26.8,21.1,41.0,,96,82.3,77.0,...,,,8.0,6.0,5.5,3.0,,,,
1,2021-06-28,32.2,27.49,26.95,21.7,39.0,,86,63.6,66.0,...,,,6.0,3.5,3.5,1.0,,,,
2,2021-06-27,30.7,27.4,27.25,23.8,39.0,,73,63.4,63.5,...,,,8.0,4.9,4.5,1.0,,,,
3,2021-06-26,27.5,23.97,23.25,19.0,36.0,,100,82.5,83.0,...,,,8.0,7.8,7.5,7.0,,,,
4,2021-06-25,25.8,21.24,22.3,18.8,30.0,,100,74.0,75.0,...,,,8.0,7.7,6.5,5.0,,,,
5,2021-06-24,28.1,22.42,22.0,15.9,30.0,,70,50.5,51.0,...,,,8.0,5.8,5.0,2.0,,,,
6,2021-06-23,23.7,17.55,16.25,8.8,,,79,53.1,57.0,...,,,6.0,3.4,3.5,1.0,,,,
7,2021-06-22,19.5,14.46,14.75,10.0,,,94,54.8,63.0,...,,,8.0,5.8,5.0,2.0,,,,
8,2021-06-21,29.7,21.06,20.65,11.6,34.0,,100,74.9,72.5,...,,,8.0,7.0,5.5,3.0,,,,
9,2021-06-20,27.1,21.15,20.55,14.0,32.0,,91,68.2,71.0,...,,,8.0,3.8,4.0,0.0,,,,


The above table is a subset of the first 10 rows of the dataset. As we can see, the dataset is ordered in reverse chronological order with 70 columns and 10,000 rows. We are also able to observe that every column except for the date column is numerical in nature. Moreover, we are able to see that the dataset has quite a few missing values.  

In [4]:
dataset.shape

(10000, 70)

In [5]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 70 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   date                           10000 non-null  object 
 1   max_temperature                10000 non-null  float64
 2   avg_hourly_temperature         10000 non-null  float64
 3   avg_temperature                10000 non-null  float64
 4   min_temperature                10000 non-null  float64
 5   max_humidex                    2783 non-null   float64
 6   min_windchill                  3354 non-null   float64
 7   max_relative_humidity          10000 non-null  int64  
 8   avg_hourly_relative_humidity   10000 non-null  float64
 9   avg_relative_humidity          10000 non-null  float64
 10  min_relative_humidity          10000 non-null  int64  
 11  max_dew_point                  10000 non-null  float64
 12  avg_hourly_dew_point           10000 non-null  

<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
# Validate the Integrity of the data 

### Missing Values 


In [19]:
def find_missing_vals(dataset):
    '''Display columsn with missing data and indicate how many values are missing for each column.'''
    missing_vals = dataset.isnull().sum()
    cols_list = dataset.columns
    col_count = 0

    for col, val in zip(cols_list, missing_vals):
        if val > 0:
            print('{}: {} values misiing'.format(col, val))
            col_count += 1
    print('\n{} columns have missing values'.format(col_count))
    return zip(cols_list, missing_vals)

In [20]:
find_missing_vals(dataset)

max_humidex: 7217 values misiing
min_windchill: 6646 values misiing
max_wind_gust: 3061 values misiing
wind_gust_dir_10s: 3062 values misiing
max_health_index: 6277 values misiing
avg_hourly_health_index: 6277 values misiing
avg_health_index: 6277 values misiing
min_health_index: 6277 values misiing
precipitation: 21 values misiing
rain: 30 values misiing
snow: 22 values misiing
snow_on_ground: 2169 values misiing
sunrise: 7263 values misiing
sunset: 7263 values misiing
daylight: 7263 values misiing
sunrise_f: 7263 values misiing
sunset_f: 7263 values misiing
min_uv_forecast: 7895 values misiing
max_uv_forecast: 7895 values misiing
min_high_temperature_forecast: 7263 values misiing
max_high_temperature_forecast: 7263 values misiing
min_low_temperature_forecast: 7263 values misiing
max_low_temperature_forecast: 7263 values misiing
solar_radiation: 10000 values misiing
max_cloud_cover_4: 10000 values misiing
avg_hourly_cloud_cover_4: 10000 values misiing
avg_cloud_cover_4: 10000 values m

<zip at 0x1fe023e8d40>

We see that 36 columns in the dataset contain missing values and 34 columns have no missing values. Recall there are 10000 rows in the dataset, meaning that 5 columns have no entries and 23 columns conatain more than 6000 missing values.  

The `percipitation`, `rain` and `snow` columns have the least missing of columns that conain missing values. These columns contain less than 30 missing values.  

In [17]:
#drop columns that have more than 60% of their values missing 
dropped_cols = ['max_humidex', 'min_windchill', 'max_health_index', 'avg_hourly_health_index', 'avg_health_index',
               'min_health_index', 'sunrise', 'sunset', 'daylight', 'sunrise_f', 'sunset_f', 'min_uv_forecast',
               'max_uv_forecast', 'min_high_temperature_forecast', 'max_high_temperature_forecast',
                'min_low_temperature_forecast', 'max_low_temperature_forecast', 'solar_radiation', 'max_cloud_cover_4',
               'avg_hourly_cloud_cover_4', 'avg_cloud_cover_4', 'min_cloud_cover_4', 'max_cloud_cover_8', 
                'avg_hourly_cloud_cover_8', 'avg_cloud_cover_8', 'min_cloud_cover_8', 'max_cloud_cover_10', 
               'avg_hourly_cloud_cover_10', 'avg_cloud_cover_10', 'min_cloud_cover_10']
data_reduced = dataset.drop(columns=dropped_cols)

In [24]:
data_reduced.shape 

(10000, 40)

In [21]:
find_missing_vals(data_reduced)

max_wind_gust: 3061 values misiing
wind_gust_dir_10s: 3062 values misiing
precipitation: 21 values misiing
rain: 30 values misiing
snow: 22 values misiing
snow_on_ground: 2169 values misiing

6 columns have missing values


<zip at 0x1fe01327100>

By removing columns that have more than 60% of their data missing, the dataset is left with 40 columns. Luckily the leftover columns with missing values can be calculated from the other data in the dataset. For example, `percipitation` is the sum of rain and snow, while `snow_on_ground` is a function of snow and temperature. Meaning, we can easily impute the values to make the dataset clean. 

### Range of Values 

In [26]:
# suppress scientific notation
pd.options.display.float_format = '{:.2f}'.format
data_reduced.describe()

Unnamed: 0,max_temperature,avg_hourly_temperature,avg_temperature,min_temperature,max_relative_humidity,avg_hourly_relative_humidity,avg_relative_humidity,min_relative_humidity,max_dew_point,avg_hourly_dew_point,...,min_visibility,heatdegdays,cooldegdays,growdegdays_5,growdegdays_7,growdegdays_10,precipitation,rain,snow,snow_on_ground
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,...,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,9979.0,9970.0,9978.0,7831.0
mean,13.51,8.99,8.85,4.2,85.54,69.61,69.27,53.01,6.43,3.26,...,14526.08,10.16,1.01,6.67,5.5,3.96,2.17,1.87,0.31,1.82
std,11.36,10.52,10.59,10.08,9.86,12.1,11.01,14.53,9.71,10.0,...,8908.72,9.32,2.17,7.16,6.42,5.25,5.46,5.28,1.46,5.06
min,-19.1,-21.95,-22.3,-26.3,38.0,27.7,29.5,13.0,-24.8,-28.1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.9,0.75,0.6,-2.7,79.0,61.8,62.0,42.0,-1.0,-4.3,...,4800.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,14.0,9.04,8.94,4.1,87.0,70.0,69.5,53.0,6.7,3.3,...,16100.0,9.1,0.0,3.9,1.9,0.0,0.0,0.0,0.0,0.0
75%,23.5,18.38,18.2,12.8,93.0,78.0,77.0,63.0,14.8,11.8,...,24100.0,17.4,0.2,13.2,11.2,8.2,1.4,0.6,0.0,0.0
max,37.9,31.7,31.95,26.3,100.0,99.9,99.5,99.0,26.6,24.4,...,25000.0,40.3,13.9,26.9,24.9,21.9,126.0,126.0,30.4,67.0


Note that we cannot see all the values for each column so we must split the data up into chunks so we can view the distribution data of each column.