<img src="./images/dsi_13_sg_shaun_project_4_banner.jpg" width=1000>

# Project 4: West Nile Virus Prediction (Data Cleaning - Weather)
**<font color = blue> Shaun Chua 
<br> (DSI-13) </font>**

---

# Table of Contents: <a id="top"></a>
[**1. Importing Libraries**](#1)
<br> [**2. Importing weather.csv**](#2)
<br> [**3. Cleaning weather_df**](#3)
<br> &emsp; [3.01 Cleaning: `Station`](#3.01)
<br> &emsp; [3.02 Cleaning: `Date`](#3.02)
<br> &emsp; [3.03 Cleaning: `Tmax`](#3.03)
<br> &emsp; [3.04 Cleaning: `Tmin`](#3.04)
<br> &emsp; [3.05 Cleaning: `Tavg`](#3.05)
<br> &emsp; [3.06 Cleaning: `Depart`](#3.06)
<br> &emsp; [3.07 Cleaning: `DewPoint`](#3.07)
<br> &emsp; [3.08 Cleaning: `WetBulb`](#3.08)
<br> &emsp; [3.09 Cleaning: `Heat`](#3.09)
<br> &emsp; [3.10 Cleaning: `Cool`](#3.10)
<br> &emsp; [3.11 Cleaning: `Sunrise`](#3.11)
<br> &emsp; [3.12 Cleaning: `Sunset`](#3.12)
<br> &emsp; [3.13 Cleaning: `CodeSum`](#3.13)
<br> &emsp; [3.14 Cleaning: `Depth`](#3.14)
<br> &emsp; [3.15 Cleaning: `Water1`](#3.15)
<br> &emsp; [3.16 Cleaning: `SnowFall`](#3.16)
<br> &emsp; [3.17 Cleaning: `PrecipTotal`](#3.17)
<br> &emsp; [3.18 Cleaning: `StnPressure`](#3.18)
<br> &emsp; [3.19 Cleaning: `SeaLevel`](#3.19)
<br> &emsp; [3.20 Cleaning: `ResultSpeed`](#3.20)
<br> &emsp; [3.21 Cleaning: `ResultDir`](#3.21)
<br> &emsp; [3.22 Cleaning: `AvgSpeed`](#3.22)
<br> &emsp; [3.23 Post-Cleaning: weather_df](#3.23)
<br> [**4. Feature Engineering**](#4)
<br> &emsp; [4.1 Relative Humidity](#4.1)
<br> &emsp; [4.2 Station Location (Longitude and Latitude)](#4.2)
<br> [**5. Exporting Cleaned weather_df**](#5)

# 1. Importing Libraries <a id="1"></a>

In [1]:
# Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import time
%matplotlib inline

In [2]:
# Starting timer for notebook 

t0 = time.time()

# 2. Importing weather.csv <a id="2"></a>

In [3]:
# Reading weather.csv as weather_df

weather_df = pd.read_csv("./datasets/weather.csv")

In [4]:
# Creating function to view dataframes

def preview(dataframe):
    dataframe_name = [x for x in globals() if globals()[x] is dataframe][0]
    print(f"{dataframe_name} has shape: {dataframe.shape}.")
    
    print("")
    print(f"{dataframe_name} has the following columns: {dataframe.columns}")
    
    print("")
    print(f"These are the top 5 rows of {dataframe_name}:")
    display(dataframe.head())

    print("")
    print(f"These are the bottom 5 rows of {dataframe_name}:")
    display(dataframe.tail())
    
    print("")
    print(f"An overview of {dataframe_name}'s feature types, counts, and nulls:")
    print("")
    display(dataframe.info())
    
    print("")
    nulls = dataframe.isnull().sum()
    total_nulls = dataframe.isnull().sum().sum()
    if total_nulls > 0:
        print(f"{dataframe_name} has a total {total_nulls} of nulls.")
        print("")
        print(f"The columns in {dataframe_name} with nulls are: {list(nulls[nulls>0].index)}") 
      
        print("")
        print(f"The variables with nulls in {dataframe_name} are:")
        display(nulls)

        print("")
        print(f"The top 5 variables in {dataframe_name} with the highest percentage of missing values are:")
        display(dataframe.isnull().mean().sort_values(ascending=False)[:5])

    else:
        print(f"{dataframe_name} does not contain nulls.")

##### Using `preview` function to view weather_df

In [5]:
preview(weather_df)

weather_df has shape: (2944, 22).

weather_df has the following columns: Index(['Station', 'Date', 'Tmax', 'Tmin', 'Tavg', 'Depart', 'DewPoint',
       'WetBulb', 'Heat', 'Cool', 'Sunrise', 'Sunset', 'CodeSum', 'Depth',
       'Water1', 'SnowFall', 'PrecipTotal', 'StnPressure', 'SeaLevel',
       'ResultSpeed', 'ResultDir', 'AvgSpeed'],
      dtype='object')

These are the top 5 rows of weather_df:


Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,...,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,83,50,67,14,51,56,0,2,...,,0,M,0.0,0.0,29.1,29.82,1.7,27,9.2
1,2,2007-05-01,84,52,68,M,51,57,0,3,...,,M,M,M,0.0,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,59,42,51,-3,42,47,14,0,...,BR,0,M,0.0,0.0,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,60,43,52,M,42,47,13,0,...,BR HZ,M,M,M,0.0,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,66,46,56,2,40,48,9,0,...,,0,M,0.0,0.0,29.39,30.12,11.7,7,11.9



These are the bottom 5 rows of weather_df:


Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,...,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
2939,2,2014-10-29,49,40,45,M,34,42,20,0,...,,M,M,M,0.00,29.42,30.07,8.5,29,9.0
2940,1,2014-10-30,51,32,42,-4,34,40,23,0,...,,0,M,0.0,0.00,29.34,30.09,5.1,24,5.5
2941,2,2014-10-30,53,37,45,M,35,42,20,0,...,RA,M,M,M,T,29.41,30.1,5.9,23,6.5
2942,1,2014-10-31,47,33,40,-6,25,33,25,0,...,RA SN,0,M,0.1,0.03,29.49,30.2,22.6,34,22.9
2943,2,2014-10-31,49,34,42,M,29,36,23,0,...,RA SN BR,M,M,M,0.04,29.54,30.2,21.7,34,22.6



An overview of weather_df's feature types, counts, and nulls:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2944 entries, 0 to 2943
Data columns (total 22 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Station      2944 non-null   int64  
 1   Date         2944 non-null   object 
 2   Tmax         2944 non-null   int64  
 3   Tmin         2944 non-null   int64  
 4   Tavg         2944 non-null   object 
 5   Depart       2944 non-null   object 
 6   DewPoint     2944 non-null   int64  
 7   WetBulb      2944 non-null   object 
 8   Heat         2944 non-null   object 
 9   Cool         2944 non-null   object 
 10  Sunrise      2944 non-null   object 
 11  Sunset       2944 non-null   object 
 12  CodeSum      2944 non-null   object 
 13  Depth        2944 non-null   object 
 14  Water1       2944 non-null   object 
 15  SnowFall     2944 non-null   object 
 16  PrecipTotal  2944 non-null   object 
 17  StnPressure  2944 non-null

None


weather_df does not contain nulls.


##### <font color = blue> Shaun: </font>

No Nulls, but according to the PDF file `noaa_weather_qclcd_documentation.pdf`:
<br> 1) `M` represents NaNs
<br> 2) `T` represents "trace". At this juncture, I assume it means a small amount, but not small enough to be neglected. 

##### Finding out number of `M`, which represent `NaN` in each column

In [6]:
for i, column in enumerate(weather_df):
    print(weather_df.columns[i], weather_df[weather_df[column]=='M'][column].count())

Station 0
Date 0
Tmax 0
Tmin 0
Tavg 11
Depart 1472
DewPoint 0
WetBulb 4
Heat 11
Cool 11
Sunrise 0
Sunset 0
CodeSum 0
Depth 1472
Water1 2944
SnowFall 1472
PrecipTotal 2
StnPressure 4
SeaLevel 9
ResultSpeed 0
ResultDir 0
AvgSpeed 3


  res_values = method(rvalues)


##### Finding out number of `T`, which represent "trace" in each column

In [7]:
for i, column in enumerate(weather_df):
    print(weather_df.columns[i], weather_df[weather_df[column]=='T'][column].count())

Station 0
Date 0
Tmax 0
Tmin 0
Tavg 0
Depart 0
DewPoint 0
WetBulb 0
Heat 0
Cool 0
Sunrise 0
Sunset 0
CodeSum 0
Depth 0
Water1 0
SnowFall 0
PrecipTotal 0
StnPressure 0
SeaLevel 0
ResultSpeed 0
ResultDir 0
AvgSpeed 0


##### <font color = blue> Shaun: </font>

Thankful there aren't any `T`, but I'll have to deal with the `M` in each column. 

I will now examine each column.

# 3. Cleaning weather_df <a id="3"></a>

## 3.01 Cleaning: `Station` <a id="3.01"></a>

In [8]:
# Station does not have null values, checking the UNIQUE values it contains

weather_df["Station"].unique()

array([1, 2], dtype=int64)

In [9]:
# Checking how many of each UNIQUE value "Station" contains

weather_df["Station"].value_counts()

1    1472
2    1472
Name: Station, dtype: int64

In [10]:
# Checking dtype

weather_df["Station"].dtype

dtype('int64')

##### <font color = blue> Shaun: </font> <a id="here"></a>

I stared at this longer than I'd like to admit, before finally realising that `1` and `2` obviously meant the station number, and 1472 + 1472 = 2994 #quickmath, which is the total number of rows. 

From the <a href="https://www.kaggle.com/c/predict-west-nile-virus/data">data description</a>:
<br> Station 1: CHICAGO O'HARE INTERNATIONAL AIRPORT Lat: 41.995 Lon: -87.933 Elev: 662 ft. above sea level
<br> Station 2: CHICAGO MIDWAY INTL ARPT Lat: 41.786 Lon: -87.752 Elev: 612 ft. above sea level

OTL. 

`dtype` seems to be correct.

**Conclusion:** 
<br> There are 2 stations used to capture weather data, everyone except me knew that already. Next.

## 3.02 Cleaning: `Date` <a id="3.02"></a>

In [11]:
# Date does not have null values, checking the values it contains

weather_df["Date"].describe()

count           2944
unique          1472
top       2011-09-11
freq               2
Name: Date, dtype: object

##### <font color = blue> Shaun: </font>

Okay I guess this makes sense. 

Since there are 2 weather stations, it is understandable that the dates come in pairs since the same day is measured twice, once by each station. 

##### Using `pd.to_datetime` to automatically change `weather_df["Date"]` into `datetime` format

In [12]:
weather_df["Date"] = pd.to_datetime(weather_df["Date"])
weather_df["Date"]

0      2007-05-01
1      2007-05-01
2      2007-05-02
3      2007-05-02
4      2007-05-03
          ...    
2939   2014-10-29
2940   2014-10-30
2941   2014-10-30
2942   2014-10-31
2943   2014-10-31
Name: Date, Length: 2944, dtype: datetime64[ns]

In [13]:
# Checking dtype

weather_df["Date"].dtype

dtype('<M8[ns]')

##### Organising `weather_df["Date"] by `year`

In [14]:
weather_df["Date"].dt.year

0       2007
1       2007
2       2007
3       2007
4       2007
        ... 
2939    2014
2940    2014
2941    2014
2942    2014
2943    2014
Name: Date, Length: 2944, dtype: int64

In [15]:
weather_df

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,...,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,83,50,67,14,51,56,0,2,...,,0,M,0.0,0.00,29.10,29.82,1.7,27,9.2
1,2,2007-05-01,84,52,68,M,51,57,0,3,...,,M,M,M,0.00,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,59,42,51,-3,42,47,14,0,...,BR,0,M,0.0,0.00,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,60,43,52,M,42,47,13,0,...,BR HZ,M,M,M,0.00,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,66,46,56,2,40,48,9,0,...,,0,M,0.0,0.00,29.39,30.12,11.7,7,11.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2939,2,2014-10-29,49,40,45,M,34,42,20,0,...,,M,M,M,0.00,29.42,30.07,8.5,29,9.0
2940,1,2014-10-30,51,32,42,-4,34,40,23,0,...,,0,M,0.0,0.00,29.34,30.09,5.1,24,5.5
2941,2,2014-10-30,53,37,45,M,35,42,20,0,...,RA,M,M,M,T,29.41,30.10,5.9,23,6.5
2942,1,2014-10-31,47,33,40,-6,25,33,25,0,...,RA SN,0,M,0.1,0.03,29.49,30.20,22.6,34,22.9


##### <font color = blue> Shaun: </font>

I might organise `weather_df["Date"]` by year, seems like that would be most helpful, but I'm not assigning it just yet. 

## 3.03 Cleaning: `Tmax` <a id="3.03"></a>
Maximum temperature in **Farenheit** ($\&deg;F\$)

In [16]:
# Checking for nulls

weather_df["Tmax"].isnull().sum()

0

In [17]:
# Checking for "M", which indicates nulls

tmax_nulls = weather_df["Tmax"].map(lambda x: x == "M")
tmax_nulls.value_counts()

False    2944
Name: Tmax, dtype: int64

In [18]:
# Getting unique values sorted, I just found out that sets does that for you, coolmax. 

set(weather_df["Tmax"].unique())

{41,
 42,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104}

In [19]:
# Checking what the top most occuring values are

weather_df["Tmax"].value_counts().head().sort_values(ascending=False)

84    128
79    121
82    118
81    117
83    109
Name: Tmax, dtype: int64

In [20]:
# Checking dtype

weather_df["Tmax"].dtype

dtype('int64')

In [21]:
# Getting summary statistics

weather_df["Tmax"].describe()

count    2944.000000
mean       76.166101
std        11.461970
min        41.000000
25%        69.000000
50%        78.000000
75%        85.000000
max       104.000000
Name: Tmax, dtype: float64

##### <font color = blue> Shaun: </font>

**$T_{max}$ Range**: 41&deg;F - 104&deg;F  **or**  05&deg;C - 40&deg;C

$T_{max}$ seems to commonly fall within 77&deg;F and 86&deg;F, and the mean $T_{max}$ is 76.2&deg;F **or** 24.6&deg;C

The takeaway here is that there does not appear to be any erroneous data, such as inordinately high or low temperatures.

`dtype` seems to be correct

## 3.04 Cleaning: `Tmin` <a id="3.04"></a>
Minimum temperature in **Farenheit** ($\&deg;F\$)

In [22]:
# Checking for nulls

weather_df["Tmin"].isnull().sum()

0

In [23]:
# Checking for "M", which indicates nulls

tmin_nulls = weather_df["Tmin"].map(lambda x: x == "M")
tmin_nulls.value_counts()

False    2944
Name: Tmin, dtype: int64

In [24]:
# Getting unique values sorted

set(weather_df["Tmin"].unique())

{29,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83}

In [25]:
# Checking what the top most occuring values are

weather_df["Tmin"].value_counts().head().sort_values(ascending=False)

63    121
65    111
60    109
61    106
62    105
Name: Tmin, dtype: int64

In [26]:
# Checking dtype

weather_df["Tmin"].dtype

dtype('int64')

In [27]:
# Getting summary statistics

weather_df["Tmin"].describe()

count    2944.000000
mean       57.810462
std        10.381939
min        29.000000
25%        50.000000
50%        59.000000
75%        66.000000
max        83.000000
Name: Tmin, dtype: float64

##### <font color = blue> Shaun: </font>

**$T_{min}$ Range**: 29&deg;F - 83&deg;F  **or**  -1.67&deg;C - 28.3&deg;C

$T_{min}$ seems to commonly fall within 49&deg;F and 71&deg;F, and the mean $T_{min}$ is 57.8&deg;F **or** 14.3&deg;C

Again, the salient point here is that there does not appear to be any erroneous data, such as inordinately high or low temperatures. 

`dtype` seems to be correct

## 3.05 Cleaning: `Tavg` <a id="3.05"></a>
Average temperature in **Farenheit** ($\&deg;F\$)

In [28]:
# Checking for nulls

weather_df["Tavg"].isnull().sum()

0

In [29]:
# Since "M" represents NaNs, I will drop rows with "M"
# Mapping True and False, to see which are rows containing M

tavg_nulls = weather_df["Tavg"].map(lambda x: x == "M")
#tavg_nulls.unique()
tavg_nulls.value_counts()

False    2933
True       11
Name: Tavg, dtype: int64

In [30]:
# Looking at unique values 

weather_df["Tavg"].unique()

array(['67', '68', '51', '52', '56', '58', 'M', '60', '59', '65', '70',
       '69', '71', '61', '55', '57', '73', '72', '53', '62', '63', '74',
       '75', '78', '76', '77', '66', '80', '64', '81', '82', '79', '85',
       '84', '83', '50', '49', '46', '48', '45', '54', '47', '44', '40',
       '41', '38', '39', '42', '37', '43', '86', '87', '89', '92', '88',
       '91', '93', '94', '90', '36'], dtype=object)

##### <font color = blue> Shaun: </font>

The total number of rows in the `Tavg` column containing `M` is 11, which tallies with what I found in [3.1](#here).

I will drop these rows.

##### Dropping rows with `M` for `Tavg` column

In [31]:
# Found out we can use "~" to drop rows, seems faster than using a for loop 

weather_df = weather_df[~tavg_nulls]
weather_df[weather_df["Tavg"] == "M"]

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,...,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed


In [32]:
# Checking what the top most occuring values are

weather_df["Tavg"].value_counts().head().sort_values(ascending=False)

73    138
77    117
70    117
75    110
71    109
Name: Tavg, dtype: int64

In [33]:
# Checking dtype

weather_df["Tavg"].dtype

dtype('O')

##### <font color = blue> Shaun: </font>

`dtype` should be a float in my opinion, so I'm going to change it.

##### Changing dtype of `Tavg` to `float`

In [34]:
# Changing dtype to float

weather_df["Tavg"] = weather_df["Tavg"].astype(float)
weather_df["Tavg"].dtype

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


dtype('float64')

##### <font color = blue> Shaun: </font>

I think I will drop Tmin and Tmax at this juncture, because I think they are captured by Tavg, which is likely a more valuable feature.

##### Dropping `Tmin` and `Tmax`

In [35]:
weather_df.drop("Tmin", axis=1, inplace=True)
weather_df.drop("Tmax", axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [36]:
weather_df

Unnamed: 0,Station,Date,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,67.0,14,51,56,0,2,0448,1849,,0,M,0.0,0.00,29.10,29.82,1.7,27,9.2
1,2,2007-05-01,68.0,M,51,57,0,3,-,-,,M,M,M,0.00,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,51.0,-3,42,47,14,0,0447,1850,BR,0,M,0.0,0.00,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,52.0,M,42,47,13,0,-,-,BR HZ,M,M,M,0.00,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,56.0,2,40,48,9,0,0446,1851,,0,M,0.0,0.00,29.39,30.12,11.7,7,11.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2939,2,2014-10-29,45.0,M,34,42,20,0,-,-,,M,M,M,0.00,29.42,30.07,8.5,29,9.0
2940,1,2014-10-30,42.0,-4,34,40,23,0,0622,1649,,0,M,0.0,0.00,29.34,30.09,5.1,24,5.5
2941,2,2014-10-30,45.0,M,35,42,20,0,-,-,RA,M,M,M,T,29.41,30.10,5.9,23,6.5
2942,1,2014-10-31,40.0,-6,25,33,25,0,0623,1647,RA SN,0,M,0.1,0.03,29.49,30.20,22.6,34,22.9


## 3.06 Cleaning: `Depart` <a id="3.06"></a>
Departure from normal temperature in **Farenheit** ($\&deg;F\$)

In [37]:
# Checking for nulls

weather_df["Depart"].isnull().sum()

0

In [38]:
# Checking for "M", which indicates nulls

depart_nulls = weather_df["Depart"].map(lambda x: x == "M")
depart_nulls.value_counts()

False    1472
True     1461
Name: Depart, dtype: int64

In [39]:
# Checking what the top most occuring values are

weather_df["Depart"].value_counts().head().sort_values(ascending=False)

M     1461
 2      93
-1      84
-2      80
 5      77
Name: Depart, dtype: int64

##### <font color = blue> Shaun: </font>

Wow if I have to drop these many rows, I'm dropping half the dataset. I think I will drop this feature, because I don't see how temperature departure will help me in analysis. 

Further, it doesn't make sense to drop rows with `M` for `Depart` if I end up not using it anyway.

##### Dropping `Depart`

In [40]:
weather_df.drop("Depart", axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [41]:
weather_df

Unnamed: 0,Station,Date,Tavg,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,67.0,51,56,0,2,0448,1849,,0,M,0.0,0.00,29.10,29.82,1.7,27,9.2
1,2,2007-05-01,68.0,51,57,0,3,-,-,,M,M,M,0.00,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,51.0,42,47,14,0,0447,1850,BR,0,M,0.0,0.00,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,52.0,42,47,13,0,-,-,BR HZ,M,M,M,0.00,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,56.0,40,48,9,0,0446,1851,,0,M,0.0,0.00,29.39,30.12,11.7,7,11.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2939,2,2014-10-29,45.0,34,42,20,0,-,-,,M,M,M,0.00,29.42,30.07,8.5,29,9.0
2940,1,2014-10-30,42.0,34,40,23,0,0622,1649,,0,M,0.0,0.00,29.34,30.09,5.1,24,5.5
2941,2,2014-10-30,45.0,35,42,20,0,-,-,RA,M,M,M,T,29.41,30.10,5.9,23,6.5
2942,1,2014-10-31,40.0,25,33,25,0,0623,1647,RA SN,0,M,0.1,0.03,29.49,30.20,22.6,34,22.9


## 3.07 Cleaning: `DewPoint` <a id="3.07"></a>
Average <a href="https://www.weather.gov/arx/why_dewpoint_vs_humidity">dew point</a>

In [42]:
# Checking for nulls 

weather_df["DewPoint"].isnull().sum()

0

In [43]:
# Checking for "M", which indicates nulls

dewpoint_nulls = weather_df["DewPoint"].map(lambda x: x == "M")
dewpoint_nulls.value_counts()

False    2933
Name: DewPoint, dtype: int64

In [44]:
# Checking what the top most occuring values are

weather_df["DewPoint"].value_counts().head().sort_values(ascending=False)

59    128
54    125
55    114
60    113
61    110
Name: DewPoint, dtype: int64

In [45]:
# Checking dtype

weather_df["DewPoint"].dtype

dtype('int64')

##### <font color = blue> Shaun: </font>

DewPoint looking good, no nulls, no `M` either.

`dtype` seems correct

## 3.08 Cleaning: `WetBulb` <a id="3.08"></a>
Average <a href="https://www.sciencedirect.com/topics/engineering/wet-bulb-temperature">Wet Bulb Temperature</a>

In [46]:
# Checking for nulls 

weather_df["WetBulb"].isnull().sum()

0

In [47]:
# Checking for "M", which indicates nulls

wetbulb_nulls = weather_df["WetBulb"].map(lambda x: x == "M")
wetbulb_nulls.value_counts()

False    2929
True        4
Name: WetBulb, dtype: int64

##### <font color = blue> Shaun: </font>

4 rows with value `M` for `Wet Bulb` column will be dropped.

##### Dropping rows with `M` for `WetBulb` column

In [48]:
weather_df = weather_df[~wetbulb_nulls]
weather_df["WetBulb"].isnull().sum()

0

In [49]:
# Checking what the top most occuring values are

weather_df["WetBulb"].value_counts().head().sort_values(ascending=False)

63    135
65    131
59    129
61    122
64    121
Name: WetBulb, dtype: int64

In [50]:
# Checking dtype

weather_df["WetBulb"].dtype

dtype('O')

##### <font color = blue> Shaun: </font>

`dtype` should be a float in my opinion, so I'm going to change it.

##### Changing dtype of `WetBulb` to `float`

In [51]:
weather_df["WetBulb"] = weather_df["WetBulb"].astype(float)
weather_df["WetBulb"].dtype

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


dtype('float64')

## 3.09 Cleaning: `Heat` <a id="3.09"></a>
Heating (Season beings in July) 
##### <font color = blue> Shaun: </font> 
I assume this means the beginning of the hot season 

In [52]:
# Checking for nulls 

weather_df["Heat"].isnull().sum()

0

In [53]:
# Checking for "M", which indicates nulls

heat_nulls = weather_df["Heat"].map(lambda x: x == "M")
heat_nulls.value_counts()

False    2929
Name: Heat, dtype: int64

In [54]:
# Checking what the top most occuring values are

weather_df["Heat"].value_counts().head().sort_values(ascending=False)

0    1866
4      88
1      86
2      81
8      67
Name: Heat, dtype: int64

In [55]:
weather_df["Heat"].dtype

dtype('O')

##### Changing dtype of `heat` to `float`

In [56]:
weather_df["Heat"] = weather_df["Heat"].astype(float)
weather_df["Heat"].dtype

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


dtype('float64')

##### <font color = blue> Shaun: </font> 
No `NaN` or `M` for `Heat`, changed dtype to `float`  

## 3.10 Cleaning: `Cool` <a id="3.10"></a>
Cooling (Season beings in January)
##### <font color = blue> Shaun: </font> 
I assume this means the beginning of the cool season 

In [57]:
# Checking for nulls 

weather_df["Cool"].isnull().sum()

0

In [58]:
# Checking for "M", which indicates nulls

cool_nulls = weather_df["Cool"].map(lambda x: x == "M")
cool_nulls.value_counts()

False    2929
Name: Cool, dtype: int64

In [59]:
# Checking what the top most occuring values are

weather_df["Cool"].value_counts().head().sort_values(ascending=False)

 0    1147
 8     137
 5     117
12     116
10     110
Name: Cool, dtype: int64

In [60]:
# Checking dtype

weather_df["Cool"].dtype

dtype('O')

##### Changing dtype of `Cool` to `float`

In [61]:
weather_df["Cool"] = weather_df["Cool"].astype(float)
weather_df["Cool"].dtype

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


dtype('float64')

##### <font color = blue> Shaun: </font> 

I think I'll just use one feature to represent temperature, `Tavg`, I don't see the point of using `Cool` and `Heat`, so I will be dropping them. 

##### Dropping `Cool` and `Heat`

In [62]:
weather_df.drop("Cool", axis=1, inplace=True)
weather_df.drop("Heat", axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [63]:
weather_df

Unnamed: 0,Station,Date,Tavg,DewPoint,WetBulb,Sunrise,Sunset,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,67.0,51,56.0,0448,1849,,0,M,0.0,0.00,29.10,29.82,1.7,27,9.2
1,2,2007-05-01,68.0,51,57.0,-,-,,M,M,M,0.00,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,51.0,42,47.0,0447,1850,BR,0,M,0.0,0.00,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,52.0,42,47.0,-,-,BR HZ,M,M,M,0.00,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,56.0,40,48.0,0446,1851,,0,M,0.0,0.00,29.39,30.12,11.7,7,11.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2939,2,2014-10-29,45.0,34,42.0,-,-,,M,M,M,0.00,29.42,30.07,8.5,29,9.0
2940,1,2014-10-30,42.0,34,40.0,0622,1649,,0,M,0.0,0.00,29.34,30.09,5.1,24,5.5
2941,2,2014-10-30,45.0,35,42.0,-,-,RA,M,M,M,T,29.41,30.10,5.9,23,6.5
2942,1,2014-10-31,40.0,25,33.0,0623,1647,RA SN,0,M,0.1,0.03,29.49,30.20,22.6,34,22.9


## 3.11 Cleaning: `Sunrise` <a id="3.11"></a>
Sunrise (Calculated, not observed)

##### <font color = blue> Shaun: </font> 
Not sure what this means ... 

In [64]:
# Checking for nulls 

weather_df["Sunrise"].isnull().sum()

0

In [65]:
# Checking for "M", which indicates nulls

sunrise_nulls = weather_df["Sunrise"].map(lambda x: x == "M")
sunrise_nulls.value_counts()

False    2929
Name: Sunrise, dtype: int64

In [66]:
# Checking what the top most occuring values are

weather_df["Sunrise"].value_counts().head()

-       1460
0416     104
0417      64
0419      40
0420      32
Name: Sunrise, dtype: int64

In [67]:
# Checking dtype

weather_df["Sunrise"].dtype

dtype('O')

##### <font color = blue> Shaun: </font> 
It seems that `-` means Nulls in this case ... probably will end up dropping `Sunrise` since about 50% is missing data (1460/2929 = 49.8%).

## 3.12 Cleaning: `Sunset` <a id="3.12"></a>
Sunset (Calculated, not observed)

##### <font color = blue> Shaun: </font> 
Same thing, not sure what this means ... I'll be darned if 50% of the data is missing yet again.

In [68]:
# Checking for nulls 

weather_df["Sunset"].isnull().sum()

0

In [69]:
# Checking for "M", which indicates nulls

sunset_nulls = weather_df["Sunset"].map(lambda x: x == "M")
sunset_nulls.value_counts()

False    2929
Name: Sunset, dtype: int64

In [70]:
# Checking what the top most occuring values are

weather_df["Sunset"].value_counts().head().sort_values(ascending=False)

-       1460
1931      95
1930      56
1929      48
1925      32
Name: Sunset, dtype: int64

In [71]:
# Checking dtype

weather_df["Sunset"].dtype

dtype('O')

##### <font color = blue> Shaun: </font> 

Yup, confirmed that `-` means Nulls in this case, and will probably end up dropping `Sunset`also since about 50% is missing data (1460/2929 = 49.8%). 

##### Dropping `Sunrise` and `Sunset`

In [72]:
weather_df.drop("Sunrise", axis=1, inplace=True)
weather_df.drop("Sunset", axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [73]:
weather_df

Unnamed: 0,Station,Date,Tavg,DewPoint,WetBulb,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,67.0,51,56.0,,0,M,0.0,0.00,29.10,29.82,1.7,27,9.2
1,2,2007-05-01,68.0,51,57.0,,M,M,M,0.00,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,51.0,42,47.0,BR,0,M,0.0,0.00,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,52.0,42,47.0,BR HZ,M,M,M,0.00,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,56.0,40,48.0,,0,M,0.0,0.00,29.39,30.12,11.7,7,11.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2939,2,2014-10-29,45.0,34,42.0,,M,M,M,0.00,29.42,30.07,8.5,29,9.0
2940,1,2014-10-30,42.0,34,40.0,,0,M,0.0,0.00,29.34,30.09,5.1,24,5.5
2941,2,2014-10-30,45.0,35,42.0,RA,M,M,M,T,29.41,30.10,5.9,23,6.5
2942,1,2014-10-31,40.0,25,33.0,RA SN,0,M,0.1,0.03,29.49,30.20,22.6,34,22.9


## 3.13 Cleaning: `CodeSum` <a id="3.13"></a>
Significant weather types

In [74]:
# Checking for nulls 

weather_df["CodeSum"].isnull().sum()

0

In [75]:
# Checking for "M", which indicates nulls

codesum_nulls = weather_df["CodeSum"].map(lambda x: x == "M")
codesum_nulls.value_counts()

False    2929
Name: CodeSum, dtype: int64

In [76]:
# Checking what the top most occuring values are

weather_df["CodeSum"].value_counts().head().sort_values(ascending=False)

              1601
RA             293
RA BR          237
BR             110
TSRA RA BR      92
Name: CodeSum, dtype: int64

In [77]:
# Checking dtype

weather_df["CodeSum"].dtype

dtype('O')

##### <font color = blue> Shaun: </font> 

No wonder the isnull() check did not return anything, the most occurring values are in fact what I think are <a href="https://www.w3schools.com/html/html_entities.asp"> non breaking spaces (`&nbsp;`)</a> 

I am going to drop this feature altogether, since more than 50% of the data is in fact missing data, and there is no way I know that can tell me what type of weather to impute. 

##### Dropping `CodeSum`

In [78]:
weather_df.drop("CodeSum", axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [79]:
weather_df

Unnamed: 0,Station,Date,Tavg,DewPoint,WetBulb,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,67.0,51,56.0,0,M,0.0,0.00,29.10,29.82,1.7,27,9.2
1,2,2007-05-01,68.0,51,57.0,M,M,M,0.00,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,51.0,42,47.0,0,M,0.0,0.00,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,52.0,42,47.0,M,M,M,0.00,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,56.0,40,48.0,0,M,0.0,0.00,29.39,30.12,11.7,7,11.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2939,2,2014-10-29,45.0,34,42.0,M,M,M,0.00,29.42,30.07,8.5,29,9.0
2940,1,2014-10-30,42.0,34,40.0,0,M,0.0,0.00,29.34,30.09,5.1,24,5.5
2941,2,2014-10-30,45.0,35,42.0,M,M,M,T,29.41,30.10,5.9,23,6.5
2942,1,2014-10-31,40.0,25,33.0,0,M,0.1,0.03,29.49,30.20,22.6,34,22.9


## 3.14 Cleaning: `Depth` <a id="3.14"></a>
Snow/Ice (On Ground) (UTC 1200) 

In [80]:
# Checking for Nulls

weather_df["Depth"].isnull().sum()

0

In [81]:
# Checking for "M", which indicates nulls

depth_nulls = weather_df["Depth"].map(lambda x:x == "M")
depth_nulls.value_counts()

False    1469
True     1460
Name: Depth, dtype: int64

In [82]:
# Checking what the top most occuring values are

weather_df["Depth"].unique()

array(['0', 'M'], dtype=object)

##### <font color = blue> Shaun: </font>

I'm not sure how the depth of snow or ice matters here, especially with regard to mosquitoes. If it was puddles or water bodies, I can understand because that would breed mosquitoes.

Further, 50% the data is missing, so I will definintely be dropping this feature.

##### Dropping `Depth`

In [83]:
weather_df.drop("Depth", axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [84]:
weather_df

Unnamed: 0,Station,Date,Tavg,DewPoint,WetBulb,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,67.0,51,56.0,M,0.0,0.00,29.10,29.82,1.7,27,9.2
1,2,2007-05-01,68.0,51,57.0,M,M,0.00,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,51.0,42,47.0,M,0.0,0.00,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,52.0,42,47.0,M,M,0.00,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,56.0,40,48.0,M,0.0,0.00,29.39,30.12,11.7,7,11.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2939,2,2014-10-29,45.0,34,42.0,M,M,0.00,29.42,30.07,8.5,29,9.0
2940,1,2014-10-30,42.0,34,40.0,M,0.0,0.00,29.34,30.09,5.1,24,5.5
2941,2,2014-10-30,45.0,35,42.0,M,M,T,29.41,30.10,5.9,23,6.5
2942,1,2014-10-31,40.0,25,33.0,M,0.1,0.03,29.49,30.20,22.6,34,22.9


## 3.15 Cleaning: `Water1` <a id="3.15"></a>

Water equivalent (1800 UTC)

<font color = blue> Shaun: </font>
<br> I suppose this means the water content of snow/ice combined, measured at 1800 UTC $\approx\$ 0200 GMT+8

In [85]:
# Checking for Nulls

weather_df["Water1"].isnull().sum()

0

In [86]:
# Checking for "M", which indicates nulls

water1_nulls = weather_df["Water1"].map(lambda x:x == "M")
water1_nulls.value_counts()

True    2929
Name: Water1, dtype: int64

<font color = blue> Shaun: </font>

So literally the entire dataset is missing for Water1, 100% of it are `M`.

I will definitely be dropping this feature.

##### Dropping `Water1`

In [87]:
weather_df.drop("Water1", axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [88]:
weather_df

Unnamed: 0,Station,Date,Tavg,DewPoint,WetBulb,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,67.0,51,56.0,0.0,0.00,29.10,29.82,1.7,27,9.2
1,2,2007-05-01,68.0,51,57.0,M,0.00,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,51.0,42,47.0,0.0,0.00,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,52.0,42,47.0,M,0.00,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,56.0,40,48.0,0.0,0.00,29.39,30.12,11.7,7,11.9
...,...,...,...,...,...,...,...,...,...,...,...,...
2939,2,2014-10-29,45.0,34,42.0,M,0.00,29.42,30.07,8.5,29,9.0
2940,1,2014-10-30,42.0,34,40.0,0.0,0.00,29.34,30.09,5.1,24,5.5
2941,2,2014-10-30,45.0,35,42.0,M,T,29.41,30.10,5.9,23,6.5
2942,1,2014-10-31,40.0,25,33.0,0.1,0.03,29.49,30.20,22.6,34,22.9


## 3.16 Cleaning: `SnowFall` <a id="3.16"></a>
Snowfall (Inches and Tenths) (2400 <a href="http://cci.esa.int/lst">LST</a>), not all stations report snowfall

In [89]:
# Checking for Nulls

weather_df["SnowFall"].isnull().sum()

0

In [90]:
# Checking for "M", which indicates nulls

snowfall_nulls = weather_df["SnowFall"].map(lambda x:x == "M")
snowfall_nulls.value_counts()

False    1469
True     1460
Name: SnowFall, dtype: int64

In [91]:
# Checking what the top most occuring values are

weather_df["SnowFall"].value_counts()

M      1460
0.0    1456
  T      12
0.1       1
Name: SnowFall, dtype: int64

<font color = blue> Shaun: </font>

It looks like only **one** observation is actually useful here. 12 `T` means out of the 13 potentially insightful observations, 12 are trace amounts.  

Guess which feature I'm about to drop.

##### Dropping `SnowFall`

In [92]:
weather_df.drop("SnowFall", axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [93]:
weather_df

Unnamed: 0,Station,Date,Tavg,DewPoint,WetBulb,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,67.0,51,56.0,0.00,29.10,29.82,1.7,27,9.2
1,2,2007-05-01,68.0,51,57.0,0.00,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,51.0,42,47.0,0.00,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,52.0,42,47.0,0.00,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,56.0,40,48.0,0.00,29.39,30.12,11.7,7,11.9
...,...,...,...,...,...,...,...,...,...,...,...
2939,2,2014-10-29,45.0,34,42.0,0.00,29.42,30.07,8.5,29,9.0
2940,1,2014-10-30,42.0,34,40.0,0.00,29.34,30.09,5.1,24,5.5
2941,2,2014-10-30,45.0,35,42.0,T,29.41,30.10,5.9,23,6.5
2942,1,2014-10-31,40.0,25,33.0,0.03,29.49,30.20,22.6,34,22.9


## 3.17 Cleaning: `PrecipTotal` <a id="3.17"></a>
Water equivalent from rainfall and melted snow (Inches and Hundredths) (2400 <a href="http://cci.esa.int/lst">LST</a>)


In [94]:
# Checking for Nulls

weather_df["PrecipTotal"].isnull().sum()

0

In [95]:
# Checking for "M", which indicates nulls

preciptotal_nulls = weather_df["PrecipTotal"].map(lambda x: x == "M")
preciptotal_nulls.value_counts()

False    2927
True        2
Name: PrecipTotal, dtype: int64

##### Dropping rows with `M` for `PrecipTotal` column

In [96]:
weather_df = weather_df[~preciptotal_nulls]
weather_df[weather_df["PrecipTotal"] == "M"]

Unnamed: 0,Station,Date,Tavg,DewPoint,WetBulb,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed


##### <font color = blue> Shaun: </font> 

It appears that `T` is between 0.00 and 0.01. I'd want to use this column so I'll be assigning `T` to a value of 0.001 for this column, intead of dropping 10% of my observations.

In [97]:
weather_df["PrecipTotal"].value_counts().to_dict()

{'0.00': 1570,
 '  T': 317,
 '0.01': 126,
 '0.02': 63,
 '0.03': 46,
 '0.04': 35,
 '0.05': 32,
 '0.08': 28,
 '0.12': 28,
 '0.06': 26,
 '0.07': 23,
 '0.16': 21,
 '0.09': 21,
 '0.11': 20,
 '0.14': 19,
 '0.17': 17,
 '0.19': 14,
 '0.28': 14,
 '0.13': 14,
 '0.18': 14,
 '0.20': 13,
 '0.15': 13,
 '0.26': 11,
 '0.23': 11,
 '0.25': 11,
 '0.24': 10,
 '0.10': 10,
 '0.39': 9,
 '0.31': 9,
 '0.29': 9,
 '0.21': 9,
 '0.43': 9,
 '0.40': 9,
 '0.32': 8,
 '0.34': 8,
 '0.33': 7,
 '0.30': 7,
 '0.45': 7,
 '0.50': 7,
 '0.41': 7,
 '0.37': 7,
 '0.48': 7,
 '0.22': 7,
 '0.59': 7,
 '0.63': 6,
 '0.65': 6,
 '0.80': 6,
 '0.84': 6,
 '0.27': 6,
 '0.36': 5,
 '0.85': 5,
 '0.93': 5,
 '0.68': 5,
 '0.54': 5,
 '0.92': 5,
 '0.44': 5,
 '0.58': 4,
 '0.72': 4,
 '0.97': 4,
 '0.52': 4,
 '0.55': 4,
 '0.75': 4,
 '1.23': 4,
 '0.89': 4,
 '0.64': 4,
 '0.51': 4,
 '0.70': 4,
 '0.60': 3,
 '0.56': 3,
 '0.87': 3,
 '0.66': 3,
 '1.05': 3,
 '0.88': 3,
 '1.01': 3,
 '0.71': 3,
 '0.47': 3,
 '0.74': 3,
 '0.35': 3,
 '0.82': 3,
 '1.31': 3,
 '1.03': 3

##### Replacing `T` with `0.001`, and then converting all replacements to `float`

In [98]:
weather_df["PrecipTotal"] = weather_df["PrecipTotal"].map(lambda x: x.replace("T", "0.001")).astype(float)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [99]:
weather_df["PrecipTotal"].value_counts()

0.000    1570
0.001     317
0.010     126
0.020      63
0.030      46
         ... 
0.760       1
4.730       1
2.760       1
1.090       1
2.240       1
Name: PrecipTotal, Length: 167, dtype: int64

## 3.18 Cleaning: `StnPressure` <a id="3.18"></a>
Average station pressure (inches mercury(hg))

In [100]:
# Checking for Nulls

weather_df["StnPressure"].isnull().sum()

0

In [101]:
# Checking for "M", which indicates nulls

stnpressure_nulls = weather_df["StnPressure"].map(lambda x: x == "M")
stnpressure_nulls.value_counts()

False    2925
True        2
Name: StnPressure, dtype: int64

<font color = blue> Shaun: </font>

Seems like sea level pressure is another measure of pressure, so I'll hold first and see what the EDA reveals about sea level pressure. 

## 3.19 Cleaning: `SeaLevel` <a id="3.19"></a>
Average sea level pressure (inches mercury(hg))

In [102]:
# Checking for Nulls

weather_df["SeaLevel"].isnull().sum()

0

In [103]:
# Checking for "M", which indicates nulls

sealevel_nulls = weather_df["SeaLevel"].map(lambda x: x == "M")
sealevel_nulls.value_counts()

False    2919
True        8
Name: SeaLevel, dtype: int64

<font color = blue> Shaun: </font>

Apparently, Sea Level Pressure is the <a href="https://community.weatherflow.com/t/pressure-local-station-vs-sea-level/1416/3">more standard</a> measure of pressure. So I will drop `StnPressure` even though it has 6 less missing observations. 

In the grand scheme of things, I don't think 6 makes that big a difference.

##### Dropping `StnPressure`

In [104]:
weather_df.drop("StnPressure", axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [105]:
weather_df

Unnamed: 0,Station,Date,Tavg,DewPoint,WetBulb,PrecipTotal,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,67.0,51,56.0,0.000,29.82,1.7,27,9.2
1,2,2007-05-01,68.0,51,57.0,0.000,29.82,2.7,25,9.6
2,1,2007-05-02,51.0,42,47.0,0.000,30.09,13.0,4,13.4
3,2,2007-05-02,52.0,42,47.0,0.000,30.08,13.3,2,13.4
4,1,2007-05-03,56.0,40,48.0,0.000,30.12,11.7,7,11.9
...,...,...,...,...,...,...,...,...,...,...
2939,2,2014-10-29,45.0,34,42.0,0.000,30.07,8.5,29,9.0
2940,1,2014-10-30,42.0,34,40.0,0.000,30.09,5.1,24,5.5
2941,2,2014-10-30,45.0,35,42.0,0.001,30.10,5.9,23,6.5
2942,1,2014-10-31,40.0,25,33.0,0.030,30.20,22.6,34,22.9


##### Dropping rows with `M` for `SeaLevel` column

In [106]:
weather_df = weather_df[~sealevel_nulls]

In [107]:
# Checking datatype

weather_df["SeaLevel"].dtype

dtype('O')

##### Changing dtype of `SeaLevel` to `float`

In [108]:
weather_df["SeaLevel"] = weather_df["SeaLevel"].astype(float)
weather_df["SeaLevel"].dtype

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


dtype('float64')

## 3.20 Cleaning: `ResultSpeed` <a id="3.20"></a>
Resultant wind speed in mph

In [109]:
# Checking for Nulls

weather_df["ResultSpeed"].isnull().sum()

0

In [110]:
# Checking for "M", which indicates nulls

resultspeed_nulls = weather_df["ResultSpeed"].map(lambda x: x == "M")
resultspeed_nulls.value_counts()

False    2919
Name: ResultSpeed, dtype: int64

In [111]:
# Checking dtype

weather_df["ResultSpeed"].dtype

dtype('float64')

#### <font color = blue> Shaun: </font>

Finally a feature that seems to be cleaned already.

## 3.21 Cleaning: `ResultDir` <a id="3.21"></a>
Resultant wind direction, in **whole degrees**, to tens of degrees

In [112]:
# Checking for Nulls

weather_df["ResultDir"].isnull().sum()

0

In [113]:
# Checking for "M", which indicates nulls

resultdir_nulls = weather_df["ResultDir"].map(lambda x: x == "M")
resultdir_nulls.value_counts()

False    2919
Name: ResultDir, dtype: int64

In [114]:
# Checking dtype

weather_df["ResultDir"].dtype

dtype('int64')

In [115]:
weather_df["ResultDir"].head(30)

0     27
1     25
2      4
3      2
4      7
5      6
6      8
8      7
9      7
10    11
11    10
12    18
13    17
14    11
15     8
16     9
17     7
18    17
19     9
20     3
21    36
22     3
23     1
24    14
25    11
26    21
27    21
28    27
29    25
30    36
Name: ResultDir, dtype: int64

##### <font color = blue> Shaun: </font>
Initially I was going to convert this to a float, but then I read the data description and realised that observations here are in **whole degrees**.

Visual inspection in the cell above confirms this, so I guess for now I won't change it. 

Might come back to it later.

## 3.22 Cleaning: `AvgSpeed` <a id="3.22"></a>
Average wind speed

In [116]:
# Checking for Nulls

weather_df["AvgSpeed"].isnull().sum()

0

In [117]:
# Checking for "M", which indicates nulls

avgspeed_nulls = weather_df["AvgSpeed"].map(lambda x: x == "M")
avgspeed_nulls.value_counts()

False    2919
Name: AvgSpeed, dtype: int64

In [118]:
# Checking dtype

weather_df["AvgSpeed"].dtype

dtype('O')

In [119]:
weather_df["AvgSpeed"]

0        9.2
1        9.6
2       13.4
3       13.4
4       11.9
        ... 
2939     9.0
2940     5.5
2941     6.5
2942    22.9
2943    22.6
Name: AvgSpeed, Length: 2919, dtype: object

##### <font color = blue> Shaun: </font>
Average speed should definitely be a float, so converting it into a float.

##### Changing dtype of `AvgSpeed` to `float` 

In [120]:
weather_df["AvgSpeed"] = weather_df["AvgSpeed"].astype(float)
weather_df["AvgSpeed"].dtype

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


dtype('float64')

## 3.23 Post-Cleaning: weather_df  <a id="3.23"></a>

In [121]:
preview(weather_df)

weather_df has shape: (2919, 10).

weather_df has the following columns: Index(['Station', 'Date', 'Tavg', 'DewPoint', 'WetBulb', 'PrecipTotal',
       'SeaLevel', 'ResultSpeed', 'ResultDir', 'AvgSpeed'],
      dtype='object')

These are the top 5 rows of weather_df:


Unnamed: 0,Station,Date,Tavg,DewPoint,WetBulb,PrecipTotal,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,67.0,51,56.0,0.0,29.82,1.7,27,9.2
1,2,2007-05-01,68.0,51,57.0,0.0,29.82,2.7,25,9.6
2,1,2007-05-02,51.0,42,47.0,0.0,30.09,13.0,4,13.4
3,2,2007-05-02,52.0,42,47.0,0.0,30.08,13.3,2,13.4
4,1,2007-05-03,56.0,40,48.0,0.0,30.12,11.7,7,11.9



These are the bottom 5 rows of weather_df:


Unnamed: 0,Station,Date,Tavg,DewPoint,WetBulb,PrecipTotal,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
2939,2,2014-10-29,45.0,34,42.0,0.0,30.07,8.5,29,9.0
2940,1,2014-10-30,42.0,34,40.0,0.0,30.09,5.1,24,5.5
2941,2,2014-10-30,45.0,35,42.0,0.001,30.1,5.9,23,6.5
2942,1,2014-10-31,40.0,25,33.0,0.03,30.2,22.6,34,22.9
2943,2,2014-10-31,42.0,29,36.0,0.04,30.2,21.7,34,22.6



An overview of weather_df's feature types, counts, and nulls:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2919 entries, 0 to 2943
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Station      2919 non-null   int64         
 1   Date         2919 non-null   datetime64[ns]
 2   Tavg         2919 non-null   float64       
 3   DewPoint     2919 non-null   int64         
 4   WetBulb      2919 non-null   float64       
 5   PrecipTotal  2919 non-null   float64       
 6   SeaLevel     2919 non-null   float64       
 7   ResultSpeed  2919 non-null   float64       
 8   ResultDir    2919 non-null   int64         
 9   AvgSpeed     2919 non-null   float64       
dtypes: datetime64[ns](1), float64(6), int64(3)
memory usage: 250.9 KB


None


weather_df does not contain nulls.


# 4. Feature Engineering <a id="4"></a>

## 4.1 Relative Humidity  <a id="4.1"></a>

##### <font color=blue> Shaun: </font>

I recall in H2 Geography that studies investigating mosquito activity often employ relative humdity as a key indicator, so I'm gonna try and engineer this feature with what I have.

The formula can be found <a href="https://iridl.ldeo.columbia.edu/dochelp/QA/Basic/dewpoint.html">here</a>. 

In [122]:
# Converting to SI unit Kelvin
# Taken from Google: (32°F − 32) × 5/9 + 273.15 = 273.15K

((weather_df["DewPoint"]-32)*5/9)+273.15

0       283.705556
1       283.705556
2       278.705556
3       278.705556
4       277.594444
           ...    
2939    274.261111
2940    274.261111
2941    274.816667
2942    269.261111
2943    271.483333
Name: DewPoint, Length: 2919, dtype: float64

In [123]:
# Calculating Actual Vapour Pressure

weather_df["vap_pressure"] = 0.611* np.exp(5243 * ((1/273.15) - (1/(((weather_df["DewPoint"]-32)*5/9)+273.15))))
weather_df["vap_pressure"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


0       1.247942
1       1.247942
2       0.895794
3       0.895794
4       0.830820
          ...   
2939    0.660409
2940    0.660409
2941    0.686431
2942    0.463068
2943    0.543081
Name: vap_pressure, Length: 2919, dtype: float64

In [124]:
# Calculating Saturated Vapour Pressure using Tavg instead of T (more accurate to use T for each instance, but no time)

weather_df["sat_vap_pressure"] = 0.611* np.exp(5243 * ((1/273.15) - (1/(((weather_df['Tavg']-32)*5/9)+273.15))))
weather_df["sat_vap_pressure"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


0       2.187858
1       2.263431
2       1.247942
3       1.293839
4       1.492846
          ...   
2939    1.001784
2940    0.895794
2941    1.001784
2942    0.830820
2943    0.895794
Name: sat_vap_pressure, Length: 2919, dtype: float64

In [125]:
# Calculating Relative Humidity (100 means it will rain)

weather_df["rel_hum"] = 100*weather_df["vap_pressure"]/weather_df["sat_vap_pressure"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [126]:
weather_df

Unnamed: 0,Station,Date,Tavg,DewPoint,WetBulb,PrecipTotal,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,vap_pressure,sat_vap_pressure,rel_hum
0,1,2007-05-01,67.0,51,56.0,0.000,29.82,1.7,27,9.2,1.247942,2.187858,57.039444
1,2,2007-05-01,68.0,51,57.0,0.000,29.82,2.7,25,9.6,1.247942,2.263431,55.134977
2,1,2007-05-02,51.0,42,47.0,0.000,30.09,13.0,4,13.4,0.895794,1.247942,71.781719
3,2,2007-05-02,52.0,42,47.0,0.000,30.08,13.3,2,13.4,0.895794,1.293839,69.235378
4,1,2007-05-03,56.0,40,48.0,0.000,30.12,11.7,7,11.9,0.830820,1.492846,55.653432
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2939,2,2014-10-29,45.0,34,42.0,0.000,30.07,8.5,29,9.0,0.660409,1.001784,65.923300
2940,1,2014-10-30,42.0,34,40.0,0.000,30.09,5.1,24,5.5,0.660409,0.895794,73.723329
2941,2,2014-10-30,45.0,35,42.0,0.001,30.10,5.9,23,6.5,0.686431,1.001784,68.520813
2942,1,2014-10-31,40.0,25,33.0,0.030,30.20,22.6,34,22.9,0.463068,0.830820,55.736221


##### <font color = blue> Shaun: </font>
Ok now that the feature is engineered, I'll remove what I don't need anymore

##### Dropping `DewPoint`, `WetBulb`, `vap_pressure`, and `sat_vap_pressure`

In [127]:
weather_df.drop("DewPoint", axis=1, inplace=True)
weather_df.drop("WetBulb", axis=1, inplace=True)
weather_df.drop("vap_pressure", axis=1, inplace=True)
weather_df.drop("sat_vap_pressure", axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [128]:
weather_df = weather_df.reset_index()


In [129]:
weather_df.head(10)

Unnamed: 0,index,Station,Date,Tavg,PrecipTotal,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,rel_hum
0,0,1,2007-05-01,67.0,0.0,29.82,1.7,27,9.2,57.039444
1,1,2,2007-05-01,68.0,0.0,29.82,2.7,25,9.6,55.134977
2,2,1,2007-05-02,51.0,0.0,30.09,13.0,4,13.4,71.781719
3,3,2,2007-05-02,52.0,0.0,30.08,13.3,2,13.4,69.235378
4,4,1,2007-05-03,56.0,0.0,30.12,11.7,7,11.9,55.653432
5,5,2,2007-05-03,58.0,0.0,30.12,12.9,6,13.2,51.854284
6,6,1,2007-05-04,58.0,0.001,30.05,10.4,8,10.8,53.847799
7,8,1,2007-05-05,60.0,0.001,30.1,11.7,7,12.0,44.807379
8,9,2,2007-05-05,60.0,0.001,30.09,11.2,7,11.5,46.544103
9,10,1,2007-05-06,59.0,0.0,30.29,14.4,11,15.0,34.04131


## 4.2 Station Location (Longitude and Latitude) <a id="4.2"></a>
##### <font color = blue> Shaun: </font>
As a reminder, you can find the longitudes and latitudes [here](#here).

In [130]:
# Adding stations by adding Longitude and Latitude

latitude=[]
longitude=[]

for i in range(len(weather_df['Station'])):
    if weather_df['Station'][i]==1:
        latitude.append(41.995)
        longitude.append(-87.933)
    else:
        latitude.append(41.786)
        longitude.append(-87.752)

weather_df['latitude']=latitude
weather_df['longitude']=longitude

In [131]:
weather_df

Unnamed: 0,index,Station,Date,Tavg,PrecipTotal,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,rel_hum,latitude,longitude
0,0,1,2007-05-01,67.0,0.000,29.82,1.7,27,9.2,57.039444,41.995,-87.933
1,1,2,2007-05-01,68.0,0.000,29.82,2.7,25,9.6,55.134977,41.786,-87.752
2,2,1,2007-05-02,51.0,0.000,30.09,13.0,4,13.4,71.781719,41.995,-87.933
3,3,2,2007-05-02,52.0,0.000,30.08,13.3,2,13.4,69.235378,41.786,-87.752
4,4,1,2007-05-03,56.0,0.000,30.12,11.7,7,11.9,55.653432,41.995,-87.933
...,...,...,...,...,...,...,...,...,...,...,...,...
2914,2939,2,2014-10-29,45.0,0.000,30.07,8.5,29,9.0,65.923300,41.786,-87.752
2915,2940,1,2014-10-30,42.0,0.000,30.09,5.1,24,5.5,73.723329,41.995,-87.933
2916,2941,2,2014-10-30,45.0,0.001,30.10,5.9,23,6.5,68.520813,41.786,-87.752
2917,2942,1,2014-10-31,40.0,0.030,30.20,22.6,34,22.9,55.736221,41.995,-87.933


# 5. Exporting Cleaned weather_df <a id="5"></a>

In [132]:
#pd.DataFrame(weather_df).to_csv('./datasets/weather_cleaned.csv', index = False)

In [133]:
print(f"Run complete, total time taken \u2248 {time.time()-t0:.2f}s")

Run complete, total time taken ≈ 1.31s
