# Project 4: West Nile Virus Prediction

Members: Eng Seng, Jin, Qi Wen, Song Yuan

## Table of Contents

- [1.Import Weather Data](#1.-Import-Weather-Data)<br>
- [2.Data Cleaning For Weather Data](#2.-Data-Cleaning-For-Weather-Data)<br>
    - [2.1 Depth,  Water1, SnowFall Column Drop](#2.1-Depth,-Water1,-SnowFall-Column-Drop)<br>
    - [2.2 Data To Be Duplicate From Station 1 to Station 2](#2.2-Data-To-Be-Duplicate-From-Station-1-to-Station-2)<br>
    - [2.3 Imputation of Wetbulb, StnPressure, SeaLevel with Median](#2.3-Imputation-of-Wetbulb,-StnPressure,-SeaLevel-with-Median)<br>
    - [2.4 PrecipTotal](#2.4-PrecipTotal)<br>
- [3.Change Weather Data Types](#3.-Change-Weather-Data-Types)<br>
    - [3.1 Tavg](#3.1-Tavg)<br>
- [4.Export Data](#4.-Export-Data)<br>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

%config InlineBackend.figure_format = 'retina'
plt.rc('xtick', labelsize=12) 
plt.rc('ytick', labelsize=12)

pd.set_option('display.max_columns', 500, 'display.max_rows', 500, 'display.width', 1000)

## 1. Import Weather Data

In [2]:
weather = pd.read_csv('../datasets/weather.csv')

In [3]:
weather.head()

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,83,50,67,14,51,56,0,2,0448,1849,,0,M,0.0,0.0,29.1,29.82,1.7,27,9.2
1,2,2007-05-01,84,52,68,M,51,57,0,3,-,-,,M,M,M,0.0,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,59,42,51,-3,42,47,14,0,0447,1850,BR,0,M,0.0,0.0,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,60,43,52,M,42,47,13,0,-,-,BR HZ,M,M,M,0.0,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,66,46,56,2,40,48,9,0,0446,1851,,0,M,0.0,0.0,29.39,30.12,11.7,7,11.9


In [4]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2944 entries, 0 to 2943
Data columns (total 22 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Station      2944 non-null   int64  
 1   Date         2944 non-null   object 
 2   Tmax         2944 non-null   int64  
 3   Tmin         2944 non-null   int64  
 4   Tavg         2944 non-null   object 
 5   Depart       2944 non-null   object 
 6   DewPoint     2944 non-null   int64  
 7   WetBulb      2944 non-null   object 
 8   Heat         2944 non-null   object 
 9   Cool         2944 non-null   object 
 10  Sunrise      2944 non-null   object 
 11  Sunset       2944 non-null   object 
 12  CodeSum      2944 non-null   object 
 13  Depth        2944 non-null   object 
 14  Water1       2944 non-null   object 
 15  SnowFall     2944 non-null   object 
 16  PrecipTotal  2944 non-null   object 
 17  StnPressure  2944 non-null   object 
 18  SeaLevel     2944 non-null   object 
 19  Result

## 2. Data Cleaning For Weather Data

**The strategy to clean the weather data :**

- Date -> Convert date column to datetime object
- Depart -> Duplicate the data from Station 1 to Station 2 which contain missing records
- Sunrise -> Duplicate the data from Station 1 to Station 2 which contain missing records    
- Sunset -> Duplicate the data from Station 1 to Station 2 which contain missing records    
- CodeSum (Rename to weather_type) -> - Change empty string to 'Moderate' (as in NOAA weather QCLCD documentation)
- Depth -> To be drop as 50% of data missing
- Water1 -> To be drop as all the data indicated "M" which represent in terms of missing.
- SnowFall -> To be drop as it had too less evidence of snowfall in overall datasets.

### 2.1 Depth, Water1, SnowFall Column Drop

**Depth**

In [5]:
weather['Depth'].value_counts()

0    1472
M    1472
Name: Depth, dtype: int64

The missing data were data from station 2, and 0 were collected from station 1, so the Depth column to be drop from the data frame.

**Water1**

In [6]:
weather['Water1'].value_counts()

M    2944
Name: Water1, dtype: int64

The whole data of column Water1 was missing, so the column will be drop too.

**SnowFall**

In [7]:
weather['SnowFall'].value_counts()

M      1472
0.0    1459
  T      12
0.1       1
Name: SnowFall, dtype: int64

The data was recorded during [May to October](https://www.isws.illinois.edu/statecli/general/chicago-climate-narrative.htm) each year, so it will be no snow fall during the period of the data recocored, so the column SnowFall will be drop.

In [8]:
# Drop Column Depth, Water1 & Snowfall
weather.drop(['Depth'], axis=1, inplace =True)
weather.drop(['Water1'], axis=1, inplace =True)
weather.drop(['SnowFall'], axis=1, inplace =True)

In [9]:
# Convert Date col to datetime object
weather['Date'] = pd.to_datetime(weather['Date'])

In [10]:
# We can further seperate the date to year and month for further data visualization
weather['Year'] = weather['Date'].dt.year
weather['Month'] = weather['Date'].dt.month

In [11]:
weather.head()

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Year,Month
0,1,2007-05-01,83,50,67,14,51,56,0,2,0448,1849,,0.0,29.1,29.82,1.7,27,9.2,2007,5
1,2,2007-05-01,84,52,68,M,51,57,0,3,-,-,,0.0,29.18,29.82,2.7,25,9.6,2007,5
2,1,2007-05-02,59,42,51,-3,42,47,14,0,0447,1850,BR,0.0,29.38,30.09,13.0,4,13.4,2007,5
3,2,2007-05-02,60,43,52,M,42,47,13,0,-,-,BR HZ,0.0,29.44,30.08,13.3,2,13.4,2007,5
4,1,2007-05-03,66,46,56,2,40,48,9,0,0446,1851,,0.0,29.39,30.12,11.7,7,11.9,2007,5


### 2.2 Data To Be Duplicate From Station 1 to Station 2

1. Depart
2. Sunrise
3. Sunset
4. Heat
5. Cool
6. PrecipTotal
7. AvgSpeed

In [12]:
# Replace missing value 'M' or '-' with NaN
replace = {'M' : np.NaN, '-' : np.NaN}
weather = weather.replace(replace)

In [13]:
weather.head()

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Year,Month
0,1,2007-05-01,83,50,67,14.0,51,56,0,2,448.0,1849.0,,0.0,29.1,29.82,1.7,27,9.2,2007,5
1,2,2007-05-01,84,52,68,,51,57,0,3,,,,0.0,29.18,29.82,2.7,25,9.6,2007,5
2,1,2007-05-02,59,42,51,-3.0,42,47,14,0,447.0,1850.0,BR,0.0,29.38,30.09,13.0,4,13.4,2007,5
3,2,2007-05-02,60,43,52,,42,47,13,0,,,BR HZ,0.0,29.44,30.08,13.3,2,13.4,2007,5
4,1,2007-05-03,66,46,56,2.0,40,48,9,0,446.0,1851.0,,0.0,29.39,30.12,11.7,7,11.9,2007,5


In [14]:
# Obeserved the the missing values are alternate, and only station 1 recorded the data,
# so we will fill station 1 data into station 2 for which having missing value

fill_stn_2 = ['Depart','Sunrise','Sunset','Heat','Cool','PrecipTotal','AvgSpeed']

# Fill NaN we filled in before with Station 1 data

for col in fill_stn_2:
    weather[col] = weather[col].fillna(method='ffill')

In [15]:
weather.head()

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Year,Month
0,1,2007-05-01,83,50,67,14,51,56,0,2,448,1849,,0.0,29.1,29.82,1.7,27,9.2,2007,5
1,2,2007-05-01,84,52,68,14,51,57,0,3,448,1849,,0.0,29.18,29.82,2.7,25,9.6,2007,5
2,1,2007-05-02,59,42,51,-3,42,47,14,0,447,1850,BR,0.0,29.38,30.09,13.0,4,13.4,2007,5
3,2,2007-05-02,60,43,52,-3,42,47,13,0,447,1850,BR HZ,0.0,29.44,30.08,13.3,2,13.4,2007,5
4,1,2007-05-03,66,46,56,2,40,48,9,0,446,1851,,0.0,29.39,30.12,11.7,7,11.9,2007,5


**Fill 'Moderate' into empty cells**

In [16]:
#Replace the empty string to Moderate
weather['CodeSum'] = weather['CodeSum'].replace(' ','Moderate')

In [17]:
weather.head()

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Year,Month
0,1,2007-05-01,83,50,67,14,51,56,0,2,448,1849,Moderate,0.0,29.1,29.82,1.7,27,9.2,2007,5
1,2,2007-05-01,84,52,68,14,51,57,0,3,448,1849,Moderate,0.0,29.18,29.82,2.7,25,9.6,2007,5
2,1,2007-05-02,59,42,51,-3,42,47,14,0,447,1850,BR,0.0,29.38,30.09,13.0,4,13.4,2007,5
3,2,2007-05-02,60,43,52,-3,42,47,13,0,447,1850,BR HZ,0.0,29.44,30.08,13.3,2,13.4,2007,5
4,1,2007-05-03,66,46,56,2,40,48,9,0,446,1851,Moderate,0.0,29.39,30.12,11.7,7,11.9,2007,5


### 2.3 Imputation of Wetbulb, StnPressure, SeaLevel with Median

For column Webulb, StnPressure, SeaLevel, we will impute the median value of that perticular month and year into the missing data.

**Wetbulb**

In [18]:
weather[weather['WetBulb'].isnull()]

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Year,Month
848,1,2009-06-26,86,69,78,7,60,,0,13,418,1931,Moderate,0.0,,29.85,6.4,4,8.2,2009,6
2410,1,2013-08-10,81,64,73,0,57,,0,8,454,1900,Moderate,0.0,,30.08,5.3,5,6.5,2013,8
2412,1,2013-08-11,81,60,71,-2,61,,0,6,455,1859,RA,0.01,29.35,30.07,2.0,27,3.0,2013,8
2415,2,2013-08-12,85,69,77,2,63,,0,12,456,1858,RA,0.66,29.27,29.92,4.5,26,7.7,2013,8


In [19]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2944 entries, 0 to 2943
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Station      2944 non-null   int64         
 1   Date         2944 non-null   datetime64[ns]
 2   Tmax         2944 non-null   int64         
 3   Tmin         2944 non-null   int64         
 4   Tavg         2933 non-null   object        
 5   Depart       2944 non-null   object        
 6   DewPoint     2944 non-null   int64         
 7   WetBulb      2940 non-null   object        
 8   Heat         2944 non-null   object        
 9   Cool         2944 non-null   object        
 10  Sunrise      2944 non-null   object        
 11  Sunset       2944 non-null   object        
 12  CodeSum      2944 non-null   object        
 13  PrecipTotal  2944 non-null   object        
 14  StnPressure  2940 non-null   object        
 15  SeaLevel     2935 non-null   object        
 16  Result

In [20]:
# Impute the median of WetBulb for June 09' and August 13' into the cell with null values

jun_09 = weather[(weather['Month'] == 6) & (weather['Year'] == 2009)]['WetBulb'].median()
aug_13 = weather[(weather['Month'] == 8) & (weather['Year'] == 2013)]['WetBulb'].median()

# Replace to each row according to Month and Year

weather.at[848, 'WetBulb'] = jun_09 
weather.at[2410, 'WetBulb'] = aug_13 
weather.at[2412, 'WetBulb'] = aug_13 
weather.at[2415, 'WetBulb'] = aug_13 

In [21]:
weather['WetBulb'].isnull().sum()

0

**StnPressure**

In [22]:
weather[weather['StnPressure'].isnull()]

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Year,Month
87,2,2007-06-13,86,68,77,6,53,62,0,12,416,1927,Moderate,0.0,,,7.0,5,8.6,2007,6
848,1,2009-06-26,86,69,78,7,60,59,0,13,418,1931,Moderate,0.0,,29.85,6.4,4,8.2,2009,6
2410,1,2013-08-10,81,64,73,0,57,65,0,8,454,1900,Moderate,0.0,,30.08,5.3,5,6.5,2013,8
2411,2,2013-08-10,81,68,75,0,55,63,0,10,454,1900,Moderate,0.0,,30.07,6.0,6,7.4,2013,8


In [23]:
# Impute the median of StnPressure for June 07',09' & August 13' into the cell with null value

sp_jun_07 = round(weather[(weather['Month'] == 6) & (weather['Year'] == 2007)]['StnPressure'].median(),2)
sp_jun_09 = round(weather[(weather['Month'] == 6) & (weather['Year'] == 2009)]['StnPressure'].median(),2)
sp_aug_13 = round(weather[(weather['Month'] == 8) & (weather['Year'] == 2013)]['StnPressure'].median(),2)

# Replace to each row according to Month and Year

weather.at[87, 'StnPressure'] = sp_jun_07
weather.at[848, 'StnPressure'] = sp_jun_09 
weather.at[2410, 'StnPressure'] = sp_aug_13
weather.at[2411, 'StnPressure'] = sp_aug_13

In [24]:
weather['StnPressure'].isnull().sum()

0

**SeaLevel**

In [25]:
weather[weather['SeaLevel'].isnull()]

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Year,Month
87,2,2007-06-13,86,68,77.0,6,53,62,0,12,416,1927,Moderate,0.00,29.33,,7.0,5,8.6,2007,6
832,1,2009-06-18,80,61,71.0,1,63,67,0,6,416,1929,RA BR,0.12,29.08,,6.7,16,7.9,2009,6
994,1,2009-09-07,77,59,68.0,1,59,62,0,3,523,1817,BR,0.00,29.39,,5.8,3,4.0,2009,9
1732,1,2011-09-08,75,57,66.0,0,53,59,0,1,524,1815,RA,T,29.34,,13.0,2,13.4,2011,9
1745,2,2011-09-14,60,48,54.0,-11,45,51,11,0,530,1805,RA BR HZ FU,T,29.47,,6.0,32,7.3,2011,9
1756,1,2011-09-20,74,49,62.0,0,54,58,3,0,537,1753,MIFG BCFG BR,0.00,29.26,,7.3,18,7.3,2011,9
2067,2,2012-08-22,84,72,,-1,51,61,0,5,506,1843,Moderate,0.00,29.39,,4.7,19,5.8,2012,8
2090,1,2012-09-03,88,71,80.0,12,70,73,0,15,519,1824,BR,0.00,29.17,,4.6,6,4.4,2012,9
2743,2,2014-07-23,76,64,70.0,-7,56,61,0,5,436,1920,Moderate,0.00,29.47,,16.4,2,16.7,2014,7


In [26]:
# Impute the median of SeaLevel for June 07',09', July 14', August 12'and September 09',11' 12' into the cell with null value
sl_jun_07 = round(weather[(weather['Month'] == 6) & (weather['Year'] == 2007)]['SeaLevel'].median(),2)
sl_jun_09 = round(weather[(weather['Month'] == 6) & (weather['Year'] == 2009)]['SeaLevel'].median(),2)
sl_jul_14 = round(weather[(weather['Month'] == 7) & (weather['Year'] == 2014)]['SeaLevel'].median(),2)
sl_aug_12 = round(weather[(weather['Month'] == 8) & (weather['Year'] == 2012)]['SeaLevel'].median(),2)
sl_sep_09 = round(weather[(weather['Month'] == 9) & (weather['Year'] == 2009)]['SeaLevel'].median(),2)
sl_sep_11 = round(weather[(weather['Month'] == 9) & (weather['Year'] == 2011)]['SeaLevel'].median(),2)
sl_sep_12 = round(weather[(weather['Month'] == 9) & (weather['Year'] == 2012)]['SeaLevel'].median(),2)


# Replace to each row according to Month and Year
weather.at[87, 'SeaLevel'] = sl_jun_07
weather.at[832, 'SeaLevel'] = sl_jun_09
weather.at[994, 'SeaLevel'] = sl_sep_09
weather.at[1732, 'SeaLevel'] = sl_sep_11
weather.at[1745, 'SeaLevel'] = sl_sep_11
weather.at[1756, 'SeaLevel'] = sl_sep_11
weather.at[2067, 'SeaLevel'] = sl_aug_12
weather.at[2090, 'SeaLevel'] = sl_sep_12
weather.at[2743, 'SeaLevel'] = sl_jul_14

In [27]:
weather['SeaLevel'].isnull().sum()

0

### 2.4 PrecipTotal

Since the data was recorded from May to October each year, replace T with 0.005 for PrecipTotal according to [WGNTV.com](https://wgntv.com/wp-content/uploads/sites/5/2018/02/ask_tom_2018.png?w=640)

In [28]:
# Replace 'T' with 0.005

weather['PrecipTotal'].replace(to_replace='  T', value= 0.005, inplace=True)

In [29]:
weather['PrecipTotal'].value_counts()

0.00     1579
0.005     318
0.01      127
0.02       63
0.03       46
0.04       36
0.05       32
0.12       28
0.08       28
0.06       27
0.07       23
0.09       21
0.16       21
0.11       20
0.14       20
0.17       17
0.28       15
0.18       14
0.13       14
0.19       14
0.20       13
0.15       13
0.23       11
0.25       11
0.26       11
0.24       10
0.10       10
0.21        9
0.29        9
0.39        9
0.31        9
0.43        9
0.40        9
0.22        8
0.34        8
0.32        8
0.33        7
0.50        7
0.59        7
0.30        7
0.41        7
0.45        7
0.48        7
0.37        7
0.65        6
0.63        6
0.27        6
0.84        6
0.80        6
0.93        5
0.85        5
0.92        5
0.44        5
0.68        5
0.54        5
0.36        5
0.72        4
0.70        4
0.89        4
0.55        4
0.51        4
0.58        4
0.66        4
0.97        4
0.64        4
1.23        4
0.75        4
0.52        4
0.87        3
1.01        3
0.77        3
0.42  

## 3. Change Weather Data Types

Check and change the data types of each columns of the weather data frame if necessary.

In [30]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2944 entries, 0 to 2943
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Station      2944 non-null   int64         
 1   Date         2944 non-null   datetime64[ns]
 2   Tmax         2944 non-null   int64         
 3   Tmin         2944 non-null   int64         
 4   Tavg         2933 non-null   object        
 5   Depart       2944 non-null   object        
 6   DewPoint     2944 non-null   int64         
 7   WetBulb      2944 non-null   object        
 8   Heat         2944 non-null   object        
 9   Cool         2944 non-null   object        
 10  Sunrise      2944 non-null   object        
 11  Sunset       2944 non-null   object        
 12  CodeSum      2944 non-null   object        
 13  PrecipTotal  2944 non-null   object        
 14  StnPressure  2944 non-null   object        
 15  SeaLevel     2944 non-null   object        
 16  Result

In [31]:
# Convert following column's Dtype to float

weather['Tmax'] = weather['Tmax'].astype(float)
weather['Tmin'] = weather['Tmin'].astype(float)
weather['Tavg'] = weather['Tavg'].astype(float)
weather['WetBulb'] = weather['WetBulb'].astype(float)
weather['Heat'] = weather['Heat'].astype(float)
weather['Cool'] = weather['Cool'].astype(float)
weather['PrecipTotal'] = weather['PrecipTotal'].astype(float)
weather['StnPressure'] = weather['StnPressure'].astype(float)
weather['SeaLevel'] = weather['SeaLevel'].astype(float)
weather['AvgSpeed'] = weather['AvgSpeed'].astype(float)

weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2944 entries, 0 to 2943
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Station      2944 non-null   int64         
 1   Date         2944 non-null   datetime64[ns]
 2   Tmax         2944 non-null   float64       
 3   Tmin         2944 non-null   float64       
 4   Tavg         2933 non-null   float64       
 5   Depart       2944 non-null   object        
 6   DewPoint     2944 non-null   int64         
 7   WetBulb      2944 non-null   float64       
 8   Heat         2944 non-null   float64       
 9   Cool         2944 non-null   float64       
 10  Sunrise      2944 non-null   object        
 11  Sunset       2944 non-null   object        
 12  CodeSum      2944 non-null   object        
 13  PrecipTotal  2944 non-null   float64       
 14  StnPressure  2944 non-null   float64       
 15  SeaLevel     2944 non-null   float64       
 16  Result

There is 11 null values in clolumn 'Tavg', so we will do necessary data cleaning next

### 3.1 Tavg

In [32]:
#Check the null value
weather[weather['Tavg'].isnull()]

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Year,Month
7,2,2007-05-04,78.0,51.0,,4,42,50.0,7.0,0.0,444,1852,Moderate,0.0,29.36,30.04,10.1,7,10.4,2007,5
505,2,2008-07-08,86.0,46.0,,5,68,71.0,0.0,13.0,424,1929,TS RA,0.28,29.16,29.8,7.4,24,8.3,2008,7
675,2,2008-10-01,62.0,46.0,,-4,41,47.0,11.0,0.0,548,1734,Moderate,0.0,29.3,29.96,10.9,33,11.0,2008,10
1637,2,2011-07-22,100.0,71.0,,5,70,74.0,0.0,14.0,435,1921,TS TSRA BR,0.14,29.23,29.86,3.8,10,8.2,2011,7
2067,2,2012-08-22,84.0,72.0,,-1,51,61.0,0.0,5.0,506,1843,Moderate,0.0,29.39,29.97,4.7,19,5.8,2012,8
2211,2,2013-05-02,71.0,42.0,,-5,39,45.0,16.0,0.0,447,1850,Moderate,0.0,29.51,30.17,15.8,2,16.1,2013,5
2501,2,2013-09-24,91.0,52.0,,-1,48,54.0,5.0,0.0,541,1746,Moderate,0.0,29.33,30.0,5.8,9,7.7,2013,9
2511,2,2013-09-29,84.0,53.0,,1,48,54.0,5.0,0.0,546,1737,RA BR,0.22,29.36,30.01,6.3,36,7.8,2013,9
2525,2,2013-10-06,76.0,48.0,,-1,44,50.0,10.0,0.0,554,1725,RA DZ BR,0.06,29.1,29.76,10.1,25,10.6,2013,10
2579,2,2014-05-02,80.0,47.0,,-4,43,47.0,15.0,0.0,447,1850,RA,0.04,29.1,29.79,10.7,23,11.9,2014,5


As per shown on the table above, all the missing data were from station 2. Refering to the data we have, we can impute the average temperature with the formula - Average Temperature = (Max temperature + Min Temperature) / 2.

In [33]:
weather['Tavg'] = round((weather['Tmax'] + weather['Tmin'])/2,0) # roundup the Tavg

In [34]:
#check for any null value in 'Tavg' column
weather['Tavg'].isnull().sum()

0

## 4. Export Data

In [35]:
# Export file to csv

weather.to_csv('../datasets/weather_clean.csv', index=False)