# 1. Introduction

The purpose of this notebook is to perform initial cleaning of the Met Eireann weather data, and to export the cleaned data to a csv file.

# 2. Setup & Data Load

Import required modules and packages:

In [1]:
# import pandas for data analysis
import pandas as pd

Set the max number of columns & rows to display:

In [2]:
pd.set_option('display.max_columns', 30)
pd.set_option('display.max_rows', 500)

Weather data is loaded from a csv:

In [3]:
df_weather = pd.read_csv('/data_analytics/data/weather.csv', sep=",", na_values=['\\N'])

In [4]:
rows = df_weather.shape[0]
cols = df_weather.shape[1]
print()
print("Before any data cleaning, the dataframe contains", rows, "rows and", cols, "columns.")
print()


Before any data cleaning, the dataframe contains 8761 rows and 11 columns.



# 3. Check for Duplicate Rows & Columns

In [5]:
print()
print('Duplicate rows:', df_weather.duplicated()[df_weather.duplicated() == True].shape[0])
print('Duplicate columns:',df_weather.columns.size - df_weather.columns.unique().size)


Duplicate rows: 0
Duplicate columns: 0


There are no duplicate rows or columns so nothing needs to be dropped here.

# 4. Assign Features as Continuous or Categorical

Print 5 rows from the dataframe:

In [6]:
df_weather.head(5)

Unnamed: 0,record_date,irain,rain,itemp,temp,iwb,wetb,dewpt,vappr,rhum,msl
0,2018-01-01 00:00:00,0,0.0,0,4.6,0,3.5,1.8,6.9,82,991.0
1,2018-01-01 01:00:00,0,0.1,0,4.7,0,3.6,1.8,7.0,81,991.1
2,2018-01-01 02:00:00,0,0.0,0,4.8,0,3.7,1.9,7.0,81,991.1
3,2018-01-01 03:00:00,0,0.0,0,4.9,0,3.8,2.2,7.2,82,990.7
4,2018-01-01 04:00:00,0,0.0,0,5.3,0,4.1,2.3,7.2,81,990.3


Assign categorical and continuous features:

In [7]:
# Select columns containing categorical data
categorical_columns = df_weather[['record_date', 'irain', 'itemp', 'iwb']].columns

# Convert data type to 'Category' for these columns
for column in categorical_columns:
    df_weather[column] = df_weather[column].astype('category')

In [8]:
# Select columns containing continuous data 
# This is done by selecting columns with a numeric type - float64 or int64
continuous_columns = df_weather.select_dtypes(['float64', 'int64']).columns

In [9]:
# check that all are correctly assigned
df_weather.dtypes

record_date    category
irain          category
rain            float64
itemp          category
temp            float64
iwb            category
wetb            float64
dewpt           float64
vappr           float64
rhum              int64
msl             float64
dtype: object

# 5 Check for Constant Features

In [10]:
# Print details for the categorical columns
df_weather[categorical_columns].describe().T

Unnamed: 0,count,unique,top,freq
record_date,8761,8761,2019-01-01 00:00:00,1
irain,8761,2,0,8759
itemp,8761,1,0,8761
iwb,8761,1,0,8761


In [11]:
# Print details for the continuous columns
df_weather[continuous_columns].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rain,8759.0,0.078217,0.342693,0.0,0.0,0.0,0.0,8.6
temp,8761.0,10.449834,5.652881,-4.5,6.3,10.0,14.5,27.5
wetb,8761.0,8.50323,4.679665,-4.6,5.1,8.4,12.0,20.4
dewpt,8761.0,6.376955,4.650585,-9.8,3.1,6.5,9.7,18.3
vappr,8761.0,10.031229,3.147053,2.9,7.6,9.7,12.1,21.0
rhum,8761.0,77.064262,14.043063,24.0,68.0,80.0,88.0,99.0
msl,8761.0,1013.116745,11.76403,979.5,1005.7,1014.9,1021.6,1041.7


**itemp** and **iwb** are constant columns so can be dropped:

In [12]:
df_weather = df_weather.drop(columns=['itemp', 'iwb'])
categorical_columns = df_weather[['record_date', 'irain']].columns

# 6. Check for Missing Data

In [13]:
# Print details for the categorical columns
df_weather[categorical_columns].describe().T

Unnamed: 0,count,unique,top,freq
record_date,8761,8761,2019-01-01 00:00:00,1
irain,8761,2,0,8759


In [14]:
# Print details for the continuous columns
df_weather[continuous_columns].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rain,8759.0,0.078217,0.342693,0.0,0.0,0.0,0.0,8.6
temp,8761.0,10.449834,5.652881,-4.5,6.3,10.0,14.5,27.5
wetb,8761.0,8.50323,4.679665,-4.6,5.1,8.4,12.0,20.4
dewpt,8761.0,6.376955,4.650585,-9.8,3.1,6.5,9.7,18.3
vappr,8761.0,10.031229,3.147053,2.9,7.6,9.7,12.1,21.0
rhum,8761.0,77.064262,14.043063,24.0,68.0,80.0,88.0,99.0
msl,8761.0,1013.116745,11.76403,979.5,1005.7,1014.9,1021.6,1041.7


Investigate rows with missing data for rain:

In [15]:
# select all rows where irain is not 0
df_weather.loc[df_weather['irain'] != 0]

Unnamed: 0,record_date,irain,rain,temp,wetb,dewpt,vappr,rhum,msl
6230,2018-09-17 14:00:00,-1,,20.2,16.3,13.4,15.4,65,1003.0
6231,2018-09-17 15:00:00,-1,,20.6,16.5,13.6,15.5,63,1002.2


There are only two rows where irain is not zero, these rows correspond to missing values for rain.

In [16]:
# select other rows around the missing values
df_weather[6220:6240]

Unnamed: 0,record_date,irain,rain,temp,wetb,dewpt,vappr,rhum,msl
6220,2018-09-17 04:00:00,0,0.0,11.2,10.3,9.3,11.7,88,1011.2
6221,2018-09-17 05:00:00,0,0.0,10.9,10.1,9.2,11.6,89,1010.4
6222,2018-09-17 06:00:00,0,0.0,11.1,10.2,9.4,11.8,89,1010.1
6223,2018-09-17 07:00:00,0,0.0,11.7,10.6,9.5,11.9,86,1009.4
6224,2018-09-17 08:00:00,0,0.0,14.6,11.9,9.3,11.7,70,1008.2
6225,2018-09-17 09:00:00,0,0.0,15.8,12.3,9.1,11.6,64,1007.2
6226,2018-09-17 10:00:00,0,0.0,16.8,13.9,11.6,13.7,71,1006.6
6227,2018-09-17 11:00:00,0,0.0,17.6,15.0,13.0,15.0,74,1005.6
6228,2018-09-17 12:00:00,0,0.0,19.4,16.4,14.2,16.1,71,1004.6
6229,2018-09-17 13:00:00,0,0.0,19.8,16.4,13.8,15.8,68,1003.8


Given that there is no rain for the rest of the day, and given the high (for Ireland) temperature on the day, I think it's safe to replace the missing rain values with 0.

I will then drop the feature **irain** as it provides no useful information.

In [17]:
# replace rain with 0 where irain is not 0
df_weather = df_weather.loc[df_weather['irain'] != -1]

In [18]:
# check that values are updated
df_weather.loc[df_weather['irain'] != 0]

Unnamed: 0,record_date,irain,rain,temp,wetb,dewpt,vappr,rhum,msl


In [19]:
# drop the irain feature
df_weather = df_weather.drop(columns=['irain'])

# 7. Drop Additional Features

The following features will be dropped as they are not available from our weather forecast source (OpenWeather API):

- wetb
- dewpt
- vappr

In [20]:
df_weather = df_weather.drop(columns=['wetb', 'dewpt', 'vappr'])

# 8. Export the Cleaned Data

In [21]:
df_weather.to_csv('/data_analytics/data/weather_cleaned.csv', index=False)

# 9. Data Quality Plan

| Feature | Data Quality Issue | Handling Strategy |
|-------------------------|----------------------|------------------------------|
| itemp | Constant feature | Drop feature |
| iwb | Constant feature | Drop feature |
| rain | Missing data - 2 rows | Imputation - replace with 0 after looking at data for other timestamps on the same day |
| irain | Seems to be a missing data indicator | Drop feature as only two rows have missing data, and imputation is performed for these rows |
| wetb | Not available from OpenWeather | Drop feature |
| dewpt | Not available from OpenWeather | Drop feature |
| vappr | Not available from OpenWeather | Drop feature |