# 1. Introduction

The purpose of this notebook is to perform initial cleaning of the Met Eireann weather data, and to export the cleaned data to a csv file.

# 2. Setup & Data Load

Import required modules and packages:

In [None]:
# import pandas for data analysis
import pandas as pd

# import convert_timestamp for various timestamp conversion functions
import convert_timestamp

Set the max number of columns & rows to display:

In [None]:
pd.set_option('display.max_columns', 30)
pd.set_option('display.max_rows', 500)

Weather data is loaded from a csv:

In [None]:
df_weather = pd.read_csv('/data_analytics/data/weather.csv', sep=",", na_values=['\\N'])

In [None]:
rows = df_weather.shape[0]
cols = df_weather.shape[1]
print()
print("Before any data cleaning, the dataframe contains", rows, "rows and", cols, "columns.")
print()

# 3. Check for Duplicate Rows & Columns

In [None]:
print()
print('Duplicate rows:', df_weather.duplicated()[df_weather.duplicated() == True].shape[0])
print('Duplicate columns:',df_weather.columns.size - df_weather.columns.unique().size)

There are no duplicate rows or columns so nothing needs to be dropped here.

# 4. Assign Features as Continuous or Categorical

Print 5 rows from the dataframe:

In [None]:
df_weather.head(5)

Assign categorical and continuous features:

In [None]:
# Select columns containing categorical data
categorical_columns = df_weather[['record_date', 'irain', 'itemp', 'iwb']].columns

# Convert data type to 'Category' for these columns
for column in categorical_columns:
    df_weather[column] = df_weather[column].astype('category')

In [None]:
# Select columns containing continuous data 
# This is done by selecting columns with a numeric type - float64 or int64
continuous_columns = df_weather.select_dtypes(['float64', 'int64']).columns

In [None]:
# check that all are correctly assigned
df_weather.dtypes

# 5 Check for Constant Features

In [None]:
# Print details for the categorical columns
df_weather[categorical_columns].describe().T

In [None]:
# Print details for the continuous columns
df_weather[continuous_columns].describe().T

**itemp** and **iwb** are constant columns so can be dropped:

In [None]:
df_weather = df_weather.drop(columns=['itemp', 'iwb'])
categorical_columns = df_weather[['record_date', 'irain']].columns

# 6. Check for Missing Data

In [None]:
# Print details for the categorical columns
df_weather[categorical_columns].describe().T

In [None]:
# Print details for the continuous columns
df_weather[continuous_columns].describe().T

Investigate rows with missing data for rain:

In [None]:
# select all rows where irain is not 0
df_weather.loc[df_weather['irain'] != 0]

There are only two rows where irain is not zero, these rows correspond to missing values for rain.

In [None]:
# select other rows around the missing values
df_weather[6220:6240]

Given that there is no rain for the rest of the day, and given the high (for Ireland) temperature on the day, I think it's safe to replace the missing rain values with 0.

I will then drop the feature **irain** as it provides no useful information.

In [None]:
# replace rain with 0 where irain is not 0
df_weather = df_weather.loc[df_weather['irain'] != -1]

In [None]:
# check that values are updated
df_weather.loc[df_weather['irain'] != 0]

In [None]:
# drop the irain feature
df_weather = df_weather.drop(columns=['irain'])

# 7. Drop Additional Features

The following features will be dropped as they are not available from our weather forecast source (OpenWeather API):

- wetb
- dewpt
- vappr

In [None]:
df_weather = df_weather.drop(columns=['wetb', 'dewpt', 'vappr'])

# 8. Export the Cleaned Data

In [None]:
df_weather.to_csv('/data_analytics/data/weather_cleaned.csv', index=False)

Import the data when required:

In [None]:
df_weather = pd.read_csv('/data_analytics/data/weather_cleaned.csv')

# 9. Data Quality Plan

| Feature | Data Quality Issue | Handling Strategy |
|-------------------------|----------------------|------------------------------|
| itemp | Constant feature | Drop feature |
| iwb | Constant feature | Drop feature |
| rain | Missing data - 2 rows | Imputation - replace with 0 after looking at data for other timestamps on the same day |
| irain | Seems to be a missing data indicator | Drop feature as only two rows have missing data, and imputation is performed for these rows |
| wetb | Not available from OpenWeather | Drop feature |
| dewpt | Not available from OpenWeather | Drop feature |
| vappr | Not available from OpenWeather | Drop feature |

# 10. Create JSON Files with Mean

Create a month feature:

In [None]:
df_weather['month'] = df_weather['record_date'].map(lambda x: convert_timestamp.timestamp_to_month_weather(x))

Create a mean temp feature:

In [None]:
means = df_weather.groupby('month')['temp'].mean().round()

In [None]:
df_weather['temp_mean'] = df_weather['month'].map(means)

Create the JSON file:

In [None]:
file = open("/data_analytics/JSON/weather.json", "w", encoding="utf8")
file.write("{\n")
for i in range(1,13):
    df_temp = df_weather.loc[df_weather.month == i]
    if df_temp.shape[0] != 0:
        mean = df_temp['temp_mean'].iloc[0]
        if i != 12:
            file.write('"' + str(i) + '": ' + str(mean) + ',\n')
        else:
            file.write('"' + str(i) + '": ' + str(mean) + '\n')
file.write("}")
file.close()