# Introduction

|Feature|Description|Unit|
|---|---|---|
|Issue Time|Time the warning was issued|utc|
|Valid From|Time the warning was valid from|utc|	  
|Valid To|Time the warning was valid to|utc|
|Warning Colour|Colour status of the weather warning (severity)|String|
|Warning Element|Weather concern (rain, fog etc)|String|
|Where To Text|Location of weather warning|String|	                 
|Warning Text|Description of weather warning|String|


Following this there are a series of columns for each County/Provice which contain boolean values TRUE and FALSE indicating whether the weather warning is attached to that location.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# import of ods file code from https://stackoverflow.com/questions/17834995/how-to-convert-opendocument-spreadsheets-to-a-pandas-dataframe

df = pd.read_excel('/Users/rebeccadillon/git/dublin-bus-team-5/machinelearning/data/raw_data/Archived_Wx_Warnings_25April2012_17February2021.ods', engine='odf')

In [None]:
df.info()

First thing I will do is drop all county and province columns apart from the Dublin column as they are not applicable to the model.

In [None]:
df = df.drop(columns=['Munster','Clare','Cork','Kerry','Limerick','Tipperary','Tipperary SR','Waterford','Leinster','Carlow','Kildare','Kilkenny','Laois','Longford','Louth','Meath','Offaly','Westmeath','Wexford','Wicklow','Ulster','Cavan','Donegal','Monaghan','Connacht','Galway','Leitrim','Mayo','Roscommon','Sligo'])

From the above we can see there are now only 8 columns and 1654 rows in the dataframe.

In [None]:
df.dtypes 

In [None]:
# change object and boolean columns to categorical

# select object and boolean columns
object_cols = df.select_dtypes(['object', 'bool']).columns

# change to categorical
for col in object_cols:
    df[col] = df[col].astype('category')
df.dtypes

In [None]:
# check cardinality
df.nunique()

We can see there are several categorical columns with large values for cardinality. The usefulness of these columns will have to be further examined. As the 'Dublin' feature contains 2 unique values, True and False, the rows containing False will be dropped from the dataframe.

In [None]:
df = df[df['Dublin'] != False]
df['Dublin']

Drop rows that contain data for weather warnings outside of required timeframe (i.e., 2018). Keep values that contain a date on or between 31/12/2017 and 31/12/2018 in the 'Issue Time' column.

In [None]:
df_2018 = df[(df['Valid From'] <= '2018-12-31') & (df['Valid To'] >= '2018-01-01')]
df = df_2018

In [None]:
df.reset_index(drop=True)

In [None]:
# check for duplicate rows
df.duplicated().value_counts()

## Descriptive statistics
### Continuous features

In [None]:
# first select the columns
continuous_cols = df.select_dtypes(['datetime64[ns]']).columns

# descriptive column
con_descriptive_df = df[continuous_cols].describe(datetime_is_numeric=True).T 

con_descriptive_df

In [None]:
for col in continuous_cols:
 df[col].hist(figsize=(15,5))
 plt.title(col)
 plt.show()

### Categorical features

In [None]:
# select categorical columns
categorical_cols = df.select_dtypes(['category']).columns

# print descriptives for categorical columns
cardinality = df[categorical_cols].nunique()
cardinality

In [None]:
null_count = df[categorical_cols].isnull().sum()
null_count

In [None]:
df[categorical_cols].describe()

In [None]:
high_card_cols = df[['WhereToText','Warning Text']]
for col in high_card_cols:
    df[col].value_counts(dropna=True)[:20].plot(kind='bar', title=col, figsize=(15,5))
    plt.show()

In [None]:
low_card_cols = df[['Warning Colour','Warning Element', 'Dublin']]
for col in low_card_cols:
    df[col].value_counts(dropna=True).plot(kind='bar', title=col, figsize=(15,5))
    plt.show()

In [None]:
df = df.reset_index(drop=True)

In [None]:
df.to_csv('data/weather-wearnings-cleaned.csv', index=False)

# Data Quality Plan
|Feature|Data Quality Issue|Action|
|---|---|---|
|Issue Time|Rows outside of required timeframe|Dropped rows in initial cleaning|
|   |Column does not add information|Drop column|
|Valid From|Rows outside of required timeframe|Dropped rows in initial cleaning| 
|Valid To|Rows outside of required timeframe|Dropped rows in initial cleaning|
|Warning Colour|No issues|Keep column|
|Warning Element|No issues with data however relevance debatable|Keep column and later assess relevance|
|Where To Text|Unnecessary due to Dublin column|Drop column|	                 
|Warning Text|Unnecessary due to high cardinality and nature of model|Drop column|
|Dublin|One value, does not add information|Drop column|

As per the issues identified in the DQR above, the following columns can be dropped from the dataframe:
* Issue Time
* WhereToText
* Warning Text
* Dublin

In [28]:
df.drop(columns=['Issue Time', 'WhereToText','Warning Text','Dublin'], inplace=True)

In [29]:
df.head()

Unnamed: 0,Valid From,Valid To,Warning Colour,Warning Element
0,2018-01-02 17:00:00,2018-01-03 21:00:00,Yellow,Wind
1,2018-01-02 17:00:00,2018-01-02 22:00:00,Orange,Wind
2,2018-01-02 16:00:00,2018-01-02 21:00:00,Orange,Wind
3,2018-01-02 16:00:00,2018-01-02 22:00:00,Orange,Wind
4,2018-01-02 22:00:00,2018-01-03 14:00:00,Yellow,Wind


In [30]:
# Save cleaned df to file
df.to_csv('/Users/rebeccadillon/git/dublin-bus-team-5/machinelearning/data/cleaned/weather-wearnings-cleaned-dqp.csv', index=False)