# Introduction
Before beginning the data quality report I import the data set and remove all rows which contain weather data outside of the required time period (i.e., the year 2018). Although this could have been completed later, it meant that the dataset was dramatically reduced in size and therefore quicker and easier to work with throughout the DQR.

|Feature|Description|Unit|
|---|---|---|
|dt|Time of data calculation|UTC|
|dt_iso|Date and time in UTC format|UTC|
|timezone|Shift in seconds from UTC|seconds|
|city_name|City name|   |
|lat|Geographical coordinates of the location (latitude)||
|lon|Geographical coordinates of the location (longitude)||
|temp|temperature|degrees celcius|
|visibility|Average visibility. The maximum value of the visibility is 10km.| metres|
|dew_point|Atmospheric temperature (varying according to pressure and humidity) below which water droplets begin to condense and dew can form|degrees celcius|
|feels_like|This temperature parameter accounts for the human perception of weather|degrees celcius|
|temp_min|Minimum temperature at the moment. This is deviation from temperature that is possible for large cities and megalopolises geographically expanded|degrees celcius|
|temp_max|Maximum temperature at the moment. This is deviation from temperature that is possible for large cities and megalopolises geographically expanded|degrees celcius|
|pressure| Atmospheric pressure (on the sea level)|hPa|
|sea_level| |   |
|grnd_level|    |   |
|humidity|humidity|%|
|wind_speed| Wind speed| meter/sec|
|wind_deg|Wind direction|degrees (meterorological|
|wind_gust|wind gust|meter/sec|
|rain_1h|Rain volume for the last hour| mm|
|rain_3h|Rain volume for the last 3 hours| mm|
|snow_1h|Snow volume for the last hour, (in liquid state)| mm|
|snow_3h|Snow volume for the last 3 hours, (in liquid state)| mm|
|clouds_all|Cloudiness| %|
|weather_id|Weather condition ID|   |
|weather_main|Group of weather parameters (Rain, Snow, Extreme etc.)|   |
|weather_description|Weather condition within the group|    |
|weather_icon|Weather icon ID|  |

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import seaborn as sns

In [None]:
df = pd.read_csv("/Users/rebeccadillon/git/dublin-bus-team-5/machinelearning/data/raw_data/dublin_weather_2018.csv")

In [None]:
df.head()

In [None]:
df.shape

There are 9060 rows of data across 28 columns.

In [None]:
df.dtypes

In [None]:
# print some descriptive statistucs of the df
df.describe(datetime_is_numeric=True).T 

We can see that the columns 'sea_level','grnd_level','rain_3h' and 'snow_3h' contain zero values and will be dropped from the dataframe. 

Columns to be dropped so far:
* sea_level
* grnd_level
* rain_3h
* snow_3h

## Changing dtypes

Change dt_iso column to datetime. Code from https://www.datasciencesociety.net/weather-proof-mobility/

In [None]:
df['dt_iso'] = df['dt_iso'].apply(lambda x: pd.to_datetime(x[:-10], infer_datetime_format=True))
df['dt_iso']

In [None]:
# select all columns with object datatype
categorical_cols = df.select_dtypes(['object']).columns
categorical_cols

In [None]:
# select columns with categorical data and add to list
categorical_cols = categorical_cols.append(df[['timezone', 'weather_id']].columns)
categorical_cols

In [None]:
# convert columns in the list to categorical columns
for col in categorical_cols:
    df[col] = df[col].astype('category')
df.dtypes

In [None]:
continuous_cols = df.select_dtypes(['float64','datetime64[ns]','int64']).columns
continuous_cols

## Check for duplicate rows and null values

In [None]:
# check for duplicate rows
df.duplicated().value_counts()

In [None]:
df.isnull().sum()

We can see that there are additional colums with missing values, 'visibility','wind_gust','rain_1h' and 'snow_1h' which will be examined further.

In [None]:
df.nunique()

The results above show that the columns 'city_name','lat' and 'lon' contain just one unique value and so the information gain from these columns is likely limited. The use of these columns will be futher examined and they may be dropped from the dataframe later.

## Check the logical integrity of the data
#### Continuous features
I will first check that there are no negative values in columns which should not logically hold negative values.

In [None]:
df.describe(datetime_is_numeric=True).T 

Observing the continuous data there are no obvious signs of illogical values.

In [None]:
# check the rows are within the required dates 
test_timeframe = df['dt_iso'].dt.year == 2018
test_timeframe.value_counts()

In [None]:
# remove these rows from the dataframe
# drop rows not in 2018
# https://sparkbyexamples.com/pandas/pandas-delete-rows-based-on-column-value/
df.drop(df[df['dt_iso'].dt.year != 2018].index, inplace=True)

#### Categorical columns
Check all data is for Dublin

In [None]:
df['city_name'].unique()

# Descriptive statistics
## Continuous features

In [None]:
# print descriptive stats for the continuous columns
# descriptive column
con_descriptive_df = df[continuous_cols].describe(datetime_is_numeric=True).T 
con_descriptive_df

In [None]:
continous_figs_pdf = PdfPages('/Users/rebeccadillon/git/dublin-bus-team-5/data_prep/documents/figs/dqr_openweather_continuous_barcharts.pdf')
for col in continuous_cols:
 fig = df[col].hist(figsize=(15,5))
 plt.title(col)
 continous_figs_pdf.savefig(fig.get_figure(),bbox_inches='tight')
 plt.show()

continous_figs_pdf.close()

### Categorical features

In [None]:
# print descriptives for categorical columns
cardinality = df[categorical_cols].nunique()
cardinality

In [None]:
null_count = df[categorical_cols].isnull().sum()
null_count

In [None]:
df[categorical_cols].describe()

In [None]:
categorical_figs_pdf = PdfPages('/Users/rebeccadillon/git/dublin-bus-team-5/data_prep/documents/figs/dqr_openweather_categorical_barcharts.pdf')

for col in categorical_cols:
    fig = df[col].value_counts(dropna=True).plot(kind='bar', title=col, figsize=(15,5), color='rebeccapurple')
    plt.title(col)
    categorical_figs_pdf.savefig(fig.get_figure(),bbox_inches='tight')
    plt.show()
    
categorical_figs_pdf.close()

# Data Quality Plan
Initial list of issues identified in the Data Quality Report

|Feature|Data Quality Issue|Action|
|---|---|---|
|dt| Similar to dt_iso|Drop column   |
|dt_iso|  No issue|keep column   |
|timezone|Low information gain | Drop column  |
|city_name|Low information gain | Drop column  |
|lat|Low information gain | Drop column  |
|lon|Low information gain | Drop column  |
|temp| |   |
|visibility| missing values| investigate further, drop if necessary  |
|dew_point| |   |
|feels_like| |   |
|temp_min| |   |
|temp_max| |   |
|pressure| |   |
|sea_level|Null column | Drop column  |
|grnd_level| Null column | Drop column  |
|humidity| |   |
|wind_speed| |   |
|wind_deg| |   |
|wind_gust| missing values|investigate further, drop if necessary  |
|rain_1h| missing values| replace with 0  |
|rain_3h|Null column | Drop column  |
|snow_1h|missing values| replace with 0  |
|snow_3h|Null column | Drop column  |
|clouds_all| |   |
|weather_id| |   |
|weather_main| |   |
|weather_description| |   |
|weather_icon| Low information gain|Drop column   |

As per the issues identified in the DQR above, the following columns can be dropped from the dataframe:
* dt
* timezone
* city_name
* lat
* lon
* sea_level
* grnd_level
* rain_3h
* snow_3h
* weather_icon

In [None]:
df.drop(columns=['dt','timezone','city_name','lat','lon','sea_level','grnd_level','rain_3h', 'snow_3h','weather_icon'], inplace=True)

In [None]:
df.head()

In [None]:
df['rain_1h'].describe()

1. Change NaN values in 'rain_1h' and 'snow_1h' to zero.

In [None]:
df['rain_1h'] = df['rain_1h'].fillna(0)
df['snow_1h'] = df['snow_1h'].fillna(0)

### 2. Create a column 'snow_ice' which flags values below 0 in the 'temp' column OR where there is a value above 0 for 'snow_1h' OR where snow is indicated in 'weather_main' 

In [None]:
df['snow_ice'] = 0

In [None]:
df.loc[df['temp'] <= 0, 'snow_ice'] =  1
df.loc[df['snow_1h'] > 0, 'snow_ice'] = 1
df.loc[df['weather_main'] == 'Snow', 'snow_ice'] = 1

In [None]:
df.loc[df['snow_ice']==1]

### 3. I will create a boolean column named 'heavy_precip' which will indicate heavy rain or snow fall

In [None]:
df['heavy_precip'] = 0

In [None]:
#df.loc[df['rain_1h'] <= 0, 'heavy_precip'] =  1
#df.loc[df['snow_1h'] <= 0, 'heavy_precip'] =  1

In [None]:
df['rain_1h'].describe()

In [None]:
df['snow_1h'].describe()

The open weather website has condition codes which map to the weather_id, weather_main and weather_description columns in this dataframe. Among these codes are the following 
|ID	|Main	|Description	|
|---|----|----|
500	Rain	light rain	
501	Rain	moderate rain	
502	Rain	heavy intensity rain
503	Rain	very heavy rain	
504	Rain	extreme rain	
511	Rain	freezing rain	
520	Rain	light intensity shower rain	
521	Rain	shower rain	
522	Rain	heavy intensity shower rain	 
531	Rain	ragged shower rain	 

from this list, for the purpose of defining heavy rain I will take the following IDs to identifying rows with 'heavy' rain:
501,502,503,504,511,521,522,531

In [None]:
df_raining = df.loc[df['weather_id'].isin([501,502,503,504,511,521,522,531])]
df_raining['weather_id'].unique()

We can see that this dataframe only contains values for 'moderate rain','heavy intensity rain' and 'shower rain'. I will now plot a histogram to show the distribution of rain values for these weather descriptions.

In [None]:
df_raining['rain_1h'].hist(figsize=(15,5))
plt.show()

I will also print some descriptive statistics of this data.

In [None]:
df_raining.describe().T

The above describes the 'min' rain per hour value as zero. I will instead go with the Q1 value of 1.06, or greater than 1mm rain per hour to indicate 'heavy' rain.

In [None]:
df.loc[df['rain_1h'] > 1, 'heavy_precip'] = 1
df.loc[df['heavy_precip']==1]

Moving onto the snow, the following condition codes which map to the weather_id column in this dataframe were taken from the open weather website. Among these codes are the following codes for snow:


|ID	|Main	|Description	|Icon|
|---|---|---|---|
600	Snow	light snow	 
601	Snow	Snow	 
602	Snow	Heavy snow	 
611	Snow	Sleet	 
612	Snow	Light shower sleet	 
613	Snow	Shower sleet	 
615	Snow	Light rain and snow	 
616	Snow	Rain and snow	 
620	Snow	Light shower snow	 
621	Snow	Shower snow	 
622	Snow	Heavy shower snow	


From the above codes, I will identify which of these codes are in my dataframe

In [None]:
df_snowing = df.loc[df['weather_id'].isin([600,601,602,611,612,613,615,616,620,621,622])]
df_snowing['weather_id'].unique()

The above result shows that 'light snow','snow','light shower sleet','light shower snow' and 'shower snow' are in the dataframe. I will omit those that are described as 'light', and just include the others in defining 'heavy precipitation'

In [None]:
df_snowing['snow_1h'].hist(figsize=(15,5))
plt.show()

In [None]:
df_snowing.describe().T

We can see that the minimum value for snow over the past hour is 0. I will omit the categories labelled 'light' and print some descriptives again.

In [None]:
df_snowing = df.loc[df['weather_id'].isin([601,621])]
df_snowing.describe().T

In [None]:
df_snowing['snow_1h'].hist(figsize=(15,5))
plt.show()

The above figure shows that the majority of rows have snow_1h values of 0.5 and above. For this reason I will place my snow threshold at 0.5.

In [None]:
df.loc[df['snow_1h'] > 0.5, 'heavy_precip'] = 1
df.loc[df['heavy_precip']==1]

2. Visibility missing values

In [None]:
df['visibility'].isnull().sum()

In [None]:
df.loc[df['visibility'].isnull()]

I will see what relationship these values have with other features.

In [None]:
def bar_plot(col1, col2):
    df.groupby(col1)[col2].mean().plot.bar(cmap='Pastel2')
    plt.title(col1 + " vs " + col2)
    plt.xticks(rotation=45)
    plt.tight_layout()

In [None]:
categorical_cols = df.select_dtypes(['category']).columns
categorical_cols = categorical_cols.append(df[['snow_ice', 'heavy_precip']].columns)
categorical_cols

In [None]:
continuous_cols = df.select_dtypes(['int64','float64','datetime64[ns]']).columns

In [None]:
df['visibility_null'] = 0  

In [None]:
df.loc[df['visibility'].isnull(), 'visibility_null'] = 1

In [None]:
for col in continuous_cols:
    bar_plot('visibility_null', col)
    plt.show()

In [None]:
for col in categorical_cols:
    sns.histplot(binwidth=0.5, x='visibility_null', hue=col, data=df, stat="count", multiple="stack")
    plt.show()

It is clear from the above figures that the null values are in relation to cloud cover. As the nature of the visibility column means it is influenced by a variety of weather factors (rain, snow, cloud cover, fog etc), I will drop this column from the dataframe as it does not appear to add much information to the dataframe.

In [None]:
df.drop(columns=['visibility','visibility_null'], inplace=True)

3. Wind gust


The wind gust column contained missing values where the visibility column also held missing values. I will repeat the same steps as I did with the visibility column.

In [None]:
df.loc[df['wind_gust'].isnull()]

As this column contains a lot of missing values and as we have a lot of other columns with useful weather information, I will drop this column.

In [None]:
df.drop(columns=['wind_gust'],inplace=True)

In [None]:
df.info()

In [None]:
# save cleaned dataframe to new file
df.to_csv('/Users/rebeccadillon/git/dublin-bus-team-5/machinelearning/data/cleaned/dublin-weather-2018-cleaned-dqp.csv', index=False)