## Introduction
Exploring the Irish Weather Dataset

## Exploratory Analysis
Import required libraries

In [None]:
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # plotting

There is 1 csv file in the current version of the dataset:


In [None]:
print(os.listdir('../input'))

### Read in the data

In [None]:
df = pd.read_csv('../input/hourly_irish_weather.csv', parse_dates=['date'])
df.drop('Unnamed: 0', axis=1, inplace=True)
nRow, nCol = df.shape
print(f'There are {nRow} rows and {nCol} columns')

Let's take a quick look at what the data looks like:

In [None]:
df.head(5)

In [None]:
df.info()

There are 2 object columns, lets explore them.

In [None]:
df_objects = df.select_dtypes(include='object')
object_memory_usage = df_objects.memory_usage(deep=True).sum()/ 1024**2
print(f"Total memory usage for Objects is {object_memory_usage :3.2f}mb's")

total_memory_usage = df.memory_usage(deep=True).sum() / 1024 ** 2
print(f"Total memory usage is {total_memory_usage :3.2f}mb's")

The 2 object columns are using nearly half the memory. Can changing them to categorical help!

In [None]:
df_category = df_objects.astype('category')

category_memory_usage = df_category.memory_usage(deep=True).sum()/ 1024**2
print(f"Total memory usage for Categories is {category_memory_usage:3.2f}mb's")

Converting object types to categories will save nearly 500mb memory.

In [None]:
object_columns = df.select_dtypes(include='object').columns

df[object_columns] = df[object_columns].astype('category')

## Checking for Missing values

In [None]:
df.isna().mean()

Over 50% of ww, w, sun, vis, clht and clamt are missing. More exploration is required.

Is this because not all stations have the equipment to gather this data?

In [None]:
def heatmap(df):
    sns.set(rc={'figure.figsize':(20,20)})
    sns.set_context('talk')
    sns.heatmap(df.round(2), cbar=None, annot=True, linewidth=.5, cmap='YlGnBu')

In [None]:
station_na_dict = {}
for station in df.station.cat.categories:
    station_na_dict[station] = df[df.station == station].isna().mean()

station_na_df = pd.DataFrame(station_na_dict)

heatmap(station_na_df)
plt.title('Proportion of missing values by Weather Station')

The heatmap shows that over two thirds of the stations do not collect data for 6 variables. Belmullet, Malin Head and Valentia are all missing consitent proportions of these variables. Have they gained access to equipment at different times?

In [None]:
yearly_na_dict = {}

for year in df.date.dt.year.unique():
    yearly_na_dict[year] = df[df.date.dt.year == year].isna().mean()

yearly_na_df = pd.DataFrame(yearly_na_dict)

heatmap(yearly_na_df)
plt.title('Proportion of missing values by Year')

Over the years the quantity of missing values for the 6 variables has increased! Are Belmullet, Malin Head and Valentia no longer reporting these variables? As the missing proportions are consistent for each variable I will look at the sun variable by year and station.

In [None]:
sun_year_station = df.groupby([df.date.dt.year, 'station']).count()['sun'].reset_index()
sun_pivoted = sun_year_station.pivot_table(values='sun', index='station', columns='date', fill_value=0)

heatmap(sun_pivoted/1000)
plt.yticks(rotation=0)
plt.title('Sun data Points by year and station (1000\'s) ')

Belmullet, Malin Head and Valentia stopped reporting these data between 2009 and 2012! Knock Airport is reporting these data since 1996. 

This [link](https://www.met.ie/climate/what-we-measure#collapsemannedweatherstations) states that the 4 airpot stations and the Casement station are manned, with the rest being automatic, which is why they have the extra variables. Bulmullet, Malin head and Valentia must have been manned previously.

### Aggerating the data


In [None]:
def aggerate_data(grouper):
    # Drop Lat, Long, w and ww columns
    agg_df = df.drop(['latitude', 'longitude', 'w', 'ww'], axis=1).groupby(grouper).agg([np.min, np.max, np.mean, np.std])
    agg_df.columns = ['_'.join(val) for val in agg_df.columns.values]
    return agg_df

month_year_agg_df = aggerate_data([df.date.dt.year, df.date.dt.month])

# Rename and reset index
month_year_agg_df.index.rename(['year', 'month'], inplace = True)
month_year_agg_df.reset_index(inplace=True)

# Combine year and Month
month_year_agg_df['day'] = 15
month_year_agg_df['date'] = pd.to_datetime(month_year_agg_df[['day','month', 'year']])
month_year_agg_df.drop(['day', 'month', 'year'], axis=1, inplace=True)

#create new date index
month_year_agg_df.index = month_year_agg_df.date
month_year_agg_df.drop('date', axis=1, inplace=True)

month_year_agg_df.head()

Fnction to plot the aggerated data in a timeseries. The function will produce two plots. The first one will be the mean and standard deviation for the variable. The second one will be the mean and min and max values. 

NB. Some variables like rain will produce a negative standard deviation, please ignore!

In [None]:
def plot_aggerated_data_timeseries(df, variable, title='Monthly'):
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(30, 20))
    fig.suptitle(f'{title} Average - {variable.capitalize()}')
    
    ax1.plot(df.index, df[f'{variable}_mean'])
    upper_sd = df[f'{variable}_mean'] + df[f'{variable}_std']
    lower_sd = df[f'{variable}_mean'] - df[f'{variable}_std']
    ax1.fill_between(df.index, y1=upper_sd, y2=lower_sd, alpha=.2)
    ax1.set_title(f'{variable.capitalize()} - Average and Standard Deviation')
   
    ax2.plot(df.index, df[f'{variable}_mean'])
    ax2.fill_between(df.index, df[f'{variable}_amin'], df[f'{variable}_amax'], alpha=.2)
    ax2.set_title(f'{variable.capitalize()} - Average, Min, Max')
    plt.show()

### Plots

Loop through each variable and produce plots

In [None]:
cols = set(c[0] for c in month_year_agg_df.columns.str.split('_'))

for col in cols:
    plot_aggerated_data_timeseries(month_year_agg_df, col, title='Month, Year')

### Note's on Plots

* 'clamt' - Cloud Ammount (Okta's)- 2 Values largert than the consisent max of 8. Value of 9 Okta's (Sky obstructed from view) is possible
* 'clht' - Cloud Ceiling Height - Value of 999 is no cloud. Is value of 0 = fog??
* 'dewpt' - Dew Point - Suspicious Low point in 2007, needs more investigating. Seasonal
* 'msl' - Mean Sea Level Pressure (hPa) - Big Outlier around 2010
* 'rain' - Rain (mm) - no outliers
* 'rhum' - Relative Humidity (%) - Zero values similar to outliers in vappr outliers??
* 'sun' - Sun (hours) - big outlier, 5 hours of sun in 1 hour???
* 'temp' - Temperature (°C) - No Apparent outliers, seasonal
* 'vappr' - Vapour Pressure (hPa) - Outlier min values between 1996 and 1998?, Seasonal
* 'vis' - Visability (m) - no outliers, seasonal
* 'wddir' - Predominant Hourly Wind Direction (degrees) - No Outliers all between 0 and 
* 'wdsp' - Mean Hourly Wind Speed (kt) - Max value is lower than highest recorded in ireland
* 'wetb' - Wet Bulb Air Temperature (°C) - Some very low outliers, needs more investigating. Seasonal

## Seasonality

Aggerate by month to assess seasonality

In [None]:
monthly_df = aggerate_data(df.date.dt.month)

for col in cols:
    plot_aggerated_data_timeseries(monthly_df, col, title='Monthly')

## More to follow . . .