# Irish Weather Data analysis

Exploratory Data Analysis (EDA) will be used to help provide initial discoveries about the key aspects of the dataset

Tasks
* Preview the data
* Variable types
* Summary stats
* Missing value and outliers
* Visualisations

## 1. Pre-processing

In [None]:
# Import packages
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import sys
import seaborn as sns

In [None]:
# Review the files in the folder
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        # With only one file we can create the variable containing the file path
        input_data = str(os.path.join(dirname, filename))
        print(input_data)

In [None]:
df = pd.read_csv('../input/irish-weather-hourly-data/hrly_Irish_weather.csv', parse_dates=['date'])
df.head()

In [None]:
df.info()

In [None]:
# Review a random sample of records from the dataframe. The n value inside the parenthesis represents the number of records to review
df.sample(5)

In [None]:
# Shape of the dataframe
print(df.shape)
# Find the number of rows within a dataframe
print(len(df))
# Extracting information from the shape tuple
print(f'Number of rows: {df.shape[0]} \nNumber of columns: {df.shape[1]}')

### 1b. Variable types

Aiming to understand if any datatype conversion is required to ensure that variables are in the correct format for further data analysis

In [None]:
# Gain high level view of the datatypes for each variable
df.dtypes

In [None]:
# Information about the dataframe. The memory_usage parameter provides a more in-depth review of the size of the dataframe
df.info(memory_usage='deep')

In [None]:
# Review memory usage by variable
df.memory_usage(deep='True')

As object variables consume the most memory, converting to an appropriate data type can really help with processing as the overall memory footprint is reduced. By default pandas will set variables with mixed datatypes to an object value. In this case mixed data types can contain string, date, integer or float values. If we understand what elements make up the variable, then data type conversions to the appropriate numeric or categorical type can take place. 

##### Categorical

From the dataframe summary we can see that the first two variables are categorical. Therefore understanding the cardinality (number of unique segments) of the variable can help to understand if a data type conversion makes sense.

In [None]:
# Review the first few categorical variables
cat_list = ['county', 'station']
df.groupby('county')['county'].count()

In [None]:
# Count for the station
df.groupby('station')['station'].count()

In [None]:
# Convert both of these variables into category data types
# Create a conversion dictionary to allow for easier maintenance
cat_type = {'county':'category',
            'station':'category'
           }
df = df.astype(cat_type)
# Review the new memory consumption
df[cat_list].memory_usage(deep='True')

In [None]:
# Confirm that the data types have changed
df[cat_list].dtypes

The most efficient method to convert data types is to apply the changes within the data import step. For csv files that are being used as dataframes, the dtype= parameter of pd.read_csv() can be updated.

##### Numeric

There will be a number of choices for numeric variables. Applying floats for values with decimal places. Integers for whole numbers. Finally, converting a date variable from string to datetime.

In [None]:
df.sample(5)

In [None]:
# Convert all values except the categorical variables into float values
# Create two lists (1. Float, 2. Integers)
# Errors emerged when applying blindly to all remaining variables. More investigations required
float_vars = [i for i in df.columns[~df.columns.isin(['county','station','date','sun','vis','clht','clamt'])]]

float_cols = ['latitude','longitude']

# df[float_vars] = df[float_vars].apply(pd.to_numeric, downcast='float')
df[float_cols] = df[float_cols].apply(pd.to_numeric, downcast='float')

In [None]:
# convert the date to date format
df['date'] = pd.to_datetime(df['date'], format='%d-%b-%Y %H:%M')

In [None]:
df[float_cols].dtypes

In [None]:
df[float_cols].memory_usage(deep='True')

In [None]:
df.info(memory_usage='deep')

### 1c. Summary stats

In [None]:
# Review the high level summary details for each variable
df.describe(include="all", datetime_is_numeric=True)

### 1d. Missing values and outliers

In [None]:
# Check for the missing values by columns
df.isnull().sum()

# Proportion of missing values by column
def isnull_prop(df):
    total_rows = df.shape[0]
    missing_val_dict = {}
    for col in df.columns:
        missing_val_dict[col] = [df[col].isnull().sum(), (df[col].isnull().sum() / total_rows)]
    return missing_val_dict

# Apply the missing value method
null_dict = isnull_prop(df)
print(null_dict.items())

In [None]:
# Display missing values using a heatmap to understand if any patterns are present
sns.heatmap(df.isnull())

Appears to be certain stations are not collecting data for the final four variables.

In [None]:
df_miss = df.loc[(df['sun'].isnull()), ['county','station','date','sun']]
df_miss.shape

In [None]:
# Missing values by station
df_miss.groupby(['station'])['date'].count()

In [None]:
# Set the index to date and check for missing values across time
df.index = df['date']
df.head()

In [None]:
# Mapping of stations by county
df_g = df.groupby(['county','station']).size().unstack(level=0)
df_g

### 2. Visualisations

In [None]:
# Unique list of values
df_unq_loc = df.drop_duplicates(subset=['station','county'])
df_unq_loc

In [None]:
# Clustering on the listings
import folium
from folium.plugins import FastMarkerCluster

Lat = 53.390862
Long = -6.158100

locations = list(zip(df_unq_loc.latitude, df_unq_loc.longitude))

map1 = folium.Map(location=[Lat,Long], zoom_start=7)
# FastMarkerCluster(data=locations).add_to(map1)
# map1

# add marker one by one on the map
for i in range(0,len(df_unq_loc)):
    folium.Marker(
        location=[df_unq_loc.iloc[i]['latitude'], df_unq_loc.iloc[i]['longitude']],
        popup=df_unq_loc.iloc[i]['station']+',\n'+df_unq_loc.iloc[i]['county'],
    ).add_to(map1)
map1

In [None]:
# Add the temperatures to the graph to understand differences
df['temp'] = pd.to_numeric(df['temp'], errors='coerce')
df_s = df.groupby(['station'])['temp'].agg(['min','mean','max'])
df_s

In [None]:
# Merge the unique locations and temperatures
df_s = df_s.reset_index()
df_unq_temp = pd.merge(df_unq_loc.loc[:,['station','county','latitude','longitude']],
                       df_s,
                       how='left',
                       on=['station']
                      )
df_unq_temp

In [None]:
# Add more details to the output
# Use the temperature to display difference in average temperatures
def colour_temp(temp):
    if temp < 9:
        return "purple"
    elif temp < 10:
        return "blue"
    elif temp < 11:
        return "green"
    else:
        return "red"

Lat = 53.390862
Long = -6.158100

locations = list(zip(df_unq_temp.latitude, df_unq_temp.longitude))

map2 = folium.Map(location=[Lat,Long], zoom_start=7)

# Add Details to the markers
for i in range(len(locations)):
    folium.Marker(locations[i]
                  ,popup=df_unq_temp.iloc[i]['station']+',\n'+df_unq_temp.iloc[i]['county']+' '+str(df_unq_temp.iloc[i]['mean'])
                  ,icon=folium.Icon(color=colour_temp(df_unq_temp.iloc[i]['mean']))
                 ).add_to(map2)
map2