# Introduction

This project is aimed at the following objective:

1. Explore weather data
2. Weather forecast
3. Dig out weather data pattern
4. Weather Impact on Unemployment Rate of two major cities of US, Los Angeles and New York,  between 2012-2017

Part 1 of the report will be focus on point 1,2,3.

Part 2 of the report will be focus on point 4.

## Data Description

[Weather Data](https://www.kaggle.com/selfishgene/historical-hourly-weather-data#wind_speed.csv) and [Unemployment Rate Data](https://www.kaggle.com/jayrav13/unemployment-by-county-us) will be used in this project

The following field will be extracted from the corresponding dataset:

1. Weather Data
    1. City
    2. Data Time
    3. Humidity
    4. Pressure
    5. Temperature
    6. Weather Description
    7. Wind Direction
    8. Wind Speed
2. Unemployment Data
    1. City (County)
    2. Year
    3. Month
    4. Unemployment Rate

## Tools

* [Python](https://www.python.org/)
* [Jupyter Notebook](http://jupyter.org/)

In [1]:
! pip install tqdm
! pip install graphviz
! pip install mlxtend



In [2]:
# Import required package

import pandas as pd
from string import Template
from dateutil.parser import parse
from tqdm import tqdm_notebook
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import graphviz
import matplotlib.pyplot as plt

# Part 1

# Introduction

In part one, we will be focus on exploring the weather data of Los Angeles and New York to find the pattern of the weather and use those result to forecast the weather.

## Overall process

1. __Data Preprocessing__
    1. __Data Cleaning__
        1. Check missing data and handle
        2. Handle noise if existing
    2. __Data Integration__
        1. Merge data set into one table, cuz the data was separated into different csv file
        2. Run Redundancy Analysis as well as Correlation Analysis
    3. __Data Transformation__
        1. Transform datetime into Day, Month, Year for easier handling
    4. __Data Reduction__
        1. Filter out the data of Los Angeles and New York
2. __Data Mining__
    1. __Decision Tree__
        1. Process
        2. Result

## Data Preprocessing

### Data Cleaning

All missing data is fill with the mean value.

Weather Description which is our label will not be clear but remove.

In [22]:
# Import Data
data_list = []
citys = pd.read_csv('~/work/dataset/historical-hourly-weather-data/city_attributes.csv')
humidity = pd.read_csv('~/work/dataset/historical-hourly-weather-data/humidity.csv')
data_list.append(('Humidity',humidity))
pressure = pd.read_csv('~/work/dataset/historical-hourly-weather-data/pressure.csv')
data_list.append(('Pressure',pressure))
temperature = pd.read_csv('~/work/dataset/historical-hourly-weather-data/temperature.csv')
data_list.append(('Temperature',temperature))
weather_description = pd.read_csv('~/work/dataset/historical-hourly-weather-data/weather_description.csv')
data_list.append(('Weather Description',weather_description))
wind_direction = pd.read_csv('~/work/dataset/historical-hourly-weather-data/wind_direction.csv')
data_list.append(('Wind Direction',wind_direction))
wind_speed = pd.read_csv('~/work/dataset/historical-hourly-weather-data/wind_speed.csv')
data_list.append(('Wind Speed',wind_speed))

In [3]:
# Helper Function
def fill_missing_value_with_mean(df, var_name = None):
    
    num_of_missing_data = df.isna().sum().sum()
    print(Template('Number of missing data of ${var_name}: ${num_of_missing_data}').substitute(var_name=var_name,num_of_missing_data=num_of_missing_data))
    if num_of_missing_data > 0:
        return_df = df.fillna(df.mean())
    else:
        return_df = df
    print(Template('${var_name} Clear!').substitute(var_name=var_name))
    return return_df

def fill_missing_value_with_next_value(df, var_name = None):
    
    num_of_missing_data = df.isna().sum().sum()
    print(Template('Number of missing data of ${var_name}: ${num_of_missing_data}').substitute(var_name=var_name,num_of_missing_data=num_of_missing_data))
    if num_of_missing_data > 0:
        return_df = df.fillna(method='bfill')
    else:
        return_df = df
    print(Template('${var_name} Clear!').substitute(var_name=var_name))
    return return_df

def transform_df(df,var_name):
        
    return pd.DataFrame([pd.Series({
        'Datetime': parse(record[1]['datetime']),
        'Year': parse(record[1]['datetime']).year,
        'Month': parse(record[1]['datetime']).month,
        'Day': parse(record[1]['datetime']).day,
        'Hour': parse(record[1]['datetime']).hour,
        'City': city_name,
        var_name: record[1][city_name]
    }) for city_name in ['New York','Los Angeles'] for record in df[['datetime','New York','Los Angeles']].iterrows()])

def map_city_to_df(city_df,row,field):
    return_val = city_df[city_df['City']==row['City']][field].values[0]
    return return_val

In [40]:
# Have a look at data structure
print('City \n',list(citys.head()))
print('Other Dataframe \n', list(humidity.head()))

City 
 ['City', 'Country', 'Latitude', 'Longitude']
Other Dataframe 
 ['datetime', 'Vancouver', 'Portland', 'San Francisco', 'Seattle', 'Los Angeles', 'San Diego', 'Las Vegas', 'Phoenix', 'Albuquerque', 'Denver', 'San Antonio', 'Dallas', 'Houston', 'Kansas City', 'Minneapolis', 'Saint Louis', 'Chicago', 'Nashville', 'Indianapolis', 'Atlanta', 'Detroit', 'Jacksonville', 'Charlotte', 'Miami', 'Pittsburgh', 'Toronto', 'Philadelphia', 'New York', 'Montreal', 'Boston', 'Beersheba', 'Tel Aviv District', 'Eilat', 'Haifa', 'Nahariyya', 'Jerusalem']


In [24]:
cleared_df_list = []
cleared_humidity = fill_missing_value_with_mean(humidity,'Humidity')
cleared_df_list.append(('Humidity',cleared_humidity))
cleared_pressure = fill_missing_value_with_mean(pressure,'Pressure')
cleared_df_list.append(('Pressure',cleared_pressure))
cleared_temperature = fill_missing_value_with_mean(temperature,'Temperature')
cleared_df_list.append(('Temperature',cleared_temperature))
cleared_weather_description = fill_missing_value_with_next_value(weather_description, 'Weather Description')
cleared_df_list.append(('Weather Description',cleared_weather_description))
cleared_wind_direction = fill_missing_value_with_mean(wind_direction,'Wind Direction')
cleared_df_list.append(('Wind Direction',cleared_wind_direction))
cleared_wind_speed = fill_missing_value_with_mean(wind_speed,'Wind Speed')
cleared_df_list.append(('Wind Speed',cleared_wind_speed))

Number of missing data of Humidity: 28651
Humidity Clear!
Number of missing data of Pressure: 16680
Pressure Clear!
Number of missing data of Temperature: 8030
Temperature Clear!
Number of missing data of Weather Description: 7955
Weather Description Clear!
Number of missing data of Wind Direction: 7975
Wind Direction Clear!
Number of missing data of Wind Speed: 7993
Wind Speed Clear!


### Data Integration

After cleaning the data, we can now merge(join) the data together and form a dataframe having the following structure

```js
{
    City: String,
    Country: String,
    Latitude: Float,
    Longitude: Float,
    Humidity: Float,
    Pressure: Float,
    Temperature: Float,
    'Weather Description': String,
    'Wind Direction': Number,
    'Wind Speed': Float
}
```

1. First, we join the dataframes apart from the City one.

In [None]:
df_list = [transform_df(df[1],df[0]) for df in cleared_df_list]

In [None]:
processed_df = pd.concat(df_list,join='inner',axis=1).T.drop_duplicates().T

for new_field in ['Country', 'Latitude', 'Longitude']:
    processed_df[new_field] = processed_df.apply(lambda row: map_city_to_df(citys,row,new_field),axis=1)

processed_df.head()

In [None]:
# Save it to csv for easier processing
processed_df.to_csv('~/work/dataset/historical-hourly-weather-data/weather_data.csv',index=False)

In [4]:
weather_data = pd.read_csv('~/work/dataset/historical-hourly-weather-data/weather_data.csv')

weather_data.head()

Unnamed: 0,Datetime,Year,Month,Day,Hour,City,Humidity,Pressure,Temperature,Weather Description,Wind Direction,Wind Speed,Country,Latitude,Longitude
0,2012-10-01 12:00:00,2012,10,1,12,New York,66.642417,1017.018977,285.400406,few clouds,196.250247,3.210954,United States,40.714272,-74.005966
1,2012-10-01 13:00:00,2012,10,1,13,New York,58.0,1012.0,288.22,few clouds,260.0,7.0,United States,40.714272,-74.005966
2,2012-10-01 14:00:00,2012,10,1,14,New York,57.0,1012.0,288.247676,few clouds,260.0,7.0,United States,40.714272,-74.005966
3,2012-10-01 15:00:00,2012,10,1,15,New York,57.0,1012.0,288.32694,few clouds,260.0,7.0,United States,40.714272,-74.005966
4,2012-10-01 16:00:00,2012,10,1,16,New York,57.0,1012.0,288.406203,few clouds,260.0,7.0,United States,40.714272,-74.005966


#### Correlation Analysis

In [5]:
weather_data[['Humidity','Pressure','Temperature','Wind Direction','Wind Speed','Latitude','Longitude']].corr()

Unnamed: 0,Humidity,Pressure,Temperature,Wind Direction,Wind Speed,Latitude,Longitude
Humidity,1.0,-0.036419,-0.201542,-0.118591,-0.138882,0.093132,0.093132
Pressure,-0.036419,1.0,-0.204188,-0.083557,-0.060743,0.051858,0.051858
Temperature,-0.201542,-0.204188,1.0,-0.021911,-0.166884,-0.305188,-0.305188
Wind Direction,-0.118591,-0.083557,-0.021911,1.0,0.350468,0.258286,0.258286
Wind Speed,-0.138882,-0.060743,-0.166884,0.350468,1.0,0.475998,0.475998
Latitude,0.093132,0.051858,-0.305188,0.258286,0.475998,1.0,1.0
Longitude,0.093132,0.051858,-0.305188,0.258286,0.475998,1.0,1.0


From above test we could see that most of the varible are slightly corrated. However, we can see that `Humidity - Temperature`, `Pressure - Temperature` are negatively corrated at a rate of around -0.2, which indicated that `Humidity`, `Pressure` are affecting `Temperature`. Also, `Wind Speed` and `Wind Direction` are postively correated at a rate of around 0.35, which indicated that `Wind Direction` is a big factor towards `Wind Speed`.

### Data Transformation

We will than transform `datetime` into `Time`, `Day`, `Month`, `Year`. In additional, we will also map the `Weather Description` field to some code for easier manipulation. The description code map as the following:

In [6]:
weather_description_code_map = { description: code
                                for code,description in enumerate(weather_data['Weather Description'].unique())}

weather_data['Weather Code'] = [weather_description_code_map[description] for description in weather_data['Weather Description']]

weather_description_code_map

{'few clouds': 0,
 'sky is clear': 1,
 'scattered clouds': 2,
 'broken clouds': 3,
 'overcast clouds': 4,
 'mist': 5,
 'drizzle': 6,
 'moderate rain': 7,
 'light intensity drizzle': 8,
 'light rain': 9,
 'fog': 10,
 'haze': 11,
 'heavy snow': 12,
 'heavy intensity drizzle': 13,
 'heavy intensity rain': 14,
 'light rain and snow': 15,
 'snow': 16,
 'light snow': 17,
 'freezing rain': 18,
 'proximity thunderstorm': 19,
 'thunderstorm': 20,
 'thunderstorm with rain': 21,
 'smoke': 22,
 'very heavy rain': 23,
 'thunderstorm with heavy rain': 24,
 'thunderstorm with light rain': 25,
 'squalls': 26,
 'dust': 27,
 'proximity thunderstorm with rain': 28,
 'thunderstorm with light drizzle': 29,
 'sand': 30,
 'shower rain': 31,
 'proximity thunderstorm with drizzle': 32,
 'light intensity shower rain': 33,
 'sand/dust whirls': 34,
 'heavy thunderstorm': 35,
 nan: 36,
 'proximity shower rain': 37}

Let's bin all five continuous values into discrete values.

In [7]:
weather_data['Humidity_BIN'] = pd.cut(weather_data['Humidity'],10).astype('str')
weather_data['Pressure_BIN'] = pd.cut(weather_data['Pressure'],10).astype('str')
weather_data['Temperature_BIN'] = pd.cut(weather_data['Temperature'],10).astype('str')
weather_data['Wind Direction_BIN'] = pd.cut(weather_data['Wind Direction'],8, labels=["N", "NE", "E", "SE", "S", "SW", "W", "NW"]).astype('str')
weather_data['Wind Speed_BIN'] = pd.cut(weather_data['Wind Speed'],10).astype('str')



Take a look at the final result of the data preprocessing phase

In [8]:
weather_data.head()

Unnamed: 0,Datetime,Year,Month,Day,Hour,City,Humidity,Pressure,Temperature,Weather Description,...,Wind Speed,Country,Latitude,Longitude,Weather Code,Humidity_BIN,Pressure_BIN,Temperature_BIN,Wind Direction_BIN,Wind Speed_BIN
0,2012-10-01 12:00:00,2012,10,1,12,New York,66.642417,1017.018977,285.400406,few clouds,...,3.210954,United States,40.714272,-74.005966,0,"(62.0, 71.5]","(1002.2, 1018.8]","(283.122, 289.592]",S,"(2.5, 5.0]"
1,2012-10-01 13:00:00,2012,10,1,13,New York,58.0,1012.0,288.22,few clouds,...,7.0,United States,40.714272,-74.005966,0,"(52.5, 62.0]","(1002.2, 1018.8]","(283.122, 289.592]",SW,"(5.0, 7.5]"
2,2012-10-01 14:00:00,2012,10,1,14,New York,57.0,1012.0,288.247676,few clouds,...,7.0,United States,40.714272,-74.005966,0,"(52.5, 62.0]","(1002.2, 1018.8]","(283.122, 289.592]",SW,"(5.0, 7.5]"
3,2012-10-01 15:00:00,2012,10,1,15,New York,57.0,1012.0,288.32694,few clouds,...,7.0,United States,40.714272,-74.005966,0,"(52.5, 62.0]","(1002.2, 1018.8]","(283.122, 289.592]",SW,"(5.0, 7.5]"
4,2012-10-01 16:00:00,2012,10,1,16,New York,57.0,1012.0,288.406203,few clouds,...,7.0,United States,40.714272,-74.005966,0,"(52.5, 62.0]","(1002.2, 1018.8]","(283.122, 289.592]",SW,"(5.0, 7.5]"


## Data Mining

Supervised: Perform KFold cross-vailation with Decision Tree, kNN and Neural Network
Clustering: KMean, 

In [9]:
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import cluster, metrics
from mlxtend.frequent_patterns import apriori

New_York_X = weather_data[weather_data['City'] == 'New York'][['Humidity','Pressure','Temperature','Wind Direction','Wind Speed','Latitude','Longitude']]
New_York_y = weather_data[weather_data['City'] == 'New York']['Weather Code']

Los_Angeles_X = weather_data[weather_data['City'] == 'Los Angeles'][['Humidity','Pressure','Temperature','Wind Direction','Wind Speed','Latitude','Longitude']]
Los_Angeles_y = weather_data[weather_data['City'] == 'Los Angeles']['Weather Code']



## New York

In [10]:
X = New_York_X
y = New_York_y

kf = KFold(n_splits=3)

print('Number of fold:', kf.get_n_splits(X))

#Desicsion Tree
tree_clf = DecisionTreeClassifier(criterion='entropy')

#kNN with k = 3
kNN_clf = KNeighborsClassifier(n_neighbors=3)

#Neural network
nn_clf = MLPClassifier()

classifier = [
    {
        'name': 'Decision Tree',
        'classifier': tree_clf
    },
    {
        'name': 'kNN',
        'classifier': kNN_clf
    },
    {
        'name': 'Neural network',
        'classifier': nn_clf
    }
]

current_clf_index = 0
for train_index, test_index in kf.split(X):
    clf = classifier[current_clf_index]['classifier']
#     print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    clf.fit(X_train,y_train)
    print(classifier[current_clf_index]['name'])
    print(clf.score(X_test,y_test),'\n')
    current_clf_index = current_clf_index + 1
    

Number of fold: 3
Decision Tree
0.164998342725 

kNN
0.208896844338 

Neural network
0.209758684699 



## Los_Angeles

In [15]:
X = Los_Angeles_X
y = Los_Angeles_y

kf = KFold(n_splits=3)

print('Number of fold:', kf.get_n_splits(X))

#Desicsion Tree
tree_clf = DecisionTreeClassifier(criterion='entropy')

#kNN with k = 3
kNN_clf = KNeighborsClassifier(n_neighbors=3)

#Neural network
nn_clf = MLPClassifier()

classifier = [
    {
        'name': 'Decision Tree',
        'classifier': tree_clf
    },
    {
        'name': 'kNN',
        'classifier': kNN_clf
    },
    {
        'name': 'Neural network',
        'classifier': nn_clf
    }
]

current_clf_index = 0
for train_index, test_index in kf.split(X):
    clf = classifier[current_clf_index]['classifier']
#     print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    clf.fit(X_train,y_train)
    print(classifier[current_clf_index]['name'])
    print(clf.score(X_test,y_test),'\n')
    current_clf_index = current_clf_index + 1
    

Number of fold: 3
Decision Tree
0.405303281405 

kNN
0.557212940864 

Neural network
0.452267303103 

