# Hong Kong Pollution & Weather: Data Join

**The Data Science Method**  


0.   Problem Identification 

1.   **Data Wrangling** 
  * Data Collection
      - Locating the data
      - Data loading
      - Data joining
   * Data Organization
      -  File structure
      -  Git & Github
  * Data Definition
      - Column names
      - Data types (numeric, categorical, timestamp, etc.)
      - Description of the columns
      - Count or percent per unique values or codes (including NA)
      - The range of values or codes  
  * Data Cleaning
      - NA or missing data
      - Duplicates
 
2.   Exploratory Data Analysis 

3.   Pre-processing and Training Data Development

4.   Modeling 

5.   Documentation

# Data Collection
Load required packages and modules into Python. Then load the data into a pandas dataframe for ease of use.

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

**Save current working directory and parent path.**

In [2]:
path = os.getcwd()
parent = os.path.dirname(path)
print(parent)

/Users/tiffanyflor/Dropbox/MyProjects/HongKongPollution/HongKongPollution


**Save data/interim/ directory path and print contents.**

In [3]:
data_path = parent + '/data/interim/'

In [4]:
os.listdir(data_path)

['monthly_pollution_2014_2020.csv',
 'daily_max_pollution.csv',
 'hourly_pollution.csv',
 '.gitkeep',
 'pollution_monthly_min_2014_2020.csv',
 'pollution_monthly_max_2014_2020.csv',
 'joined_weather_pollution_all_districts.csv',
 'cleaned_weather_2014_2020.csv']

## Load monthly_max_pollution data from csv file

In [5]:
pollution = pd.read_csv(data_path+'monthly_pollution_2014_2020.csv')
pollution.head()

Unnamed: 0,Date,Central/Western,Eastern,Kwun Tong,Sham Shui Po,Kwai Chung,Tsuen Wan,Yuen Long,Tuen Mun,Tung Chung,Tai Po,Sha Tin,Causeway Bay,Central,Mong Kok
0,2014-01-31,5.151882,4.657258,5.40457,5.276882,5.181452,5.119624,5.233871,5.362903,5.282258,4.696237,4.931452,6.310484,6.110215,5.63172
1,2014-02-28,3.494048,3.244048,3.587798,3.590774,3.511905,3.474702,3.403274,3.407738,3.558036,3.4375,3.3125,4.379464,4.346726,4.049107
2,2014-03-31,3.923387,3.600806,3.93414,3.994624,3.897849,3.901882,3.767473,3.719086,3.744624,3.721774,3.716398,5.024194,4.790323,4.494624
3,2014-04-30,4.101389,3.925,4.186111,4.354167,4.166667,4.075,3.943056,3.919444,3.751389,3.820833,3.9125,4.9625,4.919444,4.898611
4,2014-05-31,2.700269,2.834677,3.119624,3.038978,3.068548,2.803763,2.782258,2.646505,2.63172,2.775538,2.837366,3.75,3.473118,3.483871


## Load cleaned_weather data from csv file

In [6]:
weather = pd.read_csv(data_path+'cleaned_weather_2014_2020.csv', index_col=0)
weather.head()

Unnamed: 0,Date Period,Mean Pressure (hPa),Mean Daily Max Air Temp (C°),Mean Air Temp (C°),Mean Daily Min Air Temp (C°),Mean Dew Point (C°),Mean Relative Humidity (%),Mean Amount of Cloud Coverage (%),Total Rainfall (mm),Prevailing Wind Direction (degrees),Mean Wind Speed (km/h)
0,2014-01,1021.3,19.2,16.3,14.1,9.9,67,32,1.0,40,22.9
1,2014-02,1017.7,17.9,15.5,13.5,12.3,82,73,39.5,50,26.6
2,2014-03,1017.1,20.9,18.7,17.0,15.7,83,77,207.6,60,24.1
3,2014-04,1013.4,24.9,22.6,21.0,20.0,86,72,132.4,80,20.6
4,2014-05,1009.5,28.6,26.4,24.6,23.7,86,82,687.3,240,23.7


# Data Organization
Completed using cookiecutter. See README for structure.

# Data Definition
Review column names, data types, and null values.

## Column Names

In [7]:
pollution.columns

Index(['Date', 'Central/Western', 'Eastern', 'Kwun Tong', 'Sham Shui Po',
       'Kwai Chung', 'Tsuen Wan', 'Yuen Long', 'Tuen Mun', 'Tung Chung',
       'Tai Po', 'Sha Tin', 'Causeway Bay', 'Central', 'Mong Kok'],
      dtype='object')

In [8]:
weather.columns

Index(['Date Period', 'Mean Pressure (hPa)', 'Mean Daily Max Air Temp (C°)',
       'Mean Air Temp (C°)', 'Mean Daily Min Air Temp (C°)',
       'Mean Dew Point (C°)', 'Mean Relative Humidity (%)',
       'Mean Amount of Cloud Coverage (%)', 'Total Rainfall (mm)',
       'Prevailing Wind Direction (degrees)', 'Mean Wind Speed (km/h)'],
      dtype='object')

## Data Types
Review which columns are integer, float, categorical, or dates. Ensure the data types is loaded properly into the dataframe.

In [9]:
pollution.dtypes

Date                object
Central/Western    float64
Eastern            float64
Kwun Tong          float64
Sham Shui Po       float64
Kwai Chung         float64
Tsuen Wan          float64
Yuen Long          float64
Tuen Mun           float64
Tung Chung         float64
Tai Po             float64
Sha Tin            float64
Causeway Bay       float64
Central            float64
Mong Kok           float64
dtype: object

In [10]:
pollution.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Date             84 non-null     object 
 1   Central/Western  84 non-null     float64
 2   Eastern          84 non-null     float64
 3   Kwun Tong        84 non-null     float64
 4   Sham Shui Po     84 non-null     float64
 5   Kwai Chung       84 non-null     float64
 6   Tsuen Wan        84 non-null     float64
 7   Yuen Long        84 non-null     float64
 8   Tuen Mun         84 non-null     float64
 9   Tung Chung       84 non-null     float64
 10  Tai Po           84 non-null     float64
 11  Sha Tin          84 non-null     float64
 12  Causeway Bay     84 non-null     float64
 13  Central          84 non-null     float64
 14  Mong Kok         84 non-null     float64
dtypes: float64(14), object(1)
memory usage: 10.0+ KB


In [11]:
weather.dtypes

Date Period                             object
Mean Pressure (hPa)                    float64
Mean Daily Max Air Temp (C°)           float64
Mean Air Temp (C°)                     float64
Mean Daily Min Air Temp (C°)           float64
Mean Dew Point (C°)                    float64
Mean Relative Humidity (%)               int64
Mean Amount of Cloud Coverage (%)        int64
Total Rainfall (mm)                    float64
Prevailing Wind Direction (degrees)      int64
Mean Wind Speed (km/h)                 float64
dtype: object

In [12]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84 entries, 0 to 83
Data columns (total 11 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Date Period                          84 non-null     object 
 1   Mean Pressure (hPa)                  84 non-null     float64
 2   Mean Daily Max Air Temp (C°)         84 non-null     float64
 3   Mean Air Temp (C°)                   84 non-null     float64
 4   Mean Daily Min Air Temp (C°)         84 non-null     float64
 5   Mean Dew Point (C°)                  84 non-null     float64
 6   Mean Relative Humidity (%)           84 non-null     int64  
 7   Mean Amount of Cloud Coverage (%)    84 non-null     int64  
 8   Total Rainfall (mm)                  84 non-null     float64
 9   Prevailing Wind Direction (degrees)  84 non-null     int64  
 10  Mean Wind Speed (km/h)               84 non-null     float64
dtypes: float64(7), int64(3), object(1)

# Data Cleaning

## Update Index to PeriodIndex in both dataframes
Period[M] to match and join.

In [13]:
pollution.index = pd.PeriodIndex(pollution['Date'], freq='M')

In [14]:
pollution = pollution.drop('Date',axis=1)

In [15]:
pollution.index.dtype

period[M]

In [16]:
weather.index = pd.PeriodIndex(weather['Date Period'], freq='M')

In [17]:
weather = weather.drop('Date Period', axis=1)

In [18]:
weather.index.dtype

period[M]

## Concat Weather and Pollution DataFrames

In [19]:
df = pd.concat([weather, pollution], axis=1)

In [20]:
df.head()

Unnamed: 0,Mean Pressure (hPa),Mean Daily Max Air Temp (C°),Mean Air Temp (C°),Mean Daily Min Air Temp (C°),Mean Dew Point (C°),Mean Relative Humidity (%),Mean Amount of Cloud Coverage (%),Total Rainfall (mm),Prevailing Wind Direction (degrees),Mean Wind Speed (km/h),...,Kwai Chung,Tsuen Wan,Yuen Long,Tuen Mun,Tung Chung,Tai Po,Sha Tin,Causeway Bay,Central,Mong Kok
2014-01,1021.3,19.2,16.3,14.1,9.9,67,32,1.0,40,22.9,...,5.181452,5.119624,5.233871,5.362903,5.282258,4.696237,4.931452,6.310484,6.110215,5.63172
2014-02,1017.7,17.9,15.5,13.5,12.3,82,73,39.5,50,26.6,...,3.511905,3.474702,3.403274,3.407738,3.558036,3.4375,3.3125,4.379464,4.346726,4.049107
2014-03,1017.1,20.9,18.7,17.0,15.7,83,77,207.6,60,24.1,...,3.897849,3.901882,3.767473,3.719086,3.744624,3.721774,3.716398,5.024194,4.790323,4.494624
2014-04,1013.4,24.9,22.6,21.0,20.0,86,72,132.4,80,20.6,...,4.166667,4.075,3.943056,3.919444,3.751389,3.820833,3.9125,4.9625,4.919444,4.898611
2014-05,1009.5,28.6,26.4,24.6,23.7,86,82,687.3,240,23.7,...,3.068548,2.803763,2.782258,2.646505,2.63172,2.775538,2.837366,3.75,3.473118,3.483871


**Examine shape of all dataframes to confirm proper concat.**

In [21]:
pollution.shape

(84, 14)

In [22]:
weather.shape

(84, 10)

In [23]:
df.shape

(84, 24)

## Handle Missing Data -- we have none!

In [24]:
print('There are {} null values in df.'.format(df.isnull().sum().sum()))

There are 0 null values in df.


# Export data to new csv file

In [25]:
df.to_csv(data_path + 'joined_weather_pollution_all_districts.csv')