# Hong Kong Pollution: Daily Max @ Causeway Bay

Causeway Bay AQHI has been selected as the city we are interested in predicting due to the fact that CWB has been rated by both [backpackers](link:https://www.thebrokebackpacker.com/where-to-stay-in-hong-kong/) and [others](link:https://misstourist.com/where-to-stay-in-hong-kong-best-hotels/) as the best location for families to stay. Since children and elderly are [most at risk for pollution related issues](link:https://www.health.nsw.gov.au/environment/air/Pages/who-is-affected.aspx), it is the most relevant AQHI to this study.

**The Data Science Method**  


0.   Problem Identification 

1.   **Data Wrangling** 
  * Data Collection
      - Locating the data
      - Data loading
      - Data joining
   * Data Organization
      -  File structure
      -  Git & Github
  * Data Definition
      - Column names
      - Data types (numeric, categorical, timestamp, etc.)
      - Description of the columns
      - Count or percent per unique values or codes (including NA)
      - The range of values or codes  
  * Data Cleaning
      - NA or missing data
      - Duplicates
 
2.   Exploratory Data Analysis 

3.   Pre-processing and Training Data Development

4.   Modeling 

5.   Documentation

# Data Collection
Load required packages and modules into Python. Then load the data into a pandas dataframe for ease of use.

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt

%matplotlib inline

**Save current working directory and parent path.**

In [2]:
path = os.getcwd()
parent = os.path.dirname(path)
print(parent)

/Users/tiffanyflor/Dropbox/MyProjects/HongKongPollution/HongKongPollution


**Save data/interim/ directory path and print contents.**

In [3]:
data_path = parent + '/data/interim/'

In [4]:
os.listdir(data_path)

['monthly_pollution_2014_2020.csv',
 'daily_max_pollution.csv',
 'hourly_pollution.csv',
 '.gitkeep',
 'pollution_monthly_min_2014_2020.csv',
 'pollution_monthly_max_2014_2020.csv',
 '1.4_cwb_pollution_daily_max.csv',
 'joined_weather_pollution_all_districts.csv',
 'cleaned_weather_2014_2020.csv']

## Load daily max pollution data from csv file

In [5]:
df_all = pd.read_csv(data_path+'pollution_monthly_max_2014_2020.csv')
df_all.head()

Unnamed: 0,Date,Central/Western,Eastern,Kwun Tong,Sham Shui Po,Kwai Chung,Tsuen Wan,Yuen Long,Tuen Mun,Tung Chung,Tai Po,Sha Tin,Causeway Bay,Central,Mong Kok
0,2014-01-31,6.387097,5.741935,6.677419,6.548387,6.419355,6.516129,6.580645,6.677419,7.032258,5.806452,6.225806,8.064516,7.806452,7.258065
1,2014-02-28,4.178571,3.857143,4.142857,4.357143,4.25,4.321429,4.285714,4.321429,4.428571,4.071429,4.0,5.678571,5.75,5.035714
2,2014-03-31,4.870968,4.354839,4.741935,4.967742,4.967742,4.935484,4.580645,4.709677,4.645161,4.580645,4.483871,6.709677,6.064516,5.677419
3,2014-04-30,4.966667,4.8,5.0,5.2,5.033333,4.9,4.966667,4.8,4.666667,4.533333,4.833333,6.166667,6.133333,6.2
4,2014-05-31,3.322581,3.387097,3.806452,3.774194,3.741935,3.387097,3.451613,3.225806,3.064516,3.419355,3.419355,5.258065,4.516129,4.387097


## Isolate Causeway Bay

In [6]:
df = df_all[['Date','Causeway Bay']]
df.head()

Unnamed: 0,Date,Causeway Bay
0,2014-01-31,8.064516
1,2014-02-28,5.678571
2,2014-03-31,6.709677
3,2014-04-30,6.166667
4,2014-05-31,5.258065


In [7]:
df.Date[0]

'2014-01-31'

# Data Organization
Completed using cookiecutter. See README for structure.

# Data Definition
Review column names, data types, and null values.

## Column Names

In [8]:
df.columns

Index(['Date', 'Causeway Bay'], dtype='object')

## Data Types
Review which columns are integer, float, categorical, or dates. Ensure the data types is loaded properly into the dataframe.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Date          84 non-null     object 
 1   Causeway Bay  84 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.4+ KB


# Data Cleaning

## Change Date to datetime object

In [11]:
df['Date'] = pd.to_datetime(df['Date'])
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Date'] = pd.to_datetime(df['Date'])


Unnamed: 0,Date,Causeway Bay
0,2014-01-31,8.064516
1,2014-02-28,5.678571
2,2014-03-31,6.709677
3,2014-04-30,6.166667
4,2014-05-31,5.258065


In [12]:
df.dtypes

Date            datetime64[ns]
Causeway Bay           float64
dtype: object

In [13]:
# Set index to DateTime and drop Date
df = df.set_index('Date')
df.head()

Unnamed: 0_level_0,Causeway Bay
Date,Unnamed: 1_level_1
2014-01-31,8.064516
2014-02-28,5.678571
2014-03-31,6.709677
2014-04-30,6.166667
2014-05-31,5.258065


In [17]:
df['Causeway Bay'] = round(df['Causeway Bay'],1)
df.head()

Unnamed: 0_level_0,Causeway Bay
Date,Unnamed: 1_level_1
2014-01-31,8.1
2014-02-28,5.7
2014-03-31,6.7
2014-04-30,6.2
2014-05-31,5.3


## Handle Missing Data -- we have none!

In [18]:
print('There are {} null values in df.'.format(df.isnull().sum().sum()))

There are 0 null values in df.


# Export data to new csv file

In [19]:
df.to_csv(data_path + '1.4_cwb_pollution_monthly_max.csv')