# parse_data.ipynb

This notebook parses the data files used for the FP-2 assignment. 

<br>
<br>

First let's read the attached data file:

In [1]:
import pandas as pd

df0 = pd.read_csv('Air_Quality_Data.csv')

df0.describe()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,Iws,Is,Ir
count,43824.0,43824.0,43824.0,43824.0,43824.0,41757.0,43824.0,43824.0,43824.0,43824.0,43824.0,43824.0
mean,21912.5,2012.0,6.523549,15.72782,11.5,98.613215,1.817246,12.448521,1016.447654,23.88914,0.052734,0.194916
std,12651.043435,1.413842,3.448572,8.799425,6.922266,92.050387,14.43344,12.198613,10.268698,50.010635,0.760375,1.415867
min,1.0,2010.0,1.0,1.0,0.0,0.0,-40.0,-19.0,991.0,0.45,0.0,0.0
25%,10956.75,2011.0,4.0,8.0,5.75,29.0,-10.0,2.0,1008.0,1.79,0.0,0.0
50%,21912.5,2012.0,7.0,16.0,11.5,72.0,2.0,14.0,1016.0,5.37,0.0,0.0
75%,32868.25,2013.0,10.0,23.0,17.25,137.0,15.0,23.0,1025.0,21.91,0.0,0.0
max,43824.0,2014.0,12.0,31.0,23.0,994.0,28.0,42.0,1046.0,585.6,27.0,36.0


Check and Remove Missing Value

In [2]:
#  check with this code print(df0.isnull().sum()) 

df_clean = df0.dropna(subset=['pm2.5'])

df_clean.describe()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,Iws,Is,Ir
count,41757.0,41757.0,41757.0,41757.0,41757.0,41757.0,41757.0,41757.0,41757.0,41757.0,41757.0,41757.0
mean,22279.380104,2012.042771,6.513758,15.685514,11.502311,98.613215,1.750174,12.401561,1016.442896,23.866747,0.055344,0.194866
std,12658.168415,1.415311,3.454199,8.785539,6.924848,92.050387,14.433658,12.175215,10.300733,49.617495,0.778875,1.418165
min,25.0,2010.0,1.0,1.0,0.0,0.0,-40.0,-19.0,991.0,0.45,0.0,0.0
25%,11464.0,2011.0,4.0,8.0,5.0,29.0,-10.0,2.0,1008.0,1.79,0.0,0.0
50%,22435.0,2012.0,7.0,16.0,12.0,72.0,2.0,14.0,1016.0,5.37,0.0,0.0
75%,33262.0,2013.0,10.0,23.0,18.0,137.0,15.0,23.0,1025.0,21.91,0.0,0.0
max,43824.0,2014.0,12.0,31.0,23.0,994.0,28.0,42.0,1046.0,565.49,27.0,36.0


<br>
<br>

The dependent and independent variables variables (DVs and IVs) that we are interested in are:

**DVs**:
- Hourly Concentration of PM 2.5 (pm2.5) 

**IVs**:
- Dew Point (DEWP, in °C)
- Temperature (TEMP, in °C)
- Pressure (PRES, in hPa)
- Cumulated Wind Speed (lws, in m/s)
- Cumulated Hours of Rain (lr)
- Temporal factors (month)

<br>
<br>

Let's extract the relevant columns:

In [3]:

df = df_clean[ ['month','pm2.5', 'DEWP', 'TEMP', 'PRES', 'Iws', 'Ir' ] ]

df.describe()


Unnamed: 0,month,pm2.5,DEWP,TEMP,PRES,Iws,Ir
count,41757.0,41757.0,41757.0,41757.0,41757.0,41757.0,41757.0
mean,6.513758,98.613215,1.750174,12.401561,1016.442896,23.866747,0.194866
std,3.454199,92.050387,14.433658,12.175215,10.300733,49.617495,1.418165
min,1.0,0.0,-40.0,-19.0,991.0,0.45,0.0
25%,4.0,29.0,-10.0,2.0,1008.0,1.79,0.0
50%,7.0,72.0,2.0,14.0,1016.0,5.37,0.0
75%,10.0,137.0,15.0,23.0,1025.0,21.91,0.0
max,12.0,994.0,28.0,42.0,1046.0,565.49,36.0


<br>
<br>

Next let's use the `rename` function to give the columns clear variable names:

In [4]:
df = df.rename( columns={'pm2.5':'concentration','DEWP':'dew point', 'TEMP':'temperature','PRES':'pressure', 'Iws':'wind speed', 'Ir':'rainfall duration'})
numerical = ['concentration', 'dew point', 'temperature', 'pressure', 'wind speed', 'rainfall duration']

df[numerical].describe()

Unnamed: 0,concentration,dew point,temperature,pressure,wind speed,rainfall duration
count,41757.0,41757.0,41757.0,41757.0,41757.0,41757.0
mean,98.613215,1.750174,12.401561,1016.442896,23.866747,0.194866
std,92.050387,14.433658,12.175215,10.300733,49.617495,1.418165
min,0.0,-40.0,-19.0,991.0,0.45,0.0
25%,29.0,-10.0,2.0,1008.0,1.79,0.0
50%,72.0,2.0,14.0,1016.0,5.37,0.0
75%,137.0,15.0,23.0,1025.0,21.91,0.0
max,994.0,28.0,42.0,1046.0,565.49,36.0
