In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as ply
import seaborn as sns

In [2]:
pollution = pd.read_excel('PM2.5climate.xlsx')
pollution.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0


The top 5 rows shows that the target variable (pm2.5) has NaN (missing) values. This helps us understand that we need to look for missing values in our dataset, especially the target variable.

---

## Project Description

The data describes the pollution level depending on various air quality features. There are 43,824 observations with 13 columns (including the target variable). We intend to carry out regression on this dataset to predict the pollution level (PM2.5 level) through various air quality features.

The features are as follows:
1) No: row number (Quantitative)

2) Year: year of data (Quantitative)

3) Month: month of data (Quantitative)

4) Day: day of the data (Quantitative)

5) Hour: hour of data (Quantitative)

6) PM2.5: PM2.5 concentration in micrograms per cubic meter of air (Quantitative)

7) DEWP: dew point; atmospheric temperature below which water droplets begin to condense varies based on pressure and humidity (Quantitative)

8) TEMP: temperature (Quantitative)

9) PRES: air pressure in pascals (Quantitative)

10) cbwd: combined wind direction (Quantitative)

11) lws: cumulated wind speed (Quantitative)

12) ls: cumulated hours of snow (Quantitative)

13) lr: cumulated hours of rain (Quantitative)

The dataset is already preprocessed to some degree (it is not raw data). According to the description, there are a few missing values in the target variable (which will be dealt with in the project cleaning).

---

## Basic Exploration

In [3]:
pollution

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
43819,43820,2014,12,31,19,8.0,-23,-2.0,1034.0,NW,231.97,0,0
43820,43821,2014,12,31,20,10.0,-22,-3.0,1034.0,NW,237.78,0,0
43821,43822,2014,12,31,21,10.0,-22,-3.0,1034.0,NW,242.70,0,0
43822,43823,2014,12,31,22,8.0,-22,-4.0,1034.0,NW,246.72,0,0


In [4]:
pollution.shape

(43824, 13)

There are 43,824 rows with 13 features, as described above. This is a large dataset.

In [5]:
pollution.describe()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,Iws,Is,Ir
count,43824.0,43824.0,43824.0,43824.0,43824.0,41757.0,43824.0,43824.0,43824.0,43824.0,43824.0,43824.0
mean,21912.5,2012.0,6.523549,15.72782,11.5,98.613215,1.817246,12.448521,1016.447654,23.88914,0.052734,0.194916
std,12651.043435,1.413842,3.448572,8.799425,6.922266,92.050387,14.43344,12.198613,10.268698,50.010635,0.760375,1.415867
min,1.0,2010.0,1.0,1.0,0.0,0.0,-40.0,-19.0,991.0,0.45,0.0,0.0
25%,10956.75,2011.0,4.0,8.0,5.75,29.0,-10.0,2.0,1008.0,1.79,0.0,0.0
50%,21912.5,2012.0,7.0,16.0,11.5,72.0,2.0,14.0,1016.0,5.37,0.0,0.0
75%,32868.25,2013.0,10.0,23.0,17.25,137.0,15.0,23.0,1025.0,21.91,0.0,0.0
max,43824.0,2014.0,12.0,31.0,23.0,994.0,28.0,42.0,1046.0,585.6,27.0,36.0


All the variables are numerical and thus can be described.

The column 'No' does not have any real implications since it gives the row number (which can be provided by the index), hence it will probably be removed.

The year column is interesting because it gives the "mean" year which does not make much sense in this context. We might decide to categorize the variable or to create dummy variables.

Similarly, the month variable 