## Predicting Street Level Air Pollution in French cities

### Problem Statement
Air pollution is nowadays a major issue not only in countries such as China or India but also in developed countries. The World Health Organization estimate on the number of death due to poor air quality is close to 8 million (both indoor and outdoor) worldwide whereas the European Environmental Agency counted almost 450 000 premature deaths in Europe with the same cause. 

To monitor air quality, institutions and agencies are focusing on four main pollutants : 
- ozone (O3), 
- nitrogen dioxide (NO2),
- particulate matter with diameter below 2.5 μm (PM2.5),
- particulate matter with diameter below 10 μm (PM10).

Although regulated by national and European standards, evaluation of one's real exposure to air pollution at the street level is still a challenge today. Air quality monitoring is performed with expensive reference instruments (ideally) disseminated at strategic locations within the cities. However, because of pollutants' heterogeneity in terms of sources, volatility and chemical reactivity, pollutant concentrations can vary widely between very close areas and thus, one's exposure can vary accordingly.


### Introduction to Plume Labs
Plume Labs (http://www.plumelabs.com) is a French start-up which aims, through the use of open data, to provide to citizens information on their exposure to air pollution. This information is already available today for an increasing number of megalopolis through mobile applications, but the final goal of Plume Labs is to provide pollution levels at any point on the globe. To achieve this, physical models as well as artificial intelligence tools are used concurrently. 

We propose to predict the air quality level at the street level in several French cities. To reach this goal, we rely on an increasingly popular model in the atmospheric science community called **land use regression model (LURM)**. By using relevant parameters based on the land-use (e.g., the surface occupied by residential buildings within a given perimeter), as well as meteorological ones, matched with the monitoring station readings, it's possible to build a statistical model to predict air quality levels. Such models can then be used at other locations within the cities to construct a street-level air pollution map. Examples can be seen in the scientific literature :

http://www.sciencedirect.com/science/article/pii/S1877705815016331

http://www.sciencedirect.com/science/article/pii/S001393511530178X

### Objectives
The goal of the challenge is to predict three pollutants' (NO2, PM2_5, PM10) concentrations at the location of monitoring stations in several cities and at given time periods by focusing the training algorithms on :
- the regular structure of the time series (strong daily and seasonal cycles)
- the correlation between pollution and meteorology (temperature, precipitations,...)
- the correlation with static factors (e.g., length of roads within a buffer of x meters, density of residential/industrial/... buildings, see below for a complete description of the dataset)

### Requirements

* numpy>=1.10.0  
* matplotlib>=1.5.0  
* pandas>=0.17.0  
* scikit-learn>=0.17 (different syntaxes for v0.17 and v0.18)  

### Data Description
Data sets are provided in CSV files. 

#### Train datasets

In the Xtrain dataset, we provide :
- ID: row ID of the dataset (from 0 to 448168)
- daytime: arbitrary value which describe the order in time, on an hourly basis, of the data
- zone_id: values range from 0 to 5 and describe a given city
- station_id: values are included in [ 16., 17., 20., 1., 18., 22., 26., 28., 6., 9., 25., 4., 10., 23., 5., 8., 11.]. A station is related to a given zone_id, several stations can be related to one city.
- pollutant name
- is_calmday (boolean): information on the type of day - basically discriminate week/week-end days and public holidays
- meteorological parameters:
    - temperature (double)
    - windspeed (double)
    - windbearing_cos (double)
    - windbearing_sin (double)
    - cloudcover (double)
    - precipitations_intensity (double)
    - precipitations_probability (double)
    - pressure (double)
- Static variables: A buffer is a circle with a given diameter drawn around a position. As an example, HLRES_50 is the cumulated surface of High density residential land within a 50 m diameter circle.
    - HLRES: High density residential land (m2) - buffer of 50,100,300,500,1000 m
    - HLDRES: HLRES +Low density residential land (m2) - buffer of 50,100,300,500,1000 m
    - INDUSTRY: Industry land (m2) - buffer of 1000 m
    - PORT: Port land (m2) - buffer of 5000 m
    - NATURAL: Semi-natural and forested land (m2) - buffer of 5000 m
    - GREEN: urban parks and green areas + NATURAL (m2) - buffer of 5000 m
    - ROUTE: cumulated road distances within the buffer (m) - buffer of 100, 300, 500, 1000m
    - ROADINVDIST: inverse of distance between the station and the nearest road (1/m)
Important: when no data is provided, it means that:
- No land use is encompassed within the buffer for land use data
- No data is available for meteorological parameters

The Ytrain dataset is composed of:
- ID: row ID to match the Xtrain dataset
- TARGET: pollutant concentration in μg/m3TEST SET:

#### Test datasets
In the Xtest dataset, we have:
- pollutant name
- meteorological parameters: same description and values as in Xtrain
- zone_id (same as in Xtrain)
- station_id, included in [ 21., 27., 1., 29., 15., 12., 14., 0., 3., 2., 13., 19.] - values are different from Xtrain, meaning that station are not the same as these used in Xtrain
- static variables: same description but different values vs Xtrain
- daytime: same description and values as in Xtrain
- is_calmday

Our goal is to provide a Ytest dataset with 2 columns:
- ID: row ID which match the Xtest row IDs
- TARGET: results of the model for the given rows

In [None]:
data = pd.read_csv('public_train.csv.gz', compression='gzip')
X_df = data.drop(['isSkewed'], axis=1)
Y_df = data[['isSkewed']]