# The datasets

This capstone project is based on two datasets, *NYC Taxi Travel* and *NYC Weather* datasets. You can find more details on their origin in the [Introduction Notebook](./00.%20Introduction.ipynb) Notebook.

Here is a description of their structure and some quick cleanup actions.

## NYC Taxi Travel Dataset

This dataset is made of 11 columns, which contains 10 independent variables and one dependent.

The 10 independent variables will be part of the final dataset features, and the dependent one will be used to create my result vector: *km_per_hour*

Here is a description of the variables.

### Independent variables

* id - a unique identifier for each trip.
* vendorid - a code indicating the provider associated with the trip record.
* pickupdatetime - date and time when the meter was engaged.
* dropoffdatetime - date and time when the meter was disengaged.
* passengercount - the number of passengers in the vehicle (driver entered value).
* pickuplongitude - the longitude where the meter was engaged
* pickuplatitude - the latitude where the meter was engaged
* dropofflongitude - the longitude where the meter was disengaged
* dropofflatitude - the latitude where the meter was disengaged
* store_and_fwd_flag — This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server — Y=store and forward; N=not a store and forward trip.

### Dependent variable

* trip_duration — duration of the trip in seconds.


## NYC Weather Dataset

This dataset is made of 11'544 lines with 66 columns. Each line contains the daily values for a particular weather station regarding weather informations: temperature, rain, snow, wind, ...

What I will have to do first with this dataset is to extract the list of the different weather stations along with their locations. This weather station list will be merged with the NYC Travel one, using the weather stations coordinates to determine which one is the nearest from pickup and dropoff.

Here is a description of the variables.

### Weather station static informations
- STATION - Identification of the weather station
- NAME - The name of the weather station
- LATITUDE - Latitude of the weather station
- LONGITUDE - Logitude of the weather station
- ELEVATION - Elevation

### Independent variables
- DATE - Date when the measures where done (YYYY-MM-DD)
- AWND - Average wind speed
- DAPR - Number of days included in the multiday precipitation total (MDPR)
- DASF - Number of days included in the multiday snow fall total (MDSF) 
- MDPR - Multiday precipitation total (use with DAPR and DWPR, if available)
- MDSF - Multiday snowfall total 
- PGTM - Peak gust time
- PRCP - Precipitation
- PSUN - Daily percent of possible sunshine for the period
- SNOW - Snowfall
- SNWD - Snow depth
- TAVG - Average Temperature.
- TMAX - Maximum temperature
- TMIN - Minimum temperature
- TOBS - Temperature at the time of observation
- TSUN - Total sunshine for the period
- WDF2 - Direction of fastest 2-minute wind
- WDF5 - Direction of fastest 5-second wind
- WESD - Water equivalent of snow on the ground
- WESF - Water equivalent of snowfall
- WSF2 - Fastest 2-minute wind speed
- WSF5 - Fastest 5-second wind speed
- WT01 - Fog, ice fog, or freezing fog (may include heavy fog)
- WT02 - Heavy fog or heaving freezing fog (not always distinguished from fog)
- WT03 - Thunder
- WT04 - Ice pellets, sleet, snow pellets, or small hail" 
- WT05 - Hail (may include small hail)
- WT06 - Glaze or rime 
- WT08 - Smoke or haze 
- WT09 - Blowing or drifting snow
- WT11 - High or damaging winds



> Note: There is no dependent variable in this dataset as all the column will be used to add features to the NYC Taxi Travel dataset.

> Note 2: All the units in this dataset are in metric standard


# Table of content

Data preparation is splitted into seven Notebooks

## [The 83 Weather Stations](11.The%2083%20Weather%20Stations.ipynb)

In the NYC Weather Dataset, I've found what I've called *static weather station data*:
- STATION - Identification of the weather station
- NAME - The name of the weather station
- LATITUDE - Latitude of the weather station
- LONGITUDE - Logitude of the weather station
- ELEVATION - Elevation

Those static data generates a lot of duplicated data into the dataset as there is a total of 83 different weather stations in the whole dataset.

The goal of this Notebook is to create a small dataset that contains the static features of the weather stations.

## [NYC Taxi Travel Data Preparation](12.NYC%20Taxi%20Travel%20Data%20Preparation.ipynb)

This first Notebook prepares data grabbed from Kaggle, containing the Taxi Travel informations. Most of the work here will be to remove useless columns, outliers and prepare data for the rest of the project.

As this dataset has been found already cleaned, this work will be quite straight forward.

One question you may have reading this [NYC Taxi Travel Data Preparation](12.NYC%20Taxi%20Travel%20Data%20Preparation.ipynb) Notebook is how I've choosen the latitude/logitude values that I will use to remove some pickup and dropoff points from the project ? Well, looking at the weather stations locations available into the NYC Weather Datasets, I've decided to match the travel location with the weather station ones.

We'll see that this approach did not remove a lot of lines from NYC Taxi Travel Dataset (less than 6'000 over ~1'500'000), and will be more efficient when I will merge the two dataset. As I would like to match pickup and dropoff location to the nearest weather station, removing travel location far away from stations makes sense.

## [NYC Weather Data Preparation](13.NYC%20Weather%20Data%20Preparation.ipynb)

This Notebook contains the data preparation process of the independent features of the NYC Weather Dataset.

After cleaning up the dataset (drop some useless columns), most of the work will be focused on creating two datasets from this one:

- Weather Categorical dataset
    
This dataset will be built using features that are defined as categories: Wind direction, fog, peak ust,...

We'll see later in this project that this dataset will be grouped by days instead of weather station locations.
    
- Weather Numerical dataset

This second dataset will contains all the numerical values of the measures taken by the stations: Temperature, precipitation, snow...

This dataset will be grouped by weather station and date, and we will discover that not all the data is available for each station/day. We will see how to solve this.

## [NYC Taxi Travel Dataset Feature Engineering](14.NYC%20Taxi%20Travel%20Dataset%20Feature%20Engineering.ipynb)

With this third dataset, which is the biggest one, I will try to engineer some interesting features, using static data informations from the 83 weather stations:
- Distance from nearest weather station
- Distance in kilometers between pickup and dropoff locations
- ...

It's in this Notebook that I will build my result vector: *km_per_hour*

## [NYC Weather Categorical Dataset Feature Engineering](15.NYC%20Weather%20Categorical%20Dataset%20Feature%20Engineering.ipynb)

The *Weather Categorical* dataset will be grouped by days in this Notebook. This will produce a dataset with 182 lines, which is the number of days between the 1st of January 2016 and the 30th of June 2016.

Reason of this approach will be explained in the notebook.

## [NYC Weather Numerical Dataset Feature Engineering](16.NYC%20Weather%20Numerical%20Dataset%20Feature%20Engineering.ipynb)

In this Notebook, the *Weather Numerical* dataset, before being grouped by days and by weather stations, will be extended and the missing values extrapolated.

- Extended: To add missing tuple of (days, weather stations)

- Extrapolated: To fill the weather stations NaN values using the average of the *n* nearest non null weather stations values.

The whole extension and extrapolation process will be described in this Notebook.

This will produce a dataset with 15'106 lines, which is the number of days (182) multiplied by the number of weather stations (83).


## [The global Dataset - Merging all the datasets into a big one](17.The%20global%20Dataset%20-%20Merging%20all%20the%20datasets%20into%20a%20big%20one.ipynb)

After all this cleaning and feature engineering process on the two original dataset, it will be time to merge all of the datasets produced in previous Notebooks:

- Taxi Travel dataset
- Weather Stations Dataset
- Weather Categorical Dataset
- Weather Numerical Dataset

This will result in a *full* dataset ready for ML training process


# Overview of the created datasets

Af the end of data cleaning and feature engineering, I will obtain four datasets:

1. *stations* dataset

A dataset with all the static informations of the weather stations like elevation, latitude, name.

2. *travel* dataset

A *feature engineered* dataset of the *NYC Taxi Travel dataset* like pickup_datetime, dropoff_location

3. *weather categorical* dataset

Categorical feature per day, *engineered* from the weather stations dataset like snow fall, fog.

4. *weather numerical* dataset

Numerical feature per day and per weather stations, *engineered* from the weather stations dataset like temperature average, quantity of snow on the road

5. Full dataset

This dataset is a merge of the *numerical* and *categorical* features of the four previous dataset, ready to be used with ML training processes.

> Note: Construction and description of these dataset will be detailed in the following Notebooks

# Technics

To perform the merge of the data, I've decided to use an *sqlite* database approach and play with *INNER JOIN* methods to merge datasets.

This technic will be used in the [The global Dataset - Merging all the datasets into a big one](17.The%20global%20Dataset%20-%20Merging%20all%20the%20datasets%20into%20a%20big%20one.ipynb) to build the *Full* dataset

Stay tuned ;-)

# Let's go

Time to go to the first data preparation Notebook: [The 83 Weather Stations](11.The%2083%20Weather%20Stations.ipynb) :-)