# Import packages
A package (or library) contains several functions useful in a particular context. Here we import some packages that are generally used in science projects:

- pandas: to manipulate datasets
- numpy: to apply mathematical functions
- matplotlib: to display graphs
- sklearn: contains a lot of machine learning functions and models

In [None]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# import model
from sklearn.linear_model import LinearRegression

from IPython.display import display
pd.options.display.max_columns = None

# Import data
In this part, we use the pandas library in order to import the data into our notebook. The data is then saved into a dataframe called *dataset_bike*

In [None]:
# load data into a dataframe
dataset_bike = pd.read_csv("###CODE HERE###",
                           sep = ',',
                           header=0,
                           skip_blank_lines=True,
                           index_col=0)

In [None]:
# print the shape of this dataframe: (nb of rows, nb of columns)
print("The shape of the dataframe is :", "###CODE HERE###".shape)

In [None]:
# show the first rows of the dataframe
dataset_bike."###CODE HERE###"

The goal of this project is to be able to accuratly predict the number of bike that are rented at a given hour. It is the variable **cnt**. To do that, we have access to 2 years of records, including several features such as:
- meteorological measures (temp, atemp, hum, weathersit, windspeed)
- datetime information (hr, holiday, weekday, yr, mth..)

**GOAL raised by your client**: Your model must be able to predict the hourly demand on random samples with a MAE **smaller than 30 bikes**...

MAE: Mean Absolute Error

$$MAE=\frac{1}{n}\sum_{i=0}^n|pred_i- target_i|$$

With:
- $n$: number of samples
- $pred_i$: predicted demand for sample i
- $target_i$: real demand for sample i


In [None]:
# get a basic statistical description of your data
dataset_bike.describe()

In [None]:
# plot distributions for temp, atemp, hum and windspeed
for variable in ["temp", "atemp", "hum", "windspeed"]:
    dataset_bike[variable].plot.density(legend=True, figsize=(20,10))

Notice that *humidity* is often equal to 0. Let's assume that these measures are actually missing measures. We will try later to replace them.

# Preprocessing
In this part, we will prepare our dataframe so it can be ingested by a machine learning model. The main idea is:
- to get rid or replace missing values: this is called **missing values imputation**
- to keep only numerical variables. Categorical variables have to be encoded: this called **dummification**
- if possible try to create new variables from the ones that already exist. These new variables can improve predictions if they bring *signal*: this is called **feature engineering**
- to split our dataframe in 2 parts. A first one will be used for **training** our model, the other one will be for **testing the performance** of this model

## Imputation of missing values
Here we will assume that if the *windspeed* = 0 this means that the measure is actually missing.
What do we do with these missing values? Several options are possible, you can either:
- delete samples that contains missing values
- replace missing values by the average or median value of the series
- more sophisticated operations (like training another model in order to predict the missing humity)

I suggest to simply replace zeros by the average *windspeed* measure.

In [None]:
dataset_bike_missing_values = dataset_bike.copy()

# Compute average windspeed
mean_windspeed = dataset_bike["###CODE HERE###"].mean()

In [None]:
# Assign this value to windspeed when windspeed == 0
dataset_bike_missing_values.loc[dataset_bike_missing_values["windspeed"] == "###CODE HERE###", "windspeed"] = mean_windspeed

In [None]:
# Plot again the distribution to check that there is no windspeed set to 0
dataset_bike_missing_values["windspeed"].plot.density(legend=True, figsize=(20,10))

## Feature engineering
In this part, we will add new columns to our dataset, and hope that it will bring signal to our model!

In [None]:
dataset_bike_feature_engineering = dataset_bike_missing_values.copy()

### New variable 1: nb of days since 1st record
Let's add a new column to our dataframe that contain the number of days since the day 1...

In [None]:
# First we need to say to Pandas that the dteday column is a date
dataset_bike_feature_engineering["dteday"] = pd.to_datetime(dataset_bike_feature_engineering["dteday"])

In [None]:
# get the value of day 1
min_date = min(dataset_bike_feature_engineering["dteday"])

# create the new column
dataset_bike_feature_engineering["nb_days_since_1st_day"] = (dataset_bike_feature_engineering["dteday"] - "###CODE HERE###").dt.days.astype("int")

In [None]:
# display the last rows of your dataframe
dataset_bike_feature_engineering.tail()

### New variable 2: split the day in time slots
Now let's split the day in morning, noon, afternoon, evening and night. You will create a new column (feature) that gives you this information (based on *hr*)

In [None]:
# First, we need to describe the mapping between the hour and the part of the day
time_slot_mapping = {
    0: "night",
    1: "night",
    2: "night",
    3: "###CODE HERE###",
    4: "night",
    5: "night",
    6: "night",
    7: "morning",
    8: "morning",
    9: "morning",
    10: "morning",
    11: "morning",
    12: "noon",
    13: "noon",
    14: "afternoon",
    15: "afternoon",
    16: "afternoon",
    17: "afternoon",
    18: "afternoon",
    19: "evening",
    20: "evening",
    21: "evening",
    22: "night",
    23: "night"    
}

In [None]:
## create the new variable, and apply the mapping on the hour
dataset_bike_feature_engineering["time_slot"] = dataset_bike_feature_engineering["hr"].map(time_slot_mapping)

In [None]:
dataset_bike_feature_engineering.head()

### New variable 3: meteo attractiveness
I imagine that it is more motivating to use a bike on a sunny and warm day than on the rainy and cold day. As a consequence, let's create a new feature that would give the information "perfect_conditions", "worst_conditions" or "average_conditions" depending on the meteorological features (*temp, hum, weathersit, windspeed*). This is totally subjective, but maybe this will improve your model. Let's see

In [None]:
# define the function that returns the motivation depending on meteo conditions
def get_meteo_attractiveness(weather_situation, temperature, humidity, windspeed):
    if (weather_situation in [1,2]) and (temperature > 0.3) and (humidity < 0.7) and (windspeed < 0.6):
        return "perfect_conditions"
    elif (weather_situation in [3,4]) and ((temperature < 0.3) or (humidity > 0.7) or (windspeed > 0.7)):
        return "worst_conditions"
    else:
        return "average_conditions"

In [None]:
# create the new column and apply the function
dataset_bike_feature_engineering["meteo_conditions"] = \
    dataset_bike_feature_engineering.\
            apply(lambda row: get_meteo_attractiveness(row["weathersit"],
                                                       row['temp'],
                                                       row['hum'],
                                                       row['windspeed']),axis=1)

In [None]:
dataset_bike_feature_engineering.head()

## Dummification

The dummification is a function that transpose a column to a set of columns in a dataframe, such as:

| Column |
|------|
|   a  |
|   b  |
|   b  |
|   c  |
|   a  |

becomes 

| is_a | is_b | is_c |
|------|------|------|
|   1  |   0  |   0  |
|   0  |   1  |   0  |
|   0  |   1  |   0  |
|   0  |   0  |   1  |
|   1  |   0  |   0  |

All the textual variables have to be dummified in order to keep only numerical values in our dataset. We can also apply dummification to numerical variables if there is not a relational ranking among the values. Here we will apply this function to *weathersit* and *season*

In [None]:
dataset_bike_dummified = dataset_bike_feature_engineering.copy()

# define a list that contain all the columns to dummify
columns_to_dummify = ["time_slot", "meteo_conditions", "weathersit", "season"]

# Now let's apply dummification
for col in columns_to_dummify:
    dataset_bike_dummified = pd.concat([dataset_bike_dummified, pd.get_dummies(dataset_bike_feature_engineering[col], prefix=col)], axis=1)
    dataset_bike_dummified = dataset_bike_dummified.drop(col, axis=1)

In [None]:
dataset_bike_dummified.head()

## Split Train and Test
Now, you have to choose which features will be taken as input in your model, and which one is the **target**.
Then, let's split your dataframe into train and test. Usually, we use the letter **X** to talk about the features, and **y** for the target

In [None]:
columns_to_remove_in_training = ["instant", "dteday", "atemp", "cnt"]

# let's define a variable that contains the list of features to keep for the model
features_training = [col for col in dataset_bike_dummified.columns if col not in columns_to_remove_in_training]

target_feature = "###CODE HERE###"


In [None]:
features_training

In [None]:
# split your dataframe randomly in order to keep 80% of the samples for training, 20% for testing (also called evaluation).
X_train, X_test, y_train, y_test = train_test_split(dataset_bike_dummified[features_training],
                                                    dataset_bike_dummified[target_feature],
                                                    test_size=0.2, random_state= 1234)

# print the shapes
print("Shape X_train :", "###CODE HERE###".shape)
print("Shape X_test :", X_test.shape)
print("Shape y_train :", "###CODE HERE###".shape)
print("Shape y_test :", y_test.shape)

# Training
Now, your dataframe is ready! You have to choose a model and train it on your data.
Let's take only one type of model for the moment, a simple one: the linear regression.

Training your model is pretty simple thanks to the scikit-learn package! Let's see how it works!

### Model: Mutivariate Linear Regression

In [None]:
# create an "empty" model
lreg = LinearRegression(fit_intercept = False)

# fit (=train) this model. => this is when the model is looking for the best surface in 
# a vectorial space that minimise the distance between samples.
lreg.fit("###CODE HERE###" ,y_train)

# Evaluation

Great! The model is trained. Now it is time to run an evaluation to see if your model meets the expectations of your client (MAE < 30 bikes)

In [None]:
# Compute predictions on Train
lreg_pred_train = lreg.predict(X_train)
# Compute predictions on Test: this will be used for evaluation
lreg_pred_test = lreg.predict("###CODE HERE###")

# Evaluate
print("MAE Training = ", mean_absolute_error(lreg_pred_train, "###CODE HERE###"))
print("***** MAE Test = ", mean_absolute_error(lreg_pred_test, y_test), " ******")

Great new, you improved your previous model! **Unfortunatly, it seems that "MAE Test" is still not small enough to satisfy your client :/.** In the next Notebooks, we will try to use other models in order to improve your score!