## ML project

### The setup

A former colleague of yours was working on a promising data-focused project, but unfortunately he recently got fired (for unknown reasons), so you are taking the project over. **Your goal** will be to kickstart this project, and turn it into a successful data-driven use case rather than an immature experimental notebook that it is right now. You will have to **improve the approach** started by your colleague, **rethink** some of the more immature techniques, **substantially expand and reinforce** the project, as well as verify that it actually brings business value.

<img src="images/coworker.jpg" width=400>

### Project background

You are working for a bike rental company that hopes to optimize its bicycle availability at various rental locations. You have access to their past data that contains the hourly and daily count of bike rentals between years 2011 and 2012 with the corresponding weather and seasonal information. Our target variable is **cnt** - the number of bikes rented out at a particular moment. Below is a description of the remaining variables:

- **datetime**: date and time when each log of bike rentals was made
- **weathersit**: weather situation at the moment of the log 
    1: Clear, Few clouds, Partly cloudy, Partly cloudy
    2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- **temp**: Normalized temperature in Celsius. The values are derived via (t-tmin)/(tmax-tmin), tmin=-8, t_max=+39 
- **atemp**: Normalized feeling temperature in Celsius. The values are derived via (t-tmin)/(tmax-tmin), tmin=-16, t_max=+50
- **hum**: Normalized humidity. The values are divided to 100 (max)
- **windspeed**: Normalized wind speed. The values are divided to 67 (max)
- **registered**: count of registered users among those who rented bikes
- **cnt**: count of all rental bikes including both unregistered and registered users

**Important assumption**: additional research of the bike rental company showed that each rental location at each moment in time can be considered *independent* of every other bike rental log in the dataset.

The dataset can be seen below:

In [None]:
import pandas as pd

bikes_df = pd.read_csv('data/bike_rentals.csv')
bikes_df.head()

## The New Project

Here we will try to redevelop this project, while combining all of the concerns about the old project. We will gradually address each part of the project and hopefully end up with a usable application in the end!

*Note:* each block will be turned into a several preprocessing functions which we will later combine together to turn raw data into data ready for ML pipelines

*Note 2:* when creating these preprocessing functions, we should be particularly mindful whether we may further introduce any data leakage into the train-test split as well as whether we may doubt about some preprocessing steps and wish to rather use them as a part of a ML pipeline

## 1. Data Inspection & Quality Concerns

Here we will inspect the quality of the dataset and determine whether there are any serious issues + suggest how we are going to solve them further on

## 2. Feature Engineering

Here we will use our domain knowledge determine which useful features can be extracted manually, and which ones can be added using some automatic feature generators

## 3. ML applicability

Here we will explore and justify whether this problem and the available data actually allow us to use a ML approach. This would not be the case in two situations:

- The problem is too simple (there are features ~99% correlated with the target)
- The problem is too complex & there is not enough information (the features are rather unrelated to the problem)

## 4. Building a preprocessing ML pipeline

Here we will focus on assembling any preprocessing blocks of a ML pipeline that we are going to use in combination with a ML model later

## 5. Choosing and setting up ML model(s)

Here we will decide on which ML models may be a good choice for us in this application, and set them up

## 6. Parameter Tuning

Here we will select the best parameters for each model

## 7. Model Optimization and Packaging

Here we will make the last preparations to turn our selected pipeline into a forecasting application