## ML project

### The setup

A former colleague of yours was working on a promising data-focused project, but unfortunately he recently got fired (for unknown reasons), so you are taking the project over. **Your goal** will be to kickstart this project, and turn it into a successful data-driven use case rather than an immature experimental notebook that it is right now. You will have to **improve the approach** started by your colleague, **rethink** some of the more immature techniques, **substantially expand and reinforce** the project, as well as verify that it actually brings business value.

<img src="images/coworker.jpg" width=400>

### Project background

You are working for a bike rental company that hopes to optimize its bicycle availability at various rental locations. You have access to their past data that contains the hourly and daily count of bike rentals between years 2011 and 2012 with the corresponding weather and seasonal information. Our target variable is **cnt** - the number of bikes rented out at a particular moment. Below is a description of the remaining variables:

- **datetime**: date and time when each log of bike rentals was made
- **weathersit**: weather situation at the moment of the log 
    1: Clear, Few clouds, Partly cloudy, Partly cloudy
    2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- **temp**: Normalized temperature in Celsius. The values are derived via (t-tmin)/(tmax-tmin), tmin=-8, t_max=+39 
- **atemp**: Normalized feeling temperature in Celsius. The values are derived via (t-tmin)/(tmax-tmin), tmin=-16, t_max=+50
- **hum**: Normalized humidity. The values are divided to 100 (max)
- **windspeed**: Normalized wind speed. The values are divided to 67 (max)
- **registered**: count of registered users among those who rented bikes
- **cnt**: count of all rental bikes including both unregistered and registered users

**Important assumption**: additional research of the bike rental company showed that each rental location at each moment in time can be considered *independent* of every other bike rental log in the dataset.

The dataset can be seen below:

In [None]:
import pandas as pd

bikes_df = pd.read_csv('data/bike_rentals.csv')
bikes_df.head()

## The Old Project

Below you can find *untouched* the analysis made by your former colleague. Other stakeholders have communicated to you, that according to that colleague, the below code successfully develops a bike demand forecasting framework. So you would not need to change much, and just need to polish the code. You however have doubts about it and want to first **assess the adequacy of the approach**, **outline potential improvements**, and **redo/add certain steps**. 

### Assignment 1

Go through the code below and try to assess:

- what it might be missing
- where the approach might be questionable/wrong
- any code style / logic issues
- whether the built application is usable and addresses the goal

Make notes about any of the above, those will be discussed with other training participants.

*Note*: your former colleague's code is given as Python comments (# text)

*Note 2*: pause after Block #4  (where a test set is separated), the rest will be discussed later

In [None]:
### ---------- Colleague's code begins here ---------- ###

In [None]:
#1. I first drop features that would be too hard to use for building a demand forecasting application 
#+ drop any missing values (there are not many anyways)
BikePrepData=bikes_df.drop(['datetime','atemp'],axis=1).dropna()

print('data size is ' + str(BikePrepData.shape))
BikePrepData.head()

In [None]:
#2. Then I remove outliers (very low or very high values) in the target variable and corresponding other values
#I determine outlier based on whether they are much smaller or larger than the mean value of the target

print("target mean value:")
print(round(bikes_df['cnt'].mean(),2))

# therefore anything below 10 would be rather small and anything above 890 would be rather large, so those can be considered outliers

BikePrepData=BikePrepData[BikePrepData.cnt>=10]
BikePrepData=BikePrepData[BikePrepData.cnt<=890]


BikePrepData.head()

In [None]:
sum(BikePrepData.loc[0])

In [None]:
#3 here I add a few aditional features that may be useful for forecasting:
# squared temperature and a sum of all present features:

BikePrepData['sqTemp'] = BikePrepData.temp * 2
BikePrepData['RowSum'] = [sum(BikePrepData.loc[i]) for i in BikePrepData.index]
BikePrepData.head()

In [None]:
#4 Finally, I scale all features by subtracting their means and deviding by standard deviations
# and separate a random subset of the data for future testing

#from sklearn.preprocessing import StandardScaler

#DataScaled = StandardScaler().fit_transform(BikePrepData)

DataMeans = BikePrepData.mean()
DataStds = BikePrepData.std()

DataScaled = ((BikePrepData - DataMeans) / DataStds)

# the target variable is also separated from the features below

TrainData = DataScaled.sample(frac=0.65)
TrainTarget = BikePrepData[BikePrepData.index.isin(TrainData.index)]['cnt'].values #BikePrepData is used here to avoid using scaled 'cnt'
TrainData = TrainData.drop(['cnt'],axis=1).values

TestData = DataScaled.sample(frac=0.35)
TestTarget = BikePrepData[BikePrepData.index.isin(TestData.index)]['cnt'].values
TestData = TestData.drop(['cnt'],axis=1).values

#### Pause here. The rest will be discussed later

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

In [None]:
#5 Below I train, improve and evaluate two ML approaches to forecast bike rental demand
#  Model 1 is a decision tree (popular, interpretable and just generally great)
#  Model 2 is a Suport Vector Machine, also very popular and often powerful

model1 = DecisionTreeRegressor()
model1.fit(TrainData, TrainTarget)
y_pred1 = model1.predict(TestData)

model2 = SVR()
model2.fit(TrainData, TrainTarget)
y_pred2 = model2.predict(TestData)

# Root-Mean-Absolute-Error is used below as our main metric, it could be a good alternative too RMSE

print('RMAE with model 1:')
print(round(mean_absolute_error(TestTarget, y_pred1)**0.5,2))

print('RMAE with model 2:')
print(round(mean_absolute_error(y_pred2, TestTarget)**0.5,2))

In [None]:
#model 1 has a higher score and is therefore selected. 
#It already has a pretty good score (only mistaken by ~14 bikes on average, but we will try to further improve it)

In [None]:
#6 best parameters for model 1 are selected here
# based on which parameters result in a higher score on the test set

BestParams = []
BestScore = 0

for Depth in range(1,10, 2):
    for criterion in ["mse", "mae"]:
        for state in [1, 100, 999]:
            
            model = DecisionTreeRegressor(max_depth=Depth, criterion=criterion, random_state=state)
            model.fit(TrainData, TrainTarget)
            pred = model.predict(TestData)
            CurrentScore = round(mean_absolute_error(TestTarget, pred)**0.5,2)
            
            if CurrentScore > BestScore:
                BestParams = [Depth, criterion, state]
                BestScore = CurrentScore

print("the best model has these parameters" + str(BestParams))
print("and this score:" + str(BestScore))

In [None]:
# unfortunately the score is still lower than what we had for model 1, so model 1 will be used as the best model

In [None]:
BikePrepData.columns

In [None]:
#7 Building a user interface (productionizing our solution)

# Notice that is a MVP, so only temperature, wind speed and weather situation are accepted as inputs for predictions

import ipywidgets
import numpy as np

def GetPred(index=20000, 
             weathersit=BikePrepData['weathersit'].mean(),
             temp=BikePrepData['temp'].mean(),
             hum=BikePrepData['hum'].mean(),
             windspeed=BikePrepData['windspeed'].mean(),
             registered=BikePrepData['registered'].mean(),
             RowSum=BikePrepData['RowSum'].mean()):
    
    sqTemp = temp ** 2

    df = pd.DataFrame({'index': {0:index},
                       'weathersit': {0:weathersit},
                       'temp': {0:temp},
                       'hum': {0:hum},
                       'windspeed': {0:windspeed},
                       'registered': {0:registered},
                       'cnt' : {0:10000}, # need it to correctly scale
                       'sqTemp': {0:sqTemp},
                       'RowSum' : {0:RowSum},
                       
                       })    
    
    #scaling inputs
    df = ((df - DataMeans) / DataStds).drop(['cnt'],axis=1)

    display(df)
    prediction = round(model1.predict(df)[0],0)
    print("Predicted Bike Demand is: " + str(prediction))
        
style = {'description_width': 'initial'}

In [None]:
ipywidgets.interact(GetPred,
                    temp=ipywidgets.widgets.BoundedFloatText(
                        min=5, 
                        max=20, 
                        step=1, 
                        value=15,
                        description="How warm is it out there?",
                        style=style),
                    windspeed=ipywidgets.widgets.BoundedFloatText(
                        min=0, 
                        max=1, 
                        step=0.05, 
                        value=0.5,
                        description="What is the weather speed?",
                        style=style),
                    weathersit=ipywidgets.widgets.BoundedFloatText(
                        min=1, 
                        max=4,
                        step=1, 
                        value=1,
                        description="What is the weather situation out there?",
                        style=style)
                   );

In [None]:
# it works!