# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Mini-Project: Regression and Modularization (Pipeline Building)

#### (Notebook-2)

## Problem Statement

Predict the bike rental count per hour based on the environmental and seasonal settings (such as weather, day, time, humidity, wind speed, season etc).

## Learning Objectives

At the end of the mini-project, you will be able to :

* create custom classes required for data processing
* implement pipeline and train the model
* save the model/pipeline
* make prediction using the saved model/pipeline

## Dataset Description

The dataset chosen for this mini-project is a modified version of [Bike Sharing Dataset](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset).  This dataset contains the hourly and daily count of rental bikes between the years 2011 and 2012 in the capital bike share system with the corresponding weather and seasonal information. This dataset consists of 17379 instances of each 14 features. 

<br>
<img src="https://cdn.iisc.talentsprint.com/AIandMLOps/Images/BikeShareSystem.jpg" width=400px>
<br><br>

Bike sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return has become automatic. Through these systems, the user can easily rent a bike from a particular position and return to another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousand bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. As opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position are explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that the most important events in the city could be detected via monitoring these data.

### Dataset Characteristics

* **dteday:** hourly date
* **season:** 
    * spring
    * summer 
    * fall
    * winter
* **hr:** hour
* **holiday:** whether the day is considered a holiday
* **weekday:** day of the week
* **workingday:** whether the day is neither a weekend nor holiday
* **weathersit:**
    * Clear, Few clouds, Partly cloudy, Partly cloudy
    * Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    * Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    * Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog<br>   
* **temp:** temperature in Celsius
* **atemp:** "feels like" temperature in Celsius
* **humidity:** relative humidity
* **windspeed:** wind speed
* **casual:** count of casual/non-registered users
* **registered:** count of registered users
* **cnt:** count of total rental bikes including both casual and registered

In [None]:
#@title Download Dataset
!wget -qq https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/bike-sharing-dataset.csv
!ls | grep ".csv"
print("Dataset downloaded successfully!")

bike-sharing-dataset.csv
bike-sharing-dataset.csv.1
bike-sharing-dataset.csv.10
bike-sharing-dataset.csv.11
bike-sharing-dataset.csv.12
bike-sharing-dataset.csv.13
bike-sharing-dataset.csv.14
bike-sharing-dataset.csv.15
bike-sharing-dataset.csv.16
bike-sharing-dataset.csv.17
bike-sharing-dataset.csv.18
bike-sharing-dataset.csv.19
bike-sharing-dataset.csv.2
bike-sharing-dataset.csv.20
bike-sharing-dataset.csv.3
bike-sharing-dataset.csv.4
bike-sharing-dataset.csv.5
bike-sharing-dataset.csv.6
bike-sharing-dataset.csv.7
bike-sharing-dataset.csv.8
bike-sharing-dataset.csv.9
Dataset downloaded successfully!


### Import Required Packages

In [None]:
# Loading the Required Packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import accuracy_score

In [None]:
# ========== NEW IMPORTS FOR PIPELINE BUILDING ========

# to create pipeline
from sklearn.pipeline import Pipeline

# for including custom preprocessors within pipeline
from sklearn.base import BaseEstimator, TransformerMixin

## **1. Pre-Pipeline-Steps:**

### 1.1 Load, Explore, and Prepare the Data Set

* Load the dataset
* Understand different features in the training dataset
* Understand the data types of each columns
* Notice the columns of missing values

In [None]:
# YOUR CODE HERE
bikeshare = pd.read_csv('bike-sharing-dataset.csv')
bikeshare.shape

(17379, 14)

In [None]:
bikeshare.head(5)
bikeshare2 = bikeshare.copy()

In [None]:
bikeshare.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   dteday      17379 non-null  object 
 1   season      17379 non-null  object 
 2   hr          17379 non-null  object 
 3   holiday     17379 non-null  object 
 4   weekday     16504 non-null  object 
 5   workingday  17379 non-null  object 
 6   weathersit  16121 non-null  object 
 7   temp        17379 non-null  float64
 8   atemp       17379 non-null  float64
 9   hum         17379 non-null  float64
 10  windspeed   17379 non-null  float64
 11  casual      17379 non-null  int64  
 12  registered  17379 non-null  int64  
 13  cnt         17379 non-null  int64  
dtypes: float64(4), int64(3), object(7)
memory usage: 1.9+ MB


### 1.2 Working on `dteday` column to extract year and month

- Create a function to extract year and month from the date column and create two another columns
  

In [None]:
# YOUR CODE HERE
def get_year_and_month(dataframe):
    df = dataframe.copy()
    # convert 'dteday' column to Datetime datatype
    df['dteday'] = pd.to_datetime(df['dteday'], format='%Y-%m-%d')
    # Add new features 'yr' and 'mnth
    df['yr'] = df['dteday'].dt.year
    df['mnth'] = df['dteday'].dt.month_name()
    
    return df

In [None]:
bikeshare = get_year_and_month(bikeshare)
bikeshare.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   dteday      17379 non-null  datetime64[ns]
 1   season      17379 non-null  object        
 2   hr          17379 non-null  object        
 3   holiday     17379 non-null  object        
 4   weekday     16504 non-null  object        
 5   workingday  17379 non-null  object        
 6   weathersit  16121 non-null  object        
 7   temp        17379 non-null  float64       
 8   atemp       17379 non-null  float64       
 9   hum         17379 non-null  float64       
 10  windspeed   17379 non-null  float64       
 11  casual      17379 non-null  int64         
 12  registered  17379 non-null  int64         
 13  cnt         17379 non-null  int64         
 14  yr          17379 non-null  int64         
 15  mnth        17379 non-null  object        
dtypes: datetime64[ns](1), 

In [None]:
bikeshare.head()

Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,yr,mnth
0,2012-11-05,winter,6am,No,Mon,Yes,Mist,6.1,3.0014,49.0,19.0012,4,135,139,2012,November
1,2011-07-13,fall,4am,No,Wed,Yes,Clear,26.78,28.9988,58.0,16.9979,0,5,5,2011,July
2,2012-02-09,spring,11am,No,Thu,Yes,Clear,3.28,-0.9982,52.0,15.0013,4,95,99,2012,February
3,2012-03-22,summer,7am,No,Thu,Yes,Mist,14.56,15.0002,100.0,6.0032,29,332,361,2012,March
4,2011-11-08,winter,12pm,No,Tue,Yes,Clear,16.44,17.0,52.0,8.9981,28,175,203,2011,November


### 1.3 Find numerical and categorical variables

In [None]:
# YOUR CODE HERE
unused_colms = ['dteday', 'casual', 'registered']   # unused columns will be removed at later stage
target_col = ['cnt']

numerical_features = []
categorical_features = []

for col in bikeshare.columns:
    if col not in target_col + unused_colms:
        if bikeshare[col].dtypes == 'float64':
            numerical_features.append(col)
        else:
            categorical_features.append(col)


print('Number of numerical variables: {}'.format(len(numerical_features)),":" , numerical_features)

print('Number of categorical variables: {}'.format(len(categorical_features)),":" , categorical_features)

Number of numerical variables: 4 : ['temp', 'atemp', 'hum', 'windspeed']
Number of categorical variables: 8 : ['season', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'yr', 'mnth']


## **2. Pipeline-Steps:**

Build custom classes which are compatible with Skearn pipeline for imputation, feature mapping, and any column specific operation.

### **A. Imputation**

#### Build a custom Imputation class compatible with Sklearn for handling missing values in `weekday` column.

- Find the number of NaN entries in the `weekday` column, and get their row indices
- Use the `dteday` column to extract day names
- Impute values for the missing row indices in `weekday` column with the day names extracted above

**Note that** the extracted day names will contain full names (eg. 'Monday'), and the `weekday` column contains only first three letters (eg. 'Mon').

In [None]:
class WeekdayImputer(BaseEstimator, TransformerMixin):
    """ Impute missing values in 'weekday' column by extracting dayname from 'dteday' column """

    def __init__(self, variables: str):

        if not isinstance(variables, str):
            raise ValueError("variables should be a list")

        self.variables = variables

    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        # we need the fit statement to accomodate the sklearn pipeline
        # self.fill_value=X[self.variables].mode()[0]
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        X = X.copy()
        wkday_null_idx = X[X['weekday'].isnull() == True].index
        # print(len(wkday_null_idx))
        X.loc[wkday_null_idx, 'weekday'] = X.loc[wkday_null_idx, 'dteday'].dt.day_name().apply(lambda x: x[:3])
        return X

In [None]:
# Apply weekday imputer

# YOUR CODE HERE

#### Build another custom Imputation class compatible with Sklearn for handling missing values in `weathersit` column.

- Fill in the missing rows in this column with the most frequent category

In [None]:

class WeathersitImputer(BaseEstimator, TransformerMixin):
    """ Impute missing values in 'weathersit' column by replacing them with the most frequent category value """

    def __init__(self, variables: str):

        if not isinstance(variables, str):
            raise ValueError("variables should be a list")

        self.variables = variables

    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        # we need the fit statement to accomodate the sklearn pipeline
        self.fill_value=X[self.variables].mode()[0]
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        X = X.copy()
        X[self.variables]=X[self.variables].fillna( self.fill_value)
        print('weather imputer transformer called') 
        return X

In [None]:
# Apply weathersit imputer

# YOUR CODE HERE

### **B. Mapping**

#### Build a Mapper class for mapping `yr`, `mnth`, `season`, `weathersit`, `holday`, `workingday`, and `hr` columns.

In [None]:

class Mapper(BaseEstimator, TransformerMixin):
    """
    Ordinal categorical variable mapper:
    Treat column as Ordinal categorical variable, and assign values accordingly
    """

    def __init__(self, variables: str, mappings: dict):

        if not isinstance(variables, str):
            raise ValueError("variables should be a str")

        self.variables = variables
        self.mappings = mappings

    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        # we need the fit statement to accomodate the sklearn pipeline
        print('mapper fit here', self.variables) 
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        X = X.copy()
        #for feature in self.variables:
        X[self.variables] = X[self.variables].map(self.mappings).astype(int)
        print('mapper transform called') 
        return X

In [None]:
# Instantiate mapper for all ordinal categorical features

# YOUR CODE HERE

In [None]:
# Map values for all ordinal categorical features

# YOUR CODE HERE

### **C. Class for Specific operation**

#### Build a Class for handling outliers in numerical columns

- Instead of removing the outliers, change their values
    - to upper-bound, if the value is higher than upper-bound, or
    - to lower-bound, if the value is lower than lower-bound respectively.

In [None]:

class OutlierHandler(BaseEstimator, TransformerMixin):
    """
    Change the outlier values: 
        - to upper-bound, if the value is higher than upper-bound, or
        - to lower-bound, if the value is lower than lower-bound respectively.
    """

    def __init__(self, variables: str):

        if not isinstance(variables, str):
            raise ValueError("variables should be a list")
        print(' outlier fit called') 
        self.variables = variables

    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        # we need the fit statement to accomodate the sklearn pipeline
        X = X.copy()
        q1 = X.describe()[self.variables].loc['25%']
        q3 = X.describe()[self.variables].loc['75%']
        iqr = q3 - q1
        self.lower_bound = q1 - (1.5 * iqr)
        self.upper_bound = q3 + (1.5 * iqr)
        print(' outlier fit called', q1, q3) 
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        X = X.copy()
        for i in X.index:
            if X.loc[i,self.variables] > self.upper_bound:
                X.loc[i,self.variables]= self.upper_bound
            if X.loc[i,self.variables] < self.lower_bound:
                X.loc[i,self.variables]= self.lower_bound
        print('outlier transform called') 
        return X

In [None]:
# Instantiate outlier handler for all numerical features

# YOUR CODE HERE

In [None]:
# Handle outliers for all numerical columns

# YOUR CODE HERE

#### Build a Class to One-hot Encode `weekday` column

In [None]:

class WeekdayOneHotEncoder(BaseEstimator, TransformerMixin):
    """ One-hot encode weekday column """
    def __init__(self, variables: str):

        if not isinstance(variables, str):
            raise ValueError("variables should be a str")

        self.variables = variables
        # self.mappings = mappings

    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        self.encoder = OneHotEncoder(sparse_output=False)
        self.encoder.fit(X[['weekday']])
        
        print('onehot fit called', X.shape) 
        return self 

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        # YOUR CODE HERE
        X = X.copy()
        try:
          encoded_weekday = self.encoder.transform(X[['weekday']])
          enc_wkday_features = self.encoder.get_feature_names_out(['weekday'])
          X[enc_wkday_features] = encoded_weekday
        except:
          print('failed')
        # print(X.loc[[1]])
        print('onehot transform called', X.shape) 
        # X.drop(labels = ['weekday'], axis = 1, inplace = True)
        self.X = X
        return X
        

In [None]:
class Dropfeatures(BaseEstimator, TransformerMixin):
    """ One-hot encode weekday column """
    def __init__(self, variables: list):

        if not isinstance(variables, list):
            raise ValueError("variables should be a str")

        self.variables = variables
        # self.mappings = mappings

    def fit(self, X: pd.DataFrame, y: pd.Series = None):

        return self 

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        # YOUR CODE HERE
        X = X.copy()
        print('before drop', X.shape) 
        # X.drop(labels = self.variables, axis = 1, inplace = True)
        X = X.drop(labels = self.variables, axis = 1)
        # print(X.loc[[1]])
        print('after drop', X.shape) 
        # X.drop(labels = ['weekday'], axis = 1, inplace = True)
        self.X = X
        return X
        

In [None]:
# Treat 'weekday' column as a Categorical variable, perform one-hot encoding
unused_colms = ['dteday', 'casual', 'registered',  'weekday'] 
# YOUR CODE HERE

## **3. Build Pipeline**

Build a pipeline and implement all the above class transformers inside the pipeline along with the regressor.

In [None]:
holiday_mapping = {'Yes': 0, 'No': 1}
workingday_mapping = {'No': 0, 'Yes': 1}
hour_mapping = {'4am': 0, '3am': 1, '5am': 2, '2am': 3, '1am': 4, '12am': 5, '6am': 6, '11pm': 7, '10pm': 8, 
                '10am': 9, '9pm': 10, '11am': 11, '7am': 12, '9am': 13, '8pm': 14, '2pm': 15, '1pm': 16, 
                '12pm': 17, '3pm': 18, '4pm': 19, '7pm': 20, '8am': 21, '6pm': 22, '5pm': 23}
yr_mapping = {2011: 0, 2012: 1}
mnth_mapping = {'January': 0, 'February': 1, 'December': 2, 'March': 3, 'November': 4, 'April': 5, 
                'October': 6, 'May': 7, 'September': 8, 'June': 9, 'July': 10, 'August': 11}
season_mapping = {'spring': 0, 'winter': 1, 'summer': 2, 'fall': 3}
weather_mapping = {'Heavy Rain': 0, 'Light Rain': 1, 'Mist': 2, 'Clear': 3}


In [None]:
# YOUR CODE HERE
bikeshare_pipe=Pipeline([
    
    ('weekday_imputation', WeekdayImputer(variables='weekday')),
    ('Weathersit_imputation', WeathersitImputer(variables='weathersit')),
    
    ##==========Mapper======##
     ('holiday_mapping',Mapper('holiday',holiday_mapping)),
     ('workingday_mapping',Mapper('workingday',workingday_mapping)),
     ('weather_mapping',Mapper('weathersit',weather_mapping)),
     ('hour_mapping',Mapper('hr',hour_mapping)),
     ('yr_mapping',Mapper('yr',yr_mapping)),
     ('mnth_mapping',Mapper('mnth',mnth_mapping)),
     ('season_mapping',Mapper('season',season_mapping)),

     ('windspeed_outlier',OutlierHandler(variables='windspeed')),
     ('hum_outlier',OutlierHandler(variables='hum')),
     ('weekday_onehot',WeekdayOneHotEncoder(variables='weekday')),
     ('drop',Dropfeatures(variables=unused_colms)),
    #  WeekdayOneHotEncoder


    # scale
    ('scaler', StandardScaler()),
    ('model_rf', RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42))
])

 outlier fit called
 outlier fit called


## **4. Fit Pipeline**

- Separate target and prediction features
- Split data into train and test set
- Fit pipeline on train set
- Get prediction on test set
- Calculate the mse and r2_score

In [None]:
# YOUR CODE HERE
# bikeshare = bikeshare2.copy()


X = bikeshare.drop(target_col, axis=1)
y = bikeshare[target_col]
# bikeshare.drop(labels = unused_colms, axis = 1, inplace = True)

### Check for package versions may be used for requirements.txt file

In [None]:
# !pip -qq install pydantic
# !pip -qq install strictyaml
# !pip -qq install ruamel.yaml

In [None]:
import numpy as np
import pandas as pd
import sklearn
import pydantic
import strictyaml
import ruamel.yaml
import joblib

In [None]:
# YOUR CODE HERE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
X_train.shape, X_test.shape

((13903, 15), (3476, 15))

In [None]:
bikeshare_pipe.fit(X_train,y_train.values.ravel())

# print(x_1)

weather imputer transformer called
mapper fit here holiday
mapper transform called
mapper fit here workingday
mapper transform called
mapper fit here weathersit
mapper transform called
mapper fit here hr
mapper transform called
mapper fit here yr
mapper transform called
mapper fit here mnth
mapper transform called
mapper fit here season
mapper transform called
 outlier fit called 7.0015 16.997899999999998
outlier transform called
 outlier fit called 47.0 78.0
outlier transform called
onehot fit called (13903, 15)
onehot transform called (13903, 22)
before drop (13903, 22)
after drop (13903, 18)


In [None]:
y_pred=bikeshare_pipe.predict(X_test)

weather imputer transformer called
mapper transform called
mapper transform called
mapper transform called
mapper transform called
mapper transform called
mapper transform called
mapper transform called
outlier transform called
outlier transform called
onehot transform called (3476, 22)
before drop (3476, 22)
after drop (3476, 18)


In [None]:
bikeshare_pipe.named_steps["weekday_onehot"].X

Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,...,registered,yr,mnth,weekday_Fri,weekday_Mon,weekday_Sat,weekday_Sun,weekday_Thu,weekday_Tue,weekday_Wed
12830,2012-04-22,2,14,1,Sun,0,1,8.92,5.9978,93.0,...,34,1,5,0.0,0.0,0.0,1.0,0.0,0.0,0.0
8688,2011-06-05,2,1,1,Sun,0,2,21.14,24.0026,65.0,...,25,0,9,0.0,0.0,0.0,1.0,0.0,0.0,0.0
7091,2011-12-18,1,8,1,Sun,0,3,2.34,-0.9982,75.0,...,47,0,2,0.0,0.0,0.0,1.0,0.0,0.0,0.0
12230,2012-06-16,2,0,1,Sat,0,3,17.38,18.0032,68.0,...,5,1,9,0.0,0.0,1.0,0.0,0.0,0.0,0.0
431,2012-07-31,3,0,1,Tue,1,3,23.02,24.0026,83.0,...,6,1,10,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6759,2012-08-12,3,0,1,Sun,0,3,22.08,24.0026,69.0,...,8,1,11,0.0,0.0,0.0,1.0,0.0,0.0,0.0
13989,2012-08-01,3,13,1,Wed,1,2,23.96,26.0024,74.0,...,316,1,11,0.0,0.0,0.0,0.0,0.0,0.0,1.0
173,2012-04-24,2,11,1,Tue,1,3,13.62,13.9970,31.0,...,157,1,5,0.0,0.0,0.0,0.0,0.0,1.0,0.0
16192,2012-03-16,0,22,1,Fri,1,1,14.56,15.0002,82.0,...,377,1,3,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Calculate the accuracy
# print(y_test)
# print(y_pred)
# Calculate the score/error
print("R2 score:", r2_score(y_test, y_pred))
print("Mean squared error:", mean_squared_error(y_test, y_pred))

R2 score: 0.9197946785073274
Mean squared error: 2716.920428209541


## **5. Modularize the application**

- Convert the above regression application to a product environment format (.py files) inside VS code.

- Create different modules specific to functionality:
    - requirements
    - configuration
    - data manager
    - feature engineering
    - pipeline building
    - pipeline training
    - predict
