# Please implement all functions required in this notebook and write your observations and discussions as comments or markdown cells inside the notebook. Please submit the jupyter notebook along with all files that are required in the tasks.

You are provided with a file `data.csv` that contains the number of bicycles observed in several places in Ottawa during 2010 to 2019. The csv file has the following columns:

| column name | description |
|:-------------:|:-------------:|
| location_name | the location where the counter was installed |
| count      | number of bicycles passed by |
| max temp | maximum temperature (Celsius) |
| mean temp | average temperature (Celsius) |
| min temp | minimum temperature (Celsius) |
| snow on grnd (cm) | snow on ground |
| total precip (mm) | total precipitation |
| total rain (mm) | total rain |
| total snow (cm) | total snow |
| date | date of recording |


This part requires you to perform **data engineering**, **classical machine learning** methods and **neural networks** methods to solve a multi-class classification problem. All code should be written in Python (within the provided Jupyter Notebook environment) with or without (but not restricted to) the following packages:

In [1]:
import pandas as pd
import numpy as np
import sklearn
import scipy
#import torch

If you wish to use any other dependencies, please add them at the end of `requirements.txt` file. All code should be run on a CPU machine.

**Task objective:** given an input of `date, max temp, mean temp, min temp, snow on grnd (cm), total precip (mm), total rain (mm), total snow (cm)`, predict whether the tota
l number of bicycles observed in a day is ***less than 2000***, ***in between 2000 and 10k*** or ***over 10k***.

Please read through all the items below before you start the task, and write down as many comments as you can.

## 1. Pre-processing

Please implement the function below which does the following two steps:

1. First sum the counts at different locations in a day. In other words, you are creating a new `DataFrame` that contains the following columns:
    - `date`
    - `max temp`
    - `mean temp`
    - `min temp`
    - `snow on grnd (cm)`
    - `total precip (mm)`
    - `total rain (mm)`
    - `total snow (cm)`
    - `total count`
    

2. Convert the numerical column `total count` into three categories: *less than 2000*, *2000 to 10000* and *over 10000*. Save the `processed_df` to `processed_data.csv`.

In [6]:
#Convert the numerical column total count into three
#categories: less than 2000, 2000 to 10000 and over 10000. Save the processed_df to processed_data.csv.
def divide_classes(x):
    if x <  2000:
        return "less than 2000"
    elif x >= 2000 & x <= 10000:
        return "2000 to 10000"    
    elif x > 10000:
        return "Over 10000"

df = pd.read_csv("data.csv")

def preprocessing(df: pd.DataFrame) -> pd.DataFrame:
    '''
    Pre-process a dataframe
    
    :param pd.DataFrame df: raw dataframe from data.csv
    
    :returns pd.DataFrame processed_df: processed dataframe
    ''' 
    
    print(df.shape)
    print(df.head())
    df1 = df.groupby(['date' , 'location_name' ], as_index=False)['count'].sum()

    df1.columns = ['date', 'location_name', 'total count']
    #del df['location_name']

    processed_df = pd.merge(df1, df, how = 'left', on = ['date', 'location_name'])
    print(processed_df.shape)
    print(processed_df.head(5))

    del processed_df['location_name']
    del processed_df['count']

    processed_df['class'] = processed_df['total count'].apply(lambda x : divide_classes(x))
    print(processed_df.head())

    return processed_df

## 2. Data engineering

Please implement the function below to do any data engineering as needed, split the dataset into train set (80%) and test set (20%). The final `DataFrame` should be ready for training/testing, in other words, one should be able to get `x, y` arrays using the following commands:

```python
x_train = train_df.drop(columns=['total count']).values
y_train = train_df['total count'].values
```

Save `train_df`, `test_df` to `train.csv` and `test.csv` respectively.

*hint: please consider what additional information can be extracted from the given columns*

In [3]:
from sklearn.model_selection import train_test_split
def data_engineering(processed_df: pd.DataFrame) -> (pd.DataFrame,
                                                     pd.DataFrame):
    
    '''
    Perform data engineering on processed dataframe

    :param pd.DataFrame processed_df: output of preprocess()

    :returns pd.DataFrame train_df: training set of the engineered dataframe
    :returns pd.DataFrame test_df: test set of the engineered dataframe
    '''
    
    # drop the null values
    print(processed_df.isnull().sum().tolist())
    processed_df = processed_df.dropna(axis=1, how='all')
    print(processed_df.isnull().sum().tolist())

    processed_df['day'] = pd.DatetimeIndex(df['date']).day
    processed_df['month'] = pd.DatetimeIndex(df['date']).month
    processed_df['year']= pd.DatetimeIndex(df['date']).year
    del processed_df['date']
    processed_df.head()

    # train test split
    train_df, test_df = train_test_split(processed_df, test_size=0.2)
    
    # save dataframe as csv
    train_df.to_csv('train.csv', sep='\t')
    test_df.to_csv('test.csv', sep='\t')
    
    return (train_df, test_df)

##  3. Classical machine learning methods

Please complete the function below which takes `train_df` and `test_df` as inputs, and outputs the trained classifier, accuracy on training set and test set, confusion matrix on test set. You can experiment with as many methods as you want, please only leave the one with best performance (you can leave the others in comments) so that the function only returns one trained classifier.

Please answer the following questions as comments or in another markdown cell:

1. What data engineering techniques did you apply?
2. What's the best accuracy on test set did you achieve? Which classifier did you use to get the best accuracy?
3. Which features are the most important for predicting counts of bicycles?

In [4]:
# ml models decistion tree
from sklearn.tree import DecisionTreeClassifier

# One of the Ensemle model -- Random Forset which is ensemple of decision trees
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier, VotingClassifier

# import logistic regression, SVC  for ensembling
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC

# Naive Bayes
from sklearn.naive_bayes import MultinomialNB

# Distance based
from sklearn.neighbors import KNeighborsClassifier

# packages for metric for evalaution of the models
from sklearn import metrics
from sklearn.metrics import classification_report , accuracy_score, confusion_matrix

# to save the models so that we don't need to train the models every time we need to do prediction
from sklearn.externals import joblib


from sklearn.model_selection import train_test_split, cross_val_score




In [8]:
x_train, x_test = data_engineering(preprocessing(df))

x_train.to_csv("train")
x_test.to_csv("test")

(23028, 10)
  location_name  count  max temp  mean temp  min temp  snow on grnd (cm)  \
0          ALEX      0     -14.0      -18.2     -22.3               27.0   
1    ADAWE BIKE    968      29.5       23.8      18.0                0.0   
2          COBY     10      -3.0       -9.3     -15.6                1.0   
3          SOMO    122      14.0        8.8       3.5                0.0   
4          OYNG    228      15.2        9.4       3.5                0.0   

   total precip (mm)  total rain (mm)  total snow (cm)        date  
0                0.3              0.0              0.3  2015-02-20  
1                2.8              2.8              0.0  2016-08-28  
2                7.0              0.0              9.5  2011-12-25  
3               34.4             34.4              0.0  2017-05-01  
4                0.0              0.0              0.0  2015-04-27  
(23028, 11)
         date location_name  total count  count  max temp  mean temp  \
0  2010-01-01          ALEX      

In [83]:
#x_train = 
#y_train =
#x_test = 
#y_test = 


# trying different models
models = [
    LogisticRegression(),
    LinearSVC(),
    DecisionTreeClassifier(),
    KNeighborsClassifier(n_neighbors=5),
    MultinomialNB()
]

# 5 cross validation
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []

for model in models:
  model_name = model.__class__.__name__
  accuracies = cross_val_score(model, x_train, y_train, scoring='accuracy', cv=CV)
    
  for fold_idx, accuracy in enumerate(accuracies):
    entries.append((model_name, fold_idx, accuracy))

cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
cv_df.groupby('model_name').accuracy.mean()

[0, 106, 106, 106, 120, 120, 120, 120, 0, 0, 0, 0]
[0, 106, 106, 106, 120, 120, 120, 120, 0, 0, 0, 0]


KeyError: 'date'

In [None]:
random_forest = RandomForestClassifier(n_estimators=10)
random_forest.fit(x_train, y_train)


print("Accuracy = ")
print(random_forest.score(x_test, y_test))
print("\n")
y_pred_rf = random_forest.predict(X_test)
print(classification_report(y_test, y_pred_rf))

In [None]:
def classical_ml(train_df: pd.DataFrame, test_df: pd.DataFrame) -> (
    'classifier', 'accuracy', 'confusion matrix'):
    '''
    Use classical machine learning methods to predict total counts

    :param pd.DataFrame train_df: training set dataframe
    :param pd.DataFrame test_df: test set dataframe

    :returns 'classifier': trained classifier
    :returns 'accuracy': tuple of training accuracy and testing accuracy
    :returns 'confusion matrix': confusion matrix on test set
    '''
    x_train = train_df.drop(columns=['total count']).values
    y_train = train_df['total count'].values
    x_test = test_df.drop(columns=['total count']).values
    y_test = test_df['total count'].values

    #clf = model.fit(x_train, y_train, ...)
    
    
    
    ...
    return (clf, (train_acc, test_acc), confusion_matrix)

## 4. Neural networks

Please complete the function below which takes `train_df` and `test_df` as inputs, and outputs the trained model, accuracy on test set. You can experiment with as many structures as you want, please only leave the one with the best performance so that the function only returns one model. In addition to the model and test accuracy, please also produce two graphs in the notebook cell:

- accuracy on train and validation data, `x-axis`: epochs, `y-axis`: accuracy
- loss on train and validation data, `x-axis`: epochs, `y-axis`: loss

Note validation data is not test set, it should be split from training set.

Please answer the following questions as comments or in another markdown cell:

1. How many epochs did you train? How did you decide when to stop training?
2. Please briefly explain the model structure (layers, sizes) you choose.


In [None]:
def nn_ml(train_df: pd.DataFrame, test: pd.DataFrame) ->  ('model',
                                                           'test_accuracy'):
    '''
    Use neural networks to predict total counts

    :param pd.DataFrame train_df: training set dataframe
    :param pd.DataFrame test_df: test set dataframe

    :returns 'model': trained model
    :returns 'test_accuracy': accuracy on test set
    '''
    x_train = train_df.drop(columns=['total count']).values
    y_train = train_df['total count'].values
    x_test = test_df.drop(columns=['total count']).values
    y_test = test_df['total count'].values

    mdl.fit(x_train, y_train, ...)
    ...

    return (mdl, test_acc)

## 5. Discussions

Please write your observations, comments regarding this dataset and problem, you can also tell us about what challenges you faced or what you have learnt during your experiments.

## 6. Optional

If you are to design a forecasting model using this dataset to predict the counts of bicycles (still three categories, not actual numbers) in the future, what changes will you make? Please state all factors that you consider relevant, there is no need to write code for this question.

# Before you submit

Please use this checklist to make sure you submit all required files:

- [ ] `part1_answers.txt`
- [ ] `part2_answers.ipynb` (with code implementations and discussions)
- [ ] `processed_data.csv`
- [ ] `train.csv`
- [ ] `test.csv`
- [ ] (optional) `requirements.txt` if you used any additional package that was not listed before