<a href="https://colab.research.google.com/github/zerotodeeplearning/ztdl-masterclasses/blob/master/notebooks/Real_World_ML_Ads_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learn with us: www.zerotodeeplearning.com

Copyright © 2021: Zero to Deep Learning ® Catalit LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Real World ML Ads Classification

This is a long exercise with a complex dataset. It is intended to approximate a real world case where data is not clean and you need to compare several approaches and make decisions.

## Exercise 1: Get the data

Original Dataset from: https://www.kaggle.com/overflow012/playing-with-ads

Mirrored for convenience at https://archive.org/download/playing-with-ads/playing-with-ads.zip


catid: it's category of ads. Possible values:
- 2 = Jobs
- 3 = Real Estate

subcatid: it's the subcategory of ads. Possibles values:
- 2 = Apartment House for sale
- 11 = Lawyers
- 12 = Administrative - Secretary
- 14 = Call cente
- 15 = Building
- 16 = Accounting finance
- 17 = Education - Teachers
- 19 = Customer Support
- 20 = Bar and Restaurant
- 21 = Biotechnology
- 22 = Retail
- 23 = Technical support
- 24 = Work from home
- 26 = Transport
- 27 = Medicine - Health
- 28 = fashion
- 29 = Advertising - Marketing
- 30 = Human Resources
- 31 = Public relations
- 32 = Sellers
- 33 = Engineers - Architects
- 34 = software
- 35 = Wholesales
- 51 = Apartment - House for rent
- 122 = Other offers
- 132 = Travels and tourism
- 134 = Administration - Executives


Use your knowledge of shell commands to download and unzip the dataset. (Hint: to pass a command to the shell use `!`)

In [None]:
#@title Update and load libraries
!pip install -U -q fasttext

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Exercise 2: Load the dataset

- Load the dataset into a Pandas DataFrame
- Explore it using the `.head()` and `.info()` methods

## Exercise 3: Explore the labels

This dataset contains 2 possible labels:
- `catid`
- `subcatid`

Count the number or samples for each of the labels.
- What do you notice?
- Are the classes balanced?


## Exercise 4: Ad length exploration

- Create a new variable that measures the length of an Ad in number of characters.
- Display the distribution of lengths with a histogram. Do you notice anything?
- Can we use this `ad_length` as a feature for classifiation?

## Exercise 5: Naive Machine Learning

- train the model on a sample of the data with 2000 ads
- Build a simple pipeline that takes the ad and predicts the `subcatid`:
  - Use a `TfidfVectorizer` for encoding the ads. You can use the `char` analyzer with `ngram_range = (1, 3)` and `max_features=2000`.
  - Use a `LogisticRegression` model to classify them
  - Assess the score using `model.score` and `confusion_matrix`

## Exercise 6: Ideas

Now that you've created your first model, make a list of idea of things that you could try in order to improve the model. These ideas could involve:
- data manipulation
- feature engineering
- model selection
- tooling and infrastructure

Generate at least 10 ideas.

## Exercise 7: Assessing ideas

Bucket your ideas into 3 groups:
- EASY. These should be straightforward to code if you know the API and their execution should not take more than a few minutes.
- MEDIUM. These could take a little longer to code and may take a bit more to execute. The whole experiment should be achievable within a few hours.
- HARD. These are good ideas that are time consuming, either because the implementation is not straightforward, or because decision are involved (e.g. how to impute missing data or how to better deal with outliers) or because their evaluation will take a long time.

- Make a plan of your next steps that involves doing all the easy ideas and possibly some of the medium ideas

## Exercise 8: First Idea

The first idea is to notice that there are some subcategories that have very few samples, and there may be duplicate ads. Let's get rid of both.

- Create a new dataset called `dfclean` with the following properties:
  - drop all rows of categories with less than 50 ads
  - drop all duplicate ads (same `catid`, `value`, `subcatid`)
  - double check the number of ads per subcategory id

  You should get the following counts:

  ```
subcatid count
27     15699
34      8488
33      8025
122     4796
32      4553
19      3192
29      3085
132     2316
2       1997
16      1936
15      1724
31      1505
17      1173
20      1156
30      1044
134     1023
12       939
11       875
26       748
28       601
51       572
21       292
23       142
14        98
35        97
```

You can go ahead and implement your first and easiest idea or follow along and implement this one.

## Exercise 9: Second idea

Rebalancing data.

Let's create a training set with balanced subcategories.

- Split `dfclean` into train and test with a test size of 10000 and random state 0
- Create a new dataset `dfsampled` from `dftrain` that contains 50 samples from each subcat ide. you can use the `.sample` method with `replace=False` for this.
- Check the length of `dfsampled`, it should contain 1250 rows.

You can go ahead and implement your next idea or follow along and implement this one.

## Exercise 10: Next idea

Machine Learning on balanced data

- Train the pipeline model you defined earlier on the rebalanced data.
- Assess the performance on the test set using the accuracy score and the confusion matrix.
- Did rebalancing the data help?

You can go ahead and implement your next idea or follow along and implement this one.

It helped a little but performance is still pretty bad.

## Exercise 11: Next idea

Tools

Build some tools to make experimentation faster. Define 3 helper functions:
- 
```python
def calculate_scores(y_true, y_pred):
    """Returns the accuracy and F1 score"""
    ...
    return acc, f1
```
- 
```python
def calculate_scores_train_val(y_train, y_pred_train, y_val, y_pred_val):
    """Returns accuracy and F1 score for both training and validation sets"""
    ...
    return at, av, ft, fv
```
- 
```python
def train_val_model(model, model_name, X_train, y_train, X_val, y_val):
    """Trains and evaluates a model, return the results in a DataFrame"""
    ...
    return pd.DataFrame(results,
                        columns=[model_name],
                        index=['model',
                               'accuracy_score_train',
                               'accuracy_score_val',
                               'f1_score_train',
                               'f1_score_val',
                               'train_time', 'pred_time'])
  ```

You can go ahead and implement your next idea or follow along and implement this one. 

## Exercise12: Next idea

Dummy Classifier

Validate your tools by evaluating a pipeline with `TfidfVectorizer` and `DummyClassifier`.

You can go ahead and implement your next idea or follow along and implement this one.

## Exercise 13: Next idea

Regularization

Let's assess the influence of regularization strength on the model performance.

- define a variable `Cs = [0.001, 0.01, 0.1, 1, 10, 100, 1000]` with a few values for regularization.
- Iterate over the regularization values and for each value assess the performance of a pipeline with:
  - 
  ```python
  TfidfVectorizer(analyzer='char', 
                  ngram_range=(1, 3),
                  max_features=1000)
  ```
  and
  - 
  ```python
  LogisticRegression(C=c, solver='liblinear')
  ```
- Accumulate the results into a `results` DataFrame

You can go ahead and implement your next idea or follow along and implement this one.

## Exercise 14: Next idea

Assess the results

Display the results using a bar plot. Which regularization gives the best performance? Use that value from now on.

You can go ahead and implement your next idea or follow along and implement this one. 

## Exercise 15: Next idea

Let's assess the influence of the `ngram_range` parameter.

- Vary the `ngram_range` and keep everything else fixed.
- Set the regularization strength to `C=10`.
- Append the results to the `results` DataFrame
- Display the results for comparison

You can go ahead and implement your next idea or follow along and implement this one.

## Exercise 16: Next idea

Let's assess the influence of the `max_features` parameter of the `TfidfVectorizer`.

- Set the `ngram_range=(1, 3)`
- Set the regularization strength to `C=10`.
- Append the results to the `results` DataFrame
- Display the results for comparison

You can go ahead and implement your next idea or follow along and implement this one.

## Exercise 17: Next idea

Let's change the number of minimum samples in the rebalanced dataset. In order to do this:
- create new resampled datasets with `n_samples_range = [50, 100, 200, 500, 1000]`. Use `replace=True` because sometimes more samples per class are required than available.
- Train a model with `ngram_range=(1, 3)`, `max_features=10000` and `C=10`
- Append the results to the `results` DataFrame
- Display the results for comparison

You can go ahead and implement your next idea or follow along and implement this one.

## Exercise 18: Next idea

Noticing that the performance of my model keeps increasing as the number of samples goes up, let's extend the `n_samples_range` a bit further:
- Repeat the above steps with `n_samples_range = [1000, 2000, 3000]`
- Append the results to the `results` DataFrame
- Display the results for comparison

You can go ahead and implement your next idea or follow along and implement this one.

### Exercise 18: BONUS

You may have noticed that the last calculations took a long time. Let's display the training and inference times for all the experiments so far.

## Exercise 19: Next idea

Iiterate on different models and see which one has the best performance. Feel free to implement this one or your next best idea.

To iterate on models:
- Load all the necessary model classes from `sklearn`, `xgboost`, `lightgbm`. For example, try the following:
  - Multinomial Naive Bayes
  - TruncatedSVD + XGBoost
  - TruncatedSVD + Light GBM
  - Logistic regression with C=100
  - Stochastic Gradient Descent with different lossess and different values for alpha
- Make a list of model instances and iterate:
  - Use `n_samples=1000`
  - Use `ngram_range=(1, 3)`
  - Use `max_features=20000`
  - Tran the model on the resampled data
  - Append the results to the `results` DataFrame
  - Display the results for comparison

You can go ahead and implement your next idea or follow along and implement this one.

## Exercise 20: Next idea

Let's use the FastText embeddings, following the method outlined [here](https://github.com/facebookresearch/fastText/tree/master/python#text-classification-model).

In order to do this, we will need to:
- save the data to a text file containing a training sentence per line along with the labels 
- labels need to be words that are prefixed by the string `__label__`
- train the model using `fasttext.train_supervised`

In order to achieve the above, we will define helper functions:
- 
```python
def clean_text(s):
    """Returns a clean string of text by:
      - lowercasing the text
      - replacing newlines with spaces
      - replacing any characters that is not alphanumeric with a space
      - condencing multiple spaces to a single space
    """
    return clean_s
```
- 
```python
def join_labels(df):
    return ('__label__'
            + df['subcatid'].astype(str) 
            + " , "
            + clean_text(df['value']))
```
- 
```python
def save_to_txt(dfin, fname):
    series = join_labels(dfin)
    with open(fname, 'w') as fout:
      fout.write('\n'.join(series.values))
```

Notice that the API of FastText is different from Scikit Learn's API, so you will have to write some glue code to compare the results.

Also, it may be worth comparing the performance of FastText when trained using the unbalanced traing set VS the rebalanced training set.

You can go ahead and implement your next idea or follow along and implement this one.

## Exercise 21: Next idea

Let's combine the FastText embeddings with LightGBM into a single model using Gensim.

- download the FastText model using the `gensim.downloader`
- encode all sentences into vectors using the following approach:
    - remove newline characters and introduce spaces around punctuation in every sentence
    - encode every word in the sentences with FastText
    - average the FastText embeddings in each sentence so that you obtain a single average embedding for each sentence.
- Train and evaluate a `LGBMClassifier` on the embedded sentences and compare the results

## Conclusion and Next steps

- Is your model good enough?
- Can you deploy it?
- What other things should you consider?
- What are your next steps?