<center><img src="images/logo.png" alt="drawing" width="400" style="background-color:white; padding:1em;" /></center> <br/>

# ML through Application
## Module 1, Lab 4: Refining Models by Using AutoGluon

By the end of this lab, you should be able to create a model by using [AutoGluon](https://auto.gluon.ai/stable/index.html#).

You will learn how to do the following: 

- Identify the best model that AutoGluon outputs.
- Use your model to create predictions.

---

You will explore a dataset that contains information about books. The goal is to predict book prices by using features about the books.

__Business problem:__ Books from a large database with several features cannot be listed for sale because one critical piece of information is missing: the price. 

__ML problem description:__ Predict book prices by using book features, such as genre, release data, ratings, and number of reviews.

This is a regression task (the training dataset has a book price column to use for labels).

----

You will be presented with two kinds of exercises throughout the notebook: activities and challenges. <br/>

| <img style="float: center;" src="images/activity.png" alt="Activity" width="125"/>| <img style="float: center;" src="images/challenge.png" alt="Challenge" width="125"/>|
| --- | --- |
|<p style="text-align:center;">No coding is needed for an activity. You try to understand a concept, <br/>answer questions, or run a code cell.</p> |<p style="text-align:center;">Challenges are where you can practice your coding skills.</p>

## Index

- [Importing AutoGluon](#Importing-AutoGluon)
- [Getting the data](#Getting-the-data)
- [Model training with AutoGluon](#Model-training-with-AutoGluon)
- [AutoGluon training results](#AutoGluon-training-results)
- [Model prediction with AutoGluon](#Model-prediction-with-AutoGluon)

---
## Importing AutoGluon

Install and load the libraries that are needed to work with the tabular dataset.

In [None]:
%%capture
# Install libraries
!pip install -U -q -r requirements.txt

In [None]:
# Import libraries and utility functions
%load_ext autoreload
import pandas as pd
# Import the newly installed AutoGluon code library
from autogluon.tabular import TabularPredictor, TabularDataset

## Getting the data

Next, load the dataset into a Pandas DataFrame and preview the first rows of data.

__Note:__ You will use the [Amazon Product Reviews](https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews) dataset. For more information about this dataset, see the following resources:

- Ruining He and Julian McAuley. "Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering." Proceedings of the 25th International Conference on World Wide Web, Geneva, Switzerland, April 2016. https://doi.org/10.1145/2872427.2883037.

- Julian McAuley, Christopher Targett, Qinfeng Shi, Anton van den Hengel. "Image-Based Recommendations on Styles and Substitutes." Proceedings of the 38th International Association for Computing Machinery (ACM) Special Interest Group on Information Retrieval (SIGIR) Conference on Research and Development in Information Retrieval, Santiago, Chile, August 2015. https://doi.org/10.1145/2766462.2767755.

In [None]:
df_train = TabularDataset(data="data/train.csv")
df_test = TabularDataset(data="data/test.csv")

In [None]:
df_train.head()

## Model training with AutoGluon

Finally, create a subset of the training data and use it to train a model by using AutoGluon.  

Remember that you only need to provide the dataset and tell AutoGluon which column from the dataset you are trying to predict.

In [None]:
# Sampling 1,000
subsample_size = 1000  # Sample a subset of data for faster demo
df_train_smaller = df_train.sample(n=subsample_size, random_state=0)

# Print the first rows
df_train_smaller.head()

### Training a model with the small sample

AutoGluon uses certain defaults. For example, AutoGluon uses `root_mean_squared_error` as an evaluation metric for regression problems. For more information, see [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) in the sklearn documentation.

__Note:__ Training on this smaller dataset might take approximately 3–4 minutes.

In [None]:
# Run this cell

smaller_predictor = TabularPredictor(label="Price").fit(train_data=df_train_smaller)

Now the data is loaded, and a model has been trained.

## AutoGluon training results

Now you will look at the information that AutoGluon provides through its `leaderboard` function. The `leaderboard` function is a summary of all models that AutoGluon trained.

**Note:** Because AutoGluon only maximizes metrics, you will see a negative root mean squared error (RMSE) value, for prioritization purposes only.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>Try it yourself!</i></h3>
    <br>
    <p style="text-align:center; margin:auto;"><img src="images/activity.png" alt="Activity" width="100" /> </p>
    <p style="text-align: center; margin: auto;">To look more closely at the output of the AutoGluon <code>leaderboard</code> function, run the following cell.</p>
    <br>
</div>

In [None]:
# Run this cell to see the model leaderboard
smaller_predictor.leaderboard(silent=True)

### Interpreting the RMSE value

The root mean squared error (RMSE) that is used here has nice interpretability. Because you are predicting prices, the values that are expressed in the __score\_val__ column of the leaderboard output can give you an idea of the amount of error that is related to the predictions. For example, if score\_val = 0.24, the average error for book price predictions will be about 24 cents.

<div style="border: 4px solid coral; text-align: center; margin: auto;"> 
    <h3><i>Try it yourself!</i></h3>
    <p style="text-align:center; margin:auto;"><img src="images/challenge.png" alt="Challenge" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">Which model is the best?<br>
    Choose the model that you think is the best, and justify your choice with data in the following cell.</p>
    <br>
</div>


**Challenge answer**

Enter your answer here for the challenge.

<!-- SOLUTION -->
<div style="border: 4px solid coral; text-align: center;"> 
    <h3><i>There is no unique answer!</i></h3>
    <p style=" text-align: left; margin-left: 25px; margin-right: 25px; margin-top: 25px; margin-bottom: 25px;">The business problem defines how to balance the acceptable model performance (expressed by its score here), and the time involved in training and predicting (in general, associated with cost or latency problems). A solution should balance three dimensions: performance, training time, and prediction time.
<br/><br/>
The business problem defines the acceptable minimum performance and the cost in time to pay for that. Looking at the `leaderboard` function output, you need to balance the validation performance, which is expressed in the <b>score_val</b> column, against the prediction time, which is expressed in the <b>pred_time_val</b> column (if latency is a limitation, as in real-time prediction), or the training time, which is shown in the <b>fit_time</b> column (if the training time is a limitation). (Training time limitations are usually related to cost. For example, you need to train a model for several days by using expensive GPU machines.)
<br/><br/>
As an example, in the output, you can see that WeightedEnsemble_L2 model has similar performance to CatBoost. However, CatBoost has significantly lower prediction time (`pred_time_val`). Therefore, if latency is a problem, CatBoost should be your choice.</p>
    <br/>
</div>
<!-- END SOLUTION -->

## Model prediction with AutoGluon

Now that your model is trained, you can use it to predict prices.

You should always run a final model performance assessment by using data that the model didn't see (the test data). Test data is not used during training and can therefore give a performance assessment. You will use the test data to make predictions in the next step.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>Try it Yourself!</i></h3>
    <br>
    <p style="text-align:center;margin:auto;"><img src="images/activity.png" alt="Activity" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">To show the first rows of the test dataset, which you will use to predict prices, run the following cell.
        </p>
    <br>
</div>

In [None]:
# Run this cell

df_test.head()

<div style="border: 4px solid coral; text-align: center; margin: auto;"> 
    <h3><i>Try it Yourself!</i></h3>
    <p style="text-align:center; margin:auto;"><img src="images/challenge.png" alt="Challenge" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">Use this test dataset as input to the model that you just trained. Use the model to predict book prices. Use the following cell to run the appropriate code.<br><br>
    <b>Tip:</b> For information about the <code>predict</code> function, see <a href="https://auto.gluon.ai/0.6.2/api/autogluon.predictor.html">AutoGluon Predictors</a> in the AutoGluon documentation.</p>
    <br>
</div>


In [None]:
############### CODE HERE ###############

price_prediction = smaller_predictor.predict(df_test)
print(f"Price predictions (first 20 predictions): {price_prediction[0:20]}")

############## END OF CODE ##############

----
## Conclusion

You have now created a model by using AutoGluon, seen how to identify the best model version, and made predictions by using the model.

## Next lab
In the next lab, you will explore some of the advanced features of AutoGluon to refine your model.