<center><img src="images/logo.png" alt="drawing" width="400" style="background-color:white; padding:1em;" /></center> <br/>

# ML through Application
## Module 1, Lab 3: Getting Started with AutoGluon

This notebook covers how to create a model to solve an ML problem by using [AutoGluon](https://auto.gluon.ai/stable/index.html#).

You will learn how to do the following:

- Import the AutoGluon library.
- Import data to a Pandas DataFrame.
- Train a model by using AutoGluon.

---

You will explore a dataset that contains information about books. The goal is to predict book prices by using features about the books.

__Business problem:__ Books from a large database with several features cannot be listed for sale because one critical piece of information is missing: the price.

__ML problem description:__ Predict book prices by using book features, such as genre, release data, ratings, and number of reviews.

This is a regression task (the training dataset has a book price column to use for labels).

----

You will be presented with two kinds of exercises throughout the notebook: activities and challenges. <br/>

| <img style="float: center;" src="images/activity.png" alt="Activity" width="125"/>| <img style="float: center;" src="images/challenge.png" alt="Challenge" width="125"/>|
| --- | --- |
|<p style="text-align:center;">No coding is needed for an activity. You try to understand a concept, <br/>answer questions, or run a code cell.</p> |<p style="text-align:center;">Challenges are where you can practice your coding skills.</p>

## Index
- [Importing AutoGluon](#Importing-AutoGluon)
- [Getting the data](#Getting-the-data)
- [Model training with AutoGluon](#Model-training-with-AutoGluon)

---
## Importing AutoGluon

Install and load the libraries that are needed to work with the tabular dataset.

In [4]:
%%capture
# Use pip to install libraries
!pip install autogluon
!pip install tabulate

In [5]:
# Import the libraries that are needed for the notebook
%load_ext autoreload
import pandas as pd
# Import utility functions and challenge questions
#from MLUMLA_EN_M1_Lab3_quiz_questions import *

# Import the newly installed AutoGluon code library
from autogluon.tabular import TabularPredictor, TabularDataset

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


---
## Getting the data

Now get the data for the business problem.

__Note:__ You will use the [Amazon Product Reviews](https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews) dataset. For more information about this dataset, see the following resources:

- Ruining He and Julian McAuley. "Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering." Proceedings of the 25th International Conference on World Wide Web, Geneva, Switzerland, April 2016. https://doi.org/10.1145/2872427.2883037.

- Julian McAuley, Christopher Targett, Qinfeng Shi, Anton van den Hengel. "Image-Based Recommendations on Styles and Substitutes." Proceedings of the 38th International Association for Computing Machinery (ACM) Special Interest Group on Information Retrieval (SIGIR) Conference on Research and Development in Information Retrieval, Santiago, Chile, August 2015. https://doi.org/10.1145/2766462.2767755.

To load the training and test data, and then show the first few rows of the training dataset, run the following cells.

In [6]:
df_train = TabularDataset(data="data/train.csv")
df_test = TabularDataset(data="data/test.csv")

In [7]:
df_train.head()

Unnamed: 0,category,title,also_buy,brand,rank,also_view,main_cat,Price,asin,details,descriptionstring
0,[],"Books"" />",[],Joan M. Lexau,"1,683,587 in Books (",['0590457292'],Books,5.48,B001D4OHQA,"{'Publisher:': 'Scholastic (1974)', 'Language:...","Staining on cover, minimal wear and creasing. ..."
1,"['Books', 'Education & Teaching', 'Schools & T...",The Core Knowledge Sequence Content and Skill ...,"['0325008957', '1138188492', '1890517208', '14...",Core Knowledge Foundation,"974,014 in Books (","['0385316402', '1890517208', '1933486058', '19...",Books,21.4,B0071QRBFS,"{'Paperback:': '400 pages', 'Publisher:': 'Cor...",A double volume with two &quot;front covers.&q...
2,[],Stranger In The Woods,[],Leah Fried,"17,588,750 in Books (",[],Books,17.0,965906523X,"{'Hardcover:': '202 pages', 'Publisher:': 'Fel...",Stranger in the woods is a dramatic tale of co...
3,[],"Hansel and Gretel : A Fairy Opera, Vocal Score",[],"Adelheid ; Bache, Constance ; Humperdinck, E. ...","3,680,123 in Books (",['0793506603'],Books,10.95,B0011ZV86I,"{'Publisher:': 'G. Schirmer, Inc. (1957)', 'AS...","Complete vocal score, words and music."
4,"['Books', 'History', 'Asia']",Genghis Khan - Conqueror Of The World,[],Leo De Hartog,"5,083,249 in Books (",[],Books,3.5,B001LIQC7A,"{'Hardcover:': '230 pages', 'Publisher:': 'Bar...",a great biography of Ghengis Khan


<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>It's time to check your knowledge!</i></h3>
    <br>
    <p style=" text-align: center; margin: auto;">To load the question, run the following cell.</p>
    <br>
</div>

In [8]:
# Run this cell for a knowledge check question


---
## Model training with AutoGluon

You can use AutoGluon to train a model by using a single line of code. You need to provide the dataset and tell AutoGluon which column from the dataset you are trying to predict.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>Try it yourself!</i></h3>
    <br>
    <p style="text-align:center;margin:auto;"><img src="images/activity.png" alt="Activity" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">To prepare the datasets, run the following cell.<br/>
        This step is not required for AutoGluon to work, but it will reduce the time to train your first model.<br/>
The code randomly selects 1,000 rows from the dataset and splits them into training and validation datasets.</p>
    <br>
</div>

In [9]:
# Sampling 1,000
# Try setting the subsample_size to a much larger value to see what happens during training
subsample_size = 1000  # Sample a subset of the data for faster demo
df_train_smaller = df_train.sample(n=subsample_size, random_state=0)

# Print the first few rows
df_train_smaller.head()

Unnamed: 0,category,title,also_buy,brand,rank,also_view,main_cat,Price,asin,details,descriptionstring
398,[],Every Last One (Audiobook CD),"['1491546336', '1600244041', '1524754668', '14...",Visit Amazon's -Anna Quindlen- Page,"6,392,575 in Books (","['0812985907', '0525509879', '0812976185', '08...",Books,23.84,B003SFS8F8,{'Publisher:': 'Unabridged edition; Unabridged...,The latest novel from Pulitzer Prize-winner An...
3833,[],"Books"" />","['0441810764', '0312863551', '0441094996', '04...",Robert A Heinlein,"4,893,400 in Books (","['0441810764', '0312863551', '0671577808', '04...",Books,6.74,B001R2GZA4,"{'Publisher:': 'SIGNET BOOKS (1900)', 'ASIN:':...",Classic science fiction novel.
4836,"['Books', 'Reference']",Review Notes and Study Guide to Conrad's Vict...,[],Ken Sobol,"2,286,014 in Books (",[],Books,8.07,B000QCDE5A,"{'Paperback:': '142 pages', 'Publisher:': 'Mon...",A CRITICAL GUIDE BY MONARCH NOTES.
4572,[],Simon's Cat va al veterinario,[],Simon Tofield,"7,769,270 in Books (",[],Books,15.18,8416261865,"{'Publisher:': 'Duomo Ediciones (October 1, 20...",Brand New. Ship worldwide
636,"['Books', 'Arts &amp; Photography', 'Decorativ...",Taisho Kimono: Speaking of Past and Present,['4756246354'],Visit Amazon's Jan Dees Page,"2,053,979 in Books (",[],Books,51.75,8857200116,"{'Hardcover:': '292 pages', 'Publisher:': 'Ski...","A unique collection of 130 kimonos for women, ..."


<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>It's time to check your knowledge!</i></h3>
    <br>
    <p style=" text-align: center; margin: auto;">To load the question, run the following cell.</p>
    <br>
</div>

In [10]:
# Run this cell for a knowledge check question


### Training a model with a small sample

AutoGluon uses certain defaults. For example, AutoGluon uses `root_mean_squared_error` as an evaluation metric for regression problems. For more information, see [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) in the sklearn documentation.

__Note:__ Training on this smaller dataset will take approximately 3–4 minutes.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>Try it yourself!</i></h3>
    <br>
    <p style="text-align:center;margin:auto;"><img src="images/activity.png" alt="Activity" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">Use `TabularPredictor` to train the first version of the model along with the smaller 1000 sample training dataset so the model trains faster.<br>
</p>
    <br>
</div>

In [11]:
# Run this cell

smaller_predictor = TabularPredictor(label="Price", eval_metric = 'mean_squared_error').fit(train_data=df_train_smaller)

No path specified. Models will be saved in: "AutogluonModels/ag-20250920_170815"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.10.16
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 25.0.0: Mon Aug 25 21:16:39 PDT 2025; root:xnu-12377.1.9~3/RELEASE_ARM64_T6031
CPU Count:          16
Memory Avail:       25.69 GB / 48.00 GB (53.5%)
Disk Space Avail:   136.45 GB / 926.35 GB (14.7%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='extreme' : New in v1.4: Massively better than 'best' on datasets <30000 samples by using new models meta-learned on https://tabarena.ai: TabPFNv2, TabICL, Mitra, and TabM. Absolute best accuracy. Requires a GPU. Recommended 64 GB CPU memory and 32+ GB GPU mem

### Interpreting the training output
AutoGluon outputs a lot of information about what happens during model training.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>Try it yourself!</i></h3>
    <p style="text-align:center; margin:auto;"><img src="images/challenge.png" alt="Activity" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">After the training finishes, examine the output and answer the following questions based on the output.</p>
    <br>
</div>


1. What is the shape of the training dataset?
2. What type of ML problem (such as classification or regression) does AutoGluon infer? (**Hint:** Remember, you didn't mention the problem type. You only provided the label column.)
3. What does AutoGluon suggest in case it inferred the wrong problem type?
4. What kind of data preprocessing and feature engineering did AutoGluon perform?
5. What are the basic statistics about the label in the print statements from AutoGluon?
6. How many extra features were generated in addition to the originals in the dataset? What was the runtime for that?
7. Which evaluation metric was used?
8. What does AutoGluon suggest in case it inferred the wrong metric?
9. What is the ratio between the training and validation dataset? (**Hint:** Look for `val` or `validation`.)
10. Where did AutoGluon save the predictor?
11. Which folder were the models saved in?
12. What file format are the models in? (**Note:** Look at the file name suffix. You don't need to open the file.)

Try to answer these questions before you check the solution.

### List your answers here: 

1. 	The training dataset has 1,000 rows and 10 columns. 

2. AutoGluon inferred this is a regression problem. 

3. The AutoGluon suggest: 	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile']) 

4. AutoGluon applied AsType, FillNa, and Category generators with memory minimization, then created text special and binned features, removed duplicates, and built text ngram features with a CountVectorizer of size 336. The asin column was not used. 

5. The basic statistics are: (max 2326.87,min 0.0,mean 39.77738,std 123.6481) 

6. We had 432 extra features. From 9 to 441. The runtime was 2.12s. 7. The mean_squared_error evaluation metric was used. 

8. The Autogluon suggest to specify eval_metric in Predictor(), multiply by -1 to get raw MSE value. 

9. The split was 800 training rows and 200 validation rows, a 4 to 1 ratio (holdout fraction 0.2). 

10. The predictor was saved in my subfolder: /Module 4/A04 colab/AutogluonModels/ag-20250920_170815 

11. The models were saved in subfolder: AutogluonModels/ag-20250920_170815 

12. the file format for the models is .pkl file format.

<!-- SOLUTION -->
### Solution

In the following images, the arrows indicate where in the output you can find the answers to the questions. The numbers on the arrows correspond to the numbers of the questions in the previous cell.

<p style="padding: 10px; border: 1px solid black;">
<img src="./images/lab3_01.png"  width="900" height=auto>
<p style="padding: 10px; border: 1px solid black;">
<img src="./images/lab3_02.png"  width="900" height=auto>
<p style="padding: 10px; border: 1px solid black;">
<img src="./images/lab3_03.png"  width="900" height=auto>
<p style="padding: 10px; border: 1px solid black;">
<img src="./images/lab3_04.png"  width="900" height=auto>

<!-- END SOLUTION -->

----
## Conclusion

The purpose of this notebook was to explore a dataset of information about books and to use AutoGluon to build a basic model to predict book prices based on book features.

## Next lab
In the next lab, you will learn how to use AutoGluon features to refine your model.