<p style="padding: 10px; border: 1px solid black;">
<img src="./images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>

# MLU Day One Machine Learning - Hands On

This hands-on notebook will let you practice the concepts you have learned in this course so far.
In the notebook, you will explore a database of books (books of different genres, from thousands of authors).
The goal is to predict book prices using book features.

__Business Problem:__ Books from a large database of books - different genres, thousands of authors, etc., cannot be listed for sale because they are missing one critical piece of information, the price. 

__ML Problem Description:__ Predict book prices using book features, such as genre, release data, ratings, number of reviews.  
> This is a __regression__ task (we have a book price column in our train dataset that we can use as labels). <br>

----


To generate book price predictions, you will be presented with two kinds of exercises throughout the notebook: __TASKS__ and __CHALLENGES__. <br/>


| <img style="float: center;" src="./images/task_robot.png" alt="drawing" width="100"/>| <img style="float: center;" src="./images/challenge_robot.png" alt="drawing" width="130"/>|
|:---    |   ---  |
| No coding needed for theses tasks. <br /> Try to understand what is happening and run the cells & code associated to this. | These are challenges where you can practice your coding skills. <br /> Once done, uncomment the challenge asnwer and check your solution.
| || 

As we are not trying to measure your coding skills, you will find solutions throughout the notebook: 
All the challenges have answers that you can copy and paste into the challlenge coding area: **No matter how experienced and skilled you are with coding, you will be able to submit a solution!**


----

The notebook consits of 2 parts; please work top to bottom and don't skip sections as this could lead to error messages due to missing code.

### <a href="#1">Part I - Leaderboard Submission</a>
In the first part of the notebook you are going to learn how [__AutoGluon__](https://auto.gluon.ai/stable/index.html#) can solve the book price prediction problem.<br/>

You will learn how to build a simple and quick base model and then implement iterations of this model to improve it. To measure how well you are doing (and to see how the model improves) you have to submit your model's predictions to the [__Book Prices Prediction MLU Leaderboard__](https://mlu.corp.amazon.com/contests/redirect/7). Leaderboard will assess your prediction performance against other participants. Your submission to the leaderboard also __counts towards your course completion__. 

We ask you to make 2 submissions in Part I:<br/>
1. First a simple prediction trained with a smaller dataset (for a quick first submisison).
2. Then another prediction trained with a full dataset, in order to submit an improved result.

Feel free to keep improving your model and make as many submissions as you like to Leaderboard. 

### <a href="#2">Part II - Advanced AutoGluon (OPTIONAL)</a>
In the second part of the notebook you will find some advanced features of AutoGluon. You're welcome to use the insights you can gain from Part II to make an optional 3rd submission. However, a quick word of warning - AutoGluon is very powerful in its base form so you might not see much additional model improvement on Leaderboard.

----
</br>
</br>

## <a name="1">Part I - Leaderboard Submission</a>
Let's solve the book price prediction problem using __AutoGluon__.

- Part I - 1. <a href="#p1-1">AutoGluon Installation</a>
- Part I - 2. <a href="#p1-2">Getting the Data</a>
- Part I - 3. <a href="#p1-3">Model Training with AutoGluon (small train dataset)</a>
- Part I - 4. <a href="#p1-4">AutoGluon Training Results</a>
- Part I - 5. <a href="#p1-5">Model Prediction with AutoGluon</a>
- Part I - 6. <a href="#p1-6">First MLU Leaderboard Submission (with small train data)</a>
- Part I - 7. <a href="#p1-7">Second MLU Leaderboard Submission (with full train data)</a>


### <font color='orange'>Please make sure to run the below cell! It will allow you to print solutions for the code challenges.</font> 

In [1]:
# Import utility functions that provide answers to challenges
%load_ext autoreload
%aimport dayone_utils
import pandas as pd

### <a name="p1-1">Part I - 1. AutoGluon Installation</a>

We need to begin by installing AutoGluon (documentation [here](https://auto.gluon.ai/stable/install.html)).  


__NOTE__: This may take a few minutes to install (you can see that it has finished once the `[*]` symbol next to the cell disappears and turns into a number).

In [2]:
#!python3 -m pip install -qU pip
#!python3 -m pip install -qU setuptools wheel
#!python3 -m pip install -qU "mxnet<2.0.0"
#!python3 -m pip install -qU autogluon

Now we load the libraries needed to work with our Tabular dataset.

In [3]:
# Importing the newly installed AutoGluon code library
from autogluon.tabular import TabularPredictor, TabularDataset


### <a name="p1-2">Part I - 2. Getting the Data</a>

Let's get the data for our business problem.

>  <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100" /> 
>  Run the cell below to load the train and test data. Then continue and take a look at the first samples of our train dataset. <br/> This is a very basic check when performing __Data Exploration__.

In [71]:
df_train = TabularDataset(data="./datasets/training.csv")
df_test = TabularDataset(data="./datasets/mlu-leaderboard-test.csv")

Loaded data from: ./datasets/training.csv | Columns = 10 / 10 | Rows = 5051 -> 5051
Loaded data from: ./datasets/mlu-leaderboard-test.csv | Columns = 9 / 9 | Rows = 562 -> 562


In [5]:
df_train.head()

Unnamed: 0,ID,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Price
0,542,Foe (Penguin Essentials),J. M. Coetzee,"Paperback,– 21 Sep 2010",5.0 out of 5 stars,2 customer reviews,Nobel Laureate and two-time Booker prize-winni...,Action & Adventure (Books),Action & Adventure,2.52763
1,2380,Of Blood and Bone (Chronicles of The One),Nora Roberts,"Paperback,– 25 Jan 2019",4.3 out of 5 stars,5 customer reviews,"Thirteen years ago, a catastrophic pandemic kn...",Action & Adventure (Books),Romance,2.555094
2,5529,Then She Was Gone,Lisa Jewell,"Paperback,– Import, 14 Dec 2017",4.0 out of 5 stars,9 customer reviews,"BESTSELLING PSYCHOLOGICAL SUSPENSE, AND A TOP ...",Action & Adventure (Books),"Crime, Thriller & Mystery",2.531479
3,4511,Mongodb: The Definitive Guide- Powerful and Sc...,Kristina Chodorow,"Paperback,– 2013",4.7 out of 5 stars,11 customer reviews,Manage the huMONGOus amount of data collected ...,Computer Databases (Books),"Computing, Internet & Digital Media",2.845718
4,1305,Jerusalem: The Biography,Simon Sebag Montefiore,"Paperback,– 1 Mar 2012",4.6 out of 5 stars,18 customer reviews,The epic story of Jerusalem told through the l...,History of Civilization & Culture,"Biographies, Diaries & True Accounts",2.733197


### <a name="p1-3">Part I - 3. Model Training with AutoGluon (small train dataset)</a>

We can train a model using AutoGluon with only a single line of code.  All we need to do is to tell it which column from the dataset we are trying to predict, and what the dataset is.


### Sampling data
For this first training, we are going to randomly sample 1000 samples of our train dataset in order to have a faster training.



> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/>  Run the cell below to prepare the datasets (AutoGluon is doing all the magic for us). <br/>
Here we are randomly selecting 1000 rows of our dataset and splitting it into train and validation datasets.
> 

<br/>

__NOTE__: The `random_state` parameter below alows to have repeatability when running the code multiple times.

In [72]:
# Run this cell

# Sampling 1000
subsample_size = 1000  # subsample subset of data for faster demo, try setting this to much larger values
df_train_smaller = df_train.sample(n=subsample_size, random_state=0)

# Printing the first rows
df_train_smaller.head()

Unnamed: 0,ID,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Price
4097,3990,Beast,Krishna Udayasankar,"Paperback,– 25 Mar 2019",4.6 out of 5 stars,34 customer reviews,"An urban adventure thriller inspired by the legend of the Narasimha Avatar and explained through Genetics.\n\nIt was always the same dream, a dream that began with darkness and blood.\nWhen Assistant Commissioner of Police Aditi Kashyap is called upon to solve a gruesome triple homicide in a Mumbai suburb, she is dragged into the terrifying world of the Saimhas -- werelions -- who have lived alongside humans, hiding amongst them, since ancient times.\nFaced with the unbelievable, Aditi has no choice but to join hands with Prithvi, an Enforcer called in to hunt down this seemingly otherworl...","Crime, Thriller & Mystery (Books)","Crime, Thriller & Mystery",2.378398
1622,4327,Theory of Computation,Vivek Kulkarni,"Paperback,– 12 Apr 2013",1.0 out of 5 stars,1 customer review,"The book begins with basic concepts such as symbols, alphabets, sets, relations, graphs, strings, and languages. It then delves into the important topics including separate chapters on finite state machine, regular expressions, grammars, pushdown stack, Turing machine, parsing techniques, Post machine, undecidability, and complexity of problems. A chapter on production systems encompasses a computational model which is different from the Turing model, called Markov and labelled Markov algorithms. At the end, the chapter on implementations provides implementation of some key concepts especi...",Computer Science Books,"Computing, Internet & Digital Media",2.542825
1861,2352,Tom Gates #11: Dog Zombies Rule,Liz Pichon,"Hardcover,– 10 Feb 2017",4.7 out of 5 stars,15 customer reviews,Here's my excellent plan to make DogZombies the best band in the whole wide world! How hard can it be? (Very.) Right now I'm going to: 1. Write more songs. (Not about teachers.) 2. Make a spectacular music video. (Easy.) 3. Get some sleep. (Tricky when you're being kept awake by loud noises.) 4. Annoy Delia. (Nothing to do with dogzombies but always FUN.)\n\nWinner of\nThe Roald Dahl Funny Prize\nThe Red House Book Award Best Book for Young Readers\nThe Waterstone's Best Fiction for 5-12 year old's\nThe Blue Peter Award for Best Story.,Comics,Comics & Mangas,2.376577
39,2255,Only Time Will Tell (The Clifton Chronicles),Jeffrey Archer,"Paperback,– 15 Sep 2011",4.2 out of 5 stars,298 customer reviews,"The Clifton Chronicles is Jeffrey Archer’s most ambitious work in four decades as an international bestselling author. The epic tale of Harry Clifton’s life begins in 1919, in the backstreets of Bristol. His father was a war hero, but it will be twenty-one tumultuous years before Harry discovers the truth about how his father really died and if, in fact, he even was his father. The first in the series, Only Time Will Tell takes a cast of memorable characters from the ravages of the Great War to the outbreak of the Second World War, when Harry must decide whether to take his place at Oxford...",Action & Adventure (Books),Action & Adventure,2.071882
2839,2596,Arnold's Bodybuilding for Men,Schwarzenegger,"Paperback,– 12 Oct 1984",4.2 out of 5 stars,13 customer reviews,"The complete program for building and maintaining a well-conditioned, excellently proportioned body—for a lifetime of fitness and health.\n\nIn Arnold's Bodybuilding for Men, legendary athlete Arnold Schwarzenegger shows you how to achieve the best physical condition of your life. For every man, at every age, Arnold outlines a step-by-step program of excercise, skillfully combining weight training and aerobic conditioning. The result—total cardiovascular and muscular fitness.\n\nArnold's program of exercise features stretching, warm-up and warm-down routines, and three series of exercises,...",Healthy Living & Wellness (Books),Sports,2.764176


### Training a model with our small sample

> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/> 
For this first training we are going to use the smaller dataset with 1000 samples of our original train dataset in order to have a faster training.

__NOTE__: AutoGluon uses certain defaults; generally these are good but there is one exception: `eval_metric`.  By default, AutoGluon uses `‘root_mean_squared_error’` as evaluation metric for regression problems. However, MLU Leaderboard is using the `‘mean_squared_error’` metric to measure submissions quality, so we need to explictly pass this metric to AutoGluon. For more information on these options, see sklearn [metrics](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics).


---
Let's use `TabularPredictor` to train the first version of our model.

__NOTE__: Training on this smaller dataset might still take approx. 3-4 minutes!

In [7]:
# Run this cell

smaller_predictor = TabularPredictor(label="Price", eval_metric="mean_squared_error").fit(train_data=df_train_smaller)

No path specified. Models will be saved in: "AutogluonModels/ag-20220209_231254\"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20220209_231254\"
AutoGluon Version:  0.3.1
Train Data Rows:    1000
Train Data Columns: 9
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (3.9542425094393248, 1.414973347970818, 2.60143, 0.33874)
	If 'regression' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    929.09 MB
	Train Data (Original)  Memory Usage: 2.05 MB (0.2% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_

### Interpreting the Training Output
AutoGluon outputs a lot of information about what is happening.

<img style="float: left;" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
<br/><br/>
<br/>
<br/>

> After the prediction above finishes, examine the output and try to find the information below in the print out messages from AutoGluon. <br/>
1. What is the shape of your training dataset?
2. What kind of ML problem type does AutoGluon infer (classification, regression, ...)? Remember, you've never mentioned what kind of problem type it is; you only provided the label column.
3. What does AutoGluon suggest in case it inferred the wrong problem type?
4. Identify the kind of data preprocessing and feature engineering performed by AutoGluon.
5. Find the basic statistics about your label in the print statements from AutoGluon.
6. How many extra features were generated besides the originals in our dataset? What was the runtime for that?
7. What is the evaluation metric used?
8. What does AutoGluon suggests to do if it inferred the wrong metric?
9. What is the ration between train & validation dataset (try looking for `val` or `validation`)?
10. Identify the folder where the models are saved.
11. Identify where AutoGluon saved your prediction.
12. Enter a specific model folder and take a quick look to see the file format.

__Please, try hard to identify all information above before uncommenting the answer below.__ <br/>

################# LIST YOUR ANSWERS HERE #################
1. Train Data Rows: 1000, Train Data Columns: 9 <br/>
2. Regression <br/>
3. manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'] <br/>
4. (1) AsType, (2) FillNa, (3) Identity, Category(CategoryMemoryMinimize), TextSpecial(Binned, DropDuplicates), TextNgram(CountVectorizer - reducing Vectorizer vocab size from 875 to 526), (4) DropUnique ==>  9 features in original data used to generate 598 features in processed data <br/>
5. Label info (max, min, mean, stddev): (3.9542425094393248, 1.414973347970818, 2.60143, 0.33874) <br/>
6. 9 features in original data used to generate 598 features in processed data, 5.11s <br/>
7. mean_squared_error <br/>
8. specify the eval_metric argument of fit() <br/>
9. 0.8 vs 0.2, Train Rows: 800, Val Rows: 200 <br/>
10. AutogluonModels/ag-20220209_231254 <br/>
11. AutogluonModels/ag-20220209_231254 <br/>
12. model.pkl <br/>

In [14]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_FIT_INFO")

### <a name="p1-4">Part I - 4. AutoGluon Results</a>
Now let's take a look at all the information AutoGluon provides via its __leaderboard function__. <br/> 

__NOTE__: Don't confuse this with the MLU Leaderboard. The MLU Leaderboard is where you will make submissions with the predictions from your trained models; the AutoGluon leaderboard function is a summary of all models that AutoGluon trained.

<br/>

> <img style="float: left; padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
> Run the cell below and take a closer look at AutoGluon's leaderboard output. <br/>
__Which one is the best model?__

<br/>

__NOTE__: As AutoGluon only maximizes metrics, you will see a negative MSE value, for prioritization purposes only.


In [10]:
# Run this cell

smaller_predictor.leaderboard(silent=True)

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,-0.060205,0.423705,27.805055,0.0,0.269403,2,True,11
1,CatBoost,-0.061911,0.042046,15.360326,0.042046,15.360326,1,True,6
2,LightGBM,-0.064838,0.036998,1.181352,0.036998,1.181352,1,True,4
3,LightGBMXT,-0.064974,0.04,1.67694,0.04,1.67694,1,True,3
4,RandomForestMSE,-0.066901,0.053004,3.798078,0.053004,3.798078,1,True,5
5,XGBoost,-0.067591,0.020001,1.935345,0.020001,1.935345,1,True,9
6,ExtraTreesMSE,-0.069272,0.071544,3.943585,0.071544,3.943585,1,True,7
7,LightGBMLarge,-0.070022,0.056001,8.418169,0.056001,8.418169,1,True,10
8,NeuralNetFastAI,-0.072054,0.28466,7.381691,0.28466,7.381691,1,True,8
9,KNeighborsUnif,-0.138151,0.021029,0.03525,0.021029,0.03525,1,True,1


In [16]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_BEST")

### <a name="p1-5">Part I - 5. Model Prediction with AutoGluon</a>
#### Now that your model is trained, let's use it to predict prices!

We should always run a final model performance assessment using data that was unseen by the model (the test data). Test data is not used during training and can therefore give a performance assesment. In our case, we will use the test data to make predictions and submit those to MLU Leaderboard in the next step.

> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/> 
Run the cell below to show the test dataset that we will use for the MLU Leaderboard. 

In [17]:
# Run this cell

df_test.head()

Unnamed: 0,ID,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory
0,1589,"R in Action, 2ed (MANNING)",Robert L. Kabacoff,"Paperback,– 2015",4.0 out of 5 stars,7 customer reviews,"R in Action, Second Edition teaches you how to use the R language by presenting examples relevant to scientific, technical and business developers. Focusing on practical solutions, the book offers a crash course in statistics, including elegant methods for dealing with messy and incomplete data. You'll also master R's extensive graphical capabilities for exploring and presenting data visually. And this expanded second edition includes new chapters on forecasting, data mining and dynamic report writing.",Computer Science Books,"Computing, Internet & Digital Media"
1,2125,The Duchess Deal: Girl Meets Duke,Tessa Dare,"Mass Market Paperback,– 22 Aug 2017",4.8 out of 5 stars,7 customer reviews,"An iBooks Best Romance of August Pick!\nOne of Publishers Weekly's Buzz Books of Romance 2017!\nAn Amazon Best Romance of August Pick!\n2017 RT Reviewer's Choice Book of the Year Nominee and 2017 RT Reviewer's Choice Nominee for Best Historical Love & Laughter!\n When girl meets Duke, their marriage breaks all the rules…\nSince his return from war, the Duke of Ashbury’s to-do list has been short and anything but sweet: brooding, glowering, menacing London ne’er-do-wells by night. Now there’s a new item on the list. He needs an heir—which means he needs a wife. When Emma Gladstone, a vicar’...",Romance (Books),Romance
2,5516,Learning React: Functional Web Development with React and Redux,Alex Banks,"Paperback,– 2017",4.8 out of 5 stars,6 customer reviews,"""If you want to learn how to build efficient user interfaces with React, this is your book. Authors Alex Banks and Eve Porcello show you how to create UIs with this small JavaScript library that can deftly display data changes on large-scale, data-driven websites without page reloads. Along the way, youíll learn how to work with functional programming and the latest ECMAScript features.\nDeveloped by Facebook and used by companies including Netflix, Walmart and The New York Times for large parts of their web interfaces, React is quickly growing in use. By learning how to build React compon...",Internet & Web (Books),"Computing, Internet & Digital Media"
3,1307,Sikkim - Dawn of Democracy: The Truth Behind The Merger With India,GBS Sidhu,"Hardcover,– 29 Oct 2018",4.2 out of 5 stars,6 customer reviews,"It was in 1973 that G.B.S. Sidhu, a young official with the newly set-up Research and Analysis Wing (R&AW), took charge of the field office in Gangtok in 1973. With an insider's view of the events that led to the Chogyal's ouster, he presents a first-hand account of the fledgling democracy movement and the struggle for reforms led by Kazi Lhendup Dorji in a society that was struggling to come to terms with the modern world.\nIn his fast-paced, clear-sighted narrative, Sidhu tracks the reasons behind New Delhi's shift from a long-standing pro-Chogyal stand to a pro-democracy position and ma...",Government (Books),Politics
4,2449,Footprints on Zero Line: Writings on the Partition,Gulzar,"Hardcover,– 20 Aug 2017",4.5 out of 5 stars,10 customer reviews,"The Partition of 1947 has influenced the works of an entire generation of writers, and continues to do so. Gulzar witnessed the horrors of Partition first-hand and it is a theme that he has gone back to again and again in his writings. Footprints on Zero Line brings together a collection of his finest writings - fiction, non-fiction and poems - on the subject. What sets this collection apart from other writings on Partition is that Gulzar's unerring eye does not stop at the events of 1947 but looks at how it continues to affect our lives to this day. Wonderfully rendered in English by well...",Anthologies (Books),Politics


> <img style="float: left; padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
Use this new dataset as input to the model you have just trained to predict Book Prices on it <br/>
__TIP:__ look at the AutoGluon Tasks documentation and look for function __predict__ to see how to implement it [here](https://auto.gluon.ai/stable/api/autogluon.task.html#autogluon.tabular.TabularPredictor.predict).

__Please, try hard to identify all information above before uncomment the answer below. You know, it is about Learn and Be Curious, right?__

In [28]:
############## CODE HERE ####################
price_prediction = smaller_predictor.predict(df_test)
print("predicted prices for the first 10 books are : \n", price_prediction[0:10])
############## END OF CODE ####################

predicted prices for the first 10 books are : 
 0    2.631371
1    2.491900
2    2.814504
3    2.809279
4    2.633283
5    2.693715
6    2.697066
7    2.728527
8    2.424073
9    2.436965
Name: Price, dtype: float32


In [21]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_PRED")

### <a name="p1-6">Part I - 6. First MLU Leaderboard Submission (with small train data)</a>
#### Now you are ready for your first submission to our MLU Leaderboard!

> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/> 
> Run the cell below to save your prediction file in the format expected by the MLU Leaderboard.


__NOTE__: If you have __not used the trained model to make predictions on the test dataset__ in the previous section/cell, you will not have the `price_predictions` needed for the prediction submission file, and running the cell below __will raise an error__. Go back and use the __.predict()__ function on the test dataset to create the `price_prediction` - as suggested by the answer provided in the *dayone_utils* file!

In [29]:
# Run this cell

# Define empty dataset with column headers ID & Price
df_submission = pd.DataFrame(columns=["ID", "Price"])
# Creating ID column from ID list
df_submission["ID"] = df_test["ID"].tolist()
# Creating label column from price prediction list
df_submission["Price"] = price_prediction
# saving your csv file for Leaderboard submission
df_submission.to_csv(
    "./datasets/predictions/Prediction_to_Leaderboard.csv", index=False
)

#### Let's do a quick check to see if the file is ok!
> <img style="float: left; padding-right: 30px" src="./images/task_robot.png" alt="drawing" width="100"/> 
> 1. Run the cell below to check if your submission file has the right IDs for the MLU Leaderboard.
> 2. If the difference is zero you are good to go!

In [30]:
# Run the code below
print("Double-check submission file against the original test file")
sample_submission_df = pd.read_csv("./datasets/mlu-leaderboard-test.csv", sep=",")
print(
    "Differences between project result IDs and sample submission IDs:",
    (sample_submission_df["ID"] != df_submission["ID"]).sum(),
)

Double-check submission file against the original test file
Differences between project result IDs and sample submission IDs: 0


#### Downloading the Prediction File and Submitting
> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/> 
> 1. Download the file you just saved to your local machine. <br/>
> 2. Follow the instructions on the Leaderboard submission page: https://mlu.corp.amazon.com/contests/redirect/7 to submit your file.

<br>
You can find your submission file in the folder <code>datasets > predictions</code>.

### <a name="p1-7">Part I - 7. Second MLU Leaderboard Submission (with full train data)</a>

> <img style="float: left;" src="./images/challenge_robot.png" alt="drawing" width="130" /> 
> Now that you made your first submission using the small sample from your dataset, repeat the process using the full dataset and submit again to see if your score gets better.<br>
If you don't know how to write the code for this, uncomment the challenge answer; copy and paste it in the section below.

__NOTE__: It should take around 12-15 minutes to run this training with our CPU. Just in case, use the `time_limit` parameter (in seconds) to limit the run time to 20 minutes.



In [33]:
############## CODE HERE ####################
predictor = TabularPredictor(label="Price", eval_metric="mean_squared_error").fit(train_data=df_train, time_limit=15*60)

############## END OF CODE ####################

No path specified. Models will be saved in: "AutogluonModels/ag-20220209_235036\"
Beginning AutoGluon training ... Time limit = 900s
AutoGluon will save models to "AutogluonModels/ag-20220209_235036\"
AutoGluon Version:  0.3.1
Train Data Rows:    5051
Train Data Columns: 9
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (4.149249912590282, 1.414973347970818, 2.60147, 0.33003)
	If 'regression' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    1656.54 MB
	Train Data (Original)  Memory Usage: 11.52 MB (0.7% of available memory)
	Inferring data type of each feature based on column values. Se

In [32]:
# ### CHALLENGE ANSWER
#dayone_utils.answer_html("CH_FULL_PRED")

### Second MLU Leaderboard Submission with the Full Train Dataset

><img style="float: left; padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
1. Run the AutoGluon leaderboard function for the smaller dataset in the first cell below.
2. Run the AutoGluon leaderboard function for the full dataset in the second cell below.
3. Compare the performances.

__How can you explain the differences in `score_val` and `fit_time` columns?__
 


In [37]:
############## FIRST CODE HERE ####################
smaller_predictor.leaderboard(silent=True)

############## END OF CODE ####################

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,-0.060205,0.423705,27.805055,0.0,0.269403,2,True,11
1,CatBoost,-0.061911,0.042046,15.360326,0.042046,15.360326,1,True,6
2,LightGBM,-0.064838,0.036998,1.181352,0.036998,1.181352,1,True,4
3,LightGBMXT,-0.064974,0.04,1.67694,0.04,1.67694,1,True,3
4,RandomForestMSE,-0.066901,0.053004,3.798078,0.053004,3.798078,1,True,5
5,XGBoost,-0.067591,0.020001,1.935345,0.020001,1.935345,1,True,9
6,ExtraTreesMSE,-0.069272,0.071544,3.943585,0.071544,3.943585,1,True,7
7,LightGBMLarge,-0.070022,0.056001,8.418169,0.056001,8.418169,1,True,10
8,NeuralNetFastAI,-0.072054,0.28466,7.381691,0.28466,7.381691,1,True,8
9,KNeighborsUnif,-0.138151,0.021029,0.03525,0.021029,0.03525,1,True,1


In [36]:
############## SECOND CODE HERE ###############
predictor.leaderboard(silent=True)

############## END OF CODE ####################

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,-0.038863,3.411076,263.730227,0.0,0.256883,2,True,11
1,NeuralNetFastAI,-0.042488,2.938345,55.956851,2.938345,55.956851,1,True,8
2,CatBoost,-0.042937,0.130956,153.9742,0.130956,153.9742,1,True,6
3,XGBoost,-0.045198,0.058603,18.657167,0.058603,18.657167,1,True,9
4,LightGBMXT,-0.0457,0.140093,6.578702,0.140093,6.578702,1,True,3
5,LightGBMLarge,-0.045802,0.143079,28.306423,0.143079,28.306423,1,True,10
6,LightGBM,-0.047194,0.113003,6.145207,0.113003,6.145207,1,True,4
7,RandomForestMSE,-0.054881,0.097542,201.027086,0.097542,201.027086,1,True,5
8,ExtraTreesMSE,-0.055049,0.085004,248.952299,0.085004,248.952299,1,True,7
9,KNeighborsUnif,-0.134102,0.012999,0.198595,0.012999,0.198595,1,True,1


In [40]:
# ## CHALLENGE ANSWER
#dayone_utils.answer_html("CH_FULL_LEAD")

### Get the second submission for MLU Leaderboard ready</a>

><img style="float: left; padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
> Write the code that creates the output file using the predictions from your second model.


In [45]:
############## CODE HERE ####################
price_prediction_2 = predictor.predict(df_test)
print("predicted prices for the first 10 books are : \n", price_prediction_2[0:10])

# Define empty dataset with column headers ID & Price
df_full_submission = pd.DataFrame(columns=["ID", "Price"])
# Creating ID column from ID list
df_full_submission["ID"] = df_test["ID"].tolist()
# Creating label column from price prediction list
df_full_submission["Price"] = price_prediction_2
# saving your csv file for Leaderboard submission
df_full_submission.to_csv(
    "./datasets/predictions/Prediction_to_Leaderboard_2.csv", index=False
)
############## END OF CODE ####################

predicted prices for the first 10 books are : 
 0    2.788654
1    2.599526
2    2.864072
3    2.611377
4    2.725514
5    2.976959
6    2.584566
7    2.837416
8    2.485230
9    2.322487
Name: Price, dtype: float32


In [43]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_FULL_SUBM")

#### Let's do a quick check to see if the file is ok related to the IDs expected
> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/> 
1. Run the cell below to check if your submission file has the right IDs for the MLU Leaderboard.
2. If the difference is zero you are good to go

In [46]:
# Run the code below
print("Double-check submission file against the original test file")
sample_submission_df = pd.read_csv("./datasets/mlu-leaderboard-test.csv", sep=",")
print(
    "Differences between project result IDs and sample submission IDs:",
    (sample_submission_df["ID"] != df_full_submission["ID"]).sum(),
)

Double-check submission file against the original test file
Differences between project result IDs and sample submission IDs: 0


> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/> 
> Submit again to MLU leaderboard to improve your score. For the submission use the link as before: https://mlu.corp.amazon.com/contests/redirect/7 <br>

___
#  <a name="2"> Part II - Advanced AutoGluon (OPTIONAL)</a>

Now that you have made your first Leaderboard submission, let's practice using some advanced features of AutoGluon. <br/>
- Part II - 1. <a href="#p2-1">Explainability: Feature Importance</a>
- Part II - 2. <a href="#p2-2">Data Preprocessing: Cleaning & Missing Values</a>
- Part II - 3. <a href="#p2-3">Final (optional) MLU Leaderboard Submission (with full engineered data)</a>
- Part II - 4. <a href="#p2-4">Before You Go (clean up model artifacts)</a>

### <a name="p2-1">Part II - 1. Explainability</a>

There are growing business needs and legislative regulations that require explanations of why a model made a certain decision.<br/>
To better understand our trained predictor, we can estimate the overall importance of each feature.

#### Feature Importance
A feature’s importance score represents the performance drop that results when the model makes predictions on a perturbed copy of the dataset where this feature’s values have been randomly shuffled across rows. A feature score of 0.01 would indicate that the predictive performance dropped by 0.01 when the feature was randomly shuffled. The higher the score a feature has, the more important it is to the model’s performance. If a feature has a negative score, this means that the feature is likely harmful to the final model, and a model trained without that feature  would be expected to achieve a better predictive performance.



> <img style="float: left;padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100" align="left"/> 
> Run the code below to see the output of the AutoGluon feature importance function for the first model we have run, with only 1000 samples. <br/>

In [47]:
# Run the code below
smaller_predictor.feature_importance(df_train_smaller)

Computing feature importance via permutation shuffling for 9 features using 1000 rows with 3 shuffle sets...
	41.3s	= Expected runtime (13.77s per shuffle set)
	34.2s	= Actual runtime (Completed 3 of 3 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
Synopsis,0.056896,0.002689,0.000372,3,0.072304,0.041488
Edition,0.014196,0.000357,0.000105,3,0.016242,0.012149
Genre,0.010605,0.000629,0.000585,3,0.014209,0.007
BookCategory,0.008616,0.000303,0.000206,3,0.010351,0.006882
Ratings,0.008183,0.000537,0.000715,3,0.011259,0.005108
Title,0.007887,0.000251,0.000168,3,0.009324,0.006449
Reviews,0.002575,0.000249,0.001547,3,0.004,0.00115
Author,0.000811,7.8e-05,0.001537,3,0.001258,0.000364
ID,0.000573,3.3e-05,0.000557,3,0.000763,0.000383


### <a name="p2-2">Part II - 2. Data Preprocessing</a>

With AutoGluon you don't have to worry about which model to chose; indeed you can focus on the data itself. 
In the book price case, there are a few columns which are clearly very poorly encoded, most importantly the ```Edition``` column. <br/>

### Data Cleaning

For this experiment, let's use our small dataset __df_train_smaller__ to make everything run a bit faster.

> <img style="float: left;padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
> Use the functions below to clean things up a bit and expand that data out.<br/>
For this experiment, our feature engineering taks will be:<br/><br/>
>1. Splitting the Column ```Edition``` into three new ones: ```hard_paper```, ```year``` and ```month```
>2. Creating two numerical features based on the features ```Reviews``` and ```Ratings```, named ```Reviews-n``` and ```Ratings-n``` respectively.
>3. Drop the old columns from the dataset: ```Edition```,  ```Reviews``` and ```Ratings```. 

__Please, try hard to solve the challenge before uncommenting for the answer below.__ <br/>


__Day One is about Learn and Be Curious, right?__

In [48]:
# Run this cell

import re
import pandas as pd


def first_num(in_val):
    num_string = in_val.split(" ")[0]
    digits = re.sub(r"[^0-9\.]", "", num_string)
    return float(digits)


def year_get(in_val):
    m = re.compile(r"\d{4}").findall(in_val)
    # print(in_val, m)
    if len(m) > 0:
        return int(m[0])
    else:
        return None


def month_get(in_val):
    m = re.compile(r"Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec").findall(in_val)
    # print(in_val, m)
    if len(m) > 0:
        return m[0]
    else:
        return "None"


def drop_features(in_feat):
    train_data_feateng.drop(in_feat, axis=1, inplace=True)
    val_data_feateng.drop(in_feat, axis=1, inplace=True)
    return

In [73]:
############## CODE HERE ####################
train_data_feateng = df_train_smaller.copy()
val_data_feateng = df_test.copy()

train_data_feateng["hard_paper"] = [i.split(',')[0] for i in train_data_feateng["Edition"]]

train_data_feateng["year"] = train_data_feateng["Edition"].apply(year_get)
train_data_feateng["month"] = train_data_feateng["Edition"].apply(month_get)

train_data_feateng["Reviews-n"] = train_data_feateng["Reviews"].apply(first_num)
train_data_feateng["Ratings-n"] = train_data_feateng["Ratings"].apply(first_num)

val_data_feateng["hard_paper"] = [i.split(',')[0] for i in val_data_feateng["Edition"]]

val_data_feateng["year"] = val_data_feateng["Edition"].apply(year_get)
val_data_feateng["month"] = val_data_feateng["Edition"].apply(month_get)

val_data_feateng["Reviews-n"] = val_data_feateng["Reviews"].apply(first_num)
val_data_feateng["Ratings-n"] = val_data_feateng["Ratings"].apply(first_num)


drop_features(["Edition", "Reviews","Ratings"] )

############## END OF CODE ####################

In [66]:
# ## CHALLENGE ANSWER
#dayone_utils.answer_html("CH_FEAT_ENG")

><img style="float: left;padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/> 
>Now print the dataset with the new features to see how they look like

In [93]:
# Run this cell

#train_data_feateng.head(2)
(train_data_feateng["month"]=="None").sum()

60

### Identifying Missing values
By doing the feature engineering above we introduced a new challenge. 
We might now have some missing data.

> <img style="float: left;padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
> Try to identify the features that may have missing values and how many are missing. <br/>
__Are there any missing values?__

__Please, try hard to solve the challenge before uncommenting for the answer below.__ <br/>


__Day One is about Learn and Be Curious, right?__

In [103]:
############## CODE HERE ####################
import numpy as np
#train_data_feateng.isna().any()

print("null values: \n",pd.isna(train_data_feateng).sum())
print("\nmonth = None: ",(train_data_feateng["month"]=="None").sum())
print("year = None: ",(train_data_feateng["year"]=="None").sum())
print("hard_paper = None: ",(train_data_feateng["hard_paper"]=="None").sum())
############## END OF CODE ####################

null values: 
 ID              0
Title           0
Author          0
Synopsis        0
Genre           0
BookCategory    0
Price           0
hard_paper      0
year            3
month           0
Reviews-n       0
Ratings-n       0
dtype: int64

month = None:  60
year = None:  0
hard_paper = None:  0


In [104]:
# ## CHALLENGE ANSWER
#dayone_utils.answer_html("CH_MISSING")

> <img style="float: left; padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
> Let's train the model again with these new manually created features.



In [105]:
############## CODE HERE ####################
smaller_feateng_predictor = TabularPredictor(label="Price", eval_metric="mean_squared_error").fit(train_data=train_data_feateng)

############## END OF CODE ####################

No path specified. Models will be saved in: "AutogluonModels/ag-20220210_011831\"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20220210_011831\"
AutoGluon Version:  0.3.1
Train Data Rows:    1000
Train Data Columns: 11
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (3.9542425094393248, 1.414973347970818, 2.60143, 0.33874)
	If 'regression' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    1527.24 MB
	Train Data (Original)  Memory Usage: 1.92 MB (0.1% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadat

In [107]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_PRED_FEAT")

> <img style="float: left; padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
> Compare the AutoGluon leaderboard for the new pfeateng_redictor to smaller_predictor in the cells below. <br/>
__Are there any significant differences?__


In [108]:
############## FIRST CODE FROM THE ANSWER HERE ####################
smaller_predictor.leaderboard(silent=True)

############## END OF CODE ########################################

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,-0.060205,0.423705,27.805055,0.0,0.269403,2,True,11
1,CatBoost,-0.061911,0.042046,15.360326,0.042046,15.360326,1,True,6
2,LightGBM,-0.064838,0.036998,1.181352,0.036998,1.181352,1,True,4
3,LightGBMXT,-0.064974,0.04,1.67694,0.04,1.67694,1,True,3
4,RandomForestMSE,-0.066901,0.053004,3.798078,0.053004,3.798078,1,True,5
5,XGBoost,-0.067591,0.020001,1.935345,0.020001,1.935345,1,True,9
6,ExtraTreesMSE,-0.069272,0.071544,3.943585,0.071544,3.943585,1,True,7
7,LightGBMLarge,-0.070022,0.056001,8.418169,0.056001,8.418169,1,True,10
8,NeuralNetFastAI,-0.072054,0.28466,7.381691,0.28466,7.381691,1,True,8
9,KNeighborsUnif,-0.138151,0.021029,0.03525,0.021029,0.03525,1,True,1


In [109]:
############## SECOND CODE FROM THE ANSWER HERE ####################
smaller_feateng_predictor.leaderboard(silent=True)

############## END OF CODE #########################################

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,-0.062411,0.231073,29.202604,0.000999,0.29888,2,True,11
1,CatBoost,-0.064478,0.050565,18.926666,0.050565,18.926666,1,True,6
2,LightGBMXT,-0.065743,0.034,1.011851,0.034,1.011851,1,True,3
3,ExtraTreesMSE,-0.0679,0.090509,4.879728,0.090509,4.879728,1,True,7
4,XGBoost,-0.068657,0.017,1.748636,0.017,1.748636,1,True,9
5,RandomForestMSE,-0.069099,0.053004,3.877837,0.053004,3.877837,1,True,5
6,LightGBM,-0.069263,0.038,2.336842,0.038,2.336842,1,True,4
7,LightGBMLarge,-0.075532,0.050999,4.723439,0.050999,4.723439,1,True,10
8,NeuralNetFastAI,-0.079989,0.252549,4.932153,0.252549,4.932153,1,True,8
9,KNeighborsUnif,-0.134233,0.061999,0.051003,0.061999,0.051003,1,True,1


In [111]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_LEAD_COMP")

> <img style="float: left; padding-right: 30px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
1. Run the AutoGluon `feature_importance` function for original smaller dataset into the first cell below.
2. Run the feature_importance function again for the feature engineered dataset into the second cell below.
3. Compare the results.

__Are there any significant differences?__


In [112]:
############## CODE FOR THE ORIGINAL DATASET FEATURE IMPORTANCE HERE ####################
smaller_predictor.feature_importance(df_train_smaller)

############## END OF CODE ############################################################

Computing feature importance via permutation shuffling for 9 features using 1000 rows with 3 shuffle sets...
	51.08s	= Expected runtime (17.03s per shuffle set)
	37.07s	= Actual runtime (Completed 3 of 3 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
Synopsis,0.056896,0.002689,0.000372,3,0.072304,0.041488
Edition,0.014196,0.000357,0.000105,3,0.016242,0.012149
Genre,0.010605,0.000629,0.000585,3,0.014209,0.007
BookCategory,0.008616,0.000303,0.000206,3,0.010351,0.006882
Ratings,0.008183,0.000537,0.000715,3,0.011259,0.005108
Title,0.007887,0.000251,0.000168,3,0.009324,0.006449
Reviews,0.002575,0.000249,0.001547,3,0.004,0.00115
Author,0.000811,7.8e-05,0.001537,3,0.001258,0.000364
ID,0.000573,3.3e-05,0.000557,3,0.000763,0.000383


In [113]:
############## CODE FOR THE FEATURE ENGINEERED DATASET FEATURE IMPORTANCE HERE  ####################
smaller_feateng_predictor.feature_importance(train_data_feateng)

############## END OF CODE #########################################################################

Computing feature importance via permutation shuffling for 11 features using 1000 rows with 3 shuffle sets...
	43.32s	= Expected runtime (14.44s per shuffle set)
	30.45s	= Actual runtime (Completed 3 of 3 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
Synopsis,0.057548,0.002533,0.000323,3,0.072062,0.043034
Genre,0.012842,0.000705,0.000501,3,0.01688,0.008804
hard_paper,0.010553,0.000522,0.000407,3,0.013544,0.007562
Title,0.009263,0.000268,0.000139,3,0.010796,0.00773
Ratings-n,0.009136,7e-05,1e-05,3,0.009535,0.008736
BookCategory,0.005504,0.000172,0.000163,3,0.00649,0.004518
month,0.002197,9.6e-05,0.000319,3,0.002748,0.001645
Reviews-n,0.001502,0.000153,0.001719,3,0.002379,0.000626
year,0.001099,6.6e-05,0.000599,3,0.001476,0.000721
Author,0.000878,5.4e-05,0.000628,3,0.001186,0.000569


In [None]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_FEAT_COMP")

### <a name="p2-3">Part II - 3. Final (optional) MLU Leaderboard Submission (with full engineered data)</a>
Let's create the full engineered dataset to train a final AutoGluon model & let's also allocate more time to really get the best results.

__NOTE__: As there are few columns in this dataset, we don't necessarily expect additional performance improvement.

> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/> 
> Now it is time to train your model using using AutoGluon __enhanced version__.

For this experiment we will use a time limit of 30 min (`time_limit` in seconds below).

__NOTE__: 20 minutes may not be enough to have a better score than your previous submission. If you have time, try running for more than 20 minutes to improve your performance!

In [114]:
full_feateng = df_train.copy()

# CLEAN FEATURES
full_feateng['Reviews-n'] = full_feateng['Reviews'].apply(first_num)
full_feateng['Ratings-n'] = full_feateng['Ratings'].apply(first_num)
full_feateng['hard-paper'] = full_feateng['Edition'].apply(lambda x : x.split(",")[0])
full_feateng['year'] = full_feateng['Edition'].apply(year_get)
full_feateng['month'] = full_feateng['Edition'].apply(month_get)

# DROPING ORIGINAL FEATURES
full_feateng.drop(['Edition', 'Ratings', 'Reviews'], axis=1, inplace=True)

In [115]:
enhanced_predictor = TabularPredictor(label="Price", eval_metric="mean_squared_error").fit(
    train_data=full_feateng, time_limit= 30 * 60
)

No path specified. Models will be saved in: "AutogluonModels/ag-20220210_012701\"
Beginning AutoGluon training ... Time limit = 1800s
AutoGluon will save models to "AutogluonModels/ag-20220210_012701\"
AutoGluon Version:  0.3.1
Train Data Rows:    5051
Train Data Columns: 11
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (4.149249912590282, 1.414973347970818, 2.60147, 0.33003)
	If 'regression' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    1216.68 MB
	Train Data (Original)  Memory Usage: 10.81 MB (0.9% of available memory)
	Inferring data type of each feature based on column values. 

[1000]	train_set's l2: 0.00300236	valid_set's l2: 0.0457485


	-0.0456	 = Validation score   (mean_squared_error)
	14.34s	 = Training   runtime
	0.11s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 1768.23s of the 1768.19s of remaining time.
	-0.0463	 = Validation score   (mean_squared_error)
	5.66s	 = Training   runtime
	0.09s	 = Validation runtime
Fitting model: RandomForestMSE ... Training model for up to 1762.35s of the 1762.29s of remaining time.
	-0.0549	 = Validation score   (mean_squared_error)
	137.27s	 = Training   runtime
	0.11s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 1624.66s of the 1624.62s of remaining time.
	-0.0417	 = Validation score   (mean_squared_error)
	137.71s	 = Training   runtime
	0.09s	 = Validation runtime
Fitting model: ExtraTreesMSE ... Training model for up to 1486.8s of the 1486.78s of remaining time.
	-0.0549	 = Validation score   (mean_squared_error)
	167.13s	 = Training   runtime
	0.09s	 = Validation runtime
Fitting model: NeuralNetFastAI ... Training 

### Time to make Your Final Submission to the MLU Leaderboard</a>

> <img style="float: left;padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
> Now make a final prediction and submit this to MLU leaderboard.<br> Keep in mind that we used an engineered version of the dataset for training. We need to apply the same transformation to the test data before we can call `.predict()`:

In [116]:
test_data_feateng = df_test.copy()

# FOR TEST DATA 
test_data_feateng['Reviews-n'] = test_data_feateng['Reviews'].apply(first_num)
test_data_feateng['Ratings-n'] = test_data_feateng['Ratings'].apply(first_num)
test_data_feateng['hard-paper'] = test_data_feateng['Edition'].apply(lambda x : x.split(",")[0])
test_data_feateng['year'] = test_data_feateng['Edition'].apply(year_get)
test_data_feateng['month'] = test_data_feateng['Edition'].apply(month_get)

# DROPING ORIGINAL FEATURES
test_data_feateng.drop(['Edition', 'Ratings', 'Reviews'], axis=1, inplace=True)


Add the code below to create predictions and the output file.

In [118]:
############## CODE HERE ####################
price_prediction_3 = enhanced_predictor.predict(test_data_feateng)
print("predicted prices for the first 10 books are : \n", price_prediction_3[0:10])

# Define empty dataset with column headers ID & Price
df_enhanced_submission = pd.DataFrame(columns=["ID", "Price"])
# Creating ID column from ID list
df_enhanced_submission["ID"] = test_data_feateng["ID"].tolist()
# Creating label column from price prediction list
df_enhanced_submission["Price"] = price_prediction_3
# saving your csv file for Leaderboard submission
df_enhanced_submission.to_csv(
    "./datasets/predictions/Prediction_to_Leaderboard_3.csv", index=False
)

############## END OF CODE ####################

predicted prices for the first 10 books are : 
 0    2.764384
1    2.564892
2    2.948368
3    2.648545
4    2.776495
5    2.977696
6    2.612196
7    2.883893
8    2.389546
9    2.293916
Name: Price, dtype: float32


In [119]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_FINAL_SUBM")

#### Let's do a quick check to see if the file is ok related to the IDs expected
><img style="float: left; padding-right: 30px" src="./images/task_robot.png" alt="drawing" width="100"/> 
> 1. Run the cell below to check if your submission file has the right IDs for the MLU Leaderboard.
2. If the difference is zero you are good to go!

In [120]:
# Run the code below
print("Double-check submission file against the original test file")
sample_submission_df = pd.read_csv("./datasets/mlu-leaderboard-test.csv", sep=",")
print(
    "Differences between project result IDs and sample submission IDs:",
    (sample_submission_df["ID"] != df_enhanced_submission["ID"]).sum(),
)

Double-check submission file against the original test file
Differences between project result IDs and sample submission IDs: 0


<p style="padding: 10px; border: 1px solid black;">
<img src="./images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>
    
## Congrats for Finishing this Hands On!!
In the next module, __Code Walkthrough and Advanced AutoGluon__ we are going do a walkthrough over your solutions and also show a notebook that implements an __end-to-end__ solution, deploying your model for use in production.

### <a name="p2-4">Part II - 4. Before You Go</a>
> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/> 
>After you are done with this Hands On, you can clean all model artifacts uncommenting and executing the cell below.<br/>

__It's always a good practice to clean up everything when you are done.__

In [None]:
!rm -r AutogluonModels