## Prepare dataset
After downloading the dataset in extracted folder, go back to the main workspace and run prepare_dataset.py . prepare_dataset.py uses ml-20m dataset and divides it to train, validation and test data. Train data are used in training process and we will test the model using validation and test data.

In [1]:
%%bash
python prepare_dataset.py


Preprocessing seed:  0


In this model the metrics of accuracy is **recall**. In information retrieval, recall is the fraction of the relevant documents that are successfully retrieved. For example, for a text search on a set of documents, recall is the number of correct results divided by the number of results that should have been returned.

In the field of machine learning,  a confusion matrix also known as the error matrix is used to show the performance of the model on a set of test data. This matrix has 4 entries including true positive, true negetive, false positive and false negetive. Recall is **true positive/ (true positive+ false negetive)** which means the true positive output divided by total actual positive in the test dataset. In the next sections including train, test and inference cells, you can see the recall as the metrics function of this model.

For this model, the performance of the model is the number correct predict rate over all correct rates. The model predicts the rate of a movie for a user and we have all correct rates for movies so we find the recall  by dividing the correct predicted model (True positive) by whole correct rates ( True positive+ False negetive). 


## Training the model
The training can be started by running the main.py script with the train argument. The resulting checkpoints, containing the trained model weights, are then stored in the directory specified by the --checkpoint_dir directory (by default no checkpoints are saved).

Additionally, a command-line argument called --results_dir (by default None) specifies where to save the following statistics in a JSON format:

- > a complete list of command-line arguments saved as <results_dir>/args.json, and
- > a dictionary of validation metrics and performance metrics recorded during training

When you run the training command in the next cell you can see the details of trianing process in the each epoch. Also, you can change the hyperparameters of the model for training by changing the arguments of main.py. In the last cell of this notebook see more details about main.py arguments. 

After each 50 epochs we have inference and you can see the recall of the model inference after each 50 epochs. recall shows the performance of this model that is the percentage of correct predicts rate over all correct rates (true positive /true positive+ false negetive)

In [2]:
%%bash
mpirun --allow-run-as-root -np 1 -H localhost:8 python main.py --train --amp --checkpoint_dir ./checkpoints



DLL 2020-11-12 21:34:48.556530 - PARAMETER train : True  test : False  inference_benchmark : False  amp : True  epochs : 400  batch_size_train : 24576  batch_size_validation : 10000  validation_step : 50  warm_up_epochs : 5  total_anneal_steps : 15000  anneal_cap : 0.1  lam : 1.0  lr : 0.004  beta1 : 0.9  beta2 : 0.9  top_results : 100  xla : False  trace : False  activation : tanh  log_path : ./vae_cf.log  seed : 0  data_dir : /data  checkpoint_dir : ./checkpoints  world_size : 1  local_batch_size : 24576 
DLL 2020-11-12 21:34:55.632787 - (1,) train_epoch_time : 1.3805668354034424  train_throughput : 71205.53491441265 
DLL 2020-11-12 21:34:56.116540 - (2,) train_epoch_time : 0.48351335525512695  train_throughput : 203311.8608443187 
DLL 2020-11-12 21:34:56.594167 - (3,) train_epoch_time : 0.47730565071105957  train_throughput : 205956.0783610103 
DLL 2020-11-12 21:34:57.067740 - (4,) train_epoch_time : 0.47342419624328613  train_throughput : 207644.6467672365 
DLL 2020-11-12 21:34:57.



--------------------------------------------------------------------------
[[1268,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: ef95f396c947

Another transport will be used instead, although this may result in
lower performance.

btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
[VAE| INFO]: Already processed, skipping.
[VAE| INFO]: Cropping each epoch from: 116677 to 98304 samples
[VAE| INFO]: XLA disabled


## Test the model.

The model is exported to the default model_dir and can be loaded and tested using the following command. We use the weights of the trained model that has been saved in the checkpoints folder and test data to test the performance of the model on the unseen data.

In the preprocessing step the ml-20m dataset was divided to train, test and validation datasets. In this step, we use the test dataset and trained model to test the performance of the model and recall here shows the accuracy of the model prediction. It shows how much is the probability that the model predicts the rate of a movie for a user correctly.

In [3]:
%%bash
python main.py --test --amp --checkpoint_dir ./checkpoints

DLL 2020-11-12 21:38:43.866671 - PARAMETER train : False  test : True  inference_benchmark : False  amp : True  epochs : 400  batch_size_train : 24576  batch_size_validation : 10000  validation_step : 50  warm_up_epochs : 5  total_anneal_steps : 15000  anneal_cap : 0.1  lam : 1.0  lr : 0.004  beta1 : 0.9  beta2 : 0.9  top_results : 100  xla : False  trace : False  activation : tanh  log_path : ./vae_cf.log  seed : 0  data_dir : /data  checkpoint_dir : ./checkpoints  world_size : 1  local_batch_size : 24576 
DLL 2020-11-12 21:38:51.534564 - (0,) inference_throughput : 15378.013600892842 
ndcg@100:	0.4300323040406368
recall@20:	0.4015537189302785
recall@50:	0.541860369693171




--------------------------------------------------------------------------
[[565,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: ef95f396c947

Another transport will be used instead, although this may result in
lower performance.

btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
[VAE| INFO]: Already processed, skipping.
[VAE| INFO]: Cropping each epoch from: 116677 to 98304 samples
[VAE| INFO]: XLA disabled



## Main.py
This model was train using some default hyperparameters and small dataset for 400 epochs. You can try larger dataset, more training epochs and different hyperparameters to enhance the model performance. Main.py --help command shows you differnt options that you have for working with this specific model.
The **main.py** script provides an entry point to all the provided functionalities. This includes running training, testing and inference. The behavior of the script is controlled by command-line arguments listed below in the Parameters section. The prepare_dataset.py script can be used to preprocess the MovieLens 20m dataset.


### Parameters
The most important command-line parameters include:

- > --data_dir which specifies the directory inside the docker container where the data will be stored, overriding the default location /data
- > --checkpoint_dir which controls if and where the checkpoints will be stored
- > --amp for enabling mixed precision training
- > There are also multiple parameters controlling the various hyperparameters of the training process, such as the learning rate, batch size etc.

To see the full list of available options and their descriptions, use the -h or --help command-line option.

In [5]:
%%bash
python main.py --help


usage: main.py [-h] [--train] [--test] [--inference_benchmark] [--amp]
               [--epochs EPOCHS] [--batch_size_train BATCH_SIZE_TRAIN]
               [--batch_size_validation BATCH_SIZE_VALIDATION]
               [--validation_step VALIDATION_STEP]
               [--warm_up_epochs WARM_UP_EPOCHS]
               [--total_anneal_steps TOTAL_ANNEAL_STEPS]
               [--anneal_cap ANNEAL_CAP] [--lam LAM] [--lr LR] [--beta1 BETA1]
               [--beta2 BETA2] [--top_results TOP_RESULTS] [--xla] [--trace]
               [--activation ACTIVATION] [--log_path LOG_PATH] [--seed SEED]
               [--data_dir DATA_DIR] [--checkpoint_dir CHECKPOINT_DIR]

Train a Variational Autoencoder for Collaborative Filtering in TensorFlow

optional arguments:
  -h, --help            show this help message and exit
  --train               Run training of VAE
  --test                Run validation of VAE
  --inference_benchmark
                        Measure inference latency and throughput on 



--------------------------------------------------------------------------
[[819,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: ef95f396c947

Another transport will be used instead, although this may result in
lower performance.

btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
