# This Notebook is run on EC2 g4dn.4xlarge resources provided using papermill library. This notebook needs to be opened on VSCode to show all output.

It took some time for the whole notebook to run.

This notebook is located in final_presentation/ folder

To run this notebook on remote server terminal, install all packages in requirements.txt and run this command "**python3 run_notebook.py**" on final_presentation folder.

In [None]:
cd ..

In [None]:
import inspect
from final_presentation.run_notebook import run_notebook

lines = inspect.getsource(run_notebook)
print(lines)

# FileUtil Initialization

FileUtil is a class used to access storage (saving and retrieving models and files)

In [None]:
from src.utils.file_util import FileUtil

file_util = FileUtil()

In [None]:
print("Methods in FileUtil:", [func for func in dir(FileUtil) if callable(getattr(FileUtil, func)) and not func.startswith("__")])
print()
print("Attributes in FileUtil:", list(FileUtil().__dict__.keys()))

# Raw Train Data

Raw train data is stored in data/raw/reviews.csv

The file path is specified in config

In [None]:
file_util.get_raw_train_data()

# Preprocessing

Before model training, we need to preprocess the train data since model training process will retrieve the processed train data directly. **preprocess_train** method will preprocess train data and save the preprocessed data into storage.

Note: To run preprocessing via terminal, run this command "**python3 -m src.preprocessing.transformations**" on h2o2.ai project folder.

In [None]:
from src.preprocessing.transformations import preprocess_train

In [None]:
preprocess_train()

In [None]:
file_util.get_processed_train_data()

In [None]:
lines = inspect.getsource(preprocess_train)
print(lines)

In [None]:
from src.preprocessing.transformations import apply_cleaning_train

lines = inspect.getsource(apply_cleaning_train)
print(lines)

# Sentiment Analysis

## Training

We experimented with 3 models:
1. BERT
2. LSTM with Word2Vec embeddings
3. Logistic Regression with Word2Vec embeddings

All models trained are saved into storage and can be used for prediction.

Takes 30 minutes to train on EC2 instance.

Note: To run sentiment analysis training via terminal, run this command "**python3 -m src.models.sentiment_analysis.train.train**" on h2o2.ai project folder.

In [None]:
from src.models.sentiment_analysis.train.train import sentiment_analysis_train

In [None]:
sentiment_analysis_train()

## Evaluation

In [None]:
file_util.get_sentiment_viz_png()

To evaluate model performances amongst imbalanced data, we focused on the following metrics, output in "eval" folder under "metrics.json":
1. Average Precision Score
2. PR-AUC Score

**FileUtil.get_metrics** function will retrieve these saved metrics.

BERT performs a bit better than LSTM here. However, the LSTM model is unstable. Sometimes LSTM PR AUC and Average Precision are far worse than BERT (< 90%).

In [None]:
metrics = file_util.get_metrics("sentiment_analysis")
metrics

In [None]:
models_prauc = sorted(list(map(lambda item: (item[0], item[1]["PR AUC"]), metrics.items())), key = lambda x: x[1])
print("Best model is {} with PR-AUC {}".format(models_prauc[-1][0], models_prauc[-1][1]))

# Topic Modelling

## Training

**Goal: Identify topics relevant to our use case.**

Models:
1. Non-negative Matrix Factorization (NMF) with Tf-Idf vectorization
2. Latent Dirichlet Allocation (LDA) with Bag of Words vectorization
3. BERTopic

Takes 4.5 minutes to train locally.

1. All topic model results will be saved into eval folder. These graphs will then be used to determine seed topics for prediction pipeline.

Note: To run topic modelling training via terminal, run this command "**python3 -m src.models.topic_modelling.train.train**" on h2o2.ai project folder.

In [None]:
from src.models.topic_modelling.train.train import topic_modelling_train

In [None]:
topic_modelling_train()

## Evaluation

Custom visualisation function previews top words by topic, to capture most representative words in each topic

In [None]:
fig = file_util.get_topics_html("LDA")
fig.update_layout(width = 700, height = 800)

In [None]:
fig = file_util.get_topics_html("BERTopic")
fig.update_layout(width = 700, height = 1000)

In [None]:
fig = file_util.get_topics_html("NMF")
fig.update_layout(width = 700, height = 550)

Topics:

1. **Drinks**: Drinks, Tea, Coffee, Juice, Soda
2. **Snacks**: Snacks, Nuts, Chips, Crackers, Protein Bars, Cereal
3. **Ingredients**: Ingredients, Sugar, Salt, Oil, Coconut, Olive, Cocoa, Cacao, Sweetener, Gluten
4. **Flavour/Seasoning**: Flavour, Taste, Seasoning, Spices, Sauce, Chili
5. **Baked Goods**: Baked Goods, Pastries, Cookies, Bread
6. **Noodles & Pasta**: Noodles,  Pasta, Ramen, Udon
7. **Pet Food**: Dog Food, Cat Food, Pet Food, Dog Treat

# Predict reviews_test.csv

1. TEST_FILE_NAME and best_sentiment_analysis_model attributes in FileUtil are supplied from config.yml file. Hence, to edit the test file name or sentiment analysis model to use for prediction, please edit the config file.
2. **predict_sentiment_topic** function doesn't take in any parameter as it reads the data specified in test file name (if any) or defaulted to reading train data (df=FileUtil().get_raw_train_data()).
3. **predict_sentiment_topic** function will call the following three functions:


> *   **apply_cleaning_test** : preprocessing
> *   **predict_sentiment** : uses the best_sentiment_analysis_model specified in config to predict sentient labels and its probabilities
> *   **predict_topic** : Lbl2TransformerVec using the predefined seed topics in config, as identified during training

4. Result df from **predict_sentiment_topic** function will be saved to data/predicted/ folder using current datetime as csv name

Note: To run predict_sentiment_topic via terminal, run this command "**python3 -m src.models.predict**" on h2o2.ai project folder.

In [None]:
cd ..

In [None]:
from src.models.predict import predict_sentiment_topic

In [None]:
file_util.TEST_FILE_NAME

In [None]:
file_util.best_sentiment_analysis_model

In [None]:
lines = inspect.getsource(predict_sentiment_topic)
print(lines)

#### Displaying predicted output

Predictions are done on the review_test.csv file as specified for the TEST_FILE_NAME attribute in the config.yml. Notice that four new columns are added, the sentiment labels and its probabilities, as well as the subtopics and topics.

Prediction on 3k data takes 7 mins 40 secs.

In [None]:
test_bert = predict_sentiment_topic()
test_bert.head()

In [None]:
len(test_bert)

#### Dropping columns and renaming column name to align with required format.

In [None]:
test_output = test_bert.drop(["cleaned_text", "subtopic", "topic"], axis = 1)
test_output = test_output.rename(columns = {"partially_cleaned_text": "Text", "date": "Time", "sentiment": "predicted_sentiment", 
                            "sentiment_prob": "predicted_sentiment_prob"})

In [None]:
test_output.head()

In [None]:
test_output.to_csv("final_presentation/reviews_test_predictions_h2o2.ai.csv")

## Visualizations

Importing all the functions that we have written in the src.visualisaton.dashboard_viz, to plot various visualisations using the plotly library.

We have developed visualisation for **sentiments**, **topics** and **specified topics**.


In [None]:
import pandas as pd
from src.visualisation.dashboard_viz import *

vis_df = reformat_data(test_bert)

### Visualizations for sentiments

In [None]:
sentiment_pie_chart_fig = sentiment_pie_chart(vis_df)
sentiment_trend_fig = sentiment_line_chart_over_time(vis_df)
topics_sentiment_fig = topics_bar_chart(vis_df)

display(sentiment_pie_chart_fig.update_layout(width = 500, height = 300, title='Overall Sentiment Breakdown'))
display(sentiment_trend_fig.update_layout(title='Sentiment trend'))
display(topics_sentiment_fig.update_layout(title='Topics by Sentiment'))

### Visualizations for topics

In [None]:
topics_pie_chart_fig = topics_pie_chart(vis_df)
topics_bar_chart_fig = topics_bar_chart_over_time(vis_df, time_frame='Q')
top_key_words_fig = visualise_all_topics(vis_df)

display(topics_pie_chart_fig.update_layout(width = 500, height = 300, title='Frequency of topics'))
display(topics_bar_chart_fig.update_layout(title='Topics over Time'))
display(top_key_words_fig)

### Visualizations for specific topic

We will be exploring the *Drinks* topic.

In [None]:
# Subtopics in each topic
select_topic = 'Drinks'

subtopic_fig = get_subtopics(vis_df, topic=select_topic)
subtopic_sentiment_fig = sentiment_pie_chart(vis_df[vis_df["topic"]==select_topic])

display(subtopic_sentiment_fig.update_layout(width = 500, height = 300,  title=f'Sentiment Breakdown for {select_topic}'))
display(subtopic_fig.update_layout(width = 500, height = 300))

# App Demo

- App will be run on Docker
- Docker takes 30 mins to build
- Docker image is of 14 GB size

# Unit Testing

We did unit testing for all functions in all modules: preprocessing, models (predict, training, and methods in each model), utils.

We also tested the functions behaviour on edge cases (e.g. cleaning null reviews, etc)

Note: To run unit testing via terminal, run this command "**python3 -m src.unittest.unit_testing**" on h2o2.ai project folder.

In [None]:
import src.unittest.unit_testing
from src.unittest.unit_testing import unit_test

In [None]:
print("Methods in unit testing:", [method for method in dir(src.unittest.unit_testing) if method[:4] == "test"])

In [None]:
unit_test()

In [None]:
from src.unittest.unit_testing import test_apply_cleaning_train

lines = inspect.getsource(test_apply_cleaning_train)
print(lines)

In [None]:
from src.unittest.unit_testing import test_predict_when_all_stopwords

lines = inspect.getsource(test_predict_when_all_stopwords)
print(lines)

In [None]:
from src.unittest.unit_testing import test_predict_sentiment_topic

lines = inspect.getsource(test_predict_sentiment_topic)
print(lines)

# Modular Code

In [None]:
import os

def list_files(startpath):
    for root, dirs, files in os.walk(startpath):
        level = root.replace(startpath, '').count(os.sep)
        indent = ' ' * 4 * (level)
        if os.path.basename(root) == "__pycache__":
            continue
        print('{}{}/'.format(indent, os.path.basename(root)))
        subindent = ' ' * 4 * (level + 1)
        for f in files:
            print('{}{}'.format(subindent, f))

In [None]:
list_files("src")

# OOP

In [None]:
from src.models.classifier import Classifier
from src.models.sentiment_analysis.train.bert import BERT
from src.models.sentiment_analysis.train.logreg import LOGREG
from src.models.sentiment_analysis.train.lstm import Lstm
from src.models.topic_modelling.train.bertopic import BERTopic_Module
from src.models.topic_modelling.train.lda import LDA
from src.models.topic_modelling.train.nmf import Tfidf_NMF_Module
from src.models.topic_modelling.test.lbl2vec import Lbl2Vec

print(isinstance(BERT(), Classifier))
print(isinstance(LOGREG(), Classifier))
print(isinstance(Lstm(), Classifier))
print(isinstance(BERTopic_Module(), Classifier))
print(isinstance(LDA(), Classifier))
print(isinstance(Tfidf_NMF_Module(), Classifier))
print(isinstance(Lbl2Vec(), Classifier))

In [None]:
print("Methods in Classifier:", [func for func in dir(Classifier) if callable(getattr(Classifier, func)) and not func.startswith("__")])

In [None]:
print("Methods in BERT:", [func for func in dir(BERT) if callable(getattr(BERT, func)) and not func.startswith("__")])
print()
print("Attributes in BERT:", list(BERT().__dict__.keys()))

In [None]:
print("Methods in BERTopic_Module:", [func for func in dir(BERTopic_Module) if callable(getattr(BERTopic_Module, func)) and not func.startswith("__")])
print()
print("Attributes in BERTopic_Module:", list(BERTopic_Module().__dict__.keys()))

# Docstrings Examples

All docstrings are collated with Sphinx Documentation

In [None]:
help(FileUtil.put_csv)

In [None]:
from src.preprocessing.preprocessing_utils import strip_html_tags_df
help(strip_html_tags_df)

In [None]:
help(sentiment_analysis_train)