# Lab 5 - Topic Modeling and Named Entity Recognition
## Exercises

___
## Topic Analysis and Unsupervised vs Supervised learning

**1. What is the difference between supervised and unsupervised learning? Discuss some benefits and issues for each approach in the context of topic analysis.**

Your answer here!

___

**2. You are presented with a large dataset of news articles where only 50% of the data has labeled topics (finance, sports, politics, etc.). You want to assign labels to the remaining data. Explain which approach you would take (no programming!)**

Your answer here!

___

**3. Could the previous question be improved by incorporating ideas from semi-supervised learning? Explain.**

Your answer here!

___

**4. Metrics are essential when dealing with machine learning. However, regarding unsupervised clustering (e.g., of topics), we cannot use the typical precision, recall, and f-measure metrics. What are the alternatives for this task?**

Your answer here

___

## Topic Modeling

Given the five sentences:

>"Macrosoft announces a new Something Pro laptop with a detachable keyboard."

>"Melon Tusk unveils plans for a new spacecraft that could take humans to Mars."

>"The top-grossing movie of the year Ramvel Retaliators."

>"Geeglo releases a new version of its Cyborg operating system."

>"Fletnix announces a new series from the creators of Thinger Strangs."

**1. How would *you* (without programming) assign the listed sentences to separate topics? Consider techniques we have discussed in the course so far (especially Lab 4)**

Your answer here!

___

Two well-established algorithms for topic discovery are Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA)

**2. What preprocessing steps should we consider before implementing these algorithms?**

Your answer here!

___

**3. Both LSI and LDA require the user to specify the number of topic clusters. How can we attempt to *automatically* detect a reasonable number of topics?**

Your answer here!

___

## Practical Exercise - Topic Analysis and Modeling of Product Reviews
We will now be using an Amazon product review dataset to perform topic modeling (see the included `amazon_train/test.csv files`). The dataset specifically contains reviews of "appliances", a subset (~100k reviews, ~50 MB) of the full product review corpus (<https://www.kaggle.com/datasets/cynthiarempel/amazon-us-customer-reviews-dataset> - 55GB!)

This task is somewhat open, and whether you want to just cluster isolated reviews, or within subsets from the review score, is up to you.

**1. Load the dataset with `pandas`, apply some preprocessing steps you find suitable, and use at least five different techniques, based on what you have learned so far (e.g. word frequency), to visualize the dataset**
- Hint: look up exploratory data analysis (EDA)

In [None]:
# TODO visualize the reviews in at least five different ways
# must include some of the techniques used in this course (such as word frequency)
# you can use plots, graphs, trees, lists, ... anything, really!
import pandas as pd
df = pd.read_csv("amazon_train.csv")
df.head()

___

**2. Before implementing off-the-shelf topic models, it is useful to consider how to process data for topic analysis. Consider what you have learned so far to generate a processing function and discuss your findings. This should only operate on a word level!**

Below is a snippet to fetch some examples from the review corpus. You can use these to test your output.

In [None]:
reviews = df["review_body"].sample(frac=1).tolist()
for review in reviews[:5]:
    print(review)

In [None]:
# TODO: a preprocessing function to gather words/groups of words/chunks that you consider important for topic analysis/modeling
from typing import List

def preprocess_for_topic(document: str) -> List[str]:
    """
    Preprocesses a document
    Args:
        document (str): The input document to be preprocessed.
    Returns:
        List[str]: A list of words obtained by splitting the document.

    Example:
        input: "This is a test."
        output: ["This", "is", "a", "test."]
    """
    return document.split()

___
**3. Using the same data, implement a topic model with LDA using the Gensim library. Experiment with different topic counts (e.g., 3) and retrieve the top 5 words from each. Discuss your findings.**

In [None]:
# TODO: LDA topic model using Gensim (and NLTK/SpaCy/scikit-learn if preferred for processing)
# should use earlier preprocessing ideas from the course.

___
**4. With the LDA model you trained above, perform topic prediction on a sample from the test dataset, and do a simple empirical evaluation of the results.**

In [None]:
import random
test_df = pd.read_csv("amazon_test.csv")
test_reviews = test_df["review_body"].tolist()

# some reviews are looong. let's filter out some on length.
test_reviews = [review for review in test_reviews if len(str(review).split()) < 30]
test_reviews = random.sample(population=test_reviews, k=10)
test_reviews

In [None]:
# TODO: predict topics.
# input: 10 random samples from `amazon_test.csv`
# output: print the review, predicted topic, which words are contained within the predicted topic, along with the confidence score.

___

# Named Entity Recognition

Previously, you learned about noun phrases. Noun phrases such as "The slow white fox", a person "Name Nameson", a place "Mount Doom" or company names "NTNU", are some examples of what we consider *named entities*.


**1. Can you think of named entity categories that are *not* noun phrases?**

Your answer here

___

Disambiguating (or entity linking) named entities is a crucial task to applications of NER and considers the problem of assigning an identifier to each entity, i.e., linking relevant entities together. The disambiguation process often incorporates external knowledge (knowledge bases).


Consider the sentences:

- "I ate an apple in New York"
- "New York Times wrote an article about Apple"
- "New York is also known as the Big Apple"

**2. How would you tackle the task of distinguishing the entities found here? Describe your approach either in text or by code.**

In [8]:
# your answer here

___

**3. Load the reviews dataset, and extract named entities and their category from 100 reviews. Visualize the named entity categories and their frequencies.**

- For visualization, you can use tables or plots (e.g. `matplotlib` or `seaborn`)

In [None]:
# TODO: group entity categories and visualize by frequency.