<div class="alert alert-danger" style="color:black"><b>Running ML-LV Jupyter Notebooks:</b><br>
    <ol>
        <li>Make sure you are running all notebooks using the <code>adv_ai</code> kernel.
        <li><b>It is very important that you do not create any additional files within the weekly folders on CSCT cloud.</b> Any additional files, or editing the notebooks with a different environment may prevent submission/marking of your work.</li>
            <ul>
                <li>NBGrader will automatically fetch and create the correct folders files for you.</li>
                <li>All files that are not the Jupyter notebooks should be stored in the 'ML-LV/data' directory.</li>
            </ul>
        <li>Please <b>do not pip install</b> any python packages (or anything else). You should not need to install anything to complete these notebooks other than the packages provided in the Jupyter CSCT Cloud environment.</li>
    </ol>
    <b>If you would like to run this notebook locally you should:</b><br>
    <ol>
        <li>Create an environment using the requirements.txt file provided. <b>Any additional packages you install will not be accessible when uploaded to the server and may prevent marking.</b></li>
        <li>Download a copy  of the notebook to your own machine. You can then edit the cells as you wish and then go back and copy the code into/edit the ones on the CSCT cloud in-place.</li>
        <li><b>It is very important that you do not re-upload any notebooks that you have edited locally.</b> This is because NBGrader uses cell metadata to track marked tasks. <b>If you change this format it may prevent marking.</b></li>
    </ol>
</div>

# Practical 1: Data Acquisition

Machine learning algorithms require **a lot** of data, typically the more the better. Of course, there are many pre-existing datasets available and often used for learning purposes, or as benchmarks for particular NLP tasks, such as SQuAD and GLUE. These datasets are often well studied and can simply be downloaded and used with minimal pre-processing.

However, applying NLP to a new problem or task will often require data to be gathered, processed and if ground-truth labels are needed (e.g. for supervised learning), annotated. Indeed, the process of data acquisition can often be one of the most time consuming and labour intensive of any NLP project. Depending on the problem the data could come from existing documents, created by hand, or we can use the largest source of information - the internet. [Web scraping](https://en.wikipedia.org/wiki/Web_scraping) allows us to extract data from websites, so it is possible to obtain huge amounts of information. In fact, scraping was used to extract the ~500 billion token datasets used to train some of the largest state-of-the-art (SOTA) language models, like GPT-3 ([Brown, T.B., et al., 2020](https://papers.nips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)).

In this practical we will use web scraping to gather some movie reviews written by IMDB users. Specifically, from [The Best Worst Movies](https://www.imdb.com/list/ls003589177/) list. Then, we will annotate these reviews with a sentiment, positive or negative. In later practicals we will learn how to process this data and then build a model to classify the sentiment.

The objectives of this practical are:
1. Understand the process of web scraping to obtain data

2. Use existing tools to annotate data and manage data versioning

3. Consider the legal and ethical implications of web scraping and data acquisition in general

4. Produce a set of IMDB user reviews, annotated with positive or negative sentiment

## 1.0 Import libraries

Most of these Python libraries you should already be familiar with. For the web scraping we will use two specifically:

1. Requests - allows us to make HTTP requests for web pages i.e. ask a web server to send a web page and its data.

2. [Beautiful Soup 4 (bs4)](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - is a Python library for parsing and navigating HTML files. This makes the job of finding the data we want, from within a received page, much easier.

In [2]:
import os
import time
import random
import requests
import pandas as pd
from bs4 import BeautifulSoup
from IPython.display import clear_output

# Get the status of NBgrader (for skipping cell execution while validating/grading)
grading = True if os.getenv('NBGRADER_EXECUTION') else False

# Get the project directory (should be in ML-LV)
path = ''
while os.path.basename(os.path.abspath(path)) != 'ML-LV':
    path = os.path.abspath(os.path.join(path, '..'))

# Set the directory to the data folder (should be in ML-LV/data/imdb)
data_dir = os.path.join(path, 'data', 'imdb')

# Create the directory if it doesn't exist
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

# Set the directory to the shared dataset folder (should be in shared/datasets/imdb)
dataset_dir = os.path.join(path, '..', 'shared', 'datasets', 'imdb')

## 1.1 Get a list of movie names and the URL to their IMDB page

As previously stated, we will be getting reviews for movies in IMDB's curated list of [The Best Worst Movies](https://www.imdb.com/list/ls003589177/). If you follow the link you should can see the list of movies for yourself.

The process of web scraping simply involves requesting a web page from a server and then extracting the data we are interested in. However, in reality we may not know the exact URL, or we may wish to scrape many web pages at once. In this case we have a list of movies and we need to find the links to each of their IMDB pages. Unfortunately IMDB would prefer we use their API for retrieving data, so to simplify the process we will use a list of movies stored in CSV format. The following demonstrates how to scrape the first review of the first movie in the list.

1. First we will load the .csv file containing the movie titles and their URL's.

2. Next we send a request for the first movies review page. The response is the same information used by your browser to render the page. If you uncomment `print(response.content)` you can see the full response (it's pretty horrible) and `print(response.status_code)` tells us if it returned correctly or if there was an error.

3. Then we use beautiful soup to parse the response into a more manageable object ('soup'). Again, if you uncomment `print(soup.prettify())` you can see what this looks like (better, but still horrible).

4. Now we can begin to parse the page's data to find the reviews titles and contents. If you opened the page in your browser you can right click on a movie title and select 'inspect'. This will open the developer console and you should see that each review title is actually in an an `h3` tag (of class `ipc-title__text`) and the review body is held within a `div` tag (of class `ipc-html-content-inner-div`). So we can use bs4 to get a list of all the titles and review contents of these types.

5. Finally, we also replace `br` tags with a space to preserve separation in the original text text.

In [3]:
# Load the list of best and worst movies
movie_list = pd.read_csv(os.path.join(dataset_dir, 'imdb_best_worst_list.csv'))

# Drop the columns that are not needed
movie_list = movie_list[['Title', 'URL']]
print("IMDB Best and Worst Movies")
print(movie_list.head())
print(f"There are {movie_list.shape[0]} in the list.")

# Send http request to get the review page
# Appending "reviews/" to the movie url gets the review page
header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/131.0.0.0 Safari/537.36'}
response = requests.get(movie_list['URL'][0] + 'reviews/', headers=header)
# print(response.status_code)
# print(response.content)

soup = BeautifulSoup(response.content, 'html.parser')
# print(soup.prettify())

# Get the titles of the first review
review_titles = soup.find_all('h3', {'class': 'ipc-title__text'})
titles = [t.text for t in review_titles]
print(f"First review title: {titles[0]}")

# Get the text of the first review
review_contents = soup.find_all('div', {'class': 'ipc-html-content-inner-div'})
# Replace the <br> tags with spaces
for c in review_contents:
    for br in c.find_all('br'):
        br.replace_with(' ')
reviews = [c.text for c in review_contents]
print(f"First review content: {reviews[0]}")

IMDB Best and Worst Movies
                              Title                                    URL
0  Superman IV: The Quest for Peace  https://www.imdb.com/title/tt0094074/
1   Monsters Crash the Pajama Party  https://www.imdb.com/title/tt0059466/
2                    Batman Forever  https://www.imdb.com/title/tt0112462/
3                    Batman & Robin  https://www.imdb.com/title/tt0118688/
4                               Ape  https://www.imdb.com/title/tt0074148/
There are 134 in the list.
First review title: Such a shame!
First review content: SUPERMAN IV: THE QUEST FOR PEACE  OK... so everyone knows that this is the worst Superman movie ever made... but if you have not seen it in a while, you should watch it.  It is still pretty rubbish, but it is not as bad as I remember.  The story is not that bad... Superman rids planet Earth of all the nuclear weapons, and in doing so unknowingly creates a super villain named Nuclear Man thanks to arch rival Lex Luthor.  The movie does s

## 1.2 Get the user reviews

Now that we have the links for each movie we can get their reviews. As you might expect, the movies in this list are quite controversial. For our purposes this means we will should have more balance between the sentiment classes. To keep things manageable we'll get 2 random review titles from each of the first 50 movies, for a nice even total of 100.

The process is similar to the previous step:

1. Loop over each movie and request its review page (movie_url + "reviews/").

2. Get the titles of the reviews and also the main texts.

3. Store these in a list of dictionaries, along with a unique review id.

4. Create a Dataframe to hold the movie id, name, url, review title and empty sentiment, then save as .csv.

<div class="alert alert-success" style="color:black"><b>Note:</b> The review page only shows the first 25 reviews.<br>
We could use pagination to get the rest, but let's just stick with 25 for now.<br>

<b>This may take a few minutes to complete!</b>
</div>

<div class="alert alert-warning" style="color:black"><b>Legality of Web Scraping:</b> There are all kinds of <a href="https://www.blog.datahut.co/post/is-web-scraping-legal"> legal and ethical considerations</a> surrounding web scraping, including copyright, scraping non-public data, or data behind a login, such as Facebook or Linkedin.<br>

Notice that there is a time delay added after each movie request has been processed? This is to slow down the number of requests per second and prevent repeated requests overloading the server, or at least creating unnecessary traffic. Excessive 'crawl rates' could violate "trespass to chattels" law, though for this use case it is unlikely. Still, it is worth being polite while scraping.
</div>

In [4]:
if not grading:
    # Let's get 2 random reviews for a subset of 50 movies
    # Alternatively you could get all movies (time consuming)
    num_movies, num_reviews = 50, 2
    movie_reviews = []

    for movie_index in range(num_movies):
        # Get the movie name and url
        movie_name = movie_list['Title'][movie_index]
        movie_url = movie_list['URL'][movie_index]
        print(f"Getting reviews for {movie_name} at {movie_url}")
  
        # Send http request to get the review page
        # Appending "reviews?" to the movie url gets the review page
        header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/131.0.0.0 Safari/537.36'}
        response = requests.get(movie_url + 'reviews/', headers=header)
        soup = BeautifulSoup(response.content, 'html.parser')

        # Get the review titles
        review_titles = soup.find_all('h3', {'class': 'ipc-title__text'})
        titles = [t.text for t in review_titles]

        # Get the text of each review
        review_contents = soup.find_all('div', {'class': 'ipc-html-content-inner-div'})
        # Replace the <br> tags with spaces
        for c in review_contents:
            for br in c.find_all('br'):
                br.replace_with(' ')
        reviews = [c.text for c in review_contents]

        # Add to the list of reviews
        for i, (title, review) in  random.sample(list(enumerate(zip(titles, reviews))), k=num_reviews):
            id = str(i) + '-' + movie_url.split('/')[-2] # Create unique review id from the movie id
            movie_reviews.append({'id': id,'name': movie_name, 'url': movie_url, 'title': title, 'review': review})

        # Add a time delay to prevent excessive requests
        time.sleep(random.randint(2, 5))

    # Create a dataframe from the list of reviews and add an empty sentiment column
    reviews_df = pd.DataFrame(movie_reviews)
    reviews_df['sentiment'] = None
    
    # Save to csv
    reviews_df.to_csv(os.path.join(data_dir, 'imdb_reviews_raw.csv'))

Getting reviews for Superman IV: The Quest for Peace at https://www.imdb.com/title/tt0094074/
Getting reviews for Monsters Crash the Pajama Party at https://www.imdb.com/title/tt0059466/
Getting reviews for Batman Forever at https://www.imdb.com/title/tt0112462/
Getting reviews for Batman & Robin at https://www.imdb.com/title/tt0118688/
Getting reviews for Ape at https://www.imdb.com/title/tt0074148/
Getting reviews for Birdemic: Shock and Terror at https://www.imdb.com/title/tt1316037/
Getting reviews for Glen or Glenda at https://www.imdb.com/title/tt0045826/
Getting reviews for Space Thunder Kids at https://www.imdb.com/title/tt1910621/
Getting reviews for Captain America at https://www.imdb.com/title/tt0103923/
Getting reviews for The Fantastic Four at https://www.imdb.com/title/tt0109770/
Getting reviews for The Man Who Saves the World at https://www.imdb.com/title/tt0182060/
Getting reviews for The Super Inframan at https://www.imdb.com/title/tt0073168/
Getting reviews for Laserb

If that has all worked correctly the following should load your 'raw' IMDB review file and show there are 100 reviews, with 6 columns.

In [5]:
reviews_df = pd.read_csv(os.path.join(data_dir, 'imdb_reviews_annot.csv'), index_col=0)
reviews_df

Unnamed: 0,id,name,url,title,review,sentiment
0,12-tt0094074,Superman IV: The Quest for Peace,https://www.imdb.com/title/tt0094074/,Even Superman couldn't save this one!,There is a new nuclear arms race underway. Sup...,negative
1,5-tt0094074,Superman IV: The Quest for Peace,https://www.imdb.com/title/tt0094074/,The Paradoxical Son...,Superman IV: The Quest for Peace is a good mov...,negative
2,4-tt0059466,Monsters Crash the Pajama Party,https://www.imdb.com/title/tt0059466/,"Horrible movie, but great DVD!",Very odd and very short color film that tries ...,positive
3,2-tt0059466,Monsters Crash the Pajama Party,https://www.imdb.com/title/tt0059466/,This film manages to be both fun and amazingly...,"The acting, costumes and dialog for ""Monsters ...",positive
4,10-tt0112462,Batman Forever,https://www.imdb.com/title/tt0112462/,Better than most people remember.,While the Batman franchise has been much malig...,positive
...,...,...,...,...,...,...
95,15-tt0048696,Tarantula,https://www.imdb.com/title/tt0048696/,Prof. Deemer would have to work a LOT harder n...,I've always wondered if director Jack Arnold c...,
96,6-tt0076009,Exorcist II: The Heretic,https://www.imdb.com/title/tt0076009/,Funny Movie,Following up one of the greatest horror films ...,
97,18-tt0076009,Exorcist II: The Heretic,https://www.imdb.com/title/tt0076009/,Funnier than Repossessed,Inside this terrible film is an excellent film...,
98,2-tt0085625,Grunt!,https://www.imdb.com/title/tt0085625/,The Best Motion Picture ever produced,"10/10 better than Citizen Kane, Casablanca and...",


<div class="alert alert-info" style="color:black"><h2>1.3 Exercise: Annotate sentiment labels for the reviews</h2>

We are going to be analysing the sentiment of these reviews, so we need to add some sentiment labels. Later we can use these as an extra test set to evaluate a classifier. If we had more items to label it would be a good idea to use [Labelbox](https://labelbox.com/) or [Label Studio](https://labelstud.io/). However, as we only have 100 to label we can do this manually.

You can either edit the csv file manually or use the following code which will iterate over the reviews and prompt to input either `0` for 'negative', or  `1` for 'positive'.

You can stop at any time because the Dataframe is saved after each annotation. You will just need to re-load the data with the cell above and change the name to `imdb_reviews_annot.csv`.

<b>Don't over think this!</b> It shouldn't take more than an hour (at most) to label all 100 reviews.
</div>

In [6]:
if not grading:
    # Get a list of unlabelled reviews
    unlabelled_reviews = [i for i, j in enumerate(list(reviews_df['sentiment'].isnull())) if j]

    for i in unlabelled_reviews:
        # Display the movie title and review
        print(f"Movie: {reviews_df['name'][i]} ({reviews_df['id'][i]})")
        print(f"Title: {reviews_df['title'][i]}")
        # Add some newlines to the review for better readability
        review = reviews_df['review'][i].replace('.', '.\n').strip()
        print(f"Review: {review}\n")

        # Ask for the sentiment label
        # Must be Negative (0) or Positive (1)
        time.sleep(1)
        while True:
            label = input("Is this review Negative (0) or Positive (1)?")
            if label == '0':
                reviews_df.loc[i, 'sentiment'] = 'negative'
                break
            elif label == '1':
                reviews_df.loc[i, 'sentiment'] = 'positive'
                break
            else:
                print("Invalid input. Please enter 0 or 1.")

        # Clear the console and save the dataframe
        clear_output()
        reviews_df.to_csv(os.path.join(data_dir, 'imdb_reviews_annot.csv'))


Check the number of labelled reviews/progress.

In [7]:
# Check the labelled reviews
print(f"Number of positive reviews: {(reviews_df['sentiment'] == 'positive').sum()}")
print(f"Number of negative reviews: {(reviews_df['sentiment'] == 'negative').sum()}")
print(f"Number of unlabelled reviews: {reviews_df['sentiment'].isnull().sum()}")

Number of positive reviews: 56
Number of negative reviews: 44
Number of unlabelled reviews: 0


<div class="alert alert-success" style="color:black"><h3>Before you submit this notebook to NBGrader for marking:</h3> 

1. Make sure have completed all exercises marked by <span style="color:blue">**blue cells**</span>.
2. For automatically marked exercises ensure you have completed any cells with `# YOUR CODE HERE`. Then click 'Validate' button above, or ensure all cells run without producing an error.
3. For manually marked exercises ensure you have completed any cells with `"YOUR ANSWER HERE"`.
4. Ensure all cells are run with their output visible.
5. Fill in your student ID (**only**) below.
6. You should now **save and download** your work.

</div>

**Student ID:** 15006280