# Exercise: Off-the shelf models from HuggingFace
The purpose of today's module is to introduce you to techniques for advanced machine learning pipelines in Economic History. But first of all, we will skip ahead to the end and show you a very simple way of interacting with cutting edge off-the-shelf models.

We don't have to reinvent the wheel. The AI community has a culture of open source and sharing. This also means that even some of the largest and most advanced ML models (in their trained form) are freely available. In this exercise, we will have a look at one of the popular sites, [HuggingFace](https://huggingface.co/), which hosts models like this.

The goal of the following exercise is to:
1. Introduce you to Python
2. Familiarize you with the `pipeline` interface in the `transformers` library
3. Introduce you to one of the most common text analysis tools: Sentiment analysisis

### Prewritten functions (run these before you run anything else)
Since the point of this is not for you to get caught in data-tinkering, we have written a few functions to handle things that otherwise takes some tinkering. You can see them below, and you have to load them to be able to solve the exercises.

In [1]:
# Libraries
import requests
import re
import matplotlib.pyplot as plt

In [2]:
def get_notes_from_the_underground():
    """
    This function retrieves *Notes from the Underground* from the Gutenberg
    website. It also does some text cleaning and separates each sentence into
    list elements.
    """
    url = 'https://www.gutenberg.org/cache/epub/600/pg600.txt'
    response = requests.get(url)
    if response.status_code == 200:
        response.encoding = 'utf-8'
        text = response.text

        # Split text into sentences
        sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s|\r\n\r\n', text)

        # Clean text
        sentences = sentences[11:]
        sentences = sentences[:-117]

        return sentences

    else:
        raise Exception(f"Failed to retrieve data from {url}. Status code: {response.status_code}")

    return text


In [3]:
def get_pride_and_prejudice():
    """
    This function retrieves "Pride and Prejudice" by Jane Austen from the
    Gutenberg website. It also does some text cleaning and separates each
    sentence into list elements.
    """
    url = 'https://www.gutenberg.org/cache/epub/1342/pg1342.txt'
    response = requests.get(url)
    if response.status_code == 200:
        response.encoding = 'utf-8'
        text = response.text

        # Split text into sentences
        sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s|\r\n\r\n', text)

        # Clean text
        sentences = sentences[300:]
        sentences = sentences[:-121]

        return sentences

    else:
        raise Exception(f"Failed to retrieve data from {url}. Status code: {response.status_code}")

    return text

In [4]:
def get_positive_score(x):
    """
    The pipeline returns 'POSITIVE' or 'NEGATIVE' and a probability, where the
    label is based on what is the most likely sentiment of the sentence. It
    turns out to be useful to have one continuous score from -1 to 1, which
    captures completely 'postive' if 1 and completely 'negative' if -1. This
    function handles that.
    """
    if x["label"] == "POSITIVE":
        res_x = x['score']
    elif x["label"] == "NEGATIVE":
        res_x = 1 - x['score']
    else:
        raise Exception(x["label"]+"This should not be possible")

    res_x = res_x*2-1 # Expand to -1 to 1 scale

    return res_x

## 1. Sentiment analysis of *Notes from the Underground*
One of the most common tasks to run on texts is sentiment analysis. In sentiment analysis, we are interested in extracting a positive or negative score of a piece of text. This could be anything from [historical diaries](https://doi.org/10.1111/ehr.13344), to modern financial news.

In this exercise, you are asked to run sentiment analysis on all of ["Notes from the Underground" by Fyodor Dostoyevsky](https://www.gutenberg.org/cache/epub/600/pg600.txt)

## Exercise 1.1
You can access many of the highest-performing transformer models via a simple call to `pipeline` to initiate a model.

**a) Please execute the code below. Is the results as expected? How would the result be, if we simply counted positive words?**

In [5]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model = "distilbert/distilbert-base-uncased-finetuned-sst-2-english")
result = classifier("That would have been splendid. Absoloutly amazing. But it was quite the opposite.")
print(result)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'NEGATIVE', 'score': 0.9866785407066345}]


**b) Try running the classifier on the opening of *Notes from the Underground*:**
> "I am a sick man.... I am a spiteful man. I am an unattractive man. I
believe my liver is diseased".

In [None]:
result = classifier("...")
print(result)

## Exercise 1.2
We will now load up the entire book to run the analysis on the entire book. We've written a small function, `get_notes_from_the_underground`, for you to execute this easily.

**a) Use `get_notes_from_the_underground` to load the text into an object called `notes_from_the_underground`**  
**b) Print `text[0:10]`. Does this correspond to the text in the book? https://www.gutenberg.org/cache/epub/600/pg600.txt**  
**c) Apply sentiment analysis to element of the `text` object.**  
**d) Copy paste this exercise (including context) and ask ChatGPT for Python code to illustrate how sentiment develops over time. (Hint: look at moving averages over 50 sentences).**  
**e) Does the development in the sentiment correspond to what you expected?**  
**f) Extra: Run the same analysis for *Pride and Prejudice* by Jane Austin. Is it as dark as *Notes from the Underground?***

#### a)
**Hint:** You can use the function `get_notes_from_the_underground()`

In [8]:
notes_from_the_underground = ...

#### b)
**Hint:** Python is zero-indexed. I.e. you subset like this: `x[0:10]`

In [None]:
print(...)

#### c)
**Hint 1:** The function `classifier` spits out the most probable class (Negative/Positive) and its probability. To get the overall sentiment score, we have supplied you with a function, `get_positive_score`, which normalizes the output to be `-1`, if it is absoloutly certain it is negative, and `1` if it is certain it is positive (and in between if it is in between).  
**Hint 2:** This runs on every single sentence in the entire book. It takes a bit of time to run.

In [None]:
out = classifier(...)
result = [get_positive_score(x) for x in out]
print(result)

#### d)

In [None]:
# Calculate the moving average over 50 sentences
...

# Plot the moving averages
plt.figure(figsize=(14, 7))
...

#### e)
**Hint:** You can subset the text and join it in this way: `' '..join(notes_from_the_underground[...])`

In [None]:
# Subset

# Print

#### f)

In [None]:
# Get text
pride_and_prejudice = get_pride_and_prejudice()

# Run sentiment analysis
...

# Calculate the moving average over 50 sentences
...

# Plot the moving averages
...