# NLP for Content Analysis

In this workshop, we will explore how to use Natural Language Processing (NLP) techniques to analyze and extract insights from text data. We will cover various NLP tasks such as tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis.

## Why NLP?

NLP is a powerful tool for understanding and processing human language. There are many situations where the quantity of text data available is too large for manual analysis, and NLP can help automate the extraction of useful information. For example, we can use NLP to analyze policy documents to identify key themes, extract named entities such as organizations and locations, and assess the sentiment of the text.

## Our Tools

In this workshop, we'll focus on two useful tools for NLP: spaCy, and the Hugging Face Transformers library. Both libraries provide powerful and efficient implementations of various NLP tasks, making it easy to get started with text analysis.

## spaCy

[spaCy](https://spacy.io/) is an open-source library for advanced NLP in Python. It is designed specifically for large-scale use and provides a fast and efficient way to process large volumes of text. spaCy includes pre-trained models for various languages, which can be used for tasks such as tokenization, part-of-speech tagging, named entity recognition, and dependency parsing.

If it's not already installed, you can install spaCy using pip:

In [None]:
!pip install -Uq spacy

spaCy works with pre-trained models for various languages. We'll work today with the small English model, which is suitable for learning. To install the small English model, run the following command:

In [None]:
!python -m spacy download en_core_web_sm

Once successfully installed, we can load spaCy and the English model in our Python environment:

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

We can now use the `nlp` object to process text. For example, let's analyze a simple sentence:

In [None]:
document = nlp("The quick brown fox jumps over the lazy dog.")

We can now make use of our processed `document` object to understand the text better. For example, we can access the `tokens` in the document, which are the individual words and punctuation marks broken down by the model:

In [None]:
print(f'{"Token": <20}{"Reduced Form (aka lemma)": <20}')
for token in document:
    print(f"{token.text: <20}{token.lemma_: <20}")

Notice how in this example, the word "jumps" is reduced to its lemma, "jump". This is a common NLP technique called lemmatization, which reduces words to their base or dictionary form. This can be useful for tasks such as text classification or sentiment analysis, where we want to treat different forms of a word as the same. 

### Exercise 1
Try coming up with a complex sentence and see how spaCy processes it. You can use the method below, changing the sentence. Try some of the following:
- Complex compound words
- Words in different tenses
- Words with different prefixes or suffixes
- Made up words!

In [None]:
def lemmatize_sentence(sentence):
    doc = nlp(sentence)
    print(f'{"Token": <20}{"Lemma": <20}')
    for token in doc:
        print(f"{token.text: <20}{token.lemma_: <20}")

lemmatize_sentence("Put your complex sentence here, with different tenses, prefixes, and suffixes.")

With spaCy, we can also perform more advanced tasks. For example, spaCy makes a prediction about the part of speech of each token in the document. This can be useful if we only care about certain parts of speech, such as nouns or verbs. We can access the part of speech tags using the `pos_` attribute of each token:

In [None]:
def pos_tags(sentence):
    doc = nlp(sentence)
    print(f'{"Token": <20}{"Part of Speech": <20}')
    for token in doc:
        print(f"{token.text: <20}{token.pos_: <20}")

pos_tags("The quick brown fox jumps over the lazy dog.")

For a complete list of part-of-speech tags and their meanings, you can refer to the [Universal POS Tags](https://universaldependencies.org/u/pos/).

Let's go ahead and try a worked example with a real (if small) dataset. For this part, we'll use the [Hugging Face Datasets library](https://huggingface.co/docs/datasets/index) to load the `AI Jobs News Articles` dataset, which contains news articles related to AI jobs. We'll use spaCy to analyze the text in these articles.

In [None]:
!pip install -Uq datasets

The code below will download the dataset and show us the first few examples.

**NOTE:** If at any point in the lab you find that the code is running too slowly, you can adjust the `download_percent` variable below to a lower value. This will reduce the number of articles loaded, which can speed up processing time. However, for the full analysis, we recommend using the full dataset.

In [None]:
from datasets import load_dataset

download_percent = 100 # Adjust this to a lower value (e.g., 10) if you want to load a smaller subset of the dataset

# Load a small dataset of news articles related to AI jobs
ds = load_dataset("fdaudens/ai-jobs-news-articles", split=f'train[:{download_percent}%]')

# Display the first few articles
for i in range(3):
    article = ds[i]
    print(f"Title: {article['title']}")
    print(f"Content: {article['text'][:512]}...")  # Display first 512 characters
    print("-" * 80)

There are many different questions we might want to ask of this dataset, but let's start with a simple one: what are the most common words in the articles? We can use spaCy to tokenize the text and then count the frequency of each token. Because we can `lemmatize` the tokens, we can be sure that we are counting the base form of each word, rather than different forms of the same word (e.g., "run" and "running" will both be counted as "run").

In [None]:
from collections import Counter
from tqdm import tqdm

def count_tokens_pipe(articles, batch_size=100):
    token_counter = Counter()

    # Remove any articles that do not have text
    texts = [article['text'] for article in articles if article.get('text') is not None]

    # Process the texts in batches to avoid memory issues
    for doc in tqdm(nlp.pipe(texts, disable=["ner", "parser"], batch_size=batch_size), total=len(texts)):
        for token in doc:

            # Check if the token is not a stop word, punctuation, or empty lemma
            if not token.is_stop and not token.is_punct and len(token.lemma_.strip()) > 0:
                token_counter[token.lemma_] += 1

    return token_counter

# Count tokens in the dataset
token_counts = count_tokens_pipe(ds)

In [None]:
# Print the 10 most common tokens

print("\nMost common tokens:")
for token, count in token_counts.most_common(10):
    print(f"{token}: {count}")

These results are not particularly surprising, but they do give a sense of the most common words in the dataset. As we'd expect, words like "AI", "job" and "skill" are among the most common, as they are central to the topic of the articles. Let's move on to a more complex analysis.

## Named Entity Recognition (NER)

Named Entity Recognition (NER) is a key NLP task that involves identifying and classifying named entities in text into predefined categories such as persons, organizations, locations, dates, and more. NER can be particularly useful for extracting structured information from unstructured text data.

In this dataset, many articles include references to specific people in the world of AI. We can use spaCy's NER capabilities to identify people mentioned in the articles. Going beyond that, let's say that we want to evaluate whether the articles are positive, negative, or neutral in tone. We can use spaCy's sentiment analysis capabilities to do this.

Putting these concepts together, we can create a function that allows us to estimate the sentiment surrounding each person mentioned in the articles. This will give us a sense of how people in the AI field are perceived in the news articles.

We do need one additional library to make this work, called spacytextblob. This library provides a simple interface for sentiment analysis using the TextBlob library, which is built on top of NLTK and provides a simple API for common NLP tasks. Let's go ahead and install that now:

In [None]:
!pip install -Uq spacytextblob

In [None]:
from spacytextblob.spacytextblob import SpacyTextBlob

if "spacytextblob" not in nlp.pipe_names:
    nlp.add_pipe("spacytextblob", last=True)

Our first job is to extract the named entities from the articles. Named Entity Recognition identifies more than just poeple; it can also identify organizations, locations, dates and more. For our purposes, we'll first task spaCy with identifying named entities, and then we'll check if they're of type "PERSON". If they are, we'll add them to a list of people mentioned in the articles.

In [None]:
def extract_people(texts, batch_size: int = 64):
    """
    Returns a list (len == n_texts) of PERSON-name lists.
    """
    people_per_doc = []
    for doc in tqdm(
        nlp.pipe(
            texts,
            batch_size=batch_size,
            disable=["parser", "tagger", "attribute_ruler",
                     "lemmatizer", "spacytextblob"],   # keep only NER
        ),
        total=len(texts),
        desc="Finding PERSON entities",
    ):
        names = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
        people_per_doc.append(names)
    return people_per_doc

extract_people(["Sam Altman is the CEO of OpenAI.", "At the recent AI conference, Professor Geoffrey Hinton spoke", "We caught up with Sundar Pichai, the CEO of Google."], batch_size=1)

Great, we can see that the function works as expected. It returns a list of people mentioned in the articles, which we can then use for further analysis.

Our next step is to analyze the sentiment of the articles.

## Sentiment Analysis

Sentiment analysis is the process of determining the emotional tone behind a series of words. It is often used to understand the sentiment expressed in text data, such as social media posts, product reviews, or news articles. Automated prediction of sentiment has come a long way in recent years, and modern NLP libraries like spaCy make it easy to perform sentiment analysis on text data.

To perform sentiment analysis with spaCy, we can use the `spacytextblob` library, which provides a simple interface for sentiment analysis using the TextBlob library.

In [None]:
def score_sentiment(texts, batch_size: int = 64):
    """
    Returns a list (len == n_texts) of polarity scores in [-1, 1].
    """
    polarities = []
    for doc in tqdm(
        nlp.pipe(
            texts,
            batch_size=batch_size,
            disable=["parser", "tagger", "attribute_ruler",
                     "lemmatizer", "ner"],             # keep only spacytextblob
        ),
        total=len(texts),
        desc="Scoring sentiment",
    ):
        polarities.append(doc._.blob.polarity)
    return polarities

score_sentiment(["I love AI!", "AI is terrible.", "I have no opinion on AI."])

There are more sophisticated methods for sentiment analysis, but for our purposes, this will work well. The `spacytextblob` library provides a simple way to get the sentiment polarity of a text, which ranges from -1 (very negative) to 1 (very positive). A polarity of 0 indicates a neutral sentiment.

Now that we have our NER and sentiment analysis functions, we can start analyzing the articles in more depth. We can extract the people mentioned in each article and their associated sentiment scores, allowing us to see not only who is being discussed but also the sentiment surrounding those individuals.

In [None]:
import pandas as pd

def build_person_sentiment_df(articles, text_key: str = "text"):
    """
    Returns a DataFrame:  article_id | person | polarity
    """
    # --- collect texts & IDs (skip empty ones) ---
    texts = [a[text_key] for a in articles if a.get(text_key)]
    article_ids = [i for i, a in enumerate(articles) if a.get(text_key)]

    # --- run the two independent passes ---
    people_lists = extract_people(texts)
    polarity_list = score_sentiment(texts)

    # --- explode to long form ---
    records = []
    for art_id, names, pol in zip(article_ids, people_lists, polarity_list):
        for name in names:
            records.append({"article_id": art_id, "person": name, "polarity": pol})

    return pd.DataFrame(records)

In [None]:
people = build_person_sentiment_df(ds, text_key="text")

Now we have a dataset listing every mention of a person, and what polarity the article is predicted to be. We can now use this to compute the average for each person, as well as how many times they were mentioned. This will give us a sense of how often people are mentioned in the articles, and how positive or negative the sentiment is towards them.

In [None]:
people.groupby('person').agg(
    mentions=('polarity', 'count'),
    avg_sentiment=('polarity', 'mean')
).reset_index().sort_values(by='mentions', ascending=False).head(10)

As we can see, there are clearly many entities picked up by the model that are in actuality not people. Keep in mind that we are using the smallest English model, which is significantly less accurate than the larger models. In order to improve performance, let's switch to the HuggingFace Transformers library, which provides access to state-of-the-art NLP models.


## Hugging Face Transformers

The [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) library provides a wide range of pre-trained models for various NLP tasks, including named entity recognition, sentiment analysis, and more. It is built on top of PyTorch and TensorFlow, making it easy to use with both frameworks. HuggingFace has particularly become the go-to resource for transformer-based models, which is the technology behind Large Language Models (LLMs) like GPT.

In this case, we won't be working with a full LLM, since this is too resource-intensive to run on a personal computer. Instead, we'll use pre-trained models that are specifically designed for tasks like NER and sentiment analysis. These models have been trained on large datasets and can be fine-tuned for specific tasks.

### Named Entity Recognition with Hugging Face

For our identification of peoples' names, we will use the `distilbert-NER` model. BERT is an early success in the field of transformers, and DistilBERT is a smaller, faster, cheaper version of BERT that retains 97% of its language understanding while being 60% faster and using 40% less memory. As the name implies, this version of distilBERT is specifically trained for NER tasks, and is a good fit for our needs.

Let's make sure we have the necessary libraries installed, and then we can load the model.

In [None]:
!pip install -Uq transformers

In [None]:
from transformers import pipeline
from tqdm.auto import tqdm

ner_pipeline = pipeline(
    "ner",
    model="dslim/distilbert-NER",
    aggregation_strategy="max"
)

def extract_people(
    texts,
    batch_size: int = 16,
    stride: int = 128,
    show_progress: bool = True
):
    if isinstance(texts, str):
        texts = [texts]

    persons_per_doc = []
    for start in tqdm(
        range(0, len(texts), batch_size),
        desc="Performing NER",
        unit="doc",
        disable=not show_progress
    ):
        batch = texts[start : start + batch_size]
        batch_entities = ner_pipeline(
            batch,
            batch_size=len(batch),
            stride=stride
        )
        for doc in batch_entities:
            persons_per_doc.append(
                [e["word"].strip() for e in doc if e["entity_group"] == "PER" and len(e["word"].strip()) > 5]
            )

    return persons_per_doc


# ─── quick demo ────────────────────────────────────────────────────────────────
docs = [
    "Sam Altman is the CEO of OpenAI.",
    "Professor Geoffrey Hinton spoke at the conference.",
    "We caught up with Sundar Pichai, the CEO of Google."
]
print(extract_people(docs, batch_size=2))

## Sentiment Analysis with Hugging Face

For sentiment analysis, we will use the `distilroberta-finetuned-financial-news-sentiment-analysis` model. This model is fine-tuned for sentiment analysis on financial news articles, which makes it a good fit for our dataset of AI job news articles. It is based on the RoBERTa architecture, which is a variant of BERT that has been shown to perform well on a variety of NLP tasks. It is distilled, like DistilBERT, to be smaller and faster while retaining much of the original model's performance.

Let's load the sentiment analysis pipeline and define a function to extract sentiment from text. For longer texts, we will use a sliding window approach to ensure that we can process the entire text without exceeding the model's maximum input length. This is particularly important for our dataset, which contains articles that can be quite long.

In [None]:
import numpy as np

sentiment_pipeline = pipeline("sentiment-analysis", model="mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis", truncation=True)

def _score_long_text(text, max_len=512, stride=128, batch_size=16):
    """
    Return (label, score) for a single *text* of arbitrary length.
    """
    tk = sentiment_pipeline.tokenizer

    # a) Split into overlapping windows
    enc = tk(
        text,
        truncation=True,
        max_length=max_len,
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=False
    )
    chunks = tk.batch_decode(enc["input_ids"], skip_special_tokens=True)

    # b) Run the model on each window
    chunk_preds = sentiment_pipeline(chunks, batch_size=batch_size)

    # c) Aggregate: signed average of probabilities
    signed = [
        p["score"] * (+1 if p["label"].lower() == "positive"
                      else -1 if p["label"].lower() == "negative"
                      else 0)
        for p in chunk_preds
    ]
    avg = np.mean(signed)
    return avg

# 3 ─── Public function that handles a list or single string
def extract_sentiment(texts, batch_size: int = 16,
                      max_len: int = 512, stride: int = 128):
    if isinstance(texts, str):
        texts = [texts]
    return [
        _score_long_text(t, max_len=max_len, stride=stride,
                         batch_size=batch_size)
        for t in tqdm(texts, desc="Scoring sentiment", unit="doc")
    ]


extract_sentiment([
    "The job market for AI professionals is booming.",
    "Investors are worried about rising interest rates."
])

In [None]:
def build_person_sentiment_df(articles, text_key: str = "text"):
    # Collect non-empty texts and remember their original indices
    texts, art_ids = [], []
    for i, art in enumerate(articles):
        if art.get(text_key):  # skip empty / missing text, and limit to 512 characters
            texts.append(art[text_key])
            art_ids.append(i)

    people_per_article    = extract_people(texts)
    sentiment_per_article = extract_sentiment(texts)

    records = []
    for art_id, names, score in zip(
        art_ids, people_per_article, sentiment_per_article
    ):
        for name in names:
            records.append(
                {
                    "article_id": art_id,
                    "person":     name,
                    "score":      score
                }
            )

    return pd.DataFrame(records)

people = build_person_sentiment_df(ds, text_key="text")

In [None]:
people.groupby('person').agg(
    mentions=('score', 'count'),
    avg_sentiment=('score', 'mean')
).reset_index().sort_values(by='mentions', ascending=False).head(20)

Our results for this example are much better than with the small English model in spaCy. Although we still have some false positives, the model is much better at identifying people.

### Conclusion

In this workshop, we have explored how to use NLP techniques to analyze and extract insights from text data. We have covered various NLP tasks such as tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis using spaCy and Hugging Face Transformers. We showed how to apply these techniques to a real-world dataset of AI job news articles, extracting named entities and sentiment scores for individuals mentioned in the articles.

#### Extension

spaCy has many more features that we haven't covered in this notebook. If you still have time, take a look at the [spaCy documentation](https://spacy.io/usage) to learn more about what you can do with spaCy. Some interesting topics to explore include:
- Dependency parsing: Understanding the grammatical structure of sentences.
- Text classification: Automatically categorizing text into predefined categories.
- Language models: Training your own models for specific tasks.