#Learning outcomes

- Understand how and why to preprocess and basic preprocessing principles

- Implement some forms of automated content analysis

- Understand relational analysis in textual data with network analysis

- Analyse  subjective meaning, not just literal content, in language such as sentiment in language

# Preprocessing with NLTK

The Natural Language Toolkit, or [NLTK](https://www.nltk.org/), is a Python library for working with natural (human) language data in a computational way. It is mainly used for natural language processing (NLP), which entails analysing and modeling language with algorithms.

Developed in the early 2000s by computational linguists Steven Bird and Edward Loper at the University of Pennsylvania, it was created as a teaching and research tool for linguistics and computer science students.

NLTK has since become a widely used toolkit, providing tools for tokenization, tagging, parsing, and accessing standard linguistic datasets, making it especially useful for learning and research.


In [None]:
import nltk
nltk.download('all')

In [None]:
# import packages
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [None]:
text = "This is an example sentence to demonstate basic NLP preprocessing steps."
text

In [None]:
# lowercasing
text_lower = text.lower()
text_lower
# why lowercasing? treats 'The' and 'the' as the same word, for e.g.

In [None]:
# tokenisation
tokens = word_tokenize(text_lower)
tokens

In [None]:
tokens[2]

In [None]:
# removing punctuation (optional, but often done)
# we'll use list comprehension to keep only alphabetic tokens
tokens_no_punct = [word for word in tokens if word.isalpha()]
tokens_no_punct

In [None]:
# new list without non alpha tokens
# this is the same as the list comprehension but
# you might be more familiar with this for loop
tokens_no_punct = []
for word in tokens:
    if word.isalpha(): # conditional
        tokens_no_punct.append(word) # list method to append to list
tokens_no_punct

# list comprehension is more concise for our purposes

In [None]:
# removing stopwords
stop_words = stopwords.words('english') # variable from nltk. It's a list.
stop_words.extend(["would", "could", "said", "must",
                       "much", "miss", "one"])  # Add custom stop words
tokens_no_stopwords = [word for word in tokens_no_punct if word not in stop_words]
# another list comprehension to remove "stopwords", but in this case
# if NOT IN
tokens_no_stopwords


Why remove stopwords? Words like 'the', 'a', 'is' are very common in English but don't have much specific meaning, removing them can reduce noise.

In [None]:
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens_no_stopwords]
lemmatized_tokens


Lemmatization produces dictionary base words, which can be useful for tasks requiring semantic understanding. It allows the code to interpret different variations of the same word e.g. run, ran or politician, politicians as the same meaning. This can be very useful if trying to understand what a text is about or what is talked about. If you are studying linguistics, these difference may be important to keep and therefore you would not lemmatise. Preprocessing depends on your research domain and research goals.

In [None]:
nltk.pos_tag(tokens_no_stopwords, tagset='universal')

| Tag  | Meaning                   | English Examples                                      |
|------|---------------------------|--------------------------------------------------------|
| ADJ  | adjective                 | new, good, high, special, big, local                   |
| ADP  | adposition                | on, of, at, with, by, into, under                      |
| ADV  | adverb                    | really, already, still, early, now                     |
| CONJ | conjunction               | and, or, but, if, while, although                      |
| DET  | determiner, article       | the, a, some, most, every, no, which                  |
| NOUN | noun                      | year, home, costs, time, Africa                        |
| NUM  | numeral                   | twenty-four, fourth, 1991, 14:24                       |
| PRT  | particle                  | at, on, out, over, per, that, up, with                |
| PRON | pronoun                   | he, their, her, its, my, I, us                         |
| VERB | verb                      | is, say, told, given, playing, would                  |
| .    | punctuation marks         | . , ; !                                               |
| X    | other                     | ersatz, esprit, dunno, gr8, univeristy                |

from https://www.nltk.org/book/ch05.html 2.3 "A Universal Part-of-Speech Tagset" Table 2.1:

In [None]:
# it is very common to create a preprocessing function
# code that can be reused. Here we put is all together

def process(text):
    text_lower = text.lower()
    tokens = word_tokenize(text_lower)
    tokens_no_punct = [word for word in tokens if word.isalpha()]
    tokens_no_stopwords = [word for word in tokens_no_punct if word not in stop_words]
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens_no_stopwords]
    return lemmatized_tokens

In [None]:
![image.png](attachment:image.png)

# Content Analysis
Content analysis has a long tradition in the humanities and social sciences, with seminal works by Krippendorff and others. Traditional quantitative textual analysis techniques often necessitated large teams manually counting words by hand and applying elaborate "coding frameworks" (not computer code, but rules for interpretation). Researchers today increasingly use automated content analysis methods using NLP methods, which can tokenize and count words efficiently at scale. We will cover several approaches to counting words, ranging from simple but powerful frequency distributions to more rigorous statistical techniques such as topic modeling.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import requests
from nltk import FreqDist

# Project Gutenberg
[Project Gutenberg](https://www.gutenberg.org/) is a free online library that offers over 60,000 eBooks, mostly classic literature. It makes these texts freely available to the public in digital formats like plain text, which is very useful for NLP.

In [None]:
# load books
urls = {
    "Emma": "https://www.gutenberg.org/files/158/158-0.txt",
    "Pride and Prejudice": "https://www.gutenberg.org/files/1342/1342-0.txt",
    "Sense and Sensibility": "https://www.gutenberg.org/files/161/161-0.txt",
    "Mansfield Park": "https://www.gutenberg.org/files/141/141-0.txt",
}

In [None]:
# Process each book as a dictionary
books = {}
for title, url in urls.items():
    # urls.items() is a method used on a Python dictionary called urls.
    # it allows you to loop over both keys and values at the same time
    response = requests.get(url)
    # requests is a Python library used to make HTTP requests,
    # like visiting a web page or downloading a file from the internet.

    raw_text = response.text
    # extracts the text content of the response

    # cleanup to remove Gutenberg header/footer
    lines = raw_text.splitlines()
    # print(lines[:10]) TO SHOW WHAT IT IS DOING

    start = 0
    end = len(lines)
    # initialise the start and end line index

    for i, line in enumerate(lines):
    # IF CONFUSED COULD PRINT WITH TOKENS
      # loops through each line with its index i, which is what
      # enumerate does
        lower_line = line.lower()

        if '*** start of the project gutenberg ebook' in lower_line:
            start = i + 1

        elif '*** end of the project gutenberg ebook' in lower_line:
            end = i
            break


    clean_text = '\n'.join(lines[start:end]).strip()

    text = clean_text.lower()

    words = process(text)

    books[title] = {
        "text": text,
        "words": words
    }

    # frequency and plot
    freq = FreqDist(words)
    top_words = freq.most_common(10)
    words_, counts = zip(*top_words)
    # adding an underscore (like words_) is a convention often
    #used to avoid naming conflicts
    # *top_words the * unpacks the list

    plt.figure(figsize=(6, 3))
    plt.bar(words_, counts, color='skyblue')
    plt.title(f"Top Words in '{title}'")
    # using an f-string (formatted string literal) to dynamically
    # insert a variable into a string i.e. title of each book
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()


## Topic Modelling

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation


Latent Dirichlet Allocation, or LDA, is a technique to uncover the hidden topics that appear across a collection of texts. Instead of just counting how often words appear, LDA looks for patterns using probability, which words tend to appear together etc, and then groups them into meaningful topics. Each document (text) is  seen as a mix of these topics and may be label with a dominant topic.

This can be more powerful than word counting because it captures the themes or ideas running through the text. It helps reveal what the text is about, even if certain key words are used in different ways across documents.

In [None]:
# prepare documents as strings (joining tokens back)
docs = [
    ' '.join(book["words"][i:i+500])
    for book in books.values()
    # .values() is a dictionary method that returns a
    # special view of all the values in the dictionary
    for i in range(0, len(book["words"]), 500)
    if len(book["words"][i:i+500]) > 50
    # this is list slicing i to i + 500 tokens, chuck size
]
# this string method takes a list of strings and joins them into a single string

In [None]:
len(docs)

In [None]:
# create CountVectorizer to convert text to term-frequency matrix
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
# converts a list of text documents into a matrix of word counts
# also known as a document-term matrix (DTM).

Example of what a Term-Frequency Matrix (X) might look like:

| Word  | victim | walk | love | cry | darcy | bennet | thought | may |
| ----- | --- | --- | -- | --- | --- | --- | --- | ------ |
| Doc 1 | 1   | 1   | 1  | 2   | 1   | 0   | 0   | 0      |
| Doc 2 | 0   | 1   | 1  | 2   | 0   | 1   | 1   | 0      |
| Doc 3 | 1   | 0   | 0  | 2   | 0   | 1   | 0   | 1      |

Each document is now a vector of numbers representing word frequency.


This code trains the LDA model to uncover hidden topics in a set of documents. It’s set to group by 15 topics, but that is arbitary at this stage. The model looks at how words are distributed across the documents and groups them into topics based on the patterns it finds. lda.fit(X) applies the model to the term-frequency matrix (X) and "learns" what topics are present.

There are techniques to help determine the best number of topics in documents, including topic coherence scores, as well as topic models that infer the number of topics.

In [None]:
# fit LDA model, e.g. 15 topics
lda = LatentDirichletAllocation(n_components=15, random_state=42)
lda.fit(X)

In [None]:
# Get feature names (words)
words = vectorizer.get_feature_names_out()
# get the list of all the unique words (features) the vectorizer learned

You can imagine LDA components like this:

| Topic ↓ / Word Index → | 0    | 1    | 2    | 3    | 4    |
| ---------------------- | ---- | ---- | ---- | ---- | ---- |
| **Topic 0**            | 0.01 | 0.09 | 0.12 | 0.04 | 0.03 |
| **Topic 1**            | 0.03 | 0.02 | 0.01 | 0.12 | 0.08 |


In [None]:
# display top words per topic
for i, topic in enumerate(lda.components_):
  # lda.components a 2D array (matrix) from your trained LDA model
  # each row is a topic and each number in the row is the importance (weight)
  # of a word for that topic.

    top_indices = topic.argsort()[-5:][::-1]
    # returns the indices of words sorted by their weight gets the top 5
    top_words = [words[j] for j in top_indices]
    # a list of vocab words from get_feature_names_out() above
    print(f"Topic {i+1}: {', '.join(top_words)}")

Can imagine words like this:

| **Column Index** →  | **Word** |
| ---------------- | -------- |
| 0                | every    |
| 1                | walk     |
| 2                | mother   |
| 3                | sister   |
| 4                | fanny    |

So this is why we need to map the terms from words back to the components that have the weights of words in topics.

## Network Analysis of Birgrams

In [None]:
from nltk.util import ngrams
from collections import Counter
import networkx as nx

Bigrams are pairs of two words (could be more, trigrams for e.g.) that appear next to each other in a sentence. Bigrams often carry more meaning together then studying single words alone. For example, the words “New” and “York” separately aren't as meaningful as the bigram “New York,” which clearly refers to a place. Using bigrams helps capture these kinds of relationships between words, which can make text analysis or language models more accurate.

In [None]:
# collect bigrams from each book’s tokenised words
bigram_counts = Counter()
# Counter() is a special class from Python to count things.

for title, book_data in books.items():
    words = book_data["words"]
    bigrams = list(ngrams(words, 2))
    bigram_counts.update(bigrams)

In [None]:
# get top 50 most common bigrams
top_bigrams = bigram_counts.most_common(50)
top_bigrams[0:10]

In [None]:
# build network
G = nx.Graph()
# creates an empty, undirected graph
for (w1, w2), count in top_bigrams:
  #loop iterates over each bigram (word1 and word2) and its count
    G.add_edge(w1, w2, weight=count)

In [None]:
# layout
pos = nx.spring_layout(G, k=0.8)

plt.figure(figsize=(9, 7))
nx.draw_networkx(
    G, pos,
    with_labels=True,
    node_color='skyblue',
    edge_color='gray',
    node_size=300,
    font_size=10,
    font_color='black'
)
plt.title('Bigram Network')
plt.show()


# Sentiment analysis

Sentiment analysis in NLP can be approached in a number of ways. Three importance approaches are: lexicon-based methods, rule-based systems, and machine learning models.

Lexicon-based methods use predefined sentiment dictionaries to associate each word with polarity or score. Sentiment is derived from aggregating these scores. This technique does not require training data.

Rule-based systems extend the lexicon approach by incorporating linguistic rules such as handling negations or intensifiers, for example.

Machine learning treats sentiment analysis as a supervised classification problem, meaning that models learn patterns from labeled data and then can (hopefully) generalise what they have learned to new examples. Models have included Naive Bayes, logistic regression, and support vector machines, but more recent techniques often draw on [deep learning](https://en.wikipedia.org/wiki/Deep_learning), particularly transformer-based models, which capture more word context.

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
# download VADER lexicon
nltk.download('vader_lexicon')

So far we have been doing analysis at the word level, but we can do analysis at the sentence level as well. To do this we need to tokenise the data by sentence.

In [None]:
from nltk.tokenize import sent_tokenize

for title, book_data in books.items():
    text = book_data["text"]  # full book text
    sentences = sent_tokenize(text)  # split into sentences
    book_data["sentences"] = sentences  # add to the book's data


In [None]:
sentences = ['they seemed more like cheerful, easy friends, than lovers.',
 'how could she have been so brutal, so cruel to miss bates!',
 'mrs. collins welcomed her friend with the liveliest pleasure, and elizabeth was more and more satisfied with coming, when she found herself so affectionately received.',
 'kitty was the only one who shed tears; but she did weep from vexation and envy.',
 'to her it was but the natural consequence of a strong affection in a young and ardent mind.',
 'by a former marriage, mr. henry dashwood had one son: by his present lady, three daughters.',
 'but the feelings which made such composure a disgrace, left her in no danger of incurring it.',
 'she had been a beauty, and a prosperous beauty, all her life; and beauty and wealth were all that excited her respect.',
 'she could hardly have made a more untoward choice.',
 'but her uncle’s anger gave her the severest pain of all.']

In [None]:
vader_analyzer = SentimentIntensityAnalyzer()
#  creates an instance of the SentimentIntensityAnalyzer class from the
# VADER sentiment analysis tool.
# When you create vader_analyzer, you get an object ready to analyze text
for sentence in sentences:
    scores = vader_analyzer.polarity_scores(sentence)
    print(f"Sentence: {sentence}\n Score: {scores['compound']}")

[Hugging Face](https://huggingface.co/) is an open-source platform that provides a high-level interface for some very powerful NLP models. Through its transformer library it simplies tradtional NLP pipelines, including much of what we have been doing "by hand" e.g. tokenising etc..

In [None]:
# Hugging Face transformer Sentiment Analysis
from transformers import pipeline
# load default model (distilbert fine-tuned on sentiment)
classifier = pipeline("sentiment-analysis")
# pipeline is a function imported from the transformers library by Hugging Face.
for sentence in sentences:
    result = classifier(sentence)[0]
    print(f"Sentence: {sentence}\nLabel: {result['label']}, Score: {result['score']}\n")

Which sentiment scores do you agree with more? Are the scores accurate in your expert human opinion?

**1. Understand how and why to preprocess and basic preprocessing principles**

- Learned important principles in nlp such as tokenisation, stop words and lemmatisation.

- We have learn how to preprocess texts for different types of analysis, by word and by sentence.

- We have formatted that textual data as appropriate for the analysis tool: as string, tokens, bigrams and chunks of text.

**2. Implement some forms of automated content analysis**

- We have created frequency distributions to count words across a corpora of Austin's works

- Used one technique of breaking the text into a meaningful size for content analysis (we could also have chosen paragraphs, chapters or books if appropriate).

**3. Understand relational analysis in textual data with network analysis**
- Learn how to create and work with bigrams
- Created a graph representation of words that can also be applied to other entities in text

**4. Analyse subjective meaning, not just literal content, in language such as sentiment in language**

- Prepared and compared textual data for subjective meaning and compared the results of different computational methods of sentiment analysis, including a largely lexicon and rule-based sentiment analyser like VADER and a more advanced transformer-based encoder model (DistilBERT) that processes text as a sequence, capturing the meaning of each word based on its surrounding words (context).





# FURTHER RESOURCES

The [NLTK book](https://www.nltk.org/book/) is available for free online and is an accessable pathway to get a handle on the basics of NLP.

[Real Python course](https://realpython.com/nltk-nlp-python/) on NLP will help cement the basics.

An advanced free course with spaCy: https://course.spacy.io/en/