## Fundamentals of Social Data Science
# Week 3 Day 2: Natural Language Processing 

In this class we will be exposed to some of the basics of natural lagnuage processing. This is an incredibly deep field for which we can only scratch the surface. It is also a field that has had some very close ties to some of the most impressive advances in artificial intelligence and machine learning. In this case, we will not be focusing with any depth on the AI/ML consequences of this work, but will instead focus on some foundational topics that will be incredibly useful to appreciate in the run up to an understanding of machine learning. These foundational topics point to the _motivations_ for text analysis as well as the considerations with textual data. 

Learning goals: 

- Appreciate text encodings 
- Within English, appreciate tokenisation, stemming, lemmatisation, stop words
- Understand how to strip HTML from text
- Understand how to create a Term Frequency-Inverse Document Frequency matrix. 

This lecture draws on code from FSSTDS chapters 10 and 11, self-prompted code from Claude Sonnet 3.5 (original and new), ChatGPT, and general online sources as well as bespoke written code. 


In [1]:
import numpy as np
import pandas as pd
from collections import Counter 
# Counter is like a value counts for lists
# It returns a dictionary with the count of each element in the list
# where value_counts returns a pandas series with the count of each element in the series

import re
import string 
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import pandas as pd

## Encodings: Considering Text and Language

Language is what people speak as well as what they understand when they read. However, people do not read languages they read text. They do not hear languages, they hear speech. In both cases, language refers to a set of signs that have meaning, but (and this comes from semiotics), we never observe signs directly. Instead, they are a _relation_ between what we call a signifier and a signified. The signifier might be the word "horse", the signified might be an actual equine animal grazing in a field. The sign "horse" is the semantic association we make between the word "horse" (whether read on a sign "beware of wild horses" or in speech, where someone says "did you notice the horse in that field over there"). 

Within natural language processing, we are interested in these signs (i.e. what does a 'horse' mean?) and how we can understand them from the data we collect ("How has our discussion of horses changed after we started driving cars" or "do animal rights groups and jockeys use the same set of associations when talking about horses"). But while we are interested in these signs (i.e. the semantic elements of language), we cannot access them directly. They are implicit. Instead we access _data_ which in this case typically means text. NLP can also consider other metatextual elements such as speed, pitch, emphasis, differences in dialect or spelling. But typically we are focused on text and how that text helps us to understand the world better, or how we can use tools in NLP to understand a text better. 

When we translate language into text we need to encode it somehow. So instead of me saying <audio controls>
    <source src="audio_sample.m4a" type="audio/mpeg">
    Your browser does not support the audio element.
</audio>, we read "Hello everyone". 

Text is therefore an encoding of language. But it is an encoding that is useful for humans. Computers on the other hand need ways to encode text. Thus, before we even get to the analysis of text it is worth appreciating encodings if only superficially. 

In [2]:
print("a", ord("a"))
print("A", ord("A"))
print("😂",ord("😂"))

a 97
A 65
😂 128514


The way that the computer sees these characters is through these code points. But they are not encoded as numbers like this. They are converted into numbers using 'code points' in a system called Unicode. Below we can see how we can convert these numbers into Unicode strings and see how the computer decodes them as emojis. 

In [3]:
char = "😂"
code_point = ord(char)

# Convert to Unicode escape sequence
if code_point > 0xFFFF:
    # For characters above U+FFFF, use \U format with 8 digits
    unicode_escape = f"\\U{code_point:08x}"
else:
    # For characters U+0000 to U+FFFF, use \u format with 4 digits
    unicode_escape = f"\\u{code_point:04x}"

print(f"Character: {char}")
print(f"Ord value: {code_point}")
print(f"Unicode escape: {unicode_escape}")

# Verify it works
print(eval(f'"{unicode_escape}"') == char)  # True

Character: 😂
Ord value: 128514
Unicode escape: \U0001f602
True


Now we can work our way backwards and see how the Unicode escape codes can then be interpreted by the computer so we get our emoji back. 

In [4]:
# Unicode escape sequence
emoji_unicode = "\U0001F602"  # Note the capital U and 8 digits for characters above U+FFFF
print(emoji_unicode)  # 😂

# UTF-8 bytes as escape sequences
emoji_bytes = b"\xf0\x9f\x98\x82"
print(emoji_bytes.decode('utf-8'))  # 😂

# Verify they're the same
print(emoji_unicode == "😂")  # True
print(emoji_bytes.decode('utf-8') == "😂")  # True

😂
😂
True
True


There may be times in your work when you get encode and decode errors when reading text or strings. Remember, in this sense, encode means for the computer so it would have the byte strings whereas decode means for you the interpreter of this data and would be in the text and emoji strings you would expect to read. 

# Processing and "Pre-Processing" Text 

In order to work with text we need to somehow appreciate its _context_. That context can vary from 'the other words around this text'. For example in a humourous episode of The Simpsons, some aliens abduct the family and have them on board the ship. The plot follows the concerns that the Simpsons are going to be on the menu. Daughter Lisa discovers a book and warns the rest of the family. The book says "How to Cook Humans". The family confront the aliens who then dust the book revealing it is "How to Cook For Humans". Then one of the Simpsons blows off more dust to reveal "How to Cook Forty Humans" followed by "How to Cook for Forty Humans". See clip at: https://www.youtube.com/watch?v=o0QcdgeI5Rs In each exchange between Kang the Alien and Lisa, the semantics of the word "humans" changes depending on the word in front of it, either as alien meal or as guests to a party. 

How we understand the context for a text is a challenging and expansive subject of inquiry. There are many approaches to this and they range greatly in sophistication. Below is a crude scale of context: 

0. **No context**. We can think of words as separate semantic entities. (i.e "Bag of words")
1. **Adjacent words**. Words will have words before and after them. Combining these together we get N-grams. So "cook humans", "for humans", and "forty humans" would all be 2-grams. 
2. **Adjacent words as blocks of text**. One does not need to specify the exact number of n-grams. A sentence could be short or long, but it is a single sentence and thus a single collection of words that typically mean things when read together. What is important is the sensemaking that occurs through grammar. In english, consider the sentences "The colour paints well" and "The paints colour well". Here the order gives us a signal that colour is a noun and a subject in the first case, but a verb in the second case. 
3. **Words-in-documents**. Some documents will use some rare words a lot while most documents use some common words. Simply knowing that two words appear in the same document is a useful context. It is also the basis of TF-IDF which we will see at the end. 
4. **Words and their performance**. Words can be uttered as speech. This means that not only will other words be the context, but we might also have metatextual features like pitch. Referencing another pop culture moment, in the sitcom friends, there was an extended discussion about leaving the house when two of the cast forgot their keys. Monica leaves the house asking Rachel behind her "got the keys?" https://youtu.be/JjpnslsuA2g?t=30 . When they get back Rachel said Monica don't you have the keys asserting "You said 'got the keys'" as if it was an assertion not a question. Thus, the lack of attention to the context of pitch and tone left them with a misunderstanding that locked them out of the house. 

When we process text we must attend to the level of context we wish to consider for that word. As the examples given above show, the very same word "humans" in the first case, and "keys" in the second is understood in relation to context. 

Depending on the stability of the meaning of a word, more or less context might be required. Where less context is required we can say that a word's meaning is stable or unambiguous. Where more context is required we can say that a word's meaning is unstable or ambiguous. When we take in text, we want to be mindful that we are abstracting from context, but we want to ensure that we are not abstracting so far that our ability to understand meaning in the text is compromised.

This has considerable consequences for our understanding of data. If we believe that hate speech is an unpleasant force in the world, we may be inclined to want to 'ban' hate speech. However, when doing so, we may find that people start to use different terms for the same sentiment. For example, while the word "killed" is now banned on TikTok, this has led to people using the term "unalived" to refer to the same 'sign' but with a different signifier. Similarly, on the subreddit /r/Stupidpol where ostensibly leftist reactionaries critique identity politics from their perspective, a ban on discussion of trans issues has led people to use the term "train enthusiasts" or "model train operators" to refer to members of the trans community. We see this all over the world. In China one classic example now is the use of the Grass Mud Horse (草泥马, cǎo ní mǎ) as a substitute for a vulgar curse word. While I won't get into details of the curse word (which you might infer), Horse in Chinese is mǎ, whereas mother is mā. 

To that end, while we may have an interest in reducing _hate speech_, this is really an exercise in reducing _the performance of hated_. To that end, we are not trying to process hate-text, but use text to understand hate-language. This is important for social data science as it relates to our classic challenge of operationalisation and the related issue of construct validity. It also therefore can inform a number of decisions that we need to make in order to more fully appreciate the meaning behind words both _in context_ and _at scale_. It is also worth considering that within linguistics there is an entire subfield dedicated to "pragmatics" which is how the context of a word's use can influence its meaning. 

I mention this now because the first few techniques that we will show may superficially seem like they are very useful and powerful, but they end up being limited and preliminary. By being able to show these steps, we can then reflect on whether such steps are useful and needed in our own work. We also need to interrogate whether our work is sufficient to help inform others about any given research topic and about appropriate methods to address our _object of inquiry_. So let's proceed with some text processing, but please bear in mind that these tools are not the end of an analysis but the beginning. 

I also mention this because, and this is speaking in an editorial voice, I really dislike word clouds and wish to discourage their use in future academic work. While my rationale for this should be evident from the above discussion, I can spell it out more clearly: Word clouds divorce words from context, provide an inconsistent message to the viewer beyond "big words important", and make considerable assumptions that it is the words themselves and not how they are used which makes a difference. Imagine a word cloud for comments to an airline. In the middle we see "Luggage" and "Service". A superficial reading might say "look how much people care about their luggage" and "people notice our commitment to good service". When in fact the terms are there because most people are commenting to complain "you lost my luggage and the service was terrible". 

Where we want to end up is a place where we can make comments on our object of inquiry through text, but not simply comments on text. Thus, we will start with simple forms of text processing, but then see these as potential forms of pre-processing for more complex tasks. It is these more complex tasks that we ultimately want to consider when our skills get there.

# Words as entities and thus tokens 

Words in English and in most languages can be considered as discrete elements. They may not be written this way. For example in Arabic words are chained together in written speech making such discretisation challenging. In English spaces are typically used to denote separate words. When there is no space we may consider it a compound word like "something" as a compound of "some" and "thing" or as portmonteau, where two words are blended such as "hacktivism". We typcially still consider compound words and portmonteaus as single words. English also has contractions where sometimes compounded words eliminate letters, such as _cannot_ be written as "can't" and _will not_ being written as "won't". From a semantic point of view "cannot" and "can't" may or may not be considered equivalent. In logical speech that may be the case, but in a peer reviewed paper, we would expect people to use expanded terms like cannot and avoid contractions which are seen as informal. 

In general the easist way to take English text and make the words discrete is to 'split' the words and to split them by space. Observe how I might split "It was the best of times, it was the worst of times.". 

In [5]:
quote = "It was the best of times, it was the worst of times."
print(quote.split(" ")) 

['It', 'was', 'the', 'best', 'of', 'times,', 'it', 'was', 'the', 'worst', 'of', 'times.']


Now we can also see that the words have punctuation. We might consider "times," and "times." to be the same word, but they do not use the same characters. Within English, punctuation inside a word (e.g., in "can't" or "Y'know") is usually seen as being a part of the word, whereas punctuation outside the word is not. Therefore, we can also remove these punctuation symbols from the words. Using standard python we can do the following: 

In [6]:
new_quote_list = [word.strip(string.punctuation) for word in quote.split(" ")]
print(new_quote_list)

['It', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times']


Now we have words with the punctation stripped. Notice that in "Introducing Python" we introduced the `strip()` command to say how it can strip the whitespace from either side of a word, so that " hello " and "hello" are equivalent. But we can also strip any set of characters by providing them as an argument. By providing `string.punctuation`, we can strip this from both sides of the word. We could use `lstrip()` and `rstrip()` for the left and right sides respectively. However, we still have some words that are seen as generally comparable except for their capitalisation, such as "It" and "it". Capitalisation is important to consider when dealing with proper names. "Mom" and "mom" would refer to qualitatively different things, where "Mom" would be a title/name as in "Mom said I could take the car" whereas in the latter, "Before taking the car you should ask your mom". The latter is a role. There are many moms in the world, but for anyone there typically only one person referred to as "Mom". (Notably, within gay and lesbian parenting this issue comes up with words like "Mom" and "Mama" or "Pop" and "Dad" to distinguish specific parents). 

To that end, we can typically just iterate through words and make them lower or upper case in order to ensure that "It" and "it" are processed the same, but we also must be mindful that in doing so we might turn the one "Mom" into a more general "mom".  

In [7]:
lower_quote_list = [word.lower() for word in new_quote_list]
print(lower_quote_list)

['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times']


English like many languages, but certainly not all, will change the way words are said or spelled depending on their grammatical context. Thus, the verb "to buy" could be spelled "bought" for the present participle such as "He is _buying_ some shoes. He already _bought_ socks and then later he will _buy_ insoles". In many cases we may want to consider these as equivalent. For this we cannot use simple mechanistic approaches (or rather such approaches are generally unsatisfying). For example, we could just `rstrip("ing")` and `rstrip("s")` from 'buying' to get 'buy' but what about 'bought'? 

This case is an example of how we will quickly start to run out of strategies with base Python and instead must consider existing models that can help guide us. Within specialist natural language packages we have two related notions: stemming and lemmatisation. Have a look at them below. In both cases we have to import a package to help out.

The first thing to notice is that we do not need to do this discretisation process ourselves. In NLP it is called "tokenisation" and thus we can use a tokeniser for this. Observe: 

In [8]:
nltk.download('punkt_tab', quiet=True) # quiet=True to suppress output if already downloaded. 
# Incidentally, Claude and Copilot insists it is `nltk.download('punkt')` which is out of date.
# See: https://github.com/guardrails-ai/guardrails/issues/1013 

True

In [9]:
quote = "It was the best of times, it was the worst of times."

# Tokenize the quote
tokens = word_tokenize(quote)

print("Original tokens:")
print(tokens)

# Filter out tokens that are exclusively punctuation or space
filtered_tokens = [token for token in tokens if token not in string.punctuation and not token.isspace()]

print("Filtered tokens:")
print(filtered_tokens)

Original tokens:
['It', 'was', 'the', 'best', 'of', 'times', ',', 'it', 'was', 'the', 'worst', 'of', 'times', '.']
Filtered tokens:
['It', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times']


## Stemming 

Stemming creates the "stem" for the words, such that "officer" and "office" become the same as "offic". Yet they are different words referring to different kinds of objects. Thus lemmatisation, where we find the root whole word may be more useful. On the other hand, lemmatisation in this case preserves the capitalisation and so "The" and "the" are distinct words. 

In both cases we downloaded a model with specific important details and used it to transform our data. They are simple models but they can help us in our work. And they can be more complex than one might think at first blush. For example, with lemmatisation we can also consider "parts of speech". One might officiate an event (as it be the speaker at the front of the room). That might make them an "official". In this case, they were considered distinct but in other cases such as "run (verb) for office" and "that was a good run (noun)", these should be considered differently.  Observe how we can integrate parts of speech using a POS tagger, i.e. `pos_tags = nltk.pos_tag(tokens)`. 

In [10]:

# Download required NLTK data
nltk.download('wordnet', quiet=True)

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Our test sentence
text = "The retired officer was officiating the event since he was the official representative of the head office"

# Tokenize the sentence
tokens = word_tokenize(text)

# Apply stemming and lemmatization
stems = [stemmer.stem(word) for word in tokens]
lemmas = [lemmatizer.lemmatize(word) for word in tokens]

# Print results in a formatted way
print("Original:", text)
print("\nWord-by-word comparison:")
print(f"{'Original':<15} {'Stem':<15} {'Lemma':<15}")
print("-" * 45)
for orig, stem, lemma in zip(tokens, stems, lemmas):
    print(f"{orig:<15} {stem:<15} {lemma:<15}")

Original: The retired officer was officiating the event since he was the official representative of the head office

Word-by-word comparison:
Original        Stem            Lemma          
---------------------------------------------
The             the             The            
retired         retir           retired        
officer         offic           officer        
was             wa              wa             
officiating     offici          officiating    
the             the             the            
event           event           event          
since           sinc            since          
he              he              he             
was             wa              wa             
the             the             the            
official        offici          official       
representative  repres          representative 
of              of              of             
the             the             the            
head            head            head        

In the above example we compare stemming and lemmatisation. Review the table to notice the difference. In the table below we use a separate model that should know which part of speech each word belongs to. Run the code and look at how the inclusion of parts of speech changes the lemmatisation.

In [11]:
# Download required NLTK data
nltk.download('averaged_perceptron_tagger_eng', quiet=True)

text = "The retired officer was officiating the event since he was the official representative of the head office"
# text = "The candidate had a good run but ran out of luck in the final run." # Uncomment to test with this sentence

# Tokenize and get POS tags
tokens = word_tokenize(text)
pos_tags = nltk.pos_tag(tokens)

# Basic lemmatization (without POS)
basic_lemmas = [lemmatizer.lemmatize(word) for word in tokens]

# Lemmatization with POS tags
# Convert Penn Treebank tags to WordNet POS tags
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return 'a'  # adjective
    elif treebank_tag.startswith('V'):
        return 'v'  # verb
    elif treebank_tag.startswith('N'):
        return 'n'  # noun
    elif treebank_tag.startswith('R'):
        return 'r'  # adverb
    else:
        return 'n'  # noun as default

pos_lemmas = [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) 
              for word, pos in pos_tags]

# Print results
print("Original:", text)
print("\nWord-by-word comparison:")
print(f"{'Original':<15} {'POS':<8} {'Basic Lemma':<15} {'POS Lemma':<15}")
print("-" * 60)
for orig, (_, pos), basic_lem, pos_lem in zip(tokens, pos_tags, basic_lemmas, pos_lemmas):
    print(f"{orig:<15} {pos:<8} {basic_lem:<15} {pos_lem:<15}")

Original: The retired officer was officiating the event since he was the official representative of the head office

Word-by-word comparison:
Original        POS      Basic Lemma     POS Lemma      
------------------------------------------------------------
The             DT       The             The            
retired         JJ       retired         retired        
officer         NN       officer         officer        
was             VBD      wa              be             
officiating     VBG      officiating     officiate      
the             DT       the             the            
event           NN       event           event          
since           IN       since           since          
he              PRP      he              he             
was             VBD      wa              be             
the             DT       the             the            
official        JJ       official        official       
representative  NN       representative  representative 

By using a more complex model we are able to recover more of the context of any given word, such as its part of speech and the base verb "be" rather than "was". 

# Scoring text 

A common task with a suitably tokenised set of words is to assess sentiment. This means that we will have some model that will take in text and return a score reflecting the overall sentiment of some text. 

In my opinion, this sort of work often veers into an overly decontextualised understanding of text. Consequently, the scores that we get out of sentiment analysis are rarely very good for more than crude approximations. But it is also a good entry point into thinking again about more sophsiticated models because again they can enable more understanding of context. 

First, let's look at a very simple sentiment analysis library "SimpleSentimentAnalyzer". This is one is custom built and as you can see just has some positive words and negative words. It sums up the words present in the file. Then we can compare it to TextBlob which is also a very simple lexical analyser. Note when we say lexical here it means that we are operating on the words themselves. We are not converting words to some sort of abstraction and working on the abstraction. We will see that later when dealing with more complex models.

In [12]:
from textblob import TextBlob

class SimpleSentimentAnalyzer:
    """A basic lexical sentiment analyzer"""
    
    def __init__(self):
        # Very basic positive and negative word lists for demonstration
        self.positive_words = {
            'good', 'great', 'excellent', 'happy', 'wonderful', 'fantastic',
            'amazing', 'love', 'best', 'beautiful', 'nice', 'perfect'
        }
        self.negative_words = {
            'bad', 'terrible', 'awful', 'horrible', 'sad', 'wrong',
            'hate', 'worst', 'poor', 'disappointing', 'negative', 'ugly'
        }
    
    def analyze(self, text):
        """
        Returns a simple sentiment score:
        Positive words count - Negative words count
        """
        words = text.lower().split()
        positive_count = sum(1 for word in words if word in self.positive_words)
        negative_count = sum(1 for word in words if word in self.negative_words)
        return positive_count - negative_count

def compare_methods(texts):
    """Compare simple lexical analysis with TextBlob"""
    simple_analyzer = SimpleSentimentAnalyzer()
    
    print("Comparing sentiment analysis methods:\n")
    for text in texts:
        # Simple lexical analysis
        simple_score = simple_analyzer.analyze(text)
        
        # TextBlob analysis
        blob = TextBlob(text)
        textblob_score = blob.sentiment.polarity
        
        print(f"Text: {text}")
        print(f"Simple lexical score: {simple_score}")
        print(f"TextBlob score: {textblob_score:.2f}")
        print("-" * 50)

# Example usage
example_texts = [
    "This is a good and wonderful day!",
    "The movie was TERRIBLE and disappointing.",
    "The food was okay, nothing special.",
]

compare_methods(example_texts)

Comparing sentiment analysis methods:

Text: This is a good and wonderful day!
Simple lexical score: 2
TextBlob score: 0.85
--------------------------------------------------
Text: The movie was TERRIBLE and disappointing.
Simple lexical score: -1
TextBlob score: -0.80
--------------------------------------------------
Text: The food was okay, nothing special.
Simple lexical score: 0
TextBlob score: 0.43
--------------------------------------------------


In the code below we compare textblob to a more sophisticated sentiment analyser called VADER by Hutto and Gilbert. This takes into account not just the words, but their frequency and some features such as whether they are in ALL CAPS. It was based on labelled data that was then fed into a static lexical model. 

In [13]:
from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import pandas as pd

class SimpleSentimentAnalyzer:
    """A basic lexical sentiment analyzer"""
    
    def __init__(self):
        # Very basic positive and negative word lists for demonstration
        self.positive_words = {
            'good', 'great', 'excellent', 'happy', 'wonderful', 'fantastic',
            'amazing', 'love', 'best', 'beautiful', 'nice', 'perfect'
        }
        self.negative_words = {
            'bad', 'terrible', 'awful', 'horrible', 'sad', 'wrong',
            'hate', 'worst', 'poor', 'disappointing', 'negative', 'ugly'
        }
    
    def analyze(self, text):
        """
        Returns a simple sentiment score:
        Positive words count - Negative words count
        """
        words = text.lower().split()
        positive_count = sum(1 for word in words if word in self.positive_words)
        negative_count = sum(1 for word in words if word in self.negative_words)
        return positive_count - negative_count

def compare_sentiment_analyzers(texts):
    """Compare simple lexical analysis with TextBlob and VADER"""
    # Initialize analyzers
    simple_analyzer = SimpleSentimentAnalyzer()
    vader_analyzer = SentimentIntensityAnalyzer()
    
    results = []
    
    for text in texts:
        # Simple lexical analysis
        simple_score = simple_analyzer.analyze(text)
        
        # TextBlob analysis
        blob = TextBlob(text)
        textblob_score = blob.sentiment.polarity
        
        # VADER analysis
        vader_scores = vader_analyzer.polarity_scores(text)
        
        results.append({
            'Text': text,
            'Simple Score': simple_score,
            'TextBlob Score': round(textblob_score, 3),
            'VADER Compound': round(vader_scores['compound'], 3),
            'VADER Positive': round(vader_scores['pos'], 3),
            'VADER Negative': round(vader_scores['neg'], 3),
            'VADER Neutral': round(vader_scores['neu'], 3)
        })
    
    # Create DataFrame for nice display
    df = pd.DataFrame(results)
    return df

# Example texts demonstrating different aspects of sentiment analysis
example_texts = [
    "This is a good and wonderful day!",
    "The movie was terrible and disappointing.",
    "The food was okay, nothing special.",
    "This is REALLY GREAT!!!",  # Tests VADER's handling of caps and punctuation
    "The book was not bad at all.",  # Tests handling of negation
    "The service was good, but the food was terrible.",  # Tests mixed sentiment
]

# Run comparison
results_df = compare_sentiment_analyzers(example_texts)
print("\nResults:")
print(results_df.to_string())

# Detailed analysis of differences
print("\nKey observations:")
for idx, row in results_df.iterrows():
    text = row['Text']
    print(f"\nText: {text}")
    print(f"- Simple: {row['Simple Score']} (just counts positive/negative words)")
    print(f"- TextBlob: {row['TextBlob Score']} (pattern-based lexical analysis)")
    print(f"- VADER: {row['VADER Compound']} (rule-based with intensity)")
    
    # Highlight interesting differences
    if abs(row['TextBlob Score'] - row['VADER Compound']) > 0.3:
        print("  Note: Significant difference between TextBlob and VADER scores!")
        if "!!!" in text or text.isupper():
            print("  (VADER is likely responding to emphasis from caps/punctuation)")
    if "not" in text.lower() and abs(row['Simple Score']) > 0:
        print("  Note: Simple analyzer doesn't handle negation properly!")


Results:
                                               Text  Simple Score  TextBlob Score  VADER Compound  VADER Positive  VADER Negative  VADER Neutral
0                 This is a good and wonderful day!             2           0.850           0.784           0.580           0.000          0.420
1         The movie was terrible and disappointing.            -1          -0.800          -0.743           0.000           0.612          0.388
2               The food was okay, nothing special.             0           0.429          -0.092           0.233           0.277          0.490
3                           This is REALLY GREAT!!!             0           1.000           0.829           0.692           0.000          0.308
4                      The book was not bad at all.            -1           0.350           0.431           0.322           0.000          0.678
5  The service was good, but the food was terrible.             0          -0.150          -0.494           0.149       

This is just the tip of the scoring iceberg. For example, here are a few places we could go from here: 
- **OpenNLP**: A library from Apache which includes the ability to score text lexically using emotions. 
- **LIWC**: A huge and versitile API to scoring for all kinds of concepts such as "anxiety" or "skepticism". It started as a simple senitment analyser but has expanded considerably. 
- **Model-based scoring**: This is an active area of research, but we can generally get decent scores out of models like SentiBert which are trainined on vast corpus of texts with an architecture that reports a numeric sentiment score as the text passes through the model. Notably this is _not_ lexical scoring. It is not just giving the words scores but working on word embeddings, which is something we will explore later. 

# TF-IDF: Understanding a corpus through the words used

Departing from the earlier approach to scoring words based on a model that has some lexical scoring per word which is then transformed into a corpus-level score, we can also simply look to a corpus and determine what it means by the frequency of the words in the documents. A simple and yet highly versatile approach is TF-IDF: 

$$
\begin{align*}
tf(t,d) &= \frac{\text{count of term }t\text{ in document }d}{\text{total number of terms in document }d} \\[10pt]
idf(t) &= \log\left(\frac{N}{1 + |\{d \in D: t \in d\}|}\right) + 1 \\[10pt]
tf\text{-}idf(t,d,D) &= tf(t,d) \times idf(t)
\end{align*}
$$

In this forumula
- $t$ is the term (our 'words')
- $d$ is the document
- $D$ is the corpus (collection of all documents)
- $N$ is the total number of documents in the corpus
- $|{d ∈ D: t ∈ d}|$ represents the number of documents containing the term $t$

Term frequency gives us a weighted average of the term, not a count. We first look at how often a term appears given the number of other terms in the document. We will have one term frequency per-term per-document. This will be a 'term frequency matrix'. 

Inverse document frequency is a bit more tricky. It gives a weight to the presence of terms so that those which show up in most documents have a lower weight than those that show up in fewer documents. Note that $N$ is the numerator whereas the counting is in the denominator. This means that if $|{d ∈ D: t ∈ d}|$ is high it suggests these terms are in most of the docs, thus when we divide N by this count we will get a _lower_ score. Thus the term is more common and therefore makes the term less 'distinct' in any given document and thus probably less meaningful. 

Note that our TF matrix will have shape (num_documents × num_terms) while our IDF vector will have shape (num_terms,). When we multiply them, the IDF weights per-term will automatically be applied to every document's term frequencies.


## Intermezzo: Matrix multiplication 

TF-IDF is then a matrix multiplication of the term frequency matrix by the inverse document frequency vector. A term with a high frequency in a few docs will therefore show up as more notable then a term with high frequency across all docs. In a way we are weighting the term frequencies per-document by how rare the terms are across all documents. 

The code below will perform this from scratch, but then we will see how we can abstract all these details in a 'Vectorizer'. 

The code below will produce relatively long output but the purpose is to see each of the transformation steps along the way. Let's run this code and then circle back to the functions that produce the various shapes and see if we can understand how we go to the end. 

Before we embark on this, here is a toy explanation of matrix multiplication with a vector. It gets a little less intuitive when we have multiply two matrices by each other. I again recommend the superlative series Essence of Linear Algebra by 3blue1brown for a potentially visual and intuitive understanding of these concepts. Here is the playlist: https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab 

Imagine a really simple example: 

~~~txt
Term Frequencies:        cat   dog   sat
Document 1:            [0.2   0.0   0.1]
Document 2:            [0.1   0.3   0.0]

IDF weights:           [2.0   1.5   3.0]

Results in:

TF-IDF Matrix:         cat   dog   sat
Document 1:            [0.4   0.0   0.3]    # (0.2 × 2.0)  (0.0 × 1.5)  (0.1 × 3.0)
Document 2:            [0.2   0.45  0.0]    # (0.1 × 2.0)  (0.3 × 1.5)  (0.0 × 3.0)
~~~



To reiterate, the code below is a bit formidable if you read it from start to finish. I strongly recommend running it and reading through the output. Then going back to the cell and seeing if you can trace through the operations. They will be effectively: the sort of preprocessing we saw above and the matrix transformations as specified in the above formula. 


In [14]:
# Sample corpus
corpus = [
    "the cat sat on the mat",
    "the dog chased the cat",
    "the mat was on the floor",
    "a dog and a cat played together"
]

def tokenize(text):
    """Simple tokenizer that converts text to lowercase and splits on whitespace"""
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    return text.split()

def create_vocabulary(tokenized_corpus):
    """Create vocabulary and word-to-index mapping"""
    unique_words = sorted(list(set(word for doc in tokenized_corpus for word in doc)))
    word_to_idx = {word: idx for idx, word in enumerate(unique_words)}
    return unique_words, word_to_idx

def compute_tf(tokenized_doc, word_to_idx, vocab_size):
    """Compute Term Frequency for a document"""
    word_counts = Counter(tokenized_doc)
    tf_array = np.zeros(vocab_size)
    for word, count in word_counts.items():
        if word in word_to_idx:
            idx = word_to_idx[word]
            tf_array[idx] = count / len(tokenized_doc)
    return tf_array

def compute_idf(tokenized_corpus, word_to_idx, vocab_size):
    """Compute Inverse Document Frequency for all terms"""
    doc_counts = np.zeros(vocab_size)
    
    for doc in tokenized_corpus:
        unique_words = set(doc)
        for word in unique_words:
            if word in word_to_idx:
                idx = word_to_idx[word]
                doc_counts[idx] += 1
    
    idf = np.log(len(tokenized_corpus) / (1 + doc_counts)) + 1
    return idf

def visualize_tfidf_transformation(corpus):
    """
    Compute TF-IDF matrix with visualization of intermediate steps
    """
    print("Starting with corpus:")
    for i, doc in enumerate(corpus):
        print(f"Document {i}: {doc}")
    print("\n" + "="*80 + "\n")

    # Step 1: Tokenize all documents
    tokenized_corpus = [tokenize(doc) for doc in corpus]
    print("Step 1: Tokenized corpus:")
    for i, doc in enumerate(tokenized_corpus):
        print(f"Document {i}: {doc}")
    print("\n" + "="*80 + "\n")

    # Step 2: Create vocabulary
    vocabulary, word_to_idx = create_vocabulary(tokenized_corpus)
    print("Step 2: Vocabulary created:")
    print("Vocabulary:", vocabulary)
    print("\nWord to index mapping:")
    for word, idx in word_to_idx.items():
        print(f"{word}: {idx}")
    print("\n" + "="*80 + "\n")

    # Step 3: Compute document frequencies and IDF
    vocab_size = len(vocabulary)
    idf = compute_idf(tokenized_corpus, word_to_idx, vocab_size)
    
    # Create DataFrame for IDF values
    idf_df = pd.DataFrame({
        'term': vocabulary,
        'document_frequency': [sum(1 for doc in tokenized_corpus if term in set(doc)) for term in vocabulary],
        'idf': idf
    })
    print("Step 3: IDF values:")
    print(idf_df.to_string())
    print("\n" + "="*80 + "\n")

    # Step 4: Compute TF for each document
    tf_matrix = np.zeros((len(corpus), vocab_size))
    for i, doc in enumerate(tokenized_corpus):
        tf_matrix[i] = compute_tf(doc, word_to_idx, vocab_size)
    
    # Create DataFrame for TF values
    tf_df = pd.DataFrame(
        tf_matrix,
        columns=vocabulary,
        index=[f"Document {i}" for i in range(len(corpus))]
    )
    print("Step 4: Term Frequency matrix:")
    print(tf_df.to_string())
    print("\n" + "="*80 + "\n")

    # Step 5: Compute TF-IDF matrix
    tfidf_matrix = tf_matrix * idf
    
    # Create DataFrame for TF-IDF values
    tfidf_df = pd.DataFrame(
        tfidf_matrix,
        columns=vocabulary,
        index=[f"Document {i}" for i in range(len(corpus))]
    )
    print("Step 5: TF-IDF matrix:")
    print(tfidf_df.to_string())
    
    return {
        'vocabulary': vocabulary,
        'word_to_idx': word_to_idx,
        'idf_df': idf_df,
        'tf_df': tf_df,
        'tfidf_df': tfidf_df
    }

def analyze_term_importance(tfidf_df):
    """
    Analyze which terms are most distinctive across the corpus
    """
    print("\nStep 6: Analyzing term distinctiveness:")
    print("="*80)
    
    # Calculate average TF-IDF score for each term
    term_importance = pd.DataFrame({
        'term': tfidf_df.columns,
        'avg_tfidf': tfidf_df.mean(),
        'max_tfidf': tfidf_df.max(),
        'documents_with_max': [
            ', '.join([f"Document {i}" for i, value in enumerate(tfidf_df[term]) 
                      if value == tfidf_df[term].max()]) 
            for term in tfidf_df.columns
        ]
    }).sort_values('avg_tfidf', ascending=False)
    
    print("\nTerms ranked by average TF-IDF score (most distinctive first):")
    print(term_importance.to_string())
    
    print("\nTop 3 most distinctive terms and where they appear strongest:")
    for _, row in term_importance.head(3).iterrows():
        print(f"\n{row['term']}:")
        print(f"  Average TF-IDF: {row['avg_tfidf']:.3f}")
        print(f"  Maximum TF-IDF: {row['max_tfidf']:.3f}")
        print(f"  Strongest in: {row['documents_with_max']}")
        
        # Show the term's TF-IDF scores across all documents
        print("\n  TF-IDF scores across documents:")
        for doc_idx, score in enumerate(tfidf_df[row['term']]):
            print(f"    Document {doc_idx}: {score:.3f}")


# Run the visualization
results = visualize_tfidf_transformation(corpus)
analyze_term_importance(results['tfidf_df'])


Starting with corpus:
Document 0: the cat sat on the mat
Document 1: the dog chased the cat
Document 2: the mat was on the floor
Document 3: a dog and a cat played together


Step 1: Tokenized corpus:
Document 0: ['the', 'cat', 'sat', 'on', 'the', 'mat']
Document 1: ['the', 'dog', 'chased', 'the', 'cat']
Document 2: ['the', 'mat', 'was', 'on', 'the', 'floor']
Document 3: ['a', 'dog', 'and', 'a', 'cat', 'played', 'together']


Step 2: Vocabulary created:
Vocabulary: ['a', 'and', 'cat', 'chased', 'dog', 'floor', 'mat', 'on', 'played', 'sat', 'the', 'together', 'was']

Word to index mapping:
a: 0
and: 1
cat: 2
chased: 3
dog: 4
floor: 5
mat: 6
on: 7
played: 8
sat: 9
the: 10
together: 11
was: 12


Step 3: IDF values:
        term  document_frequency       idf
0          a                   1  1.693147
1        and                   1  1.693147
2        cat                   3  1.000000
3     chased                   1  1.693147
4        dog                   2  1.287682
5      floor        

So our most distinctive terms in this document are "the" "cat" and "a". So this might not be the best example in that case since 'the' and 'a' are not usually meaningful. Rounding out the analyses, let's do the following: 
- consider "stop words"
- look at how we can do all these operations using scikit-learn's TF-IDF vectorizer. 

Stop words are common English words that rarely convey meaning. That said, in my own work on /r/mensrights and /r/menslib the stop words for "she" and "her" turned out to be meaningful and helped to classify the subs. So this is an area where you would want to exercise your own judgment over the corpus. 

In [15]:
# Download stopwords if you haven't already
nltk.download('stopwords', quiet=True)

# Get English stopwords from NLTK
stop_words = list(set(stopwords.words('english')))

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words=stop_words)

# Fit and transform the corpus
tfidf_matrix = vectorizer.fit_transform(corpus)

# Convert to DataFrame for better visualization
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=vectorizer.get_feature_names_out(),
    index=[f"Document {i}" for i in range(len(corpus))]
)

# Show the most distinctive terms
term_importance = pd.DataFrame({
    'term': tfidf_df.columns,
    'avg_tfidf': tfidf_df.mean()
}).sort_values('avg_tfidf', ascending=False)

print("TF-IDF scores after removing stopwords:")
display(tfidf_df.style.format("{:.3f}"))
print("\nMost distinctive terms:")
display(term_importance.head().style.format({'avg_tfidf': '{:.3f}'}))

TF-IDF scores after removing stopwords:


Unnamed: 0,cat,chased,dog,floor,mat,played,sat,together
Document 0,0.448,0.0,0.0,0.0,0.553,0.0,0.702,0.0
Document 1,0.448,0.702,0.553,0.0,0.0,0.0,0.0,0.0
Document 2,0.0,0.0,0.0,0.785,0.619,0.0,0.0,0.0
Document 3,0.367,0.0,0.453,0.0,0.0,0.575,0.0,0.575



Most distinctive terms:


Unnamed: 0,term,avg_tfidf
cat,cat,0.316
mat,mat,0.293
dog,dog,0.252
floor,floor,0.196
chased,chased,0.176


# Summary 

This lecture moved through language to text and then from text to ranking. We were able to rank words using a lexical approach that drew upon scoring with pre-defined scores to a structural approach that inductively used the word frequencies to give us a sense of the corpus. 

In the lab we apply this to headlines on Reddit. This will be the start of a cumulative group project on distinctive subreddits. The lab will give some TF-IDF results out of the box but your job will be to motivate an analysis and tweak a little code. 