# Analyzing word and document frequency: tf-idf

A central question in text mining and natural language processing is how to quantify what a document is about. Can we do this by looking at the words that make up the document?

One measure of how important a word may be is its term frequency (tf).

This is how frequently a word occurs in a document - as we saw in Lab 2. However, there are words in a document that occur many times but may not be important. In English these words are most often things like “the”, “is”, “of”, and so forth. We might take the approach of adding words like these to a list of stop words and removing them before analysis, but it is possible that some of these words might be more important in some documents than others. A list of stop words is not a very sophisticated approach to adjusting term frequency for commonly used words.

Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. This can be combined with term frequency to calculate a term’s tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how rarely it is used.

The tf-idf statistic is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.

### Preparing data

In [1]:
import requests
import string
import pandas as pd

# Jane Eyre
book_url = 'https://www.gutenberg.org/files/1260/1260-0.txt'
response = requests.get(book_url)
bronte1 = response.text
allowed_chars = string.ascii_letters + string.digits + string.whitespace
bronte1 = ''.join(c for c in bronte1 if c in allowed_chars)

# Wuthering Heights
book_url = 'https://www.gutenberg.org/cache/epub/768/pg768.txt'
response = requests.get(book_url)
bronte2 = response.text
allowed_chars = string.ascii_letters + string.digits + string.whitespace
bronte2 = ''.join(c for c in bronte2 if c in allowed_chars)

# Vilette
book_url = 'https://www.gutenberg.org/files/9182/9182-0.txt'
response = requests.get(book_url)
bronte3 = response.text
allowed_chars = string.ascii_letters + string.digits + string.whitespace
bronte3 = ''.join(c for c in bronte3 if c in allowed_chars)

# Agnes Gray
book_url = 'https://www.gutenberg.org/files/767/767-0.txt'
response = requests.get(book_url)
bronte4 = response.text
allowed_chars = string.ascii_letters + string.digits + string.whitespace
bronte4 = ''.join(c for c in bronte4 if c in allowed_chars)

# Create our dataframes
bronte1_lines = bronte1.splitlines()

bronte1_df = pd.DataFrame({
    "line": bronte1_lines,
    "line_number": list(range(len(bronte1_lines)))
})

bronte2_lines = bronte2.splitlines()

bronte2_df = pd.DataFrame({
    "line": bronte2_lines,
    "line_number": list(range(len(bronte2_lines)))
})

bronte3_lines = bronte3.splitlines()

bronte3_df = pd.DataFrame({
    "line": bronte3_lines,
    "line_number": list(range(len(bronte3_lines)))
})

bronte4_lines = bronte4.splitlines()

bronte4_df = pd.DataFrame({
    "line": bronte4_lines,
    "line_number": list(range(len(bronte4_lines)))
})

# We’ll want to know which content comes from which book
bronte1_df = bronte1_df.assign(book = 'Jane Eyre')
bronte2_df = bronte2_df.assign(book = 'Wuthering Heights')
bronte3_df = bronte3_df.assign(book = 'Vilette')
bronte4_df = bronte4_df.assign(book = 'Agnes Grey')

# Finally, we concatenate the books into one dataframe
books = [bronte1_df, bronte2_df, bronte3_df, bronte4_df]
bronte_books_df = pd.concat(books)
bronte_books_df.head()

Unnamed: 0,line,line_number,book
0,START OF THE PROJECT GUTENBERG EBOOK 1260,0,Jane Eyre
1,,1,Jane Eyre
2,JANE EYRE,2,Jane Eyre
3,AN AUTOBIOGRAPHY,3,Jane Eyre
4,,4,Jane Eyre


In [2]:
# We split the data into words
# We first split the text column into a list of words
bronte_books_df['word'] = bronte_books_df['line'].str.split()

# Explode the words column to create a new row for each word (this creates a separate row for each word from the newly created words list)
bronte_books_df = bronte_books_df.explode('word')

# Reset the index of the dataframe (we want to index each word now)
bronte_books_df = bronte_books_df.reset_index(drop=True)
bronte_books_df.head()

Unnamed: 0,line,line_number,book,word
0,START OF THE PROJECT GUTENBERG EBOOK 1260,0,Jane Eyre,START
1,START OF THE PROJECT GUTENBERG EBOOK 1260,0,Jane Eyre,OF
2,START OF THE PROJECT GUTENBERG EBOOK 1260,0,Jane Eyre,THE
3,START OF THE PROJECT GUTENBERG EBOOK 1260,0,Jane Eyre,PROJECT
4,START OF THE PROJECT GUTENBERG EBOOK 1260,0,Jane Eyre,GUTENBERG


In [3]:
# For our investigations the line & line_number columns will not be necessary, so we will remove them
bronte_books_df = bronte_books_df[['book', 'word']]
bronte_books_df

Unnamed: 0,book,word
0,Jane Eyre,START
1,Jane Eyre,OF
2,Jane Eyre,THE
3,Jane Eyre,PROJECT
4,Jane Eyre,GUTENBERG
...,...,...
579034,Agnes Grey,THE
579035,Agnes Grey,PROJECT
579036,Agnes Grey,GUTENBERG
579037,Agnes Grey,EBOOK


### Word counting revisited

In [4]:
# Let's count the occurrences of each word - this is a prerequisite for finding term frequency
count_df = bronte_books_df.groupby('word')['word'].count() # Group by word column, then only keep the word column and perform the counting

# Let's sort by term frequency
count_df_sorted = count_df.sort_values(ascending=False)

count_df_sorted.head(10)

word
the    22085
and    19486
I      18440
to     15632
of     13147
a      12183
in      7987
was     7440
you     6319
her     5981
Name: word, dtype: int64

In [5]:
# The .size() method operates similary, but differs slightly in output format
# .size() also counts null values, which .count() does not
bronte_books_df.groupby(['word']).size().sort_values(ascending=False).reset_index(name='count')

Unnamed: 0,word,count
0,the,22085
1,and,19486
2,I,18440
3,to,15632
4,of,13147
...,...,...
30104,glowa,1
30105,glovessuch,1
30106,gloveless,1
30107,glossily,1


In [6]:
# Groupby allows grouping based on multiple columns
bronte_books_df.groupby(['word', 'book']).size().sort_values(ascending=False).reset_index(name='count')

Unnamed: 0,word,book,count
0,the,Vilette,7894
1,the,Jane Eyre,7332
2,I,Jane Eyre,7009
3,and,Jane Eyre,6263
4,and,Vilette,6163
...,...,...,...
53281,paperwork,Wuthering Heights,1
53282,paperwork,Vilette,1
53283,papers,Wuthering Heights,1
53284,conflict,Wuthering Heights,1


### Aggregate
One useful and elegant way of counting/aggregating data in pandas is by using the .agg() method.


In [7]:
# We group our data by words, then we aggregate and can decide what information we want to display for each column

# setting 'first' for the book column means that in the new dataframe we will display the first book on which each word occurs (in the book column)
# setting 'count' for the word column means that in the new dataframe we will display the count of given word (in the word column)
count_df = bronte_books_df.groupby('word').agg({'book': 'first', 'word': 'count'})
count_df

# Another way to describe the line above is - for each group (in our case a group = a word and all its appearances) we show on the 'book' column the first book from that group and on the 'word' column the total count of entries from that group
# .agg() is more flexible than .apply() and allows multi-column aggregations like the one we see above, each of which can be different - e.g. first and count

Unnamed: 0_level_0,book,word
word,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Wuthering Heights,5
10,Vilette,1
12,Vilette,1
1260,Jane Eyre,2
13th,Jane Eyre,1
...,...,...
zigzag,Jane Eyre,2
zigzags,Vilette,1
zle,Vilette,1
zone,Vilette,2


In [8]:
# Because we used groupby, the 'word' keyword has become both an index and a column name
# To get rid of any naming problems down the line, we will rename the column name 'word' to 'count'
count_df = count_df.rename(columns={'word': 'count'})

# Sorting values based on count column
count_df.sort_values('count', ascending=False)

Unnamed: 0_level_0,book,count
word,Unnamed: 1_level_1,Unnamed: 2_level_1
the,Jane Eyre,22085
and,Jane Eyre,19486
I,Jane Eyre,18440
to,Jane Eyre,15632
of,Jane Eyre,13147
...,...,...
glowa,Jane Eyre,1
glovessuch,Vilette,1
gloveless,Agnes Grey,1
glossily,Jane Eyre,1


### Merging Dataframes

What we want next is to have a dataframe in which we know how many times each word appears per book and how many times it appears in all of the books.

It is sometimes very useful to merge together two dataframes and this is what we're going to do to get our desired dataframe.

In [9]:
count_df_1 = bronte_books_df.groupby(['word', 'book']).size().sort_values(ascending=False).reset_index(name='count') # How many appearances each word has in each book
count_df_1


Unnamed: 0,word,book,count
0,the,Vilette,7894
1,the,Jane Eyre,7332
2,I,Jane Eyre,7009
3,and,Jane Eyre,6263
4,and,Vilette,6163
...,...,...,...
53281,paperwork,Wuthering Heights,1
53282,paperwork,Vilette,1
53283,papers,Wuthering Heights,1
53284,conflict,Wuthering Heights,1


In [10]:
count_df_2 = bronte_books_df.groupby(['book']).size().sort_values(ascending=False).reset_index(name='count') # How many words each book has
count_df_2


Unnamed: 0,book,count
0,Vilette,199315
1,Jane Eyre,189694
2,Wuthering Heights,121114
3,Agnes Grey,68916


In [11]:
book_words = count_df_1.merge(count_df_2, on='book')
book_words


Unnamed: 0,word,book,count_x,count_y
0,the,Vilette,7894,199315
1,and,Vilette,6163,199315
2,I,Vilette,5762,199315
3,of,Vilette,4924,199315
4,to,Vilette,4732,199315
...,...,...,...,...
53281,congenial,Agnes Grey,1,68916
53282,fellowcreatures,Agnes Grey,1,68916
53283,confused,Agnes Grey,1,68916
53284,panting,Agnes Grey,1,68916


In [12]:
book_words = book_words.rename(columns={'count_x': 'word_appearances_in_book', 'count_y': 'book_total_word_count'}) # Give more meaningful names
book_words.head(10)

Unnamed: 0,word,book,word_appearances_in_book,book_total_word_count
0,the,Vilette,7894,199315
1,and,Vilette,6163,199315
2,I,Vilette,5762,199315
3,of,Vilette,4924,199315
4,to,Vilette,4732,199315
5,a,Vilette,4406,199315
6,in,Vilette,2980,199315
7,was,Vilette,2836,199315
8,her,Vilette,2071,199315
9,it,Vilette,1905,199315


### Exercise 1

1. Add a **tf** (term frequency) column to your book_words dataframe.
2. Add a new idf column to your dataframe
3. Add the final tf-idf column to your dataframe
4. Display your dataframe's words in descending order of their tf-idf.

Term frequency says how frequently a given word appears in a book. The formula for calculating it is
    
    term_frequency = word_appearances_in_book / book_total_word_count

Idf or inverse document frequency is computed as **idf = log(N / n)**

where


```
N is the total number of documents (books) in your dataset and n is the number of documents containing the word.
```



Once you have tf and idf, the tf-idf is obtained by simply multiplying the two.

Hint: For ex. 1.2 the pandas **transform** function could come in handy.

In [13]:
# 1.1
import numpy as np

book_words["tf"] = book_words["word_appearances_in_book"] / book_words["book_total_word_count"]

In [14]:
# 1.2

book_words["doc_count"] = book_words.groupby("word")["book"].transform("nunique")

book_words["idf"] = np.log( 4 / book_words["doc_count"] ) 

In [15]:
# 1.3

book_words["tf-idf"] = book_words["tf"] * book_words["idf"]

book_words[book_words["doc_count"] == 2]

Unnamed: 0,word,book,word_appearances_in_book,book_total_word_count,tf,doc_count,idf,tf-idf
66,Madame,Vilette,316,199315,0.001585,2,0.693147,0.001099
74,M,Vilette,291,199315,0.001460,2,0.693147,0.001012
122,de,Vilette,186,199315,0.000933,2,0.693147,0.000647
215,Monsieur,Vilette,103,199315,0.000517,2,0.693147,0.000358
262,Project,Vilette,84,199315,0.000421,2,0.693147,0.000292
...,...,...,...,...,...,...,...,...
53259,congratulating,Agnes Grey,1,68916,0.000015,2,0.693147,0.000010
53264,conqueror,Agnes Grey,1,68916,0.000015,2,0.693147,0.000010
53265,conquest,Agnes Grey,1,68916,0.000015,2,0.693147,0.000010
53271,conquer,Agnes Grey,1,68916,0.000015,2,0.693147,0.000010


In [16]:
# 1.4

book_words_sorted = book_words.sort_values(by="tf-idf", ascending=False)
book_words_sorted

Unnamed: 0,word,book,word_appearances_in_book,book_total_word_count,tf,doc_count,idf,tf-idf
34602,Heathcliff,Wuthering Heights,413,121114,0.003410,1,1.386294,0.004727
34610,Linton,Wuthering Heights,340,121114,0.002807,1,1.386294,0.003892
18925,Rochester,Jane Eyre,312,189694,0.001645,1,1.386294,0.002280
34611,Catherine,Wuthering Heights,333,121114,0.002749,2,0.693147,0.001906
34656,Hareton,Wuthering Heights,164,121114,0.001354,1,1.386294,0.001877
...,...,...,...,...,...,...,...,...
29844,suppressing,Jane Eyre,1,189694,0.000005,4,0.000000,0.000000
29802,swing,Jane Eyre,1,189694,0.000005,4,0.000000,0.000000
29791,suspect,Jane Eyre,1,189694,0.000005,4,0.000000,0.000000
29716,generously,Jane Eyre,1,189694,0.000005,4,0.000000,0.000000


# Language Models

A language model is a statistical model that can be used to estimate the probability of a sequence of words in a language. It is trained on a corpus of text data, and learns to predict the likelihood of observing a given sequence of words based on the frequency and context of those words in the training data.

Language models can be used for a variety of natural language processing tasks, such as text generation, machine translation, speech recognition, and more.

In [17]:
import nltk
from nltk.corpus import brown
from nltk import FreqDist
nltk.download('brown')

# load the Brown corpus
corpus = brown.words()

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\sergiu.varga\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


In this example, we're using the Brown corpus from the nltk library, which is a collection of text samples from a wide range of genres, including news, fiction, and academic writing.

In [18]:
print(corpus[1100:1110]) # Print a sample of 10 words from the corpus



In [19]:
# create a frequency distribution of the words in the corpus
freq_dist = FreqDist(corpus)

# calculate the total number of words in the corpus
total_words = len(corpus)

# calculate the probability of each word in the corpus
word_probs = {word: freq_dist[word] / total_words for word in freq_dist.keys()}
print(word_probs['high']) # Probability of the word 'high' to appear

0.0003970058353829513


### Naive sentence generation

We're going to create a naive function that generates sentences using our language model.

In [20]:
# generate a sentence using the language model
import random

def generate_sentence(word_length = 10):
    sentence = []
    while len(sentence) < word_length:
        word = random.choices(list(word_probs.keys()), list(word_probs.values()))[0]
        sentence.append(word)
    return " ".join(sentence)

In [21]:
print(generate_sentence())

the home China square '' afraid . . he prevent


The sentences generated are likely not going to sound very good, since the model is extremely naive.

All that is happening is that each word in the sentence gets semi-randomly generated with the likelihood of it being chosen depending on its frequency in the Brown corpus.

# N-grams

So far we’ve considered words as individual units, and considered the relationship to their frequency of occurrence. However, many interesting text analyses are based on the relationships between words.
One such relationship is given by n-grams.

N-grams are groups of n consecutive words that appear in a given text corpus.

Bigrams are groups of 2 consecutive words (e.g. she went, he ate, car crashed)

Trigrams are groups of 3 consecutive words (e.g. she went home, he ate a, the car crashed).

In [22]:
# Example of what bigrams look like
bigrams = list(nltk.bigrams(corpus))
bigrams[:10]

[('The', 'Fulton'),
 ('Fulton', 'County'),
 ('County', 'Grand'),
 ('Grand', 'Jury'),
 ('Jury', 'said'),
 ('said', 'Friday'),
 ('Friday', 'an'),
 ('an', 'investigation'),
 ('investigation', 'of'),
 ('of', "Atlanta's")]

In [23]:
# Example of what trigrams look like
trigrams = list(nltk.trigrams(corpus))
trigrams[:10]

[('The', 'Fulton', 'County'),
 ('Fulton', 'County', 'Grand'),
 ('County', 'Grand', 'Jury'),
 ('Grand', 'Jury', 'said'),
 ('Jury', 'said', 'Friday'),
 ('said', 'Friday', 'an'),
 ('Friday', 'an', 'investigation'),
 ('an', 'investigation', 'of'),
 ('investigation', 'of', "Atlanta's"),
 ('of', "Atlanta's", 'recent')]

### Naive next word prediction

Knowing that word relations are pretty important in our language, let's create a function that predicts what the next word in a sentence would be using a simple **bigram** language model.

In [24]:
from nltk.corpus import brown
import random

# get the words from the Brown corpus
corpus = brown.words()

# create bigrams from the corpus
bigrams = list(nltk.bigrams(corpus))

# calculate the frequency distribution of the bigrams
bigram_freqdist = nltk.FreqDist(bigrams)

# calculate the total number of bigrams in the corpus
total_bigrams = len(bigrams)

# create a function to generate the next word based on the previous word
def generate_next_word(sentence):
    prev_word = sentence.split()[-1]
    possible_words = {}
    for bigram in bigram_freqdist:
        if bigram[0] == prev_word:
            possible_words[bigram[1]] = bigram_freqdist[bigram] / total_bigrams
    if possible_words:
        return max(possible_words, key=possible_words.get)
    else:
        return None

In [25]:
# predict the next word for a given context
context = "The director"
next_word = generate_next_word(context)
print(f"The predicted next word for '{context}' is '{next_word}'")

The predicted next word for 'The director' is 'of'


### Exercise 2
1. Create a function that takes as input the number of words to generate and generates a sentence using the previous bigram language model. You can start with a random first word from the brown corpus and then use generate_next_word(sentence) function to help you.

2. Create a function that predicts the next word of a sentence by looking at the previous two words. This means you will create a trigram language model - use the same Brown corpus as before.

In [48]:
# 2.1

def generate_bigram_based_sentence(no_words):
    
    words = brown.words()
    start_word = random.choice(words)
    sentence = [start_word]
    
    for _ in range(no_words - 1):
        next_word = generate_next_word(" ".join(sentence))
        if next_word is None:
            break
        sentence.append(next_word)
    
    return " ".join(sentence)

bigram_sentence = generate_bigram_based_sentence(10)

bigram_sentence

'that the same time , and the same time ,'

In [49]:
# 2.2

corpus = brown.words()
trigrams = list(nltk.trigrams(corpus))
trigram_freqdist = nltk.FreqDist(trigrams)
total_trigrams = len(trigrams)

def generate_next_word_trigram(sentence):
    words = sentence.split()
    if len(words) < 2:
        return None
    prev_bigram = tuple(words[-2:])
    possible_words = {}
    for trigram in trigram_freqdist:
        if (trigram[0], trigram[1]) == prev_bigram:
            possible_words[trigram[2]] = trigram_freqdist[trigram] / total_trigrams
    if possible_words:
        return max(possible_words, key=possible_words.get)
    else:
        return None

generate_next_word_trigram(bigram_sentence)

'and'

### N-grams in dataframes

Let's get back to our books.
We'll create a dataframe containing information about the bigrams in our books corpus.


In [28]:
# The simplest way to do this would be to create the dataframe directly from bigrams rather than unigrams (single words)
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

bronte1_bigrams = list(nltk.bigrams(nltk.word_tokenize(bronte1)))
bronte1_df = pd.DataFrame(bronte1_bigrams, columns=['Word 1', 'Word 2'])

bronte2_bigrams = list(nltk.bigrams(nltk.word_tokenize(bronte2)))
bronte2_df = pd.DataFrame(bronte2_bigrams, columns=['Word 1', 'Word 2'])

bronte3_bigrams = list(nltk.bigrams(nltk.word_tokenize(bronte3)))
bronte3_df = pd.DataFrame(bronte3_bigrams, columns=['Word 1', 'Word 2'])

bronte4_bigrams = list(nltk.bigrams(nltk.word_tokenize(bronte4)))
bronte4_df = pd.DataFrame(bronte4_bigrams, columns=['Word 1', 'Word 2'])


# We’ll want to know which content comes from which book
bronte1_df = bronte1_df.assign(book = 'Jane Eyre')
bronte2_df = bronte2_df.assign(book = 'Wuthering Heights')
bronte3_df = bronte3_df.assign(book = 'Vilette')
bronte4_df = bronte4_df.assign(book = 'Agnes Grey')

# Finally, we concatenate the books into one dataframe
books = [bronte1_df, bronte2_df, bronte3_df, bronte4_df]
bronte_books_df = pd.concat(books)
bronte_books_df

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sergiu.varga\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\sergiu.varga\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Unnamed: 0,Word 1,Word 2,book
0,START,OF,Jane Eyre
1,OF,THE,Jane Eyre
2,THE,PROJECT,Jane Eyre
3,PROJECT,GUTENBERG,Jane Eyre
4,GUTENBERG,EBOOK,Jane Eyre
...,...,...,...
67874,OF,THE,Agnes Grey
67875,THE,PROJECT,Agnes Grey
67876,PROJECT,GUTENBERG,Agnes Grey
67877,GUTENBERG,EBOOK,Agnes Grey


### Exercise 3

1. Add a **bigram** column that shows the entire bigrams ("The Project" and "Project Gutenberg" are examples of this column's values), not just the separate words.
2. Clean the dataframe by removing stop words.
3. Display the most frequently occuring 10 bigrams.

In [29]:
# 3.1

bronte_books_df["bigram"] = bronte_books_df["Word 1"] + " " + bronte_books_df["Word 2"]

bronte_books_df


Unnamed: 0,Word 1,Word 2,book,bigram
0,START,OF,Jane Eyre,START OF
1,OF,THE,Jane Eyre,OF THE
2,THE,PROJECT,Jane Eyre,THE PROJECT
3,PROJECT,GUTENBERG,Jane Eyre,PROJECT GUTENBERG
4,GUTENBERG,EBOOK,Jane Eyre,GUTENBERG EBOOK
...,...,...,...,...
67874,OF,THE,Agnes Grey,OF THE
67875,THE,PROJECT,Agnes Grey,THE PROJECT
67876,PROJECT,GUTENBERG,Agnes Grey,PROJECT GUTENBERG
67877,GUTENBERG,EBOOK,Agnes Grey,GUTENBERG EBOOK


In [30]:
# 3.2

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

bronte_books_df = bronte_books_df[~bronte_books_df["Word 1"].str.lower().isin(stop_words)]
bronte_books_df = bronte_books_df[~bronte_books_df["Word 2"].str.lower().isin(stop_words)]
bronte_books_df = bronte_books_df.reset_index(drop=True)

bronte_books_df

Unnamed: 0,Word 1,Word 2,book,bigram
0,PROJECT,GUTENBERG,Jane Eyre,PROJECT GUTENBERG
1,GUTENBERG,EBOOK,Jane Eyre,GUTENBERG EBOOK
2,EBOOK,1260,Jane Eyre,EBOOK 1260
3,1260,JANE,Jane Eyre,1260 JANE
4,JANE,EYRE,Jane Eyre,JANE EYRE
...,...,...,...,...
100773,London,Colchester,Agnes Grey,London Colchester
100774,Eton,END,Agnes Grey,Eton END
100775,PROJECT,GUTENBERG,Agnes Grey,PROJECT GUTENBERG
100776,GUTENBERG,EBOOK,Agnes Grey,GUTENBERG EBOOK


In [31]:
# 3.3

most_freq_bigrams = bronte_books_df["bigram"].value_counts()
most_freq_bigrams.head(10)

Mr Rochester         281
Project Gutenberg    166
Dr John              126
St John              119
Mr Heathcliff        118
Mrs Fairfax          107
Madame Beck          101
Mrs Bretton           92
young lady            74
Miss Grey             71
Name: bigram, dtype: int64

### Exercise 4

1. Create a dataframe containing the **bigram, word1, word2** and **book** columns for the following 4 books and remove stop words:
        https://www.gutenberg.org/cache/epub/1228/pg1228.txt - On the Origin of Species, by Charles Darwin

        https://www.gutenberg.org/cache/epub/4363/pg4363.txt - Beyond Good and Evil, by Friedrich Nietzsche

        https://www.gutenberg.org/cache/epub/3296/pg3296.txt - The Confessions of Saint Augustine, by Saint Augustine

        https://www.gutenberg.org/files/1661/1661-0.txt - The Adventures of Sherlock Holmes, by Arthur Conan Doyle

2. Display the most frequent 8 words of each book (use word1 column when counting)

3. Display the most relevant 8 words of each book based on tf-idf (use word1 column when counting)

4. Display the most relevant 5 bigrams of each book based on tf-idf

5. Display the most frequent 5 street names found in the entire 4 book corpus. The book they are coming from should also be visible.

6. Choose a fixed word1 of your choice and find the most common 5 bigrams in each book that have word1 equal to the word you chose.




In [32]:
# 4.1 

import requests
import string
import pandas as pd
from nltk.corpus import stopwords
from nltk.util import bigrams

stop_words = stopwords.words('english')
allowed_chars = string.ascii_letters + string.digits + string.whitespace

# On the Origin of Species, by Charles Darwin
book_url = 'https://www.gutenberg.org/cache/epub/1228/pg1228.txt'
response = requests.get(book_url)
book1 = response.text
book1 = ''.join(c.lower() for c in book1 if c in allowed_chars)

# Beyond Good and Evil, by Friedrich Nietzsche
book_url = 'https://www.gutenberg.org/cache/epub/4363/pg4363.txt'
response = requests.get(book_url)
book2 = response.text
book2 = ''.join(c.lower() for c in book2 if c in allowed_chars)

# The Confessions of Saint Augustine, by Saint Augustine
book_url = 'https://www.gutenberg.org/cache/epub/3296/pg3296.txt'
response = requests.get(book_url)
book3 = response.text
book3 = ''.join(c.lower() for c in book3 if c in allowed_chars)

# The Adventures of Sherlock Holmes, by Arthur Conan Doyle
book_url = ' https://www.gutenberg.org/files/1661/1661-0.txt'
response = requests.get(book_url)
book4 = response.text
book4 = ''.join(c.lower() for c in book4 if c in allowed_chars)

book1_bigrams = list(nltk.bigrams(nltk.word_tokenize(book1)))
book1_df = pd.DataFrame(book1_bigrams, columns=['Word 1', 'Word 2'])

book2_bigrams = list(nltk.bigrams(nltk.word_tokenize(book2)))
book2_df = pd.DataFrame(book2_bigrams, columns=['Word 1', 'Word 2'])

book3_bigrams = list(nltk.bigrams(nltk.word_tokenize(book3)))
book3_df = pd.DataFrame(book3_bigrams, columns=['Word 1', 'Word 2'])

book4_bigrams = list(nltk.bigrams(nltk.word_tokenize(book4)))
book4_df = pd.DataFrame(book4_bigrams, columns=['Word 1', 'Word 2'])

book1_df = book1_df[~book1_df["Word 1"].isin(stop_words)]
book1_df = book1_df[~book1_df["Word 2"].isin(stop_words)]

book2_df = book2_df[~book2_df["Word 1"].isin(stop_words)]
book2_df = book2_df[~book2_df["Word 2"].isin(stop_words)]

book3_df = book3_df[~book3_df["Word 1"].isin(stop_words)]
book3_df = book3_df[~book3_df["Word 2"].isin(stop_words)]

book4_df = book4_df[~book4_df["Word 1"].isin(stop_words)]
book4_df = book4_df[~book4_df["Word 2"].isin(stop_words)]

book1_df["Bigram"] = book1_df["Word 1"] + " " + book1_df["Word 2"]
book2_df["Bigram"] = book2_df["Word 1"] + " " + book2_df["Word 2"]
book3_df["Bigram"] = book3_df["Word 1"] + " " + book3_df["Word 2"]
book4_df["Bigram"] = book4_df["Word 1"] + " " + book4_df["Word 2"]

# We’ll want to know which content comes from which book
book1_df = book1_df.assign(book = 'On the Origin of Species')
book2_df = book2_df.assign(book = 'Beyond Good and Evil')
book3_df = book3_df.assign(book = 'The Confessions of Saint Augustine')
book4_df = book4_df.assign(book = 'The Adventures of Sherlock Holmes')

# Finally, we concatenate the books into one dataframe
books = [book1_df, book2_df, book3_df, book4_df]
books_df = pd.concat(books)
books_df

Unnamed: 0,Word 1,Word 2,Bigram,book
1,project,gutenberg,project gutenberg,On the Origin of Species
2,gutenberg,ebook,gutenberg ebook,On the Origin of Species
13,natural,selection,natural selection,On the Origin of Species
22,anyone,anywhere,anyone anywhere,On the Origin of Species
26,united,states,united states,On the Origin of Species
...,...,...,...,...
107543,archive,foundation,archive foundation,The Adventures of Sherlock Holmes
107547,help,produce,help produce,The Adventures of Sherlock Holmes
107550,new,ebooks,new ebooks,The Adventures of Sherlock Holmes
107558,email,newsletter,email newsletter,The Adventures of Sherlock Holmes


In [33]:
# 4.2

frequent_words = books_df.groupby(['Word 1', 'book']).size().reset_index(name='count')
top_words_by_book_df = frequent_words.sort_values(['book', 'count'], ascending=[True, False])

top_words_by_book_df.groupby("book").head(8)

Unnamed: 0,Word 1,book,count
11569,one,Beyond Good and Evil,226
5812,every,Beyond Good and Evil,135
12941,project,Beyond Good and Evil,89
11184,new,Beyond Good and Evil,86
7407,good,Beyond Good and Evil,81
7671,gutenberg,Beyond Good and Evil,80
15742,still,Beyond Good and Evil,78
7252,german,Beyond Good and Evil,76
11570,one,On the Origin of Species,450
11061,natural,On the Origin of Species,380


In [34]:
 # 4.3
top_words_by_corpus_df = books_df.groupby(['book']).size().sort_values(ascending=False).reset_index(name='count')
    
book_words = top_words_by_book_df.merge(top_words_by_corpus_df, on='book')
book_words = book_words.rename(columns={'count_x': 'word_appearances_in_book', 'count_y': 'book_total_word_count'})

book_words["tf"] = book_words["word_appearances_in_book"] / book_words["book_total_word_count"]
book_words["doc_count"] = book_words.groupby("Word 1")["book"].transform("nunique")

book_words["idf"] = np.log( 4 / book_words["doc_count"] ) 
book_words["tf-idf"] = book_words["tf"] * book_words["idf"]

book_words.sort_values(by=['book', 'tf-idf'], ascending=[True, False]).groupby('book').head(8)

Unnamed: 0,Word 1,book,word_appearances_in_book,book_total_word_count,tf,doc_count,idf,tf-idf
78,refined,Beyond Good and Evil,20,12992,0.001539,1,1.386294,0.002134
7,german,Beyond Good and Evil,76,12992,0.00585,3,0.287682,0.001683
119,morality,Beyond Good and Evil,14,12992,0.001078,1,1.386294,0.001494
126,democratic,Beyond Good and Evil,13,12992,0.001001,1,1.386294,0.001387
131,nowadays,Beyond Good and Evil,13,12992,0.001001,1,1.386294,0.001387
58,hitherto,Beyond Good and Evil,24,12992,0.001847,2,0.693147,0.00128
145,gregarious,Beyond Good and Evil,12,12992,0.000924,1,1.386294,0.00128
148,morals,Beyond Good and Evil,12,12992,0.000924,1,1.386294,0.00128
4881,species,On the Origin of Species,372,32277,0.011525,2,0.693147,0.007989
4913,geological,On the Origin of Species,99,32277,0.003067,1,1.386294,0.004252


In [35]:
# 4.4
frequent_bigrams = books_df.groupby(['Bigram', 'book']).size().reset_index(name='count')
top_bigrams_by_book_df = frequent_bigrams.sort_values(['book', 'count'], ascending=[True, False])

top_bigrams_by_corpus_df = books_df.groupby(['book']).size().sort_values(ascending=False).reset_index(name='count')
    
book_bigrams = top_bigrams_by_book_df.merge(top_bigrams_by_corpus_df, on='book')
book_bigrams = book_bigrams.rename(columns={'count_x': 'bigram_appearances_in_book', 'count_y': 'book_total_bigram_count'})

book_bigrams["tf"] = book_bigrams["bigram_appearances_in_book"] / book_bigrams["book_total_bigram_count"]
book_bigrams["doc_count"] = book_bigrams.groupby("Bigram")["book"].transform("nunique")

book_bigrams["idf"] = np.log( 4 / book_bigrams["doc_count"] ) 
book_bigrams["tf-idf"] = book_bigrams["tf"] * book_bigrams["idf"]

book_bigrams.sort_values(by=['book', 'tf-idf'], ascending=[True, False]).groupby('book').head(8)

Unnamed: 0,Bigram,book,bigram_appearances_in_book,book_total_bigram_count,tf,doc_count,idf,tf-idf
2,one must,Beyond Good and Evil,32,12992,0.002463,2,0.693147,0.001707
6,modern ideas,Beyond Good and Evil,16,12992,0.001232,1,1.386294,0.001707
12,beyond good,Beyond Good and Evil,13,12992,0.001001,1,1.386294,0.001387
20,free spirits,Beyond Good and Evil,9,12992,0.000693,1,1.386294,0.00096
4,would like,Beyond Good and Evil,17,12992,0.001308,2,0.693147,0.000907
23,good opinion,Beyond Good and Evil,8,12992,0.000616,1,1.386294,0.000854
24,historical sense,Beyond Good and Evil,8,12992,0.000616,1,1.386294,0.000854
10,one another,Beyond Good and Evil,14,12992,0.001078,2,0.693147,0.000747
11690,natural selection,On the Origin of Species,287,32277,0.008892,1,1.386294,0.012327
11692,organic beings,On the Origin of Species,85,32277,0.002633,1,1.386294,0.003651


In [37]:
# 4.5

df_streets = books_df

df_streets = df_streets[df_streets["Bigram"].str.contains(''.join(" street"), case=False, na=False)]

street_counts = df_streets.groupby(["Bigram", "book"]).size().reset_index(name="Count")

street_counts = street_counts.sort_values(by="Count", ascending=False)

street_counts.head(5)  

Unnamed: 0,Bigram,book,Count
1,baker street,The Adventures of Sherlock Holmes,29
15,leadenhall street,The Adventures of Sherlock Holmes,4
24,threadneedle street,The Adventures of Sherlock Holmes,2
21,regent street,The Adventures of Sherlock Holmes,2
12,goodge street,The Adventures of Sherlock Holmes,2


In [61]:
# 4.6

word1 = "one "

book_bigrams[book_bigrams["Bigram"].str.startswith(word1) ].sort_values(by=['book', 'tf-idf'], ascending=[True, False]).groupby('book').head(8)

Unnamed: 0,Bigram,book,bigram_appearances_in_book,book_total_bigram_count,tf,doc_count,idf,tf-idf
2,one must,Beyond Good and Evil,32,12992,0.002463,2,0.693147,0.001707
10,one another,Beyond Good and Evil,14,12992,0.001078,2,0.693147,0.000747
30,one might,Beyond Good and Evil,7,12992,0.000539,1,1.386294,0.000747
39,one finds,Beyond Good and Evil,6,12992,0.000462,1,1.386294,0.00064
86,one wishes,Beyond Good and Evil,4,12992,0.000308,1,1.386294,0.000427
8,one may,Beyond Good and Evil,15,12992,0.001155,3,0.287682,0.000332
169,one wished,Beyond Good and Evil,3,12992,0.000231,1,1.386294,0.00032
472,one believed,Beyond Good and Evil,2,12992,0.000154,1,1.386294,0.000213
11694,one species,On the Origin of Species,70,32277,0.002169,2,0.693147,0.001503
11879,one case,On the Origin of Species,9,32277,0.000279,1,1.386294,0.000387


In [62]:
books_df

Unnamed: 0,Word 1,Word 2,Bigram,book
1,project,gutenberg,project gutenberg,On the Origin of Species
2,gutenberg,ebook,gutenberg ebook,On the Origin of Species
13,natural,selection,natural selection,On the Origin of Species
22,anyone,anywhere,anyone anywhere,On the Origin of Species
26,united,states,united states,On the Origin of Species
...,...,...,...,...
107543,archive,foundation,archive foundation,The Adventures of Sherlock Holmes
107547,help,produce,help produce,The Adventures of Sherlock Holmes
107550,new,ebooks,new ebooks,The Adventures of Sherlock Holmes
107558,email,newsletter,email newsletter,The Adventures of Sherlock Holmes


# Exercise 5
1. Create a dataframe containing all the columns you will need to do a trigram-based analysis for **a book of your choice**. Suggested columns: trigram, word1, word2, word3, book and potentially tf, idf, tf-idf.

2. Display the top 10 words by tf-idf
3. Display the most frequent 5 trigrams for 2 target words of your choice.

 This means you will choose 2 separate words from the book (suggestion is to choose relevant words, e.g. main character names), keep one of the 3 words from the trigram fixed, and then display the most frequent trigrams co-occurring with your word. E.g. You choose 'John' as one of your two words - then you set it as a fixed word (up to you if it should be word1, word2 or word3), and find what are the most frequent trigrams that have John in that fixed position.

 Do this for two separate words, some suggestions are to use the protagonist and the antagonist of your story, or to use opposing principles/subjects from your work as your target words.




In [98]:
# 5.1
import requests
import string
import pandas as pd
from nltk.corpus import stopwords
from nltk.util import trigrams


stop_words = stopwords.words('english')
allowed_chars = string.ascii_letters + string.digits + string.whitespace

book_url = "https://www.gutenberg.org/cache/epub/5200/pg5200.txt"
response = requests.get(book_url)
meta = response.text
meta = ''.join(c.lower() for c in meta if c in allowed_chars)

meta_trigrams = list(nltk.trigrams(nltk.word_tokenize(meta)))
meta_df = pd.DataFrame(meta_trigrams, columns=['Word 1', 'Word 2','Word 3'])

meta_df = meta_df[~meta_df["Word 1"].isin(stop_words)]
meta_df = meta_df[~meta_df["Word 2"].isin(stop_words)]
meta_df = meta_df[~meta_df["Word 3"].isin(stop_words)]

meta_df["Trigram"] = meta_df["Word 1"] + " " + meta_df["Word 2"] + " " + meta_df["Word 3"]
meta_df["Book"] = "Metamorphosis"
meta_df = meta_df.reset_index(drop=True)

meta_df

Unnamed: 0,Word 1,Word 2,Word 3,Trigram,Book
0,project,gutenberg,ebook,project gutenberg ebook,Metamorphosis
1,project,gutenberg,license,project gutenberg license,Metamorphosis
2,gutenberg,license,included,gutenberg license included,Metamorphosis
3,copyrighted,project,gutenberg,copyrighted project gutenberg,Metamorphosis
4,project,gutenberg,ebook,project gutenberg ebook,Metamorphosis
...,...,...,...,...,...
1731,website,includes,information,website includes information,Metamorphosis
1732,project,gutenberg,including,project gutenberg including,Metamorphosis
1733,project,gutenberg,literary,project gutenberg literary,Metamorphosis
1734,gutenberg,literary,archive,gutenberg literary archive,Metamorphosis


In [104]:
# 5.2

frequent_words = meta_df.groupby(['Word 3', 'Book']).size().reset_index(name='count')
top_words_by_book_df = frequent_words.sort_values(['Book', 'count'], ascending=[True, False])

top_words_by_corpus_df = meta_df.groupby(['Book']).size().sort_values(ascending=False).reset_index(name='count')
    
book_words = top_words_by_book_df.merge(top_words_by_corpus_df, on='Book')
book_words = book_words.rename(columns={'count_x': 'word_appearances_in_book', 'count_y': 'book_total_words_count'})

book_words["tf"] = book_trigrams["trigram_appearances_in_book"] / book_trigrams["book_total_trigram_count"]

book_words.sort_values(by=['Book', 'tf'], ascending=[True, False]).head(10)

Unnamed: 0,Word 3,Book,word_appearances_in_book,book_total_words_count,tf
0,gregor,Metamorphosis,37,1736,0.010369
1,gutenberg,Metamorphosis,28,1736,0.007488
2,would,Metamorphosis,23,1736,0.007488
3,gregors,Metamorphosis,21,1736,0.007488
4,back,Metamorphosis,19,1736,0.006912
5,samsa,Metamorphosis,19,1736,0.006336
6,works,Metamorphosis,19,1736,0.004032
7,electronic,Metamorphosis,18,1736,0.003456
8,father,Metamorphosis,17,1736,0.00288
9,even,Metamorphosis,15,1736,0.00288


In [125]:
# 5.3

frequent_trigrams = meta_df.groupby(['Trigram', 'Book']).size().reset_index(name='count')
top_trigrams_by_book_df = frequent_trigrams.sort_values(['Book', 'count'], ascending=[True, False])

top_trigrams_by_corpus_df = meta_df.groupby(['Book']).size().sort_values(ascending=False).reset_index(name='count')
    
book_trigrams = top_trigrams_by_book_df.merge(top_trigrams_by_corpus_df, on='Book')
book_trigrams = book_trigrams.rename(columns={'count_x': 'trigram_appearances_in_book', 'count_y': 'book_total_trigram_count'})

book_trigrams["tf"] = book_trigrams["trigram_appearances_in_book"] / book_trigrams["book_total_trigram_count"]

# book_trigrams["doc_count"] = book_trigrams.groupby("Trigram")["Book"].transform("nunique")
# book_trigrams["idf"] = np.log( 1 / book_trigrams["doc_count"] ) 
# book_trigrams["tf-idf"] = book_trigrams["tf"] * book_trigrams["idf"]

book_trigrams[:10]

Unnamed: 0,Trigram,Book,trigram_appearances_in_book,book_total_trigram_count,tf
0,project gutenberg electronic,Metamorphosis,18,1736,0.010369
1,gutenberg literary archive,Metamorphosis,13,1736,0.007488
2,literary archive foundation,Metamorphosis,13,1736,0.007488
3,project gutenberg literary,Metamorphosis,13,1736,0.007488
4,gutenberg electronic works,Metamorphosis,12,1736,0.006912
5,project gutenberg license,Metamorphosis,11,1736,0.006336
6,full project gutenberg,Metamorphosis,7,1736,0.004032
7,gutenberg electronic work,Metamorphosis,6,1736,0.003456
8,project gutenberg trademark,Metamorphosis,5,1736,0.00288
9,project gutenberg work,Metamorphosis,5,1736,0.00288


In [120]:
word1 = " mr "
book_trigrams[book_trigrams["Trigram"].str.contains(word1)].sort_values(by=['Book', 'tf'], ascending=[True, False]).head(5)

Unnamed: 0,Trigram,Book,trigram_appearances_in_book,book_total_trigram_count,tf
18,said mr samsa,Metamorphosis,3,1736,0.001728
108,answered mr samsa,Metamorphosis,1,1736,0.000576
133,asked mr samsa,Metamorphosis,1,1736,0.000576
165,bed mr samsa,Metamorphosis,1,1736,0.000576
254,collar mr samsa,Metamorphosis,1,1736,0.000576


In [124]:
word2 = " gregor "
book_trigrams[book_trigrams["Trigram"].str.contains(word2)].sort_values(by=['Book', 'tf'], ascending=[True, False]).head(5)

Unnamed: 0,Trigram,Book,trigram_appearances_in_book,book_total_trigram_count,tf
77,alarm gregor twice,Metamorphosis,1,1736,0.000576
101,although gregor wasnt,Metamorphosis,1,1736,0.000576
113,anything gregor answered,Metamorphosis,1,1736,0.000576
141,back gregor wanted,Metamorphosis,1,1736,0.000576
163,bed gregor said,Metamorphosis,1,1736,0.000576
