# Lesson 1: Text Analysis - Word Laws

In many languages words are the basic unit of text. The way words are used in a language tend to follow certain laws - and these laws apply across languages and the subject matter of the document.

In this lab we will learn and explore these word laws for English and Armenian texts.


## Preprocessing

Words appear in different forms in text - sometimes they start with capital letters, have punctuation, and take different forms depending on grammer. 

To get accurate and consistent results we need to preprocess text.

In [None]:
import re
import Stemmer
import string 
import unittest
%reload_ext ipython_unittest

## Exercise 1

Complete the following functions to preprocess text in Armenian and English.

In [None]:
# Preprocess English

def preprocess_line_en(line:str) -> list[str]:
    """Preprocesses a line of English text and returns a list of tokens"""
    
    # 1. Case folding
    # Convert line to lower case
    
    # 2. Tokenisation
    # Remove all punctuation and split the line into a list of words
    # Hint: string.punctuation returns all punctuation symbols
    
    # 3. Stopping
    # The file datasets/stopwords_en.txt contains a list of stopwords
    # one on each line. Remove any stopwords from the list of tokens from step 2
    
    # 4. Normalisation/Stemming
    # Use pystemmer to stem each word in the list from step 3. 
    # Stemmer has already been imported, see the documentation here for how to use it:
    # https://github.com/snowballstem/pystemmer/blob/master/docs/quickstart.txt
    
    # 5. Return the list of tokens
    
    raise Exception("I need to be implemented!")
    

**Run the following cell to test your implementation**

If everything is working you should see "Success" output.

In [None]:
%%unittest
"case folding works"
result = preprocess_line_en("THe QUICK brown FoX JuMpS")
assert result == ["quick", "brown", "fox", "jump"]

"tokenisation works"
result = preprocess_line_en("Look! It's the quick, brown, fox.")
assert result == ["quick", "brown", "fox"]

"stopping works"
result = preprocess_line_en("look the fox it is jumping")
assert result == ["fox", "jump"]

"normalisation works"
result = preprocess_line_en("quickly the audacious fox jumped")
assert result == ["quick", "audaci", "fox", "jump"]

"everything together"
result = preprocess_line_en("The quick, BrOwN, fox jumps over the lazy dog!!")
assert result == ["quick", "brown", "fox", "jump", "lazi", "dog"]

In [None]:
# Preprocess Armenian

def preprocess_line_hy(line:str) -> list[str]:
    """Preprocesses a line of Armenian text and returns a list of tokens"""
    
    # 1. Case folding
    # Convert line to lower case
    
    # 2. Tokenisation
    # Remove all punctuation and split the line into a list of words
    # Hint: use a regular expression to extract all word characters
    
    # 3. Stopping
    # The file datasets/stopwords_hy.txt contains a list of stopwords
    # one on each line. Remove any stopwords from the list of tokens from step 2
    
    # 4. Normalisation/Stemming
    # Use pystemmer to stem each word in the list from step 3. 
    # Stemmer has already been imported, see the documentation here for how to use it:
    # https://github.com/snowballstem/pystemmer/blob/master/docs/quickstart.txt
    
    # 5. Return the list of tokens
    
    raise Exception("I need to be implemented!")

**Run the following cell to test your implementation**

If everything is working you should see "Success" output.

In [None]:
%%unittest
"case folding works"
result = preprocess_line_hy("Եթե ՄՈՐՈՒՔՆԵՐԻ Մեջ իմաստություն լիներ, բոլոր այծերը մարգարեներ կլինեին")
assert result == ['եթե', 'մորու', 'իմաստությ', 'լիներ', 'բոլոր', 'այծերը', 'մարգարե', 'կլինե']

"tokenisation works"
result = preprocess_line_hy("Եթե մորուքների՜ մեջ «իմաստություն» լիներ, բոլոր այծերը մարգարեներ կլինեին։")
assert result == ['եթե', 'մորու', 'իմաստությ', 'լիներ', 'բոլոր', 'այծերը', 'մարգարե', 'կլինե']

"stopping works"
result = preprocess_line_hy("Եթե մորուքների մեջ")
assert result == ['եթե', 'մորու']

"normalisation works"
result = preprocess_line_hy("Եթե մորուքների")
assert result == ['եթե', 'մորու']

"everything together"
result = preprocess_line_hy("Եթե մորուքների՜ մեջ «իմաստություն» լիներ, ԲՈԼՈՐ այծերը մարգարեներ կլինեին։")
assert result == ['եթե', 'մորու', 'իմաստությ', 'լիներ', 'բոլոր', 'այծերը', 'մարգարե', 'կլինե']

## Zipfs Law - Frequency of Words

Some words are used very frequently, for example in English 'a', 'the', 'of', 'it'. Others are used much less frequently such as 'sedulously' and 'verisimilitude'.

Given a reasonably long document, 50% of the words contained in it will appear only once. In general the frequency of words follows an expenential curve:

<div>
<img src="attachment:Screenshot%202023-01-31%20at%2021.24.58.png" width="200px"/>
</div>

From the equation `rank * freq = constant` if we plot the log of the rank against the log of the frequency we should get a graph that's a straight line.

<div>
    <img src="attachment:Screenshot%202023-02-06%20at%2022.02.16.png" width="300px"/>
</div>


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from collections.abc import Callable

## Exercise 2

Complete the following function to calculate the frequency of words in a given text document.

In [None]:
def calculate_word_frequencies(preprocess: Callable[[str], list[str]], dataset_path: str) -> list[int]:
    """Reads the file dataset_path, preprocesses the words, 
    and calculates the frequencies of all the words"""
    
    # 1. Read the file "dataset_path" line by line or all at once.
    #    Preprocess the text using the preprocess function. This will return a list of words.
    #    Calculate the frequencies of words.

    # 2. Sort the frequencies from highest to lowest
    
    # 3. Return the sorted frequencies
    

**Run the following to plot a graph of the rank against frequency for the English Bible**

Is the graph as you would expect?

In [None]:
def plot_zipf_graph(freqs, title):
    # Plot rank against frequencies
    freqs_lg = freqs
    plt.plot(freqs_lg)
    plt.xlabel("rank")
    plt.ylabel("freq")
    plt.title(f"Graph of Zipf's Law {title}")
    plt.show()
    
en_bible_freqs = calculate_word_frequencies(preprocess_line_en, 'datasets/bible.en.txt')
plot_zipf_graph(en_bible_freqs, "English Bible")

**Run the following to plot a graph of the rank against frequency for the Armenian Bible**

Is the graph as you would expect?

In [None]:
hy_bible_freqs = calculate_word_frequencies(preprocess_line_hy, 'datasets/bible.hy.txt')
plot_zipf_graph(hy_bible_freqs, "Armenian Bible")

### Extension Exercises

1. Write another function `plot_zipf_log_graph` that plots the log of the rank against the log of the frequencies (Hint: use np.log). Does the graph look as you'd expect?
2. Run the examples against the following two datasets containing wikipedia extracts `abstracts.en.txt` and `abstracts.hy.txt`. Do the graphs look as you'd expect?

## Benford's Law - Frequency of the first digit in numbers

Benford's law states that the distribution of the first digit of numbers in a large dataset follow a Zipf like law with 1 appearing much more frequently than the remaining numbers. 

## Exercise 3

Complete the following function that given a list of numbers calculate the frequencies of the first digit. 

In [None]:
def calculate_first_digit_frequencies(numbers: list[int]) -> dict[int, int]:
    """Calculates the frequencies of the first digit in the list of numbes.
    Returns a dictionary of {digit, freq} pairs."""
    
    # 1. Loop through the list of "numbers" and calculate the frequencies of the first 
    # digit. 
    
    # 2. Return a dictionary of pairs {digit, frequency} of the frequencies of each digit from 1 to 9.
    
    # Example return value: [{1, 5000}, {2, 3000}, ...]
    

**Run the cell below to plot Benford's law for the English and Armenian bible.**

Are the graphs are as predicted?

In [None]:
# Plot a graph of Benford's law

def plot_benfords_law(digit_freqs, title):
    plt.bar(digit_freqs.keys(), digit_freqs.values())
    plt.xticks(range(10))
    plt.title(f"Benford's Law {title}")
    plt.xlabel("First digit")
    plt.ylabel("Frequency")
    plt.show()

en_bible_first_digit_freqs = calculate_first_digit_frequencies(en_bible_freqs)
plot_benfords_law(en_bible_first_digit_freqs, 'English Bible Word Frequencies')

hy_bible_first_digit_freqs = calculate_first_digit_frequencies(hy_bible_freqs)
plot_benfords_law(hy_bible_first_digit_freqs, 'Armenian Bible Word Frequencies')

### Extension Exercises

1. Plot Heaps law for the two datasets containing wikipedia extracts `abstracts.en.txt` and `abstracts.hy.txt`. Do the graphs look as you'd expect? 

## Heap's Law

Heap's law describes how the size of the vocabulary of a document increases over time. As the size of a document gets bigger the rate at which new words appear rapidly decreases.

## Exercise 4

Complete the following function to calculate the growth of the vocabulary. 

In [None]:
def calculate_vocab_growth(preprocess: Callable[[str], list[str]], dataset_path:str) -> list[list[int]]:
    """Calculates the growth of the vocabulary of the given dataset"""
    
    # 1. Read the dataset at dataset_path line by line or all at once and preprocess it.
    
    # 2. Process each word and keep track of two numbers:
    # n := the number of words processed
    # v := the number of unique words seen 
    # After every 100 words processed record store the values of [n, v] in a list.
    
    # 3. After processing the whole document return the list the [n, v] pairs.
    
    # Example output: [[0, 0], [100, 10], [200, 50], ...]

**Run the cell below to plot Heap's law for the English and Armenian bible**

The code also plots a best fit curve and computes the values of b and k. Expect b < 1 and k between 0.4 and 0.7.

Do the graphs appear as you'd expect?

In [None]:
def plot_heaps_law_and_fit_curve(vocab_growth, title):
    
    vocab_growth = np.array(vocab_growth)
    plt.plot(vocab_growth[:, 0], vocab_growth[:, 1])

    # Fit a curve of the form v = kn^b for some constants k and b
    # The first column of the data is n and the second column of the data is b
    # By taking logs of this equation we reduce it to fitting a linear curve

    vocab_growth_lg = np.log(vocab_growth)

    X = vocab_growth_lg[:, 0][:,None]
    y = vocab_growth_lg[:, 1]

    def phi_lin(Xin):
        return np.hstack([np.ones((Xin.shape[0], 1)), Xin])

    w_fit = np.linalg.lstsq(phi_lin(X), y, rcond=None)[0]
    # first weight is log(k)
    k = np.exp(w_fit[0])
    # the second weight is b
    b = w_fit[1]

    # Plot v = kn^b for each n
    n_grid = vocab_growth[:, 0]
    v_fit = k * (n_grid)**b

    plt.plot(n_grid, v_fit)

    plt.legend(["Raw data", f"Fitted curve {k:.2f}n^{b:.2f}"])
    plt.xlabel("Number of words read (n)")
    plt.ylabel("Number of unique words seen (v)")
    plt.title(f"Heap's law {title}")
    plt.show()
    print(f"The best fit for {title} has k={k:.2f} and b={b:.2f}")

en_bible_vocab_growth = calculate_vocab_growth(preprocess_line_en, 'datasets/bible.en.txt')
plot_heaps_law_and_fit_curve(en_bible_vocab_growth, "English Bible")

hy_bible_vocab_growth = calculate_vocab_growth(preprocess_line_hy, 'datasets/bible.hy.txt')
plot_heaps_law_and_fit_curve(hy_bible_vocab_growth, "Armenian Bible")

### Extension Exercises

1. Plot graphs for Heap's law for the wikipedia datasets `abstracts.en.txt` and `abstracts.hy.txt`. Do the graphs look as you'd expect?

## Clumping and contagion of words

Words that appear rarely often appear close together - this is known as clumping. Once a word starts to be used in a document it gets used very frequently - this is known as contagion.

## Exercise 5

Read the given document word by word and calculate the distances between words that appear exactly twice.

In [None]:
def calculate_word_distances_len_2(preprocess: Callable[[str], list[str]], dataset_path: str) -> list[int]:
    """Returns the distances between words in the given document that appear exactly twice"""
    
    # 1. Read the dataset line by line or all at once and preprocess it.
    
    # 2. For each word create a list of indexes where that word appears in the document.
    
    # 3. For each of the words that appear exactly twice calculate the distance between the two occurences.
    
    # 4. Return a list of the distances calculated in step 3.
    

**Run the cell below to plot a graph of the density of the distances between words that appear exactly twice in the Armenian and English bible**

Are the graphs as expected? Discuss with your classmates.

In [None]:
def plot_clumping_graph(distances, title):
    plt.hist(distances, bins=100)
    plt.xlabel("Distance between words that appear exactly twice")
    plt.ylabel("Density")
    plt.title(f"Density of distances between words that appear exactly twice: {title}")
    plt.show()
    
en_bible_distances = calculate_word_distances_len_2(preprocess_line_en, 'datasets/bible.en.txt')
plot_clumping_graph(en_bible_distances, "English Bible")

hy_bible_distances = calculate_word_distances_len_2(preprocess_line_hy, 'datasets/bible.hy.txt')
plot_clumping_graph(hy_bible_distances, "Armenian Bible")

### Extension Exercises

1. Plot these graphs for the wikipedia datasets. Are the results as expected?