<center>
<h1>Final Project: Positive Pointwise Mutual Information</h2>
<h2>Corpus Linguistics with Python - Summer Semester 2022</h1>
<h3>Wei-Ling Liao - Matriculation Nr. 1194717.</h3>
</center>

# Step 1: Corpus Pre-Processing
The following code cell pre-process the 'text_fic.txt'. First, I import the NLTK and regular expression modules and assign the input and output files to the corresponding directory. I define a tag_set because NLTK provides a simple lemmatizer that recognizes and processes **content words only**: nouns, verbs, adjectives, and adverbs. By default, the lemmatizer treats each word as a noun. Therefore, tag_set is for recognizing other parts of speech than nouns.

Description of code: I investigated the my_coca.preprocessed and found out that 'paragraph element' were removed; therefore, I did the same thing. Next, I use sent_tokenize to identify sentence boundaries. For each sentence, I use word_tokenize to split the words. Every token is assigned a pos and the WordNetLemmatizer is instantiated. Next, the code iterates each token ,lowers the words and clarifies the pos. Then, tokens are lemmatized according to their pos and they are appended into a list which is then combined by space-seperated string using join function. In the end, sentences are seperated by sent_tokenize and print to coca_preprocessed file.

In [178]:
import nltk
import re
from nltk import sent_tokenize
from nltk import word_tokenize
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer

fn = "/compLing/students/courses/corpusLinguistics/finalProject/text_fic.txt"
fnout = "/users/wei-ling.liao/CLPython/FinalProject/coca.preprocessed"
tag_set = {"J" : "a", "N" : "n", "V" : "v", "R" : "r", "M": "v"} #Penn Treebank Tagsets
lemmaList = []

with open(fn, 'r') as f, open(fnout, 'w') as fout:
    for line in f:
        pattern = re.compile(r"<p>")
        matching = pattern.sub(r"", line)
        
        sentences = sent_tokenize(matching)
        for sentence in sentences:
            tokens = word_tokenize(sentence)
            postags = pos_tag(tokens)
            lemmatizer = WordNetLemmatizer()
            
            for w in range(len(tokens)):
                token = tokens[w].lower()
                pos = postags[w][1]
                firstPOSletter = pos[0]
                
                if (firstPOSletter in tag_set) and (pos != "NFP"): #NFP: pos of punctuations
                    lemma = lemmatizer.lemmatize(token, pos = tag_set[firstPOSletter])
                else:
                    lemma = token         
                    
                lemmaList.append(lemma)
        
    lemmatized_text = " ".join(lemmaList)
    sentences = sent_tokenize(lemmatized_text)
    for sentence in sentences:
        print(sentence, file = fout)
        

# Step 2: Counting Unigrams and Bigrams

I chose to use my_coca.preprocessed as an input file to avoid wrong output for the following steps because my file from pre-processing is not exactly the same as the one we were given for this project.

Description of code: To compute the frequency of unigrams and bigrams, two dictionaries are created. First, I tokenize the words with the word_tokenize function and assign them to the variable 'tokens'. Next, I extract the bigrams and store them in a list. Now the dictionaries can be filled seperately with similar iteration procedure. As requested, both in unigramsDict and bigramsDict, frequency <=5 is set to 0. In the end, unigramsDict is used to count the total number of tokens.

In [179]:
from nltk import bigrams

fn = "/compLing/students/courses/corpusLinguistics/finalProject/outputFiles/my_coca.preprocessed"
fnout = "/users/wei-ling.liao/CLPython/FinalProject/coca.counts"

unigramsDict = {}
bigramsDict = {}
lemma = 0

with open(fn, "r") as f, open(fnout, "w") as fout:
    for line in f:
        tokens = word_tokenize(line)
        bigramsList = list(bigrams(tokens))
   
        # fill the unigramDict
        for word in tokens:
            if word in unigramsDict.keys():
                unigramsDict[word] += 1
            else:
                unigramsDict[word] = 1
            
        # fill the bigramsDict
        for bigram in bigramsList:
            if bigram in bigramsDict.keys():
                bigramsDict[bigram] += 1
            else:
                bigramsDict[bigram] = 1
            
    # change freq if less or equal to 5 - unigramsDict
    for k,v in unigramsDict.items():
        if v <= 5:
            unigramsDict[k] = 0.0
        print(k, v, file = fout, sep = "\t")
            
    # change freq if less or equal to 5 - bigramsDict
    for k,v in bigramsDict.items():
        if v <= 5:
            bigramsDict[k] = 0.0
        print(k, v, file = fout, sep = "\t")

    # count the lemma
    for v in unigramsDict.values():
        lemma += v
    print("Total number of lemmas:", lemma)
        

Total number of lemmas: 1372584.0


# Step 3: PPMI

Description of code: This function takes five arguments: word1, word2, unigramsDict, bigramsDict and totalLemma. First, it is necessary to check if the words or bigram exist in the two dictionaries; if it is not in the dictionary, the frequency of the word/bigram is set to 0. Next, to avoid the failure of code execution, PMI is set to 0, if frequency of word1, word2 or bigram is 0. On the other hand, if frequency of word1, word2 and bigram are not 0, PMI is calculated according to the formula and is rounded to the 3rd decimal position. Last point, negative PMI is excluded.

In [180]:
import math

def PPMI(word1, word2, unigramsDict, bigramsDict, totalLemma):
    
    # check if the words/bigram exist
    if word1 in unigramsDict:
        freq1 = unigramsDict[word1]
    else:
        freq1 = 0
    
    if word2 in unigramsDict:
        freq2 = unigramsDict[word2]
    else:
        freq2 = 0
    
    if bigram in bigramsDict:
        bigramFreq = bigramsDict[bigram]
    else:
        bigramFreq = 0
        
    # probability - if word1 or word2 or bigrams frequency is 0 --> PPMI = 0
    if freq1 == 0 or freq2 == 0:
        PMI = 0
    elif bigramFreq == 0:
        PMI = 0
    else:
        word1Probability = freq1 / totalLemma
        word2Probability = freq2 / totalLemma
        bigramsProbability = bigramFreq / totalLemma
    
        #calculate PMI
        PMI = math.log2(bigramsProbability / (word1Probability * word2Probability))
        PMI = round(PMI, 3)
    
        # exclude negative PMI
        if PMI <0:
            PMI = 0
    
    # return a set of bigrams with its PPMI
    return PMI



# Step 4: Computing PPMI Scores
Description of code: To compute PPMI of each bigram, the PPMI function is called for each bigram stored in bigramsDict. Besides, ppmiDict, where bigram is the key and ppmi score is the value, is created. Keys and values are printed in tab-separated file. 

In [181]:
fnout = "/users/wei-ling.liao/CLPython/FinalProject/coca.ppmi"
ppmiDict = dict()

for bigram in bigramsDict:
    word1 = bigram[0]
    word2 = bigram[1]
    
    with open (fnout, "a") as fout: 
        ppmiScore = PPMI(word1, word2, unigramsDict, bigramsDict, totalLemma)
        ppmiDict[bigram] = ppmiScore  #complete the dictionary
        print(word1, word2, ppmiScore, sep = "\t", file = fout)

Description of code: Function topN require two arguments: a dictionary and n indicating numbers of bigrams to visulize. In this case, ppmiDict from above is used because bigrams and ppmi scores are stored here. The ppmiDict is sorted by **sorted** function. To sort the dictionary, the key argument of sorted function needs to be clarified. reverse is set to True, so the value is in descending order. Lastly, the top 20 of the dictionary is printed.

In [182]:
def topN(ppmiScores, n = 20):
    ppmiScores = sorted(ppmiScores.items(), key = lambda x: x[1], reverse = True)
    
    for k, v in ppmiScores[:n]:
        print(k[0], k[1], v, sep = "\t")
        
topN(ppmiDict)

guiseppi	scapellini	17.845
sint	holo	17.845
vito	adamo	17.43
palo	alto	17.26
tel	aviv	17.26
edith	schermerhorn	17.108
oswald	truxa	17.015
chiang	mai	16.971
macy	levitt	16.971
cecily	scriber	16.916
amos	holt	16.845
irene	lashman	16.845
pheasant	theodora	16.845
del	norte	16.73
anarchist	no.l	16.623
nil	spaar	16.43
slo	mo	16.43
beverly	hills	16.371
moyshe	rabeynu	16.343
artful	dodger	16.182


# Step 5: Comparing PPMI and Frequency Counts
The topN is called to show the top20 bigrams sorted by the frequency counts. 

Difference between PPMI and frequency counts: For the bigram frequency counts, it gives a overview about how often does the word appear with each other. From the result under, the top 20 highest bigram frequency counts are mostly function words and punctuations, which doesn't give a clue for the content. On the other hand, PPMI considers the probability of words occurring together and the number of appearences of each token. This reduces for PPMi value for very frequent collocations (i.e., punctuations), and give a better understanding in terms of content.

In [183]:
topN(bigramsDict)

@	@	62278
,	and	7211
,	``	6486
.	``	5934
it	be	4864
of	the	4618
do	not	4450
in	the	4406
``	``	3822
be	a	3416
``	i	3413
i	be	3361
,	but	3144
on	the	2786
he	be	2756
to	the	2746
be	not	2690
,	the	2591
,	i	2412
say	.	2219
