<center>
<h1>Final Project: Positive Pointwise Mutual Information</h2>
<h2>Corpus Linguistics with Python - Summer Semester 2020</h1>
<h3>Sara Sultan - 968430.</h3>
</center>

## Consistency of the code:
In this code PEP8 guidlines are followed, and to make sure of it used pycodestyle module is used.

In [4]:
%load_ext pycodestyle_magic

In [5]:
%pycodestyle_on

In [11]:
import math
import nltk
import pandas as pd
from tqdm import tqdm
from nltk import ngrams, word_tokenize, pos_tag
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer

nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt')
lemmatizer = nltk.WordNetLemmatizer()

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/sara/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/sara/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/sara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Step 1: Corpus Pre-Processing

Preprocessing is done mainly by lemmatization and normalization of corpus along with removal of unnecessary characters and adding new line for every sentence. 

NLTK wordnet lemmatizer is used for lemmatization. Wordnet lemmatizer lemmatizes the words with respect to their POS category, so average_perceptron tagger is used to tag the words in corpus to be lemmatized by wordnet lemmatizer. 

Firstly, new line is added after every period and question mark, and un-necessary characters are removed .
Secondaly, a function 'get_wordnet_tag' is defined to tag words in corpus to be lemmatized by wordnet lemmatizer.
Lastly, corpus is split, tagged, and lemmatized in nested for loop, and lemmas are saved in the list 'lemmatized'.


Normalization is done before lemmatization because some all capitalized words were not being properly lemmatized.

In [10]:
with open('text_fic.txt', 'r') as text_file:
    corpus = text_file.read()

corpus = corpus.replace(' . ', ' . \n')
corpus = corpus.replace(' ? ', ' ? \n')
corpus = corpus.replace(' ! ', ' ! \n')
corpus = corpus.replace('. \n" <p>', '. "\n<p>')
corpus = corpus.replace('? \n" <p>', '? "\n<p>')
corpus = corpus.replace('! \n" <p>', '! "\n<p>')
corpus = corpus.replace('<p> ', '')
corpus = corpus.replace('``', '"')

# lowercase the corpus before lemmatization
corpus = corpus.lower()

In [13]:
# Define function to tag words in corpus
def get_wordnet_tag(treebanktag):
    if treebanktag[0] == 'J':
        return wn.ADJ
    elif treebanktag[0] == 'V':
        return wn.VERB
    elif treebanktag[0] == 'R':
        return wn.ADV
    else:
        return wn.NOUN


# Lemmatization after pos_tagging
lemmatized = []
for sent in tqdm(corpus.split('\n')):
    for token, tag in pos_tag(sent.split()):
        lemma = lemmatizer.lemmatize(token, get_wordnet_tag(tag))
        lemmatized.append(lemma)
    lemmatized.append('\n')

100%|██████████| 89431/89431 [00:48<00:00, 1845.35it/s]


### Save the Preprocessed Data:
Preprocessed data is a list which is converted to string, white space after every new line is removed and the resultant text is saved to a text file 'coca.preprocessed'.

In [16]:
coca_preprocessed = ' '.join(lemmatized)
coca_preprocessed = coca_preprocessed.replace('\n ', '\n')
with open("coca.preprocessed.text", "w") as fp:
    fp.write(coca_preprocessed)

# Step 2: Counting Unigrams and Bigrams

### 2.1: Extraction of Bigrams and Unigrams
This step involves following steps:

   1. Uigrams and Bigrams are extracted from the pre-processed data using ngrams and bigrams modules respectively.

   2. Frequencies of both Unigrams and Bigrams are calculated by frequency distribution module of NLTK.

   3. Unigrams and Bigrams with their respective frequency counts are stored in separate files in dataframe.
   
   4. Unigrams saved in dataframe in above step, were in tuple. They are being extracted as well.

In [35]:
# Remove \n from lemmatized tokens
lemmatized = [lemma for lemma in lemmatized if lemma != '\n']

# Extract  Bigrams and their frequencies into a dataframe
bigrams = nltk.bigrams(lemmatized)
freq_dist_bigrams = nltk.FreqDist(bigrams)
df_bigrams = pd.DataFrame(freq_dist_bigrams.items(), columns=['Bigrams', 'Frequency_Count'])

# Extract  Unigrams and their frequencies into a dataframe
unigrams = ngrams(lemmatized, 1)
freq_dist_unigrams = nltk.FreqDist(unigrams)
df_unigrams = pd.DataFrame(freq_dist_unigrams.items(), columns=['Unigrams', 'Frequency_Count'])
df_unigrams['Unigrams'] = df_unigrams['Unigrams'].apply(lambda x: x[0])

7:80: E501 line too long (92 > 79 characters)
12:80: E501 line too long (95 > 79 characters)


### 2.2: Changing Frequency to 0 

Frequencies of Unigrams and Bigrams having value of 5 or less is turned to 0.


In [18]:
# frequencies less than or equal to 5 are turned to 0
df_bigrams.loc[df_bigrams['Frequency_Count'] <= 5, 'Frequency_Count'] = 0
df_unigrams.loc[df_unigrams['Frequency_Count'] <= 5, 'Frequency_Count'] = 0

### 2.3: Saving Frequency Counts to dictionaries 
Unigrams and Bigrams with their frequency counts are saved in separate dictionaries, with ngrams as keys and respective frequency counts as values.

In [19]:
# converting to dictionary
bigrams_dict = dict(zip(df_bigrams['Bigrams'], df_bigrams['Frequency_Count']))
unigrams_dict = dict(zip(df_unigrams['Unigrams'], df_unigrams['Frequency_Count']))

3:80: E501 line too long (82 > 79 characters)


### 2.4: coca.counts file
Unigram and Bigram data frames, tab separated, are saved in one csv file 'coca.count' excluding header and indices. File is created with Unigrams in write mode and Bigrams are appended next. This way all Unigrams come first and Bigrams later in the file.

In [20]:
df_unigrams.to_csv('coca.counts', sep='\t', index=False, header=False, mode='w')
df_bigrams.to_csv('coca.counts', sep='\t', index=False, header=False, mode='a')

1:80: E501 line too long (80 > 79 characters)


### 2.5 Total number of Tokens
Total number of tokens is calculated by summing frequencies of all Unigrams.

In [21]:
token_length = sum(unigrams_dict.values())
token_length

1347712

# Step 3: PPMI

Function is written to calculate PPMI scores. The function takes five arguments and return result as PPMI score as a float number rounded to 3 decimal points.

To meet required conditions that function should return 0 if frequency of Unigrams or Bigrams is zero or the words does not exist in the dictionaries of the said ngrams, a collective condition is written which says that if any of the mentioned conditions is met, the function would return 0 as PPMI score.

In [25]:
def PPMI(word1, word2, unigramDict, bigramDict, total_tokens):
    # if words are not in dictionaries or frequency of words is zero, return 0
    zero_condition = (word1 not in unigramDict) or \
                     (word2 not in unigramDict) or \
                     (unigramDict[word1] == 0) or \
                     (unigramDict[word2] == 0) or \
                     ((word1, word2) not in bigramDict) or \
                     (bigramDict[(word1, word2)] == 0) or \
                     (bigramDict[(word1, word2)] == 0)

    if zero_condition:
        return 0

    # calculate probability of words
    prob_word1 = float(unigramDict[word1]) / total_tokens
    prob_word2 = float(unigramDict[word2]) / total_tokens
    prob_word1_word2 = float(bigramDict[(word1, word2)]) / total_tokens

    # calculate PPMI score
    PPMI_score = math.log2(prob_word1_word2/(prob_word1*prob_word2))

    # Round the PPMI score to 3 decimal points
    PPMI_score_round = round(PPMI_score, 3)

    return PPMI_score_round

# Step 4: Computing PPMI Scores

### 4.1 PPMI Score for all Bigrams

PPMI score of Bigrams is computed by applying for loop on Bigrams dictionary, and saved in a list which is later converted to dataframe. The dataframe contains three columns with headers('WORD1', 'WORD2', 'Values'). First and second columns contain individual words of Bigrams and third column named as 'Values' contain their respective PPMI score.

In [26]:
ppmi_score_list = []
for word1, word2 in bigrams_dict:
    # Compute the PPMI for all the bigrams in bigrams_dict
    ppmi_score_calculation = PPMI(word1, word2, unigrams_dict, bigrams_dict, token_length)
    ppmi_score_list.append((word1, word2, ppmi_score_calculation))

df_ppmi_score = pd.DataFrame(ppmi_score_list, columns=('WORD1', 'WORD2', 'Values'))

4:80: E501 line too long (90 > 79 characters)
7:80: E501 line too long (83 > 79 characters)


### 4.2 Save PPMI Score of Bigrams in a file

PPMI score of Bigrams, tab separated, are saved without head and indices in a csv file named 'coca.ppmi'. 

In [28]:
df_ppmi_score.to_csv('coca.ppmi', sep='\t', index=False, header=False, mode='w')

1:80: E501 line too long (80 > 79 characters)


### 4.3 Function to print highest 20 PPMI Score 

A new function 'topN' is defined to sort the Bigrams in descending order of their PPMI score and print top 20 Bigrams with their PPMI scores excluding header and indices. 

Function takes two arguments i.e data and number of items to be printed, where number of items to be printed is set at the default value of 20.

In [29]:
def topN(df, number_of_items=20):
    score_sort = df.sort_values(by='Values', ascending=False)
    selected_items = score_sort.head(number_of_items)
    header_index_removed = selected_items.to_string(index=False, header=False)
    print(header_index_removed)

In [30]:
topN(df_ppmi_score)

  guiseppi    scapellini  17.777
      sint          holo  17.777
      vito         adamo  17.362
       tel          aviv  17.192
     edith  schermerhorn  17.040
      palo          alto  17.040
    oswald         truxa  16.947
      macy        levitt  16.903
    chiang           mai  16.903
    cecily       scriber  16.848
     irene       lashman  16.777
  pheasant      theodora  16.777
      amos          holt  16.777
       del         norte  16.662
       slo            mo  16.555
 anarchist          no.l  16.555
       nil         spaar  16.362
    moyshe       rabeynu  16.275
   clavius         gulch  16.192
      zand       dynasty  16.040


# Step 5: Comparing PPMI and Frequency Counts

### 5.1 Print the top-20 Bigrams sorted by Frequency Counts

Bigrams dictionary (containing bigrams with frequency counts) is converted to dataframe to print top 20 bigrams sorted by their frequency. There is no need to give number of Bigrams to be printed as the default value is set to 20.

In [31]:
# Make a dataframe from bigrams dictionary(bigramsDict)
frequency_count_list = []
for word1, word2 in bigrams_dict:
    frequency_count_list.append((word1, word2, bigrams_dict[word1, word2]))

df_freq_count = pd.DataFrame(frequency_count_list, columns=('WORD1', 'WORD2', 'Values'))

6:80: E501 line too long (88 > 79 characters)


In [32]:
topN(df_freq_count)

  @    @  61996
  .    "  18447
  ,  and   7206
  ,    "   6486
  "    "   5136
  .    i   5051
 of  the   4617
 in  the   4403
  ?    "   4263
  "    i   4184
  .   he   4128
 do  n't   3972
  .  the   3938
  .    #   3301
  ,  but   3139
 it   be   2924
  .  she   2902
 on  the   2786
 to  the   2746
  ,  the   2591


### 5.2 Differences between Computing simple Frequency vs. using PPMI Scores

Simple frequency is the measure of how frequent a word or pair of words occur in the corpus while PMI quantifies the likelihood of co occurance of two words. Keeping in view the results obtained most frequent bigrams are most insignificant and make no sense of themselves as well while PPMI scores show the words that make unique sense when appear together. Top 20 PPMI scored bigrams are names, the probability of them appearing together is higher than pairs of most frequency counts. 