# AI4PH Short Course in NLP - Computer Assignment
*10 December 2023*
</br>
*Author: Yasaman Parhizkar*

**Description:**

Create a Jupyter notebook that performs the following NLP in this order:

1. Load the Brown Corpus from NLTK using paras().
2. Remove punctuation and stopwords.
3. Apply the lancaster stemmer.
4. Print to the screen the top 10 words in terms of TF. Show the TF values as well.
5. Print to the screen the top 10 words in terms of TF-IDF. Use the paragraphs as documents for calculating TF-IDF. Show the TF-IDF values as well.
6. Use pos_tag() to tag each token.
7. Print to the screen the 10 most common trigrams of word-tag pairs. Show their frequencies as well. Use nltk.trigrams().
8. Please clean up your code before submission. Your code should not contain any rough work. Also, insert appropriate comments to make it easy to understand and follow your code.  

This is a pass or fail assignment and you will receive a pass as long as your code performs the tasks above without any error. Submit your Jupyter notebook to Canvas by 9:59pm MT / 11:59pm ET on Sunday, December 10, 2023.

In [1]:
# install required packages
!pip install nltk tqdm



In [2]:
import nltk
from nltk.corpus import brown
import string
from nltk.corpus import stopwords
from tqdm import tqdm, trange
import math
import operator

# download the Brown Corpus and other required corpora
nltk.download('brown')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [3]:
# load the brown corpus
brown_paras = brown.paras()
print('Test Sentence:', " ".join(brown_paras[0][0]), '\n')

# create a list of all punctuations
punctuations = string.punctuation + '--``\"\"\'\''

# remove punctuations
brown_paras_clean = [[[token.lower() for token in sent if not token in punctuations] for sent in paraph] for paraph in tqdm(brown_paras, desc='Punctuation Removal')]
print('\nTest Sentence:', " ".join(brown_paras_clean[0][0]), '\n')

# remove stopwords
brown_paras_clean = [[[token for token in sent if not token in stopwords.words('english')] for sent in paraph] for paraph in tqdm(brown_paras_clean, desc='Stopword Removal')]
print('\nTest Sentence:', " ".join(brown_paras_clean[0][0]), '\n')

# apply the lancaster stemmer
lancaster = nltk.LancasterStemmer()
lancaster_stems = [[[lancaster.stem(word) for word in sent] for sent in paraph] for paraph in tqdm(brown_paras_clean, desc='Lancaster Stemmer')]
print('\nTest Sentence:', " ".join(lancaster_stems[0][0]), '\n')

Test Sentence: The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . 



Punctuation Removal: 100%|██████████| 15667/15667 [00:05<00:00, 2735.78it/s]



Test Sentence: the fulton county grand jury said friday an investigation of atlanta's recent primary election produced no evidence that any irregularities took place 



Stopword Removal: 100%|██████████| 15667/15667 [01:57<00:00, 132.87it/s]



Test Sentence: fulton county grand jury said friday investigation atlanta's recent primary election produced evidence irregularities took place 



Lancaster Stemmer: 100%|██████████| 15667/15667 [00:10<00:00, 1486.94it/s]


Test Sentence: fulton county grand jury said friday investig atlanta's rec prim elect produc evid irregul took plac 






In [4]:
# compute TF
brown_tokens = [token for paraph in brown_paras_clean for sent in paraph for token in sent] # flatten the list of tokens
tf = nltk.FreqDist(brown_tokens)
tf.most_common()[:10]

[('one', 3292),
 ('would', 2714),
 ('said', 1961),
 ('new', 1635),
 ('could', 1601),
 ('time', 1598),
 ('two', 1412),
 ('may', 1402),
 ('first', 1361),
 ('like', 1292)]

In [5]:
# compute TF-IDF
tf_idf = {}
ntokens = len(brown_tokens)
ndocs = len(brown_paras_clean)

n_tokens_computed = 100
token_idx = 0

for token in tqdm(tf, desc='TF-IDF Computation'):
    count = 0
    doc_idx = 0

    while doc_idx < ndocs:
        doc = [token for sent in brown_paras_clean[doc_idx] for token in sent]
        if token in doc:
            count += 1
        doc_idx += 1

    tf_idf[token] = tf[token] * math.log(ndocs / count)

    # NOTE: Uncomment the next three lines to limit the number of considered tokens for faster runtime.
    # if token_idx >= n_tokens_computed-1:
    #   break
    # token_idx += 1


# print the words with the largest tf-idf values
sorted_tf_idf = sorted(tf_idf.items(),
                       key=operator.itemgetter(1),
                       reverse=True)  # this is how you sort a dict
sorted_tf_idf[:10]

TF-IDF Computation: 100%|██████████| 49637/49637 [49:50<00:00, 16.60it/s]


[('one', 5967.480472092773),
 ('would', 5714.329758469583),
 ('said', 4260.661866852177),
 ('new', 4097.735481700572),
 ('af', 4086.0738714671734),
 ('could', 3958.3297676189504),
 ('time', 3904.4523114557496),
 ('may', 3823.063295503012),
 ('two', 3595.191700677196),
 ('first', 3501.2731240146404)]

In [6]:
# part-of-speach tagging
tags = nltk.pos_tag(brown_tokens)
tags[:10]

[('fulton', 'NN'),
 ('county', 'NN'),
 ('grand', 'JJ'),
 ('jury', 'NN'),
 ('said', 'VBD'),
 ('friday', 'JJ'),
 ('investigation', 'NN'),
 ("atlanta's", 'NN'),
 ('recent', 'JJ'),
 ('primary', 'JJ')]

In [7]:
# compute the most common word-tag trigrams
tf_wt_trigram = nltk.FreqDist(nltk.trigrams(tags))
tf_wt_trigram.most_common(10)

[((('world', 'NN'), ('war', 'NN'), ('2', 'CD')), 35),
 ((('new', 'JJ'), ('york', 'NN'), ('city', 'NN')), 26),
 ((('new', 'JJ'), ('york', 'NN'), ('times', 'NNS')), 18),
 ((('government', 'NN'), ('united', 'VBD'), ('states', 'NNS')), 18),
 ((('basic', 'JJ'), ('wage', 'NN'), ('rate', 'NN')), 16),
 ((('a.', 'NN'), ('notte', 'NN'), ('jr.', 'NN')), 15),
 ((('notte', 'NN'), ('jr.', 'NN'), ('governor', 'NN')), 15),
 ((('per', 'IN'), ('capita', 'JJ'), ('income', 'NN')), 14),
 ((('world', 'NN'), ('war', 'NN'), ('1', 'CD')), 13),
 ((('index', 'NN'), ('words', 'NNS'), ('electronic', 'JJ')), 13)]