<h1><center>Natural Language Processing</center></h1>

### Stemming

Task-1: Use the below given text and strip off any affixes:
   
DENNIS: Listen, strange women lying in ponds distributing swords is no basis for a system of government. Supreme executive   power derives from a mandate from the masses, not from some farcical aquatic ceremony. The striped bats are hanging on their feet for best.

Use the following stemmers available in [NLTK](https://www.nltk.org/) to perform stemming of the above given text and make a comparison of their output:

- [Porter stemmer](https://www.nltk.org/api/nltk.stem.porter.html) 
- [Lancaster stemmer](https://www.nltk.org/api/nltk.stem.lancaster.html)
- [Snowball stemmer](https://www.nltk.org/api/nltk.stem.snowball.html) 

In [3]:
from nltk import stem
from nltk.tokenize import word_tokenize
string = """
DENNIS: Listen, strange women lying in ponds distributing swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony. The striped bats are hanging on their feet for best.
"""
stringT = word_tokenize(string.lower())

stemmer = stem.PorterStemmer()
stemmed = [stemmer.stem(word) for word in stringT]
print("Porter stemmer:")
print(stemmed)

stemmer2 = stem.LancasterStemmer()
stemmed2 = [stemmer2.stem(word) for word in stringT]
print("Lancaster stemmer:")
print(stemmed2)

stemmer3 = stem.SnowballStemmer("english")
stemmed3 = [stemmer3.stem(word) for word in stringT]
print("Snowball stemmer:")
print(" ".join(stemmed3))

Porter stemmer:
['denni', ':', 'listen', ',', 'strang', 'women', 'lie', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.', 'the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'for', 'best', '.']
Lancaster stemmer:
['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not', 'from', 'som', 'farc', 'aqu', 'ceremony', '.', 'the', 'striped', 'bat', 'ar', 'hang', 'on', 'their', 'feet', 'for', 'best', '.']
Snowball stemmer:
denni : listen , strang women lie in pond distribut sword is no basi for a system of govern . suprem execut power deriv from a mandat from the mass , not from some farcic aquat ceremoni . the 

### Lemmatization

Task-1: By using an [NLTK lemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html), such as the wordnet lemmatizer, perform the lemmatization of the following text and compare the lematized text with the text produced with Porter stemmer:

DENNIS: Listen, strange women lying in ponds distributing swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony. The striped bats are hanging on their feet for best.

In [7]:
# import nltk
# nltk.download('wordnet')
wnl = stem.WordNetLemmatizer()
lematized = [wnl.lemmatize(word) for word in stringT]
print("Wordnet lemmatizer:")
print(lematized)

Wordnet lemmatizer:
['dennis', ':', 'listen', ',', 'strange', 'woman', 'lying', 'in', 'pond', 'distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.', 'the', 'striped', 'bat', 'are', 'hanging', 'on', 'their', 'foot', 'for', 'best', '.']


### Information Retrieval System

Task-1: Develop an information retrieval system based on ranked retrieval. The intended system should be based on tf-idf scores and cosine similarities to retrieve ranked indices of documents most relevant to the need. 


A collection of documents ([WordsDataset.csv](https://canvas.bham.ac.uk/courses/65790/files/14306214?module_item_id=3017551)) and a set of [queries](https://canvas.bham.ac.uk/courses/65790/files/14306235?module_item_id=3017553) are available in the course folder to develop the desired system. This is a sample dataset where every document is a collection of a few words.

Upon querying, the query should be compared to the words of every document based on the mentioned scheme and returns ranked (sorted top 10 highest) indices most relevant to the query.

Hint: Go to the [lecture slides](https://canvas.bham.ac.uk/courses/65790/files/14297069?module_item_id=3015094) and follow the steps to develop an end-to-end IR system

Useful links:  [NLTK](https://www.nltk.org/) [pandas](https://pandas.pydata.org/docs/user_guide/index.html), [NumPy](https://numpy.org/doc/stable/user/index.html#user)

In [8]:
# Step2: Preprocessing
# Convert csv and txt into dict using pandas
import pandas as pd
documents = pd.read_csv('WordsDataset.csv', header=0, index_col=0, squeeze = True).to_dict()
queries = pd.read_csv("Queries.txt", header = None).to_dict()[0]
# Tokenization
from nltk.tokenize import word_tokenize
print(documents)
print(queries)
for key, value in documents.items():
    documents[key] = word_tokenize(documents[key].lower())
for key, value in queries.items():
    queries[key] = word_tokenize(queries[key].lower())
print(documents)
print(queries)

{0: 'Hiker, demon, creepy, scary, tunnel, stalk', 1: 'Batman, batman beyond, who are you, narrows it down, animated, show, officer', 2: 'Up, carl, russell, honor, award, scout badge, old man, kids, movie, record', 3: 'Tom, jerry, sword, stab, dont care, cartoon, show', 4: 'Wholesome, comic, dialogue bubble, dog, sleeping with owner', 5: 'Doug dimmadome, chef hat, long, fast food, restaurant, employee', 6: 'Empty town, comparison, bustling city, contradictory', 7: 'Lord of the rings, lotr, gandalf, pipip, sending, movie, height', 8: 'Geralt, yennefer, pointing, blame, slapstick, video game, player', 9: 'Goofy, college, max ,shock, surprise, reveal, announce, reaction, cartoon', 10: 'Gordon Ramsay, pepto bismol, patrick, feeding, crossover, cartoon, show, chef', 11: 'Groot, gunpoint, force, surreal, movie, despicable me', 12: 'Having enough, jump, slapstick, fall', 13: 'Cat, possessed', 14: 'Hotdog, dog, many options', 15: 'Jedi, master, lightsaber, block, unexpected, movie, star wars', 

In [20]:
# Step3: Relevant Information for IR
# Remove stopwords and punctuation marks
from nltk.corpus import stopwords
from string import punctuation as punc

nltk_stopwords = stopwords.words('english')
for p in punc:
    nltk_stopwords.append(p)
print(nltk_stopwords)

for key, value in documents.items():
    documents[key] = [word for word in documents[key] if not word in nltk_stopwords]
for key, value in queries.items():
    queries[key] = [word for word in queries[key] if not word in nltk_stopwords]
    
print(documents)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [16]:
# Reduce dimensionality - stem
from nltk import stem
stemmer = stem.PorterStemmer()

for key, value in documents.items():
    stemmed = [stemmer.stem(word) for word in documents[key]]
    documents[key] = stemmed
for key, value in queries.items():
    stemmed = [stemmer.stem(word) for word in queries[key]]
    queries[key] = stemmed
    
# construct termSet and convert it to map
termSet = set()
for key, value in documents.items():
    termSet.update(documents[key])
for key, value in queries.items():
    termSet.update(queries[key])

termMap = {k: v for k, v in zip(termSet, range(len(termSet)))}
print(termMap)

{'movi': 0, 'third': 1, 'black': 2, 'crusad': 3, 'groot': 4, 'shake': 5, 'puncher': 6, 'peter': 7, 'fortress': 8, 'team': 9, 'enough': 10, 'gordon': 11, 'option': 12, 'incr': 13, 'comparison': 14, 'water': 15, 'cat': 16, 'step': 17, 'stalk': 18, 'wwe': 19, 'lord': 20, 'zach': 21, 'employ': 22, 'pipip': 23, 'support': 24, 'phinea': 25, 'kick': 26, 'videogam': 27, 'guy': 28, 'hold': 29, 'termin': 30, 'danc': 31, 'patrick': 32, 'kane': 33, 'stereotyp': 34, 'sofa': 35, 'feed': 36, 'swap': 37, 'batman': 38, 'award': 39, 'yennef': 40, 'block': 41, 'sport': 42, 'fallout': 43, 'mr.': 44, 'jesu': 45, 'putin': 46, 'stab': 47, 'fire': 48, 'mark': 49, 'surpri': 50, 'intellig': 51, 'hidden': 52, 'look': 53, 'slide': 54, 'battl': 55, 'bodi': 56, 'sign': 57, 'osborn': 58, 'ferb': 59, 'footbal': 60, 'carri': 61, 'gandalf': 62, 'time': 63, 'polic': 64, 'frozon': 65, 'stone': 66, 'umpir': 67, 'blame': 68, 'dont': 69, 'turtl': 70, 'tear': 71, 'send': 72, 'alon': 73, 'better': 74, 'cast': 75, 'jedi': 76, 

In [37]:
# initialise the vector space
import numpy as np
docTF = np.zeros((len(documents), len(termMap)))
queTF = np.zeros((len(queries), len(termMap)))

# Construct vector space
for key, value in documents.items():
    for term in value:
        docTF[key,termMap.get(term)] += 1
for key, value in queries.items():
     for term in value:
        queTF[key,termMap.get(term)] += 1

In [64]:
docIDF = np.log10(len(termMap) / (np.count_nonzero(docVS, axis=0) + 1))
docW = docTF * docIDF
queW = queTF * docIDF

## Step4: caculate similarity
from numpy.linalg import norm

res = np.zeros((queW.shape[0], docW.shape[0]))
for i in range(queW.shape[0]):
    for j in range(docW.shape[0]):
        res[i, j] = np.dot(queW[i], docW[j]) / (norm(queW[i])*norm(docW[j]))
print(res)

[[0.         0.36007936 0.1216192  0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.23899627
  0.23131359 0.         0.         0.         0.         0.
  0.         0.1367546  0.         0.         0.         0.
  0.         0.         0.13906793 0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.32955732 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.        

In [93]:
ind = np.argpartition(res, -5)[:,-5:]
resTopK = np.take_along_axis(res, ind, axis=1)
resTopKSorted = np.take_along_axis(ind, np.argsort(resTopK), axis=1)
print(np.flip(resSortedTopK, 1))

[[ 1 35 36 50 43]
 [28 19 17 15 18]
 [53  8 17 15 18]
 [34  2 19 53 18]
 [ 2 17 16 11 19]
 [53  7 50 18 23]]
