# Lab - Bag of Words, TFIDF, Document Similarity

## Lab Summary:

In this lab we will be learning about document similarity and word embeddings.
## Lab Goal:
Upon completion of this lab, the student should be able to:
<ul>
    <li> understand cosine, jaccard, and Euclidean similarity </li>
    <li> test if 2 documents are similar or not </li>
</ul>


## Time Requried:
This lab should take 4 hours to complete for those unfamiliar with the tools.
Hint: The lab flow is already prepared for you, just read through the links and try implementing them. Don't try to have it all figured out before starting.

## Lab Preparation:
<ul>
    <li> Review Lecture for Week 5 and 6 </li>
</ul>

## Hardware Needed:
Any computer with access to the internet and web browser



## Import Packages and Classes (Initial)

In this lab we will be using the following libraries:
<ol>
    <li> NLTK </li>
    <li> Pandas </li>
    <li> Matplotlib </li>
    <li> Gensim </li>
</ol>


<b>Matplotlib</b> Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. It is completely free and open source.

![Image](https://miro.medium.com/max/1050/1*EsqDYFK-IzGEAm4FyZP0wQ.jpeg)

More here: https://matplotlib.org/stable/index.

<b>Pandas</b> is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

![Image](https://miro.medium.com/max/819/1*Dss7A8Z-M4x8LD9ccgw7pQ.png)

More here: https://pandas.pydata.org/

<b>Gensim</b> supports a variety of other NLP tasks such as: 
- converting words to vectors (word2vec), 
- document to vectors (doc2vec), 
- finding text similarity, and text summarization

![Gensim](https://repository-images.githubusercontent.com/1349775/202c4680-8f7c-11e9-91c6-745fdcbeffe8)

More about Gensim here:
https://pypi.org/project/gensim/

<b>Bag of Words</b> is a Natural Language Processing technique of text modelling. A bag of words is a representation of text that describes the occurrence of words within a document. 

At first, we would see how to find the bag of words representation without using any library. Then, we would use sklearn library to achieve the same thing.

In [2]:
def vectorize(tokens):
    ''' This function returns the bag of words representation. You need to use this function ahead'''
    vector=[]
    for w in filtered_vocab:
        vector.append(tokens.count(w))
        #tokens.count(w) tells us the count of each word in the filtered_vocab in tokens
        #filtered_vocab contains list of unique words after filtering stopwords and punctuation
    return vector

    
def unique(sequence):
    '''this function returns the vocabulary of words'''
    seen = set()
    return set(sequence) 

In [3]:
stopwords=["to","is","a","the","of"]
special_char=[",",":"," ",";",".","?"]


#your list of sentences
string1="Paris is the capital of France"
string2="Milan is the fashion capital of the world"

#lowercasing is one of the most important steps in text preprocessing. A particular word whether in lower or upper
#case means the same thing
#convert them to lower case
#your code here
string1=string1.lower()
string2=string2.lower()

#your code here
#split the sentences into tokens
tokens1=string1.split()
tokens2=string2.split()


print(tokens1)
print(tokens2)

['paris', 'is', 'the', 'capital', 'of', 'france']
['milan', 'is', 'the', 'fashion', 'capital', 'of', 'the', 'world']


Now, we will find the vocabulary which is the set of unique words in the corpus. We also need to filter it since we do not want stopwords or special characters to influence our vocabulary.

In [4]:
vocab=unique(tokens1+tokens2)
vocab

{'capital', 'fashion', 'france', 'is', 'milan', 'of', 'paris', 'the', 'world'}

In [5]:
filtered_vocab=[]
#filtered_vocab should contain the words from vocab that are not stopwords or special_char
#your code here
for w in vocab: 
    if w not in stopwords and w not in special_char: 
        filtered_vocab.append(w)
print(filtered_vocab)

['milan', 'fashion', 'capital', 'france', 'paris', 'world']


In [6]:
print(filtered_vocab)
#convert sentences into vectords
vector1=vectorize(tokens1)
print(vector1)
vector2=vectorize(tokens2)
print(vector2)

['milan', 'fashion', 'capital', 'france', 'paris', 'world']
[0, 0, 1, 1, 1, 0]
[1, 1, 1, 0, 0, 1]


Well, that was the bag of words, the hard way! Now, we would do the same using the sklearn library

In [7]:
text= "Sweden is part of the geographical area of Fennoscandia. The climate is in general mild for its northerly latitude due to significant maritime influence. In spite of the high latitude, Sweden often has warm continental summers, being located in between the North Atlantic, the Baltic Sea, and vast Russia. The general climate and environment vary significantly from the south and north due to the vast latitudal difference, and much of Sweden has reliably cold and snowy winters. Southern Sweden is predominantly agricultural, while the north is heavily forested and includes a portion of the Scandinavian Mountains."
#Split text into sentences. Store the sentences into a list
#Your code here:
text = text.split('.')
text

['Sweden is part of the geographical area of Fennoscandia',
 ' The climate is in general mild for its northerly latitude due to significant maritime influence',
 ' In spite of the high latitude, Sweden often has warm continental summers, being located in between the North Atlantic, the Baltic Sea, and vast Russia',
 ' The general climate and environment vary significantly from the south and north due to the vast latitudal difference, and much of Sweden has reliably cold and snowy winters',
 ' Southern Sweden is predominantly agricultural, while the north is heavily forested and includes a portion of the Scandinavian Mountains',
 '']

We would use the sklearn library to find bag of word representation automatically.

The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary. 


More about this here: https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn

In [8]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

CountVec = CountVectorizer(
    ngram_range=(1,1),  # use (2,2) for bigrams
    stop_words='english'
)

Count_data = CountVec.fit_transform(sentence for sentence in text)

cv_dataframe = pd.DataFrame(
    Count_data.toarray(),
    columns=CountVec.get_feature_names_out()
)

print(cv_dataframe)


   agricultural  area  atlantic  baltic  climate  cold  continental  \
0             0     1         0       0        0     0            0   
1             0     0         0       0        1     0            0   
2             0     0         1       1        0     0            1   
3             0     0         0       0        1     1            0   
4             1     0         0       0        0     0            0   
5             0     0         0       0        0     0            0   

   difference  environment  fennoscandia  ...  snowy  south  southern  spite  \
0           0            0             1  ...      0      0         0      0   
1           0            0             0  ...      0      0         0      0   
2           0            0             0  ...      0      0         0      1   
3           1            1             0  ...      1      1         0      0   
4           0            0             0  ...      0      0         1      0   
5           0         

The next task is to find the bag of words representation again but this time, do it for <b>bigrams</b>!

A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. 

In [9]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

CountVec = CountVectorizer(
    ngram_range=(2,2),   # bigrams
    stop_words='english'
)

Count_data = CountVec.fit_transform(sentence for sentence in text)

cv_dataframe = pd.DataFrame(
    Count_data.toarray(),
    columns=CountVec.get_feature_names_out()
)

print(cv_dataframe)


   agricultural north  area fennoscandia  atlantic baltic  baltic sea  \
0                   0                  1                0           0   
1                   0                  0                0           0   
2                   0                  0                1           1   
3                   0                  0                0           0   
4                   1                  0                0           0   
5                   0                  0                0           0   

   climate environment  climate general  cold snowy  continental summers  \
0                    0                0           0                    0   
1                    0                1           0                    0   
2                    0                0           0                    1   
3                    1                0           1                    0   
4                    0                0           0                    0   
5                    0          

<b>TFIDF</b>

Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus

![Image](https://miro.medium.com/max/3604/1*qQgnyPLDIkUmeZKN2_ZWbQ.png)

You can read more about TFIDF  here: http://www.tfidf.com/, https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf

Now, let's implement tfidf first without libraries and then with them!


In [10]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

For this section we would be working with three documents:

In [11]:
doc1 = 'When does summer begin , When'
doc2 = 'The rain soothes my soul'
doc3 = 'The winter is lovely but long'


Our corpus here is three documents: doc1, doc2, and doc3

In [12]:
#split each of the three documents into tokens. Store the tokens from doc1 in a variable bowDOC1
#Store the tokens from doc2 in a variable bowDOC2. Store the tokens from doc3 in a variable bowDOC3
#your code here
bowDOC1 = doc1.split(' ')
bowDOC2 = doc2.split(' ')
bowDOC3 = doc3.split(' ')
print(bowDOC1)
print(bowDOC2)
print(bowDOC3)

['When', 'does', 'summer', 'begin', ',', 'When']
['The', 'rain', 'soothes', 'my', 'soul']
['The', 'winter', 'is', 'lovely', 'but', 'long']


In [13]:
#Remember the vocabulary is the set of unique words from all the documents
#find the vocabulary by finding the unique words of bowDOC1, bowDOC2, and bowDOC3
#your code here
vocabulary = set(bowDOC1).union(set(bowDOC2)).union(set(bowDOC3))
vocabulary

{',',
 'The',
 'When',
 'begin',
 'but',
 'does',
 'is',
 'long',
 'lovely',
 'my',
 'rain',
 'soothes',
 'soul',
 'summer',
 'winter'}

Now, we have the vocabulary- the set of unique words. We need to convert our documents into vectors so that we can pass these vectors on to our machines for any sort of computation.

In [14]:
#finding the bow vector for doc1
vectorA = dict.fromkeys(vocabulary, 0)
print(vectorA)
for word in bowDOC1:
    vectorA[word] += 1
print(vectorA)

{'soothes': 0, ',': 0, 'long': 0, 'The': 0, 'soul': 0, 'rain': 0, 'winter': 0, 'begin': 0, 'my': 0, 'When': 0, 'is': 0, 'lovely': 0, 'but': 0, 'does': 0, 'summer': 0}
{'soothes': 0, ',': 1, 'long': 0, 'The': 0, 'soul': 0, 'rain': 0, 'winter': 0, 'begin': 1, 'my': 0, 'When': 2, 'is': 0, 'lovely': 0, 'but': 0, 'does': 1, 'summer': 1}


In [15]:
#find the bow vector for doc2
#your code here
vectorB = dict.fromkeys(vocabulary, 0)
print(vectorB)
for word in bowDOC2:
    vectorB[word] += 1
print(vectorB)

{'soothes': 0, ',': 0, 'long': 0, 'The': 0, 'soul': 0, 'rain': 0, 'winter': 0, 'begin': 0, 'my': 0, 'When': 0, 'is': 0, 'lovely': 0, 'but': 0, 'does': 0, 'summer': 0}
{'soothes': 1, ',': 0, 'long': 0, 'The': 1, 'soul': 1, 'rain': 1, 'winter': 0, 'begin': 0, 'my': 1, 'When': 0, 'is': 0, 'lovely': 0, 'but': 0, 'does': 0, 'summer': 0}


In [16]:
#find the bow vector for doc3
#your code here
vectorC = dict.fromkeys(vocabulary, 0)
print(vectorC)
for word in bowDOC3:
    vectorC[word] += 1
print(vectorC)

{'soothes': 0, ',': 0, 'long': 0, 'The': 0, 'soul': 0, 'rain': 0, 'winter': 0, 'begin': 0, 'my': 0, 'When': 0, 'is': 0, 'lovely': 0, 'but': 0, 'does': 0, 'summer': 0}
{'soothes': 0, ',': 0, 'long': 1, 'The': 1, 'soul': 0, 'rain': 0, 'winter': 1, 'begin': 0, 'my': 0, 'When': 0, 'is': 1, 'lovely': 1, 'but': 1, 'does': 0, 'summer': 0}


TFIDF consists of 2 steps: finding the TF and the IDF. The final result is just the product of TF and IDF. 

In [17]:
#this is a function for computing term frequency
def computeTF(wordDict, bagOfWords):
    tfDict = {}
    bagOfWordsCount = len(bagOfWords)#finds the length of list bagOfWords
    for word, count in wordDict.items():
        tfDict[word] = count / float(bagOfWordsCount) #finding term frequency
    return tfDict

In [18]:
#term frquency for doc1
tfA = computeTF(vectorA, bowDOC1)
#find the term frequencies for doc2 and doc3, store them in variables tfB and tfC
tfB = computeTF(vectorB, bowDOC2)
tfC = computeTF(vectorC, bowDOC3)

In [19]:
#Just seeing how the tf for document B looks like
tfB

{'soothes': 0.2,
 ',': 0.0,
 'long': 0.0,
 'The': 0.2,
 'soul': 0.2,
 'rain': 0.2,
 'winter': 0.0,
 'begin': 0.0,
 'my': 0.2,
 'When': 0.0,
 'is': 0.0,
 'lovely': 0.0,
 'but': 0.0,
 'does': 0.0,
 'summer': 0.0}

In [20]:
#function for computing inverse document frequency
def computeIDF(documents):
    import math
    N = len(documents)
    
    idfDict = dict.fromkeys(documents[0].keys(), 0)
    for document in documents:
        for word, val in document.items():
            if val > 0:
                idfDict[word] += 1
    
    for word, val in idfDict.items():
        idfDict[word] = math.log(N / float(val))
    return idfDict

In [21]:
idfs = computeIDF([vectorA, vectorB, vectorC])
idfs

{'soothes': 1.0986122886681098,
 ',': 1.0986122886681098,
 'long': 1.0986122886681098,
 'The': 0.4054651081081644,
 'soul': 1.0986122886681098,
 'rain': 1.0986122886681098,
 'winter': 1.0986122886681098,
 'begin': 1.0986122886681098,
 'my': 1.0986122886681098,
 'When': 1.0986122886681098,
 'is': 1.0986122886681098,
 'lovely': 1.0986122886681098,
 'but': 1.0986122886681098,
 'does': 1.0986122886681098,
 'summer': 1.0986122886681098}

In [22]:
#tfidf is just tf*idf
def computeTFIDF(tfBagOfWords, idfs):
    tfidf = {}
    for word, val in tfBagOfWords.items():
        tfidf[word] = val * idfs[word]
    return tfidf

In [23]:
#find the tfidf for 3 documents and represent them in a data frame
#your code here
tfidfA = computeTFIDF(tfA, idfs)
tfidfB = computeTFIDF(tfB, idfs)
tfidfC = computeTFIDF(tfC, idfs)

df = pd.DataFrame([tfidfA, tfidfB, tfidfC])

In [24]:
df

Unnamed: 0,soothes,",",long,The,soul,rain,winter,begin,my,When,is,lovely,but,does,summer
0,0.0,0.183102,0.0,0.0,0.0,0.0,0.0,0.183102,0.0,0.366204,0.0,0.0,0.0,0.183102,0.183102
1,0.219722,0.0,0.0,0.081093,0.219722,0.219722,0.0,0.0,0.219722,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.183102,0.067578,0.0,0.0,0.183102,0.0,0.0,0.0,0.183102,0.183102,0.183102,0.0,0.0


Now, let's find tfidf using scikit-learn.

In [25]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# define tf-idf vectorizer
vectorizer = TfidfVectorizer()

# fit documents into vectorizer
vectors = vectorizer.fit_transform([doc1, doc2, doc3])

# get vocabulary
feature_names = vectorizer.get_feature_names_out()


# convert to readable format
dense = vectors.todense()
denselist = dense.tolist()

# print the tf-idf vectors
df = pd.DataFrame(denselist, columns=feature_names)
df


Unnamed: 0,begin,but,does,is,long,lovely,my,rain,soothes,soul,summer,the,when,winter
0,0.377964,0.0,0.377964,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.377964,0.0,0.755929,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.467351,0.467351,0.467351,0.467351,0.0,0.355432,0.0,0.0
2,0.0,0.423394,0.0,0.423394,0.423394,0.423394,0.0,0.0,0.0,0.0,0.0,0.322002,0.0,0.423394


The values slightly differ because sklearn uses a slightly different implementation

In [26]:
doc1 = 'Cricket is a bat-and-ball game played between two teams of eleven players'
doc2 = 'The batting side scores runs by striking the ball bowled at the wicket with the bat '
doc3 = 'Means of dismissal include being bowled, when the ball hits the stumps and dislodges the bails'
doc4 = 'Forms of cricket range from Twenty20, with each team batting for a single innings of 20 overs, to Test matches played over five days'
#find the tfidf representation for each of these documents
#your code here
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([doc1, doc2, doc3, doc4])
feature_names = vectorizer.get_feature_names_out()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)
df

Unnamed: 0,20,and,at,bails,ball,bat,batting,being,between,bowled,...,team,teams,test,the,to,twenty20,two,when,wicket,with
0,0.0,0.245646,0.0,0.0,0.198871,0.245646,0.0,0.0,0.31157,0.0,...,0.0,0.31157,0.0,0.0,0.0,0.0,0.31157,0.0,0.0,0.0
1,0.0,0.0,0.224511,0.0,0.143302,0.177007,0.177007,0.0,0.0,0.177007,...,0.0,0.0,0.0,0.708028,0.0,0.0,0.0,0.0,0.224511,0.177007
2,0.0,0.193204,0.0,0.245054,0.156415,0.0,0.0,0.245054,0.0,0.193204,...,0.0,0.0,0.0,0.579611,0.0,0.0,0.0,0.245054,0.0,0.0
3,0.217618,0.0,0.0,0.0,0.0,0.0,0.171572,0.0,0.0,0.0,...,0.217618,0.0,0.217618,0.0,0.217618,0.217618,0.0,0.0,0.0,0.171572


# Document Similarity

- Text Similarity has to determine how the two text documents close to each other

- The similarity can be in terms of both context and meaning

- For finding similarity, text needs to be converted into vectors

- There are various text similarity metric exist such as Cosine similarity, Jaccard Similarity.


#Cosine Similarity

- Cosine similarity is a metric used to measure how similar 2 documents are 
- Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. 
- The smaller the angle, higher the cosine similarity.

![Image](https://datascience-enthusiast.com/figures/cosine_sim.png)

In [27]:
doc_1 = "Brazil won the Football world cup five times" 
doc_2 = "Italy comes after Brazil in that regard" 

#find the bag of word vector representation
CountVec = CountVectorizer(ngram_range=(1,1))
Count_data = CountVec.fit_transform(sentence for sentence in [doc_1, doc_2])
vectorA = Count_data.toarray()[0]
vectorB = Count_data.toarray()[1]

We would now calculate the cosine similarity between documents using Sklearn.

In [28]:
from sklearn.metrics.pairwise import cosine_similarity
#use the cosine_similarity function to calculate cosine distance like this
cosine_similarity_matrix = cosine_similarity(Count_data)
pd.DataFrame(cosine_similarity_matrix,['doc_1','doc_2'])

Unnamed: 0,0,1
doc_1,1.0,0.133631
doc_2,0.133631,1.0


In [29]:
doc_1 = "Sweden is in Scandanavia" 
doc_2 = "Denmark is a neighbor of Sweden" 
doc_3 = "Norway and Denmark are close by"
#find the pairwise cosine similarity between the documents


In [30]:
#your code here
CountVec = CountVectorizer(ngram_range=(1,1))
Count_data = CountVec.fit_transform(sentence for sentence in [doc_1, doc_2])
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity_matrix = cosine_similarity(Count_data)
pd.DataFrame(cosine_similarity_matrix,['doc_1','doc_2'])

Unnamed: 0,0,1
doc_1,1.0,0.447214
doc_2,0.447214,1.0


In [31]:
#your code here
CountVec = CountVectorizer(ngram_range=(1,1))
Count_data = CountVec.fit_transform(sentence for sentence in [doc_2, doc_3])
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity_matrix = cosine_similarity(Count_data)
pd.DataFrame(cosine_similarity_matrix,['doc_2','doc_3'])

Unnamed: 0,0,1
doc_2,1.0,0.182574
doc_3,0.182574,1.0


In [32]:
#your code here
CountVec = CountVectorizer(ngram_range=(1,1))
Count_data = CountVec.fit_transform(sentence for sentence in [doc_1, doc_3])
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity_matrix = cosine_similarity(Count_data)
pd.DataFrame(cosine_similarity_matrix,['doc_1','doc_3'])

Unnamed: 0,0,1
doc_1,1.0,0.0
doc_3,0.0,1.0


#Jaccard Similarity

- Jaccard Similarity is also known as the Jaccard index and Intersection over Union 
- used to determine the similarity between two text document in terms of their context 
- similarity is in terms of how many common words are exist over total words

![Image](https://miro.medium.com/max/744/1*XiLRKr_Bo-VdgqVI-SvSQg.png)

More here: https://en.wikipedia.org/wiki/Jaccard_index

In [33]:
# Let's have 2 documents

doc1 = 'A is the brother of B'
doc2 = 'B is the friend of C who is not a brother of A'

In [34]:
#convert documents to lower case - basic preprocessing!
doc1 = doc1.lower()
doc2 = doc2.lower()

In [35]:
#split both the documents into tokens. Make sure that you have no duplicates. You might want to use sets for this purpose!?
#your code here
doc1 = set(doc1.split())
doc2 = set(doc2.split())

In [36]:
#find common words from the 2 documents
#your code here
intersection = doc1.intersection(doc2)

In [37]:
#find the vocabulary, total unique words in both documents
#your code here
union = doc1.union(doc2)

In [38]:
#find jaccard similarity
#your code here
float(len(intersection)) / len(union)

0.6


# Submission Instructions
    1. Run all code cells in the notebook

    2. Answer all questions in markdown cells beneath each section

    3. Export the notebook as HTML:

        File → Download as → HTML (.html)

    4. Submit the HTML file in Canvas by the posted deadline

## Setup
### Run the following cell before starting the assignment:

In [39]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import KeyedVectors
import gensim.downloader as api

# For better plot display in HTML export
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "svg"

print("Libraries loaded successfully!")

Libraries loaded successfully!


## Part 1

In [40]:
documents = [
    "Hindi cinema, often known as Bollywood, is the Indian Hindi-language film industry based in Mumbai.",
    "Films adapted from comic books have had plenty of success, whether about superheroes like Batman and Superman or geared toward kids.",
    "Every now and then a movie comes along from a suspect studio that becomes a critical darling despite low expectations.",
    "The Year 2000 problem, also known as Y2K, refers to computer bugs related to date formatting in the new millennium."
]

In [41]:
vectorizer = TfidfVectorizer(lowercase=True)#, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)

similarity_matrix = cosine_similarity(tfidf_matrix)

df_sim = pd.DataFrame(
    similarity_matrix.round(3),
    index=[f"Doc {i+1}" for i in range(4)],
    columns=[f"Doc {i+1}" for i in range(4)]
)

df_sim

Unnamed: 0,Doc 1,Doc 2,Doc 3,Doc 4
Doc 1,1.0,0.0,0.0,0.166
Doc 2,0.0,1.0,0.069,0.0
Doc 3,0.0,0.069,1.0,0.0
Doc 4,0.166,0.0,0.0,1.0


## Answer the following in markdown cells:
    1. Which pair of documents has the highest similarity score? Does this result make intuitive sense? Explain why or why not.
    
    2. Why is the similarity between Document 1 and Document 4 close to zero?
    
    3. Re-run the analysis without removing stop words (stop_words=None). How does the similarity matrix change, and why?

#### 1. 
Looking at the table produced, it seems that the documents of 1 & 4 have some degree of similarity; the rest do not have any besides their digonal accross. On face value this doens't make much intutive sense due to the subject matter being competely different. To understand why this is the case, the matter of Cosine Similiarity comes into play. Such an algorithm operates on the stem of words rather then that of their meaning (symantic operations); this can mean that documents which have common stems belonging to words will have a greater degree of similarity evne if the meaning doens't appear the same.

Here whaat was passed wasn't the pure vectorization of the corpus but rather of the output of the tfidfvectorizer; this means such a matrix had it's vectors tuned to have common words be within ideal differeanting directions. This would then make the use of cosine similiarty here so as to find stem-similarity between rare words within the corpus.

#### 2.
Though being the only two which are similar, their values are close to zero. This would be here to mean the rare words within the corpus have only a small amount of stem-similarity. 

#### 3. 
Doing this the cosine similarity scores go up for documents 1 & 4 while also now getting a score for documents 3 & 2. This would mean there would be a rare amount of stop words witin the corpus which have the same stem throughout as well.

In [None]:
# Download a smaller, fast model (first run may take 1–2 minutes)
# Options include:
# 'glove-wiki-gigaword-50'
# 'word2vec-google-news-300'
# 'fasttext-wiki-news-subwords-300'

model = api.load("glove-wiki-gigaword-100")
print("Model loaded! Vocabulary size:", len(model))

words = ['paris', 'italy', 'man', 'woman', 'car', 'bike', 'apple', 'banana']

for word in words:
    if word in model:
        similar = model.most_similar(word, topn=3)
        print(f"\n{word.upper()}:")
        for w, score in similar:
            print(f"  {w} ({score:.3f})")

def document_vector(doc):
    words = [w.lower() for w in doc.split() if w.lower() in model]
    if len(words) == 0:
        return np.zeros(model.vector_size)
    return np.mean(model[words], axis=0)

doc_vectors = [document_vector(doc) for doc in documents]
embedding_similarity = cosine_similarity(doc_vectors)

df_emb_sim = pd.DataFrame(
    embedding_similarity.round(3),
    index=[f"Doc {i+1}" for i in range(4)],
    columns=[f"Doc {i+1}" for i in range(4)]
)

df_emb_sim

## Part 2
### Load a Pre-Trained Embedding Model

### Word-Level Similarity

### Document-Level Similarity Using Embeddings

### Answer the following:
        1. What pattern do you notice between king → queen and man → woman?

        2. Try your own analogy (for example: "paris" - "france" + "italy"). Does the result return something close to "rome"?

        3. Compare this similarity matrix with the TF-IDF matrix from Part 1. Which method better captures semantic similarity versus keyword overlap?

        4. Why might Document 2 and Document 3 show higher similarity with embeddings than with TF-IDF?

### 1.
For both the simliarity analysis of Queen & King there is a second order to what are commonly understood linearities to their significant other; instead, the first simliarites to these terms are rather what they are before in their status being prince & princess. When looking at the simliarity analysis of Man there can be seen a differnce from the above terms: instead of to a term to denote what a man once was it's instead to the term Women.

### 2. 
Chainging the King & Queen terms to the terms Pairs & Italy doens't result in a similarity between neither of them for the term Rome; rather, I get other terms like that of Frace for Paris and Spain for Italy.


### 3.
Focusing on the difference between the use of TF-IDF matrix and the model used here it would show that the model used here has a matrix output which is more human readable where it presents the most simliary terms to the user. In part 1 rather it requires a manual inspection for what terms are more similar to each other. In view of keyword overlap, these sorts of models provide a greater degree of possible simliarity between terms rather then that of a binary simliarity.


### 4.
I'm not sure what document of 2 & 3 the question is refering to, but this likely can be a general question to ask about why embedding might discover higher simlarites then TF-IDF.


## Part 3: Reflection and Real-World Application (30 points)

Recommended reading (optional but helpful):

https://intellica-ai.medium.com/comparison-of-different-word-embeddings-on-text-similarity-a-use-case-in-nlp-e83e08469c1cLinks to an external site.

### Answer each question in 3–5 sentences:

    1. Provide two practical business examples where accurate text similarity improves workforce productivity (for example: customer support, legal review, content recommendation).

    2. Why do word embeddings generally outperform TF-IDF for semantic tasks, while TF-IDF may still be preferred for speed or interpretability?

    3. If you were building a plagiarism detection system for student essays, which approach would you favor and why?

    4. Based on this lab, which embedding model would you try next (Word2Vec, FastText, or BERT), and for what reason?

## 1.
Going with the example given for legal review, I think it's common to think of a sort of replacement for a legal review using such methods of NLP, but putting these tools in the hands of such a reviewer can rather differentiate them via their perticuler ablity. One might have a trend within their set of documents they can hone into by possibly using a basic suite which can perform TF-IDF, thereby finding the use of important terms.

## 2.
The reasoning for TF-IDF being sightly less preformative then Word Embedding models comes about from at least two things. The first is that TF-IDF is built from the foundation of similarity based on semantics rather then lexical, which means terms that share similair form are what's related within TF-IDF and that terms which had shared simliar meaning within sentences are what's related when using Embedding models. The second is that word embedding models map within lower dimensions whihch means there can be denser vectors of simliarites for the model to learn from; inverse TF-IDF maps within very high dimensions which can cause greater distnace between vectors. 


## 3.
The idea of plagiarism brings to me that idea of someone using simliar ideas in a similar way; the only reason I think this could be is due to the effort it takes to re-write something in another way resulting in just learning the material anyways. This means I would use a word embedding model for finding the lexical similairty of sentences to paragraphs, perhaps looking into the herustics of hyperparameter tuning.


## 4.
Trying out FastText would be interesting due to it's approach of representing terms as bags of character of n-grams rather then that of atmoic units belonging to the term. This itself may show that it's a herustic of doing such a pre-processing/embedding style for similiarity matching.
