# Lab - Data Preprocessing

## Lab Summary:
In this lab we will be learning about NLP data preprocessing techniques, including Bag of Words, TF-IDF, and Document Similarity

## Learning Outcomes:
Upon completion of this lab, the student can:
<ul>
    <li>Compare cosine, Jaccard, and Euclidean similarity </li>
    <li>Apply Python to apply pre-processing techniques, including Bag of Words, TF-IDF, and Document Similarity</li>
    <li>Apply Python to test the similarity of 2 documents</li>
</ul>

## Import Packages and Classes (Initial)
In this lab we will use these libraries:
<ol>
    <li> NLTK </li>
    <li> Pandas </li>
    <li> Matplotlib </li>
    <li> Gensim </li>
</ol>

# Bag of Words (BoW)

<b>Bag of Words</b> is a text modelling technique. Bag of words creates a vector, using the count of each word within a document.

It is possible to generate a BoW algorithm without using a pre-created Python library.  We will review this method and apply it to some words.  Then, we'll do the same thing using the scikit-learn Python library.


In [1]:
! pip install nltk pandas matplotlib gensim

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-macosx_11_0_arm64.whl.metadata (8.1 kB)
Collecting numpy>=1.23.2 (from pandas)
  Downloading numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl.metadata (114 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-macosx_12_0_arm64.whl.metadata (60 kB)
Downloading gensim-4.3.3-cp311-cp311-macosx_11_0_arm64.whl (24.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl (14.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.0/14.0 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading scipy-1.13.1-cp311-cp311-macosx_12_0_arm64.whl (30.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.3/30.3 MB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected pack

In [2]:
def unique(sequence):
    '''this function returns the vocabulary of words'''
    seen = set()
    return set(sequence) 

def vectorize(tokens):
    ''' This function returns the bag of words representation. You need to use this function ahead'''
    vector=[]
    for w in filtered_vocab:
        vector.append(tokens.count(w))
        #tokens.count(w) tells us the count of each word in the filtered_vocab in tokens
        #filtered_vocab contains list of unique words after filtering stopwords and punctuation
    return vector

In [3]:
# Define some stopwords and special characters
# We wish to ignore these in our BoW algorithm.
stopwords=["to","is","a","the","of"]
special_char=[",",":"," ",";",".","?"]

# Some sentences
string1="Paris is the capital of France"
string2="Milan is the fashion capital of the world"

# Lowercasing is one of the most important steps in text preprocessing. 
# A particular word whether in lower or upper case means the same thing.
# Convert sentences to lowercase.
string1=string1.lower()
string2=string2.lower()

# Split the sentences into tokens
tokens1=string1.split()
tokens2=string2.split()

# Print the tokens and visually inspect them.
print(tokens1)
print(tokens2)

['paris', 'is', 'the', 'capital', 'of', 'france']
['milan', 'is', 'the', 'fashion', 'capital', 'of', 'the', 'world']


Now, we will find the <i>vocabulary</i> - the set of unique words in the corpus. 


In [4]:
# Apply our function "unique" as created previously:
vocab=unique(tokens1+tokens2)
vocab

{'capital', 'fashion', 'france', 'is', 'milan', 'of', 'paris', 'the', 'world'}

So our vocabulary is the set of unique words in the corpus (across all documents).

We also need to remove stopwords and special characters, as they add little or no meaning.

In [5]:
filtered_vocab=[]
# filtered_vocab should contain the words from vocab that are not stopwords or special_char
# It should contain all the meaningful words in all of the text items.
for w in vocab: 
    if w not in stopwords and w not in special_char: 
        filtered_vocab.append(w)
print(filtered_vocab)

['paris', 'capital', 'milan', 'fashion', 'france', 'world']


#### Vectorization

We have identified unique words that have meaning.

Now, we will apply a technique called <i>vectorization</i>. This substitutes text with numbers.

Vectorization is a critical step in NLP, because computers process numeric values.

We will use our <code>vectorize</code> function that we created above.

In [6]:
# Remind ourselves what those unique, meaningful words were.
print(filtered_vocab)

# Convert sentences into vectords
vector1=vectorize(tokens1)
vector2=vectorize(tokens2)

# Print the vectorized tokens.
print(vector1)
print(vector2)

['paris', 'capital', 'milan', 'fashion', 'france', 'world']
[1, 1, 0, 0, 1, 0]
[0, 1, 1, 1, 0, 1]


# Bag of Words with scikit-learn

Fortunately, we do not have to create a new set of functions every time we wish to use the BoW model.

Scikit-Learn has built-in modules that do this work for us.  

Below, we'll put these libraries into practice.

In [7]:
# Store your text into a variable.
text = "Sweden is part of the geographical area of Fennoscandia. \
The climate is in general mild for its northerly latitude due to \
significant maritime influence. In spite of the high latitude, \
Sweden often has warm continental summers, being located in \
between the North Atlantic, the Baltic Sea, and vast Russia. \
The general climate and environment vary significantly from the \
south and north due to the vast latitudal difference, and much \
of Sweden has reliably cold and snowy winters. Southern Sweden \
is predominantly agricultural, while the north is heavily forested \
and includes a portion of the Scandinavian Mountains."

# Split the text into sentences and store them into a list, using the decimal as sentence separator.
text = text.split('.')
text

['Sweden is part of the geographical area of Fennoscandia',
 ' The climate is in general mild for its northerly latitude due to significant maritime influence',
 ' In spite of the high latitude, Sweden often has warm continental summers, being located in between the North Atlantic, the Baltic Sea, and vast Russia',
 ' The general climate and environment vary significantly from the south and north due to the vast latitudal difference, and much of Sweden has reliably cold and snowy winters',
 ' Southern Sweden is predominantly agricultural, while the north is heavily forested and includes a portion of the Scandinavian Mountains',
 '']

#### sklearn CountVectorizer

<code>CountVectorizer</code> provides a simple way to tokenize a collection of text documents, build a vocabulary of known words, and encode new documents using that vocabulary.

<b>Steps to use CountVectorizer:</b>
- Create an instance of the CountVectorizer class.
- Call the fit() function in order to learn a vocabulary from one or more documents.
- Call the fit_transform() function on one or more documents as needed to encode each as a vector.

Reference: https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn

In [8]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

CountVec = CountVectorizer(ngram_range=(1,1), # to use bigrams ngram_range=(2,2)
                           stop_words='english')
Count_data = CountVec.fit_transform(sentence for sentence in text)
 
cv_dataframe=pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_names_out())
print(cv_dataframe)

   agricultural  area  atlantic  baltic  climate  cold  continental  \
0             0     1         0       0        0     0            0   
1             0     0         0       0        1     0            0   
2             0     0         1       1        0     0            1   
3             0     0         0       0        1     1            0   
4             1     0         0       0        0     0            0   
5             0     0         0       0        0     0            0   

   difference  environment  fennoscandia  ...  snowy  south  southern  spite  \
0           0            0             1  ...      0      0         0      0   
1           0            0             0  ...      0      0         0      0   
2           0            0             0  ...      0      0         0      1   
3           1            1             0  ...      1      1         0      0   
4           0            0             0  ...      0      0         1      0   
5           0         

# Practice

Following the steps in the previous example, create a dataframe of the vector of bi-grams in the following text:

"The Bag of Words (BoW) model is a foundational technique in natural language processing that converts text into numerical feature vectors by counting word occurrences. In Python, libraries like Scikit-learn provide tools such as CountVectorizer to transform a corpus of text into a sparse matrix of word counts. This representation is commonly used as input for machine learning models like Naive Bayes, logistic regression, or support vector machines to perform tasks such as text classification or sentiment analysis. While BoW ignores grammar and word order, it captures essential frequency information that can be highly effective in many NLP applications."

In [24]:
# Your Code Here:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

bow_text = "The Bag of Words (BoW) model is a foundational technique in natural language processing that converts text into numerical feature vectors by counting word occurrences. In Python, libraries like Scikit-learn provide tools such as CountVectorizer to transform a corpus of text into a sparse matrix of word counts. This representation is commonly used as input for machine learning models like Naive Bayes, logistic regression, or support vector machines to perform tasks such as text classification or sentiment analysis. While BoW ignores grammar and word order, it captures essential frequency information that can be highly effective in many NLP applications."
bow_text = bow_text.split('.')

CountVec = CountVectorizer(ngram_range=(2,2), # to use bigrams ngram_range=(2,2)
                           stop_words='english')
Count_data = CountVec.fit_transform(sentence for sentence in bow_text)
pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_names_out())

Unnamed: 0,bag words,bayes logistic,bow ignores,bow model,captures essential,classification sentiment,commonly used,converts text,corpus text,counting word,...,text sparse,tools countvectorizer,transform corpus,used input,vector machines,vectors counting,word counts,word occurrences,word order,words bow
0,1,0,0,1,0,0,0,1,0,1,...,0,0,0,0,0,1,0,1,0,1
1,0,0,0,0,0,0,0,0,1,0,...,1,1,1,0,0,0,1,0,0,0
2,0,1,0,0,0,1,1,0,0,0,...,0,0,0,1,1,0,0,0,0,0
3,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Term Frequency - Inverse Document Frequency (TF-IDF)

Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus

Like we did with Bag of Words, we'll demonstrate the long way of deriving TF-IDF, with a purpose of understanding how it works.

After that, we'll use built-in functionality of scikit-learn.

In [10]:
# A haiku:
doc1 = 'When does summer begin, When'
doc2 = 'The rain soothes my soul'
doc3 = 'The winter is lovely but long'

In [11]:
# Split each of the three documents into tokens.
bowDOC1 = doc1.split(' ')
bowDOC2 = doc2.split(' ')
bowDOC3 = doc3.split(' ')
print(bowDOC1)
print(bowDOC2)
print(bowDOC3)

['When', 'does', 'summer', 'begin,', 'When']
['The', 'rain', 'soothes', 'my', 'soul']
['The', 'winter', 'is', 'lovely', 'but', 'long']


In [12]:
# Remember the vocabulary is the set of unique words from all the documents
# Find the vocabulary by finding the unique words of bowDOC1, bowDOC2, and bowDOC3
vocabulary = set(bowDOC1).union(set(bowDOC2)).union(set(bowDOC3))
vocabulary

{'The',
 'When',
 'begin,',
 'but',
 'does',
 'is',
 'long',
 'lovely',
 'my',
 'rain',
 'soothes',
 'soul',
 'summer',
 'winter'}

Now, we have the vocabulary. 

Next, we vectorize using <code>TfidfVectorizer</code>

In [13]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [14]:
# Find the vector for doc1

# Create and print a dictionary from each word in the vocabulary.
vectorA = dict.fromkeys(vocabulary, 0)
print(vectorA)

# Using a for-loop, update the vector with the count of words in the first document.
for word in bowDOC1:
    vectorA[word] += 1
print(vectorA)

{'is': 0, 'lovely': 0, 'rain': 0, 'does': 0, 'my': 0, 'The': 0, 'begin,': 0, 'but': 0, 'long': 0, 'winter': 0, 'soul': 0, 'soothes': 0, 'summer': 0, 'When': 0}
{'is': 0, 'lovely': 0, 'rain': 0, 'does': 1, 'my': 0, 'The': 0, 'begin,': 1, 'but': 0, 'long': 0, 'winter': 0, 'soul': 0, 'soothes': 0, 'summer': 1, 'When': 2}


# Practice - Vectorize doc1 and doc1.

Using the above example, vectorize doc2 and doc3.

In [15]:
# Find the vector for doc2
# Create and print a dictionary from each word in the vocabulary.

# Using a for-loop, update the vector with the count of words in the second document and print it.
# Create and print a dictionary from each word in the vocabulary.
vectorB = dict.fromkeys(vocabulary, 0)
print(vectorB)

# Using a for-loop, update the vector with the count of words in the second document.
for word in bowDOC2:
    vectorB[word] += 1
print(vectorB)

{'is': 0, 'lovely': 0, 'rain': 0, 'does': 0, 'my': 0, 'The': 0, 'begin,': 0, 'but': 0, 'long': 0, 'winter': 0, 'soul': 0, 'soothes': 0, 'summer': 0, 'When': 0}
{'is': 0, 'lovely': 0, 'rain': 1, 'does': 0, 'my': 1, 'The': 1, 'begin,': 0, 'but': 0, 'long': 0, 'winter': 0, 'soul': 1, 'soothes': 1, 'summer': 0, 'When': 0}


In [16]:
# Find the vector for doc3
# Create and print a dictionary from each word in the vocabulary.

# Using a for-loop, update the vector with the count of words in the third document and print it.
vectorC = dict.fromkeys(vocabulary, 0)
print(vectorC)

# Using a for-loop, update the vector with the count of words in the second document.
for word in bowDOC3:
    vectorC[word] += 1
print(vectorC)

{'is': 0, 'lovely': 0, 'rain': 0, 'does': 0, 'my': 0, 'The': 0, 'begin,': 0, 'but': 0, 'long': 0, 'winter': 0, 'soul': 0, 'soothes': 0, 'summer': 0, 'When': 0}
{'is': 1, 'lovely': 1, 'rain': 0, 'does': 0, 'my': 0, 'The': 1, 'begin,': 0, 'but': 1, 'long': 1, 'winter': 1, 'soul': 0, 'soothes': 0, 'summer': 0, 'When': 0}


# Final step: Produce the Term Frequency and Inverse Term Frequency

TFIDF consists of 2 steps: finding the TF and the IDF. The final result is just the product of TF and IDF. 

In [17]:
# This function computes term frequency
def computeTF(wordDict, bagOfWords):
    tfDict = {}
    bagOfWordsCount = len(bagOfWords) # Finds the length of list Bag of Words
    for word, count in wordDict.items():
        tfDict[word] = count / float(bagOfWordsCount) # Find term frequency
    return tfDict

In [18]:
# Term frquency for doc1
tfA = computeTF(vectorA, bowDOC1)
tfA

{'is': 0.0,
 'lovely': 0.0,
 'rain': 0.0,
 'does': 0.2,
 'my': 0.0,
 'The': 0.0,
 'begin,': 0.2,
 'but': 0.0,
 'long': 0.0,
 'winter': 0.0,
 'soul': 0.0,
 'soothes': 0.0,
 'summer': 0.2,
 'When': 0.4}

# Practice - Compute the Term Frequency for doc2 and doc3.

Using the example above, compute the Term Frequencies for doc2 and doc3.

Store the results of each computation in variables called tfB and tfC, respectively.

In [19]:
# Your Code Here.
# tfB:
tfB = computeTF(vectorB, bowDOC2)

# tfC:
tfC = computeTF(vectorC, bowDOC3)


In [20]:
# Check your work:
tfB

{'is': 0.0,
 'lovely': 0.0,
 'rain': 0.2,
 'does': 0.0,
 'my': 0.2,
 'The': 0.2,
 'begin,': 0.0,
 'but': 0.0,
 'long': 0.0,
 'winter': 0.0,
 'soul': 0.2,
 'soothes': 0.2,
 'summer': 0.0,
 'When': 0.0}

In [21]:
tfC

{'is': 0.16666666666666666,
 'lovely': 0.16666666666666666,
 'rain': 0.0,
 'does': 0.0,
 'my': 0.0,
 'The': 0.16666666666666666,
 'begin,': 0.0,
 'but': 0.16666666666666666,
 'long': 0.16666666666666666,
 'winter': 0.16666666666666666,
 'soul': 0.0,
 'soothes': 0.0,
 'summer': 0.0,
 'When': 0.0}

# Create the Inverse Document Frequency function.

In [22]:
# Function to compute inverse document frequency
def computeIDF(documents):
    import math
    N = len(documents)
    
    idfDict = dict.fromkeys(documents[0].keys(), 0)
    for document in documents:
        for word, val in document.items():
            if val > 0:
                idfDict[word] += 1
    
    for word, val in idfDict.items():
        idfDict[word] = math.log(N / float(val))
    return idfDict

In [23]:
idfs = computeIDF([vectorA, vectorB, vectorC])
idfs

{'is': 1.0986122886681098,
 'lovely': 1.0986122886681098,
 'rain': 1.0986122886681098,
 'does': 1.0986122886681098,
 'my': 1.0986122886681098,
 'The': 0.4054651081081644,
 'begin,': 1.0986122886681098,
 'but': 1.0986122886681098,
 'long': 1.0986122886681098,
 'winter': 1.0986122886681098,
 'soul': 1.0986122886681098,
 'soothes': 1.0986122886681098,
 'summer': 1.0986122886681098,
 'When': 1.0986122886681098}

# Finally, calculate TF-IDF.

In [25]:
# TF-IDF is calculated by multiplying tf * idf for each word.
def computeTFIDF(tfBagOfWords, idfs):
    tfidf = {}
    for word, val in tfBagOfWords.items():
        tfidf[word] = val * idfs[word]
    return tfidf

In [26]:
# Find the TF-IDF for 3 documents and represent them in a data frame
tfidfA = computeTFIDF(tfA, idfs)
tfidfB = computeTFIDF(tfB, idfs)
tfidfC = computeTFIDF(tfC, idfs)

df = pd.DataFrame([tfidfA, tfidfB, tfidfC])

In [27]:
df

Unnamed: 0,is,lovely,rain,does,my,The,"begin,",but,long,winter,soul,soothes,summer,When
0,0.0,0.0,0.0,0.219722,0.0,0.0,0.219722,0.0,0.0,0.0,0.0,0.0,0.219722,0.439445
1,0.0,0.0,0.219722,0.0,0.219722,0.081093,0.0,0.0,0.0,0.0,0.219722,0.219722,0.0,0.0
2,0.183102,0.183102,0.0,0.0,0.0,0.067578,0.0,0.183102,0.183102,0.183102,0.0,0.0,0.0,0.0


# Use Scikit-Learn to Calculate TF-IDF.

In [29]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Define a vectorizer
vectorizer = TfidfVectorizer()

# Fit documents into the vectorizer
vectors = vectorizer.fit_transform([doc1, doc2, doc3])
feature_names = vectorizer.get_feature_names_out()

# print vectors in a readable format
print(feature_names) # prints the vocabulary

array(['begin', 'but', 'does', 'is', 'long', 'lovely', 'my', 'rain',
       'soothes', 'soul', 'summer', 'the', 'when', 'winter'], dtype=object)

In [30]:
# Our vectors are "sparse".  We eventually wish to transform our TF-IDF vectors
# into a data frame, which will require us to transform them into "dense."
dense = vectors.todense()
denselist = dense.tolist()

# Print the tfidf vectors
df = pd.DataFrame(denselist, columns=feature_names)
print(df)

      begin       but      does        is      long    lovely        my  \
0  0.377964  0.000000  0.377964  0.000000  0.000000  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.467351   
2  0.000000  0.423394  0.000000  0.423394  0.423394  0.423394  0.000000   

       rain   soothes      soul    summer       the      when    winter  
0  0.000000  0.000000  0.000000  0.377964  0.000000  0.755929  0.000000  
1  0.467351  0.467351  0.467351  0.000000  0.355432  0.000000  0.000000  
2  0.000000  0.000000  0.000000  0.000000  0.322002  0.000000  0.423394  


Notice that the values slightly differ from our own calculations.

This is because sklearn uses a slightly different implementation.

# Practice - scikit-Learn TFIDF

Using the examples above, create a dataframe of the dense vectors generated from the following paragraph, using scikit-learn's TfidfVectorizer function.

<b>Use this paragraph:</b>

The TF-IDF vectorization process transforms text data into numerical features by assigning weights based on how important a word is to a document relative to a collection of documents. Unlike the Bag of Words approach, which simply counts word occurrences, TF-IDF down-weights common words that appear in many documents and emphasizes more distinctive terms. Both methods convert text into sparse matrices, but TF-IDF captures more nuanced information about word significance. As a result, TF-IDF often leads to better performance in machine learning tasks where understanding word relevance is important.

In [None]:
# Your Code Here:
# Store the text into a variable and split it into sentences, using the "." character.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

text = 'The TF-IDF vectorization process transforms text data into numerical features by assigning weights based on how important a word is to a document relative to a collection of documents. Unlike the Bag of Words approach, which simply counts word occurrences, TF-IDF down-weights common words that appear in many documents and emphasizes more distinctive terms. Both methods convert text into sparse matrices, but TF-IDF captures more nuanced information about word significance. As a result, TF-IDF often leads to better performance in machine learning tasks where understanding word relevance is important.'
sentences = text.split('.')

In [34]:
# Store each sentence in documents.

sentence_dict = {}
for i, sentence in enumerate(sentences):
    if sentence.strip():  # Avoid adding empty sentences
        sentence_dict[f'doc{i+1}'] = sentence.strip()

sentence_dict

{'doc1': 'The TF-IDF vectorization process transforms text data into numerical features by assigning weights based on how important a word is to a document relative to a collection of documents',
 'doc2': 'Unlike the Bag of Words approach, which simply counts word occurrences, TF-IDF down-weights common words that appear in many documents and emphasizes more distinctive terms',
 'doc3': 'Both methods convert text into sparse matrices, but TF-IDF captures more nuanced information about word significance',
 'doc4': 'As a result, TF-IDF often leads to better performance in machine learning tasks where understanding word relevance is important'}

In [38]:
# Find the tfidf vectorization for each document.
# Define a vectorizer
vectorizer = TfidfVectorizer()

# Fit documents into the vectorizer
sentence_dict_vectors = vectorizer.fit_transform(sentence_dict.values())
sentence_dict_feature_names = vectorizer.get_feature_names_out()

In [39]:
# Store the vectorization in a dataframe format and display or print it.
dense = sentence_dict_vectors.todense()
denselist = dense.tolist()

# Print the tfidf vectors
df = pd.DataFrame(denselist, columns=sentence_dict_feature_names)
print(df)

      about       and    appear  approach        as  assigning       bag  \
0  0.000000  0.000000  0.000000  0.000000  0.000000   0.211875  0.000000   
1  0.000000  0.201839  0.201839  0.201839  0.000000   0.000000  0.201839   
2  0.270352  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000   
3  0.000000  0.000000  0.000000  0.000000  0.255627   0.000000  0.000000   

      based    better      both  ...        to  transforms  understanding  \
0  0.211875  0.000000  0.000000  ...  0.334090    0.211875       0.000000   
1  0.000000  0.000000  0.000000  ...  0.000000    0.000000       0.000000   
2  0.000000  0.000000  0.270352  ...  0.000000    0.000000       0.000000   
3  0.000000  0.255627  0.000000  ...  0.201539    0.000000       0.255627   

     unlike  vectorization   weights     where     which      word     words  
0  0.000000       0.211875  0.167045  0.000000  0.000000  0.110565  0.000000  
1  0.201839       0.000000  0.159132  0.000000  0.201839  0.105328  0.40367

# Document Similarity

Text Similarity determines how "close" to each other are multiple text documents.

Similarity can be in terms of both context and meaning.

Various text similarity metrics exist, including:

1. Cosine similarity
2. Jaccard similarity
3. Euclidean similarity

## Cosine Similarity

Cosine similarity is a metric used to measure how similar 2 documents are.

Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.

The smaller the angle, higher the cosine similarity.

In [40]:
doc_1 = "Brazil won the Football world cup five times" 
doc_2 = "Italy comes after Brazil in that regard" 

# Find the bag of word vector representation
CountVec = CountVectorizer(ngram_range=(1,1))
Count_data = CountVec.fit_transform(sentence for sentence in [doc_1, doc_2])

Now, calculate the cosine similarity between documents using sklearn.

In [41]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity_matrix = cosine_similarity(Count_data)

# Display the dataframe that shows the similarity of the two documents.
pd.DataFrame(cosine_similarity_matrix,['doc_1','doc_2'])

Unnamed: 0,0,1
doc_1,1.0,0.133631
doc_2,0.133631,1.0


What if you have more than two documents?

How many comparisons would you need to make?

In [42]:
doc_1 = "Sweden is in Scandanavia" 
doc_2 = "Denmark is a neighbor of Sweden" 
doc_3 = "Norway and Denmark are close by"

In [43]:
# Between the first and second:
CountVec = CountVectorizer(ngram_range=(1,1))
Count_data = CountVec.fit_transform(sentence for sentence in [doc_1, doc_2])
cosine_similarity_matrix = cosine_similarity(Count_data)
pd.DataFrame(cosine_similarity_matrix,['doc_1','doc_2'])

Unnamed: 0,0,1
doc_1,1.0,0.447214
doc_2,0.447214,1.0


In [45]:
# Between the second and third:
CountVec = CountVectorizer(ngram_range=(1,1))
Count_data = CountVec.fit_transform(sentence for sentence in [doc_2, doc_3])
cosine_similarity_matrix = cosine_similarity(Count_data)
pd.DataFrame(cosine_similarity_matrix,['doc_2','doc_3'])

Unnamed: 0,0,1
doc_2,1.0,0.182574
doc_3,0.182574,1.0


In [46]:
# Between first and third:
CountVec = CountVectorizer(ngram_range=(1,1))
Count_data = CountVec.fit_transform(sentence for sentence in [doc_1, doc_3])
cosine_similarity_matrix = cosine_similarity(Count_data)
pd.DataFrame(cosine_similarity_matrix,['doc_1','doc_3'])

Unnamed: 0,0,1
doc_1,1.0,0.0
doc_3,0.0,1.0


# Practice - Cosine Similarity

Calculate the cosine similarity between the sentences in this paragraph:

<b>Use the sentences from this paragraph:</b>

Cosine similarity measures the angle between two vectors. It captures directional similarity, not magnitude differences.

In [48]:
# Your Code Here:
# Store your text into a variable.
cosine_text = "Cosine similarity measures the angle between two vectors. It captures directional similarity, not magnitude differences."

# Split the text into sentences and store them into a list, using the decimal as sentence separator.
cosine_sentences = cosine_text.split('.')


In [50]:
# Assign the first element of text to one variable and the second element of the text to a different variable.

# Compare the first and second documents:
doc1 = cosine_sentences[0].strip()
doc2 = cosine_sentences[1].strip()

CountVec = CountVectorizer(ngram_range=(1,1))
Count_data = CountVec.fit_transform(sentence for sentence in [doc1, doc2])
cosine_similarity_matrix = cosine_similarity(Count_data)
pd.DataFrame(cosine_similarity_matrix,['doc1','doc2'])


Unnamed: 0,0,1
doc1,1.0,0.133631
doc2,0.133631,1.0


# Jaccard Similarity
Jaccard Similarity is also known as the Jaccard index and Intersection over Union.

Jaccard is used to determine the similarity between two text document in terms of their context.

Similarity is in terms of how many common words are exist over total words

![Image](jaccard.png)

Reference: https://en.wikipedia.org/wiki/Jaccard_index

In [51]:
# Start with two docs.
doc1 = 'A is the brother of B'
doc2 = 'B is the friend of C who is not a brother of A'

In [52]:
# Convert them to lower case for preprocessing.
doc1 = doc1.lower()
doc2 = doc2.lower()

In [53]:
# Split into tokens. Make sure that you have no duplicates. You might want to use sets for this purpose.
doc1 = set(doc1.split())
doc2 = set(doc2.split())

In [54]:
# Find common words from the 2 documents.
intersection = doc1.intersection(doc2)

In [56]:
# Find the vocabulary - unique words in both documents
union = doc1.union(doc2)
union

{'a', 'b', 'brother', 'c', 'friend', 'is', 'not', 'of', 'the', 'who'}

In [57]:
# Calculate jaccard similarity
float(len(intersection)) / len(union)

0.6