# Semantic Text Similarity 

## Steps:-

####  Cleaning Text:
* Clealing Text using Regular Expression.
#### Tokenizing:
* Splitting sentences and words from the body of text. Words are separated by space after the word, i.e.after every word there is a space.
#### Stop Words:
* Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words . Stop words can be filtered from the text to be processed.
#### Lemmatizing:
* The goal of lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. By default, an attempt will be made to find the closest noun of a word.
#### Synsets:
* WordNet is a lexical database for the English language, and is part of the NLTK corpus. We can use WordNet alongside the NLTK module to find the meaning of words, synonyms, antonyms and more.

In [59]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

In [60]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords,wordnet
import re

In [61]:
from itertools import product

### Cleaning Text

In [62]:
str1 = "He loves to play football."
str2 = "Football is his favourite sport."

In [63]:
lemma=WordNetLemmatizer()

In [64]:
def clean_text(a):
    a=a.lower()
    text=re.sub('[^a-z]',' ',a)
    text=nltk.word_tokenize(text)
    text=[lemma.lemmatize(word) for word in text if word not in stopwords.words('english')]
    text=' '.join(text)
    return text

In [65]:
sent1=clean_text(str1)
sent2=clean_text(str2)

In [66]:
print(sent1)
print(sent2)

love play football
football favourite sport


### Finding Similarity

In [73]:
final=[]
for word1 in sent1:
    similarity = []
    for word2 in sent2:
        sims=[]
        syns1 = wordnet.synsets(word1)
        syns2 = wordnet.synsets(word2)
        for sense1, sense2 in product(syns1,syns2):
            d = wordnet.wup_similarity(sense1, sense2)
            if d != None:
                sims.append(d)
        if sims != []:        
            max_sim = max(sims)
            similarity.append(max_sim)
    if similarity != []:
        max_final = max(similarity)
        final.append(max_final)

In [74]:
similarity_index = np.mean(final)
similarity_index

0.9930555555555556

In [75]:
if similarity_index>0.80:
    print("Similar")
elif similarity_index>=0.60:
    print("Somewhat Similar")
else:
    print("Not Similar")

Similar
