# Finding Similarity Between Texts

In this recipee, we are going to discuss how to find the similarity between two documents or text, there are many similarity metrics like Euclidean, cosine, Jaccard etc. Applications of text similarity can be found in areas like spelling correction and data deduplication.

## Here are few of the similarity metrics:
**Cosine Similarity**: Calculates the cosine of the angles between 2 vectors.       
**Jaccard similarity**: The score is calculated using intersection or union of words.       
**Jaccard index**: (the number in both sets)/(the number in either sets)*100        
**Levenshtein distance**: Minimal number of insertions, deletions, and replacement required  for transforming string 'a' to string 'b'   
**Hamming distance**: Number of positions with the same symbol in both strings. But it can be defined only for strings with equal length

In [1]:
documents = (
'I like NLP',
'I am exploring NLP',
'I am a beginner in NLP',
'I want to learn NLP',
'I like advanced NLP')


In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tf_idf_vectorizer = TfidfVectorizer()
tfidf_matrix = tf_idf_vectorizer.fit_transform(documents)
print(tfidf_matrix)
print(tf_idf_vectorizer.vocabulary_)
voc_sorted = sorted(tf_idf_vectorizer.vocabulary_, key = tf_idf_vectorizer.vocabulary_.__getitem__)
tfidf = pd.DataFrame(tf_idf_vectorizer.idf_,index=tf_idf_vectorizer.vocabulary_,columns=['tfidf_weights'])
tfidf

cosine_similarity(tfidf_matrix[0:1],tfidf_matrix)

  (0, 6)	0.8610369959439764
  (0, 7)	0.5085423203783267
  (1, 7)	0.3477147117091919
  (1, 1)	0.5887321837696324
  (1, 3)	0.7297183669435993
  (2, 7)	0.2808823162882302
  (2, 1)	0.47557510189256375
  (2, 2)	0.5894630806320427
  (2, 4)	0.5894630806320427
  (3, 7)	0.26525552965220073
  (3, 9)	0.5566685141652766
  (3, 8)	0.5566685141652766
  (3, 5)	0.5566685141652766
  (4, 6)	0.5887321837696324
  (4, 7)	0.3477147117091919
  (4, 0)	0.7297183669435993
{'exploring': 3, 'in': 4, 'like': 6, 'beginner': 2, 'learn': 5, 'want': 9, 'to': 8, 'nlp': 7, 'am': 1, 'advanced': 0}


array([[1.        , 0.17682765, 0.14284054, 0.13489366, 0.68374784]])

let us take,    
vec1: tfidf_matrix[0] and vec2 = tfidf_matrix[0]  

$$\vec{vec1} = 0.861017*u6 + 0.50854*u7$$
$$\vec{vec2} = 0.347714*u7 + 0.58873*u1 + 0.73*u3$$
$$cos(\vec{vec1},\vec{vec2}) = \vec{vec1}\cdot\vec{vec2}$$
$$                           = 0.861017*0 + 0.50854*0.347714 + 0.58873*0 + 0.73*0$$
$$ = 0.17682765$$


### Phonetic Matching

The next vesion of similarity checking is phonetic matching, which roughly matches the two words or sentences and also creates an alphanumeric string as encoded vesion of the text or word. It is very useful for searching large text corpora, correcting spellins errors, and matcing relavant names. **Soundex** and **Metaphone** are two main phonetic algorithms used for this purpose. The simplest way to do this is by using Fuzzy library.  

In [4]:
!pip install Fuzzy
import Fuzzy



In [21]:
import fuzzy
soundex = fuzzy.Soundex(4)
soundex('natural')

''