# Similarity between two documents

We calculate similarity between two documents as a similarity measure between two vectors. We convert the documents into vectors and we use a similarity measure between two vectors to calculate similarity between two documents

In [26]:
# creating a corpus of two documents

corpus = ['''One of the the finer books I read this year was John Kaags Hiking with Nietzsche, in which Kaag, a professor of philosophy, rekindles his passion for the German thinker while tracing picturesque iking trails in the mountains of Switzerland. 
It's a near precise rendering of the travelogue as a self help book. A young Kaag was an avowed Nietzsche acolyte but given the ravages of responsibilities and adulthood, the writer put his affinity to test by undertaking physically enduring hikes through the Alps, revisiting haunts that the philosopher escaped to, in search of solitude and salve. The journey's demands, coupled with his own inner turmoil, are catnip for anybody feeling at cross purposes with their own life.
In the book, Kaag qyites Neitzsche writing to his mother after he had spent time in Splugen, "I was overcome by the desire to remain here... this high alpine valley... There are pure, strong gusts of air, hills and boulders of all shapes... But what pleases me the most are the splendid highroads over which I walk for hours." Travel as the answer to searching questions is harddly a radical idea but what's endearing about the book is that it subtly confirms a basic tenet of why we go on these journeys in the first place. Sometimes, being on the move matters more than anything else.''' ,
         
         '''Summer is a charming flirt. Easygoing and casual. summer doesn't huff and puff to win our affections. It has us at "Hello." Winter broods like the tortured protagonist of big fat Russian novel. It is daunting and dramatic, burning with a slow intensity.
The season's reputation precedes itslef, and often, not in a good way. It has a way of whittling down everything to its bare bones. Even relationships not attuned to its ebbs and flows can fray. At a dinner conversation I once attended, I listened in bemusement as a recent divorcee made the case that it was the Scandinavian frost that had cooled his ex-wife's ardour. How original.
''']


In [27]:
# Preprocessing

# 1. Stemming
import nltk
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

for i in corpus:
    example = i
    example = [stemmer.stem(token) for token in example.split(" ")]
    print(" ".join(example))
    print("\n")

one of the the finer book I read thi year wa john kaag hike with nietzsche, in which kaag, a professor of philosophy, rekindl hi passion for the german thinker while trace picturesqu ike trail in the mountain of switzerland. 
it' a near precis render of the travelogu as a self help book. A young kaag wa an avow nietzsch acolyt but given the ravag of respons and adulthood, the writer put hi affin to test by undertak physic endur hike through the alps, revisit haunt that the philosoph escap to, in search of solitud and salve. the journey' demands, coupl with hi own inner turmoil, are catnip for anybodi feel at cross purpos with their own life.
in the book, kaag qyit neitzsch write to hi mother after he had spent time in splugen, "I wa overcom by the desir to remain here... thi high alpin valley... there are pure, strong gust of air, hill and boulder of all shapes... but what pleas me the most are the splendid highroad over which I walk for hours." travel as the answer to search question 

In [28]:
# 2. Lemmatization

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

for i in corpus:
    example = i
    example = [lemmatizer.lemmatize(token) for token in example.split(" ")]
    print(" ".join(example))
    print("\n")

One of the the finer book I read this year wa John Kaags Hiking with Nietzsche, in which Kaag, a professor of philosophy, rekindles his passion for the German thinker while tracing picturesque iking trail in the mountain of Switzerland. 
It's a near precise rendering of the travelogue a a self help book. A young Kaag wa an avowed Nietzsche acolyte but given the ravage of responsibility and adulthood, the writer put his affinity to test by undertaking physically enduring hike through the Alps, revisiting haunt that the philosopher escaped to, in search of solitude and salve. The journey's demands, coupled with his own inner turmoil, are catnip for anybody feeling at cross purpose with their own life.
In the book, Kaag qyites Neitzsche writing to his mother after he had spent time in Splugen, "I wa overcome by the desire to remain here... this high alpine valley... There are pure, strong gust of air, hill and boulder of all shapes... But what plea me the most are the splendid highroad ov

In [29]:
# Feature Engineering 

# 1. CountVectors

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer( binary = True)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print("\n")
print(X.toarray())

['about', 'acolyte', 'adulthood', 'affections', 'affinity', 'after', 'air', 'all', 'alpine', 'alps', 'an', 'and', 'answer', 'anybody', 'anything', 'ardour', 'are', 'as', 'at', 'attended', 'attuned', 'avowed', 'bare', 'basic', 'being', 'bemusement', 'big', 'bones', 'book', 'books', 'boulders', 'broods', 'burning', 'but', 'by', 'can', 'case', 'casual', 'catnip', 'charming', 'confirms', 'conversation', 'cooled', 'coupled', 'cross', 'daunting', 'demands', 'desire', 'dinner', 'divorcee', 'doesn', 'down', 'dramatic', 'easygoing', 'ebbs', 'else', 'endearing', 'enduring', 'escaped', 'even', 'everything', 'ex', 'fat', 'feeling', 'finer', 'first', 'flirt', 'flows', 'for', 'fray', 'frost', 'german', 'given', 'go', 'good', 'gusts', 'had', 'harddly', 'has', 'haunts', 'he', 'hello', 'help', 'here', 'high', 'highroads', 'hikes', 'hiking', 'hills', 'his', 'hours', 'how', 'huff', 'idea', 'iking', 'in', 'inner', 'intensity', 'is', 'it', 'its', 'itslef', 'john', 'journey', 'journeys', 'kaag', 'kaags', 'l

In [30]:
# 2. TF-IDF Vectors

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
Y = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print("\n")
print(Y[0].toarray())
print("\n")
print(Y[1].toarray())

['about', 'acolyte', 'adulthood', 'affections', 'affinity', 'after', 'air', 'all', 'alpine', 'alps', 'an', 'and', 'answer', 'anybody', 'anything', 'ardour', 'are', 'as', 'at', 'attended', 'attuned', 'avowed', 'bare', 'basic', 'being', 'bemusement', 'big', 'bones', 'book', 'books', 'boulders', 'broods', 'burning', 'but', 'by', 'can', 'case', 'casual', 'catnip', 'charming', 'confirms', 'conversation', 'cooled', 'coupled', 'cross', 'daunting', 'demands', 'desire', 'dinner', 'divorcee', 'doesn', 'down', 'dramatic', 'easygoing', 'ebbs', 'else', 'endearing', 'enduring', 'escaped', 'even', 'everything', 'ex', 'fat', 'feeling', 'finer', 'first', 'flirt', 'flows', 'for', 'fray', 'frost', 'german', 'given', 'go', 'good', 'gusts', 'had', 'harddly', 'has', 'haunts', 'he', 'hello', 'help', 'here', 'high', 'highroads', 'hikes', 'hiking', 'hills', 'his', 'hours', 'how', 'huff', 'idea', 'iking', 'in', 'inner', 'intensity', 'is', 'it', 'its', 'itslef', 'john', 'journey', 'journeys', 'kaag', 'kaags', 'l

In [31]:
# Calculate Cosine Similarity

from sklearn.metrics.pairwise import cosine_similarity
similarity_1 = cosine_similarity(X[0] , X[1])
similarity_2 = cosine_similarity(Y[0] , Y[1])
print(similarity_1)
print(similarity_2)

[[0.12658932]]
[[0.33821912]]
