# Similarity Vectorizer

A demonstration of how vector databases are created. The text values are converted into vectors and distance metrics are used from Sci-kit learn to deteremine word similarity.

TF measures how often a word appears in a document. [Number of t appears in a document d by total number of terms in document d]IDF reduces the weight of common words while increasing the weight of rare words. [Log total number of documents in corpus D divided by number of documents containing term t]
Distance metrics are functions that measure the dissimilarity between two objects. These metrics satisfy certain conditions, such as non-negativity, symmetry, and the triangle inequality.

In [1]:
import numpy as np
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics.pairwise import cosine_similarity

The module sklearn.feature_extraction.text helps in natural language processing tasks by quantifying the importance of words in document. TF-IDF stand for Term Frequency-Inverse Document Frequency
The module sklearn.metrics.pairwise provides utilities for calculating pairwise distances and affinities between sets of samples.

In [7]:
df = pandas.read_csv("faqs.csv")
df.dropna(inplace=True)

In [3]:
print(df)

                                            Question  \
0                     How will I access this course?   
1  Is there a limited timeframe during which I ha...   
2  When will Mammoth Interactive release this mas...   
3  Will the masterclass be live, or can it be rev...   
4  What if my company wants to enroll in this mas...   
5  What is the refund policy for the Licensing le...   
6                                    More questions?   
7                                                 Hi   
8                                       How are you?   

                                              Answer  
0  We provide our courses online via our training...  
1  Our courses have no expiration date or add-on ...  
2  We expect to release this masterclass by May. ...  
3  The masterclass is not live. It can be revisited.  
4  For multiple copies, pledge at the multiple co...  
5  We cannot offer refunds for the multiple copy ...  
6  Ask away in the Comments tab of this Kickstart...  

In [4]:
vectorizer = TfidfVectorizer()
vectorizer.fit(np.concatenate((df.Question,df.Answer)))


0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'


In [5]:
vectorized_questions = vectorizer.transform(df.Question)
print(vectorized_questions)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 56 stored elements and shape (9, 134)>
  Coords	Values
  (0, 1)	0.5120391518477828
  (0, 22)	0.4602775723841269
  (0, 50)	0.5120391518477828
  (0, 113)	0.2954127439104473
  (0, 128)	0.4201281721635454
  (1, 19)	0.3584495210182788
  (1, 30)	0.3584495210182788
  (1, 45)	0.3137476662498792
  (1, 54)	0.22033448434299002
  (1, 60)	0.3584495210182788
  (1, 64)	0.19262748601054774
  (1, 111)	0.3137476662498792
  (1, 113)	0.18101166414307543
  (1, 115)	0.3584495210182788
  (1, 116)	0.20561287007641493
  (1, 127)	0.3584495210182788
  (2, 53)	0.4552179307093832
  (2, 63)	0.4552179307093832
  (2, 64)	0.2446299421195228
  (2, 97)	0.3984481915038884
  (2, 113)	0.229878268358101
  (2, 126)	0.4552179307093832
  (2, 128)	0.3269267785369717
  (3, 10)	0.54183151362783
  (3, 13)	0.33018369995715785
  :	:
  (3, 128)	0.270915756813915
  (4, 18)	0.371184966370913
  (4, 32)	0.371184966370913
  (4, 51)	0.371184966370913
  (4, 52)	0.2920515545324218

In [6]:
while True:
    user_input = input()
    vectorized_user_input = vectorizer.transform([user_input])
    similarities = cosine_similarity(vectorized_user_input,
                                       vectorized_questions)
    closest_question = np.argmax(similarities,
                                 axis=1)
    print(similarities)
    print(closest_question)

    answer = df.Answer.iloc[closest_question].values[0]
    print("Answer: ", answer)
    break    

 can I get a refund?


[[0.         0.         0.         0.18161563 0.         0.24916032
  0.         0.         0.        ]]
[5]
Answer:  We cannot offer refunds for the multiple copy levels. Please be extra sure to purchase this option with care. This is because we share a portion of the pledge with Kickstarter, and the difference at these pledge levels is significant. Thank you for understanding, and we're excited to welcome your organization.
