## The Google of Quotes

The file “quotes.txt” contains pairs of lines, with the first line being a quote
and the following line being the person who said it. We want to build a
search engine that, given a word or words, finds the best matching quotes.



In [1]:
## Import the regular expressions library

import re
import math

In [2]:
## Open the file and read lines from the file

file = open('quotes.txt','r')
file_list = file.readlines()
file_list

['How we spend our days is, of course, how we spend our lives.\n',
 'Annie Dillard\n',
 'Two roads diverged in a wood, and I...I took the one less traveled by, and that has made all the difference.\n',
 'Robert Frost\n',
 'What is happiness? The feeling that power is growing, that resistance is overcome.\n',
 'Friedrich Nietzsche\n',
 'A great deal of intelligence can be invested in ignorance when the need for illusion is deep.\n',
 'Saul Bellow\n',
 'Those who are preoccupied with `making a statement` usually don`t have any statements worth making.\n',
 'Thomas Sowell\n',
 'Women need a reason to have sex -- men just need a place.\n',
 'Billy Crystal\n',
 'The heart has its reasons, of which the mind knows nothing.\n',
 'Blaise Pascal\n',
 'It is necessary for the welfare of society that genius should be privileged to utter sedition, to blaspheme, to outrage good taste, to corrupt the youthful mind, and generally to scandalize one`s uncles.\n',
 'George Bernard Shaw\n',
 'Freedom is n

#### (a) Building a list of full quotes 
First, read in the file, and create
a list of full quotes of the form “quote - speaker”. <br>

In [3]:
## Create a new list with the quote and author as part of the same list element

list_of_quotes = list()
for i in range(0, len(file_list), 2) :
    quote =  file_list[i].rstrip() + ' - ' + file_list[i+1].rstrip()

    list_of_quotes.append(quote)

list_of_quotes

['How we spend our days is, of course, how we spend our lives. - Annie Dillard',
 'Two roads diverged in a wood, and I...I took the one less traveled by, and that has made all the difference. - Robert Frost',
 'What is happiness? The feeling that power is growing, that resistance is overcome. - Friedrich Nietzsche',
 'A great deal of intelligence can be invested in ignorance when the need for illusion is deep. - Saul Bellow',
 'Those who are preoccupied with `making a statement` usually don`t have any statements worth making. - Thomas Sowell',
 'Women need a reason to have sex -- men just need a place. - Billy Crystal',
 'The heart has its reasons, of which the mind knows nothing. - Blaise Pascal',
 'It is necessary for the welfare of society that genius should be privileged to utter sedition, to blaspheme, to outrage good taste, to corrupt the youthful mind, and generally to scandalize one`s uncles. - George Bernard Shaw',
 'Freedom is not something that anybody can be given; freedom 

In [134]:
len(list_of_quotes) ## the number of quotes in the list

886

#### (b) Words from full quotes 
The function words_in_quote takes a full quote as argument and outputs a list of the words in the it. The
words are all in lower-case, and contain only characters, digits, or
underscore. <br>

In [4]:
def words_in_quote(quote):
    
    '''
    A function to return a list of words in the quote
    
    Paramters: 
    quote (str) : A full quote 
    
    Returns:
    A list of words from the quote containing only characters, digits, underscore
    
    '''
    
    words = []
    lower_quote = quote.lower()

    words = re.split(r'[^a-z0-9_`]+',lower_quote)
    return words

#### (c) Build the postings-list dictionary
A postings-list is
a dictionary whose keys are full quotes, and whose values are themselves
dictionaries with key being a word, and value being the number of times
the word occurs in the full quote. <br>

In [5]:
## Create a postings_list dictionary with key as quote and a dictionary as value
## The key dictionary consists words as keys and the frequency of the word in quote as values

postings_list = {}

for i in list_of_quotes:
    words = words_in_quote(i)
    
    words_dictionary = {}
    
    for word in words:
        if word in words_dictionary:
            words_dictionary[word] += 1
        else:
            words_dictionary[word] = 1
    
    postings_list[i] = words_dictionary

postings_list

{'How we spend our days is, of course, how we spend our lives. - Annie Dillard': {'how': 2,
  'we': 2,
  'spend': 2,
  'our': 2,
  'days': 1,
  'is': 1,
  'of': 1,
  'course': 1,
  'lives': 1,
  'annie': 1,
  'dillard': 1},
 'Two roads diverged in a wood, and I...I took the one less traveled by, and that has made all the difference. - Robert Frost': {'two': 1,
  'roads': 1,
  'diverged': 1,
  'in': 1,
  'a': 1,
  'wood': 1,
  'and': 2,
  'i': 2,
  'took': 1,
  'the': 2,
  'one': 1,
  'less': 1,
  'traveled': 1,
  'by': 1,
  'that': 1,
  'has': 1,
  'made': 1,
  'all': 1,
  'difference': 1,
  'robert': 1,
  'frost': 1},
 'What is happiness? The feeling that power is growing, that resistance is overcome. - Friedrich Nietzsche': {'what': 1,
  'is': 3,
  'happiness': 1,
  'the': 1,
  'feeling': 1,
  'that': 2,
  'power': 1,
  'growing': 1,
  'resistance': 1,
  'overcome': 1,
  'friedrich': 1,
  'nietzsche': 1},
 'A great deal of intelligence can be invested in ignorance when the need for i

#### (d) Build the reverse postings-list dictionary
A reverse
postings-list is a dictionary whose keys are the words, and the values are
themselves dictionaries with the key being a full quote, and the value being
the number of times the word appeared in the full quote. <br>

In [137]:
## Create a reverse_postings_list dictionary with key as a word and initialize the value as None

reverse_postings_list = {}

for i in list_of_quotes:
    words = words_in_quote(i)
        
    for word in words:
        dictionary = {}
        
        if word in reverse_postings_list:
            continue
        else:
            reverse_postings_list[word] = None

reverse_postings_list

{'how': None,
 'we': None,
 'spend': None,
 'our': None,
 'days': None,
 'is': None,
 'of': None,
 'course': None,
 'lives': None,
 'annie': None,
 'dillard': None,
 'two': None,
 'roads': None,
 'diverged': None,
 'in': None,
 'a': None,
 'wood': None,
 'and': None,
 'i': None,
 'took': None,
 'the': None,
 'one': None,
 'less': None,
 'traveled': None,
 'by': None,
 'that': None,
 'has': None,
 'made': None,
 'all': None,
 'difference': None,
 'robert': None,
 'frost': None,
 'what': None,
 'happiness': None,
 'feeling': None,
 'power': None,
 'growing': None,
 'resistance': None,
 'overcome': None,
 'friedrich': None,
 'nietzsche': None,
 'great': None,
 'deal': None,
 'intelligence': None,
 'can': None,
 'be': None,
 'invested': None,
 'ignorance': None,
 'when': None,
 'need': None,
 'for': None,
 'illusion': None,
 'deep': None,
 'saul': None,
 'bellow': None,
 'those': None,
 'who': None,
 'are': None,
 'preoccupied': None,
 'with': None,
 '`making': None,
 'statement`': None,
 

In [138]:
## The key dictionary is now updated to contain quotes as keys that has the word;
## The frequency of the word in that quote as value

for key in reverse_postings_list.keys():
    
    dictionary = {}
    
    for quote in list_of_quotes:
        if key in words_in_quote(quote):
            
            dictionary[quote] = postings_list[quote][key]
        
    reverse_postings_list[key] = dictionary

In [139]:
reverse_postings_list

{'how': {'How we spend our days is, of course, how we spend our lives. - Annie Dillard': 2,
  'Christmas can be celebrated in the school room with pine trees, tinsel and reindeers, but there must be no mention of the man whose birthday is being celebrated. One wonders how a teacher would answer if a student asked why it was called Christmas. - Ronald Reagan': 1,
  'I have often wondered how it is that every man loves himself more than all the rest of men, but yet sets less value on his own opinion of himself than on the opinion of others. - Marcus Aurelius': 1,
  'It`s not how old you are, it`s how hard you work at it. - Jonah Barrington': 2,
  'No matter how much cats fight, there always seem to be plenty of kittens. - Abraham Lincoln': 1,
  'Success, the real success, does not depend upon the position you hold but upon how you carry yourself in that position. - Theodore Roosevelt': 1,
  'Never discourage anyone who continually makes progress, no matter how slow. - Plato': 1,
  'True 

#### (e) Writing a TF-IDF function
To measure how much a full
quote is about a particular word, one typically uses the TF-IDF measure.
>• TF stands for “term frequency”; the term frequency of a word w in
a full quote q is the number of times w occurs in q, divided by the
maximum number of times any word occurs in q. <br>
• IDF stands for “inverse document frequency”: the IDF of a word w is
the logarithm of the ratio of the total number N of full quotes to the
number of full quotes in that contain the word w. <br>
• TF-IDF of a word w for a full quote q is just the product of the TF
and IDF. <br>
The function tf_idf computes the TF-IDF of any word in any full quote,
using the postings and reverse-postings.<br>

In [140]:
number_of_quotes = len(list_of_quotes)

def tf_idf(word, quote):
    
    '''
    A function to calculate TF-IDF value for a word-quote pair
    
    Parameters:
    word (str), quote (str): A word and a full quote as input
    
    Returns:
    The TF-IDF for the word-quote pair
    '''
    
    if word in words_in_quote(quote):
        tf = reverse_postings_list[word][quote]/max(postings_list[quote].values())
    
        idf = math.log(number_of_quotes/len(reverse_postings_list[word].keys()))
    
        tf_idf_value = tf * idf
    else:
        return 0
    
    return tf_idf_value

In [141]:
tf_idf('entertainer', 'An actor is at most a poet and at least an entertainer. - Marlon Brando')

3.3933584753025405


#### (f) Quote search using a single word
The function tf_idf_dictionary takes a word as argument, and returns a dictionary whose keys are full
quotes containing that word, and whose values are the TF-IDF score of that
word for that full quote. <br>


In [142]:
def tf_idf_dictionary(word):
    
    '''
    A function to create a dictionary with keys as full quotes containing input word, 
    and whose values are the TF-IDF score of that word for that full quote.
    
    Parameters:
    word (str) : A word for which the dictionary needs to be created
    
    Returns:
    A dictionary key keys as full quotes containing the word, values are TF-IDF score for the word-quote combination
    '''
    
    dictionary = {}
    
    for keys in reverse_postings_list[word].keys():
        
        dictionary[keys] = tf_idf(word,keys)     
        
    return dictionary

In [143]:
tf_idf_dictionary('how')

{'How we spend our days is, of course, how we spend our lives. - Annie Dillard': 3.742194512881658,
 'Christmas can be celebrated in the school room with pine trees, tinsel and reindeers, but there must be no mention of the man whose birthday is being celebrated. One wonders how a teacher would answer if a student asked why it was called Christmas. - Ronald Reagan': 1.871097256440829,
 'I have often wondered how it is that every man loves himself more than all the rest of men, but yet sets less value on his own opinion of himself than on the opinion of others. - Marcus Aurelius': 1.2473981709605526,
 'It`s not how old you are, it`s how hard you work at it. - Jonah Barrington': 3.742194512881658,
 'No matter how much cats fight, there always seem to be plenty of kittens. - Abraham Lincoln': 3.742194512881658,
 'Success, the real success, does not depend upon the position you hold but upon how you carry yourself in that position. - Theodore Roosevelt': 1.871097256440829,
 'Never discoura

#### (g) Quote search using multiple words  
The function word_dictionary takes a list of words as argument, and returns a dictionary whose keys
are full quotes containing one or more of the words in that list, and whose
values are the sum of TF-IDF scores of the words in that list for that full
quote

In [144]:
def word_dictionary(list_of_words):
    
    '''
    A function that creates a dictionary whose keys are full quotes containing one or more of the words in the input list
    and whose values are the sum of TF-IDF scores of the words in that list for that full quote.
    
    Parameters:
    list_of_words (list): A list of words
    
    Returns:
    A dictionary with quotes as keys that contain one or more words from the list and 
    value is the sum of the TF-IDF score of the words from the input list that occur in the quote
    '''
    
    dictionary = {}
    
    for quote in list_of_quotes:
        tf_idf_value = 0
        
        quote_words = words_in_quote(quote)
        
        for word in quote_words:
            if word in list_of_words:
                tf_idf_value += tf_idf(word,quote)
            
            dictionary[quote] = tf_idf_value 
            
    final_dictionary = {k: v for k, v in dictionary.items() if v>0}
            
    return final_dictionary

In [107]:
val = word_dictionary(['heart', 'mind', 'disease'])
val

{'The heart has its reasons, of which the mind knows nothing. - Blaise Pascal': 4.281399303556952,
 'It is necessary for the welfare of society that genius should be privileged to utter sedition, to blaspheme, to outrage good taste, to corrupt the youthful mind, and generally to scandalize one`s uncles. - George Bernard Shaw': 0.8157333499005741,
 'In every man`s heart there is a secret nerve that answers to the vibrations of beauty. - Christopher Morley': 4.484131857611035,
 'Everybody can be great... because anybody can serve. You don`t have to have a college degree to serve. You don`t have to make your subject and verb agree to serve. You only need a heart full of grace. A soul generated by love. - Martin Luther King Jr.': 1.1210329644027588,
 'It is always thus, impelled by a state of mind which is destined not to last, that we make our irrevocable decisions. - Marcel Proust': 2.0393333747514353,
 'Love anything and your heart will be wrung and possibly broken. If you want to make 

In [145]:
quote = 'We all wish to be judged by our peers, by the men `after our own heart.` Only they really know our mind and only they judge it by standards we fully acknowledge. Theirs is the praise we really covet and the blame we really dread. The little pockets of early Chrstians survived because they cared exclusively for the love of `the bretheren` and stopped their ears to the opinion of the Pagan society around them. But a circle of criminals, cranks, or perverts survives in just the same way; by becoming deaf to the opinion of the outer world, by discounting it as the chatter of outsiders who `don`t understand,` of the `conventional,` the `bourgeois,` the `Establishment,` of prigs, prudes, and humbugs. - C.S. Lewis'
print(f'Value of the quote in the dictionary: {val[quote]}')

print(f'TF-IDF of disease in the quote: {tf_idf("disease",quote)}')
print(f'TF-IDF of heart in the quote: {tf_idf("heart",quote)}')
print(f'TF-IDF of mind in the quote: {tf_idf("mind",quote)}')


Value of the quote in the dictionary: 0.6116284719367076
TF-IDF of disease in the quote: 0
TF-IDF of heart in the quote: 0.3202951326865025
TF-IDF of mind in the quote: 0.29133333925020505


# END-OF-CODE