# Introduction

In this exercise, we will 
- Evaluate text similarity of Amazon book search results
    - Search for a book title in Amazon Books
    - In Python, compare the book titles to each other using a text similarity measure (Cosine Similarity)
    - Compare which two titles are the most similar to each other and Which are the most dissimilar? 
    - Where do they rank, among the first 24 results?
- Evaluate text similarity of search results of any book
    - Enter one of the books in a search engine
    - Run the same text similarity calculation as question 1b
    - Determine which one has the highest similarity measure

#### Preparation Steps
- Import the necessary packages
- Compile book titles from Amazon books search results page

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy

### Question 1

In [2]:
book_titles = ["The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change",
               "By Stephen R. Covey: The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change",
               "The 7 Habits of Highly Effective Teens",
               "The 7 Habits of Highly Effective People Personal Workbook",
               "Emotional Intelligence 2.0: Mastery of 7 Modern Psychological Steps to Develop Your EQ, Improve Social Skills, Achieve the Habits of Highly Effective People and Discover Why It Matters More than IQ",
               "The 7 Habits of Highly Effective Teens Personal Workbook",
               "The 7 Habits of Happy Kids",
               "The 7 Habits of Highly Effective Families",
               "Summary: The 7 Habits Of Highly Effective People by Stephen R.Covey - More knowledge in less time",
               "Seven Habits of Highly Effective People: Restoring the Character Ethic",
               "The Stephen R. Covey Interactive Reader - 4 Books in 1: The 7 Habits of Highly Effective People, First Things First, and the Best of the Most Renowned Leadership Teacher of our Time",
               "The 7 Habits of Highly Effective People: 30th Anniversary Guided Journal",
               "The 7 Habits of Highly Effective People (Chinese Edition)",
               "The 7 Habits of Highly Effective People - Signature Series: Insights from Stephen R. Covey",
               "The 7 Habits of Highly Effective People (Gujarati Edition)",
               "The 7 Habits Of Highly Effective People - Restoring The Character Ethic",
               "The 7 Habits of Highly Effective People (Unabridged Audio Program) 15th (fifteenth) Anniversary Edition by Covey, Stephen R. published by Franklin Covey (2011)",
               "Selected Works of Stephen Covey: The 7 Habits of Highly Effective People 25th Anniversary Edition, Execution Essentials, Management Essentials, Leadership Essentials",
               "Summary: The 7 Habits of Highly Effective People - Powerful Lessons in Personal Change by Stephen R. Covey",
               "A Self-Guided Workbook for Highly Effective Teens: A Companion to the Best Selling 7 Habits of Highly Effective Teens (Gift for Teens and Tweens)",
               "Los Siete Habitos de las Personas Altamente Eficaces [The Seven Habits of Highly Effective People]",
               "The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change (Japanese Edition)",
               "7 Habits of Highly Effective College Students",
               "The 7 Habits of Highly Effective People [Arabic Edition]: Powerful Lessons in Personal Change"]

In [3]:
len(book_titles)

24

In [4]:
count_vectorizer = CountVectorizer(stop_words='english')
count_vectorizer = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(book_titles)

In [5]:
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix, 
                  columns=count_vectorizer.get_feature_names(), 
                  index=['book_1', 'book_2', 'book_3', 'book_4', 'book_5', 'book_6',
                         'book_7', 'book_8', 'book_9', 'book_10', 'book_11', 'book_12',
                         'book_13', 'book_14', 'book_15', 'book_16', 'book_17', 'book_18',
                         'book_19', 'book_20', 'book_21', 'book_22', 'book_23', 'book_24'])
df

Unnamed: 0,15th,2011,25th,30th,achieve,altamente,and,anniversary,arabic,audio,...,the,things,time,to,tweens,unabridged,why,workbook,works,your
book_1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
book_2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
book_3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
book_4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
book_5,0,0,0,0,1,0,1,0,0,0,...,1,0,0,1,0,0,1,0,0,1
book_6,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
book_7,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
book_8,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
book_9,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
book_10,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [6]:
cosim = cosine_similarity(df, df)[0]

In [7]:
cosim

array([1.        , 0.88640526, 0.61545745, 0.74620251, 0.37907125,
       0.63960215, 0.40451992, 0.61545745, 0.54494926, 0.57207755,
       0.52223297, 0.57207755, 0.63960215, 0.52223297, 0.63960215,
       0.6092718 , 0.36181361, 0.39886202, 0.85634884, 0.36196138,
       0.46709937, 0.91986621, 0.49236596, 0.91986621])

In [8]:
print("Book Title 22: ", book_titles[21])
print("Book Title 24: ", book_titles[23])
print("Book Title 1: ", book_titles[0])
print("Book Title 17: ", book_titles[16])

Book Title 22:  The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change (Japanese Edition)
Book Title 24:  The 7 Habits of Highly Effective People [Arabic Edition]: Powerful Lessons in Personal Change
Book Title 1:  The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change
Book Title 17:  The 7 Habits of Highly Effective People (Unabridged Audio Program) 15th (fifteenth) Anniversary Edition by Covey, Stephen R. published by Franklin Covey (2011)


#### Most similar and dissimilar books
- The books 22 and 24 are most similar
- The books 1 and 17 are the most dissimilar

In [9]:
temp = cosim.argsort()
ranks = numpy.empty_like(temp)
ranks[temp] = numpy.flip(numpy.arange(len(cosim)))

In [10]:
ranks

array([ 0,  3,  9,  5, 21,  8, 19, 10, 14, 13, 15, 12,  7, 16,  6, 11, 23,
       20,  4, 22, 18,  1, 17,  2], dtype=int64)

#### How the books rank:

- Books 22 and 24 rank 2nd and 3rd amongst the top 24 books
- Books 1 and 17 rank 1st and 24 amongst the top 24 books

### Question 2

#### Enter the book title into search engine and get the capsule text from google.com. Compare the book title to the capsule text and get the text similarity score

In [11]:
capsule_text = ["The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change",
                "The 7 Habits of Highly Effective People: Powerful Lessons in ...\n www.amazon.com › Habits-Highly-Effective-People-Powerful-ebook\n This 7 Habits book guides you through each habit step-by-step: Habit 1: Be Proactive. Habit 2: Begin With The End In Mind. Habit 3: Put First Things First. Habit 4: Think Win-Win. Habit 5: Seek First To Understand Then Be Understood. Habit 6: Synergize. Habit 7: Sharpen The Saw.", 
                "7 Habits of Highly Effective People - QuickMBA\n www.quickmba.com › Management\n Summary of The 7 Habits of Highly Effective People, Stephen F. Covey's bestseller ... In his #1 bestseller, Stephen R. Covey presented a framework for personal ... paradigms are right, simply changing outward behavior is not effective. ... Our character is a collection of our habits, and habits have a powerful role in our lives."]

In [12]:
sparse_matrix_capsule = count_vectorizer.fit_transform(capsule_text)

In [13]:
capsule_term_matrix = sparse_matrix_capsule.todense()
df_cap = pd.DataFrame(capsule_term_matrix, 
                  columns=count_vectorizer.get_feature_names(), 
                  index=['book_title', 'cap_text_1', 'cap_text_2'])
df_cap

Unnamed: 0,amazon,and,are,be,begin,behavior,bestseller,book,by,change,...,think,this,through,to,understand,understood,win,with,www,you
book_title,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
cap_text_1,1,0,0,2,1,0,0,1,1,0,...,1,1,1,1,1,1,2,1,1,1
cap_text_2,0,1,1,0,0,1,2,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [14]:
cosine_similarity(df_cap, df_cap)[0]

array([1.        , 0.44020439, 0.58296404])

#### Compare scores:

- The capsule text from the 20th organic result has the better similarity measure to the original book title.

### Conclusion

In this assignment, I started by searching a book on Amazon. I then extracted the book name with the subtitle for the first 24 results. The next step was to calculate the text similarity measure using cosine similarity method and listed the most similar and most dissimilar books.

- Book Title 22:  The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change (Japanese Edition)
- Book Title 24:  The 7 Habits of Highly Effective People [Arabic Edition]: Powerful Lessons in Personal Change
- Book Title 1:  The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change
- Book Title 17:  The 7 Habits of Highly Effective People (Unabridged Audio Program) 15th (fifteenth) Anniversary Edition by Covey, Stephen R. published by Franklin Covey (2011)

#### Most similar and dissimilar books
- The books 22 and 24 are most similar
- The books 1 and 17 are the most dissimilar

#### How the books rank:

- Books 22 and 24 rank 2nd and 3rd amongst the top 24 books
- Books 1 and 17 rank 1st and 24 amongst the top 24 books

The next step was to search the book title in Google.com and get the search capsule text to run the text similarity measure used in the first question. The book title was then compared to the capsule text and the cosine similarity was calculated between the 3. Finally, we found that the 20th organic result had a better cosine similarity measure.

### References

- https://www.machinelearningplus.com/nlp/cosine-similarity/