# Model Training Part 1

## Popularity based Filtering & Content based filtering

by: [Yachi Darji](https://www.linkedin.com/in/yachi-darji/)

In my first notebook, I tried to explore the data and answer several deep dive questions. From that exploration, we observed that people start to give lower rating if they read more books. We thought that this could be a result of an inappropriate book recommendation system, so that people end up reading books they don't like.

Now, it's time for us to to develop recommendation systems. In this notebook, the recommendation systems developed based on several methods:

1. Basic Recommender
2. Content-based Filtering
3. Collaborative Filtering
   
In the end, I will compare the result given by the recommendation systems and explore strengths and weakness of each model.

### Import Libraries

In [2]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# for content based filtering
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
rating=pd.read_csv("C:/Users/yachi/final/myproject/Data/ratings.csv")
book=pd.read_csv("C:/Users/yachi/final/myproject/Data/final_Data.csv")


### 1. Simple Recommendation

#### a. Recommendation based on Weighted average of rating and popularity. 

One of the easiest ways to give recommendations is to rank the book based on `average_rating` or `rating_count` (popularity). However, as we mentioned in EDA, we found:

1. Book with relatively lower number of `rating_count` (less popular) when we rank book based on `average_rating`.
2. Book with relatively lower `average_rating` when we ranked book based on `rating_count`.

Therefore, we need to make a weighted rating of `average_rating` and `rating_count`. In this case, I will use rating formula like the one used in IMDB site to determine the Top Rated 250 Movies.

New Rating Score is determined by following equation.


![New Rating Formula](img/formula.png "New Rating Score")

where  :

* v=number of ratings(`rating_count`)
* m=minimum `ratings_count` required to be recommended
* R= average of ratings(`average_rating`)
* C=the mean ratings for all books

Now, let's determine the appropriate value for m, the number of votes needed to be listed in the chart. for this simple recommeder, our cutoff will be the 95th percentile. In order for a book to appear in the recommendation, it must recieve at least 95% of the other books on the list (around 2100 ratings).

In [4]:
def simple_recommender(book_data,n=5):
    v=book['ratings_count']
    m=book['ratings_count'].quantile(0.95)
    R=book['average_rating']
    C=book['average_rating'].mean()
    score=((v/(v+m))*R) + ((m/(v+m))*C)
    book['score']=score

    qualified=book.sort_values('score',ascending=False)
    return qualified[['book_id','title','authors','average_rating','ratings_count','score']].head(n)

In [5]:
simple_recommender(book)

Unnamed: 0,book_id,title,authors,average_rating,ratings_count,score
21,25,harry potter and the deathly hallows (harry po...,J.K. Rowling,4.61,1746574,4.555956
23,27,harry potter and the half-blood prince (harry ...,J.K. Rowling,4.54,1678823,4.490428
15,18,harry potter and the prisoner of azkaban (harr...,J.K. Rowling,4.53,1832823,4.48509
20,24,harry potter and the goblet of fire (harry pot...,J.K. Rowling,4.53,1753043,4.483227
1,2,harry potter and the sorcerer's stone (harry p...,J.K. Rowling,4.44,4602479,4.424365


#### b. Evaluation

This system offers generalized recommendations to every user based on popularity and average rating of the book. The recommender some flaws. For example, it makes the same suggestion to everyone, regardless of their own preferences. The top of our chart is full with J.K. Rowling's Harry Potter novels.

In order to personalize of our recommendations, we are going to create recommendation system that compares books based on a set of metrics and suggests books that are most similar to a particular book that a user liked.

### 2. Content Based Recommendation System

To personalize our recommendation , we will measure the cosine similarity between books.

1. Make new column which consist of authors, title, genres and description of each book.
2. Use `TFIDFVectorizer` to convert our data to vector.
3. Calculate the cosine similarity score for all books
4. User will input their favourite book, we will sort book that more similar to the input.
5. recommend a user books

Since we use `TFIDFVectorizer`, the dot product will directly give us the cosine similarity score. Therfore, we will use `Sklearn`'s `linear_kernel` instead of `cosine_similarities` since it is much faster.

In [6]:
def content(books):
    books['content'] = (pd.Series(books[['authors', 'title', 'genres', 'description']].fillna('').values.tolist()).str.join(' ')) 

    tf_content = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), stop_words='english')
    tfidf_matrix = tf_content.fit_transform(books['content'])
    cosine = linear_kernel(tfidf_matrix, tfidf_matrix)
    index = pd.Series(books.index, index=books['title'])

    return cosine, index

def content_recommendation(books, title, n=5):
    title=title.lower()
    cosine_sim, indices = content(books)
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:n + 1]
    book_indices = [i[0] for i in sim_scores]
    return books[['book_id', 'title', 'authors', 'average_rating', 'ratings_count']].iloc[book_indices]

In [7]:
content_recommendation(book,'1984')

Unnamed: 0,book_id,title,authors,average_rating,ratings_count
795,846,animal farm / 1984,George Orwell,4.26,116197
2048,2187,we,Yevgeny Zamyatin,3.95,40020
3670,4004,homage to catalonia,George Orwell,4.14,22227
6857,8056,"1q84 #1-2 (1q84, #1-2)",Haruki Murakami,4.07,8342
4915,5510,the far side gallery,Gary Larson,4.42,20022


In [8]:
content_recommendation(book,'it')

Unnamed: 0,book_id,title,authors,average_rating,ratings_count
767,818,dreamcatcher,Stephen King,3.59,115855
9857,9559,riding the bullet,Stephen King,3.6,9809
9316,7884,stephen king's n.,Marc Guggenheim,4.22,10844
856,911,gerald's game,Stephen King,3.47,100158
886,944,desperation,Stephen King,3.8,94821


In [9]:

content_recommendation(book,'Emma')

Unnamed: 0,book_id,title,authors,average_rating,ratings_count
4205,4653,the complete novels,Jane Austen,4.55,18828
9,10,pride and prejudice,Jane Austen,4.24,2035490
2644,2839,war brides,Helen Bryan,3.75,16565
415,451,northanger abbey,Jane Austen,3.8,205167
4192,4637,"the proposal (the proposition, #2)",Katie Ashley,4.04,34169


Notice that the system recommends a book with average_rating (3.95) lower than average and book with low ratings_count (8342). We will try to improve our recommendation by adding popularity-rating filter.

#### b. Content Based + Popularity-Rating Filter

The mechanism to remove books with low ratings has been added on top of the content based filtering. This system will return books that are similar to your input, are popular and have high ratings. However, in this filter, our cutoff will be the quantile 75. In order for a book to appear in the recommendation, it must be ranked in top 25 similar and receive at least 75% weight score of the other books on the list (around 800 ratings).

In [10]:
def improved_recommendation(books, title, n=5):
    title=title.lower()
    cosine_sim, indices = content(books)
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    book_indices = [i[0] for i in sim_scores]
    books2 = book.iloc[book_indices][['book_id', 'title', 'authors', 'average_rating', 'ratings_count']]

    v = books2['ratings_count']
    m = books2['ratings_count'].quantile(0.75) #here the minimum rating is quantile 75
    R = books2['average_rating']
    C = books2['average_rating'].median()
    books2['new_score'] = (v/(v+m) * R) + (m/(m+v) * C)

    high_rating = books2[books2['ratings_count'] >= m]
    high_rating = high_rating.sort_values('new_score', ascending=False)

    return high_rating[['book_id', 'title', 'authors', 'average_rating', 'ratings_count','new_score']].head(n)

In [11]:
improved_recommendation(book,'Emma')

Unnamed: 0,book_id,title,authors,average_rating,ratings_count,new_score
9,10,pride and prejudice,Jane Austen,4.24,2035490,4.225682
208,230,persuasion,Jane Austen,4.13,365425,4.090751
38,43,jane eyre,Charlotte Brontë,4.1,1198557,4.088262
590,635,sophie's world,Jostein Gaarder,3.88,109692,3.92
415,451,northanger abbey,Jane Austen,3.8,205167,3.855742


In [12]:
improved_recommendation(book,'IT')

Unnamed: 0,book_id,title,authors,average_rating,ratings_count,new_score
64,72,the shining (the shining #1),Stephen King,4.17,791850,4.140806
214,237,carrie,Stephen King,3.93,356814,3.925616
276,305,pet sematary,Stephen King,3.91,256383,3.91
860,915,insomnia,Stephen King,3.79,100972,3.849757
513,556,cujo,Stephen King,3.65,158215,3.750789


In [13]:
improved_recommendation(book,'1984')

Unnamed: 0,book_id,title,authors,average_rating,ratings_count,new_score
795,846,animal farm / 1984,George Orwell,4.26,116197,4.202753
759,809,brave new world / brave new world revisited,Aldous Huxley,4.16,108124,4.127083
1044,1120,aesop's fables,Aesop,4.05,88508,4.046841
8316,2375,"tinker, tailor, soldier, spy",John le Carré,4.04,40871,4.04
604,649,1q84,Haruki Murakami,3.89,125195,3.926917


#### c. Evaluation

This method is suitable for people who are looking for similar books, but this system can not capture tastes and provide recommendations across genres. Therefore, we will try to build a recommendation system using Collaborative filtering.