# Content-Based Filtering Recommender System

## Table of Contents
1. Loading in Data
2. NLP Preprocessing 

In [2]:
# Imports
import pandas as pd
# import requests
# from bs4 import BeautifulSoup
# from nltk.probability import FreqDist
# from nltk.corpus import stopwords
# from nltk.tokenize import regexp_tokenize, RegexpTokenizer
# from nltk.stem import WordNetLemmatizer

# from scipy.sparse import csr_matrix

# from sklearn.model_selection import train_test_split
# from sklearn.pipeline import Pipeline
# from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# from sklearn.naive_bayes import MultinomialNB
# from sklearn.model_selection import cross_val_score, GridSearchCV
# from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

## 1. Loading in Data

I'll be using two datasets: the reviews dataset created in **section 4.c. of my Data_Preparation Notebook** and the metadata dataset created in **section 4.a.** of said notebook. The former will be for creating the recommender system, and the latter is for receiving user input and returning recommendations.

In [3]:
# Loading in data
reviews = pd.read_csv('data/gr_reviews_per_book.csv', index_col=1)
metadata = pd.read_csv('data/metadata.csv')

In [8]:
reviews.head()

Unnamed: 0.1,Unnamed: 0,book_id,string_tokens
0,0,1,one best book series think get better suspense...
1,1,2,first read book worst one harry potter series ...
2,2,3,remember trying time read always gave page ski...
3,3,5,one definitely good second one much happened r...
4,4,6,best harry potter book far followed closely bo...


In [9]:
reviews.drop(columns = ['Unnamed: 0'], inplace = True)

In [10]:
reviews.head()

Unnamed: 0,book_id,string_tokens
0,1,one best book series think get better suspense...
1,2,first read book worst one harry potter series ...
2,3,remember trying time read always gave page ski...
3,5,one definitely good second one much happened r...
4,6,best harry potter book far followed closely bo...


In [12]:
reviews.isna().sum()

book_id          0
string_tokens    0
dtype: int64

## 2. NLP Processing

The review text data will need to be NLP processed before it can be meaningfully used.  Natural Language Processing (NLP) invovles inputs of a certain type: namely, "tokenized" text. Ideally, a string of lower-case, individual, normalized, semantic words. 

Below we initialize a tokenizer, a stopwords list, and a lemmatizer that we use in a custom function. 

The tokenizer will use a regex pattern to turn all words that are at least 3 letters long into a "token."

The stopwords list will be used to remove words like "is" the "the." These are filler words that have no semantic meaning but are still the majority of most speech. They are not useful for prediction and they dramatically increase the input that a model must process. Therefore, our function iterates through the tokenized text and removes them.

Finally our lemmatizer will get to the meaningful "base" or "lemma" of a word. So it will take "change," "changes," "changed," and "changing" and identify them all as the token "change" instead of 4 separate words. This is essentially the "normalization" of text.

In [None]:
# Lower casing
reviews['reviews_and_desc']= reviews['reviews_and_desc'].apply(lambda x: x.lower())
reviews['reviews_and_desc'][2]

In [None]:
tokenizer = RegexpTokenizer(r"(?u)\w{3,}") # This pattern finds words that are at least 3 letters long
stopwords = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

def preprocessing(text, tokenizer, stopwords, lemmatizer):
    # Tokenize
    tokens = tokenizer.tokenize(text)
    
    # Remove stopwords
    tokens = [token for token in tokens if token not in stopwords]
    
    # Lemmatize
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return tokens

In [None]:
# Apply the preprocessing function to the 'Text' column
reviews_nlp = reviews.copy()
reviews_nlp['list_tokens'] = reviews_nlp['reviews_and_desc'].apply(lambda x: preprocessing(x, tokenizer, stopwords, lemmatizer))
reviews_nlp.head()

In [None]:
type(reviews_nlp['list_tokens'][0])

In [None]:
reviews_nlp = reviews_nlp.drop(columns='reviews_and_desc')

## Vectorizing Reviews

In [29]:
df['list_tokens'] = df['string_tokens'].apply(lambda x: x.split())
df.head()

Unnamed: 0,book_id,string_tokens,average_rating,description,num_pages,ratings_count,publication_year,list_tokens
0,1,one best book series think get better suspense...,4.54,The war against Voldemort is not going well: e...,652.0,1713866.0,2006.0,"[one, best, book, series, think, get, better, ..."
1,2,first read book worst one harry potter series ...,4.47,Harry Potter is due to start his fifth year at...,870.0,1766895.0,2004.0,"[first, read, book, worst, one, harry, potter,..."
2,3,remember trying time read always gave page ski...,4.45,Harry Potter's life is miserable. His parents ...,320.0,4765497.0,1997.0,"[remember, trying, time, read, always, gave, p..."
3,5,one definitely good second one much happened r...,4.53,Harry Potter's third year at Hogwarts is full ...,435.0,1876252.0,2004.0,"[one, definitely, good, second, one, much, hap..."
4,6,best harry potter book far followed closely bo...,4.53,Harry Potter is midway through his training as...,734.0,1792561.0,2002.0,"[best, harry, potter, book, far, followed, clo..."


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25475 entries, 0 to 25474
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   book_id           25475 non-null  int64  
 1   string_tokens     25475 non-null  object 
 2   average_rating    25474 non-null  float64
 3   description       25170 non-null  object 
 4   num_pages         23473 non-null  float64
 5   ratings_count     25474 non-null  float64
 6   publication_year  22391 non-null  float64
 7   list_tokens       25475 non-null  object 
dtypes: float64(4), int64(1), object(3)
memory usage: 1.6+ MB


In [33]:
tfidf = TfidfVectorizer(max_features=100)
doc_term_matrix = tfidf.fit_transform(df['string_tokens'])
df_doc_term_matrix = pd.DataFrame.sparse.from_spmatrix(doc_term_matrix, columns=tfidf.get_feature_names_out())

In [34]:
df_doc_term_matrix.head()

Unnamed: 0,actually,also,always,another,author,back,bad,best,better,bit,...,want,wanted,way,well,whole,work,world,would,writing,year
0,0.043907,0.076664,0.06414,0.032151,0.004186,0.035596,0.023336,0.039199,0.034591,0.045042,...,0.032335,0.014827,0.064098,0.072012,0.045205,0.013542,0.047176,0.071711,0.022042,0.082924
1,0.041008,0.0937,0.060527,0.018772,0.009517,0.041439,0.023422,0.035759,0.046144,0.056163,...,0.0388,0.019949,0.076559,0.07362,0.041634,0.01415,0.053947,0.073806,0.024361,0.067344
2,0.036545,0.053299,0.054523,0.016347,0.014983,0.044315,0.01968,0.031475,0.037297,0.029757,...,0.032211,0.01981,0.054758,0.065961,0.037399,0.013968,0.123113,0.076544,0.041103,0.126902
3,0.034729,0.091613,0.076617,0.026325,0.005597,0.047452,0.027907,0.071474,0.052153,0.030464,...,0.032427,0.011519,0.061239,0.059839,0.044875,0.009951,0.067974,0.067215,0.027612,0.071605
4,0.030269,0.081345,0.058291,0.025084,0.011102,0.043056,0.028742,0.044107,0.042506,0.04532,...,0.033135,0.011424,0.066903,0.057447,0.045566,0.010389,0.101918,0.080373,0.030484,0.073607


In [18]:
df = df.drop(columns = ['string_tokens'])

In [35]:
df_final = df_doc_term_matrix.merge(df, left_index=True, right_index=True)
df_final.head()

Unnamed: 0,actually,also,always,another,author,back,bad,best,better,bit,...,writing,year,book_id,string_tokens,average_rating,description,num_pages,ratings_count,publication_year,list_tokens
0,0.043907,0.076664,0.06414,0.032151,0.004186,0.035596,0.023336,0.039199,0.034591,0.045042,...,0.022042,0.082924,1,one best book series think get better suspense...,4.54,The war against Voldemort is not going well: e...,652.0,1713866.0,2006.0,"[one, best, book, series, think, get, better, ..."
1,0.041008,0.0937,0.060527,0.018772,0.009517,0.041439,0.023422,0.035759,0.046144,0.056163,...,0.024361,0.067344,2,first read book worst one harry potter series ...,4.47,Harry Potter is due to start his fifth year at...,870.0,1766895.0,2004.0,"[first, read, book, worst, one, harry, potter,..."
2,0.036545,0.053299,0.054523,0.016347,0.014983,0.044315,0.01968,0.031475,0.037297,0.029757,...,0.041103,0.126902,3,remember trying time read always gave page ski...,4.45,Harry Potter's life is miserable. His parents ...,320.0,4765497.0,1997.0,"[remember, trying, time, read, always, gave, p..."
3,0.034729,0.091613,0.076617,0.026325,0.005597,0.047452,0.027907,0.071474,0.052153,0.030464,...,0.027612,0.071605,5,one definitely good second one much happened r...,4.53,Harry Potter's third year at Hogwarts is full ...,435.0,1876252.0,2004.0,"[one, definitely, good, second, one, much, hap..."
4,0.030269,0.081345,0.058291,0.025084,0.011102,0.043056,0.028742,0.044107,0.042506,0.04532,...,0.030484,0.073607,6,best harry potter book far followed closely bo...,4.53,Harry Potter is midway through his training as...,734.0,1792561.0,2002.0,"[best, harry, potter, book, far, followed, clo..."


In [38]:
df['publication_year'].value_counts()

publication_year
2013.0    3209
2014.0    2833
2012.0    2612
2015.0    2606
2016.0    2054
          ... 
1966.0       1
1960.0       1
1965.0       1
1975.0       1
16.0         1
Name: count, Length: 70, dtype: int64

## Scaling Certain Columns

'average_rating', 'num_pages', 'ratings_count', 'publication_year'?