#readMoreCanlit | A Recommender System for Canadian Literature

Shawn Syms
https://shawnsyms.github.io/
shawn@shawnsyms.com
416-843-4169

Executive summary

Overview

#readMoreCanlit is a content-based recommender system that promotes the reading of Canadian literature.

According to a recent survey by the non-profit organization BookNetCanada, 92% percent of Canadians had read a paper copy of a book, and use of e-reading devices had increased by 25% over the previous year. Still, the Canadian best-seller lists tend to be dominated by American and international titles. #readMoreCanlit could serve as part of a consolidated marketing program to promote reading Canadian literature. Future phases could include a purchasing engine.

Problem statement

I will build a content-based recommender system that accepts input from readers about content they like, and makes appropriate recommendations of similar titles from a dataframe of Canadian books:

The user should also be able to provide the name of a book , or they can select one from a list of 5 titles that the system already knows about
The user should also be able to input a freeform sentence about the type of book that they like
The baseline for comparison could be whether or not the system can make recommendations that are better than random guessing (if time permits I could build the random generator also; in fact it could be feature -- ie, just tell me about any 5 Canadian books.

Data sources

The system requires the following data points: title, author and description
For Canadian books, the source was a specialist site called 49thshelf.com, from which I extracted information on 8,500 Canadian fiction titles
After removing duplicates, I had 6,775 Canadian titles
To get the non-Canadian titles, I needed to source International Standard Book Numbers (ISBNs). I searched online and found extensive lists of ISBNs in several places, including openlibrary.org and data.planet. From these sources, I collected 2.7M ISBNs
I used the API furnished by a directory called ISBNdb.com to cycle through the ISBNs 15,000 at a time (a daily limit), querying the database for title, author and description information (the lists themselves contained the ISBNs only)
I found that the ratio of database entries that actually included descriptions (which are core to the recommender system) was quite small -- five to ten percent at best -- so the process of gathering sufficient numbers of non-Canadian titles has been somewhat painstaking
Ultimately I was able to produce a list of over 10,000 international titles on which to train the model

(home)

Metrics

Recommender systems are known to be challenging from a metrics perspective. In the case of this project, I am using domain knowledge to assess performance, and will also be conducting a survey of my peers to get their input on the effectiveness of the model. I'm researching other potential metrics; I did see one instance where a scikit learn accuracy score was used.

Findings

Use of scikit learn's TFIDF vectorizer and cosine similarity allows for a model that trains very quickly (x seconds) and whose results are satisfying
Between now and when the project is due, I'd like to try some additional approaches, such as gensim and neural networks

Limitations

The project is de facto limited by my coding capabilities, though they are improving every day
The low number of books catalogued by the application (both Canadian and international) is a limitation; this affects both (1) quality and diversity of results and (2) the likelihood that a user will enter a book title of which the model is not yet aware

(home)

Tools and technologies

This project uses regex and NLTK's Porter Stemmer for text preprocessing

The application was coded using Python, flask, pandas and scikit learn

Because this project involved many pieces of text that need to be compared to one another, the text was transformed and catalogued by the TF-IDF (term frequency–inverse document frequency) Vectorizer

Libraries

Here is a list of libraries used in the notebooks and in the development of the system (please see the import cells and requirements.txt for full details:

beautifulsoup

datetime

json

lxml

matplotlib

nltk

numpy

pandas

pprint

random

regex

requests

seaborn

sklearn

time

urllib

(home)

Name	Name	Last commit message	Last commit date
Latest commit shawnsyms RMC 19 May 2020 May 20, 2020 0a80c1d · May 20, 2020 History 7 Commits
code	code	RMC 19 May 2020	May 20, 2020
config	config	1st commit RMC	May 14, 2020
data	data	RMC 19 May 2020	May 20, 2020
img	img	RMC 19 May 2020	May 20, 2020
presentation	presentation	RMC 19 May 2020	May 20, 2020
templates	templates	RMC 19 May 2020	May 20, 2020
.flaskenv	.flaskenv	RMC 19 May 2020	May 20, 2020
.gitignore	.gitignore	admin updates 2020-05-14	May 14, 2020
1_data_acquisition.py	1_data_acquisition.py	4th project check-in 2020-05-15	May 15, 2020
2_data_cleaning_and_visualization.py	2_data_cleaning_and_visualization.py	4th project check-in 2020-05-15	May 15, 2020
3_recommender_system.py	3_recommender_system.py	4th project check-in 2020-05-15	May 15, 2020
README.md	README.md	admin updates 2020-05-14	May 14, 2020
app_starter_wc.py	app_starter_wc.py	RMC 19 May 2020	May 20, 2020
requirements.txt	requirements.txt	4th project check-in 2020-05-15	May 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

#readMoreCanlit | A Recommender System for Canadian Literature

Contents

Executive summary

Overview

Problem statement

Data sources

Metrics

Findings

Limitations

Tools and technologies

Libraries

About

Releases

Packages

Languages

shawnsyms/readMoreCanlit

Folders and files

Latest commit

History

Repository files navigation

#readMoreCanlit | A Recommender System for Canadian Literature

Contents

Executive summary

Overview

Problem statement

Data sources

Metrics

Findings

Limitations

Tools and technologies

Libraries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages