## Project Goals


**Problem Statement**

Recommend the top review a user should read for a chosen book, determined by choosing the most similar reviewer to that user. 

**Potential Audience**

* Anyone looking to buy a book
* Anyone interested in combining both explicit and implicit feedback models along with NLP methods for further review text analysis

**Goals** 

* Build a recommender system for books to obtain user-item rating matrix based on explicit feedback latent factor models. 

* Use the review text and NLP methods to augment the user-item rating by conducting review text analysis

* Calculate user similarities

* For a chosen book and a reviewer (for a book they haven't yet reviewed), list the order of the reviews shown for that book, as ordered by the helpful votes (in reality this would be done by the number of helpful votes/reviewer rating) vs ordering the book reviews by the user similarities. I want to measure the improvement of the model by taking the weighted avg (ranked by the order shown) of the user similaries in both instances and see if my recommendation system maximised this score (a.k.a. pushed the reviews by most similar people to the top). 

## Data Gathering

There are a number of datasets that I found that I can potentially work with. These are the main sources: 

1. Amazon dataset provided by Amazon
https://s3.amazonaws.com/amazon-reviews-pds/readme.html

2. Amazon review dataset provided by ML research team at USCD (Julian McAuley)
http://jmcauley.ucsd.edu/data/amazon/

3. Goodreads dataset by ML research team at USCD (Julian McAuley)
https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home

I only included the notebook for my chosen dataset as the others are a little messy and hard to follow. 

**Amazon book review dataset provided by Amazon**

I started by exploring with this dataset as this came in the most helpful format - a structured CSV file. I faced a lot of problems loading and doing EDA with this dataset given the size of the file (~8GB due to review text). 

After casting the columns into the most memory efficient type and using libraries like Dask and writing functions to process the rows in chunks, I was able to check some meaningful insights about the data. Around 60% of the users had only provided 1 review which would make it quite hard to build a meaningful profile for a considerable number of users. Additionally certain books and users had a disportionately high reviews which would overshadow the other books and users that have only 1 review. But given the number of rows available, I could remove all the books and users with only 1 review, however, I explored other options. 

**GoodReads by Julian McAuley & Team**

Julian McAuley & his team as USCD have a comprehensive dataset containing GoodReads interaction and review dataset. These are large datasets (one of them being ~32GB). They are presented as 3 datasets, 1 with the meta-data of the boks, 1 for the book-user interaction (including rating, have read or shelved) and 1 containing the book reviews. The three tables can be merged to obtain the full information for each review. 

This could be potentially a very interesting dataset to work with as not only does it have explicit feedback (rating) but also implicit feedback (have_read or shelved to read) - the McAuley team have written a paper about developing a novel technique to include both forms of recommenders systems which would have interesting to follow. However the database schema to establish this relationship and the amount of data storage required would require a lot of time and effort that may be beyond the scope of this project for now. 

**Amazon Book Review dataset by Julian McAuley & Team**

The McAuley team had also made available large Amazon review datasets across many products, books being the largest one. There were many datasets in a variety of formats that I explored but in the end, I settled on a particular version had been reduced to extract the 5-core, so that each user (reviewer) and each book has aleast 5 reviews. This was presented in a structured JSON file with the columns such as:

- ReviewerID 
- BookID 
- Rating 
- Helpful votes ratio (number of helpful votes/ total votes) - useful for ordering the reviews at the end
- Timestamp 
- ReviewText

...along with the other information, with around ~8 million rows (~9gb). I would also utilise the meta-data book dataset which contains information such as the genre and title of the bookID which would be useful for visual EDA and potentially to augment the user similarity matrix. 

**Storing the data**

My end goal with data storage is to have the two tables, 1 containing the book review informtion and the other containing the meta-data, on a remote PostgreSQL database on Google Cloud Platform. This offers the best solution for querying efficiently and merging the two tables. 

I tried to achieve this in a number of ways, including writing the json file directly to the remote PostgreSQL server. However, after researching and trying (in vain), writing directly to a remote database chunkwise for millions of rows wouldn't be the most efficient way of doing so. 

I decided to instead process the large JSON file chunkwise into a CSV file and then import the CSV file directly on to the remote PostgreSQL database or if that fails due to type error, create an SQL dump file with the types clearly declared. 

**Data troubles**

Reading in the json file into a csv proved to be very difficult. The json file has a memory footprint of 8.8GB. I tried many methods such as writing my own chunkwise functions, using json's chunkwise method and ijson and other json streamer libraries. 

Even though they were all supposed to be reading in the file chunkwise, they would all cause an OSError around ~4.3GB level with the error message reading "File too large". By this point, I have  ~4.4 million rows and hence decided to drop reading in the rest of the data and proceed with this. I wouldn't actually being using this whole size while processing my model as I don't need the review text for the model which is a considerable size of the dataset. 

Stopping halfway means that some of the reviewers and books will only have 1 review (the rest of them required to statisfy the 5-core condition would be in the other half of the unread data). But these instances only make up around 20% which is still a lot better than the other datasets I considered. 

**Next Steps** 

I finally have the CSV files for the meta-data and review tables in the format I would like. I am in the process of converting them into SQL dump files with types declared for the best memory efficiency. Once this has been imported on to Google Cloud platform, I can establish a connection to query the data. 
