# DATA 643: Recommender Systems
Final Project:  Book Crossing RecSys | Walt Wells, Summer 2017

# Notebook P1 - Overview, TOC, Data, Citations, References

## Project Overview

The DATA 643 Final Project will serve as an opportunity to study further and implement a recommender system at a medium - large scale.   Because of the challenges inherent in managing large datasets, we will use some techniques to reduce the complexity and size of the data, ultimately resulting in a very small and manageable set used for recommending to cold-start users. 

Our recommender system will utilize the full Book-Crossing dataset (see #Data). Our model will rely on an iterative approach to matrix factorization and classification using a research grant on an available cloud resource (see #Resources).   

### Workflow: 

First we will clean and prepare our dataset, initially considering both implicit and explicit ratings.  Next we will use Truncated SVD to factorize the utility matrix to reduce complexity of both the data and the computational management of large data.  Then we will run an unsupervised clustering classifier (CLarA) over the factorized user matrix, so that in the end we have representative users of each cluster and can quickly and easily calculate similiarity scores and predictions over a very small set of data.   This final reduced dataset can then easily be employed in a production environment to solve the "cold start problem" by offering a new user some choices and then calculating their similarity to the clustered options.   We can then make some final predictions for the user, with an eye on serendipity. 

Concentrating on the munging, the math behind factorization, and learning new classifier techniques, I ran out of time to create a simple dashboard for recommending, so our final notebook will instead show some examples of how the system would work for a new user. 

## Notebook Organization

To facilitate easier run-time, the different functions will be split across different notebooks.   A TOC of the contents of each notebook can be found below.  

* P1 - Overview, TOC, Data, Citations, References
* P2 - Data Preparation
* P3 - Matrix Factorization
* P4 - User Classification | Clustering
* P5 - Recommendation | Summary
* helper.R - common functions in use across all notebooks

## Data 

For this project we will use the full [Book-Crossing Dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/). 

__From the site:__
"Collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books. "

__Publication Citation:__
Improving Recommendation Lists Through Topic Diversification,
Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW '05), May 10-14, 2005, Chiba, Japan. 

The csv .zip file has been downloaded and uncompressed and the resulting csvs are in a "BX-CSV-DUMP" folder.  Since we will only use Collaborative Filtering models, we won't be doing anything with the user data - we'll just refer to the user IDs abstractly. We will utilize:

* The Utility Matrix: BX-Book-Ratings.csv 
* For Images and Titles:  BX-Books.csv

## Resources

Data Cleaning, NA Imputation, Matrix Factorization, Modeling, and Tuning will be done on a VM as part of an allocation grant from the [Open Science Data Cloud](https://www.opensciencedatacloud.org/). The OSDC offers services similar to commercial cloud providers like AWS, Azure and Google Compute, but is designed to serve the 'long tail' of the data science community by providing allocation grants to researchers in need of resources. I help manage the operations of the OSDC as part of my work with the [Open Commons Consortium](http://www.occ-data.org/).

When stored as a sparse matrix, our Book Crossing Dataset is not too large and can be managed in the VM using ephemeral storage.  We don't need to utilize block or object storage.  I will port forward a Jupyter notebook running an R kernel through a proxy server and work in a browser on my local machine. Github is used to manage the code.

# References

* https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf
* [Building a Recommendation System with R by Suresh K. Gorakala, Michele Usuelli](https://www.amazon.com/dp/B012O8S1YM/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1)
* https://github.com/wwells/CUNY_DATA_643/
* [Mining of Massive Datasets, Anand Rajaraman and Jeffrey Ullman, Chapter 11.3](http://infolab.stanford.edu/~ullman/mmds/book.pdf)
* https://stackoverflow.com/questions/36666241/recommenderlab-running-into-memory-issues
* https://www.r-bloggers.com/large-scale-eigenvalue-decomposition-and-svd-with-rarpack/
* http://langvillea.people.cofc.edu/DISSECTION-LAB/Emmie%27sLSI-SVDModule/p5module.html
* https://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters
* http://www.sthda.com/english/wiki/cluster-analysis-in-r-unsupervised-machine-learning
* https://stackoverflow.com/questions/2643939/remove-columns-from-dataframe-where-all-values-are-na
* https://stackoverflow.com/questions/28267398/summing-columns-on-every-nth-row-of-a-data-frame-in-r
* http://www.sthda.com/english/wiki/print.php?id=239