# Section 1: Problem #

The dataset I am going to use is "[Amazon Fine Food Reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews)" from Kaggle. The problem is to recommend new products to the users based on their past rating and other users' rating with similar tastes.

Recommender system has been widely used in recent years. It has become a important aspect for companies with marketing strategy targetting to tailor for each customer's preferences and interests. Personalisation is now transforming and improving customer experience for all kinds of businesses, for example, making product recommendation for online shopping, suggesting movies, music and news.The most famous one is [Netflix Prize](https://en.wikipedia.org/wiki/Netflix_Prize) to compete for the best collaborative filtering algorithm.

# Section 2: Data

The dataset is downloaded from [Kaggle](https://www.kaggle.com/snap/amazon-fine-food-reviews), it was originally published on [SNAP](http://snap.stanford.edu/data/web-FineFoods.html). 

The dataset contains around 500,000 food reviews from Amazon with a period between 1999 and 2012. It is a single csv file about 286 MB which includes productID, userID, rating of the product, review summary and text of details.

In [16]:
import graphlab

## Load data into SFrame

In [2]:
# food_reviews = pd.read_csv('./amazon-fine-foods/Reviews.csv')
food_reviews = graphlab.SFrame.read_csv('./amazon-fine-foods/Reviews.csv', escape_char="")

This non-commercial license of GraphLab Create is assigned to q15928@hotmail.com and will expire on September 27, 2016. For commercial licensing options, visit https://dato.com/buy/.


2016-05-23 09:39:11,686 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.9 started. Logging: C:\Users\Jason\AppData\Local\Temp\graphlab_server_1463960347.log.0


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[long,str,str,str,long,long,long,long,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [3]:
food_reviews.head(5)

Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score
1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5
2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1
3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres"" ...",1,1,4
4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2
5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir"" ...",0,0,5

Time,Summary,Text
1303862400,Good Quality Dog Food,I have bought several of the Vitality canned dog ...
1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted ...
1219017600,"""Delight"" says it all",This is a confection that has been around a few ...
1307923200,Cough Medicine,If you are looking for the secret ingredient in ...
1350777600,Great taffy,Great taffy at a great price. There was a wide ...


## Split data into training set and test set

In [4]:
training_set, test_set = graphlab.recommender.util.random_split_by_user(food_reviews, user_id='UserId', item_id='ProductId',\
                                                                        max_num_users=None, item_test_proportion=0.2, random_seed=5)

# Section 3: Training recommender models
## item-based collaborative filtering

In [10]:
model1 =graphlab.item_similarity_recommender.create(training_set, user_id='UserId', item_id='ProductId', target='Score',
                                                   similarity_type='cosine')

### Evaluation on training set

In [12]:
model1.evaluate(training_set)


Precision and recall summary statistics by cutoff
+--------+----------------+-------------+
| cutoff | mean_precision | mean_recall |
+--------+----------------+-------------+
|   1    |      0.0       |     0.0     |
|   2    |      0.0       |     0.0     |
|   3    |      0.0       |     0.0     |
|   4    |      0.0       |     0.0     |
|   5    |      0.0       |     0.0     |
|   6    |      0.0       |     0.0     |
|   7    |      0.0       |     0.0     |
|   8    |      0.0       |     0.0     |
|   9    |      0.0       |     0.0     |
|   10   |      0.0       |     0.0     |
+--------+----------------+-------------+
[10 rows x 3 columns]



('\nOverall RMSE: ', 0.9436548346408716)

Per User RMSE (best)
+---------------+-------+------+
|     UserId    | count | rmse |
+---------------+-------+------+
| ABOYH7FKD9THJ |   1   | 0.0  |
+---------------+-------+------+
[1 rows x 3 columns]


Per User RMSE (worst)
+----------------+-------+------+
|     UserId     | count | rmse |
+----------------+-------+------+
| A3I3051BZDJWGU |   2   | 4.0  |
+----------------+-------+------+
[1 rows x 3 columns]


Per Item RMSE (best)
+------------+-------+------+
| ProductId  | count | rmse |
+------------+-------+------+
| B001E0VNAQ |   1   | 0.0  |
+------------+-------+------+
[1 rows x 3 columns]


Per Item RMSE (worst)
+------------+-------+------+
| ProductId  | count | rmse |
+------------+-------+------+
| B003WO0H04 |   1   | 4.0  |
+------------+-------+------+
[1 rows x 3 columns]



{'precision_recall_by_user': Columns:
 	UserId	str
 	cutoff	int
 	precision	float
 	recall	float
 	count	int
 
 Rows: 3952746
 
 Data:
 +--------------------+--------+-----------+--------+-------+
 |       UserId       | cutoff | precision | recall | count |
 +--------------------+--------+-----------+--------+-------+
 | #oc-R109MU5OBBZ59U |   1    |    0.0    |  0.0   |   1   |
 | #oc-R109MU5OBBZ59U |   2    |    0.0    |  0.0   |   1   |
 | #oc-R109MU5OBBZ59U |   3    |    0.0    |  0.0   |   1   |
 | #oc-R109MU5OBBZ59U |   4    |    0.0    |  0.0   |   1   |
 | #oc-R109MU5OBBZ59U |   5    |    0.0    |  0.0   |   1   |
 | #oc-R109MU5OBBZ59U |   6    |    0.0    |  0.0   |   1   |
 | #oc-R109MU5OBBZ59U |   7    |    0.0    |  0.0   |   1   |
 | #oc-R109MU5OBBZ59U |   8    |    0.0    |  0.0   |   1   |
 | #oc-R109MU5OBBZ59U |   9    |    0.0    |  0.0   |   1   |
 | #oc-R109MU5OBBZ59U |   10   |    0.0    |  0.0   |   1   |
 +--------------------+--------+-----------+--------+------

### Evaluation on the test set

In [11]:
model1.evaluate(test_set)


Precision and recall summary statistics by cutoff
+--------+------------------+------------------+
| cutoff |  mean_precision  |   mean_recall    |
+--------+------------------+------------------+
|   1    | 0.00569519325689 | 0.00348662767696 |
|   2    | 0.00497380211102 | 0.00616218887772 |
|   3    | 0.00420600568676 | 0.00753156945039 |
|   4    | 0.00373035158326 | 0.00863756444124 |
|   5    | 0.00338421039309 | 0.00970402704521 |
|   6    | 0.00319141755469 | 0.0107343260155  |
|   7    | 0.00303382040796 | 0.0119669031289  |
|   8    | 0.00297890247298 | 0.0135314339876  |
|   9    | 0.00283494064343 | 0.0143087473299  |
|   10   | 0.00279317589288 | 0.0159426425287  |
+--------+------------------+------------------+
[10 rows x 3 columns]



('\nOverall RMSE: ', 1.491783456592445)

Per User RMSE (best)
+---------------+-------+------+
|     UserId    | count | rmse |
+---------------+-------+------+
| AHNSRNMRD8132 |   1   | 0.0  |
+---------------+-------+------+
[1 rows x 3 columns]


Per User RMSE (worst)
+----------------+-------+------+
|     UserId     | count | rmse |
+----------------+-------+------+
| A1IZLM5SR42ED8 |   1   | 5.0  |
+----------------+-------+------+
[1 rows x 3 columns]


Per Item RMSE (best)
+------------+-------+------+
| ProductId  | count | rmse |
+------------+-------+------+
| B002WKL5ZA |   1   | 0.0  |
+------------+-------+------+
[1 rows x 3 columns]


Per Item RMSE (worst)
+------------+-------+------+
| ProductId  | count | rmse |
+------------+-------+------+
| B002P9QRCO |   1   | 5.0  |
+------------+-------+------+
[1 rows x 3 columns]



{'precision_recall_by_user': Columns:
 	UserId	str
 	cutoff	int
 	precision	float
 	recall	float
 	count	int
 
 Rows: 1422252
 
 Data:
 +--------------------+--------+-----------+--------+-------+
 |       UserId       | cutoff | precision | recall | count |
 +--------------------+--------+-----------+--------+-------+
 | #oc-R103C0QSV1DF5E |   1    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   2    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   3    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   4    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   5    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   6    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   7    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   8    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   9    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   10   |    0.0    |  0.0   |   1   |
 +--------------------+--------+-----------+--------+------

In [36]:
model1.recommend(users=['A3HDKO7OW0QNK4','A3SGXH7AUHU8GW', 'A1D87F6ZCVE5NK', 'A395BORC6FGVXV'], k=5)

UserId,ProductId,score,rank
A3HDKO7OW0QNK4,B003F6UO7K,5.0,1
A3HDKO7OW0QNK4,B00144C10S,5.0,2
A3HDKO7OW0QNK4,B00171APVA,5.0,3
A3HDKO7OW0QNK4,B000E7L2R4,5.0,4
A3HDKO7OW0QNK4,B001E4KFG0,5.0,5
A3SGXH7AUHU8GW,B003F6UO7K,5.0,1
A3SGXH7AUHU8GW,B0001PB9FY,5.0,2
A3SGXH7AUHU8GW,B00144C10S,5.0,3
A3SGXH7AUHU8GW,B00171APVA,5.0,4
A3SGXH7AUHU8GW,B000E7L2R4,5.0,5


## Matrix factorization model 
### model parameters are:
* num_factors=70 
* max_iterations=45 
* regularization=0.00001

In [13]:
model2 = graphlab.ranking_factorization_recommender.create(training_set, user_id='UserId', item_id='ProductId', target='Score', 
                                                           num_factors=70, max_iterations=45, regularization=1e-005)

### Evaluation on test set

In [14]:
model2.evaluate(test_set)


Precision and recall summary statistics by cutoff
+--------+------------------+------------------+
| cutoff |  mean_precision  |   mean_recall    |
+--------+------------------+------------------+
|   1    | 0.00143012630673 | 0.00121760118058 |
|   2    | 0.00175918191713 | 0.00317906792487 |
|   3    | 0.00170855797707 | 0.00459371621761 |
|   4    | 0.00166426202951 | 0.0059485092802  |
|   5    | 0.00177436909915 | 0.00769294094812 |
|   6    | 0.00156723281106 | 0.00793720145892 |
|   7    | 0.00136503838379 | 0.00802274989096 |
|   8    | 0.00128300048093 | 0.00862900170649 |
|   9    | 0.00159043545026 | 0.0117049484646  |
|   10   | 0.00155668615688 | 0.0122425424477  |
+--------+------------------+------------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 1.2719582590249225)

Per User RMSE (best)
+---------------+-------+-------------------+
|     UserId    | count |        rmse       |
+---------------+-------+-------------------+
| AVEJUQGPHHWEJ |   1   | 6.75178817513

{'precision_recall_by_user': Columns:
 	UserId	str
 	cutoff	int
 	precision	float
 	recall	float
 	count	int
 
 Rows: 1422252
 
 Data:
 +--------------------+--------+-----------+--------+-------+
 |       UserId       | cutoff | precision | recall | count |
 +--------------------+--------+-----------+--------+-------+
 | #oc-R103C0QSV1DF5E |   1    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   2    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   3    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   4    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   5    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   6    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   7    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   8    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   9    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   10   |    0.0    |  0.0   |   1   |
 +--------------------+--------+-----------+--------+------

### model parameters are:
* num_factors=10 
* max_iterations=21 
* regularization=0.00001

In [7]:
model3 = graphlab.ranking_factorization_recommender.create(training_set, user_id='UserId', item_id='ProductId', target='Score', 
                                                           num_factors=10, max_iterations=21, regularization=0.00001)

### Evaluation on test set

In [9]:
model3.evaluate(test_set)


Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    |  0.00132887842661 | 0.000623920428955 |
|   2    | 0.000765687093426 | 0.000676713966447 |
|   3    |  0.00099982281621 |  0.00191727701527 |
|   4    |  0.00114536664389 |  0.00332321632841 |
|   5    |  0.00127319209254 |  0.00485423165509 |
|   6    |  0.0012107558998  |  0.00548627130132 |
|   7    |  0.00116796661713 |  0.00604872436871 |
|   8    |  0.00112954666262 |  0.00656903601933 |
|   9    |  0.00110669557856 |  0.00702890027485 |
|   10   |  0.00110866428734 |  0.00764681715349 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 1.2392488632190517)

Per User RMSE (best)
+----------------+-------+-------------------+
|     UserId     | count |        rmse       |
+----------------+-------+-------------------+
| A38QZG

{'precision_recall_by_user': Columns:
 	UserId	str
 	cutoff	int
 	precision	float
 	recall	float
 	count	int
 
 Rows: 1422252
 
 Data:
 +--------------------+--------+-----------+--------+-------+
 |       UserId       | cutoff | precision | recall | count |
 +--------------------+--------+-----------+--------+-------+
 | #oc-R103C0QSV1DF5E |   1    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   2    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   3    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   4    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   5    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   6    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   7    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   8    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   9    |    0.0    |  0.0   |   1   |
 | #oc-R103C0QSV1DF5E |   10   |    0.0    |  0.0   |   1   |
 +--------------------+--------+-----------+--------+------