# Recommenter Systems Specialization Capstone Project

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

## Project Information

### Project Objective

Your project should assess and recommend a recommender solution for the following scenario:

You work for a large online retailer (we’ll call it Nile-River.com) as a recommender systems expert on a team focused on direct sales to consumers in the US.   

Your market research team has identified “back to school” as a critical time period for office product sales to consumers in the US.  They note that the six weeks including and surrounding the month of August are responsible for 31% of yearly office product sales. 

They also report that the surge in office product sales is not limited to traditional school products (such as notebooks, pencils, and erasers).  Rather, it appears that once people are buying school products, they also buy other office products (indeed, more document shredders are sold during these six weeks than at any other time of the year, even tax preparation season).   

Indeed, most large-dollar office-product purchases include a mix of inexpensive and more expensive products in the same transaction.  This data suggests that it may be important to have inexpensive products as an entry point, but more expensive ones that build the transaction size.  (For example, I buy paper and pens, and then realize I need several hundred dollars of laser printer toner.)  Or alternatively that once someone comes to buy something large, they also fill in smaller items (one I’m buying a new printer, I might as well also buy a calculator and a box of colored paper clips).   

They suspect that the surge in office product sales is due to in-person sales and promotion at retail outlets, with two particularly important prompts: 

Visits to office products superstores (chain stores such as Staples and Office Depot) peak during this time of year.  Once parents are in the store to buy supplies for their children, they see other products of interest.

Special displays are set up at “Big-Box” stores such as Wal-mart and Target with both school supplies and other office products.

Unfortunately, your site (Nile-River.com) does not experience as large a surge in office product sales during the back-to-school period.  You do experience a surge (about double typical sales, or about 23% of annual consumer sales), but it is far below that of your offline competitors. 

These figures include the results of existing promotions such as back-to-school banners and a free next-day shipping promotion for products sold during the two weeks when schools most commonly start classes (late August and early September).

Your challenge, therefore, is to develop a recommender system to increase sales of office products during this important time period.  To maximize business value, you also have a set of key goals and constraints.

Given that your site already has a very effective product-association recommender system, you’ve been asked to focus on recommending products based on customer’s overall profiles, not their current browsing or basket.   

Your product recommendations will be displayed in two places on the site: 

Five products displayed on the “office products” landing page where customers will land if they click on banner ads (back to school shopping!) or select the office products category (from various menus or navigation aids).

Five products displayed as part of “other suggestions” that will be displayed as part of the shopping cart display and near the bottom of product pages (primarily will be placed on product pages from the same category, but also related products such as textbooks, school bags, and backpacks).

Research shows that additional sales at this time of year are divided fairly broadly among categories of office products (school supplies, consumable supplies, durable office equipment).  Your recommender should respond to this research appropriately.   

Your recommender should also address the finding above about having both cheaper and more expensive products available to attract customers.   

Finally, Nile-River.com prides itself on having a much deeper product catalog than the typical big-box store.  One of the key drivers of repeat business is customer discovery of new products they likely couldn’t buy at a local store.  Your recommender should respond to this information appropriately.   

### Data Set Descriptions

#### Items

A data set derived from Amazon.com with product metadata and ratings data on office products.  The data set is provided thanks to Julian McAuley at UCSD, and involves actual data from the period May 1996-July 2014.  To make your computation more tractable, we’ve used a dense subset of the data (called the 5-core subset) that only includes items and users with at least five ratings.  [Note that the original datasets are available at http://jmcauley.ucsd.edu/data/amazon/, though these should not be used for this capstone.  

For each item, your meta-data includes: 

- An item number

- Amazon’s ITEM number (“asin”)

- The item’s brand name

- The item title

- The item category (both leaf category and full path)

- A price in dollars

An availability score between 0 and 1 that reflects how widespread the product is in retail stores; higher scores reflect broad availability; lower scores indicate products not found in most big box store.  Note that the availability score is synthetic (we created it), but for purposes of this capstone, treat it as if it were real data. 

In [2]:
items = pd.read_excel(open('capstone.xlsx', 'rb'),
              sheet_name='Items').set_index('Item')
items

Unnamed: 0_level_0,Availability,ASIN,Price,Brand,Title,LeafCat,FullCat
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
24,0.475237,B00004Z498,3.62,Scotch,"3M Scotch Mounting Tape, .5-Inch by 75-Inch (110)",Mounting Tape,"Office Products/Office & School Supplies/Tape,..."
30,0.543847,B00004Z5QO,8.16,Avery,Avery Easy Peel Return Address Labels for Inkj...,Printer Labels: Laser & Inkjet,Office Products/Office & School Supplies/Label...
35,0.336081,B00004Z5SN,8.22,Avery,Avery Easy Peel Address Labels for Laser Print...,Address Labels,Office Products/Office & School Supplies/Label...
41,0.564493,B00004Z69W,15.96,Avery,Avery Easy Peel Address Labels for Inkjet Prin...,Address Labels,Office Products/Office & School Supplies/Label...
45,0.908922,B0000538AC,4.39,Scotch,"Scotch(R) Gift Wrap Tape, 0.75 x 300 Inches, 3...",Transparent Tape,"Office Products/Office & School Supplies/Tape,..."
...,...,...,...,...,...,...,...
2321,0.623943,B00FZ8ZRKU,7.49,Wilson Jones,Wilson Jones Heavy Duty D-Ring View Binder wit...,D-Ring Binders,Office Products/Office & School Supplies/Binde...
2324,0.406177,B00FZ909DE,7.69,Wilson Jones,Wilson Jones Heavy Duty D-Ring View Binder wit...,D-Ring Binders,Office Products/Office & School Supplies/Binde...
2326,0.905137,B00G411O8G,9.99,,"Swingline Thermal Laminating Pouch, Letter Siz...",Laminating Supplies,Office Products/Office Electronics/Presentatio...
2356,0.504581,B00GO4GNPC,9.99,Swingline,"Swingline Customizable Fashion Stapler, 20 She...",Desktop Staplers,Office Products/Office & School Supplies/Stapl...


#### Ratings

A ratings matrix with a row for each item and columns representing each user (ratings are on a 1-5 star scale).  Your ratings matrix includes all the ratings data you will receive (we have not separated out test and training data -- that’s your responsibility)

In [3]:
ratings = pd.read_excel(open('capstone.xlsx', 'rb'),
              sheet_name='Ratings').set_index('item')
ratings

Unnamed: 0_level_0,64,65,75,79,83,112,252,271,301,305,...,3411,3430,3524,3533,3625,3902,3991,4047,4342,4462
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
24,,,,,,,4.0,5.0,4.0,5.0,...,,,,,,,,,,
30,,,,,,,,,,,...,,,,,,,,,,
35,,,,4.0,,,,,,,...,,,,,,,,,,
41,,,,,,,,,,,...,,,,,,,,,,
45,,5.0,4.0,5.0,,,,,,5.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2321,,,,,,,,,,,...,,,,,,,,,,
2324,,,,,,,,,,5.0,...,,,,,,,,,,
2326,,,,,,,2.0,,,,...,,4.0,,,,,,,,
2356,,,,,5.0,,2.0,,,,...,,,,,,,,,,


#### CBF

Predicted ratings output from a TFIDF content-based recommender.

In [4]:
CBF = pd.read_excel(open('capstone.xlsx', 'rb'),
              sheet_name='CBF').set_index('Item')
CBF

Unnamed: 0_level_0,64,65,75,79,83,112,252,271,301,305,...,3411,3430,3524,3533,3625,3902,3991,4047,4342,4462
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
24,4.759110,4.970097,5.101361,4.982741,4.744962,4.610706,4.363733,4.901377,4.756129,5.137714,...,4.186854,4.662036,4.928785,4.274939,4.560101,4.644191,5.095067,5.314045,4.213635,5.446849
30,4.760034,4.953063,5.131722,4.986909,5.042781,4.765790,4.164508,4.895494,5.182499,5.151480,...,4.131023,4.540971,4.821043,4.421641,4.552488,4.702159,5.096360,5.278416,4.262306,5.198768
35,4.586862,4.787650,4.969127,4.825272,4.848246,4.607144,4.013671,4.737729,5.019004,4.989486,...,3.967397,4.377044,4.645329,4.237040,4.387914,4.540736,4.935084,5.118339,4.103346,5.036774
41,4.758954,4.960017,5.136817,4.994696,5.018703,4.775178,4.181156,4.907392,5.189884,5.159089,...,4.138219,4.545229,4.818427,4.428620,4.561844,4.714112,5.107646,5.289050,4.274392,5.209078
45,4.767237,4.966481,5.103202,4.960500,4.767904,4.675170,4.347195,4.873802,4.951988,5.138159,...,4.135849,4.641535,4.924091,4.309349,4.577074,4.637065,5.096841,5.301905,4.206003,5.451323
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2321,4.167239,4.255061,4.551568,4.402626,4.215355,4.156263,3.535358,4.417078,4.569657,4.669404,...,3.605002,4.046803,4.218134,3.698820,3.823662,4.064583,4.436702,4.826035,3.694039,4.627310
2324,4.045560,4.160742,4.467183,4.317713,4.123812,4.056125,3.435776,4.177312,4.453157,4.587238,...,3.503908,3.966109,4.128133,3.600607,3.697779,3.966914,4.341219,4.719335,3.571713,4.510517
2326,4.530425,4.763254,4.848857,4.816104,4.585276,4.460446,3.666267,4.858475,4.965700,4.927526,...,3.753651,4.384275,4.737709,4.034147,4.415432,4.453367,4.930767,5.073401,3.979239,5.011236
2356,3.789361,4.002356,4.085710,4.047038,3.984555,3.796182,2.987649,3.844924,4.193067,4.154335,...,3.232913,3.704836,3.852723,3.432231,3.561263,3.696044,4.119966,4.313876,3.216915,4.252117


#### Item-Item

Predicted ratings output from an item-item collaborative filtering recommender.

In [5]:
item_item = pd.read_excel(open('capstone.xlsx', 'rb'),
              sheet_name='Item-Item').set_index('Item')
item_item

Unnamed: 0_level_0,64,65,75,79,83,112,252,271,301,305,...,3411,3430,3524,3533,3625,3902,3991,4047,4342,4462
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
24,5.048144,4.876618,5.298740,4.714621,4.685493,4.902435,4.519138,5.437919,4.192167,4.651734,...,5.039983,5.035183,5.002485,4.672079,4.668742,4.753890,4.933271,4.645820,4.484029,4.789651
30,4.432181,4.686060,4.944408,4.827350,4.865588,4.708614,4.761924,5.088993,4.897821,4.631434,...,4.512749,5.009835,4.416021,4.920164,4.447667,4.781522,5.210271,5.018024,5.193638,4.882565
35,4.424702,4.639536,4.902414,4.654787,4.487670,5.043107,4.416495,5.251397,4.817922,4.886945,...,3.818230,4.077078,4.916671,4.460068,4.560418,4.656706,4.336244,4.729318,4.522493,4.822804
41,4.455630,4.846841,4.764456,4.636426,4.765563,5.048186,4.878265,4.484044,4.899459,4.730924,...,4.076657,4.755233,5.064970,4.733682,4.779031,4.601137,4.822821,4.919715,5.438721,5.609193
45,4.713699,4.824686,4.912413,5.012759,4.592849,4.597524,5.038015,4.591647,4.878341,4.834799,...,4.721283,4.473431,4.766598,4.304352,4.719857,4.565420,5.163960,4.778962,4.380615,5.057542
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2321,4.281895,3.862893,4.827148,4.150019,3.878916,4.151949,3.298728,4.304851,4.757871,4.537145,...,3.737476,3.582756,4.243824,4.015442,3.811678,4.244009,4.181049,4.540734,3.655595,4.826119
2324,4.293118,3.999757,4.409072,4.123518,3.945789,4.270189,3.179614,4.077376,4.175634,4.650600,...,4.492877,4.302054,3.892376,3.719091,3.576754,3.646797,3.930087,4.493832,4.009592,3.897694
2326,4.579365,4.572798,4.434981,4.550085,4.901238,4.713017,3.302515,4.143929,5.077498,4.826939,...,3.980225,3.983297,4.638855,4.447525,4.126651,4.626393,4.870472,4.950686,3.557243,4.767729
2356,3.210927,3.743892,4.078174,4.070962,4.045424,3.709863,2.663308,4.282468,4.400385,4.030946,...,3.452572,3.114758,3.664836,3.052114,3.777407,3.286255,3.851484,3.367702,3.171694,3.976323


#### MF

Predicted ratings output from a matrix factorization (gradient descent) recommender.

In [6]:
MF = pd.read_excel(open('capstone.xlsx', 'rb'),
              sheet_name='MF').set_index('Item')
MF

Unnamed: 0_level_0,64,65,75,79,83,112,252,271,301,305,...,3411,3430,3524,3533,3625,3902,3991,4047,4342,4462
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
24,4.739904,4.886359,4.980338,4.925705,4.796032,4.718714,4.324353,4.930692,5.045021,5.032640,...,4.290049,4.635368,4.798700,4.256978,4.692152,4.638347,5.014710,5.072654,4.393017,5.110367
30,4.695184,4.841951,4.942925,4.873662,4.751501,4.696670,4.274564,4.881475,5.005127,4.985742,...,4.236899,4.588769,4.754245,4.221880,4.655599,4.583816,4.975035,5.023935,4.359319,5.068546
35,4.620439,4.762737,4.852639,4.805924,4.671336,4.574349,4.221161,4.810916,4.919947,4.913164,...,4.184821,4.522413,4.675023,4.136575,4.569446,4.525586,4.892151,4.954829,4.268255,4.984210
41,4.766709,4.911528,5.009094,4.953096,4.817944,4.741805,4.346761,4.954938,5.076048,5.062182,...,4.314581,4.657781,4.823483,4.278950,4.713878,4.663673,5.042297,5.102804,4.418982,5.139825
45,4.764646,4.900232,5.046869,4.943493,4.773807,4.775370,4.278389,4.918982,5.127898,5.079316,...,4.265525,4.604948,4.808842,4.255544,4.686123,4.622342,5.059712,5.121551,4.440469,5.172261
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2321,4.192011,4.334042,4.441094,4.377275,4.232716,4.168622,3.760883,4.372850,4.511897,4.491786,...,3.733115,4.073182,4.245268,3.698858,4.132379,4.082282,4.470936,4.533290,3.847991,4.571141
2324,4.106161,4.250431,4.339646,4.294593,4.157955,4.063277,3.701492,4.297347,4.407116,4.399386,...,3.668406,4.005482,4.162764,3.620231,4.054114,4.011566,4.377630,4.441356,3.754010,4.471525
2326,4.548460,4.686544,4.831728,4.728423,4.563182,4.567547,4.058429,4.705670,4.910632,4.862243,...,4.045333,4.388830,4.595400,4.039991,4.472244,4.406033,4.843428,4.903557,4.224087,4.957209
2356,3.773298,3.921012,4.020423,3.951684,3.831369,3.776887,3.354407,3.960345,4.081770,4.062569,...,3.316251,3.668533,3.833475,3.302465,3.736164,3.662434,4.052961,4.100617,3.438470,4.145502


#### User-User

Predicted ratings output from a user-user collaborative filtering recommender.

In [7]:
user_user = pd.read_excel(open('capstone.xlsx', 'rb'),
              sheet_name='User-User').set_index('Item')
user_user 

Unnamed: 0_level_0,64,65,75,79,83,112,252,271,301,305,...,3411,3430,3524,3533,3625,3902,3991,4047,4342,4462
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
24,4.590799,4.877711,4.458179,4.805026,4.997193,4.098849,4.043067,5.239944,4.191001,4.762860,...,4.010254,4.084892,4.512691,3.754210,4.310961,4.588265,5.260862,4.932911,3.333438,4.918313
30,4.426518,4.859874,4.646025,4.937423,4.633184,4.356083,4.073458,4.786792,4.767103,4.885959,...,3.995149,4.087980,4.534441,3.964599,4.134575,4.655635,5.022880,4.894737,3.967310,4.761658
35,4.025716,4.866861,4.835609,4.993884,4.664077,4.136289,3.867810,4.656409,4.436962,4.905074,...,3.673124,3.910839,4.504693,3.927909,4.142269,4.573515,4.514161,4.683942,3.785175,4.177748
41,4.638216,5.001494,4.586768,4.785592,4.859969,4.371172,4.004206,4.482062,4.821622,5.248041,...,3.920856,4.121550,4.257406,3.848176,4.640973,4.564264,5.557184,4.846247,3.628684,5.171294
45,4.806725,5.367683,4.748397,5.171098,4.974515,4.384480,4.413938,4.744272,5.028890,5.214975,...,4.189349,4.540017,4.674176,4.048724,4.525230,4.680311,4.492206,5.239830,3.613434,4.730830
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2321,4.266290,4.449117,4.813637,4.622020,4.405027,4.052676,3.756925,4.953918,5.299797,4.201119,...,3.763000,3.926368,4.407431,3.989687,4.181327,4.108814,4.835349,4.978257,3.494123,4.778453
2324,4.437552,4.308682,4.990296,4.585984,4.146157,4.111071,3.351168,5.051755,4.972796,4.744187,...,3.713658,4.329625,4.411524,3.724572,4.063230,3.816742,4.620690,4.784461,4.104845,4.546214
2326,4.662011,4.841642,4.841856,4.841898,4.716187,4.290783,3.199112,4.798093,5.135523,4.793772,...,3.878725,3.757266,4.518772,3.998773,4.609285,4.359215,4.947376,4.707264,3.487148,4.928782
2356,3.965539,4.641901,4.215249,4.813070,4.658051,4.114999,2.518776,5.283099,5.002078,4.602801,...,3.442500,4.052028,4.765966,3.327745,3.768645,3.462555,4.898652,4.326108,3.169288,4.179017


#### PersBias

Predicted ratings output from a baseline recommender that uses product and customer ratings distributions to provide personally-scaled average predictions.

In [8]:
persbias = pd.read_excel(open('capstone.xlsx', 'rb'),
              sheet_name='PersBias').set_index('Item')
persbias

Unnamed: 0_level_0,64,65,75,79,83,112,252,271,301,305,...,3411,3430,3524,3533,3625,3902,3991,4047,4342,4462
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
24,4.760019,4.934751,5.072294,5.014281,4.839983,4.734613,4.403725,5.013488,5.185511,5.134814,...,4.167298,4.725065,4.847920,4.236699,4.778955,4.679319,5.097628,5.286626,4.255793,5.242198
30,4.761857,4.936589,5.074133,5.016120,4.841821,4.736452,4.405564,5.015326,5.187349,5.136652,...,4.169136,4.726903,4.849758,4.238537,4.780793,4.681157,5.099466,5.288464,4.257631,5.244037
35,4.598962,4.773694,4.911237,4.853224,4.678925,4.573556,4.242668,4.852430,5.024453,4.973756,...,4.006240,4.564007,4.686862,4.075641,4.617898,4.518261,4.936570,5.125568,4.094736,5.081141
41,4.769841,4.944573,5.082116,5.024103,4.849805,4.744435,4.413547,5.023309,5.195332,5.144636,...,4.177119,4.734886,4.857741,4.246521,4.788777,4.689140,5.107449,5.296447,4.265615,5.252020
45,4.757687,4.932419,5.069962,5.011949,4.837651,4.732281,4.401393,5.011156,5.183178,5.132482,...,4.164966,4.722733,4.845588,4.234367,4.776623,4.676987,5.095295,5.284294,4.253461,5.239866
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2321,4.146532,4.321264,4.458808,4.400795,4.226496,4.121127,3.790239,4.400001,4.572024,4.521327,...,3.553811,4.111578,4.234433,3.623212,4.165468,4.065832,4.484141,4.673139,3.642306,4.628712
2324,4.041269,4.216001,4.353544,4.295531,4.121233,4.015863,3.684975,4.294738,4.466761,4.416064,...,3.448548,4.006315,4.129170,3.517949,4.060205,3.960569,4.378878,4.567876,3.537043,4.523448
2326,4.545145,4.719877,4.857420,4.799407,4.625109,4.519739,4.188851,4.798614,4.970636,4.919940,...,3.952424,4.510191,4.633046,4.021825,4.564081,4.464445,4.882753,5.071752,4.040919,5.027324
2356,3.779731,3.954463,4.092006,4.033993,3.859695,3.754325,3.423437,4.033199,4.205222,4.154526,...,3.187009,3.744776,3.867631,3.256411,3.798667,3.699030,4.117339,4.306337,3.275505,4.261910


## Part 1. Plan

### Translation of Business Goals into Metrics

Following metrics will be used to evaluate the algorithms:
- DIFF5: the average normalized difference of rating given by a user to Top-5 items selected by the algorithm with missing values being replaced by 0. The reason behind this metric is the measurement of how much the user will actually like the recommended items. Normalization is used to account for different rating scales used by different users. This metric is to be maximized.
- priceSTD: Average standard deviation of normalized price among Top-5 items selected by the algorithm. It measures price diversity among proposed items. This metric is to be maximized.
- UNAVAIL: The median of (1 - availability) among top 5 items. It measures general availability of proposed items. The reason for chosing median instead of mean is that we want common items also to be sometimes recommended, provided that they fit two other objectives.
- FINAL: logarithm of geometric mean of max(0, DIFF5), priceSTD and UNAVAIL. Its maximization ensures that we do not severely lag behind in any of our three objectives (recommending useful items, recommending items with diverse prices and recommending items that are less available in other shops).

In [9]:
def Top5(predictions, users):
    return {user: np.argpartition(predictions.iloc[:, user].to_numpy(), len(predictions)-5)[-5:]
            for user in users}   

In [10]:
norm_rating = ((ratings - ratings.mean(axis=0))/ratings.std(axis=0)).fillna(0)
norm_rating

Unnamed: 0_level_0,64,65,75,79,83,112,252,271,301,305,...,3411,3430,3524,3533,3625,3902,3991,4047,4342,4462
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
24,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.064550,0.599145,-1.449138,0.526235,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
30,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
35,0.0,0.000000,0.000000,-1.354006,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
41,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
45,0.0,0.774597,-1.044466,0.677003,0.000000,0.0,0.000000,0.000000,0.000000,0.526235,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2321,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2324,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.526235,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2326,0.0,0.000000,0.000000,0.000000,0.000000,0.0,-1.871942,0.000000,0.000000,0.000000,...,0.0,-0.223374,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2356,0.0,0.000000,0.000000,0.000000,0.658281,0.0,-1.871942,0.000000,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
def DIFF5(top5):
    return np.mean(np.asarray(list(map(lambda x: norm_rating.iloc[top5[x], x], top5))))

In [12]:
items['NormPrice'] = ((items['Price'] - items['Price'].mean())/items['Price'].std()).fillna(0)
items

Unnamed: 0_level_0,Availability,ASIN,Price,Brand,Title,LeafCat,FullCat,NormPrice
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
24,0.475237,B00004Z498,3.62,Scotch,"3M Scotch Mounting Tape, .5-Inch by 75-Inch (110)",Mounting Tape,"Office Products/Office & School Supplies/Tape,...",-0.393748
30,0.543847,B00004Z5QO,8.16,Avery,Avery Easy Peel Return Address Labels for Inkj...,Printer Labels: Laser & Inkjet,Office Products/Office & School Supplies/Label...,-0.273588
35,0.336081,B00004Z5SN,8.22,Avery,Avery Easy Peel Address Labels for Laser Print...,Address Labels,Office Products/Office & School Supplies/Label...,-0.272000
41,0.564493,B00004Z69W,15.96,Avery,Avery Easy Peel Address Labels for Inkjet Prin...,Address Labels,Office Products/Office & School Supplies/Label...,-0.067147
45,0.908922,B0000538AC,4.39,Scotch,"Scotch(R) Gift Wrap Tape, 0.75 x 300 Inches, 3...",Transparent Tape,"Office Products/Office & School Supplies/Tape,...",-0.373368
...,...,...,...,...,...,...,...,...
2321,0.623943,B00FZ8ZRKU,7.49,Wilson Jones,Wilson Jones Heavy Duty D-Ring View Binder wit...,D-Ring Binders,Office Products/Office & School Supplies/Binde...,-0.291321
2324,0.406177,B00FZ909DE,7.69,Wilson Jones,Wilson Jones Heavy Duty D-Ring View Binder wit...,D-Ring Binders,Office Products/Office & School Supplies/Binde...,-0.286028
2326,0.905137,B00G411O8G,9.99,,"Swingline Thermal Laminating Pouch, Letter Siz...",Laminating Supplies,Office Products/Office Electronics/Presentatio...,-0.225154
2356,0.504581,B00GO4GNPC,9.99,Swingline,"Swingline Customizable Fashion Stapler, 20 She...",Desktop Staplers,Office Products/Office & School Supplies/Stapl...,-0.225154


In [13]:
def priceSTD(top5):
    return np.mean(np.asarray(list(map(lambda x: items['NormPrice'].iloc[top5[x]].std(), top5))))

In [14]:
def UNAVAIL(top5):
    return np.median(np.asarray(list(map(lambda x: 1 - items['Availability'].iloc[top5[x]], top5))))

In [15]:
def FINAL(top5):
    return np.log(max(DIFF5(top5), 1e-4)) + np.log(UNAVAIL(top5)) + np.log(priceSTD(top5))

### Plan for evaluating base algorithms

The following base algorithms will be evaluated:
- a TFIDF content-based recommender
- an item-item collaborative filtering recommender
- a matrix factorization (gradient descent) recommender
- a user-user collaborative filtering
- a baseline recommender that uses product and customer ratings distributions to provide personally-scaled average predictions

All those algorithms except the last one make sense for current problem (and the last one, despite being obviously worse due to always recommending the same set of Top5 most popular items, can still be useful as a baseline to test others against). Moreover, precomputed predictions of all 5 algorithms are already given to us as data.

What algorithms of those actually solve our problem (if any), will b determined empirically.

In problem statement there is no description of the data, on which the algorithms were trained and tuned, but a presence of a TFIDF content-based recommender, trained on review data, suggests that they are most likely trained on data, unavailable to us. Therefore, no meaningful separation on train and test datasets. On the other hand, it will be necessary, when we will train hybrids on data, given to us.

### Plan for hybrids

The following data will be used for building hybrid models:
- Rating predictions of base algorithms
- Availability (is significant for our goals, despite being unrepresented in rating table)
- Price (is significant for our goals, despite being unrepresented in rating table)
- Squared deviation of price from the mean price among all items (it will allow the linear models to take price "extremity" into account)

The following three hybrid models will be used:
- Linear hybrid: final score is a linear combination of selected features.
- Selective hybrid: final score is a score given by one of the base models, selected by a linear classifier
- Mixed hybrid: final score is given by one of several linear hybrid models, selected by a linear classifier.

The third model is more expressive than the first two models, however it still needs to be tested against them due to the danger of overfitting.

Those hybrids seem promising because, they will allow to account both for all available predictions (and difference between them can also yield some information) and the parameters of items, which are relevant to our goals, despite being not directly correlated with ratings.

The dataset will be separated into a train dataset on which the weights of hybrid models will be tuned (first 50 users) and test dataset on which the final versions of all three models will be evaluated (other 50 users). Because the objective metric (FINAL) is non-differentiable, a zero-order optimization technique will be required for tuning. I will use [the cross entropy method](https://en.wikipedia.org/wiki/Cross-entropy_method) with gaussian weight distribution. 

In [16]:
def CEOpt(f, start_mean, start_cov, sample_size=100, n_iterations=50, percentile=0.25):
    mean, cov = start_mean, start_cov
    for i in tqdm(range(n_iterations)):
        sample = st.multivariate_normal(mean, cov, allow_singular=True).rvs(sample_size)
        results = np.asarray(list(map(f, sample)))
        best = np.argpartition(results, int(sample_size*(1 - percentile)))[int(sample_size*(1 - percentile)):]
        mean, cov = np.mean(sample[best], axis=0), np.cov(sample[best].T)
    return sample[np.argmax(results)], np.max(results)

In [17]:
def linear_hybrid(weights, features):
    return np.sum(weights*features, axis=1)

In [18]:
def selective_hybrid(weights, features, candidates):
    choice =np.argmax(np.sum(weights.reshape((1, features.shape[1], 
                                                 candidates.shape[1]))*features.reshape(features.shape[0], 
                                                                                        features.shape[1], 1), 
                             axis=1), axis=1)
    return candidates[np.arange(candidates.shape[0]), choice]

In [19]:
def mixed_hybrid(weights, features):
    feature_dim = features.shape[1]
    candidates = np.asarray([linear_hybrid(weights[feature_dim*i:feature_dim*(i+1)], features) 
                             for i in range(feature_dim)]).T
    return selective_hybrid(weights[-feature_dim**2:], features, candidates)

## Part 2. Measurement

### Summary Table of Statistics

In [20]:
base_table = pd.DataFrame({'Algorithm': ['CBF', 'Item-Item', 'MF', 'User-User', 'PersBias'],
                        'DIFF5': [DIFF5(Top5(predictions, range(100))) 
                                  for predictions in [CBF, item_item, MF, user_user, persbias]],
                        'priceSTD': [priceSTD(Top5(predictions, range(100))) 
                                  for predictions in [CBF, item_item, MF, user_user, persbias]],
                        'UNAVAIL': [UNAVAIL(Top5(predictions, range(100))) 
                                  for predictions in [CBF, item_item, MF, user_user, persbias]],
                        'FINAL': [FINAL(Top5(predictions, range(100))) 
                                  for predictions in [CBF, item_item, MF, 
                                                           user_user, persbias]]}).set_index('Algorithm')
base_table

Unnamed: 0_level_0,DIFF5,priceSTD,UNAVAIL,FINAL
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CBF,0.087183,0.428872,0.435507,-4.117582
Item-Item,0.137943,0.555428,0.420511,-3.435214
MF,0.028979,0.424577,0.481862,-5.127931
User-User,0.347079,0.591379,0.302113,-2.780454
PersBias,0.024142,0.221039,0.420511,-6.099485


### Conclusions

In DIFF5 and FINAL the performance of the algorithms is ranked the following way:
1. User-User
2. Item-Item
3. CBF
4. MF
5. PersBias

User-User turned out to be the best algorithm for this problem.

Item-Item fares worse than User-User in this case, because the amount of distinct items in the shop exceeds the amount of users.

CBF is on the third place, meaning that undisplayed reviews happen to be less useful here than ratings (which is expected, as an average user is unlikely to write a review on some cheap and frequently bought office item).

MF is even worse here, because items happen o be too diverse in meaning and function to be adequately described by a relatively small number of features.

PersBias recommends the same Top-5 items for everyone. It is no wonder, that such one-size-fits-all approach happened to be the worst one.

In priceSTD, MF unexpectedly turned out to be the best one, while the relative positions of other algorithms did not change.

In UNAVAIL User-User is actually the worst, MF -- the best and Item-Item is the same as with PersPias -- worse than CBF.

The positive correlation between DIFF5 and priceSTD reflects that items, found useful by any given user, tend to be pretty diverse in prises (just like the aforementioned research suggests).

Negative correlation between DIFF5 and UNAVAIL reflects the fact, that people naturally favor items they have already seen somvere else.

### Selection for Hybridization

Despite a user-user is clearly the best algorithm for all three objectives, I am still inclined to take all five algorithms for hybridization. This is because there are cases in which individually mediocre algorithms perform excelently when combined. Whether the current case is one of those, will be determined empirically.

## Part 3. Mixing

### Hybrids

The following data will be used for building hybrid models:
- Rating predictions of base algorithms
- Availability (is significant for our goals, despite being unrepresented in rating table)
- Price (is significant for our goals, despite being unrepresented in rating table)
- Squared deviation of price from the mean price among all items (it will allow the linear models to take price "extremity" into account)

The following three hybrid models will be used:
- Linear hybrid: final score is a linear combination of selected features.
- Selective hybrid: final score is a score given by one of the base models, selected by a linear classifier
- Mixed hybrid: final score is given by one of several linear hybrid models, selected by a linear classifier.

The third model is more expressive than the first two models, however it still needs to be tested against them due to the danger of overfitting.

Those hybrids seem promising because, they will allow to account both for all available predictions (and difference between them can also yield some information) and the parameters of items, which are relevant to our goals, despite being not directly correlated with ratings.

The dataset will be separated into a train dataset on which the weights of hybrid models will be tuned (first 50 users) and test dataset on which the final versions of all three models will be evaluated (other 50 users). Because the objective metric (FINAL) is non-differentiable, a zero-order optimization technique will be required for tuning. I will use [the cross entropy method](https://en.wikipedia.org/wiki/Cross-entropy_method) with gaussian weight distribution.

In [21]:
features = pd.DataFrame({'avail': (items['Availability'].to_numpy()*\
                                         np.ones(ratings.shape).T).T.flatten(),
                               'norm_price': (items['NormPrice'].to_numpy()*\
                                              np.ones(ratings.shape).T).T.flatten(),
                               'sq_norm_price': ((items['NormPrice'].to_numpy()**2)*\
                                                 np.ones(ratings.shape).T).T.flatten(),
                               'cbf': CBF.to_numpy().flatten(),
                               'item_item': item_item.to_numpy().flatten(),
                               'MF': MF.to_numpy().flatten(),
                               'user_user': user_user.to_numpy().flatten(),
                               'pers_bias': persbias.to_numpy().flatten()})
features

Unnamed: 0,avail,norm_price,sq_norm_price,cbf,item_item,MF,user_user,pers_bias
0,0.475237,-0.393748,0.155037,4.759110,5.048144,4.739904,4.590799,4.760019
1,0.475237,-0.393748,0.155037,4.970097,4.876618,4.886359,4.877711,4.934751
2,0.475237,-0.393748,0.155037,5.101361,5.298740,4.980338,4.458179,5.072294
3,0.475237,-0.393748,0.155037,4.982741,4.714621,4.925705,4.805026,5.014281
4,0.475237,-0.393748,0.155037,4.744962,4.685493,4.796032,4.997193,4.839983
...,...,...,...,...,...,...,...,...
19995,0.356766,-0.088056,0.007754,4.586171,4.909234,4.481505,4.676085,4.542387
19996,0.356766,-0.088056,0.007754,4.945749,4.721969,4.850921,4.846195,4.960696
19997,0.356766,-0.088056,0.007754,5.153897,4.864441,4.902315,5.059812,5.149694
19998,0.356766,-0.088056,0.007754,4.099508,4.286248,4.232878,3.657823,4.118861


In [22]:
linear_weights, linear_score = \
    CEOpt(lambda x: FINAL(Top5(pd.DataFrame(linear_hybrid(x, features.to_numpy()).reshape(ratings.shape)), 
                               range(50))), 
      np.zeros(features.shape[1]), np.eye(features.shape[1]))
linear_score

  0%|          | 0/50 [00:00<?, ?it/s]

-1.224254496273481

In [23]:
candidates = features.iloc[:,3:].to_numpy()
selective_weights, selective_score = \
    CEOpt(lambda x: FINAL(Top5(pd.DataFrame(selective_hybrid(x, 
                                                             features.to_numpy(), 
                                                             candidates).reshape(ratings.shape)), 
                               range(50))), 
          np.zeros(features.shape[1]*candidates.shape[1]), np.eye(features.shape[1]*candidates.shape[1]), 
          n_iterations=75)
selective_score

  0%|          | 0/75 [00:00<?, ?it/s]

-2.2538430828527463

In [24]:
mixed_weights, mixed_score = \
    CEOpt(lambda x: FINAL(Top5(pd.DataFrame(mixed_hybrid(x, features.to_numpy()).reshape(ratings.shape)), 
                               range(50))), 
          np.zeros(2*features.shape[1]**2), np.eye(2*features.shape[1]**2),
          n_iterations=100)
mixed_score

  0%|          | 0/100 [00:00<?, ?it/s]

-1.001295017318381

### Evaluation

In [25]:
prediction_tables = list(map(lambda x: pd.DataFrame(x.reshape(ratings.shape)),
                             [linear_hybrid(linear_weights, features.to_numpy()), 
                              selective_hybrid(selective_weights, features.to_numpy(), candidates),
                              mixed_hybrid(mixed_weights, features.to_numpy())]))
hybrid_table = pd.DataFrame({'Hybrid': ['Linear', 'Selective', 'Mixed'],
                        'DIFF5': [DIFF5(Top5(predictions, range(50, 100))) 
                                  for predictions in prediction_tables],
                        'priceSTD': [priceSTD(Top5(predictions, range(50, 100))) 
                                  for predictions in prediction_tables],
                        'UNAVAIL': [UNAVAIL(Top5(predictions, range(50, 100))) 
                                  for predictions in prediction_tables],
                        'FINAL': [FINAL(Top5(predictions, range(50, 100))) 
                                  for predictions in prediction_tables]}).set_index('Hybrid')
hybrid_table

Unnamed: 0_level_0,DIFF5,priceSTD,UNAVAIL,FINAL
Hybrid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Linear,0.224025,3.501326,0.344268,-1.309189
Selective,0.333296,1.058483,0.313979,-2.200318
Mixed,0.265311,3.801161,0.336412,-1.080965


Mixed hybrid happens to be the best one in FINAL (as expected from the most exprressive form of hybridization tried) and in priceSTD particularly. It is, however beaten by linear hybrid in UNAVAIL. Selective hybrid fares the worst in priceSTD, UNAVAIL and FINAL (it is still better in priceSTD and FINAL than the base algorithms) but is the best in DIFF5 (as its rating predictions are still the predictions of one of the base algorithms).

DIFF5 and UNAVAIL seem to be worse in hybrids than in the best base algorithm (User-User) -- they were sacrificed for the better price diversity.

The difference between the score on train data and test data is relatively small, which means that no overfitting has occured.

### Sample Outputs

In [26]:
def display_output(predictions, user):
    top5idx = Top5(predictions, [user])[user]
    output = items.iloc[top5idx, :]
    output['rating'] = ratings.iloc[top5idx, user]
    output['norm_rating'] = norm_rating.iloc[top5idx, user]
    return output

In [27]:
display_output(pd.DataFrame(linear_hybrid(linear_weights, 
                                          features.to_numpy()).reshape(ratings.shape)), 51)

Unnamed: 0_level_0,Availability,ASIN,Price,Brand,Title,LeafCat,FullCat,NormPrice,rating,norm_rating
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
620,0.825379,B000J07BRQ,17.42,Scotch,"Scotch Heavy Duty Packaging Tape, 1.88 Inches ...",Packing Tape,Office Products/Office & School Supplies/Envel...,-0.028506,5.0,0.983495
1890,0.779626,B005IVL0RS,255.6,Epson,Epson WorkForce 845 Wireless All-in-One Color ...,Printers,Office Products/Office Electronics/Printers & ...,6.275358,,0.0
2255,0.475237,B00D7NYF62,13.51,,"Quartet Magnetic Dry-Erase Board, 11 x 14 Inch...",Dry Erase Boards,Office Products/Office & School Supplies/Prese...,-0.131991,5.0,0.983495
1876,0.655732,B005HFJFK4,285.29,Epson,Epson Artisan 837 Wireless All-in-One Color In...,Photo Printers,Office Products/Office Electronics/Printers & ...,7.061158,,0.0
2025,0.697887,B007TRUWG4,6.56,Paper Mate,Paper Mate Quick Flip 0.7MM Mechanical Pencil ...,Mechanical Pencils,Office Products/Office & School Supplies/Writi...,-0.315935,5.0,0.983495


In [28]:
display_output(pd.DataFrame(linear_hybrid(linear_weights, 
                                          features.to_numpy()).reshape(ratings.shape)), 52)

Unnamed: 0_level_0,Availability,ASIN,Price,Brand,Title,LeafCat,FullCat,NormPrice,rating,norm_rating
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1297,0.945595,B002K9M7MS,,,"3M Permanent Adhesive Shipping Labels, 2 x 4 I...",Shipping Labels,Office Products/Office & School Supplies/Label...,0.0,,0.0
1240,0.601346,B002CRKUZO,3.26,Post-It,"Post-it Durable IndexTabs, 1 Inch, Ideal For B...",Tab Inserts,Office Products/Office & School Supplies/Label...,-0.403276,,0.0
1292,0.973463,B002K9GOPE,35.0,3M,"3M Permanent Adhesive Address Labels, 1 x 2.62...",Address Labels,Office Products/Office & School Supplies/Label...,0.436781,5.0,0.928191
1032,0.644513,B001DPSUUI,23.0,Fellowes,"Fellowes Hot Laminating Pouches, Letter, 5 mil...",Laminating Supplies,Office Products/Office Electronics/Presentatio...,0.119179,,0.0
1890,0.779626,B005IVL0RS,255.6,Epson,Epson WorkForce 845 Wireless All-in-One Color ...,Printers,Office Products/Office Electronics/Printers & ...,6.275358,,0.0


In [29]:
display_output(pd.DataFrame(linear_hybrid(linear_weights, 
                                          features.to_numpy()).reshape(ratings.shape)), 53)

Unnamed: 0_level_0,Availability,ASIN,Price,Brand,Title,LeafCat,FullCat,NormPrice,rating,norm_rating
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1292,0.973463,B002K9GOPE,35.0,3M,"3M Permanent Adhesive Address Labels, 1 x 2.62...",Address Labels,Office Products/Office & School Supplies/Label...,0.436781,,0.0
1297,0.945595,B002K9M7MS,,,"3M Permanent Adhesive Shipping Labels, 2 x 4 I...",Shipping Labels,Office Products/Office & School Supplies/Label...,0.0,5.0,0.809174
1032,0.644513,B001DPSUUI,23.0,Fellowes,"Fellowes Hot Laminating Pouches, Letter, 5 mil...",Laminating Supplies,Office Products/Office Electronics/Presentatio...,0.119179,5.0,0.809174
1890,0.779626,B005IVL0RS,255.6,Epson,Epson WorkForce 845 Wireless All-in-One Color ...,Printers,Office Products/Office Electronics/Printers & ...,6.275358,,0.0
1876,0.655732,B005HFJFK4,285.29,Epson,Epson Artisan 837 Wireless All-in-One Color In...,Photo Printers,Office Products/Office Electronics/Printers & ...,7.061158,,0.0


In [30]:
display_output(pd.DataFrame(selective_hybrid(selective_weights, 
                                             features.to_numpy(), candidates).reshape(ratings.shape)), 51)

Unnamed: 0_level_0,Availability,ASIN,Price,Brand,Title,LeafCat,FullCat,NormPrice,rating,norm_rating
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1292,0.973463,B002K9GOPE,35.0,3M,"3M Permanent Adhesive Address Labels, 1 x 2.62...",Address Labels,Office Products/Office & School Supplies/Label...,0.436781,,0.0
1937,0.748812,B005X9VZ70,15.55,DYMO,DYMO LabelManager 160 Hand Held Label Maker,Label Makers,Office Products/Office Electronics/Other Offic...,-0.077999,,0.0
620,0.825379,B000J07BRQ,17.42,Scotch,"Scotch Heavy Duty Packaging Tape, 1.88 Inches ...",Packing Tape,Office Products/Office & School Supplies/Envel...,-0.028506,5.0,0.983495
2025,0.697887,B007TRUWG4,6.56,Paper Mate,Paper Mate Quick Flip 0.7MM Mechanical Pencil ...,Mechanical Pencils,Office Products/Office & School Supplies/Writi...,-0.315935,5.0,0.983495
2255,0.475237,B00D7NYF62,13.51,,"Quartet Magnetic Dry-Erase Board, 11 x 14 Inch...",Dry Erase Boards,Office Products/Office & School Supplies/Prese...,-0.131991,5.0,0.983495


In [31]:
display_output(pd.DataFrame(selective_hybrid(selective_weights, 
                                             features.to_numpy(), candidates).reshape(ratings.shape)), 52)

Unnamed: 0_level_0,Availability,ASIN,Price,Brand,Title,LeafCat,FullCat,NormPrice,rating,norm_rating
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1300,0.951505,B002K9XU0Q,,,Post-it&reg; Super Sticky Removable File Folde...,File Folder Labels,Office Products/Office & School Supplies/Label...,0.0,5.0,0.928191
1217,0.991766,B0027CTFBO,21.79,Fellowes,Bankers Box SmoothMove Moving and Storage Boxe...,Box Mailers,Office Products/Office & School Supplies/Envel...,0.087154,5.0,0.928191
619,0.344697,B000J05GKA,41.63,Paper Pro,PaperPro 1210 Professional 65 Sheet Stapler,Heavy-Duty Staplers,Office Products/Office & School Supplies/Stapl...,0.612256,,0.0
1292,0.973463,B002K9GOPE,35.0,3M,"3M Permanent Adhesive Address Labels, 1 x 2.62...",Address Labels,Office Products/Office & School Supplies/Label...,0.436781,5.0,0.928191
1890,0.779626,B005IVL0RS,255.6,Epson,Epson WorkForce 845 Wireless All-in-One Color ...,Printers,Office Products/Office Electronics/Printers & ...,6.275358,,0.0


In [32]:
display_output(pd.DataFrame(selective_hybrid(selective_weights, 
                                             features.to_numpy(), candidates).reshape(ratings.shape)), 53)

Unnamed: 0_level_0,Availability,ASIN,Price,Brand,Title,LeafCat,FullCat,NormPrice,rating,norm_rating
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
825,1.0,B0010T3QT2,7.9,Quality Park,"Quality Park Reveal-N-Seal Envelope, #10, 4-1/...",Envelopes,Office Products/Office & School Supplies/Envel...,-0.28047,5.0,0.809174
1292,0.973463,B002K9GOPE,35.0,3M,"3M Permanent Adhesive Address Labels, 1 x 2.62...",Address Labels,Office Products/Office & School Supplies/Label...,0.436781,,0.0
1297,0.945595,B002K9M7MS,,,"3M Permanent Adhesive Shipping Labels, 2 x 4 I...",Shipping Labels,Office Products/Office & School Supplies/Label...,0.0,5.0,0.809174
1032,0.644513,B001DPSUUI,23.0,Fellowes,"Fellowes Hot Laminating Pouches, Letter, 5 mil...",Laminating Supplies,Office Products/Office Electronics/Presentatio...,0.119179,5.0,0.809174
1680,0.613216,B004F9QBE6,3.99,BIC,"BIC Cristal For Her Ball Pen, 1.0mm, Black, 16...",Ballpoint Pens,Office Products/Office & School Supplies/Writi...,-0.383955,,0.0


In [33]:
display_output(pd.DataFrame(mixed_hybrid(mixed_weights, 
                                         features.to_numpy()).reshape(ratings.shape)), 51)

Unnamed: 0_level_0,Availability,ASIN,Price,Brand,Title,LeafCat,FullCat,NormPrice,rating,norm_rating
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
620,0.825379,B000J07BRQ,17.42,Scotch,"Scotch Heavy Duty Packaging Tape, 1.88 Inches ...",Packing Tape,Office Products/Office & School Supplies/Envel...,-0.028506,5.0,0.983495
2255,0.475237,B00D7NYF62,13.51,,"Quartet Magnetic Dry-Erase Board, 11 x 14 Inch...",Dry Erase Boards,Office Products/Office & School Supplies/Prese...,-0.131991,5.0,0.983495
1890,0.779626,B005IVL0RS,255.6,Epson,Epson WorkForce 845 Wireless All-in-One Color ...,Printers,Office Products/Office Electronics/Printers & ...,6.275358,,0.0
1876,0.655732,B005HFJFK4,285.29,Epson,Epson Artisan 837 Wireless All-in-One Color In...,Photo Printers,Office Products/Office Electronics/Printers & ...,7.061158,,0.0
2025,0.697887,B007TRUWG4,6.56,Paper Mate,Paper Mate Quick Flip 0.7MM Mechanical Pencil ...,Mechanical Pencils,Office Products/Office & School Supplies/Writi...,-0.315935,5.0,0.983495


In [34]:
display_output(pd.DataFrame(mixed_hybrid(mixed_weights, 
                                         features.to_numpy()).reshape(ratings.shape)), 52)

Unnamed: 0_level_0,Availability,ASIN,Price,Brand,Title,LeafCat,FullCat,NormPrice,rating,norm_rating
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
103,0.726813,B00006IBV7,10.53,Avery,Avery Two-Side Printable Clean Edge Business C...,Business Cards,Office Products/Office & School Supplies/Paper...,-0.210862,5.0,0.928191
1292,0.973463,B002K9GOPE,35.0,3M,"3M Permanent Adhesive Address Labels, 1 x 2.62...",Address Labels,Office Products/Office & School Supplies/Label...,0.436781,5.0,0.928191
1217,0.991766,B0027CTFBO,21.79,Fellowes,Bankers Box SmoothMove Moving and Storage Boxe...,Box Mailers,Office Products/Office & School Supplies/Envel...,0.087154,5.0,0.928191
1890,0.779626,B005IVL0RS,255.6,Epson,Epson WorkForce 845 Wireless All-in-One Color ...,Printers,Office Products/Office Electronics/Printers & ...,6.275358,,0.0
1876,0.655732,B005HFJFK4,285.29,Epson,Epson Artisan 837 Wireless All-in-One Color In...,Photo Printers,Office Products/Office Electronics/Printers & ...,7.061158,3.0,-1.392286


In [35]:
display_output(pd.DataFrame(mixed_hybrid(mixed_weights, 
                                         features.to_numpy()).reshape(ratings.shape)), 53)

Unnamed: 0_level_0,Availability,ASIN,Price,Brand,Title,LeafCat,FullCat,NormPrice,rating,norm_rating
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1032,0.644513,B001DPSUUI,23.0,Fellowes,"Fellowes Hot Laminating Pouches, Letter, 5 mil...",Laminating Supplies,Office Products/Office Electronics/Presentatio...,0.119179,5.0,0.809174
1297,0.945595,B002K9M7MS,,,"3M Permanent Adhesive Shipping Labels, 2 x 4 I...",Shipping Labels,Office Products/Office & School Supplies/Label...,0.0,5.0,0.809174
1680,0.613216,B004F9QBE6,3.99,BIC,"BIC Cristal For Her Ball Pen, 1.0mm, Black, 16...",Ballpoint Pens,Office Products/Office & School Supplies/Writi...,-0.383955,,0.0
1890,0.779626,B005IVL0RS,255.6,Epson,Epson WorkForce 845 Wireless All-in-One Color ...,Printers,Office Products/Office Electronics/Printers & ...,6.275358,,0.0
1876,0.655732,B005HFJFK4,285.29,Epson,Epson Artisan 837 Wireless All-in-One Color In...,Photo Printers,Office Products/Office Electronics/Printers & ...,7.061158,,0.0


We see, that performance of selective hybrid does not differ much from performance of base algorithms, which is expected.

Linear and mixed hybrids on the other hand result in improvement over price diversity and availability -- now most users have 1-2 expensive items in their recommended lists and rare items are also a common. sight in recommendations. However this comes at a sacrifice of relevance -- for example, mixed hybrid recomends item 1876 to 52th user because it is both relatively rare and expensive, despite it was poorly received by them.

## Part 4. Proposal and Reflection.

### Proposal

While designing our recommender we want the 5 items recommended to each user to be:
- relevant, so that the user will actually want to buy them
- contain both expenive and cheap items as a study shows that they are more likely to be bought together
- contain "local speialities", which are not available in other shops, as they are a key factor that gives us edge over the competitors

To evaluate adherence to these objectives, I propose the following metrics:
- DIFF5: the average normalized difference of rating given by a user to Top-5 items selected by the algorithm with missing values being replaced by 0. The reason behind this metric is the measurement of how much the user will actually like the recommended items. Normalization is used to account for different rating scales used by different users. This metric is to be maximized.
- priceSTD: Average standard deviation of normalized price among Top-5 items selected by the algorithm. It measures price diversity among proposed items. This metric is to be maximized.
- UNAVAIL: The median of (1 - availability) among top 5 items. It measures general availability of proposed items. The reason for chosing median instead of mean is that we want common items also to be sometimes recommended, provided that they fit two other objectives.
- FINAL: logarithm of geometric mean of max(0, DIFF5), priceSTD and UNAVAIL. Its maximization ensures that we do not severely lag behind in any of our three objectives (recommending useful items, recommending items with diverse prices and recommending items that are less available in other shops).

The final Top5 items will be determined by a mixed hybrid -- final score is given by one of several linear combinations of features, selected by a linear classifier, based on the same features.

The following features will be used:
- Rating predictions of base algorithms
- Availability (is significant for our goals, despite being unrepresented in rating table)
- Price (is significant for our goals, despite being unrepresented in rating table)
- Squared deviation of price from the mean price among all items (it will allow the linear models to take price "extremity" into account)

Base algorithms here are:
- a TFIDF content-based recommender
- an item-item collaborative filtering recommender
- a matrix factorization (gradient descent) recommender
- a user-user collaborative filtering
- a baseline recommender that uses product and customer ratings distributions to provide personally-scaled average predictions

The data will be separated onto three disjoint parts. The first one will be used to train base algorithms. The second one -- to tune hybrid weights using [the cross entropy method](https://en.wikipedia.org/wiki/Cross-entropy_method) with gaussian weight distribution. The third one will be used to evaluate the final hybrid and determine, what quality will the final model have for new users.

Such recommender has empirically proven its better performance that the others evaluated according to the aforementioned set of metrics, developed to quantitatively reflect our business goals, formulated at the beginning of the proposal.

For example, after this procedure, the user 51 was recommended 3 cheap items that they already bought and liked and 2 expensive items new to them and two of these cheap items and one of these expensive items were relatively unavailable (availability < 0.7), which coincides with our goals.

### Reflection

The process of translating business requirements to metrics was not easy. After all there is no one obviously correct way of combining three business objectives, completely orthogonal to each other, into one metric.

Evaluation of different algorithms was an interesting task and yielded a lot of unexpected results. The problem of selecting hybrids was actually quite challenging -- unable to devise good weights out of my head I had to resort to nnumeric optimization (and that also required additional data management).

The problems of data management have different status on the different stages of the capstone. When it comes to base algorithms, then it is hard to tell anything about the presence/absense of data leaks, because the data on which the base algorithms were trained was not specified and only the precomputed results were available. On the other hand, no data leaks were allowed (modulo possible data leaks in base algorithm predictions) and data was managed through division of users onto disjoint train and test groups. In case if the people, who prepared the precomputed predictions of base algorithms also did that on a separate dataset, one can conclude, that the solution generalizes well -- no significant overfitting was detected.

Everything in this project was done via Python + Numpy + Pandas. These tools are extremely useful and flexible.

I think, the capstone achieved its goal of giving me one project to bring together the diverse set of materials I learned in this specialization. I feel more capable of and confident in my ability for taking on applications of recommender systems.