## Non Personalized Recommendations 
- Needed when users are new to the system but we still want to provide recommendations
- Recommendations based on general usage patterns e.g. recommending trending article 
- Sometimes personalized recommendations are not possible 

### Weak personalization
- Recommendations based on limited data about the user
    - zip code or location
    - age, gender, nationality, ethnicity
- Used for first pass 'stereotyped' personalization
- Product recommendations based on _product association_ i.e a user is looking a certain product and other products related may be recommended 


### Summary Statistics
- Reviews and ratings aggregated can be one way to provide non personalized recommendation
- If majority of the past users recommend product in their reviews, it is a good candidate for recommendation
- However, what do the aggregated scores represent? 
    - Does the high average rating mean it is the most popular product? 
    

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
# Read movie review matrix and comput summary statistics
movie_ratings_df = pd.read_csv('data/movie_ratings1.csv')
movie_ratings_df.head()

Unnamed: 0,User,"Gender (1 =F, 0=M)",260: Star Wars: Episode IV - A New Hope (1977),1210: Star Wars: Episode VI - Return of the Jedi (1983),356: Forrest Gump (1994),"318: Shawshank Redemption, The (1994)","593: Silence of the Lambs, The (1991)",3578: Gladiator (2000),1: Toy Story (1995),2028: Saving Private Ryan (1998),...,2396: Shakespeare in Love (1998),2916: Total Recall (1990),780: Independence Day (ID4) (1996),541: Blade Runner (1982),1265: Groundhog Day (1993),"2571: Matrix, The (1999)",527: Schindler's List (1993),"2762: Sixth Sense, The (1999)",1198: Raiders of the Lost Ark (1981),34: Babe (1995)
0,755,0,1.0,5.0,2.0,,4.0,4.0,2.0,2.0,...,2.0,,5.0,2.0,,4.0,2.0,5.0,,
1,5277,0,5.0,3.0,,2.0,4.0,2.0,1.0,,...,3.0,2.0,2.0,,2.0,,5.0,1.0,3.0,
2,1577,1,,,,5.0,2.0,,4.0,,...,,1.0,4.0,4.0,1.0,1.0,2.0,3.0,1.0,3.0
3,4388,0,,3.0,,,,1.0,2.0,3.0,...,,4.0,1.0,3.0,5.0,,5.0,1.0,1.0,2.0
4,1202,1,4.0,3.0,4.0,1.0,4.0,1.0,,4.0,...,5.0,1.0,,4.0,,3.0,5.0,5.0,,


In [6]:
# view summary statistics of the data
movie_ratings_df.describe()

Unnamed: 0,User,"Gender (1 =F, 0=M)",260: Star Wars: Episode IV - A New Hope (1977),1210: Star Wars: Episode VI - Return of the Jedi (1983),356: Forrest Gump (1994),"318: Shawshank Redemption, The (1994)","593: Silence of the Lambs, The (1991)",3578: Gladiator (2000),1: Toy Story (1995),2028: Saving Private Ryan (1998),...,2396: Shakespeare in Love (1998),2916: Total Recall (1990),780: Independence Day (ID4) (1996),541: Blade Runner (1982),1265: Groundhog Day (1993),"2571: Matrix, The (1999)",527: Schindler's List (1993),"2762: Sixth Sense, The (1999)",1198: Raiders of the Lost Ark (1981),34: Babe (1995)
count,20.0,20.0,15.0,14.0,10.0,10.0,16.0,12.0,17.0,11.0,...,11.0,12.0,13.0,9.0,12.0,12.0,12.0,12.0,11.0,10.0
mean,3658.1,0.45,3.266667,3.0,2.7,3.6,3.0625,2.916667,2.823529,3.0,...,2.909091,1.916667,2.769231,3.222222,3.166667,2.833333,3.0,2.833333,2.909091,3.0
std,1749.716756,0.510418,1.387015,1.467599,1.337494,1.646545,1.28938,1.564279,1.131111,1.414214,...,1.513575,0.996205,1.235168,1.092906,1.585923,1.527525,1.595448,1.642245,1.578261,1.414214
min,139.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,2558.75,0.0,2.0,2.0,2.0,2.5,2.0,1.75,2.0,2.0,...,2.0,1.0,2.0,2.0,2.0,1.75,2.0,1.0,1.5,2.0
50%,4252.5,0.0,4.0,3.0,2.5,4.0,3.0,3.0,2.0,3.0,...,3.0,2.0,3.0,3.0,3.0,2.5,2.5,3.0,3.0,2.5
75%,4916.25,1.0,4.0,4.0,3.75,5.0,4.0,4.0,4.0,4.0,...,4.0,2.25,4.0,4.0,5.0,4.0,5.0,4.25,4.0,4.0
max,6037.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,4.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0


In [26]:
# Drop User and Gender as Columns in the DF
# axis : {0 or ‘index’, 1 or ‘columns’}, default 0
ratings_df = movie_ratings_df.drop(labels=['User', 'Gender (1 =F, 0=M)'], axis=1)

### Mean rating
___

In [28]:
# Mean Rating: Calculate the mean rating for each movie, order with the highest rating listed first, 
# and submit the top three (along with the mean scores for the top two).

## axis : {index (0), columns (1)}
mean_movie_ratings = ratings_df.mean(axis=0)
mean_movie_ratings.sort_values(ascending=False)[:3]

318: Shawshank Redemption, The (1994)             3.600000
260: Star Wars: Episode IV - A New Hope (1977)    3.266667
541: Blade Runner (1982)                          3.222222
dtype: float64

### Rating Count
___

In [15]:
# Rating Count (popularity): Count the number of ratings for each movie, order with the most number of 
# ratings first, and submit the top three (along with the counts for the top two).

# The count() method returns the number of non-NaN values in each column
# Similarly, count(axis=1) returns the number of non-NaN values in each row.

count_of_movie_ratings = ratings_df.count(axis=0)
count_of_movie_ratings.sort_values(ascending=False)[:3]

1: Toy Story (1995)                               17
593: Silence of the Lambs, The (1991)             16
260: Star Wars: Episode IV - A New Hope (1977)    15
dtype: int64

### % of Ratings higher than...
___

In [35]:
# % of ratings 4+ (liking): Calculate the percentage of ratings for each movie that are 4 or higher. 
# Order with the highest percentage first, and submit the top three (along with the percentage for the top two). 
# Notice that the three different measures of "best" reflect different priorities and give different results; 
# this should help you see why you need to be thoughtful about what metrics you use.

# Count only returns non NaN count values

(ratings_df.where(ratings_df >= 4).count(axis=0) / ratings_df.count(axis=0)).sort_values(ascending=False)[:3]

318: Shawshank Redemption, The (1994)             0.700000
260: Star Wars: Episode IV - A New Hope (1977)    0.533333
3578: Gladiator (2000)                            0.500000
dtype: float64

### 