### Recommendations with MovieTweetings: Most Popular Recommendation

Now that you have created the necessary columns we will be using throughout the rest of the lesson on creating recommendations, let's get started with the first of our recommendations.

To get started, read in the libraries and the two datasets you will be using throughout the lesson using the code below.


In [465]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tests as t

%matplotlib inline

# Read in the datasets
movies = pd.read_csv('movies_clean.csv')
reviews = pd.read_csv('reviews_clean.csv')
del movies['Unnamed: 0']
del reviews['Unnamed: 0']

#### 1. How To Find The Most Popular Movies

For this notebook, we have a single task.  The task is that no matter the user, we need to provide a list of the recommendations based on simply the most popular items.

For this task, we will consider what is "most popular" based on the following criteria:

* A movie with the highest average rating is considered best
* With ties, movies that have more ratings are better
* A movie must have a minimum of 5 ratings to be considered among the best movies
* If movies are tied in their average rating and number of ratings, the ranking is determined by the movie that is the most recent rating

With these criteria, the goal for this notebook is to take a **user_id** and provide back the **n_top** recommendations.  Use the function below as the scaffolding that will be used for all the future recommendations as well.

In [475]:
# merge two tables
df =movies.merge(reviews.iloc[:,:4])
df.head()

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,History,News,Horror,...,Action,Documentary,Animation,Comedy,Short,Western,Thriller,user_id,rating,timestamp
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0,0,0,0,...,0,1,0,0,1,0,0,40425,5,1396981211
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0,0,0,0,...,0,1,0,0,1,0,0,48654,10,1412878553
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0,0,0,0,...,0,1,0,0,1,0,0,38857,10,1439248579
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1,0,0,0,0,0,...,0,0,0,0,0,0,0,30852,8,1488189899
4,91,Le manoir du diable (1896),Short|Horror,1896,1,0,0,0,0,1,...,0,0,0,0,1,0,0,9720,6,1385233195


In [364]:
# filter out < 5 rates
df_filter=df[df['movie_id'].isin(df['movie_id'].unique()[df.groupby('movie_id')['rating'].count()>=5])]
df_filter.head()

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,History,News,Horror,...,Action,Documentary,Animation,Comedy,Short,Western,Thriller,user_id,rating,timestamp
5,417,Le voyage dans la lune (1902),Short|Adventure|Fantasy,1902,0,1,0,0,0,0,...,0,0,0,0,1,0,0,471,10,1437579236
6,417,Le voyage dans la lune (1902),Short|Adventure|Fantasy,1902,0,1,0,0,0,0,...,0,0,0,0,1,0,0,3739,9,1434226311
7,417,Le voyage dans la lune (1902),Short|Adventure|Fantasy,1902,0,1,0,0,0,0,...,0,0,0,0,1,0,0,6581,9,1474824191
8,417,Le voyage dans la lune (1902),Short|Adventure|Fantasy,1902,0,1,0,0,0,0,...,0,0,0,0,1,0,0,9970,8,1414348320
9,417,Le voyage dans la lune (1902),Short|Adventure|Fantasy,1902,0,1,0,0,0,0,...,0,0,0,0,1,0,0,14678,7,1440606444


In [365]:
df_filter.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 677742 entries, 5 to 712306
Data columns (total 38 columns):
movie_id       677742 non-null int64
movie          677742 non-null object
genre          677715 non-null object
date           677742 non-null int64
1800's         677742 non-null int64
1900's         677742 non-null int64
2000's         677742 non-null int64
History        677742 non-null int64
News           677742 non-null int64
Horror         677742 non-null int64
Musical        677742 non-null int64
Film-Noir      677742 non-null int64
Mystery        677742 non-null int64
Adventure      677742 non-null int64
Sport          677742 non-null int64
War            677742 non-null int64
Music          677742 non-null int64
Reality-TV     677742 non-null int64
Adult          677742 non-null int64
Crime          677742 non-null int64
Family         677742 non-null int64
Drama          677742 non-null int64
Talk-Show      677742 non-null int64
Biography      677742 non-null int64

In [366]:
s=df_filter.groupby('movie_id')['rating'].mean()
d={'movie_id':s.keys(),
'means':s.values}
#pd.DataFrame(d)

In [367]:
# average rating
df_avg=df_filter.merge(pd.DataFrame(d))

In [368]:
# With ties, movies that have more ratings are better
r=df_avg['movie_id'].value_counts()
d2={'movie_id':r.keys(),
    'counts':r.values}
df_count=df_avg.merge(pd.DataFrame(d2))

In [369]:
df_count.head()

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,History,News,Horror,...,Animation,Comedy,Short,Western,Thriller,user_id,rating,timestamp,means,counts
0,417,Le voyage dans la lune (1902),Short|Adventure|Fantasy,1902,0,1,0,0,0,0,...,0,0,1,0,0,471,10,1437579236,8.368421,19
1,417,Le voyage dans la lune (1902),Short|Adventure|Fantasy,1902,0,1,0,0,0,0,...,0,0,1,0,0,3739,9,1434226311,8.368421,19
2,417,Le voyage dans la lune (1902),Short|Adventure|Fantasy,1902,0,1,0,0,0,0,...,0,0,1,0,0,6581,9,1474824191,8.368421,19
3,417,Le voyage dans la lune (1902),Short|Adventure|Fantasy,1902,0,1,0,0,0,0,...,0,0,1,0,0,9970,8,1414348320,8.368421,19
4,417,Le voyage dans la lune (1902),Short|Adventure|Fantasy,1902,0,1,0,0,0,0,...,0,0,1,0,0,14678,7,1440606444,8.368421,19


In [370]:
# the movie that is the most recent rating
rec=df_count.groupby(['movie_id'])['timestamp'].max()
d3={'movie_id':rec.index,
    'timestamp':rec.values}
df_recent=df_count.merge(pd.DataFrame(d3),how='right')

In [371]:
# sourted by avg rating and count rates
df_recent=df_recent.sort_values(by=['means','counts','timestamp'], ascending=False).reset_index(drop=True)

> a better solution

In [503]:
means=df.groupby(['movie_id','movie'])['rating'].mean()
count=df.groupby(['movie_id','movie'])['rating'].count()
time=df.groupby(['movie_id','movie'])['timestamp'].max()

In [504]:
df_result=pd.DataFrame({'means':means,'count':count,'time':time})
df_result.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,means,count,time
movie_id,movie,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8,Edison Kinetoscopic Record of a Sneeze (1894),5.0,1,1396981211
10,La sortie des usines Lumière (1895),10.0,1,1412878553
12,The Arrival of a Train (1896),10.0,1,1439248579
25,The Oxford and Cambridge University Boat Race (1895),8.0,1,1488189899
91,Le manoir du diable (1896),6.0,1,1385233195


In [505]:
df_result=df_result[df_result['count']>=5].reset_index()
df_result.head()

Unnamed: 0,movie_id,movie,means,count,time
0,417,Le voyage dans la lune (1902),8.368421,19,1530047326
1,6864,Intolerance: Love's Struggle Throughout the Ag...,8.8,5,1508630790
2,10323,Das Cabinet des Dr. Caligari (1920),8.210526,19,1506666124
3,12349,The Kid (1921),8.5,60,1526267924
4,12364,Körkarlen (1921),9.625,8,1475259604


In [516]:
list(df_result.sort_values(by=['means','count','time'],ascending=False).iloc[:5,1].values)

['MSG 2 the Messenger (2015)',
 'Avengers: Age of Ultron Parody (2015)',
 'Sorry to Bother You (2018)',
 'Selam (2013)',
 "Quiet Riot: Well Now You're Here, There's No Way Back (2014)"]

In [561]:
def popular_recommendations(user_id, n_top):
    '''
    INPUT:
    user_id - the user_id of the individual you are making recommendations for
    n_top - an integer of the number recommendations you want back
    OUTPUT:
    top_movies - a list of the n_top recommended movies by movie title in order best to worst
    '''
    means=df.groupby(['movie_id','movie'])['rating'].mean()
    count=df.groupby(['movie_id','movie'])['rating'].count()
    time=df.groupby(['movie_id','movie'])['timestamp'].max()
    # convert df
    df_result=pd.DataFrame({'means':means,'count':count,'timestamp':time})
    # filter >=5
    df_result=df_result[df_result['count']>=5].reset_index()
    # order the result
    result=df_result.sort_values(by=['means','count','timestamp'],ascending=False)
    # a list of the n_top recommended movies by movie title in order best to worst
    top_movies=list(result.iloc[:n_top,1].values)
    return top_movies # a list of the n_top movies as recommended

Usint the three criteria above, you should be able to put together the above function.  If you feel confident in your solution, check the results of your function against our solution. On the next page, you can see a walkthrough and you can of course get the solution by looking at the solution notebook available in this workspace.  

In [562]:
# Put your solutions for each of the cases here

# Top 20 movies recommended for id 1

recs_20_for_1 = popular_recommendations(1,20)

# Top 5 movies recommended for id 53968
recs_5_for_53968 = popular_recommendations(53968,5)

# Top 100 movies recommended for id 70000
recs_100_for_70000 = popular_recommendations(70000,100)

# Top 35 movies recommended for id 43
recs_35_for_43 = popular_recommendations(43,35)



In [563]:
### You Should Not Need To Modify Anything In This Cell
ranked_movies = t.create_ranked_df(movies, reviews) # only run this once - it is not fast

# check 1 
assert t.popular_recommendations('1', 20, ranked_movies) == recs_20_for_1,  "The first check failed..."
# check 2
assert t.popular_recommendations('53968', 5, ranked_movies) == recs_5_for_53968,  "The second check failed..."
# check 3
assert t.popular_recommendations('70000', 100, ranked_movies) == recs_100_for_70000,  "The third check failed..."
# check 4
assert t.popular_recommendations('43', 35, ranked_movies) == recs_35_for_43,  "The fourth check failed..."

print("If you got here, looks like you are good to go!  Nice job!")

If you got here, looks like you are good to go!  Nice job!


**Notice:** This wasn't the only way we could have determined the "top rated" movies.  You can imagine that in keeping track of trending news or trending social events, you would likely want to create a time window from the current time, and then pull the articles in the most recent time frame.  There are always going to be some subjective decisions to be made.  

If you find that no one is paying any attention to your most popular recommendations, then it might be time to find a new way to recommend, which is what the next parts of the lesson should prepare us to do!


### Part II: Adding Filters

Now that you have created a function to give back the **n_top** movies, let's make it a bit more robust.  Add arguments that will act as filters for the movie **year** and **genre**.  

Use the cells below to adjust your existing function to allow for **year** and **genre** arguments as **lists** of **strings**.  Then your ending results are filtered to only movies within the lists of provided years and genres (as `or` conditions).  If no list is provided, there should be no filter applied.

You can adjust other necessary inputs as necessary to retrieve the final results you are looking for!

Try writing a few tests against the test function in our test function.  Below returns the top 20 movies for user 1 based on the specified year and genre filters.  Does yours return the same? 

In [434]:
years=['2015', '2016', '2017', '2018']
genres=['History']
yearfilter=df_recent[df_recent['date'].isin(years)]
genresfilter=yearfilter[yearfilter[genres].sum(axis=1)>0]
genresfilter

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,History,News,Horror,...,Animation,Comedy,Short,Western,Thriller,user_id,rating,timestamp,means,counts
53,6316138,Ayla: The Daughter of War (2017),Drama|History|War,2017,0,0,1,1,0,0,...,0,0,0,0,0,3380,5,1529705949,9.200000,35
84,5098836,I Believe in Miracles (2015),Documentary|History|Sport,2015,0,0,1,1,0,0,...,0,0,0,0,0,46052,9,1524867219,9.100000,10
206,6223974,The Farthest (2017),Documentary|History,2017,0,0,1,1,0,0,...,0,0,0,0,0,20977,10,1528689985,8.833333,6
348,4010918,Sado (2015),Drama|History,2015,0,0,1,1,0,0,...,0,0,0,0,0,4056,10,1521149142,8.666667,9
438,6068960,Hatred (2016),Drama|History|War,2016,0,0,1,1,0,0,...,0,0,0,0,0,18052,7,1512677796,8.600000,5
748,4964310,Kincsem (2017),Adventure|Drama|History,2017,0,0,1,1,0,0,...,0,0,0,0,0,15862,8,1513318937,8.400000,5
891,2168180,Nise - O Coração da Loucura (2015),Biography|Drama|History,2015,0,0,1,1,0,0,...,0,0,0,0,0,28288,8,1502442730,8.333333,9
898,6794424,LA 92 (2017),Documentary|History,2017,0,0,1,1,0,0,...,0,0,0,0,0,45903,8,1521379672,8.333333,6
943,1398426,Straight Outta Compton (2015),Biography|Drama|History,2015,0,0,1,1,0,0,...,0,0,0,0,0,43405,7,1528614871,8.306977,645
953,3449292,Manjhi: The Mountain Man (2015),Biography|Drama|History,2015,0,0,1,1,0,0,...,0,0,0,0,0,52959,7,1528028349,8.300000,10


In [538]:
df

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,History,News,Horror,...,Action,Documentary,Animation,Comedy,Short,Western,Thriller,user_id,rating,timestamp
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0,0,0,0,...,0,1,0,0,1,0,0,40425,5,1396981211
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0,0,0,0,...,0,1,0,0,1,0,0,48654,10,1412878553
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0,0,0,0,...,0,1,0,0,1,0,0,38857,10,1439248579
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1,0,0,0,0,0,...,0,0,0,0,0,0,0,30852,8,1488189899
4,91,Le manoir du diable (1896),Short|Horror,1896,1,0,0,0,0,1,...,0,0,0,0,1,0,0,9720,6,1385233195
5,417,Le voyage dans la lune (1902),Short|Adventure|Fantasy,1902,0,1,0,0,0,0,...,0,0,0,0,1,0,0,471,10,1437579236
6,417,Le voyage dans la lune (1902),Short|Adventure|Fantasy,1902,0,1,0,0,0,0,...,0,0,0,0,1,0,0,3739,9,1434226311
7,417,Le voyage dans la lune (1902),Short|Adventure|Fantasy,1902,0,1,0,0,0,0,...,0,0,0,0,1,0,0,6581,9,1474824191
8,417,Le voyage dans la lune (1902),Short|Adventure|Fantasy,1902,0,1,0,0,0,0,...,0,0,0,0,1,0,0,9970,8,1414348320
9,417,Le voyage dans la lune (1902),Short|Adventure|Fantasy,1902,0,1,0,0,0,0,...,0,0,0,0,1,0,0,14678,7,1440606444


In [548]:
means=df.groupby(['movie_id','movie'])['rating'].mean()
count=df.groupby(['movie_id','movie'])['rating'].count()
time=df.groupby(['movie_id','movie'])['timestamp'].max()
# convert df
df_result=pd.DataFrame({'means':means,'count':count,'timestamp':time})
# filter >=5
df_result=df_result[df_result['count']>=5].reset_index()
# order the result
result=df_result.sort_values(by=['means','count','timestamp'],ascending=False)
result

Unnamed: 0,movie_id,movie,means,count,timestamp
9432,4921860,MSG 2 the Messenger (2015),10.000000,48,1471195010
9584,5262972,Avengers: Age of Ultron Parody (2015),10.000000,28,1452213883
9718,5688932,Sorry to Bother You (2018),10.000000,14,1529199888
8035,2737018,Selam (2013),10.000000,10,1431298561
7883,2560840,"Quiet Riot: Well Now You're Here, There's No W...",10.000000,6,1453509044
7279,2219210,Crawl Bitch Crawl (2012),10.000000,6,1374535852
9238,4448444,Make Like a Dog (2015),10.000000,5,1504965108
9532,5131914,Pandorica (2016),10.000000,5,1459749142
6958,2059318,Third Contact (2011),10.000000,5,1392306511
5652,1431149,Romeo Juliet (2009),10.000000,5,1374859141


In [549]:
result.merge(df,how='left')

Unnamed: 0,movie_id,movie,means,count,timestamp,genre,date,1800's,1900's,2000's,...,Game-Show,Action,Documentary,Animation,Comedy,Short,Western,Thriller,user_id,rating
0,4921860,MSG 2 the Messenger (2015),10.000000,48,1471195010,Comedy|Drama|Fantasy,2015,0,0,1,...,0,0,0,0,1,0,0,0,25615,10
1,5262972,Avengers: Age of Ultron Parody (2015),10.000000,28,1452213883,Short|Comedy,2015,0,0,1,...,0,0,0,0,1,1,0,0,12017,10
2,5688932,Sorry to Bother You (2018),10.000000,14,1529199888,Comedy|Fantasy|Sci-Fi,2018,0,0,1,...,0,0,0,0,1,0,0,0,50300,10
3,2737018,Selam (2013),10.000000,10,1431298561,Drama|Romance,2013,0,0,1,...,0,0,0,0,0,0,0,0,40675,10
4,2560840,"Quiet Riot: Well Now You're Here, There's No W...",10.000000,6,1453509044,Documentary|Music,2014,0,0,1,...,0,0,1,0,0,0,0,0,30109,10
5,2219210,Crawl Bitch Crawl (2012),10.000000,6,1374535852,Horror|Sci-Fi|Thriller,2012,0,0,1,...,0,0,0,0,0,0,0,1,2358,10
6,4448444,Make Like a Dog (2015),10.000000,5,1504965108,Short|Comedy|Drama,2015,0,0,1,...,0,0,0,0,1,1,0,0,9701,10
7,5131914,Pandorica (2016),10.000000,5,1459749142,Sci-Fi,2016,0,0,1,...,0,0,0,0,0,0,0,0,5463,10
8,2059318,Third Contact (2011),10.000000,5,1392306511,Mystery|Sci-Fi|Thriller,2011,0,0,1,...,0,0,0,0,0,0,0,1,32418,10
9,1431149,Romeo Juliet (2009),10.000000,5,1374859141,Drama,2009,0,0,1,...,0,0,0,0,0,0,0,0,12355,10


In [556]:
def popular_filter(user_id, n_top, yearslist=None,genreslist=None):
    '''
    INPUT:
    user_id - the user_id of the individual you are making recommendations for
    n_top - an integer of the number recommendations you want back
    yearslist- a list of years to filter movie released year
    genreslist - a list of genres the movie is categoried into
    OUTPUT:
    top_movies - a list of the n_top recommended movies by movie title in order best to worst
    '''
    # Do stuff
    means=df.groupby(['movie_id','movie'])['rating'].mean()
    count=df.groupby(['movie_id','movie'])['rating'].count()
    time=df.groupby(['movie_id','movie'])['timestamp'].max()
    # convert df
    df_result=pd.DataFrame({'means':means,'count':count,'timestamp':time})
    # filter >=5
    df_result=df_result[df_result['count']>=5].reset_index()
    # order the result
    result=df_result.sort_values(by=['means','count','timestamp'],ascending=False)
    
    # merge df to obtain year and genres
    result=result.merge(df,how='left')
    # filter by year & genres
    if yearslist is not None:
        result=result[result['date'].isin(yearslist)]
    if genreslist is not None:
        result=result[result[genreslist].sum(axis=1)>0]
    # a list of the n_top recommended movies by movie title in order best to worst
    top_movies=list(result.iloc[:n_top,1].values)
    return top_movies # a list of the n_top movies as recommended

> check if pass the lists

In [557]:
popular_filter('1', 20, ['2015', '2016', '2017', '2018'], ['History'])==t.popular_recs_filtered('1', 20, ranked_movies, years=['2015', '2016', '2017', '2018'], genres=['History'])

True

> check if not passing the lists

In [558]:
popular_filter('1', 20)==recs_20_for_1

True

> check if pass one list

In [559]:
popular_filter('1', 20,['2015', '2016', '2017', '2018'])==t.popular_recs_filtered('1', 20, ranked_movies, years=['2015', '2016', '2017', '2018'])

True

In [560]:
t.popular_recs_filtered('1', 20, ranked_movies, genres=['History'])==popular_filter('1', 20,genreslist=['History'])

True