<a href="https://colab.research.google.com/github/zhukuixi/Udacity_DataScientistNanoDegree/blob/main/Recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!git clone https://github.com/zhukuixi/Udacity_DataScientistNanoDegree

Cloning into 'Udacity_DataScientistNanoDegree'...
remote: Enumerating objects: 277, done.[K
remote: Counting objects: 100% (100/100), done.[K
remote: Compressing objects: 100% (69/69), done.[K
remote: Total 277 (delta 51), reused 69 (delta 29), pack-reused 177[K
Receiving objects: 100% (277/277), 135.37 MiB | 19.52 MiB/s, done.
Resolving deltas: 100% (107/107), done.


# 1.Data Preparation

In [82]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


%matplotlib inline

# Read in the datasets
movies = pd.read_csv('/content/Udacity_DataScientistNanoDegree/MovieTweet/data/original_movies.dat',
                     delimiter='::',
                     header=None,
                     names=['movie_id', 'movie', 'genre'],
                     dtype={'movie_id': object}, engine='python')

reviews = pd.read_csv('/content/Udacity_DataScientistNanoDegree/MovieTweet/data/original_ratings.dat',
                      delimiter='::',
                      header=None,
                      names=['user_id', 'movie_id', 'rating', 'timestamp'],
                      dtype={'movie_id': object, 'user_id': object, 'timestamp': object},
                      engine='python')

# Reduce the size reviews dataset
reviews = reviews.loc[:100000,:]

## 1.1 Data Exploration


In [147]:
# Check the data
print(movies.head())
print("\n")
print(reviews.head())
print("\n")

# Check the missing value
print(movies.isna().mean())
print("\n")
print(reviews.isna().mean())

  movie_id                                              movie  \
0  0000008      Edison Kinetoscopic Record of a Sneeze (1894)   
1  0000010                La sortie des usines Lumière (1895)   
2  0000012                      The Arrival of a Train (1896)   
3       25  The Oxford and Cambridge University Boat Race ...   
4  0000091                         Le manoir du diable (1896)   

               genre  date  News  Romance  Horror  Crime  Adventure  \
0  Documentary|Short  1894     0        0       0      0          0   
1  Documentary|Short  1895     0        0       0      0          0   
2  Documentary|Short  1896     0        0       0      0          0   
3                NaN  1895     0        0       0      0          0   
4       Short|Horror  1896     0        0       1      0          0   

   Documentary  ...  History  Fantasy  Film-Noir  Music  Comedy  Game-Show  \
0            1  ...        0        0          0      0       0          0   
1            1  ...       

In [74]:
dict_sol1 = {
'The number of movies in the dataset': movies.shape[0],
'The number of ratings in the dataset': reviews.shape[0],
'The number of different genres': movies['genre'].str.split("|").to_frame('genre').explode('genre')['genre'].nunique(),
'The number of unique users in the dataset': reviews['user_id'].nunique(),
'The number missing ratings in the reviews dataset': sum(pd.isna(reviews['rating'])),
'The average rating given across all ratings': reviews['rating'].mean(),
'The minimum rating given across all ratings': reviews['rating'].min(),
'The maximum rating given across all ratings': reviews['rating'].max()
}
dict_sol1

{'The number of movies in the dataset': 35479,
 'The number of ratings in the dataset': 100001,
 'The number of different genres': 28,
 'The number of unique users in the dataset': 8022,
 'The number missing ratings in the reviews dataset': 0,
 'The average rating given across all ratings': 7.397666023339767,
 'The minimum rating given across all ratings': 0,
 'The maximum rating given across all ratings': 10}

## 1.2 Data Cleaning

we need to pull some additional relevant information out of the existing columns.  

For each of the datasets, there are a couple of cleaning steps we need to take care of:  

Movies  
- Pull the date from the title and create new column  
- Dummy the date column with 1's and 0's for each century of a movie (1800's, 1900's, and 2000's)  
- Dummy column the genre with 1's and 0's for each genre  

Reviews  
- Create a date out of time stamp  



In [148]:
movies['date'] = movies['movie'].str[-5:-1]
dummy_time = pd.get_dummies(movies['date'].str[:2]+"00's")
movies_new = pd.concat([movies,dummy_time],axis=1)

In [149]:
#amenities = np.unique(np.concatenate(listings['amenities'].map(lambda amns: amns.split("|"))))

total_genres = set()
for gen in movies_new['genre'].dropna().str.split("|"):
  for g in gen:
    total_genres.add(g)

def getCategory(x,g):
  if pd.isna(x):
    return 0
  return 1 if g in x else 0

for g in total_genres:
  movies_new[g] = movies_new['genre'].map(lambda x:getCategory(x,g))




In [150]:
movies_new.head()

Unnamed: 0,movie_id,movie,genre,date,News,Romance,Horror,Crime,Adventure,Documentary,...,Music,Comedy,Game-Show,Talk-Show,Biography,Family,Sport,1800's,1900's,2000's
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
2,12,The Arrival of a Train (1896),Documentary|Short,1896,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
3,25,The Oxford and Cambridge University Boat Race ...,,1895,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,91,Le manoir du diable (1896),Short|Horror,1896,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [151]:
from datetime import datetime
reviews['date'] = reviews['timestamp'].apply(lambda x:datetime.fromtimestamp(int(x)))
reviews_new = reviews
reviews_new.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,date
0,1,114508,8,1381006850,2013-10-05 21:00:50
1,2,208092,5,1586466072,2020-04-09 21:01:12
2,2,358273,9,1579057827,2020-01-15 03:10:27
3,2,10039344,5,1578603053,2020-01-09 20:50:53
4,2,6751668,9,1578955697,2020-01-13 22:48:17


# 2. Knowledge Based Recommendation (Most Popular Recommendation)

## 2.1 Part I: How To Find The Most Popular Movies?    
For this part, we have a single task. The task is that no matter the user, we need to provide a list of the recommendations based on simply the most popular items.    

For this task, we will consider what is "most popular" based on the following criteria:  

A movie with the highest **average rating** is considered best
With ties, movies that have **more ratings** are better
A movie must have a **minimum of 5 ratings** to be considered among the best movies
If movies are tied in their average rating and number of ratings, the ranking is determined by the movie that is the **most recent rating**
  
With these criteria, the goal for this notebook is to take a user_id and provide back the n_top recommendations. Use the function below as the scaffolding that will be used for all the future recommendations as well.
  
Before you implement your code for popular_recommendations function, we will provide a helper function, called create_ranked_df. This helper function transforms movies and reviews dataframes into a ranked_movies dataframe of movies that are sorted by the highest average rating & time and have more than 4 ratings.

In [152]:
# This helper function transforms `movies` and `reviews` dataframes
# into a `ranked_movies` dataframe of movies that are sorted
# by the highest average rating & time and have more than 4 ratings.

def create_ranked_df(movies, reviews):
        '''
        INPUT
        movies - the movies dataframe
        reviews - the reviews dataframe

        OUTPUT
        ranked_movies - a dataframe with movies that are sorted by highest avg rating, more reviews,
                        then time, and must have more than 4 ratings
        '''

        # Pull the average ratings and number of ratings for each movie
        movie_ratings = reviews.groupby('movie_id')['rating']
        avg_ratings = movie_ratings.mean()
        num_ratings = movie_ratings.count()
        last_rating = pd.DataFrame(reviews.groupby('movie_id').max()['date'])
        last_rating.columns = ['last_rating']

        # Add Dates
        rating_count_df = pd.DataFrame({'avg_rating': avg_ratings, 'num_ratings': num_ratings})
        rating_count_df = rating_count_df.join(last_rating)

        # merge with the movies dataset
        movie_recs = movies.set_index('movie_id').join(rating_count_df)

        # sort by top avg rating and number of ratings
        ranked_movies = movie_recs.sort_values(['avg_rating', 'num_ratings', 'last_rating'], ascending=False)

        # for edge cases - subset the movie list to those with only 5 or more reviews
        ranked_movies = ranked_movies[ranked_movies['num_ratings'] > 4]

        return ranked_movies

ranked_movies = create_ranked_df(movies_new, reviews_new)

In [153]:
def popular_recommendations(user_id, n_top, ranked_movies):
    '''
    INPUT:
    user_id - the user_id (str) of the individual you are making recommendations for
    n_top - an integer of the number recommendations you want back
    ranked_movies - a dataframe from

    OUTPUT:
    top_movies - a list of the n_top recommended movies by movie title in order best to worst
    '''

    # Implement your code here
    top_movies = ranked_movies['movie'][:n_top].to_list()
    return top_movies

In [154]:
# Top 20 movies recommended for id 1
recs_20_for_1 = popular_recommendations(1,20,ranked_movies)
recs_20_for_1

['Be Somebody (2016)',
 'Doctor Zhivago (1965)',
 'Taare Zameen Par (2007)',
 'Coldplay: A Head Full of Dreams (2018)',
 'City Lights (1931)',
 'Nema-ye Nazdik (1990)',
 'The Lord of the Rings: The Return of the King (2003)',
 'Tarzan (1999)',
 'Mimi wo sumaseba (1995)',
 'Drishyam (2015)',
 '12 Angry Men (1957)',
 'The Shawshank Redemption (1994)',
 'La meglio gioventù (2003)',
 "It's a Wonderful Life (1946)",
 'The Lord of the Rings: The Two Towers (2002)',
 'The Sound of Music (1965)',
 'Hotaru no haka (1988)',
 'Terminator 2: Judgment Day (1991)',
 'Hiroshima mon amour (1959)',
 'Aladdin (1992)']

## 2.2 Part II: Adding Filters    
Now that you have created a function to give back the n_top movies, let's make it a bit more robust. Add arguments that will act as filters for the movie year and genre.  

Use the cells below to adjust your existing function to allow for year and genre arguments as lists of strings. Then your ending results are filtered to only movies within the lists of provided years and genres (as or conditions). If no list is provided, there should be no filter applied.
  
You can adjust other necessary inputs as necessary to retrieve the final results you are looking for!  

In [155]:
def popular_recs_filtered(user_id, n_top, ranked_movies, years=None, genres=None):
    '''
    INPUT:
    user_id - the user_id (str) of the individual you are making recommendations for
    n_top - an integer of the number recommendations you want back
    ranked_movies - a pandas dataframe of the already ranked movies based on avg rating, count, and time
    years - a list of strings with years of movies
    genres - a list of strings with genres of movies

    OUTPUT:
    top_movies - a list of the n_top recommended movies by movie title in order best to worst
    '''

    # Implement your code here

    # Step 1: filter movies based on year and genre
    # Step 2: create top movies list
    filter_years = ranked_movies['date'].isin(years) if years!=None else 1
    filter_genres = ranked_movies[genres].sum(axis=1)>0 if genres!=None else 1
    row_filter = filter_years & filter_genres
    top_movies = list(ranked_movies.loc[row_filter,'movie'][:n_top])


    return top_movies



In [158]:
# Top 20 movies recommended for id 1 with years=['2015', '2016', '2017', '2018'], genres=['History']
recs_20_for_1_filtered = popular_recs_filtered(1,20,ranked_movies,years=['2015', '2016', '2017', '2018'], genres=['History'])
# Top 100 movies recommended for id 70000 with no year filter but genres=['History', 'News']
recs_100_for_70000_filtered = popular_recs_filtered(70000,100,ranked_movies, genres=['History','News'])

print(recs_20_for_1_filtered)
print(recs_100_for_70000_filtered)

['Taeksi woonjunsa (2017)', 'Ayla: The Daughter of War (2017)', 'Hacksaw Ridge (2016)', 'They Shall Not Grow Old (2018)', 'Straight Outta Compton (2015)', 'Hidden Figures (2016)', '13th (2016)', 'Little Boy (2015)', 'Under sandet (2015)', 'Hotel Mumbai (2018)', 'Darkest Hour (2017)', 'Kono sekai no katasumi ni (2016)', 'Bridge of Spies (2015)', 'Woman in Gold (2015)', 'The Birth of a Nation (2016)', 'The Big Short (2015)', 'Dunkirk (2017)', 'Victoria &amp; Abdul (2017)', 'Anthropoid (2016)', 'Truth (2015)']
['Hotel Rwanda (2004)', "Schindler's List (1993)", 'Amadeus (1984)', 'Gone with the Wind (1939)', 'Lawrence of Arabia (1962)', 'Braveheart (1995)', 'Barry Lyndon (1975)', 'Gandhi (1982)', 'Taeksi woonjunsa (2017)', 'Before the Flood (2016)', 'Ayla: The Daughter of War (2017)', 'The Grapes of Wrath (1940)', 'Hacksaw Ridge (2016)', 'Il gattopardo (1963)', 'Persepolis (2007)', 'Portrait de la jeune fille en feu (2019)', 'Good Night, and Good Luck. (2005)', 'Missing (1982)', 'Changeling