## Data science movie recommendation challenge 

- Create a function `get_recommendations` which will return the title of the most highly recommended movie for a given user.
- The function takes as arguments three dataframes, a name for the user, the year of interest and the recommendation method.
- All methods should return a movie title that has not yet been rated by the given user. If there is more than one movie that meets the condition, the function should return the first movie in alphabetical order. 

In [1]:
import pandas as pd
import numpy as np

### Read in data from csv files

- Define separator to parse by when reading from csv file (see [link](https://stackoverflow.com/questions/18039057/python-pandas-error-tokenizing-data))

In [2]:
movies_df = pd.read_csv("/home/home02/earshar/data_science/main/data/movies.csv", sep="|")
ratings_df = pd.read_csv("/home/home02/earshar/data_science/main/data/ratings.csv", sep="|")
users_df = pd.read_csv("/home/home02/earshar/data_science/main/data/users.csv", sep="|")

### Get user ID for chosen user

In [3]:
def get_user_id(name: str, users_df: pd.DataFrame):
    """ 
    Get user ID from pd.DataFrame given user name 

    User arguments: 

    -- name: string from pd.DataFrame containing full names for all potential users 
    -- users_df: pd.DataFrame containing user ID, name, age and gender
    """
    user_info = users_df[users_df['full name'] == name].reset_index(drop=True)
    user_id = user_info.loc[0]['user id']
    return user_id

In [4]:
# user_id = get_user_id(users_df['full name'].iloc[500], users_df)

### Ignore movies already rated by user

- Use `~` operator to subset values not in given list (see [link](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html))
- Filter dataframe by list as discussed [here](https://www.statology.org/pandas-filter-in-list/)
- Used [this link](https://sparkbyexamples.com/pandas/pandas-get-cell-value-from-dataframe/?expand_article=1) to get the values of `item id` for all rows
- Operate on the `movies_not_rated_df` dataframe that we defined earlier
- Sort by `item id` and reset index values

In [5]:
def ignore_movies_rated_by_user(user_id: int, ratings_df: pd.DataFrame):
    """
    Use the ratings DataFrame to collect all movie IDs not rated by the user. 
    Exclude all rows for which the value of 'movie ID' matches any of those rated by the user. 

    User arguments: 

    -- user_id: integer representing user ID 
    -- ratings_df: pd.DataFrame containing movie ratings from all users 
    """
    movies_not_rated_df = ratings_df[ratings_df['user id'] != user_id].reset_index(drop=True)
    movies_rated_df = ratings_df[ratings_df['user id'] == user_id].reset_index(drop=True)
    movie_ids = movies_rated_df.loc[:]['item id']

    movie_ids_not_rated = movies_not_rated_df[~movies_not_rated_df['item id'].isin(movie_ids)]
    movie_ids_not_rated = movie_ids_not_rated.sort_values(by=['item id']).reset_index(drop=True)
    return movie_ids_not_rated

In [6]:
# movie_ids_not_rated = ignore_movies_rated_by_user(user_id, ratings_df)
# movie_ids_not_rated

### Link the movie ID dataframe (for all users except the given user) to the movie title and year

- Create the `row index` column --> row index from second df corresponding to `item id` from first df
- Get the corresponding `movie title` and `release year` from second df
- Add these columns to first df so all information is easily accessible 

In [7]:
def calculate_row_index(movie_id_df: pd.DataFrame):
    """ 
    Get the row index corresponding to each movie ID.
    For some reason there is no movie ID = 267, which throws us off a bit. 
    """

    movie_id_df['row index'] = movie_id_df.loc[:]['item id'] - 1
    movie_id_df['row index'] = np.where(movie_id_df['row index'] >= 267, 
                                        movie_id_df['row index']-1, 
                                        movie_id_df['row index'])

    return movie_id_df


In [8]:
def collate_all_movie_info(movie_id_df: pd.DataFrame, movie_title_df: pd.DataFrame):
    """
    Get the movie title and release year for all movies not rated by the chosen user. 
    Put this information into the existing 'ratings' dataframe alongside movie ID ('item id'). 

    User arguments: 

    -- movie_id_df: pd.DataFrame containing all ratings and movie IDs not rated by the user 
    -- movie_title_df: pd.DataFrame containing movie title and release year for all movies
    """

    movie_id_df = calculate_row_index(movie_id_df)
    movie_title = movie_title_df['movie title'].iloc[movie_id_df['row index']].reset_index(drop=True)
    release_year = movie_title_df['release year'].iloc[movie_id_df['row index']].reset_index(drop=True)
    movie_id_df['movie title'] = movie_title
    movie_id_df['release year'] = release_year
    return movie_id_df

In [9]:
# movie_ids_and_titles = collate_all_movie_info(movie_ids_not_rated, movies_df)
# movie_ids_and_titles

In [10]:
def select_single_year(year: int, movie_id_df: pd.DataFrame):
    """ 
    Select a subset of movies from a single year
    """

    movie_id_df = movie_id_df[movie_id_df['release year'] == year].reset_index(drop=True)

    return movie_id_df

### Different methods for choosing most highly rated movie

- Counting number of times each movie has been rated (see [this article](https://www.geeksforgeeks.org/how-to-extract-the-value-names-and-counts-from-value_counts-in-pandas/))
- Calculate mean rating of movies using groupby function (see [this article](https://stackoverflow.com/questions/30482071/how-to-calculate-mean-values-grouped-on-another-column))

In [11]:
def choose_movie_by_popularity(movie_df: pd.DataFrame):
    """
    Find the movie rated by the most users
    """
    movie_df = movie_df.sort_values(by=['rating'], ascending=False)
    movie_title = movie_df['movie title'].value_counts().index.tolist()[0]
    return movie_title

In [12]:
def choose_movie_by_rating(movie_df: pd.DataFrame):
    """ 
    Find the movie rated most highly by users 
    """
    mean_movie_rating = movie_df.groupby('movie title', as_index=False)['rating'].mean()
    mean_movie_rating = mean_movie_rating.sort_values(by=['rating'], ascending=False)
    movie_title = mean_movie_rating['movie title'].iloc[0]
    return movie_title

### Define main function `get_recommendations`

- Use of `unique` found [here](https://sparkbyexamples.com/pandas/pandas-check-column-contains-a-value-in-dataframe/)

In [13]:
def get_recommendations(users_df: pd.DataFrame, 
                        movies_df: pd.DataFrame, 
                        ratings_df: pd.DataFrame, 
                        full_name: str, 
                        method: str, 
                        year: int):
    """
    Return the title of the most highly recommended movie for the given user. 
    All methods should return a movie title that has not yet been rated by the given user.
    If there is more than one movie that meets the condition, the function should return the first 
    movie in alphabetical order. 

    users:     information about users with the columns (user id, full name, age, gender, zip code)
    ratings:   information about movie ratings by users with the columns (user id, item id, rating, timestamp)
    movies:    information about movies with the columns (movie id, movie title and release year)
    full name: full name of the user for whom we want to return one recommended movie
               whose release year is equal to year using one of the three implemented methods 
    method:    method for recommending movie ('by_popularity', 'by_rating' or 'by_similar_users') 
    year:      movie release year
    """

    if not full_name in users_df['full name'].unique():
        raise ValueError(f'{full_name} does not exist in the database, please check your input carefully!')
    
    if not year in movies_df['release year'].unique():
        raise ValueError(f'No movies released in {year} in the database, please choose again!')
    
    user_id = get_user_id(full_name, users_df)
    movie_ids_not_rated = ignore_movies_rated_by_user(user_id, ratings_df)
    movie_ids_not_rated = collate_all_movie_info(movie_ids_not_rated, movies_df)
    movie_ids_single_year = select_single_year(year, movie_ids_not_rated)

    if method == "by_popularity":
        movie_title = choose_movie_by_popularity(movie_ids_single_year)
    elif method == "by_rating":
        movie_title = choose_movie_by_rating(movie_ids_single_year)
    else:
        raise ValueError('Chosen method does not exist; please either choose "by_popularity" or "by_rating"!')

    return movie_title


### Call the main function `get_recommendations`

In [14]:
full_name = users_df['full name'].iloc[0]
year = 1991
movie_title = get_recommendations(users_df, movies_df, ratings_df, full_name, "by_popularity", year)
movie_title, full_name

('Beauty and the Beast (1991)', 'Ryan James')

### What are the next steps? 

- Add tests for edge cases 
- Work out how to incorporate the `by_similar_users` method for choosing a movie