## Data science movie recommendation challenge 

- Create a function `get_recommendations` which will return the title of the most highly recommended movie for a given user.
- The function takes as arguments three dataframes, a name for the user, the year of interest and the recommendation method.
- All methods should return a movie title that has not yet been rated by the given user. If there is more than one movie that meets the condition, the function should return the first movie in alphabetical order. 

In [22]:
import pandas as pd
import numpy as np

### Read in data from csv files

- Define separator to parse by when reading from csv file (see [link](https://stackoverflow.com/questions/18039057/python-pandas-error-tokenizing-data))

In [23]:
movies_df = pd.read_csv("/home/home02/earshar/data_science/main/data/movies.csv", sep="|")
ratings_df = pd.read_csv("/home/home02/earshar/data_science/main/data/ratings.csv", sep="|")
users_df = pd.read_csv("/home/home02/earshar/data_science/main/data/users.csv", sep="|")

In [25]:
# users_df

### Get user ID for chosen user

In [26]:
def get_user_id(name: str, users_df: pd.DataFrame):
    """ 
    Get user ID from pd.DataFrame given user name 

    User arguments: 

    -- name: string from pd.DataFrame containing full names for all potential users 
    -- users_df: pd.DataFrame containing user ID, name, age and gender
    """
    user_info = users_df[users_df['full name'] == name].reset_index(drop=True)
    user_id = user_info.loc[0]['user id']
    return user_id

user_id = get_user_id(users_df['full name'].iloc[500], users_df)

In [28]:
# ratings_df

### Ignore movies already rated by user

- Use `~` operator to subset values not in given list (see [link](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html))
- Filter dataframe by list as discussed [here](https://www.statology.org/pandas-filter-in-list/)
- Used [this link](https://sparkbyexamples.com/pandas/pandas-get-cell-value-from-dataframe/?expand_article=1) to get the values of `item id` for all rows
- Operate on the `movies_not_rated_df` dataframe that we defined earlier
- Sort by `item id` and reset index values

In [29]:
def ignore_movies_rated_by_user(user_id: int, ratings_df: pd.DataFrame):
    """
    Use the ratings DataFrame to collect all movie IDs not rated by the user. 
    Exclude all rows for which the value of 'movie ID' matches any of those rated by the user. 

    User arguments: 

    -- user_id: integer representing user ID 
    -- ratings_df: pd.DataFrame containing movie ratings from all users 
    """
    movies_not_rated_df = ratings_df[ratings_df['user id'] != user_id].reset_index(drop=True)
    movies_rated_df = ratings_df[ratings_df['user id'] == user_id].reset_index(drop=True)
    movie_ids = movies_rated_df.loc[:]['item id']

    movie_ids_not_rated = movies_not_rated_df[~movies_not_rated_df['item id'].isin(movie_ids)]
    movie_ids_not_rated = movie_ids_not_rated.sort_values(by=['item id']).reset_index(drop=True)
    return movie_ids_not_rated

In [30]:
movie_ids_not_rated = ignore_movies_rated_by_user(user_id, ratings_df)
movie_ids_not_rated

Unnamed: 0,user id,item id,rating,timestamp
0,1,1,5,874965758
1,336,1,3,877759342
2,303,1,5,879466966
3,886,1,4,876031433
4,49,1,2,888068651
...,...,...,...,...
86449,863,1678,1,889289570
86450,863,1679,3,889289491
86451,863,1680,2,889289570
86452,896,1681,3,887160722


### Link the movie ID dataframe (for all users except the given user) to the movie title and year

- Create the `row index` column --> row index from second df corresponding to `item id` from first df
- Get the corresponding `movie title` and `release year` from second df
- Add these columns to first df so all information is easily accessible 

In [31]:
def collate_all_movie_info(movie_id_df: pd.DataFrame, movie_title_df: pd.DataFrame):
    """
    Get the movie title and release year for all movies not rated by the chosen user. 
    Put this information into the existing 'ratings' dataframe alongside movie ID ('item id'). 

    User arguments: 

    -- movie_id_df: pd.DataFrame containing user ratings and movie IDs 
    -- movie_title_df: pd.DataFrame containing movie title and release year for each movie in the database
    """
    movie_id_df['row index'] = movie_id_df.loc[:]['item id'] - 2
    movie_title = movie_title_df['movie title'].iloc[movie_id_df['row index']].reset_index(drop=True)
    release_year = movie_title_df['release year'].iloc[movie_id_df['row index']].reset_index(drop=True)
    movie_id_df['movie title'] = movie_title
    movie_id_df['release year'] = release_year
    return movie_id_df

In [32]:
movie_ids_not_rated

Unnamed: 0,user id,item id,rating,timestamp
0,1,1,5,874965758
1,336,1,3,877759342
2,303,1,5,879466966
3,886,1,4,876031433
4,49,1,2,888068651
...,...,...,...,...
86449,863,1678,1,889289570
86450,863,1679,3,889289491
86451,863,1680,2,889289570
86452,896,1681,3,887160722


In [33]:
movie_ids_not_rated = collate_all_movie_info(movie_ids_not_rated, movies_df)
movie_ids_not_rated

Unnamed: 0,user id,item id,rating,timestamp,row index,movie title,release year
0,1,1,5,874965758,-1,Scream of Stone (Schrei aus Stein) (1991),1996
1,336,1,3,877759342,-1,Scream of Stone (Schrei aus Stein) (1991),1996
2,303,1,5,879466966,-1,Scream of Stone (Schrei aus Stein) (1991),1996
3,886,1,4,876031433,-1,Scream of Stone (Schrei aus Stein) (1991),1996
4,49,1,2,888068651,-1,Scream of Stone (Schrei aus Stein) (1991),1996
...,...,...,...,...,...,...,...
86449,863,1678,1,889289570,1676,Mat' i syn (1997),1998
86450,863,1679,3,889289491,1677,B. Monkey (1998),1998
86451,863,1680,2,889289570,1678,Sliding Doors (1998),1998
86452,896,1681,3,887160722,1679,You So Crazy (1994),1994


### What are the next steps?

- Add more functions to calculate most highly recommended movie for given user 
- Check for bugs, as the script seems to output the same movies irrespective of the user
- Run entire script

In [57]:
def select_single_year(year: int, movie_ids_df: pd.DataFrame):
    """ 
    Select a subset of movies from a single year
    """

    movie_ids_df = movie_ids_df[movie_ids_df['release year'] == year].reset_index(drop=True)

    return movie_ids_df

### Define main function `get_recommendations`

- Use of `unique` found [here](https://sparkbyexamples.com/pandas/pandas-check-column-contains-a-value-in-dataframe/)

In [59]:
def get_recommendations(users_df: pd.DataFrame, 
                        movies_df: pd.DataFrame, 
                        ratings_df: pd.DataFrame, 
                        full_name: str, 
                        method: str, 
                        year: int):
    """
    Return the title of the most highly recommended movie for the given user. 
    All methods should return a movie title that has not yet been rated by the given user.
    If there is more than one movie that meets the condition, the function should return the first 
    movie in alphabetical order. 

    users:     information about users with the columns (user id, full name, age, gender, zip code)
    ratings:   information about movie ratings by users with the columns (user id, item id, rating, timestamp)
    movies:    information about movies with the columns (movie id, movie title and release year)
    full name: full name of the user for whom we want to return one recommended movie
               whose release year is equal to year using one of the three implemented methods 
    method:    method for recommending movie ('by_popularity', 'by_rating' or 'by_similar_users') 
    year:      movie release year
    """

    if not full_name in users_df['full name'].unique():
        raise ValueError(f'{full_name} does not exist in the database, please check your input carefully!')
    
    if not year in movies_df['release year'].unique():
        raise ValueError(f'No movies released in {year} in the database, please choose again!')
    
    user_id = get_user_id(full_name, users_df)
    movie_ids_not_rated = ignore_movies_rated_by_user(user_id, ratings_df)
    movie_ids_not_rated = collate_all_movie_info(movie_ids_not_rated, movies_df)
    movie_ids_single_year = select_single_year(year, movie_ids_not_rated)
    return movie_ids_single_year


### Call the main function `get_recommendations`

In [123]:
full_name = users_df['full name'].iloc[0]
print(full_name)
movies_single_year = get_recommendations(users_df, movies_df, ratings_df, full_name, "by_rating", 1998)

Ryan James


In [124]:
movies_single_year.sort_values(by=['rating'], ascending=False)

Unnamed: 0,user id,item id,rating,timestamp,row index,movie title,release year
0,592,315,5,885280156,313,Apt Pupil (1998),1998
985,883,1656,5,891692168,1654,Little City (1998),1998
967,466,1607,5,890284231,1605,Hurricane Streets (1998),1998
270,166,347,5,886397562,345,Wag the Dog (1997),1998
962,416,1594,5,893212484,1592,Everest (1998),1998
...,...,...,...,...,...,...,...
459,13,351,1,886302385,349,"Prophecy II, The (1998)",1998
674,181,885,1,878962006,883,Phantoms (1998),1998
673,451,885,1,879012890,883,Phantoms (1998),1998
833,818,1105,1,891883071,1103,Firestorm (1998),1998
