## Data science movie recommendation challenge 

- Create a function `get_recommendations` which will return the title of the most highly recommended movie for a given user.
- The function takes as arguments three dataframes, a name for the user, the year of interest and the recommendation method.
- All methods should return a movie title that has not yet been rated by the given user. If there is more than one movie that meets the condition, the function should return the first movie in alphabetical order. 

In [1]:
import pandas as pd
import numpy as np

### Read in data from csv files

- Define separator to parse by when reading from csv file (see [link](https://stackoverflow.com/questions/18039057/python-pandas-error-tokenizing-data))

In [2]:
movies_df = pd.read_csv("/home/home02/earshar/data_science/main/data/movies.csv", sep="|")
ratings_df = pd.read_csv("/home/home02/earshar/data_science/main/data/ratings.csv", sep="|")
users_df = pd.read_csv("/home/home02/earshar/data_science/main/data/users.csv", sep="|")

In [3]:
users_df

Unnamed: 0,user id,full name,age,gender,zip code
0,1,Ryan James,24,M,85711
1,2,Alice Graves,53,F,94043
2,3,Ambrose Smith,23,M,32067
3,4,Bobby Alvarez,24,M,43537
4,5,Latosha Jiles,33,F,15213
...,...,...,...,...,...
938,939,Melva Carrol,26,F,33319
939,940,Randall Hill,32,M,02215
940,941,Don Hancock,20,M,97229
941,942,Geri Wilson,48,F,78209


### Get user ID for chosen user

In [4]:
def get_user_id(name: str, users_df: pd.DataFrame):
    """ 
    Get user ID from pd.DataFrame given user name 

    User arguments: 

    -- name: string from pd.DataFrame containing full names for all potential users 
    -- users_df: pd.DataFrame containing user ID, name, age and gender
    """
    user_info = users_df[users_df['full name'] == name].reset_index(drop=True)
    user_id = user_info.loc[0]['user id']
    return user_id

user_id = get_user_id("Ryan James", users_df)

In [5]:
ratings_df

Unnamed: 0,user id,item id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
...,...,...,...,...
99996,12,203,3,879959583
99997,913,288,5,881250949
99998,914,288,5,891717742
99999,915,288,5,878887116


### Ignore movies already rated by user

- Use `~` operator to subset values not in given list (see [link](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html))
- Filter dataframe by list as discussed [here](https://www.statology.org/pandas-filter-in-list/)
- Used [this link](https://sparkbyexamples.com/pandas/pandas-get-cell-value-from-dataframe/?expand_article=1) to get the values of `item id` for all rows
- Operate on the `movies_not_rated_df` dataframe that we defined earlier
- Sort by `item id` and reset index values

In [6]:
def ignore_movies_rated_by_user(user_id: int, ratings_df: pd.DataFrame):
    """
    Use the ratings DataFrame to collect all movie IDs not rated by the user 

    User arguments: 

    -- user_id: integer representing user ID 
    -- ratings_df: pd.DataFrame containing movie ratings from all users 
    """
    movies_not_rated_df = ratings_df[ratings_df['user id'] != user_id].reset_index(drop=True)
    movies_rated_df = ratings_df[ratings_df['user id'] == user_id].reset_index(drop=True)
    movie_ids = movies_rated_df.loc[:]['item id']

    movie_ids_not_rated = movies_not_rated_df[~movies_not_rated_df['item id'].isin(movie_ids)]
    movie_ids_not_rated = movie_ids_not_rated.sort_values(by=['item id']).reset_index(drop=True)
    return movie_ids_not_rated

In [7]:
movie_ids_not_rated = ignore_movies_rated_by_user(user_id, ratings_df)

### Link the movie ID dataframe (for all users except the given user) to the movie title and year

- Create the `row index` column --> row index from second df corresponding to `item id` from first df
- Get the corresponding `movie title` and `release year` from second df
- Add these columns to first df so all information is easily accessible 

In [8]:
def collate_all_movie_info(movie_id_df: pd.DataFrame, movie_title_df: pd.DataFrame):
    """
    For all movies not rated by the given user, put the ID, title and year into a single DataFrame

    User arguments: 

    -- movie_id_df: pd.DataFrame containing user ratings and movie IDs 
    -- movie_title_df: pd.DataFrame containing movie title and release year for each movie in the database
    """
    movie_id_df['row index'] = movie_id_df.loc[:]['item id'] - 2
    movie_title = movie_title_df['movie title'].iloc[movie_id_df['row index']].reset_index(drop=True)
    release_year = movie_title_df['release year'].iloc[movie_id_df['row index']].reset_index(drop=True)
    movie_id_df['movie title'] = movie_title
    movie_id_df['release year'] = release_year
    return movie_id_df

In [9]:
movie_ids_not_rated = collate_all_movie_info(movie_ids_not_rated, movies_df)
movie_ids_not_rated

Unnamed: 0,user id,item id,rating,timestamp,row index,movie title,release year
0,634,273,3,875729069,271,Heat (1995),1995
1,567,273,5,882427068,271,Heat (1995),1995
2,458,273,4,886394730,271,Heat (1995),1995
3,425,273,4,878738435,271,Heat (1995),1995
4,291,273,3,874833705,271,Heat (1995),1995
...,...,...,...,...,...,...,...
58162,863,1678,1,889289570,1676,Mat' i syn (1997),1998
58163,863,1679,3,889289491,1677,B. Monkey (1998),1998
58164,863,1680,2,889289570,1678,Sliding Doors (1998),1998
58165,896,1681,3,887160722,1679,You So Crazy (1994),1994


### What are the next steps?

- Add more functions to calculate most highly recommended movie for given user 
- Check for bugs, as the script seems to output the same movies irrespective of the user
- Run entire script

In [10]:
def select_single_year(year: int, movies_id_df: pd.DataFrame):
    """ 
    Select a subset of movies from a single year
    """

    movies_id_df = movies_id_df[movies_id_df['release year'] == year]

    return movies_id_df

### Define main function `get_recommendations`

- Use of `unique` found [here](https://sparkbyexamples.com/pandas/pandas-check-column-contains-a-value-in-dataframe/)

In [14]:
def get_recommendations(users_df: pd.DataFrame, 
                        movies_df: pd.DataFrame, 
                        ratings_df: pd.DataFrame, 
                        full_name: str, 
                        method: str, 
                        year: int):
    """
    Return the title of the most highly recommended movie for the given user. 
    All methods should return a movie title that has not yet been rated by the given user.
    If there is more than one movie that meets the condition, the function should return the first 
    movie in alphabetical order. 

    users:     information about users with the columns (user id, full name, age, gender, zip code)
    ratings:   information about movie ratings by users with the columns (user id, item id, rating, timestamp)
    movies:    information about movies with the columns (movie id, movie title and release year)
    full name: full name of the user for whom we want to return one recommended movie
               whose release year is equal to year using one of the three implemented methods 
    method:    method for recommending movie ('by_popularity', 'by_rating' or 'by_similar_users') 
    year:      movie release year
    """

    if not full_name in users_df['full name'].unique():
        raise ValueError(f'{full_name} does not exist in the database, please check your input carefully!')
    
    if not year in movies_df['release year'].unique():
        raise ValueError(f'No movies released in {year} in the database, please choose again!')
    
    user_id = get_user_id(full_name, users_df)
    print(user_id)
    movie_ids_not_rated = ignore_movies_rated_by_user(user_id, ratings_df)
    print(movie_ids_not_rated)
    movie_ids_not_rated = collate_all_movie_info(movie_ids_not_rated, movies_df)
    movie_ids_not_rated = select_single_year(year, movie_ids_not_rated)
    return movie_ids_not_rated


### Call the main function `get_recommendations`

In [15]:
get_recommendations(users_df, movies_df, ratings_df, "Don Hancock", "by_popularity", 1998)

941
       user id  item id  rating  timestamp
0          429        2       3  882387599
1          416        2       4  886317115
2          757        2       3  888466490
3          201        2       2  884112487
4          627        2       3  879531352
...        ...      ...     ...        ...
94084      863     1678       1  889289570
94085      863     1679       3  889289491
94086      863     1680       2  889289570
94087      896     1681       3  887160722
94088      916     1682       3  880845755

[94089 rows x 4 columns]


Unnamed: 0,user id,item id,rating,timestamp,row index,movie title,release year
44054,205,315,4,888284245,313,Apt Pupil (1998),1998
44055,819,315,5,884618354,313,Apt Pupil (1998),1998
44056,870,315,2,883876178,313,Apt Pupil (1998),1998
44057,284,315,5,885329593,313,Apt Pupil (1998),1998
44058,344,315,5,884813342,313,Apt Pupil (1998),1998
...,...,...,...,...,...,...,...
94075,782,1670,3,891497793,1668,Tainted (1998),1998
94076,787,1671,1,888980193,1669,"Further Gesture, A (1996)",1998
94084,863,1678,1,889289570,1676,Mat' i syn (1997),1998
94085,863,1679,3,889289491,1677,B. Monkey (1998),1998
