## Data science movie recommendation challenge 

- Create a function `get_recommendations` which will return the title of the most highly recommended movie for a given user.
- The function takes as arguments three dataframes, a name for the user, the year of interest and the recommendation method.
- All methods should return a movie title that has not yet been rated by the given user. If there is more than one movie that meets the condition, the function should return the first movie in alphabetical order. 

In [1]:
import pandas as pd

### Read in data from csv files

- Define separator to parse by when reading from csv file (see [link](https://stackoverflow.com/questions/18039057/python-pandas-error-tokenizing-data))

In [172]:
movies_df = pd.read_csv("/home/home02/earshar/data_science/main/data/movies.csv", sep="|")
ratings_df = pd.read_csv("/home/home02/earshar/data_science/main/data/ratings.csv", sep="|")
users_df = pd.read_csv("/home/home02/earshar/data_science/main/data/users.csv", sep="|")

In [173]:
users_df

Unnamed: 0,user id,full name,age,gender,zip code
0,1,Ryan James,24,M,85711
1,2,Alice Graves,53,F,94043
2,3,Ambrose Smith,23,M,32067
3,4,Bobby Alvarez,24,M,43537
4,5,Latosha Jiles,33,F,15213
...,...,...,...,...,...
938,939,Melva Carrol,26,F,33319
939,940,Randall Hill,32,M,02215
940,941,Don Hancock,20,M,97229
941,942,Geri Wilson,48,F,78209


In [174]:
name="Ryan James"
user_info = users_df[users_df['full name'] == name].reset_index(drop=True)
user_id = user_info.loc[0]['user id']
user_id

1

In [175]:
ratings_df

Unnamed: 0,user id,item id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
...,...,...,...,...
99996,12,203,3,879959583
99997,913,288,5,881250949
99998,914,288,5,891717742
99999,915,288,5,878887116


### Select all movies that haven't already been rated by the user

In [176]:
movies_not_rated_df = ratings_df[ratings_df['user id'] != user_id].reset_index(drop=True)
movies_rated_df = ratings_df[ratings_df['user id'] == user_id].reset_index(drop=True)

### Sort by `item id` and reset indices

In [177]:
# movies_rated_df.sort_values(by=['item id']).reset_index(drop=True)
# movies_not_rated_df.sort_values(by=['item id']).reset_index(drop=True)

### Use `loc` function to subset dataframe

- Used [this link](https://sparkbyexamples.com/pandas/pandas-get-cell-value-from-dataframe/?expand_article=1) to get the values of `item id` for all rows

In [178]:
movie_ids = movies_rated_df.loc[:]['item id']
movie_ids

0       61
1      189
2       33
3      160
4       20
      ... 
267     28
268    172
269    122
270    152
271     94
Name: item id, Length: 272, dtype: int64

### Ignore movies already rated by user

- Use `~` operator to subset values not in given list (see [link](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html))

In [179]:
movie_ids_not_rated = movies_not_rated_df[~movies_not_rated_df['item id'].isin(movie_ids)]
#movie_ids_not_rated.sort_values(by=['item id']).reset_index(drop=True)

In [180]:
movies_df

Unnamed: 0,movie id,movie title,release year
0,1,Toy Story (1995),1995
1,2,GoldenEye (1995),1995
2,3,Four Rooms (1995),1995
3,4,Get Shorty (1995),1995
4,5,Copycat (1995),1995
...,...,...,...
1676,1678,Mat' i syn (1997),1998
1677,1679,B. Monkey (1998),1998
1678,1680,Sliding Doors (1998),1998
1679,1681,You So Crazy (1994),1994


### Apply the function below to multiple values of `movie_id`

- Filter dataframe by list as discussed [here](https://www.statology.org/pandas-filter-in-list/)
- Search for all movies (+ year and title) that haven't been rated by the user

In [181]:
movie_titles_not_rated = movies_df[~movies_df['movie id'].isin(movie_ids)]
movie_titles_not_rated

Unnamed: 0,movie id,movie title,release year
271,273,Heat (1995),1995
272,274,Sabrina (1995),1995
273,275,Sense and Sensibility (1995),1995
274,276,Leaving Las Vegas (1995),1995
275,277,Restoration (1995),1995
...,...,...,...
1676,1678,Mat' i syn (1997),1998
1677,1679,B. Monkey (1998),1998
1678,1680,Sliding Doors (1998),1998
1679,1681,You So Crazy (1994),1994


In [182]:
movie_ids_not_rated

Unnamed: 0,user id,item id,rating,timestamp
1,186,302,3,891717742
2,22,377,1,878887116
4,166,346,1,886397596
5,298,474,4,884182806
7,253,465,5,891628467
...,...,...,...,...
99722,276,1090,1,874795795
99725,913,288,5,881250949
99726,914,288,5,891717742
99727,915,288,5,878887116


### Add the movie ratings to the `movie_titles_not_rated` dataframe

In [169]:
movie_titles_not_rated.loc[:]['rating'] = movie_ids_not_rated.loc[:]['rating'].reset_index(drop=True)
# movie_titles_not_rated.loc[:]['rating'] = movie_ids_not_rated['rating']

In [170]:
movie_ids_not_rated['rating'].reset_index(drop=True)

0        3
1        1
2        1
3        4
4        5
        ..
58162    1
58163    5
58164    5
58165    5
58166    5
Name: rating, Length: 58167, dtype: int64

In [171]:
movie_titles_not_rated

Unnamed: 0,movie id,movie title,release year
271,273,Heat (1995),1995
272,274,Sabrina (1995),1995
273,275,Sense and Sensibility (1995),1995
274,276,Leaving Las Vegas (1995),1995
275,277,Restoration (1995),1995
...,...,...,...
1676,1678,Mat' i syn (1997),1998
1677,1679,B. Monkey (1998),1998
1678,1680,Sliding Doors (1998),1998
1679,1681,You So Crazy (1994),1994


In [13]:
def get_recommendations(users: pd.DataFrame, 
                        movies: pd.DataFrame, 
                        ratings: pd.DataFrame, 
                        full_name: str, 
                        method: str, 
                        year: int):
    """
    Return the title of the most highly recommended movie for the given user. 
    All methods should return a movie title that has not yet been rated by the given user.
    If there is more than one movie that meets the condition, the function should return the first 
    movie in alphabetical order. 

    users:     information about users with the columns (user id, full name, age, gender, zip code)
    ratings:   information about movie ratings by users with the columns (user id, item id, rating, timestamp)
    movies:    information about movies with the columns (movie id, movie title and release year)
    full name: full name of the user for whom we want to return one recommended movie
               whose release year is equal to year using one of the three implemented methods 
    method:    method for recommending movie ('by_popularity', 'by_rating' or 'by_similar_users') 
    year:      movie release year
    """

    if not full_name in users['full name'].unique():
        raise ValueError(f'{full_name} does not exist in the database, please check your input carefully!')
    
    if not year in movies['release year'].unique():
        raise ValueError(f'No movies released in {year} in the database, please choose again!')

### Some context for the function above 

- Use of `unique` found [here](https://sparkbyexamples.com/pandas/pandas-check-column-contains-a-value-in-dataframe/)

In [15]:
get_recommendations(users_df, movies_df, ratings_df, "Ryan James", "by_popularity", 1995)