# Movie Recommendation System

The task is to build a model that provides top 5 movie recommendations to a user, based on their ratings of other movies.

### EDA Right now, checking to see which data will help up with our business problem

In [3]:
# Importing pandas to analyze csv files
import pandas as pd

# Adding a variable for each csv file
links = pd.read_csv('data/links.csv')
movies = pd.read_csv('data/movies.csv')
ratings = pd.read_csv('data/ratings.csv')
tags = pd.read_csv('data/tags.csv')

### Looking at all dataframes to see what could/isn't useful to the business problem

In [4]:
links

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
9737,193581,5476944,432131.0
9738,193583,5914996,445030.0
9739,193585,6397426,479308.0
9740,193587,8391976,483455.0


In [7]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [6]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [5]:
tags

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
...,...,...,...,...
3678,606,7382,for katie,1171234019
3679,606,7936,austere,1173392334
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978


### Looking at all these dataframes, there are many ways I see that we could go about making a movie recommendation system off the present data. Looking at these dataframes on paper, we could:

- Make a recommendation system that gives top movie in certain genres based off ratings
- Make a recommendation system based off tags rather than genre using ratings
#### Let's start with a recommendation system based on genre by rating for now. 

In [10]:
genre_ratings = pd.merge(movies, ratings, how="left", on="movieId")

genre_ratings

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,9.649827e+08
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,8.474350e+08
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,1.106636e+09
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,1.510578e+09
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,1.305696e+09
...,...,...,...,...,...,...
100849,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184.0,4.0,1.537109e+09
100850,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184.0,3.5,1.537110e+09
100851,193585,Flint (2017),Drama,184.0,3.5,1.537110e+09
100852,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184.0,3.5,1.537110e+09


In [11]:
genre_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100854 entries, 0 to 100853
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   movieId    100854 non-null  int64  
 1   title      100854 non-null  object 
 2   genres     100854 non-null  object 
 3   userId     100836 non-null  float64
 4   rating     100836 non-null  float64
 5   timestamp  100836 non-null  float64
dtypes: float64(3), int64(1), object(2)
memory usage: 4.6+ MB


In [12]:
## Only 18-54 rows that are non-null out of 100k? Yeah, let's drop that.
genre_ratings.dropna()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,9.649827e+08
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,8.474350e+08
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,1.106636e+09
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,1.510578e+09
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,1.305696e+09
...,...,...,...,...,...,...
100849,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184.0,4.0,1.537109e+09
100850,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184.0,3.5,1.537110e+09
100851,193585,Flint (2017),Drama,184.0,3.5,1.537110e+09
100852,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184.0,3.5,1.537110e+09


## Now that we have this movie genre rating combined with the user's ratings, doing more EDAs to see how to properly make this recommendation system. 

## Oh yeah, I have remember that I forgot to turn the data column into an actual date thing so I could possibly filter ratings to be more recent rather than of all time.

In [52]:
# Analyzing each column before taking further steps

# Looking at how many different movies are in the dataframe
print("There are", genre_ratings['title'].nunique(), "total movies in the dataset\n")

# Looking at how many genre combos are in the dataframe
print("There are", genre_ratings['genres'].nunique(), "unique genre combinations in this dataframe, appearing from as much as 7k rows to 1 row.\n")

# Looking at how many users/user entries are in the data
genre_ratings['userId'].value_counts().sort_index()
print("There is data from 610 total users, where the max ratings given from one user is 2.7k and the minimum 20.")

There are 9737 total movies in the dataset

There are 951 unique genre combinations in this dataframe, appearing from as much as 7k rows to 1 row.

There is data from 610 total users, where the max ratings given from one user is 2.7k and the minimum 20.


#### Should I waste my time my time going through the genre column to seperate each column?  I'm thinking of doing a for loop to split the genre values by the "|" and then trying to find how many total individual unique genres there are instead of genre combos to better deepen how many recommendations we can give one person... whatever let me try it...

In [84]:
genre_combos = list(genre_ratings['genres'].unique())
print("The genre_combos variable is a list of", type(genre_combos[0]), "based off the movies in the data.")

unique_genres = []

for genre_combo in genre_combos:
    genres = genre_combo.split("|")
    for genre in genres:
        if genre not in unique_genres:
            unique_genres.append(genre)
        
unique_genres

The genre_combos variable is a list of <class 'str'> based off the movies in the data.


['Adventure',
 'Animation',
 'Children',
 'Comedy',
 'Fantasy',
 'Romance',
 'Drama',
 'Action',
 'Crime',
 'Thriller',
 'Horror',
 'Mystery',
 'Sci-Fi',
 'War',
 'Musical',
 'Documentary',
 'IMAX',
 'Western',
 'Film-Noir',
 '(no genres listed)']

#### Now I have an idea to see the mean ratings of each genre, even though it most likely won't be useful for the analysis, at least in my head...