# Project 8 - Recommender System for Movies

## Problem Statement:

- This notebook implements a movie recommender system. 
- Recommender systems are used to suggest movies or songs to users based on their interest or usage history. 
- For example, Netflix recommends movies to watch based on the previous movies you've watched.  
- In this example, we will use Item-based Collaborative Filter 


- Dataset MovieLens: https://grouplens.org/datasets/movielens/100k/ 
- Photo Credit: https://pxhere.com/en/photo/1588369

# Stage 1 - Import the dataset and Libraries:

### Import the Libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Import the Dataset:

##### Dataset 1: Movie titles

In [2]:
data_movies_titles = pd.read_csv("Movie_Id_Titles")

In [3]:
data_movies_titles.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


##### Dataset 2: Movie Ratings

In [4]:
data_movies_ratings = pd.read_csv('u.data', sep = '\t', names=['user_id', 'item_id', 'rating', 'timestamp'])

In [5]:
data_movies_ratings.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


In [7]:
data_movies_ratings.tail()

Unnamed: 0,user_id,item_id,rating,timestamp
99998,880,476,3,880175444
99999,716,204,5,879795543
100000,276,1090,1,874795795
100001,13,225,2,882399156
100002,12,203,3,879959583


As it can be seen from the dataframe above, the "timestamp" column presents with data that is not useful at this moment. 

##### Drop the "timestamp" column.

In [8]:
data_movies_ratings.drop(['timestamp'], axis = 1, inplace = True)

In [9]:
data_movies_ratings.head()

Unnamed: 0,user_id,item_id,rating
0,0,50,5
1,0,172,5
2,0,133,1
3,196,242,3
4,186,302,3


#### Explore further details of the Dataset:

In [10]:
data_movies_ratings.describe()

Unnamed: 0,user_id,item_id,rating
count,100003.0,100003.0,100003.0
mean,462.470876,425.520914,3.529864
std,266.622454,330.797791,1.125704
min,0.0,1.0,1.0
25%,254.0,175.0,3.0
50%,447.0,322.0,4.0
75%,682.0,631.0,4.0
max,943.0,1682.0,5.0


In [11]:
data_movies_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100003 entries, 0 to 100002
Data columns (total 3 columns):
user_id    100003 non-null int64
item_id    100003 non-null int64
rating     100003 non-null int64
dtypes: int64(3)
memory usage: 2.3 MB


#### Create the dataset required: Merging Dataset 1 and 2 together:

As both dataset have the same "item_id" column, it will be merge on this column.

In [12]:
data_movies = pd.merge(data_movies_ratings, data_movies_titles, on = 'item_id')

In [13]:
data_movies.head()

Unnamed: 0,user_id,item_id,rating,title
0,0,50,5,Star Wars (1977)
1,290,50,5,Star Wars (1977)
2,79,50,4,Star Wars (1977)
3,2,50,5,Star Wars (1977)
4,8,50,5,Star Wars (1977)


In [14]:
data_movies.shape

(100003, 4)

# Stage 2 - Exploratory Data Analysis:

# Stage 3 - 

# Stage 4 - 

# Stage 5 - 