## Loading the datasets

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
movies=pd.read_csv("Data/movies.csv")
ratings=pd.read_csv("Data/ratings.csv")
tags=pd.read_csv("Data/tags.csv")
links=pd.read_csv("Data/links.csv")


* First We make sure that our dataframes don't contain null values using the following command

In [2]:
movies.isnull().sum().sum()


0

In [3]:
ratings.isnull().sum().sum()

0

In [4]:
tags.isnull().sum().sum()

0

* We find no missing values in the movies, ratings and tags dataframes while we find that there is 8 missing values in the links dataframe as can be seen below:

In [5]:
links[pd.isnull(links).any(axis=1)]

Unnamed: 0,movieId,imdbId,tmdbId
624,791,113610,
843,1107,102336,
2141,2851,81454,
3027,4051,56600,
5532,26587,92337,
5854,32600,377059,
6059,40697,105946,
7382,79299,874957,


* Next we will look at the unique movies in the movies dataframe using the following command

In [6]:
movies['movieId'].nunique()


9742

* In total we have 9742 unique movies in the dataset.
* Of these 9742 movies we only have 9724 movies who actually has a rating as can be seen by executing the following command:

In [7]:
ratings['movieId'].nunique()

9724

## Data Cleaning and Merging

* I noticed that there is a timestamp column in the ratings and the tags dataframe.
* I will delete this column because i think it doesn't provide any additional value for our recommendation system

In [8]:
del ratings['timestamp']
del tags['timestamp']

* Now i will try to merge the tags and ratings data frame to see how i can obtain additional value from the data sets
* I figured that i will do a left join between the ratings and the tags by going through the different joins from the following link: http://chris.friedline.net/2015-12-15-rutgers/lessons/python2/04-merging-data.html
* The join will be based on the userId and movieId as can be shown below:

In [9]:
ratings_tags=pd.merge(ratings,tags,  how='left', left_on=['userId','movieId'], right_on = ['userId','movieId'])
ratings_tags['movieId'].nunique()

9724

In [10]:
ratings_tags.head()

Unnamed: 0,userId,movieId,rating,tag
0,1,1,4.0,
1,1,3,4.0,
2,1,6,4.0,
3,1,47,5.0,
4,1,50,5.0,


In [11]:
#ratings_tags[ratings_tags['tag'].isna()]

* By looking at our new dataframe "ratings_tags" we notice that there is a lot of movies that have ratings but not have tags associated with them. 
* These can be seen as the "NaN" values in the ratings_tags dataframe.
* The next step would be to create a dataframe with the movies that have ratings and tags
* The new dataframe is going to be called "have_ratings_tags" and is created using the following command:

In [12]:
have_ratings_tags=ratings_tags[ratings_tags.tag.notnull()]
have_ratings_tags['movieId'].nunique()


1464

In [13]:
tags['movieId'].nunique()

1572

* We notice that the newly created dataframe have 1464 unique movieId values while the original tags have 1572 unique movieId values.
* This means that there is (1572-1464 = 108) movies that have tags but have no ratings.

* Our new dataframe "have_ratings_tags" have all the movies that have ratings and tags and we can now use it to perform clustering.

In [14]:
have_ratings_tags['tag'].nunique()

1543

* Now we will transform the tags to all lowercase letters to detect duplicates

In [15]:
have_ratings_tags= have_ratings_tags.applymap(lambda s:s.lower() if type(s) == str else s)

In [16]:
have_ratings_tags.head()

Unnamed: 0,userId,movieId,rating,tag
241,2,60756,5.0,funny
242,2,60756,5.0,highly quotable
243,2,60756,5.0,will ferrell
252,2,89774,5.0,boxing story
253,2,89774,5.0,mma


In [17]:
#have_ratings_tags['movieId'].nunique()


In [18]:
#tags['movieId'].nunique()

In [19]:
#have_ratings_tags['tag'].nunique()

* Here is our new dataframe have_ratings_tags that have all the movies that have ratings and tags associated with them

In [20]:
have_ratings_tags

Unnamed: 0,userId,movieId,rating,tag
241,2,60756,5.0,funny
242,2,60756,5.0,highly quotable
243,2,60756,5.0,will ferrell
252,2,89774,5.0,boxing story
253,2,89774,5.0,mma
...,...,...,...,...
99967,606,6107,4.0,world war ii
100087,606,7382,4.5,for katie
101553,610,3265,5.0,gun fu
101554,610,3265,5.0,heroic bloodshed


In [21]:
#have_ratings_tags=have_ratings_tags.drop_duplicates()
#have_ratings_tags

* We notice that there is some movies that have multiple and different ratings.
* We are going to creating an average ratings for all movies with multiple ratings using the following commands:

In [22]:
average_ratings=have_ratings_tags.groupby('movieId')['rating'].mean().reset_index(name='average_ratings')
average_ratings.head()


Unnamed: 0,movieId,average_ratings
0,1,3.833333
1,2,3.75
2,3,2.5
3,5,1.5
4,7,3.0


Now we can merge the average ratings with the have_ratings_tags data frame.

In [23]:
have_ratings_tags=pd.merge(have_ratings_tags,average_ratings,on=['movieId'])
have_ratings_tags.head()

Unnamed: 0,userId,movieId,rating,tag,average_ratings
0,2,60756,5.0,funny,4.1875
1,2,60756,5.0,highly quotable,4.1875
2,2,60756,5.0,will ferrell,4.1875
3,62,60756,3.5,comedy,4.1875
4,62,60756,3.5,funny,4.1875
