In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Data Description

In [2]:
genome_scores_df = pd.read_csv('genome-scores.csv')
genome_tags_df = pd.read_csv('genome-tags.csv')
links_df = pd.read_csv('links.csv')
movies_df = pd.read_csv('movies.csv')
ratings_df = pd.read_csv('ratings.csv')
tags_df = pd.read_csv('tags.csv')

**Ratings Data File Structure (ratings.csv)**

Each line of this file after the header row represents one rating of one movie by one user, and has the following format:  

- The lines within this file are ordered first by userId, then, within user, by movieId.
- Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
- Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [3]:
ratings_df.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828


In [4]:
len(ratings_df)

25000095

**Tags Data File Structure (tags.csv)**  

All tags are contained in the file tags.csv. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:  

- The lines within this file are ordered first by userId, then, within user, by movieId.
- Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.
- Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [5]:
tags_df.head(3)

Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598


In [6]:
len(tags_df)

1093360

**Movies Data File Structure (movies.csv)**  

Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format:

- Movie titles are entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

- Genres are a pipe-separated list, and are selected from the following: Action
Adventure
Animation
Children's
Comedy
Crime
Documentary
Drama
Fantasy
Film-Noir
Horror
Musical
Mystery
Romance
Sci-Fi
Thriller
War
Western
(no genres listed)

In [7]:
movies_df.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [8]:
len(movies_df)

62423

**Links Data File Structure (links.csv)**  

Identifiers that can be used to link to other sources of movie data are contained in the file links.csv. Each line of this file after the header row represents one movie, and has the following format:

- movieId is an identifier for movies used by https://movielens.org. E.g., the movie Toy Story has the link https://movielens.org/movies/1.

- imdbId is an identifier for movies used by http://www.imdb.com. E.g., the movie Toy Story has the link http://www.imdb.com/title/tt0114709/.

- tmdbId is an identifier for movies used by https://www.themoviedb.org. E.g., the movie Toy Story has the link https://www.themoviedb.org/movie/862.

In [9]:
links_df.head(3)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0


In [10]:
len(links_df)

62423

**Tag Genome (genome-scores.csv and genome-tags.csv)**. 

This data set includes a current copy of the Tag Genome.

The tag genome is a data structure that contains *tag relevance scores* for movies. The structure is a dense matrix: each movie in the genome has a value for every tag in the genome.

As described in this article, the tag genome encodes how strongly movies exhibit particular properties represented by tags (atmospheric, thought-provoking, realistic, etc.). The tag genome was computed using a machine learning algorithm on user-contributed content including tags, ratings, and textual reviews.

Please include the following citation if referencing tag genome data:

Jesse Vig, Shilad Sen, and John Riedl. 2012. The Tag Genome: Encoding Community Knowledge to Support Novel Interaction. ACM Trans. Interact. Intell. Syst. 2, 3: 13:1–13:44. https://doi.org/10.1145/2362394.2362395

In [11]:
genome_scores_df.head(3)

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625


In [12]:
len(genome_scores_df)

15584448

In [13]:
genome_tags_df.head(3)

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century


In [14]:
len(genome_tags_df)

1128

# Data Merging & Data Cleaning

![relationship](relationship.png)

In [15]:
df = ratings_df.merge(tags_df,on=['userId','movieId','timestamp'],how='left')
df.head(3)

Unnamed: 0,userId,movieId,rating,timestamp,tag
0,1,296,5.0,1147880044,
1,1,306,3.5,1147868817,
2,1,307,5.0,1147868828,


In [16]:
# convert integer timestamp into a datetime
import datetime 
df = df.copy()
df['timestamp'] = df['timestamp'].apply(lambda x: datetime.datetime.fromtimestamp(x / 1e3))
df.head(3)

Unnamed: 0,userId,movieId,rating,timestamp,tag
0,1,296,5.0,1970-01-14 01:51:20.044,
1,1,306,3.5,1970-01-14 01:51:08.817,
2,1,307,5.0,1970-01-14 01:51:08.828,


In [17]:
# convert genres into multiple columns with binary values
movie_genres = movies_df.set_index('movieId').genres.str.split('|', expand=True).stack()
genres = pd.get_dummies(movie_genres, prefix='g').groupby(level=0).sum().reset_index()
genres.head(3)

Unnamed: 0,movieId,g_(no genres listed),g_Action,g_Adventure,g_Animation,g_Children,g_Comedy,g_Crime,g_Documentary,g_Drama,...,g_Film-Noir,g_Horror,g_IMAX,g_Musical,g_Mystery,g_Romance,g_Sci-Fi,g_Thriller,g_War,g_Western
0,1,0,0,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [18]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp,tag
0,1,296,5.0,1970-01-14 01:51:20.044,
1,1,306,3.5,1970-01-14 01:51:08.817,
2,1,307,5.0,1970-01-14 01:51:08.828,
3,1,665,5.0,1970-01-14 01:51:18.820,
4,1,899,3.5,1970-01-14 01:51:08.510,


In [19]:
df = df.merge(movies_df,on=['movieId'],how='left')
df.drop(['genres'],inplace=True,axis=1)
df = df.merge(genres,on=['movieId'],how='left')
df.head(3)

Unnamed: 0,userId,movieId,rating,timestamp,tag,title,g_(no genres listed),g_Action,g_Adventure,g_Animation,...,g_Film-Noir,g_Horror,g_IMAX,g_Musical,g_Mystery,g_Romance,g_Sci-Fi,g_Thriller,g_War,g_Western
0,1,296,5.0,1970-01-14 01:51:20.044,,Pulp Fiction (1994),0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,1,306,3.5,1970-01-14 01:51:08.817,,Three Colors: Red (Trois couleurs: Rouge) (1994),0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,307,5.0,1970-01-14 01:51:08.828,,Three Colors: Blue (Trois couleurs: Bleu) (1993),0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
df = df.merge(links_df,on=['movieId'],how='left')
df.head(3)

Unnamed: 0,userId,movieId,rating,timestamp,tag,title,g_(no genres listed),g_Action,g_Adventure,g_Animation,...,g_IMAX,g_Musical,g_Mystery,g_Romance,g_Sci-Fi,g_Thriller,g_War,g_Western,imdbId,tmdbId
0,1,296,5.0,1970-01-14 01:51:20.044,,Pulp Fiction (1994),0,0,0,0,...,0,0,0,0,0,1,0,0,110912,680.0
1,1,306,3.5,1970-01-14 01:51:08.817,,Three Colors: Red (Trois couleurs: Rouge) (1994),0,0,0,0,...,0,0,0,0,0,0,0,0,111495,110.0
2,1,307,5.0,1970-01-14 01:51:08.828,,Three Colors: Blue (Trois couleurs: Bleu) (1993),0,0,0,0,...,0,0,0,0,0,0,0,0,108394,108.0


In [21]:
df = df.merge(genome_tags_df,on=['tag'],how='left')
df.head(3)

Unnamed: 0,userId,movieId,rating,timestamp,tag,title,g_(no genres listed),g_Action,g_Adventure,g_Animation,...,g_Musical,g_Mystery,g_Romance,g_Sci-Fi,g_Thriller,g_War,g_Western,imdbId,tmdbId,tagId
0,1,296,5.0,1970-01-14 01:51:20.044,,Pulp Fiction (1994),0,0,0,0,...,0,0,0,0,1,0,0,110912,680.0,
1,1,306,3.5,1970-01-14 01:51:08.817,,Three Colors: Red (Trois couleurs: Rouge) (1994),0,0,0,0,...,0,0,0,0,0,0,0,111495,110.0,
2,1,307,5.0,1970-01-14 01:51:08.828,,Three Colors: Blue (Trois couleurs: Bleu) (1993),0,0,0,0,...,0,0,0,0,0,0,0,108394,108.0,


In [22]:
df = df.merge(genome_scores_df,on=['movieId','tagId'],how='left')
df.head(3)

Unnamed: 0,userId,movieId,rating,timestamp,tag,title,g_(no genres listed),g_Action,g_Adventure,g_Animation,...,g_Mystery,g_Romance,g_Sci-Fi,g_Thriller,g_War,g_Western,imdbId,tmdbId,tagId,relevance
0,1,296,5.0,1970-01-14 01:51:20.044,,Pulp Fiction (1994),0,0,0,0,...,0,0,0,1,0,0,110912,680.0,,
1,1,306,3.5,1970-01-14 01:51:08.817,,Three Colors: Red (Trois couleurs: Rouge) (1994),0,0,0,0,...,0,0,0,0,0,0,111495,110.0,,
2,1,307,5.0,1970-01-14 01:51:08.828,,Three Colors: Blue (Trois couleurs: Bleu) (1993),0,0,0,0,...,0,0,0,0,0,0,108394,108.0,,


# Data Exploration