# MovieLens Dataset
Source: https://files.grouplens.org/datasets/movielens/ml-25m-README.html

About the dataset: https://files.grouplens.org/datasets/movielens/ml-25m-README.html

In [18]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## movies.csv
Variables:
- movieId (integer)
- title (string) -> year is in brackets behind
- genres (string) -> multiple genres separated by |

In [2]:
movies = pd.read_csv("../ml-25m/movies.csv")
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


In [None]:
# strip white spaces from strings
movies["title"] = movies["title"].str.strip()
movies["genres"] = movies["genres"].str.strip()

: 

In [None]:
movies.isnull().values.any()

: 

### Movie release year
We extract the year from the "title" column of the dataframe. Movies without a year specified in the title column will have the "year" column value specified as NaN.

In [None]:
# extract movie year from title as new column
movies["year"] = movies["title"].str.extract("(\(\d{4}\))$")
# movies["year"] = movies["title"].str[-6:]
movies["year"] = movies["year"].str[1:5]

: 

In [None]:
movies["year"].unique()

: 

In [None]:
movies[movies["year"].isna()] # movies without year labelled

: 

From here, we can obtain the number of movies corresponding to each release year.

In [None]:
movies["year"].value_counts()

: 

In [None]:
sns.countplot(x="year", data=movies)

: 

### Movie genres
Genres for this dataset: Action, Adventure, Animation, Children's, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, IMAX, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western

If there are no genres listed for the movie: (no genres listed)

In [None]:
movies[movies["genres"] == "(no genres listed)"]

: 

There are 5062 movies with no genres listed.

In [None]:
movies_with_genre = movies[movies["genres"] != "(no genres listed)"]
movies["genres"] = movies_with_genre["genres"].apply(lambda x: x.split("|"))

: 

The genres for each movies is split by the delimeter "|". Applying the above function will split the genres by the delimeter and make all genres of the movie into a list. For movies with "(no genres listed)", the value becomes NaN. We can confirm this in the next two code blocks.

In [None]:
movies

: 

In [None]:
movies[movies["genres"].isnull()] # movies with no genres listed, which are the same as those before the genre column was manipulated

: 

From here, we explore the number of movies categorised into each genre.

In [None]:
# genre_count = movies["genres"].apply(lambda x: [i for i in x]).stack().value_counts()
genre_count = movies["genres"].apply(lambda x: pd.Series(x).value_counts()).sum()
genre_count = genre_count.astype("int")
genre_count.sort_index(inplace=True)

: 

In [None]:
genre_count

: 

In [None]:
genre_count.plot.bar()

: 

## ratings.csv
Variables:
- userId (integer)
- movieId (integer)
- rating (float)
- timestamp (integer) -> seconds since midnight of UTC timezone

In [None]:
ratings = pd.read_csv("../ml-25m/ratings.csv")
ratings

: 

In [None]:
ratings["datetime"] = pd.to_datetime(ratings["timestamp"], unit="s") # format timestamp to datetime
ratings["year"] = pd.DatetimeIndex(ratings["datetime"]).year # extract year from datetime

: 

### Spread of year ratings were made

In [None]:
ratings["year"].value_counts()

: 

In [None]:
rating_year_spread = sns.countplot(x="year", data=ratings)
plt.ticklabel_format(style='plain', axis='y')
rating_year_spread.bar_label(rating_year_spread.containers[0], rotation="vertical", padding=5)
rating_year_spread.set_xticklabels(rating_year_spread.get_xticklabels(), rotation=45)
None

: 

## genome-tags.csv
Variables:
- tagId (integer)
- tag (text)

In [None]:
genome_tags = pd.read_csv("../ml-25m/genome-tags.csv")
genome_tags

: 

## genome-scores.csv
Variables:
- movieId (integer)
- tagId (integer)
- relevance (float) -> score for relevance of tag to the movie

Each movie has a score for every tag

In [None]:
genome_scores = pd.read_csv("../ml-25m/genome-scores.csv")
genome_scores

: 

### Mean tag relevance score for tags across all movies

In [None]:
tag_mean_scores = genome_scores[["tagId", "relevance"]].groupby(["tagId"]).mean()
tag_mean_scores.reset_index(inplace=True)
tag_mean_scores = tag_mean_scores.merge(genome_tags, how="inner", on="tagId")
tag_mean_scores.sort_values("relevance", ascending=False, inplace=True)
tag_mean_scores

: 

## tags.csv
This file is to be explored further on how it can be integrated with our project

In [None]:
tags = pd.read_csv("../ml-25m/tags.csv")
tags

: 

## links.csv

Variables:
- movieId
- imdbId
- tmdbId

Potential use of this file: merging with other datasets (TMDB or IMDB) to obtain more information about the movie

In [None]:
links = pd.read_csv("../ml-25m/links.csv")
links

: 