# IMDb Dataset Details

Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. <br>
The first line in each file contains headers that describe what is in each column. <br>
A ‘\N’ is used to denote that a particular field is missing or null for that title/name. The available datasets are as follows:<br>

**title.akas.tsv.gz** - Contains the following information for titles:
titleId (string) - a tconst, an alphanumeric unique identifier of the title<br>
ordering (integer) – a number to uniquely identify rows for a given titleId<br>
title (string) – the localized title<br>
region (string) - the region for this version of the title<br>
language (string) - the language of the title<br>
types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay".<br>
attributes (array) - Additional terms to describe this alternative title, not enumerated<br>
isOriginalTitle (boolean) – 0: not original title; 1: original title
<br>
<br>
**title.basics.tsv.gz** - Contains the following information for titles:<br>
tconst (string) - alphanumeric unique identifier of the title<br>
titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)<br>
primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release<br>
originalTitle (string) - original title, in the original language<br>
isAdult (boolean) - 0: non-adult title; 1: adult title<br>
startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year<br>
endYear (YYYY) – TV Series end year. ‘\N’ for all other title types<br>
runtimeMinutes – primary runtime of the title, in minutes<br>
genres (string array) – includes up to three genres associated with the title
<br><br>
**title.ratings.tsv.gz** – Contains the IMDb rating and votes information for titles<br>
tconst (string) - alphanumeric unique identifier of the title<br>
averageRating – weighted average of all the individual user ratings<br>
numVotes - number of votes the title has received<br>

In [11]:
import pandas as pd
import numpy as np

In [2]:
basics_url="https://datasets.imdbws.com/title.basics.tsv.gz"
akas_url="https://datasets.imdbws.com/title.akas.tsv.gz"
ratings_url="https://datasets.imdbws.com/title.ratings.tsv.gz"

In [17]:
basics = pd.read_csv(basics_url, sep='\t', low_memory=False)
basics.replace({'\\N':np.nan},inplace=True)

In [None]:
akas = pd.read_csv(akas_url, sep='\t', low_memory=False)
akas.replace({'\\N':np.nan},inplace=True)

In [None]:
ratings = pd.read_csv(ratings_url, sep='\t', low_memory=False)
ratings.replace({'\\N':np.nan}inplace=True)