# IMDB Datasets pipeline to TVShow Dataset

This is my pipeline for creating a new dataset for TV Shows to make Analysis more easy with out having to recombine the files each time, which is very CPU intense on an Core i3 or Kaggle.

# IMDB Non-Commercial Datasets
Subsets of IMDB data are available for access to customers for personal and non-commercial use. You can hold local copies of this data, and it is subject to our terms and conditions. Please refer to the [Non-Commercial Licensing](https://help.imdb.com/article/imdb/general-information/can-i-use-imdb-data-in-my-software/G5JTRESSHJBBHTGX?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=1ed1aea6-d2ad-4705-95fd-ba13f1b5014f&pf_rd_r=XRE3QWF2G5YWTD2SGT0V&pf_rd_s=center-1&pf_rd_t=60601&pf_rd_i=interfaces&ref_=fea_mn_lk1) and [copyright/license](http://www.imdb.com/Copyright?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=1ed1aea6-d2ad-4705-95fd-ba13f1b5014f&pf_rd_r=XRE3QWF2G5YWTD2SGT0V&pf_rd_s=center-1&pf_rd_t=60601&pf_rd_i=interfaces&ref_=fea_mn_lk2) and verify compliance.

# Data Location
The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily.

# IMDB Dataset Details
Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name.

In [None]:
import numpy as np 
import pandas as pd 

# Input data files are available in the read-only "../input/" directory

#main media title dataset
tbasics_file = "../input/imdb-basic-dataset/title.basics.tsv/data.tsv"
# TV show link table
episode_file = "../input/imdb-basic-dataset/title.episode.tsv/data.tsv"
# ratings
ratings_file = "../input/imdb-basic-dataset/title.ratings.tsv/data.tsv"


# imdb-basic-dataset/title.basics.tsv/data.tsv
- **tconst (string)** - alphanumeric unique identifier of the title
- **titleType (string)** – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- **primaryTitle (string)** – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- **originalTitle (string)** - original title, in the original language
- **isAdult (boolean)** - 0: non-adult title; 1: adult title
- **startYear (YYYY)** – represents the release year of a title. In the case of TV Series, it is the series start year
- **endYear (YYYY)** – TV Series end year. ‘\N’ for all other title types
- **runtimeMinutes** – primary runtime of the title, in minutes
- **genres (string array)** – includes up to three genres associated with the title

In [None]:
tbasics = pd.read_csv(tbasics_file, sep='\t', low_memory=False)
tbasics

# imdb-basic-dataset/title.ratings.tsv/data.tsv
* **tconst (string)** - alphanumeric unique identifier of the title
* **averageRating** – weighted average of all the individual user ratings
* **numVotes** - number of votes the title has received

In [None]:
ratings = pd.read_csv(ratings_file, sep='\t')
ratings

# imdb-basic-dataset/title.episode.tsv/data.tsv

* **tconst (string)** - alphanumeric identifier of episode
* **parentTconst (string)** - alphanumeric identifier of the parent TV Series
* **seasonNumber (integer)** – season number the episode belongs to
* **episodeNumber (integer)** – episode number of the tconst in the TV series

In [None]:
episode = pd.read_csv(episode_file, sep='\t').rename(columns={
            "seasonNumber": "S",
            "episodeNumber": "E"
        }).replace('\\N', 0)
episode

# Join the Titles with the Ratings

every record will have their ratings and average votes included

In [None]:
tbasics = tbasics.join(ratings.set_index('tconst'), on="tconst").replace('\\N', np.nan)
tbasics

# build the TVShow DataFrame
- joins the records in the *tbasics* DataFrame based in the *episodes* DataFrame
- conflicting columns were renamed

In [None]:
df = episode.join(tbasics.drop([
            "endYear",
            'isAdult',
            "numVotes",
            "genres",
            "averageRating",
            "titleType",
            "runtimeMinutes",
            "startYear"
        ], axis=1).rename(columns={
            "tconst": "parentTconst",
            "primaryTitle": "TVShow"
        }).set_index('parentTconst'), on="parentTconst").join(tbasics.rename(columns={
            "primaryTitle": "Episode",
            "originalTitle": "originalEpisode",
            "startYear": "year",
            "runtimeMinutes": "minutes"
        }).set_index('tconst'), on="tconst", rsuffix='r_').drop(columns=[
            'titleType', 
            'originalTitle', 
            'isAdult',
            "originalEpisode", 
            "genres", 
            "endYear"
        ]).drop(['parentTconst'], axis=1)
df

# update the DataFrame for Future analysis


In [None]:

df['year'] = df['year'].astype('int64', errors='ignore')
df['minutes'] = df['minutes'].astype('int64', errors='ignore')
df['Episode'] = 'S' + df['S'].str.zfill(2) + 'E' + df['E'].str.zfill(2) + ' ' + df['Episode']
df['S'] = df['S'].astype('int64', errors='ignore')
df['E'] = df['E'].astype('int64', errors='ignore')
df

# Save the new DataFrame to a new TSV file

In [None]:
df.to_csv('TVShows.tsv', sep='\t', na_rep='\\N', header=True, index=False, index_label=None, errors='strict')