In our final project we want to predict if a new release is going to succeed and the amount of the potential revenue. We did not discover any dataset which satisfies our standards, so I decided to code my own. Here is the plan: 

- Get an interface of Imdb dataset: http://www.imdb.com/interfaces
- Reduce the dataset to only `Movies` and set a year range from 1980 to now.
- Convert `IMDBId` to `TMDBId`. 
- Use https://www.themoviedb.org/documentation/api to build the final dataset.

### Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import sys
sys.path.append('../source/')

import helpers
import json 

### IMDB Dataset

We got our interface from all titles available at IMDB. To start with our dataset building, we should get a quick overview and get some useful info.

In [2]:
title_basics = pd.read_csv("../data/pre-processed/title_basics.tsv", sep='\t')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
title_basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
2,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
3,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
4,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"


In [4]:
title_basics.shape

(10894425, 9)

In [5]:
title_basics.titleType.unique()

array(['short', 'movie', 'tvMovie', 'tvSeries', 'tvEpisode', 'tvShort',
       'tvMiniSeries', 'tvSpecial', 'video', 'videoGame'], dtype=object)

Almost 11 million items! Luckily they are not all movies but shorts, tv shows... Previous getting our tmdb id from the interface, we would need to filter out our dataset to get only movies.

In [6]:
title_basics = title_basics[title_basics.titleType == "movie"] 

In [7]:
title_basics.shape

(862220, 9)

We reduced a huge amount of data we won't use. We should not only filter by time but also by time frame. Movies from 1940 won't tell us much as the movie industry changed a lot. We will filter and set a year range from 1980 to 2020. To do it, we should first convert our `startYear` column to numeric.

In [8]:
title_basics['startYear'] = pd.to_numeric(title_basics['startYear'], errors='coerce')

As we would need to predict our movies from movies released in the following years, now it's time to split our dataset into two. The first one from 1980 to 2020 (to test our model) and the second one from after 2020 (to predict them).

In [9]:
title_basics_after_2020 = title_basics[(title_basics["startYear"] > 2020) & (title_basics["startYear"] <= 2030)]

In [10]:
title_basics_before_2020 = title_basics[(title_basics["startYear"] > 1980.0) & (title_basics["startYear"] <= 2020)]

In [11]:
title_basics_after_2020.shape, title_basics_before_2020.shape

((2726, 9), (508702, 9))

To summarize, from an IMDB interface with 11 million items, we finally got a final dataset with 508,702 movies with a release date between 1980 and 2020. The next step is to convert our IMDB id to TMDB one, so let's export the two datasets.

In [12]:
title_basics_before_2020.to_csv("../data/processed/dataset_builder/title_basics_before_2020.csv")

In [13]:
title_basics_after_2020.to_csv("../data/processed/dataset_builder/title_basics_after_2020.csv")

### TMDB Dataset

The IMDB interface was very useful to get a quick interview about the amount of data we are going to use. Sadly it doesn't have as much information as we need to test our model. To solve it, as IMDB does not have an open API, we will convert our IMDB id to TMDB one.

To do so, we created a python script which uses a pool of threads to execute the requests asynchronously. We first tried to do it synchronously but it took almost 6 hours to complete it. 

The output will be a `JSON` file located at `../data/processed/json/tmdb_id_list.json` as json file.

- `python3 ../source/tmdb_retriever.py`

[...] ~3600 seconds later...

In [14]:
json_path = "../data/processed/json/tmdb_id_list.json"

In [15]:
with open(json_path) as json_file:
    data = json.load(json_file)
    json_to_list_function = helpers.convert_output_id(data)
    tmdb_ids_df = pd.DataFrame(json_to_list_function, columns=["tmdb_id"])

In [16]:
tmdb_ids_df.shape

(251039, 1)

From 508,702 IMDB ids, we reduced the amount to 251,039. There are several reasons:  the id was not found on TMDB, connection error, or an exception occurred.

Time to export the final result to the processed folder! Reasy to retrieve all data from each id.

In [17]:
tmdb_ids_df.to_csv("../data/processed/dataset_builder/tmdb_ids.csv")

### Movie Dataset

Now it is time to retrieve all movie information from our 251,039 ids.

We created a new python script to request all the information. The output will be located at `../data/processed/json/tmdb_movie_list.json` as `JSON` file.

- ``python3 ../source/tmdb_movies.py``

In [18]:
movie_json_path = "../data/processed/json/tmdb_movie_list.json"

We created a helper to remove any badly formatted requests.

In [19]:
movie_json = helpers.get_transformed_json(movie_json_path)

In [20]:
movies_df = pd.DataFrame(movie_json)

In [21]:
movies_df.head()

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,credits,status_code,status_message
0,False,,,0.0,"[{'id': 80, 'name': 'Crime'}, {'id': 9648, 'na...",,421114.0,tt0080495,es,La capilla ardiente,...,"[{'iso_639_1': 'es', 'name': 'Español'}]",Released,,La capilla ardiente,False,3.0,3.0,"{'cast': [{'cast_id': 0, 'character': 'Ángel',...",,
1,False,,,840000.0,"[{'id': 27, 'name': 'Horror'}]",,91817.0,tt0082367,en,Fear No Evil,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Alexandria High… class of '81 - All the studen...,Fear No Evil,False,4.4,14.0,"{'cast': [{'cast_id': 2, 'character': 'Andrew ...",,
2,False,,"{'id': 184977, 'name': 'Shaolin Temple Collect...",0.0,"[{'id': 28, 'name': 'Action'}]",,10275.0,tt0079891,cn,少林寺,...,"[{'iso_639_1': 'cn', 'name': '广州话 / 廣州話'}, {'i...",Released,,The Shaolin Temple,False,7.1,53.0,"{'cast': [{'cast_id': 7, 'character': 'Gong Yu...",,
3,False,,,0.0,[],,270810.0,tt0080311,en,...Maybe This Time,...,[],Released,,...Maybe This Time,False,0.0,0.0,"{'cast': [{'cast_id': 0, 'character': 'Fran', ...",,
4,False,/fpB6mNdhTG8vX3vjPLHGO6lKbiF.jpg,,0.0,"[{'id': 18, 'name': 'Drama'}]",,301845.0,tt0082047,es,Barcelona sur,...,"[{'iso_639_1': 'es', 'name': 'Español'}]",Released,,Barcelona sur,False,5.0,2.0,"{'cast': [{'cast_id': 0, 'character': 'Gumer',...",,


In [22]:
len(movies_df)

250951

From 251,039 ids, we kept almost the same length! Let's congratulate our job here.

We finalize our work here. Dataset is already built! Please, go to the next notebook called `2.1.Pre_transformation.ipynb` to pre-transform our dataset.

In [23]:
movies_df.to_csv("../data/processed/dataset_builder/movies_list.csv", sep=',', encoding='utf-8')