We have already retrieved the final dataset from TMDB API, but we only got the raw data from a JSON, it needs some pre-processed transformation to be able to explore the data. 

### Imports

In [1]:
import pandas as pd
import numpy as np

import sys
sys.path.append('../source/')

import helpers

### Pre-transformation

In [2]:
movies_df = pd.read_csv("../data/processed/movies_list.csv", index_col=0)

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
movies_df.head()

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,credits,status_code,status_message
0,False,,,0,"[{'id': 80, 'name': 'Crime'}, {'id': 9648, 'na...",,421114.0,tt0080495,es,La capilla ardiente,...,"[{'iso_639_1': 'es', 'name': 'Español'}]",Released,,La capilla ardiente,False,3.0,3.0,"{'cast': [{'cast_id': 0, 'character': 'Ángel',...",,
1,False,,,840000,"[{'id': 27, 'name': 'Horror'}]",,91817.0,tt0082367,en,Fear No Evil,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Alexandria High… class of '81 - All the studen...,Fear No Evil,False,4.4,14.0,"{'cast': [{'cast_id': 2, 'character': 'Andrew ...",,
2,False,,"{'id': 184977, 'name': 'Shaolin Temple Collect...",0,"[{'id': 28, 'name': 'Action'}]",,10275.0,tt0079891,cn,少林寺,...,"[{'iso_639_1': 'cn', 'name': '广州话 / 廣州話'}, {'i...",Released,,The Shaolin Temple,False,7.1,53.0,"{'cast': [{'cast_id': 7, 'character': 'Gong Yu...",,
3,False,,,0,[],,270810.0,tt0080311,en,...Maybe This Time,...,[],Released,,...Maybe This Time,False,0.0,0.0,"{'cast': [{'cast_id': 0, 'character': 'Fran', ...",,
4,False,/fpB6mNdhTG8vX3vjPLHGO6lKbiF.jpg,,0,"[{'id': 18, 'name': 'Drama'}]",,301845.0,tt0082047,es,Barcelona sur,...,"[{'iso_639_1': 'es', 'name': 'Español'}]",Released,,Barcelona sur,False,5.0,2.0,"{'cast': [{'cast_id': 0, 'character': 'Gumer',...",,


As we can see,`genre`, `production_companies`, `production_countries`, `spoken_languages`, `cast` and `crew` columns still contain JSON formatting. The purpose of this notebook is not to detect NaNs or outliers but to pre-transform our dataset to get a more accurated dataset possible for next steps.

### Genres

In [4]:
movies_df.genres

0         [{'id': 80, 'name': 'Crime'}, {'id': 9648, 'na...
1                            [{'id': 27, 'name': 'Horror'}]
2                            [{'id': 28, 'name': 'Action'}]
3                                                        []
4                             [{'id': 18, 'name': 'Drama'}]
                                ...                        
250946                  [{'id': 99, 'name': 'Documentary'}]
250947    [{'id': 36, 'name': 'History'}, {'id': 10752, ...
250948                  [{'id': 99, 'name': 'Documentary'}]
250949    [{'id': 99, 'name': 'Documentary'}, {'id': 18,...
250950                       [{'id': 35, 'name': 'Comedy'}]
Name: genres, Length: 4101750, dtype: object

In [5]:
helpers.from_json_to_array(movies_df, "genres", "name': '([^,]+)'}")

In [6]:
movies_df.genres

0         [Crime, Mystery, Drama, Thriller, Horror]
1                                          [Horror]
2                                          [Action]
3                                                []
4                                           [Drama]
                            ...                    
250946                                [Documentary]
250947                               [History, War]
250948                                [Documentary]
250949                         [Documentary, Drama]
250950                                     [Comedy]
Name: genres, Length: 4101750, dtype: object

### Production Companies

In [7]:
movies_df.production_companies

0         [{'id': 82343, 'logo_path': None, 'name': 'Pel...
1         [{'id': 43628, 'logo_path': None, 'name': 'LaL...
2                                                        []
3                                                        []
4         [{'id': 37786, 'logo_path': None, 'name': 'Fig...
                                ...                        
250946                                                   []
250947                                                   []
250948                                                   []
250949                                                   []
250950    [{'id': 90562, 'logo_path': '/qII3jJQ4S32FgJRl...
Name: production_companies, Length: 4101750, dtype: object

In [8]:
helpers.from_json_to_array(movies_df, "production_companies", "id': ([^,]+)")

In [9]:
movies_df.production_companies

0                                 [82343]
1                                 [43628]
2                                      []
3                                      []
4                    [37786, 37787, 1660]
                       ...               
250946                                 []
250947                                 []
250948                                 []
250949                                 []
250950    [90562, 8906, 7981, 356, 11773]
Name: production_companies, Length: 4101750, dtype: object

### Production Countries

In [10]:
movies_df.production_countries

0         [{'iso_3166_1': 'MX', 'name': 'Mexico'}, {'iso...
1         [{'iso_3166_1': 'US', 'name': 'United States o...
2         [{'iso_3166_1': 'CN', 'name': 'China'}, {'iso_...
3                                                        []
4         [{'iso_3166_1': 'MX', 'name': 'Mexico'}, {'iso...
                                ...                        
250946             [{'iso_3166_1': 'RU', 'name': 'Russia'}]
250947              [{'iso_3166_1': 'CN', 'name': 'China'}]
250948             [{'iso_3166_1': 'RU', 'name': 'Russia'}]
250949                                                   []
250950             [{'iso_3166_1': 'FR', 'name': 'France'}]
Name: production_countries, Length: 4101750, dtype: object

In [11]:
helpers.from_json_to_array(movies_df, "production_countries", "iso_3166_1': '([^,]+)'")

In [12]:
movies_df.production_countries

0         [MX, ES]
1             [US]
2         [CN, HK]
3               []
4         [MX, ES]
            ...   
250946        [RU]
250947        [CN]
250948        [RU]
250949          []
250950        [FR]
Name: production_countries, Length: 4101750, dtype: object

### Spoken Languages

In [13]:
movies_df.spoken_languages

0                  [{'iso_639_1': 'es', 'name': 'Español'}]
1                  [{'iso_639_1': 'en', 'name': 'English'}]
2         [{'iso_639_1': 'cn', 'name': '广州话 / 廣州話'}, {'i...
3                                                        []
4                  [{'iso_639_1': 'es', 'name': 'Español'}]
                                ...                        
250946             [{'iso_639_1': 'ru', 'name': 'Pусский'}]
250947                 [{'iso_639_1': 'zh', 'name': '普通话'}]
250948             [{'iso_639_1': 'ru', 'name': 'Pусский'}]
250949             [{'iso_639_1': 'es', 'name': 'Español'}]
250950            [{'iso_639_1': 'fr', 'name': 'Français'}]
Name: spoken_languages, Length: 4101750, dtype: object

In [14]:
helpers.from_json_to_array(movies_df, "spoken_languages", "iso_639_1': '([^,]+)'")

In [15]:
movies_df.spoken_languages

0             [es]
1             [en]
2         [cn, zh]
3               []
4             [es]
            ...   
250946        [ru]
250947        [zh]
250948        [ru]
250949        [es]
250950        [fr]
Name: spoken_languages, Length: 4101750, dtype: object

### Credits

In [16]:
movies_df.head()

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,credits,status_code,status_message
0,False,,,0,"[Crime, Mystery, Drama, Thriller, Horror]",,421114.0,tt0080495,es,La capilla ardiente,...,[es],Released,,La capilla ardiente,False,3.0,3.0,"{'cast': [{'cast_id': 0, 'character': 'Ángel',...",,
1,False,,,840000,[Horror],,91817.0,tt0082367,en,Fear No Evil,...,[en],Released,Alexandria High… class of '81 - All the studen...,Fear No Evil,False,4.4,14.0,"{'cast': [{'cast_id': 2, 'character': 'Andrew ...",,
2,False,,"{'id': 184977, 'name': 'Shaolin Temple Collect...",0,[Action],,10275.0,tt0079891,cn,少林寺,...,"[cn, zh]",Released,,The Shaolin Temple,False,7.1,53.0,"{'cast': [{'cast_id': 7, 'character': 'Gong Yu...",,
3,False,,,0,[],,270810.0,tt0080311,en,...Maybe This Time,...,[],Released,,...Maybe This Time,False,0.0,0.0,"{'cast': [{'cast_id': 0, 'character': 'Fran', ...",,
4,False,/fpB6mNdhTG8vX3vjPLHGO6lKbiF.jpg,,0,[Drama],,301845.0,tt0082047,es,Barcelona sur,...,[es],Released,,Barcelona sur,False,5.0,2.0,"{'cast': [{'cast_id': 0, 'character': 'Gumer',...",,


To pre-transform our credits column, we will first split it into two new columns. `Crew` and `Cast`.

In [None]:
helpers.split_credits_column(movies_df)

In [22]:
movies_df.cast[0]

[{'cast_id': 0,
  'character': 'Ángel',
  'credit_id': '57ffc57ec3a36815490010dd',
  'gender': 2,
  'id': 266603,
  'name': 'Miguel Ayones',
  'order': 1,
  'profile_path': '/2W3RBABBUOTHNeg7stxc4NfJ68g.jpg'},
 {'cast_id': 1,
  'character': '',
  'credit_id': '57ffc58d9251411ac400120d',
  'gender': 2,
  'id': 19447,
  'name': 'Pancho Córdova',
  'order': 2,
  'profile_path': None},
 {'cast_id': 2,
  'character': '',
  'credit_id': '57ffc59f9251411ac900120b',
  'gender': 0,
  'id': 1178853,
  'name': 'Ricardo Díaz',
  'order': 3,
  'profile_path': None},
 {'cast_id': 3,
  'character': 'Sara',
  'credit_id': '57ffc5b2c3a368153d001138',
  'gender': 1,
  'id': 1686031,
  'name': 'Beatriz Galbó',
  'order': 4,
  'profile_path': None},
 {'cast_id': 4,
  'character': 'Elena',
  'credit_id': '57ffc5c89251411acd001146',
  'gender': 0,
  'id': 1489266,
  'name': 'Lucy Tovar',
  'order': 5,
  'profile_path': None}]

#### Cast

In [23]:
helpers.from_json_to_array(movies_df, "cast", "'id': ([^,]+)")

#### Crew

In [24]:
helpers.from_json_to_array(movies_df, "crew", "'id': ([^,]+)")

In [52]:
movies_df.head()

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,status,tagline,title,video,vote_average,vote_count,status_code,status_message,cast,crew
0,False,,,0,"[Crime, Mystery, Drama, Thriller, Horror]",,421114.0,tt0080495,es,La capilla ardiente,...,Released,,La capilla ardiente,False,3.0,3.0,,,"[266603, 19447, 1178853, 1686031, 1489266]","[30823, 50542, 100897, 100897, 1489713]"
1,False,,,840000,[Horror],,91817.0,tt0082367,en,Fear No Evil,...,Released,Alexandria High… class of '81 - All the studen...,Fear No Evil,False,4.4,14.0,,,"[158489, 1219410, 995591, 104060, 1577541, 126...","[11470, 19172, 106667, 137193, 583023, 583023,..."
2,False,,"{'id': 184977, 'name': 'Shaolin Temple Collect...",0,[Action],,10275.0,tt0079891,cn,少林寺,...,Released,,The Shaolin Temple,False,7.1,53.0,,,"[1336, 1059241, 64707, 1342843, 1342844, 13428...","[64690, 64688, 64687, 64689, 64692, 64693, 646..."
3,False,,,0,[],,270810.0,tt0080311,en,...Maybe This Time,...,Released,,...Maybe This Time,False,0.0,0.0,,,"[58063, 23, 1321846, 1203483, 94807, 217028, 1...","[97327, 166750, 1311627]"
4,False,/fpB6mNdhTG8vX3vjPLHGO6lKbiF.jpg,,0,[Drama],,301845.0,tt0082047,es,Barcelona sur,...,Released,,Barcelona sur,False,5.0,2.0,,,"[1062327, 974896, 1383246, 144268, 1380767, 31...","[37540, 542246, 1057601, 1057601, 1086384, 110..."


We finished with the pre-transformation of our data. Now it's time to explore and visualize our dataset. To conclude this step, we will export the dataset to proper use it in the future.

In [53]:
movies_df.to_csv("../data/processed/movies_transformed_list.csv")