In order to encode our `Cast` column we need to retrieve all data from each actor. In this notebook we will take care to build a dataset and get all movies from each actor by release year.

## Imports

In [1]:
import pandas as pd
import re
import numpy as np
import json

import sys
sys.path.append('../source/')

import helpers

Here we retrieve the transformed dataset exported in the previous notebook.

In [2]:
movies_df = pd.read_csv("../data/processed/transformation/movies_transformed_list.csv", index_col=0)

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
movies_df.shape

(251024, 30)

## People Pre Transformation

### Cast

To start, we will create a new dataset with all cast members from any of our movies. To do so, we created a helper to get unique values.

In [4]:
cast_ids = helpers.unique_values(movies_df.cast)

In [5]:
cast_id_df = pd.DataFrame(cast_ids, columns=["cast_id"])

In [6]:
cast_id_df.shape

(503733, 1)

On the following steps, we will build a dataset indexed by each cast member where the columns are the years, and value is the result of the mean of movies released that year by each cast member.

In [7]:
cast_id_df.to_csv("../data/processed/people_transformation/cast_ids.csv")

We don't have the cast member's information so we created a new python script where we asynchronously request all information - movies, vote average, name - for each one. Our result will be a `JSON` file located at  `../data/processed/json/tmdb_crew_list.json`. 

``python3 ../source/tmdb_people.py``

[...] ~5839.39 seconds later...

In [8]:
json_path = "../data/processed/json/tmdb_crew_list.json"

In [9]:
with open(json_path) as json_file:
    data = json.load(json_file)

We got our data! On this cell we filter out if a value of our `JSON` is a string.

In [10]:
people_list = list(filter(lambda i: not(type(i) is str), data))

In [11]:
people_df = pd.DataFrame(people_list)

In [12]:
people_df.head()

Unnamed: 0,cast,crew,id,status_code,status_message,success
0,"[{'character': 'Pre-Teen Gomez', 'credit_id': ...",[],2436202.0,,,
1,"[{'character': 'Herself', 'credit_id': '55119c...",[],1445097.0,,,
2,"[{'character': 'self', 'credit_id': '54ecae67c...",[],1431250.0,,,
3,"[{'character': 'Herself', 'credit_id': '57eba5...",[],1685918.0,,,
4,"[{'character': 'Réceptionniste à l'hospice', '...",[],2294782.0,,,


We set as our id the tmdb one and also create a new column with it.

In [13]:
people_df.set_index('id', inplace=True)
people_df["tmdb_id"] = people_df.index

We got some extra columns we won't need. Time to remove it!

In [14]:
people_df.drop(["crew", "status_code", "status_message", "success"], axis=1, inplace=True)

In [15]:
people_df.head()

Unnamed: 0_level_0,cast,tmdb_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2436202.0,"[{'character': 'Pre-Teen Gomez', 'credit_id': ...",2436202.0
1445097.0,"[{'character': 'Herself', 'credit_id': '55119c...",1445097.0
1431250.0,"[{'character': 'self', 'credit_id': '54ecae67c...",1431250.0
1685918.0,"[{'character': 'Herself', 'credit_id': '57eba5...",1685918.0
2294782.0,"[{'character': 'Réceptionniste à l'hospice', '...",2294782.0


We create a new dataframe bringing the cast series as new columns.

In [16]:
cast_df_series = people_df.cast.apply(pd.Series)

Merge `cast_df_series` df (with cast series columns) from previous step into `people_df` dataframe.

In [17]:
cast_df = people_df.merge(cast_df_series, left_index = True, right_index = True)

We here unpivot the DataFrame from wide format to long format.

In [18]:
cast_df = cast_df.melt(id_vars = ['tmdb_id'], value_name = "cast")

In [19]:
cast_df.head()

Unnamed: 0,tmdb_id,variable,cast
0,1.0,cast,"[{'character': 'Himself', 'credit_id': '52fe45..."
1,2.0,cast,"[{'character': 'Luke Skywalker', 'credit_id': ..."
2,3.0,cast,"[{'character': 'Rick Deckard', 'credit_id': '5..."
3,4.0,cast,"[{'character': 'Herself', 'credit_id': '52fe48..."
4,5.0,cast,"[{'character': 'Baron Victor Frankenstein', 'c..."


As we don't need the variable information, we drop the column from our dataset.

In [20]:
cast_df = cast_df.drop("variable", axis=1)

Unpivotting our DataFrame helped us to visualize there are many NaNs. After we drop them, our df will be easier to interact.

In [21]:
cast_df = cast_df.dropna()

Let's sort by`tmdb_id` and reset our index as it is not longer the correct one.

In [22]:
cast_df = cast_df.sort_values(by=['tmdb_id'], ascending=True)

In [23]:
cast_df.reset_index(inplace=True)

We have our DataFrame sorted by`tmdb_id` and each value is a movie the cast member did, but it still has the `JSON` formatting. It's time to create new columns based on `name`, `year` and `vote_average`. Then, we will remove the `cast` column.

In [24]:
cast_df["name"] = cast_df["cast"].apply(lambda x: x["character"] if "character" in x else np.NaN)

In [25]:
cast_df["release_date"] = cast_df["cast"].apply(lambda x: x["release_date"] if "release_date" in x else np.NaN)
cast_df['year'] = pd.DatetimeIndex(cast_df['release_date']).year

In [26]:
cast_df["vote_average"] = cast_df["cast"].apply(lambda x: x["vote_average"] if "vote_average" in x else np.NaN)

In [27]:
cast_df.drop(["cast", "release_date"], axis=1, inplace=True)

Movies which has no vote_average won't tell us what we want so we will filter out our DataFrame to remove these values.

In [28]:
cast_df = cast_df[cast_df.vote_average != 0]

So it's almost done! Now we pivot our DataFrame by index, where our columns are the year of each movie release and the value is all movies the cast member grouped by it's mean.

In [29]:
cast_df_pivoted = pd.pivot_table(cast_df, values='vote_average', index=['tmdb_id'], columns=['year'], aggfunc=np.mean)

We finalize our work here. People Dataset is already built! Please, go to the next notebook called `3.EDA.ipynb` to visualize our dataset.

In [30]:
cast_df_pivoted.to_csv("../data/processed/people_transformation/people_cast_list.csv")