## Introduction
This kernel provides a brief introduction to the TMDB dataset. We'll cover:
- How to load the data
- Working with the json fields
- Flattening the json fields

In [1]:
import json
import pandas as pd

### Loading the data
The TMDb data includes several nested json fields, so loading the data requires an extra step beyond reading the csv files.

In [2]:
def load_tmdb_movies(path):
    df = pd.read_csv(path)
    df['release_date'] = pd.to_datetime(df['release_date']).apply(lambda x: x.date())
    json_columns = ['genres', 'keywords', 'production_countries', 'production_companies', 'spoken_languages']
    for column in json_columns:
        df[column] = df[column].apply(json.loads)
    return df


def load_tmdb_credits(path):
    df = pd.read_csv(path)
    json_columns = ['cast', 'crew']
    for column in json_columns:
        df[column] = df[column].apply(json.loads)
    return df

In [3]:
movies = load_tmdb_movies("../input/tmdb_5000_movies.csv")
credits = load_tmdb_credits("../input/tmdb_5000_credits.csv")

The movies file contains the bulk of the data that's comparable to the old IMDB data.

In [4]:
movies.head(3)

Credits contains information about the people who contributed to a film. There's a lot of information here; long credits can contain a few hundred names per film.

In [5]:
credits.head(3)

### Working with the json fields
As we've seen, almost all of the data in the credits file is actually nested json. Since we've loaded the json fields properly, theose fields can be accessed just like any other list or dictionary. 

The cast field contains the following keys:

In [6]:
print(sorted(credits.cast.iloc[0][0].keys()))

The crew field contains these keys:

In [7]:
print(sorted(credits.crew.iloc[0][0].keys()))

At least for the few entries I hand checked, the cast and crew lists are already sorted by the order of appeareance in the film's credits. For example, we can get the names of the first five actors in Avatar with:

In [8]:
[actor['name'] for actor in credits['cast'].iloc[0][:5]]

For a more involved example, let's try checking how often men got the lead credit in the 10 films with the highest revenues. First, let's pull in a utility function from [How to Get IMDB Kernels Working With TMDb Data](https://www.kaggle.com/sohier/getting-imdb-kernels-working-with-tmdb-data/) to handle missing values in the json fields.

In [9]:
def safe_access(container, index_values):
    # return a missing value rather than an error upon indexing/key failure
    result = container
    try:
        for idx in index_values:
            result = result[idx]
        return result
    except IndexError or KeyError:
        return pd.np.nan

Now we can create a new column for the gender. It's trinary data ({other: 0, female: 1, male: 2}), but that's the best we can do with the data on hand.

In [10]:
credits['gender_of_lead'] = credits.cast.apply(lambda x: safe_access(x, [0, 'gender']))
credits['lead'] = credits.cast.apply(lambda x: safe_access(x, [0, 'name']))
credits.head(3)

In [11]:
credits.gender_of_lead.value_counts()

Now to get to the top 10 films by revenue.

In [12]:
df = pd.merge(movies, credits, left_on='id', right_on='movie_id')
df[['original_title', 'revenue', 'lead', 'gender_of_lead']].sort_values(by=['revenue'], ascending=False)[:10]

We could say that this list is somewhat skewed since we're only looking at a list of ~4,800 films.

We could also say that since it validates my personal belief that Robert Downey Jr. is very entertaining as Iron Man, it must be correct.

## Reformatting the Data
Nested json can be a pain; it often makes more sense to use an alternate data structure. In this section we'll make this data more [tidy](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html).

We'll start by flattening out the credits file until we have one row per person. The cast data already includes an 'order' entry; since I'm not sure if the crew data ordering has any meaning we'll preserve it just in case.

In [13]:
credits.apply(lambda row: [x.update({'movie_id': row['movie_id']}) for x in row['cast']], axis=1);
credits.apply(lambda row: [x.update({'movie_id': row['movie_id']}) for x in row['crew']], axis=1);
credits.apply(lambda row: [person.update({'order': order}) for order, person in enumerate(row['crew'])], axis=1);

cast = []
credits.cast.apply(lambda x: cast.extend(x))
cast = pd.DataFrame(cast)
cast['type'] = 'cast'

crew = []
credits.crew.apply(lambda x: crew.extend(x))
crew = pd.DataFrame(crew)
crew['type'] = 'crew'

people = pd.concat([cast, crew],  ignore_index=True)

In [14]:
people.sample(3)

Done! We've gotten rid of all of the json involved with the credits.