# Shortest path between actors

What's the shortest path between two actors, via films they've acted together?

# Download IMDb Data

[IMDb Datasets](https://www.imdb.com/interfaces/) provide dumps of all movie data. We'll download 3 tables:

## name.basics.tsv.gz

**nconst**|**primaryName**|**birthYear**|**deathYear**|**primaryProfession**|**knownForTitles**
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
nm0000001|Fred Astaire|1899|1987|soundtrack,actor,miscellaneous|tt0031983,tt0072308,tt0053137,tt0050419
nm0000002|Lauren Bacall|1924|2014|actress,soundtrack|tt0071877,tt0037382,tt0038355,tt0117057
nm0000003|Brigitte Bardot|1934|\N|actress,soundtrack,music\_department|tt0054452,tt0049189,tt0056404,tt0057345
nm0000004|John Belushi|1949|1982|actor,soundtrack,writer|tt0080455,tt0078723,tt0072562,tt0077975
nm0000005|Ingmar Bergman|1918|2007|writer,director,actor|tt0050976,tt0050986,tt0060827,tt0083922

## title.basics.tsv.gz

**tconst**|**titleType**|**primaryTitle**|**originalTitle**|**isAdult**|**startYear**|**endYear**|**runtimeMinutes**|**genres**
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
tt0000001|short|Carmencita|Carmencita|0|1894|\N|1|Documentary,Short
tt0000002|short|Le clown et ses chiens|Le clown et ses chiens|0|1892|\N|5|Animation,Short
tt0000003|short|Pauvre Pierrot|Pauvre Pierrot|0|1892|\N|4|Animation,Comedy,Romance
tt0000004|short|Un bon bock|Un bon bock|0|1892|\N|12|Animation,Short
tt0000005|short|Blacksmith Scene|Blacksmith Scene|0|1893|\N|1|Comedy,Short

## title.principals.tsv.gz

**tconst**|**ordering**|**nconst**|**category**|**job**|**characters**
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
tt0000001|1|nm1588970|self|\N|["Self"]
tt0000001|2|nm0005690|director|\N|\N
tt0000001|3|nm0374658|cinematographer|director of photography|\N
tt0000002|1|nm0721526|director|\N|\N
tt0000002|2|nm1335271|composer|\N|\N

In [1]:
# Download the data
# !rm -f *.tsv.gz
!curl --silent -C - -o name.basics.tsv.gz https://datasets.imdbws.com/name.basics.tsv.gz
!curl --silent -C - -o title.principals.tsv.gz https://datasets.imdbws.com/title.principals.tsv.gz
!curl --silent -C - -o title.basics.tsv.gz https://datasets.imdbws.com/title.basics.tsv.gz
!ls -la *.tsv.gz

-rw-rw-r-- 1 shakir shakir 263051457 Mar 17 13:44 name.basics.tsv.gz
-rw-rw-r-- 1 shakir shakir 186362322 Mar 17 13:44 title.basics.tsv.gz
-rw-rw-r-- 1 shakir shakir 470605995 Mar 17 13:44 title.principals.tsv.gz


In [2]:
# These gzip files have trailing garbage.
# Python's gzip module does not read GZIP files with trailing garbage.
# Let's create an equivalent of pandas.read_csv() that works around it.
# See https://stackoverflow.com/a/54608126/100904
import zlib
import io
import pandas as pd

def read_csv(path, **kwargs):
    with open(path, 'rb') as handle:
        raw = handle.read()
    stream = io.BytesIO(zlib.decompress(raw, zlib.MAX_WBITS|16))
    return pd.read_csv(stream, **kwargs)

In [3]:
# Load the movies. This needs ~1.4GB RAM, 15s
movies = read_csv('title.basics.tsv.gz', sep='\t', na_values='\\N', dtype={
    'tconst': 'str',
    'titleType': 'str',
    'primaryTitle': 'str',
    'startYear': 'Int64',
}, usecols=['tconst', 'titleType', 'primaryTitle', 'startYear']).set_index('tconst')

In [4]:
# Only consider movies, not TV series, etc. Shrinks data to ~5%
movies = movies[movies['titleType'] == 'movie']
del movies['titleType']
movies.head()

Unnamed: 0_level_0,primaryTitle,startYear
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1
tt0000009,Miss Jerry,1894
tt0000147,The Corbett-Fitzsimmons Fight,1897
tt0000502,Bohemios,1905
tt0000574,The Story of the Kelly Gang,1906
tt0000591,The Prodigal Son,1907


In [5]:
# Load the cast of each film. 2.0 GB RAM. 30s
cast = read_csv('title.principals.tsv.gz', sep='\t', na_values='\\N', dtype={
    'tconst': 'str',
    'nconst': 'str',
    'category': 'str',
}, usecols=['tconst', 'nconst', 'category'])

In [6]:
# Only consider actors, not directors, composers, etc. Shrinks data to about 40%
# Only consider actors that have acted in movies, not TV series, etc.
cast = cast[cast.category.isin({'actor', 'actress'}) & cast['tconst'].isin(movies.index)]
cast.reset_index(drop=True, inplace=True)
cast.head()

Unnamed: 0,tconst,nconst,category
0,tt0000009,nm0063086,actress
1,tt0000009,nm0183823,actor
2,tt0000009,nm1309758,actor
3,tt0000502,nm0215752,actor
4,tt0000502,nm0252720,actor


In [7]:
# Load 11m names with birth year. 16s
name = read_csv('name.basics.tsv.gz', sep='\t', na_values='\\N', dtype={
    'nconst': 'str',
    'primaryName': 'str',
    'birthYear': 'Int64'
}, usecols=['nconst', 'primaryName', 'birthYear']).set_index('nconst')

In [8]:
# Drop those who haven't acted in movies
name = name[name.index.isin(cast['nconst'])]
# name['titles'] has the number of movies they've acted in
name['titles'] = cast['nconst'].value_counts()

name.head()

Unnamed: 0_level_0,primaryName,birthYear,titles
nconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
nm0000001,Fred Astaire,1899,35
nm0000002,Lauren Bacall,1924,37
nm0000003,Brigitte Bardot,1934,35
nm0000004,John Belushi,1949,7
nm0000005,Ingmar Bergman,1918,3


# Create a `networkx` graph from this

In [10]:
import networkx as nx
G = nx.from_pandas_edgelist(cast, 'tconst', 'nconst')

In [11]:


# Load the name.basics.tsv file
names_df = pd.read_csv('name.basics.tsv.gz', sep='\t', low_memory=False)



Lazzarus Powell's nconst: ['nm2376957']
Kenneth Hadley's nconst: ['nm0352901']


In [11]:


# Load the name.basics.tsv file
names_df = read_csv('name.basics.tsv.gz', sep='\t', low_memory=False)



In [12]:
# Search for Lazzarus Powell
lazzarus_powell = names_df[names_df['primaryName'] == 'Lazzarus Powell']
print("Lazzarus Powell's nconst:", lazzarus_powell['nconst'].values)

# Search for Kenneth Hadley
kenneth_hadley = names_df[names_df['primaryName'] == 'Kenneth Hadley']
print("Kenneth Hadley's nconst:", kenneth_hadley['nconst'].values)


Lazzarus Powell's nconst: ['nm2376957']
Kenneth Hadley's nconst: ['nm0352901']


In [None]:
kenneth_hadley['nconst'].values

In [20]:
# Assuming you have already created the graph G and found the nconst values for the actors
lazzarus_nconst = lazzarus_powell['nconst'].values[0]
kenneth_nconst = kenneth_hadley['nconst'].values[0]

# Find all paths between Lazzarus Powell and Kenneth Hadley
all_paths = list(nx.all_simple_paths(G, source=lazzarus_nconst, target=kenneth_nconst))

# Print all paths
for path in all_paths:
    print("Path:", path)


In [14]:
# Paths provided in the question
paths = [
    ["Lazzarus Powell", "Checkmate", "Fat Joe", "Thicker Than Water", "Ice Cube", "Kenneth Hadley"],
    ["Lazzarus Powell", "Checkmate", "Fat Joe", "Thicker Than Water", "Ice Cube", "xXx: State of the Union", "Scott Speedman", "Kenneth Hadley"],
    ["Lazzarus Powell", "Checkmate", "Fat Joe", "Thicker Than Water", "Ice Cube", "xXx: State of the Union", "Scott Speedman", "Underworld: Evolution", "Bill Nighy", "Sometimes Always Never", "Alice Lowe", "Sightseers", "Kenneth Hadley"]
]

# Function to check if a path is valid
def is_valid_path(path):
    for i in range(0, len(path) - 2, 2):
        actor = path[i]
        movie = path[i + 1]
        next_actor = path[i + 2]
        # Check if the actor and the next actor appeared in the movie
        if not ((cast['nconst'] == name[name['primaryName'] == actor].index[0]) & (cast['tconst'] == movies[movies['primaryTitle'] == movie].index[0])).any():
            return False
        if not ((cast['nconst'] == name[name['primaryName'] == next_actor].index[0]) & (cast['tconst'] == movies[movies['primaryTitle'] == movie].index[0])).any():
            return False
    return True

# Check each path
for path in paths:
    if is_valid_path(path):
        print("Valid path:", " - ".join(path))


In [16]:
# Paths provided in the question
paths = [
    ["Lazzarus Powell", "Checkmate", "Fat Joe", "Thicker Than Water", "Ice Cube", "Kenneth Hadley"],
    ["Lazzarus Powell", "Checkmate", "Fat Joe", "Thicker Than Water", "Ice Cube", "xXx: State of the Union", "Scott Speedman", "Kenneth Hadley"],
    ["Lazzarus Powell", "Checkmate", "Fat Joe", "Thicker Than Water", "Ice Cube", "xXx: State of the Union", "Scott Speedman", "Underworld: Evolution", "Bill Nighy", "Sometimes Always Never", "Alice Lowe", "Sightseers", "Kenneth Hadley"]
]

# Function to check if a path is valid
def is_valid_path(path):
    for i in range(0, len(path) - 2, 2):
        actor = path[i]
        movie = path[i + 1]
        next_actor = path[i + 2]
        # Check if the actor and the next actor appeared in the movie
        actor_nconst = name[name['primaryName'] == actor].index[0]
        next_actor_nconst = name[name['primaryName'] == next_actor].index[0]
        movie_tconst = movies[movies['primaryTitle'] == movie].index[0]
        print(actor_nconst,next_actor_nconst)
        if not ((cast['nconst'] == actor_nconst) & (cast['tconst'] == movie_tconst)).any():
            return False
        if not ((cast['nconst'] == next_actor_nconst) & (cast['tconst'] == movie_tconst)).any():
            return False
    return True

# Check each path
for path in paths:
    if is_valid_path(path):
        print("Valid path:", " - ".join(path))


nm2376957 nm0196117
nm2376957 nm0196117
nm2376957 nm0196117


In [24]:
# Paths provided in the question
paths = [
    ["Lazzarus Powell", "Checkmate", "Fat Joe", "Thicker Than Water", "Ice Cube", "Kenneth Hadley"],
    ["Lazzarus Powell", "Checkmate", "Fat Joe", "Thicker Than Water", "Ice Cube", "xXx: State of the Union", "Scott Speedman", "Kenneth Hadley"],
    ["Lazzarus Powell", "Checkmate", "Fat Joe", "Thicker Than Water", "Ice Cube", "xXx: State of the Union", "Scott Speedman", "Underworld: Evolution", "Bill Nighy", "Sometimes Always Never", "Alice Lowe", "Sightseers", "Kenneth Hadley"]
]

# Function to check if a path is valid
def is_valid_path(path):
    for i in range(0, len(path) - 2, 2):
        actor = path[i]
        movie = path[i + 1]
        next_actor = path[i + 2]
        # Check if the actor and the next actor appeared in the movie
        actor_nconst = name[name['primaryName'] == actor].index[0]
        next_actor_nconst = name[name['primaryName'] == next_actor].index[0]
        movie_tconst = movies[movies['primaryTitle'] == movie].index[0]
        if not ((cast['nconst'] == actor_nconst) & (cast['tconst'] == movie_tconst)).any():
            return False
        if not ((cast['nconst'] == next_actor_nconst) & (cast['tconst'] == movie_tconst)).any():
            return False
    return True

# Check each path
valid_paths = []
for path in paths:
    if is_valid_path(path):
        valid_paths.append(path)

# Print valid paths
for path in valid_paths:
    print("Valid path:", " - ".join(path))


In [13]:
if 'nm2376957' in G and 'nm0352901' in G:
    path = nx.shortest_path(G, 'nm2376957', 'nm0352901')
    print(path)
else:
    print("One or both nodes are not in the graph.")


['nm2376957', 'tt6059750', 'nm4098919', 'tt19866398', 'nm9629900', 'tt15532522', 'nm0234174', 'tt5207838', 'nm1546206', 'tt3449664', 'nm1361530', 'tt2023690', 'nm0352901']


In [14]:
# We can find the shortest path between 2 actors. For example, Lazzarus Powell (nm2376957) and Kenneth Hadley (nm0352901)
nx.shortest_path(G, 'nm2376957', 'nm0352901')

['nm2376957',
 'tt6059750',
 'nm4098919',
 'tt19866398',
 'nm9629900',
 'tt15532522',
 'nm0234174',
 'tt5207838',
 'nm1546206',
 'tt3449664',
 'nm1361530',
 'tt2023690',
 'nm0352901']

In [15]:
# Let's write a function that converts these IDs into names
def names(path):
    return ' - '.join((movies['primaryTitle'][p] if p.startswith('tt') else name['primaryName'][p]) for p in path)

# ... and a function that
def path(source, target):
    source = name[name['primaryName'] == source].index[0]
    target = name[name['primaryName'] == target].index[0]
    return names(nx.shortest_path(G, source, target))

In [16]:
# This is the shortest path between Robin Williams (nm0000245) and Angelina Jolie (nm0001401)
path('Lazzarus Powell', 'Kenneth Hadley')

'Lazzarus Powell - Long Crime No See - Lester Greene - Driving Force - Michelle l Lamb - The Waterboyz - Omar J. Dorsey - Cargo - Warren Brown - Captain Webb - Steve Oram - Sightseers - Kenneth Hadley'

In [17]:
# There many be multiple paths between them. Let's list them all
def paths(source, target):
    source = name[name['primaryName'] == source].index[0]
    target = name[name['primaryName'] == target].index[0]
    return [names(p) for p in nx.all_shortest_paths(G, source, target)]

In [18]:
# These are all the shortest paths between Lazzarus Powell and Kenneth Hadley
paths('Lazzarus Powell', 'Kenneth Hadley')

['Lazzarus Powell - Long Crime No See - Lester Greene - Driving Force - Michelle l Lamb - The Waterboyz - Omar J. Dorsey - Cargo - Warren Brown - Captain Webb - Steve Oram - Sightseers - Kenneth Hadley',
 'Lazzarus Powell - Checkmate - Fat Joe - Thicker Than Water - Ice Cube - Anaconda - Eric Stoltz - A Murder of Crows - Marianne Jean-Baptiste - In Fabric - Steve Oram - Sightseers - Kenneth Hadley',
 'Lazzarus Powell - Checkmate - Fat Joe - This Is Me... Now - Jennifer Lopez - Anaconda - Eric Stoltz - A Murder of Crows - Marianne Jean-Baptiste - In Fabric - Steve Oram - Sightseers - Kenneth Hadley',
 'Lazzarus Powell - Checkmate - Fat Joe - Thicker Than Water - Ice Cube - Torque - Martin Henderson - The Moment - Marianne Jean-Baptiste - In Fabric - Steve Oram - Sightseers - Kenneth Hadley',
 'Lazzarus Powell - Checkmate - Fat Joe - This Is Me... Now - Ben Affleck - Paycheck - Aaron Eckhart - Rumble Through the Dark - Marianne Jean-Baptiste - In Fabric - Steve Oram - Sightseers - Kennet

# Let's explore the network

In [None]:
paths('Shahab Hosseini', 'Angelina Jolie')

In [None]:
paths('Onur Tuna', 'Angelina Jolie')

In [None]:
paths('Robin Williams', 'Jackie Chan')

In [None]:
paths('Rajinikanth', 'Jackie Chan')

In [None]:
paths('Clint Eastwood', 'Toshirô Mifune')

# Further explorations

In [None]:
paths('Kevin Bacon', 'Sivakarthikeyan')

In [None]:
paths('Ashish Vidyarthi', 'N!xau')

In [None]:
paths('Gandhimathi', 'N!xau')

In [None]:
# Note: Asad Dadarkar acted in Dil Chatha Hai: https://www.imdb.com/title/tt0292490/
# But his name is not in title.principals.tsv.gz, since it's a list of primary cast, not a complete list.
# An extended list is available from https://contribute.imdb.com/dataset
# But only to people with 1000+ contributions in the last 360 days: https://community-imdb.sprinklr.com/conversations/data-issues-policy-discussions/imdb-data-now-easily-available-to-contributors/5f4a7a0d8815453dba963bbc
paths('Asad Dadarkar', 'N!xau')

# Stories

- Who is the Kevin Bacon of Bollywood?
- MGR never allowed Jayalalitha to act with others for a while until they broke up.
- Senthil needed to pair up with Goundamani in the later years to get a chance
- No one acts with Prashanth after his Malaysian visit. Or Vadivelu after his Vijayakanth visit.
- Arjun was a top star in Tamil. Then he had a bad patch -- where he took refuge in Kannada. Then he moved back.
- Vadivelu may have been a highlight connected actor earlier, but fell over time
- What about the Venkat Prabhu cluster?


Notes

- Build a tool
- Allow annotations & story forms for users to create their stories
- Allow embedding -- of visual and of story