# Shortest path between actors

What's the shortest path between two actors, via films they've acted together?

# Download IMDb Data

[IMDb Datasets](https://www.imdb.com/interfaces/) provide dumps of all movie data. We'll download 3 tables:

## name.basics.tsv.gz

**nconst**|**primaryName**|**birthYear**|**deathYear**|**primaryProfession**|**knownForTitles**
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
nm0000001|Fred Astaire|1899|1987|soundtrack,actor,miscellaneous|tt0031983,tt0072308,tt0053137,tt0050419
nm0000002|Lauren Bacall|1924|2014|actress,soundtrack|tt0071877,tt0037382,tt0038355,tt0117057
nm0000003|Brigitte Bardot|1934|\N|actress,soundtrack,music\_department|tt0054452,tt0049189,tt0056404,tt0057345
nm0000004|John Belushi|1949|1982|actor,soundtrack,writer|tt0080455,tt0078723,tt0072562,tt0077975
nm0000005|Ingmar Bergman|1918|2007|writer,director,actor|tt0050976,tt0050986,tt0060827,tt0083922

## title.basics.tsv.gz

**tconst**|**titleType**|**primaryTitle**|**originalTitle**|**isAdult**|**startYear**|**endYear**|**runtimeMinutes**|**genres**
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
tt0000001|short|Carmencita|Carmencita|0|1894|\N|1|Documentary,Short
tt0000002|short|Le clown et ses chiens|Le clown et ses chiens|0|1892|\N|5|Animation,Short
tt0000003|short|Pauvre Pierrot|Pauvre Pierrot|0|1892|\N|4|Animation,Comedy,Romance
tt0000004|short|Un bon bock|Un bon bock|0|1892|\N|12|Animation,Short
tt0000005|short|Blacksmith Scene|Blacksmith Scene|0|1893|\N|1|Comedy,Short

## title.principals.tsv.gz

**tconst**|**ordering**|**nconst**|**category**|**job**|**characters**
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
tt0000001|1|nm1588970|self|\N|["Self"]
tt0000001|2|nm0005690|director|\N|\N
tt0000001|3|nm0374658|cinematographer|director of photography|\N
tt0000002|1|nm0721526|director|\N|\N
tt0000002|2|nm1335271|composer|\N|\N

In [1]:
# Download the data
# !rm -f *.tsv.gz
!curl --silent -C - -o name.basics.tsv.gz https://datasets.imdbws.com/name.basics.tsv.gz
!curl --silent -C - -o title.principals.tsv.gz https://datasets.imdbws.com/title.principals.tsv.gz
!curl --silent -C - -o title.basics.tsv.gz https://datasets.imdbws.com/title.basics.tsv.gz
!ls -la *.tsv.gz

-rwxrwxr-x+ 1 Anand S Anand S 229771318 Jul  1 08:32 name.basics.tsv.gz
-rwxrwxr-x+ 1 Anand S Anand S 157916340 Jul  1 08:32 title.basics.tsv.gz
-rwxrwxr-x+ 1 Anand S Anand S 402948527 Jul  1 08:32 title.principals.tsv.gz


In [2]:
# These gzip files have trailing garbage.
# Python's gzip module does not read GZIP files with trailing garbage.
# Let's create an equivalent of pandas.read_csv() that works around it.
# See https://stackoverflow.com/a/54608126/100904
import zlib
import io
import pandas as pd

def read_csv(path, **kwargs):
    with open(path, 'rb') as handle:
        raw = handle.read()
    stream = io.BytesIO(zlib.decompress(raw, zlib.MAX_WBITS|16))
    return pd.read_csv(stream, **kwargs)

In [3]:
# Load the movies. This needs ~1.4GB RAM, 15s
movies = read_csv('title.basics.tsv.gz', sep='\t', na_values='\\N', dtype={
    'tconst': 'str',
    'titleType': 'str',
    'primaryTitle': 'str',
    'startYear': 'Int64',
}, usecols=['tconst', 'titleType', 'primaryTitle', 'startYear']).set_index('tconst')

In [4]:
# Only consider movies, not TV series, etc. Shrinks data to ~5%
movies = movies[movies['titleType'] == 'movie']
del movies['titleType']
movies.head()

Unnamed: 0_level_0,primaryTitle,startYear
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1
tt0000502,Bohemios,1905
tt0000574,The Story of the Kelly Gang,1906
tt0000591,The Prodigal Son,1907
tt0000615,Robbery Under Arms,1907
tt0000630,Hamlet,1908


In [6]:
# Load the cast of each film. 2.0 GB RAM. 30s
cast = read_csv('title.principals.tsv.gz', sep='\t', na_values='\\N', dtype={
    'tconst': 'str',
    'nconst': 'str',
    'category': 'str',
}, usecols=['tconst', 'nconst', 'category'])

In [7]:
# Only consider actors, not directors, composers, etc. Shrinks data to about 40%
# Only consider actors that have acted in movies, not TV series, etc.
cast = cast[cast.category.isin({'actor', 'actress'}) & cast['tconst'].isin(movies.index)]
cast.reset_index(drop=True, inplace=True)
cast.head()

Unnamed: 0,tconst,nconst,category
0,tt0000502,nm0215752,actor
1,tt0000502,nm0252720,actor
2,tt0000574,nm0846887,actress
3,tt0000574,nm0846894,actor
4,tt0000574,nm1431224,actor


In [8]:
# Load 11m names with birth year. 16s
name = read_csv('name.basics.tsv.gz', sep='\t', na_values='\\N', dtype={
    'nconst': 'str',
    'primaryName': 'str',
    'birthYear': 'Int64'
}, usecols=['nconst', 'primaryName', 'birthYear']).set_index('nconst')

In [9]:
# Drop those who haven't acted in movies
name = name[name.index.isin(cast['nconst'])]
# name['titles'] has the number of movies they've acted in
name['titles'] = cast['nconst'].value_counts()
name.head()

Unnamed: 0_level_0,primaryName,birthYear,titles
nconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
nm0000001,Fred Astaire,1899,35
nm0000002,Lauren Bacall,1924,37
nm0000003,Brigitte Bardot,1934,35
nm0000004,John Belushi,1949,7
nm0000005,Ingmar Bergman,1918,3


# Create a `networkx` graph from this

In [10]:
import networkx as nx
G = nx.from_pandas_edgelist(cast, 'tconst', 'nconst')

In [11]:
# We can find the shortest path between 2 actors. For example, Robin Williams (nm0000245) and Angelina Jolie (nm0001401)
nx.shortest_path(G, 'nm0000245', 'nm0001401')

['nm0000245', 'tt0097165', 'nm0000160', 'tt0364045', 'nm0001401']

In [12]:
# Let's write a function that converts these IDs into names
def names(path):
    return ' - '.join((movies['primaryTitle'][p] if p.startswith('tt') else name['primaryName'][p]) for p in path)

# ... and a function that 
def path(source, target):
    source = name[name['primaryName'] == source].index[0]
    target = name[name['primaryName'] == target].index[0]
    return names(nx.shortest_path(G, source, target))

In [13]:
# This is the shortest path between Robin Williams (nm0000245) and Angelina Jolie (nm0001401)
path('Robin Williams', 'Angelina Jolie')

'Robin Williams - Dead Poets Society - Ethan Hawke - Taking Lives - Angelina Jolie'

In [14]:
# There many be multiple paths between them. Let's list them all
def paths(source, target):
    source = name[name['primaryName'] == source].index[0]
    target = name[name['primaryName'] == target].index[0]
    return [names(p) for p in nx.all_shortest_paths(G, source, target)]

In [15]:
# These are all the shortest paths between Robin Williams (nm0000245) and Angelina Jolie (nm0001401)
paths('Robin Williams', 'Angelina Jolie')

['Robin Williams - Dead Poets Society - Ethan Hawke - Taking Lives - Angelina Jolie',
 'Robin Williams - Awakenings - Robert De Niro - Shark Tale - Angelina Jolie',
 'Robin Williams - Awakenings - Robert De Niro - The Good Shepherd - Angelina Jolie',
 'Robin Williams - Good Will Hunting - Matt Damon - The Good Shepherd - Angelina Jolie',
 'Robin Williams - Hook - Dustin Hoffman - Kung Fu Panda - Angelina Jolie',
 'Robin Williams - Hook - Dustin Hoffman - Kung Fu Panda 2 - Angelina Jolie',
 'Robin Williams - Hook - Dustin Hoffman - Kung Fu Panda 3 - Angelina Jolie',
 'Robin Williams - Toys - Robin Wright - Beowulf - Angelina Jolie',
 'Robin Williams - Happy Feet - Brittany Murphy - Girl, Interrupted - Angelina Jolie',
 'Robin Williams - House of D - David Duchovny - Playing God - Angelina Jolie',
 'Robin Williams - The Big White - Giovanni Ribisi - Gone in 60 Seconds - Angelina Jolie',
 'Robin Williams - The Big White - Giovanni Ribisi - Sky Captain and the World of Tomorrow - Angelina 

# Let's explore the network

In [21]:
# Shahab Hosseini is a famous Iranian actor. How does he connect with Angelina Jolie?
paths('Shahab Hosseini', 'Angelina Jolie')

['Shahab Hosseini - Darbareye Elly - Golshifteh Farahani - The Song of Scorpions - Irrfan Khan - A Mighty Heart - Angelina Jolie',
 'Shahab Hosseini - A Separation - Payman Maadi - Last Knights - Clive Owen - Beyond Borders - Angelina Jolie',
 'Shahab Hosseini - A Separation - Payman Maadi - Last Knights - Morgan Freeman - Wanted - Angelina Jolie',
 'Shahab Hosseini - The Salesman - Babak Karimi - Zeros and Ones - Ethan Hawke - Taking Lives - Angelina Jolie']

In [22]:
# Robin Williams and Jackie Chan are both prolific comedians. Whom have they acted with?
paths('Robin Williams', 'Jackie Chan')

['Robin Williams - Hook - Dustin Hoffman - Kung Fu Panda 2 - Jackie Chan',
 'Robin Williams - What Dreams May Come - Max von Sydow - Rush Hour 3 - Jackie Chan',
 'Robin Williams - Night at the Museum: Secret of the Tomb - Owen Wilson - Shanghai Noon - Jackie Chan',
 'Robin Williams - Night at the Museum: Secret of the Tomb - Owen Wilson - Shanghai Knights - Jackie Chan']

In [23]:
# Rajinikanth and Jackie Chan are among Asia's highest paid actors. How are they connected?
paths('Rajinikanth', 'Jackie Chan')

['Rajinikanth - Kabali - Winston Chao - 1911 - Jackie Chan']

In [24]:
# Clint Eastwood began his career playing Toshirô Mifune's roles. How are they connected?
paths('Clint Eastwood', 'Toshirô Mifune')

['Clint Eastwood - Paint Your Wagon - Lee Marvin - Hell in the Pacific - Toshirô Mifune',
 'Clint Eastwood - Space Cowboys - James Garner - Grand Prix - Toshirô Mifune']

In [25]:
# Kevin Bacon is extremely well connected. How can he reach the South Indian Sivakarthikeyan?
paths('Kevin Bacon', 'Sivakarthikeyan')

['Kevin Bacon - The Air I Breathe - Brendan Fraser - Line of Descent - Abhay Deol - Hero - Sivakarthikeyan']

In [26]:
# Do you remember N!xau from The Gods Must Be Crazy? Could be connected with one of India's most cross-cultural actors, Ashish Vidyarthi?
paths('Ashish Vidyarthi', 'N!xau')

['Ashish Vidyarthi - Benaam - Mithun Chakraborty - CC2C - Chia-Hui Liu - Shaolin Warrior - Lung Chan - Crazy Safari - N!xau',
 "Ashish Vidyarthi - 12 O'Clock - Mithun Chakraborty - CC2C - Chia-Hui Liu - Shaolin Warrior - Lung Chan - Crazy Safari - N!xau",
 'Ashish Vidyarthi - Jole Jongole - Mithun Chakraborty - CC2C - Chia-Hui Liu - Shaolin Warrior - Lung Chan - Crazy Safari - N!xau',
 'Ashish Vidyarthi - Zindagi Khoobsoorat Hai - Rajit Kapoor - The Making of the Mahatma - Paul Slabolepszy - Saturday Night at the Palace - Marius Weyers - The Gods Must Be Crazy - N!xau',
 'Ashish Vidyarthi - Colours of Passion - Nandana Sen - Bokshu the Myth - Steven Berkoff - Charlie - Marius Weyers - The Gods Must Be Crazy - N!xau',
 'Ashish Vidyarthi - Agnee 2 - Amit Hasan - Nayok - Md Rafsan Jamil - Deep Bay of Bengal - Kent Cheng - The Gods Must Be Funny in China - N!xau',
 'Ashish Vidyarthi - Rokto - Amit Hasan - Nayok - Md Rafsan Jamil - Deep Bay of Bengal - Kent Cheng - The Gods Must Be Funny in

In [28]:
# How is N!xau connected with one of South India's most insular actresses, Gandhimathi?
paths('Gandhimathi', 'N!xau')[:5]

['Gandhimathi - Pathinaru Vayathinile - Kamal Haasan - Ladies Only - Seema Biswas - Cooking with Stella - Don McKellar - Meditation Park - Pei-Pei Cheng - The Gods Must Be Funny in China - N!xau',
 'Gandhimathi - Naan Avanillai - Kamal Haasan - Ladies Only - Seema Biswas - Cooking with Stella - Don McKellar - Meditation Park - Pei-Pei Cheng - The Gods Must Be Funny in China - N!xau',
 'Gandhimathi - Melnattu Marumagal - Kamal Haasan - Ladies Only - Seema Biswas - Cooking with Stella - Don McKellar - Meditation Park - Pei-Pei Cheng - The Gods Must Be Funny in China - N!xau',
 'Gandhimathi - Sattam En Kaiyil - Kamal Haasan - Ladies Only - Seema Biswas - Cooking with Stella - Don McKellar - Meditation Park - Pei-Pei Cheng - The Gods Must Be Funny in China - N!xau',
 'Gandhimathi - Unnai Sutrum Ulagam - Kamal Haasan - Ladies Only - Seema Biswas - Cooking with Stella - Don McKellar - Meditation Park - Pei-Pei Cheng - The Gods Must Be Funny in China - N!xau']

In [29]:
# Can we find 2 people that are NOT connected on this network?

# So far, no. We thought we had a promising start with Asad Dadarkar and N!xau.
# But that's not true.
# Asad Dadarkar acted in Dil Chatha Hai: https://www.imdb.com/title/tt0292490/
# But his name is not in title.principals.tsv.gz, since it's a list of primary cast, not a complete list.
# An extended list is available from https://contribute.imdb.com/dataset
# But only to people with 1000+ contributions in the last 360 days: https://community-imdb.sprinklr.com/conversations/data-issues-policy-discussions/imdb-data-now-easily-available-to-contributors/5f4a7a0d8815453dba963bbc
paths('Asad Dadarkar', 'N!xau')

IndexError: index 0 is out of bounds for axis 0 with size 0

# Story ideas

In discussion with Srinivasan Ramani, The Hindu

- Who is the Kevin Bacon of Bollywood?
- "MGR never allowed Jayalalitha to act with others for a while until they broke up." Prove or disprove.
- Senthil needed to pair up with Goundamani in the latter years to get an acting chance.
- No one acts with Prashanth after his Malaysian visit. Or Vadivelu after his Vijayakanth visit.
- Arjun was a top star in Tamil. Then he had a bad patch -- where he took refuge in Kannada. Then he moved back.
- Vadivelu may have been a highlight connected actor earlier, but fell over time
- What about the Venkat Prabhu cluster?

Suggestions

- Build a tool for laymen to use
- Allow annotations & story forms for users to create their stories
- Allow embedding -- of visual and of story