In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd
import numpy as np

The following files are importing from my personal drive. These files are all in the shared drive so you can replace the paths with your own:

- `principals_filepath`: `Datasets/title_principals.tsv`
- `movies_filepath`: `Datasets/filtered_movies.csv`
- `cast_filepath`: `Datasets/movies_cast.csv`
- `names_filepath`: `Datasets/names_basic_data.tsv` -- basically the same file but I processed it locally and saved as a csv.
- `title_basic_filepath`: `Datasets/title_basic_data.tsv` -- as above, the same file but I processed it locally.

In [4]:
principals_filepath = '/content/drive/MyDrive/Year 4 (2022 - 2023)/Fall 2022/Computational Social Science/Project/title_principals.tsv'
movies_filepath = '/content/drive/MyDrive/Year 4 (2022 - 2023)/Fall 2022/Computational Social Science/Project/Movie details/filtered_movies.csv'
cast_filepath = '/content/drive/MyDrive/Year 4 (2022 - 2023)/Fall 2022/Computational Social Science/Project/Movie details/movies_cast.csv'
names_filepath = '/content/drive/MyDrive/Year 4 (2022 - 2023)/Fall 2022/Computational Social Science/Project/name_basics.csv'
title_basic_filepath = '/content/drive/MyDrive/Year 4 (2022 - 2023)/Fall 2022/Computational Social Science/Project/title_basic_data.csv'

# `title_principals.tsv`
This file is from IMDB and contains all the principal cast & crew listed for each title. Each row corresponds to a person with a specific role in a title. Has >52M rows so proceed with caution.

In [9]:
p_big = pd.read_csv(principals_filepath, delimiter='\t')

In [None]:
print(p_big.shape)
p_big.head()

(52116168, 6)


Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Self""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0374658,cinematographer,director of photography,\N
3,tt0000002,1,nm0721526,director,\N,\N
4,tt0000002,2,nm1335271,composer,\N,\N


# `filtered_movies.csv`
This file has the list of movies we finalized the other day. Total: 3587 movies. We're not gonna use all of these, there are a few extra processing/filtering steps below. Some of the columns don't really matter, just ignore for now.

In [None]:
movies = pd.read_csv(movies_filepath)
print(movies.shape)
movies.head()

(3587, 19)


Unnamed: 0,index,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,tmdb_id,og_language,budget,revenue,ratio,ROI,averageRating,numVotes,Unnamed: 18
0,7363,tt1179904,movie,Paranormal Activity,Paranormal Activity,0,2007,,86,"Horror,Mystery",23827,en,230000,194183034,0.001184,844.274061,6.3,238846,Info says budget is $15k
1,4135,tt0390521,movie,Super Size Me,Super Size Me,0,2004,,100,Documentary,9372,en,65000,28575078,0.002275,439.616585,7.2,108792,
2,4907,tt2309260,movie,The Gallows,The Gallows,0,2015,,81,"Horror,Mystery,Thriller",299245,en,100000,42664410,0.002344,426.6441,4.2,20611,
3,8453,tt0876294,movie,Zombie Diaries,The Zombie Diaries,0,2006,,85,"Horror,Sci-Fi",25035,en,9300,2000000,0.00465,215.053763,4.0,4755,
4,9342,tt0907657,movie,Once,Once,0,2007,,86,"Drama,Music,Romance",5723,en,160000,20710513,0.007726,129.440706,7.8,116662,


`tconst` -- unique ID for the titles. Here I'm getting the IDs of all the finalized movies (our main dataset).

In [None]:
movie_ids = movies.tconst.to_list()
print(len(movie_ids))
movie_ids[:3]

3587


['tt1179904', 'tt0390521', 'tt2309260']

Back to the `p_big` (`title_principals.tsv`) for a bit. I'm trying to find the rows in `p_big` that correspond to the movies in our main dataset. The movie IDs in `p_big` are duplicated (one movie lists multiple principal cast & crew) so I'm taking unique IDs. This leaves us with >8M unique titles from `title_principals.tsv`.

In [None]:
p_movies = list(p_big.tconst.unique())
len(p_movies)

8309693

Get the overlap between `movie_ids` (IDs of movies from our main dataset) and `p_movies` (IDs of unique titles from `title_principals.tsv`). Two movies from out dataset are missing from `title_principals.tsv` entirely. There is no clear explanation for this.

In [None]:
overlap = list(set(movie_ids) & set(p_movies))
len(overlap), len(movie_ids)

(3585, 3587)

In [None]:
[i for i in movie_ids if i not in overlap]
# ['tt1233334', 'tt0273517']

['tt1233334', 'tt0273517']

Now I filter out the rows from `p_big` that correspond to the movies in our dataset. Save all of this to `principals`, which now has **all the principal cast & crew for all the movies in our main dataset** (3587 movies currently). `principals` currently has 35,665 rows.

In [None]:
principals = p_big[p_big.tconst.isin(movie_ids)]
print(principals.shape)

(35665, 6)


Continue to filter `principals` to leave only rows correspond to the cast (rows where `category='actor'` or `actress`). This leaves us with 14,122 rows.

In [None]:
principals = principals[principals.category.isin(['actor', 'actress'])]
print(principals.shape)

(14122, 6)


At this point I will run out of RAM soon. Save `principals`. This file is available on the shared folder: `Datasets/New processed data/principals.csv`.

In [None]:
principals.to_csv('principals.csv', index=False)

Of all the rows (movies & actors) present in `principals`, there are 3540 unique movies, which is fewer than the number of movies in our main dataset (3587). This means that a number of movies don't list any person as an actor/actress. This is usually the case for documentaries.

In [None]:
principals.tconst.unique().shape

(3540,)

Reminder: this is what `principals` currently looks like. For each movie, each principal actor/actress has their own row. Next, I want to collapse this dataframe by grouping the rows that have the same `tconst` -- the same movie. All the IDs (`nconst`) of the principal actors/actresses for that movie will be appended to a list.

In [None]:
principals.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
282506,tt0035423,1,nm0000212,actress,\N,"[""Kate McKay""]"
282507,tt0035423,2,nm0413168,actor,\N,"[""Leopold""]"
282508,tt0035423,3,nm0000630,actor,\N,"[""Stuart Besser""]"
282509,tt0035423,4,nm0005227,actor,\N,"[""Charlie McKay""]"
1013071,tt0118589,1,nm0001014,actress,\N,"[""Billie Frank""]"


That collapsed dataframe is saved to `people_list`. Now, each movie has a single row containing the list of IDs of its principal cast. The length of this dataframe is 3540, same as the number of unique movie IDs from `principals` above.

In [None]:
people_list = pd.DataFrame(principals.groupby('tconst')['nconst'].apply(list)).reset_index()
print(people_list.shape)
people_list.head()

(3540, 2)


Unnamed: 0,tconst,nconst
0,tt0035423,"[nm0000212, nm0413168, nm0000630, nm0005227]"
1,tt0118589,"[nm0001014, nm0073160, nm0066586, nm0004771]"
2,tt0120467,"[nm0641168, nm0424682, nm0507915, nm0839486]"
3,tt0120667,"[nm0344435, nm0004821, nm0262635, nm0004695]"
4,tt0120679,"[nm0000161, nm0000547, nm0001691, nm0535502]"


Inspect the movies with no actor/actress listed. There are 47 such movies. You can expand the `genres` column a bit to see the individual genres distribution here.

In [None]:
missing_cast = [i for i in movie_ids if i not in people_list.tconst.to_list()]
print(len(missing_cast))
movies[movies.tconst.isin(missing_cast)].shape

47


(47, 19)

In [None]:
movies[movies.tconst.isin(missing_cast)].head()

Unnamed: 0,index,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,tmdb_id,og_language,budget,revenue,ratio,ROI,averageRating,numVotes,Unnamed: 18
1,4135,tt0390521,movie,Super Size Me,Super Size Me,0,2004,,100,Documentary,9372,en,65000,28575078,0.002275,439.616585,7.2,108792,
10,4492,tt1584016,movie,Catfish,Catfish,0,2010,,87,"Documentary,Drama,Mystery",42296,en,30000,3045943,0.009849,101.531433,7.1,43004,
29,911,tt0361596,movie,Fahrenheit 9/11,Fahrenheit 9/11,0,2004,,122,"Documentary,Drama,War",1777,en,6000000,222446882,0.026973,37.07448,7.5,129696,
49,10878,tt0343121,movie,Tupac: Resurrection,Tupac: Resurrection,0,2003,,112,"Biography,Documentary,Music",21525,en,300000,7808524,0.03842,26.028413,7.9,9573,
124,9108,tt0322802,movie,Jackass: The Movie,Jackass: The Movie,0,2002,,85,"Action,Comedy,Documentary",9012,en,5000000,64282312,0.077782,12.856462,6.6,95248,


So we drop these movies from our list. The remaining dataset has 3540 movies.

In [None]:
movies = movies[~movies.tconst.isin(missing_cast)]
print(movies.shape)
movies.head()

(3540, 19)


Unnamed: 0,index,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,tmdb_id,og_language,budget,revenue,ratio,ROI,averageRating,numVotes,Unnamed: 18
0,7363,tt1179904,movie,Paranormal Activity,Paranormal Activity,0,2007,,86,"Horror,Mystery",23827,en,230000,194183034,0.001184,844.274061,6.3,238846,Info says budget is $15k
2,4907,tt2309260,movie,The Gallows,The Gallows,0,2015,,81,"Horror,Mystery,Thriller",299245,en,100000,42664410,0.002344,426.6441,4.2,20611,
3,8453,tt0876294,movie,Zombie Diaries,The Zombie Diaries,0,2006,,85,"Horror,Sci-Fi",25035,en,9300,2000000,0.00465,215.053763,4.0,4755,
4,9342,tt0907657,movie,Once,Once,0,2007,,86,"Drama,Music,Romance",5723,en,160000,20710513,0.007726,129.440706,7.8,116662,
5,2405,tt0374900,movie,Napoleon Dynamite,Napoleon Dynamite,0,2004,,96,Comedy,8193,en,400000,46118097,0.008673,115.295243,6.9,218880,


Reminder: we have the `people_list` dataframe in which each row has a movie ID and a list of actor/actress IDs.

In [None]:
people_list.head()

Unnamed: 0,tconst,nconst
0,tt0035423,"[nm0000212, nm0413168, nm0000630, nm0005227]"
1,tt0118589,"[nm0001014, nm0073160, nm0066586, nm0004771]"
2,tt0120467,"[nm0641168, nm0424682, nm0507915, nm0839486]"
3,tt0120667,"[nm0344435, nm0004821, nm0262635, nm0004695]"
4,tt0120679,"[nm0000161, nm0000547, nm0001691, nm0535502]"


Merge our list `movies` with `people_list` based on `tconst`. Now each movie has basic information as well as a list of the principal cast (all IDs not actual name). Save this to `movies_cast_2`.

In [None]:
movies_cast_2 = pd.merge(movies, people_list, left_on='tconst', right_on='tconst', how='left')
movies_cast_2

Unnamed: 0,index,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,tmdb_id,og_language,budget,revenue,ratio,ROI,averageRating,numVotes,Unnamed: 18,nconst
0,7363,tt1179904,movie,Paranormal Activity,Paranormal Activity,0,2007,,86,"Horror,Mystery",23827,en,230000,194183034,0.001184,844.274061,6.3,238846,Info says budget is $15k,"[nm2209370, nm2913790, nm2104166, nm2910808]"
1,4907,tt2309260,movie,The Gallows,The Gallows,0,2015,,81,"Horror,Mystery,Thriller",299245,en,100000,42664410,0.002344,426.644100,4.2,20611,,"[nm3790547, nm6018502, nm4090362, nm2258992]"
2,8453,tt0876294,movie,Zombie Diaries,The Zombie Diaries,0,2006,,85,"Horror,Sci-Fi",25035,en,9300,2000000,0.004650,215.053763,4.0,4755,,"[nm0014714, nm2428131, nm2408572, nm0286170]"
3,9342,tt0907657,movie,Once,Once,0,2007,,86,"Drama,Music,Romance",5723,en,160000,20710513,0.007726,129.440706,7.8,116662,,"[nm0360598, nm2461627, nm2008013, nm2478608]"
4,2405,tt0374900,movie,Napoleon Dynamite,Napoleon Dynamite,0,2004,,96,Comedy,8193,en,400000,46118097,0.008673,115.295243,6.9,218880,,"[nm1417647, nm0708293, nm0340973, nm1088052]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3535,10702,tt1935194,movie,Overdrive,Overdrive,0,2017,,93,"Action,Adventure,Crime",404733,en,28210303,7793,3619.954190,0.000276,5.3,14982,,"[nm2207222, nm7188414, nm1869101, nm4719349]"
3536,414,tt1376213,movie,The Adventurer: The Curse of the Midas Box,The Adventurer: The Curse of the Midas Box,0,2013,,100,"Adventure,Family,Fantasy",227707,en,25000000,6399,3906.860447,0.000256,5.4,6779,,"[nm2902567, nm0790688, nm0372176, nm0000554]"
3537,9979,tt7869070,movie,The Fanatic,The Fanatic,0,2019,,88,"Crime,Thriller",509853,en,18000000,3153,5708.848716,0.000175,4.2,11981,,"[nm0000237, nm0001701, nm3132111, nm3626674]"
3538,3185,tt0254703,movie,The Point Men,The Point Men,0,2001,,100,"Action,Drama",46943,en,6100000,714,8543.417367,0.000117,4.3,1161,,"[nm0000483, nm0289098, nm0716490, nm0531229]"


Inspect the dataframe: most movies list 4 principal cast. Min is 1 cast and max is 10 actors/actresses.

In [None]:
movies_cast_2.apply(lambda x: len(x.nconst), axis=1).describe()

count    3540.000000
mean        3.989266
std         0.298623
min         1.000000
25%         4.000000
50%         4.000000
75%         4.000000
max        10.000000
dtype: float64

Again, save these to local. I already saved and uploaded these to the shared folder.
- `people_list` -- `Datasets/New processed data/people_list.csv`
- `movies_cast_2` -- `Datasets/New processed data/movies_cast_2.csv`

In [None]:
people_list.to_csv('people_list.csv', index=False)

In [None]:
movies_cast_2.to_csv('movies_cast_2.csv', index=False)

Reminder: `principals` has 14,122 corresponding to all principal cast in our dataset of 3540 movies.

In [None]:
print(principals.shape)
principals.head()

(14122, 6)


Unnamed: 0,tconst,ordering,nconst,category,job,characters
282506,tt0035423,1,nm0000212,actress,\N,"[""Kate McKay""]"
282507,tt0035423,2,nm0413168,actor,\N,"[""Leopold""]"
282508,tt0035423,3,nm0000630,actor,\N,"[""Stuart Besser""]"
282509,tt0035423,4,nm0005227,actor,\N,"[""Charlie McKay""]"
1013071,tt0118589,1,nm0001014,actress,\N,"[""Billie Frank""]"


Out of these roles in all these movies, there are 5273 unique actors involved in these movies. Save the unique actor IDs to `casts`.

In [None]:
casts = list(principals.nconst.unique())
len(casts)

5273

# `name_basics.tsv`
This file is from IMDB and has the basic information of the actors. We need this to get their names (so far we only have their IDs -- `nconst`). This file has about 12M rows so proceed with caution.

In [None]:
names = pd.read_csv(names_filepath)

In [None]:
print(names.shape)
names.head()

(11931746, 6)


Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0072308,tt0053137,tt0031983,tt0050419"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0037382,tt0038355,tt0071877,tt0117057"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,music_department","tt0054452,tt0057345,tt0049189,tt0056404"
3,nm0000004,John Belushi,1949,1982,"actor,soundtrack,writer","tt0077975,tt0078723,tt0072562,tt0080455"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0083922,tt0060827,tt0050976,tt0050986"


Above I got a list of all unique actors relevant to the movies dataset (`casts`, with 5273 unique IDs). Here I filter the `names` dataframe to keep only rows of the relevant actors, save it to `names_rel` (relevant names).

In [None]:
names_rel = names[names.nconst.isin(casts)]
print(names_rel.shape)
names_rel.head()

(5272, 6)


Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0037382,tt0038355,tt0071877,tt0117057"
7,nm0000008,Marlon Brando,1924,2004,"actor,soundtrack,director","tt0070849,tt0078788,tt0068646,tt0047296"
83,nm0000084,Gong Li,1965,\N,"actress,producer","tt0473444,tt0430357,tt0397535,tt0101640"
89,nm0000090,Armin Mueller-Stahl,1930,\N,"actor,director,writer","tt0117631,tt0765443,tt0963178,tt0139809"
91,nm0000092,John Cleese,1939,\N,"actor,writer,producer","tt0071853,tt0063929,tt0085959,tt0095159"


`casts` has length 5273 and `names_rel` has length 5272, so one name disappeared in between. I'm trying to find that person here.

In [None]:
[i for i in casts if i not in names_rel.nconst.to_list()]

['nm1789985']

As we said before, this is supposed to be Jon Daly (https://www.imdb.com/name/nm1789985/). No idea why he is completely absent from the `names` dataframe (remember, this dataframe is the IMDB dataset imported directly here, no changes).

In [None]:
names[names.nconst == 'nm1789985']

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles


Again, export file. This file is also on drive: `Datasets/New processed data/relevant_names.csv` and has the basic info of all actors relevant to the main movies dataset (5272 actors involved in 3540 movies).

In [None]:
names_rel.to_csv('relevant_names.csv', index=False)

# Getting previous movies of the actors

In [5]:
# New session
names_rel_path = '/content/drive/MyDrive/Year 4 (2022 - 2023)/Fall 2022/Computational Social Science/Project/New processed data/relevant_names.csv'

In [6]:
names_rel = pd.read_csv(names_rel_path)
print(names_rel.shape)
names_rel.head()

(5272, 6)


Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0037382,tt0038355,tt0071877,tt0117057"
1,nm0000008,Marlon Brando,1924,2004,"actor,soundtrack,director","tt0070849,tt0078788,tt0068646,tt0047296"
2,nm0000084,Gong Li,1965,\N,"actress,producer","tt0473444,tt0430357,tt0397535,tt0101640"
3,nm0000090,Armin Mueller-Stahl,1930,\N,"actor,director,writer","tt0117631,tt0765443,tt0963178,tt0139809"
4,nm0000092,John Cleese,1939,\N,"actor,writer,producer","tt0071853,tt0063929,tt0085959,tt0095159"


In [7]:
cast_ids = names_rel.nconst.to_list()
print(len(cast_ids))
cast_ids[:5]

5272


['nm0000002', 'nm0000008', 'nm0000084', 'nm0000090', 'nm0000092']

In [10]:
prev_roles = p_big[p_big.nconst.isin(cast_ids)]
print(prev_roles.shape)
prev_roles.head()

(769942, 6)


Unnamed: 0,tconst,ordering,nconst,category,job,characters
282506,tt0035423,1,nm0000212,actress,\N,"[""Kate McKay""]"
282507,tt0035423,2,nm0413168,actor,\N,"[""Leopold""]"
282508,tt0035423,3,nm0000630,actor,\N,"[""Stuart Besser""]"
282509,tt0035423,4,nm0005227,actor,\N,"[""Charlie McKay""]"
285190,tt0035726,3,nm0060931,actress,\N,"[""Mary Watson""]"


In [11]:
prev_roles.category.value_counts()

actor                  296944
self                   272802
actress                159281
archive_footage         18759
writer                  10747
director                 7491
producer                 2913
cinematographer           367
composer                  280
archive_sound             191
editor                    151
production_designer        16
Name: category, dtype: int64

In [None]:
prev_roles.to_csv('prev_roles.csv', index=False)

In [None]:
# After session crashed
prev_roles = pd.read_csv('./prev_roles.csv')
print(prev_roles.shape)
prev_roles.head()

(769942, 6)


Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0035423,1,nm0000212,actress,\N,"[""Kate McKay""]"
1,tt0035423,2,nm0413168,actor,\N,"[""Leopold""]"
2,tt0035423,3,nm0000630,actor,\N,"[""Stuart Besser""]"
3,tt0035423,4,nm0005227,actor,\N,"[""Charlie McKay""]"
4,tt0035726,3,nm0060931,actress,\N,"[""Mary Watson""]"


In [12]:
prev_roles.tconst.unique().shape

(511566,)

In [13]:
title_basic = pd.read_csv(title_basic_filepath)
print(title_basic.shape)
title_basic.head()

(9233338, 9)


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894.0,,1.0,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892.0,,5.0,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892.0,,4.0,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892.0,,12.0,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893.0,,1.0,"Comedy,Short"


Need to filter the rows from prev_roles to keep only movie rows. After that also need to filter by number of votes.

In [14]:
title_type_lookup = dict(zip(title_basic.tconst, title_basic.titleType))

In [15]:
def title_lookup(tconst):
  try:
    return title_type_lookup[tconst]
  except:
    return None

print(title_lookup('tt0000001'))
print(title_lookup('tt0086748'))

short
None


In [16]:
prev_roles['titleType'] = prev_roles.apply(lambda x: title_lookup(x.tconst), axis=1)
prev_roles.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,tconst,ordering,nconst,category,job,characters,titleType
282506,tt0035423,1,nm0000212,actress,\N,"[""Kate McKay""]",movie
282507,tt0035423,2,nm0413168,actor,\N,"[""Leopold""]",movie
282508,tt0035423,3,nm0000630,actor,\N,"[""Stuart Besser""]",movie
282509,tt0035423,4,nm0005227,actor,\N,"[""Charlie McKay""]",movie
285190,tt0035726,3,nm0060931,actress,\N,"[""Mary Watson""]",movie


In [17]:
prev_movie_roles = prev_roles[prev_roles.titleType == 'movie']
print(prev_movie_roles.shape)
prev_movie_roles.head()

(80787, 7)


Unnamed: 0,tconst,ordering,nconst,category,job,characters,titleType
282506,tt0035423,1,nm0000212,actress,\N,"[""Kate McKay""]",movie
282507,tt0035423,2,nm0413168,actor,\N,"[""Leopold""]",movie
282508,tt0035423,3,nm0000630,actor,\N,"[""Stuart Besser""]",movie
282509,tt0035423,4,nm0005227,actor,\N,"[""Charlie McKay""]",movie
285190,tt0035726,3,nm0060931,actress,\N,"[""Mary Watson""]",movie


In [32]:
prev_movie_roles.to_csv('prev_movie_roles.csv', index=False)

In [18]:
prev_movie_roles.category.value_counts()

actor                  48977
actress                23984
self                    3440
producer                1617
director                1434
writer                   801
archive_footage          412
composer                  56
cinematographer           46
editor                    18
production_designer        2
Name: category, dtype: int64

In [30]:
prev_movie_roles = prev_movie_roles[prev_movie_roles.category.isin(['actor', 'actress'])]
print(prev_movie_roles.shape)
prev_movie_roles.head()

(72961, 10)


Unnamed: 0,tconst,ordering,nconst,category,job,characters,titleType,startYear,averageRatings,numVotes
282506,tt0035423,1,nm0000212,actress,\N,"[""Kate McKay""]",movie,2001.0,6.4,84717.0
282507,tt0035423,2,nm0413168,actor,\N,"[""Leopold""]",movie,2001.0,6.4,84717.0
282508,tt0035423,3,nm0000630,actor,\N,"[""Stuart Besser""]",movie,2001.0,6.4,84717.0
282509,tt0035423,4,nm0005227,actor,\N,"[""Charlie McKay""]",movie,2001.0,6.4,84717.0
285190,tt0035726,3,nm0060931,actress,\N,"[""Mary Watson""]",movie,1943.0,6.1,271.0


In [None]:
relevant_names = pd.read_csv('./relevant_names.csv')
print(relevant_names.shape)
relevant_names.head()

(5272, 6)


Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0037382,tt0038355,tt0071877,tt0117057"
1,nm0000008,Marlon Brando,1924,2004,"actor,soundtrack,director","tt0070849,tt0078788,tt0068646,tt0047296"
2,nm0000084,Gong Li,1965,\N,"actress,producer","tt0473444,tt0430357,tt0397535,tt0101640"
3,nm0000090,Armin Mueller-Stahl,1930,\N,"actor,director,writer","tt0117631,tt0765443,tt0963178,tt0139809"
4,nm0000092,John Cleese,1939,\N,"actor,writer,producer","tt0071853,tt0063929,tt0085959,tt0095159"


In [None]:
principals = pd.read_csv('principals.csv')
print(principals.shape)
principals.head()

(14122, 6)


Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0035423,1,nm0000212,actress,\N,"[""Kate McKay""]"
1,tt0035423,2,nm0413168,actor,\N,"[""Leopold""]"
2,tt0035423,3,nm0000630,actor,\N,"[""Stuart Besser""]"
3,tt0035423,4,nm0005227,actor,\N,"[""Charlie McKay""]"
4,tt0118589,1,nm0001014,actress,\N,"[""Billie Frank""]"


In [19]:
year_lookup_dict = dict(zip(title_basic.tconst, title_basic.startYear))

def lookup_year(tconst):
  try:
    return year_lookup_dict[tconst]
  except:
    return None

print(lookup_year('tt0000001'))
print(lookup_year('tt0086748'))

1894.0
None


In [20]:
prev_movie_roles['startYear'] = prev_movie_roles.apply(lambda x: lookup_year(x.tconst), axis=1)
prev_movie_roles.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,tconst,ordering,nconst,category,job,characters,titleType,startYear
282506,tt0035423,1,nm0000212,actress,\N,"[""Kate McKay""]",movie,2001.0
282507,tt0035423,2,nm0413168,actor,\N,"[""Leopold""]",movie,2001.0
282508,tt0035423,3,nm0000630,actor,\N,"[""Stuart Besser""]",movie,2001.0
282509,tt0035423,4,nm0005227,actor,\N,"[""Charlie McKay""]",movie,2001.0
285190,tt0035726,3,nm0060931,actress,\N,"[""Mary Watson""]",movie,1943.0


Filter previous movies by ratings first.

In [21]:
ratings_filepath = '/content/drive/MyDrive/Year 4 (2022 - 2023)/Fall 2022/Computational Social Science/Project/title_ratings_data.csv'
ratings = pd.read_csv(ratings_filepath)
print(ratings.shape)
ratings.head()

(1260951, 3)


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1911
1,tt0000002,5.8,256
2,tt0000003,6.5,1714
3,tt0000004,5.6,169
4,tt0000005,6.2,2529


In [22]:
ratings_dict = ratings.set_index('tconst').to_dict(orient='index')

In [23]:
def lookup_rating(tconst):
  try:
    return ratings_dict[tconst]['averageRating']
  except:
    return None

def lookup_votes(tconst):
  try:
    return ratings_dict[tconst]['numVotes']
  except:
    return None

print(lookup_rating('tt0162222'))
print(lookup_votes('tt0162222'))

7.8
586450


In [24]:
prev_movie_roles['averageRatings'] = prev_movie_roles.apply(lambda x: lookup_rating(x.tconst), axis=1)
prev_movie_roles['numVotes'] = prev_movie_roles.apply(lambda x: lookup_votes(x.tconst), axis=1)
prev_movie_roles.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,tconst,ordering,nconst,category,job,characters,titleType,startYear,averageRatings,numVotes
282506,tt0035423,1,nm0000212,actress,\N,"[""Kate McKay""]",movie,2001.0,6.4,84717.0
282507,tt0035423,2,nm0413168,actor,\N,"[""Leopold""]",movie,2001.0,6.4,84717.0
282508,tt0035423,3,nm0000630,actor,\N,"[""Stuart Besser""]",movie,2001.0,6.4,84717.0
282509,tt0035423,4,nm0005227,actor,\N,"[""Charlie McKay""]",movie,2001.0,6.4,84717.0
285190,tt0035726,3,nm0060931,actress,\N,"[""Mary Watson""]",movie,1943.0,6.1,271.0


In [31]:
prev_significant_roles = prev_movie_roles[prev_movie_roles.numVotes > 125]
print(prev_significant_roles.shape)
prev_significant_roles.head()

(56066, 10)


Unnamed: 0,tconst,ordering,nconst,category,job,characters,titleType,startYear,averageRatings,numVotes
282506,tt0035423,1,nm0000212,actress,\N,"[""Kate McKay""]",movie,2001.0,6.4,84717.0
282507,tt0035423,2,nm0413168,actor,\N,"[""Leopold""]",movie,2001.0,6.4,84717.0
282508,tt0035423,3,nm0000630,actor,\N,"[""Stuart Besser""]",movie,2001.0,6.4,84717.0
282509,tt0035423,4,nm0005227,actor,\N,"[""Charlie McKay""]",movie,2001.0,6.4,84717.0
285190,tt0035726,3,nm0060931,actress,\N,"[""Mary Watson""]",movie,1943.0,6.1,271.0


In [33]:
prev_significant_roles.to_csv('prev_significant_roles.csv', index=False)

In [34]:
def get_cast_movies_before_year(nconst, year):
  query = prev_significant_roles[(prev_significant_roles.nconst == nconst) & (prev_significant_roles.startYear < year)]
  return query.tconst.to_list(), query.averageRatings.to_list(), query.numVotes.to_list()

In [35]:
t_hanks = 'nm0000158'
ca_year = 2000 # Cast Away
# get_cast_movies_before_year(t_hanks, ca_year)

In [36]:
# prev_significant_roles['prior_movies'] = prev_significant_roles.apply(lambda x: get_cast_movies_before_year(x.nconst, x.startYear), axis=1)
# prev_significant_roles.head()

In [37]:
prev_tconst = []
prev_ratings = []
prev_numvotes = []
for i, row in prev_significant_roles.iterrows():
  nconst, year = row.nconst, row.startYear
  tconst_list, ratings_list, numvotes_list = get_cast_movies_before_year(nconst, year)
  prev_tconst.append(tconst_list)
  prev_ratings.append(ratings_list)
  prev_numvotes.append(numvotes_list)

In [38]:
prev_significant_roles['prev_movies'] = prev_tconst
prev_significant_roles['prev_movies_ratings'] = prev_ratings
prev_significant_roles['prev_movies_numvotes'] = prev_numvotes
prev_significant_roles.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,tconst,ordering,nconst,category,job,characters,titleType,startYear,averageRatings,numVotes,prev_movies,prev_movies_ratings,prev_movies_numvotes
282506,tt0035423,1,nm0000212,actress,\N,"[""Kate McKay""]",movie,2001.0,6.4,84717.0,"[tt0090660, tt0093260, tt0094933, tt0095897, t...","[5.7, 6.8, 6.1, 5.9, 5.7, 7.7, 5.9, 7.2, 5.6, ...","[11804.0, 60197.0, 10395.0, 18751.0, 1367.0, 2..."
282507,tt0035423,2,nm0413168,actor,\N,"[""Leopold""]",movie,2001.0,6.4,84717.0,"[tt0120903, tt0180037, tt0212936]","[7.3, 6.4, 6.3]","[612256.0, 2228.0, 610.0]"
282508,tt0035423,3,nm0000630,actor,\N,"[""Stuart Besser""]",movie,2001.0,6.4,84717.0,"[tt0116041, tt0119296, tt0119891, tt0134084, t...","[7.0, 5.3, 5.4, 5.6, 7.5, 7.6]","[5825.0, 142.0, 21117.0, 142872.0, 1476.0, 972..."
282509,tt0035423,4,nm0005227,actor,\N,"[""Charlie McKay""]",movie,2001.0,6.4,84717.0,"[tt0118925, tt0119937, tt0122743, tt0215129]","[6.7, 6.8, 4.8, 6.5]","[2533.0, 7359.0, 823.0, 169530.0]"
285190,tt0035726,3,nm0060931,actress,\N,"[""Mary Watson""]",movie,1943.0,6.1,271.0,[],[],[]


In [39]:
prev_significant_roles.shape

(56066, 13)

In [40]:
prev_significant_roles.to_csv('prev_significant_roles_3.csv', index=False)

In [44]:
newfile_path = '/content/drive/MyDrive/Year 4 (2022 - 2023)/Fall 2022/Computational Social Science/Project/New processed data/Newer data'
!cp prev_significant_roles_3.csv 'drive/MyDrive/Year 4 (2022 - 2023)/Fall 2022/Computational Social Science/Project/New processed data/Newer data/prev_signficant_roles_3.csv'