# **Data 620 Project 1 Proposal**
Seung Min Song, Krutika Patel<br>

02/18/2024

# **Overview**

Centrality measures can be used to predict (positive or negative) outcomes for a node.

Your task in this week’s assignment is to identify an interesting set of network data that is available on the web (either through web scraping or web APIs) that could be used for analyzing and comparing centrality measures across nodes.  As an additional constraint, there should be at least one categorical variable available for each node (such as “Male” or “Female”; “Republican”, “Democrat,” or “Undecided”, etc.)

In addition to identifying your data source, you should create a high-level plan that describes how you would load the data for analysis, and describe a hypothetical outcome that could be predicted from comparing degree centrality across categorical groups.

# **Research Question**

In this project, the goal is to explore not only the collaborations between actors but also the associations among genres and directors. Throughout the research process, the plan is to delve deeper into these relationships by examining key network metrics such as degree centrality, closeness centrality, and betweenness centrality.

# **Data Source**

For this project we propose to use the following two datasets:

* IMDB Films By Actor For 10K Actors https://www.kaggle.com/datasets/darinhawley/imdb-films-by-actor-for-10k-actors
    * The dataset has the following attributes:
        * Actor, ActorID, Film, Year, Votes, Rating, FilmID
* IMDb Movie Dataset: All Movies by Genre https://www.kaggle.com/datasets/rajugc/imdb-movies-dataset-based-on-genre
    * The dataset has the following attributes
        * movie_id, movie_name, year, certificate, runtime, genre, rating, description, director, director_id, star, star_id, votes, gross(in $)

# **Network**

The network graph will be represented with two types of nodes: Movie and Actor. With the edge between representing the contribution of an actor towards a film.


# **Data Wrangling**

Include:

* Merging datasets
* Remove duplicate and unrelated columns

In order to create a singular dataset the two datasets will be joined by the 'FilmID' and 'movie_id' attributes that represent the same values in both datasets. Each row of the new dataset will have the following values: Actor, ActorID, Film, Year, Votes, Rating, FilmID, certificate, runtime, genre, and director.

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load dataset

actors = pd.read_csv('actorfilms.csv')
actors.head()

Unnamed: 0,Actor,ActorID,Film,Year,Votes,Rating,FilmID
0,Fred Astaire,nm0000001,Ghost Story,1981,7731,6.3,tt0082449
1,Fred Astaire,nm0000001,The Purple Taxi,1977,533,6.6,tt0076851
2,Fred Astaire,nm0000001,The Amazing Dobermans,1976,369,5.3,tt0074130
3,Fred Astaire,nm0000001,The Towering Inferno,1974,39888,7.0,tt0072308
4,Fred Astaire,nm0000001,Midas Run,1969,123,4.8,tt0064664


In [35]:
action= pd.read_csv('action.csv')
action.head()

Unnamed: 0,movie_id,movie_name,year,certificate,runtime,genre,rating,description,director,director_id,star,star_id,votes,gross(in $)
0,tt9114286,Black Panther: Wakanda Forever,2022,PG-13,161 min,"Action, Adventure, Drama",6.9,The people of Wakanda fight to protect their h...,Ryan Coogler,/name/nm3363032/,"Letitia Wright, \nLupita Nyong'o, \nDanai Guri...","/name/nm4004793/,/name/nm2143282/,/name/nm1775...",204835.0,
1,tt1630029,Avatar: The Way of Water,2022,PG-13,192 min,"Action, Adventure, Fantasy",7.8,Jake Sully lives with his newfound family form...,James Cameron,/name/nm0000116/,"Sam Worthington, \nZoe Saldana, \nSigourney We...","/name/nm0941777/,/name/nm0757855/,/name/nm0000...",295119.0,
2,tt5884796,Plane,2023,R,107 min,"Action, Thriller",6.5,A pilot finds himself caught in a war zone aft...,Jean-François Richet,/name/nm0724938/,"Gerard Butler, \nMike Colter, \nTony Goldwyn, ...","/name/nm0124930/,/name/nm1591496/,/name/nm0001...",26220.0,
3,tt6710474,Everything Everywhere All at Once,2022,R,139 min,"Action, Adventure, Comedy",8.0,A middle-aged Chinese immigrant is swept up in...,"Dan Kwan, \nDaniel Scheinert",/name/nm3453283/,"Michelle Yeoh, \nStephanie Hsu, \nJamie Lee Cu...","/name/nm3215397/,/name/nm0000706/,/name/nm3513...",327858.0,
4,tt5433140,Fast X,2023,,,"Action, Crime, Mystery",,Dom Toretto and his family are targeted by the...,Louis Leterrier,/name/nm0504642/,"Vin Diesel, \nJordana Brewster, \nTyrese Gibso...","/name/nm0004874/,/name/nm0108287/,/name/nm0879...",,


In [29]:
merged_df = pd.merge(actors, action, left_on='FilmID', right_on='movie_id', how='inner')
merged_df.head(1)

Unnamed: 0,Actor,ActorID,Film,Year,Votes,Rating,FilmID,movie_id,movie_name,year,...,runtime,genre,rating,description,director,director_id,star,star_id,votes,gross(in $)
0,Fred Astaire,nm0000001,The Towering Inferno,1974,39888,7.0,tt0072308,tt0072308,The Towering Inferno,1974,...,165 min,"Action, Drama, Thriller",7.0,"At the opening party of a colossal, but poorly...",John Guillermin,/name/nm0347086/,"Paul Newman, \nSteve McQueen, \nWilliam Holden...","/name/nm0000056/,/name/nm0000537/,/name/nm0000...",45059.0,116000000.0


In [30]:
# check missing values
print(merged_df.isnull().sum())


Actor              0
ActorID            0
Film               0
Year               0
Votes              0
Rating             0
FilmID             0
movie_id           0
movie_name         0
year               0
certificate     1862
runtime           60
genre              0
rating             0
description        0
director           0
director_id        0
star              23
star_id            0
votes              0
gross(in $)    14984
dtype: int64


In [31]:
# Specify the list of columns to be deleted
columns_to_delete = ['description', 'director_id', 'star', 'star_id', 'votes', 'gross(in $)']

# Delete the specified columns from the DataFrame
merged_df = merged_df.drop(columns=columns_to_delete)

# Print the modified DataFrame
print(merged_df.head(1))

          Actor    ActorID                  Film  Year  Votes  Rating  \
0  Fred Astaire  nm0000001  The Towering Inferno  1974  39888     7.0   

      FilmID   movie_id            movie_name  year certificate  runtime  \
0  tt0072308  tt0072308  The Towering Inferno  1974          PG  165 min   

                     genre  rating         director  
0  Action, Drama, Thriller     7.0  John Guillermin  


In [32]:
# Delete repeated columns
repeated_columns = ['movie_id', 'movie_name', 'year', 'rating']

# Delete the specified columns from the DataFrame
merged_df = merged_df.drop(columns=repeated_columns)

# Print the modified dataframe
print(merged_df.head(1))

          Actor    ActorID                  Film  Year  Votes  Rating  \
0  Fred Astaire  nm0000001  The Towering Inferno  1974  39888     7.0   

      FilmID certificate  runtime                    genre         director  
0  tt0072308          PG  165 min  Action, Drama, Thriller  John Guillermin  


In [33]:
# Check missing values
print(merged_df.isnull().sum())

Actor             0
ActorID           0
Film              0
Year              0
Votes             0
Rating            0
FilmID            0
certificate    1862
runtime          60
genre             0
director          0
dtype: int64
