I am going to implement some of the concepts deal in this online course from Stanford University: https://www.coursera.org/learn/social-economic-networks/

# Week 2
We are going to explore

- Create a real graph using the movie data!
- Centrality
(The concept below will be dealt in another notebook)
- Thresholds and Phase Transitions
- Application – Diffusion Centrality

We were exploring power of graph, step by step
We are going to use the metadata.json from https://grouplens.org/datasets/movielens/tag-genome-2021/

In particular this part is the important one:


**raw/metadata.json**

The file contains information about movies from MovieLens - 84,661 lines of json objects that have the following fields:
- title – movie title (84,484 unique titles)
- directedBy – directors separated by comma (‘,’)
- starring – actors separated by comma (‘,’)
- dateAdded – date, when the movie was added to MovieLens
- avgRating – average rating of a movie on MovieLens
- imdbId – movie id on the IMDB website (84,661 unique ids)
- item_id – movie id, which is consistent across files (84,661 unique ids)
- Example line:

{"title": "Toy Story (1995)", "directedBy": "John Lasseter", "starring": "Tim Allen, Tom Hanks, Don Rickles, Jim Varney, 
John Ratzenberger, Wallace Shawn, Laurie Metcalf, John Morris, R. Lee Ermey, Annie Potts", "dateAdded": null, "avgRating":
3.89146, "imdbId": "0114709", "item_id": 1}

So what we are going to do it is to assume that all actors that appear in the same movies are connected to each other together with their director. 
This is just an interesting exercise to do. This is the first iteration, and we are going to see if we can find the "most connected actors/actresses or director" according to this metric.

In [85]:
import math
import networkx as nx
import matplotlib.pyplot as plt 
import numpy as np
import json
import operator
# to easily creade the connection between nodes
from itertools import combinations
# save graph
import pickle

In [97]:
# PARAMETERS
# min_amount_movies_to_appear_to_be_considered_in_graph
min_m = 15

In [98]:
# Opening JSON file
filename = 'metadata_movies.json'
# WE USE SET TO AVOID DUPLICATING THE ARTISTS
all_actors = set()
all_directors = set() 
with open(filename) as file:
    for line in file:
        json_file = json.loads(line)
        actors_list = list(json_file['starring'].replace(' ','').split(','))
        directors_list = list(json_file['directedBy'].replace(' ','').split(','))
        
        all_actors.update(actors_list)
        all_directors.update(directors_list)



In [99]:
print (f""" Actors in database: {len(all_actors)} and
 Directores in database: {len(all_directors)} """)

 Actors in database: 123048 and
 Directores in database: 33594 


In [100]:
json_file['starring'].replace(' ','').split(',')

['SteveVaccariello',
 'BarbaraZaun',
 'RobBabik',
 'BradBergman',
 'ChristopherFragapane']

In [101]:
# These are too many, so we need to clean a little the data, we are going to keep the most
# this process is extremely quick :) 

In [102]:
# WE USE DICTIONARY TO CLEAN ARTISTS WHO APPEAR IN JUST A FEW MOVIES (we see this with value)

dict_director = {}
# initiate dictionary directors
for director in list(all_directors):
    dict_director[director] = 0

# initiate actors and actresses    
dict_actors = {}
for actor in list(all_actors):
    dict_actors[actor] = 0    

In [103]:
with open(filename) as file:
    for line in file:
        json_file = json.loads(line)
        temp_direc_list = list(json_file['directedBy'].replace(' ','').split(','))
        temp_actor_list = list(json_file['starring'].replace(' ','').split(','))

        for director in temp_direc_list:
            dict_director[director] = dict_director[director] + 1
        for actor in  temp_actor_list:
            dict_actors[actor] = dict_actors[actor] + 1


In [104]:
# this was an error in the dictionary
dict_actors.pop(str(""))
dict_director.pop(str(""))

3153

In [105]:
# these are actors that appear in a lot of movies
len(set(dict_actors.values()))

117

In [106]:
# So we are going to DELETE PEOPLE IN THE DICTIONARIES
# we do this step because we should change a dictionary while iterating it is best practice to create a new one
new_temp_actors = {k: v for k, v in dict_actors.items() if v >= min_m}
print(len(set(new_temp_actors)))
dict_actors = new_temp_actors 

new_temp_director = {k: v for k, v in dict_director.items() if v >= min_m}
dict_director  = new_temp_director
print(len(set(dict_director)))



4469
705


In [108]:
# function needed to create edge easily
def rSubset(arr, r=2):
 
    return list(combinations(arr, r))

In [115]:
G = nx.Graph()
with open(filename) as file:
    for line in file:
        json_file = json.loads(line)
        temp_direc_list = list(json_file['directedBy'].replace(' ','').split(','))
        temp_actor_list = list(json_file['starring'].replace(' ','').split(','))
        
        clean_direct_list =[artist for artist in temp_direc_list ]
        clean_actor_list = [artist for artist in temp_actor_list ]
       
        connected_nodes = rSubset(clean_direct_list + clean_actor_list)
        for connected_node in connected_nodes:
            G.add_edge(list(connected_node)[0],list(connected_node)[1])

In [120]:
# Let us work with the graph that has the most important actors/actresses/directors
G0 = G.subgraph(list(dict_actors.keys())+list(dict_director.keys()))
G0

<networkx.classes.graph.Graph at 0x1aa9211e350>

In [121]:
import pickle

# save graph object to file
pickle.dump(G0, open('filename.pickle', 'wb'))

# load graph object from file
G = pickle.load(open('filename.pickle', 'rb'))

In [122]:
def report_grahp(G):
    # max connected graph
    # GCC: dictionary that shows all the nodes that are connected
    Gcc = max(nx.connected_components(G), key=len)

    # G0: creates a subgraph with the node selected in GCC
    G0 = G.subgraph(Gcc)
    degree_sequence_sub_graph = sorted((d for n, d in G0.degree()), reverse=True)
    
    number_of_nodes = len(degree_sequence_sub_graph)
    average_connection = sum(degree_sequence_sub_graph)/number_of_nodes
    average_eccentricity = sum((list(nx.eccentricity(G0).values())))/number_of_nodes
    information_review = str(f"""Diameter: {nx.diameter(G0)}, 
                                number_of_nodes: {number_of_nodes} 
                                average_connection: {round(average_connection,2)}, 
                                average_eccentricity: {round(average_eccentricity,2)}
                                average_clustering: {round(nx.average_clustering(G0),4)}
                            """ )
                            
    return G0,information_review

In [123]:
G0,information_review = report_grahp(G)

In [124]:
print(information_review)

Diameter: 8, 
                                number_of_nodes: 5125 
                                average_connection: 88.29, 
                                average_eccentricity: 5.66
                                average_clustering: 0.1817
                            


Interesting we find a very high diameter giving the high connection average. And the other interesting aspects is the clustering. Personally I think is very low, considering that all that actors that appear in the same movie are all inmediately connected. So, there are actors and director that connect to others very far actors that do not connect between each other. Well maybe it is no so suprising if we think they are working together and are not necessary friends to each other.

In [125]:
eigenvector_centrality = nx.eigenvector_centrality(G)

In [132]:
eigenvector_centrality_order = dict(sorted(eigenvector_centrality.items(), key=lambda item: item[1],reverse=True))
dict_actors_order = dict(sorted(dict_actors.items(), key=lambda item: item[1],reverse=True))

In [133]:
eigenvector_centrality_order

{'SusanSarandon': 0.06733539520681518,
 'SamuelL.Jackson': 0.06688039156625447,
 'BruceWillis': 0.06584061166710133,
 'ChristopherWalken': 0.06392821167638296,
 'SteveBuscemi': 0.06304360174257274,
 'StanleyTucci': 0.06261756862588044,
 'NickNolte': 0.05985857585593803,
 'DonaldSutherland': 0.058978728031788434,
 'RobertDeNiro': 0.05788934843241732,
 'AlecBaldwin': 0.05632475278008087,
 'JackBlack': 0.054982062026815014,
 'HarveyKeitel': 0.0541615617872137,
 'WilliamH.Macy': 0.05365052454034322,
 'JohnCusack': 0.05359852180939199,
 'MorganFreeman': 0.05342584636450277,
 'JohnC.Reilly': 0.05178955392270744,
 'WoodyHarrelson': 0.0516543915929849,
 'OliverPlatt': 0.051085685363085394,
 'RayLiotta': 0.05102929712706255,
 'LiamNeeson': 0.05096863746580911,
 'JeffGoldblum': 0.05087816664502702,
 'WillFerrell': 0.050438201946103794,
 'BenStiller': 0.04952758781677541,
 'OwenWilson': 0.04934117599598813,
 'WillemDafoe': 0.04933722034920345,
 'PaulGiamatti': 0.049255381821452435,
 'DennisQuaid'

In [134]:
# this was the actress more well connected giving this metric, what do you think?
nx.node_connected_component(G, 'SusanSarandon')


{'FrankMorgan',
 'DonDeFore',
 'TimThomerson',
 'NicolasDuvauchelle',
 'MichaelAnsara',
 'DickMiller',
 'AndrewSensenig',
 'JeanSorel',
 'HoltMcCallany',
 'PeterHaber',
 'MichelSimon',
 'JackHawkins',
 'ArchieMayo',
 'DevonBostick',
 'MarioCarotenuto',
 'DonStroud',
 'JohnAshton',
 'JohnRussell',
 'GeoffreyKeen',
 'JohnSaxon',
 'MichaelPate',
 'D.B.Sweeney',
 'StanleyTucci',
 'MelindaDillon',
 'CharlesMcGraw',
 'LamChing-Ying',
 'JamesGammon',
 'JessicaChastain',
 'CharlesJarrott',
 'SteveAustin',
 'SeanPatrickFlanery',
 'IanMcShane',
 'MortenGrunwald',
 'TomokazuSeki',
 'FredArmisen',
 'VincentElbaz',
 'SamTaylor',
 'VingRhames',
 'SylvesterStallone',
 'ElvisPresley',
 'EricBalfour',
 'ElinaLöwensohn',
 'RichardRoundtree',
 'FrankCapra',
 'DannyMcBride',
 'AlisonBrie',
 'CarloLizzani',
 'A.J.Buckley',
 'ChristinaCox',
 'JohnWayne',
 'ElizabethDaily',
 'KateBurton',
 'JohnAmos',
 'VivicaA.Fox',
 'TreatWilliams',
 'DouglasJackson',
 'LaurieMetcalf',
 'WarwickDavis',
 'RobertPattinson',


In [135]:
# Let us see the actors that appears more times in our dataset (that is to say in more movies)
dict_actors_order

{'EricRoberts': 171,
 'ChristopherLee': 166,
 'ClarenceNash': 155,
 'MelBlanc': 144,
 'DonaldSutherland': 141,
 'Jr.': 132,
 'JohnWayne': 127,
 'GérardDepardieu': 123,
 'RayMilland': 121,
 'MichaelIronside': 120,
 'DannyTrejo': 120,
 'SamuelL.Jackson': 119,
 'MichaelMadsen': 118,
 'MichaelCaine': 114,
 'JamesMason': 110,
 'MalcolmMcDowell': 110,
 'AlbertoSordi': 108,
 'DonaldPleasence': 107,
 'LanceHenriksen': 106,
 'DannyGlover': 102,
 'RobertDeNiro': 102,
 'JackieChan': 101,
 'HarveyKeitel': 101,
 'WillemDafoe': 101,
 'ChristopherPlummer': 101,
 'JohnHurt': 100,
 'ChristopherWalken': 100,
 'AnthonyQuinn': 99,
 'NicolasCage': 99,
 'JohnCarradine': 99,
 'SusanSarandon': 99,
 'MickeyRooney': 98,
 'DennisHopper': 96,
 'VincentPrice': 95,
 'BurtReynolds': 95,
 'MorganFreeman': 95,
 'AlecBaldwin': 94,
 'CatherineDeneuve': 93,
 'BruceWillis': 92,
 'LorettaYoung': 91,
 'JamesCaan': 91,
 'JohnMalkovich': 90,
 'MartinSheen': 90,
 'BruceDern': 90,
 'LionelBarrymore': 89,
 'MarcelloMastroianni':

Something really interesting, worth analysing more in depth is that even though Eric Roberts appears in more movies than **Susan Sarandon**, Sarandon has a higher eigenvector centrality. So the contacts of Susan has more contacts themselves, giving Sarandon a richer network.

In [137]:
# the lowest connected actor/director
dict(sorted(eigenvector_centrality.items(), key=lambda item: item[1],reverse=False))

{'LotteReiniger': 3.584946815473141e-46,
 'JamesH.White': 3.584946815473141e-46,
 'ChristianGonzález': 3.584946815473141e-46,
 'AlexandrePromio': 3.584946815473141e-46,
 'FrederickWiseman': 3.584946815473141e-46,
 'DzigaVertov': 3.584946815473141e-46,
 'LavDiaz': 3.584946815473141e-46,
 'KemalSunal': 3.584946815473141e-46,
 'TakashiIto': 3.759089191981565e-40,
 'LouisTheroux': 3.759089191981565e-40,
 'StanBrakhage': 3.759089191981565e-40,
 'NormanMcLaren': 3.759089191981565e-40,
 'WilliamK.L.Dickson': 3.759089191981565e-40,
 'SegundodeChomón': 3.759089191981565e-40,
 'GeorgesMéliès': 3.759089191981565e-40,
 'TakahiroSakurai': 1.3763183504389725e-09,
 'MaayaSakamoto': 6.922400908369351e-09,
 'TomokazuSeki': 1.9169493036493868e-08,
 'AjuVarghese': 5.145900780813246e-08,
 'LeonidGayday': 1.6247957402675293e-07,
 'ZdeněkSvěrák': 1.828419874820206e-07,
 'LadislavSmoljak': 1.8403651552157137e-07,
 'YuichiNakamura': 1.8474993096824648e-07,
 'KlaraRumyanova': 2.2071452276525148e-07,
 'VasiliyL

In [140]:
# It seems resonable that directors or actors that appear in a few movies are less connected giving how we calculate these values
dict_director['LotteReiniger'] 

15

In [141]:
# Who are the bridges? 
betweenness_centrality = nx.betweenness_centrality(G)

In [147]:
betweenness_centrality_order = dict(sorted(betweenness_centrality.items(), key=lambda item: item[1],reverse=True))

In [148]:
# save graph object to file
pickle.dump(betweenness_centrality, open('betweenness_centrality.pickle', 'wb'))
pickle.dump(eigenvector_centrality_order, open('betweenness_centrality.pickle', 'wb'))

In [149]:
betweenness_centrality_order

{'ChristopherLee': 0.009365700776509528,
 'Jr.': 0.0059400973147535775,
 'DonaldSutherland': 0.005740013017043543,
 'ErnestBorgnine': 0.005583401927038072,
 'AnthonyQuinn': 0.005524249696400307,
 'MalcolmMcDowell': 0.0054121359840835935,
 'EricRoberts': 0.005185176482905752,
 'ChristopherPlummer': 0.005113775635694579,
 'PriyankaChopra': 0.005046653194330893,
 'MickeyRooney': 0.005016657682630236,
 'OmPuri': 0.004942240175501012,
 'MichaelCaine': 0.00492872472890949,
 'MaxvonSydow': 0.004767852384127542,
 'GérardDepardieu': 0.004518502679841983,
 'WillemDafoe': 0.004426361776079331,
 'EliWallach': 0.004362986698290242,
 'DeanStockwell': 0.004206023709816889,
 'RodSteiger': 0.004201825867537865,
 'CatherineDeneuve': 0.004183197916252519,
 'RayMilland': 0.004145393206880905,
 'MichaelMadsen': 0.004027970096465236,
 'FrancoNero': 0.003944870034935656,
 'RobertMitchum': 0.003914586717412769,
 'BruceDern': 0.0038833833835829543,
 'DonaldPleasence': 0.003725052139013143,
 'RoddyMcDowall': 0.

Giving the big diameter and the low clustering, it seems like this parameter is very important. So this actor **Christopher Lee** Connects different groups of actors and directors.