In [1]:
import pandas as pd
import networkx as nx
from tqdm.notebook import tqdm

In [2]:
df = pd.read_csv('imdb_dataset.tsv', sep='\t', header=None)

In [3]:
edges = df.to_records(index=False)

In [4]:
#https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html
# I think the simple regex is sufficient
df[2] = df[1].str.extract(r'(\d{4})', expand=True)
df = df.rename(columns={0: "actor", 1: "movie", 2: "year"})

In [5]:
actors = df.actor.unique()
movies = df.movie.unique()
print(f"Number of actors is: {actors.size} \nNumber of movies is: {movies.size} \nTotal nodes will be: {actors.size + movies.size}")
print(f"Number of edges is will be: {len(edges)}")

Number of actors is: 2364796 
Number of movies is: 745941 
Total nodes will be: 3110737
Number of edges is will be: 8104335


In [6]:
movies_dict = df.drop(columns='actor').drop_duplicates().set_index('movie').to_dict('index')

In [8]:
oriGinal = nx.Graph()
oriGinal.add_nodes_from(actors, bipartite = 0) #attribute bipartite following documentation recommendations. In this case 0 is actors, 1 is movies
print(f"Number of nodes after adding actors is {oriGinal.number_of_nodes()}")
oriGinal.add_nodes_from(movies_dict, bipartite = 1)
print(f"Number of nodes after adding movies is {oriGinal.number_of_nodes()}") #???????
      

Number of nodes after adding actors is 2364796
Number of nodes after adding movies is 3110735


In [9]:
oriGinal.add_edges_from(edges)

In [20]:
G = nx.convert_node_labels_to_integers(oriGinal, label_attribute='original_name')

NameError: name 'graph' is not defined

In [22]:
actor_nodes = {n for n, d in G.nodes(data=True) if d["bipartite"] == 0}
movies_nodes = set(G) - actor_nodes

In [27]:
#TODO asserts
print(f"Number of actor nodes: {len(actor_nodes)}")
print(f"Number of movies nodes: {len(movies_nodes)}")
print(f"Total number of nodes: {len(actor_nodes) + len(movies_nodes)}")

print(f"#Nodes? {oriGinal.number_of_nodes() == G.number_of_nodes()}")
print(f"#Edges? {oriGinal.number_of_edges() == G.number_of_edges()}")

Number of actor nodes: 2364794
Number of movies nodes: 745941
Total number of nodes: 3110735
#Nodes? True
#Edges? True


In [18]:
names = nx.get_node_attributes(G, "original_name")
print(names[500])

Aaker, Lee


In [39]:
print(oriGinal["'t Hoen, Dani?l"])
print(oriGinal["'Kid Niagara' Kallet, Harry"])

print(G[40])

{'Zonde (2010)': {}}
{'Drug Demon Romance (2012)': {}}
{2364839: {}}


Networkx uses a dictionary of dictionaries of dictionaries, as specified in the docs

NetworkX uses a “dictionary of dictionaries of dictionaries” as the basic network data structure. This allows fast lookup with reasonable storage for large sparse networks. The keys are nodes so G[u] returns an adjacency dictionary keyed by neighbor to the edge attribute dictionary. A view of the adjacency data structure is provided by the dict-like object G.adj as e.g. for node, nbrsdict in G.adj.items():. The expression G[u][v] returns the edge attribute dictionary itself. A dictionary of lists would have also been possible, but not allow fast edge detection nor convenient storage of edge data.

#Nodes? True
#Edges? True


In [13]:
G.size

<bound method Graph.size of <networkx.classes.graph.Graph object at 0x2b6e3c520>>

In [14]:
G.adj

AtlasView({2364832: {}})

In [15]:
degree_sequence = sorted((d for n, d in G.degree()), reverse=True)
dmax = max(degree_sequence)

## Question 1
G) Considering only the movies up to year x with x in {1930,1940,1950,1960,1970,1980,1990,2000,2010,2020}, write a function which, given x, computes the average number of movies per actor up to year x. 

In [None]:
def avgMoviesPerActorUpToYear(graph, year):
    actor_nodes = {n for n, d in graph.nodes(data=True) if d["bipartite"] == 0}
    movies_nodes = set(B) - actor_nodes

## Question 2
3) Considering only the movies up to year x with x in {1930,1940,1950,1960,1970,1980,1990,2000,2010,2020} and restricting to the largest connected component of the graph. Approximate the closeness centrality for each node. Who are the top-10 actors?

## Question 3
III) Which is the pair of movies that share the largest number of actors?

## Question 4
Build also the actor graph, whose nodes are only actors and two actors are connected if they did a movie together. Answer to the following question:

Which is the pair of actors who collaborated the most among themselves?

### Notes
- [NetworkX docs on bipartite graphs](https://networkx.org/documentation/stable/reference/algorithms/bipartite.html) However, if the input graph is not connected, there are more than one possible colorations. This is the reason why we require the user to pass a container with all nodes of one bipartite node set as an argument to most bipartite functions.