# Network Analysis Lab

Complete the following exercises to help solidify your understanding of network analysis.

In [1]:
import networkx as nx
import nxviz
import community
import pandas as pd
from itertools import combinations

## U.S. Mens Basketball Data Set

In the `us_mens_basketball.csv` data set, each row represents an single basketball player's participation in a single event at a single Olympics. 

In [2]:
basketball = pd.read_csv('/Users/Dinis/Ironhack/Labs/lab-network-analysis/data/us_mens_basketball.csv')

In [3]:
basketball.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,351,Julius Shareef Abdur-Rahim,M,23.0,202.0,104.0,United States,USA,2000 Summer,2000,Summer,Sydney,Basketball,Basketball Men's Basketball,Gold
1,2636,"Stephen Todd ""Steve"" Alford",M,19.0,185.0,74.0,United States,USA,1984 Summer,1984,Summer,Los Angeles,Basketball,Basketball Men's Basketball,Gold
2,2863,Walter Ray Allen,M,25.0,192.0,93.0,United States,USA,2000 Summer,2000,Summer,Sydney,Basketball,Basketball Men's Basketball,Gold
3,3874,"William Lloyd ""Willie"" Anderson, Jr.",M,21.0,200.0,86.0,United States,USA,1988 Summer,1988,Summer,Seoul,Basketball,Basketball Men's Basketball,Bronze
4,4505,Carmelo Kyan Anthony,M,20.0,203.0,109.0,United States,USA,2004 Summer,2004,Summer,Athina,Basketball,Basketball Men's Basketball,Bronze


## 1. Transform this data set into one that can be turned into a graph where the entities are represented by the Name field and the relationships are represented by whether the players played in the same Olympics together (Games field).

Sort descending by the number of pairwise interactions. Which pair of players have competed in the most Olympics together?

In [4]:
# Collecting all the Olympic Games in the data set 
print(basketball['Games'].unique())

# Collecting all the players that played in each Olympic Games
players = [basketball[basketball['Games']== game]['Name'].to_list() for game in basketball['Games'].unique()]
players[:1]

['2000 Summer' '1984 Summer' '1988 Summer' '2004 Summer' '2008 Summer'
 '2012 Summer' '2016 Summer' '1976 Summer' '1960 Summer' '1936 Summer'
 '1972 Summer' '1948 Summer' '1992 Summer' '1996 Summer' '1964 Summer'
 '1968 Summer' '1952 Summer' '1956 Summer']


[['Julius Shareef Abdur-Rahim',
  'Walter Ray Allen',
  'Vincent Lamont "Vin" Baker',
  'Vincent Lamar "Vince" Carter',
  'Kevin Maurice Garnett',
  'Timothy Duane "Tim" Hardaway',
  'Allan Wade Houston',
  'Jason Frederick Kidd',
  'Antonio Keithflen McDyess',
  'Alonzo Harding Mourning',
  'Gary Dwayne Payton',
  'Steven Delano "Steve" Smith']]

In [8]:
# Using itertools.combinations(iterable, r) we can calculate how many combinations are there of players that 
# played in the same Olympics together

combination_players = [list(combinations(players,2)) for players in players]
combination_players

[[('Julius Shareef Abdur-Rahim', 'Walter Ray Allen'),
  ('Julius Shareef Abdur-Rahim', 'Vincent Lamont "Vin" Baker'),
  ('Julius Shareef Abdur-Rahim', 'Vincent Lamar "Vince" Carter'),
  ('Julius Shareef Abdur-Rahim', 'Kevin Maurice Garnett'),
  ('Julius Shareef Abdur-Rahim', 'Timothy Duane "Tim" Hardaway'),
  ('Julius Shareef Abdur-Rahim', 'Allan Wade Houston'),
  ('Julius Shareef Abdur-Rahim', 'Jason Frederick Kidd'),
  ('Julius Shareef Abdur-Rahim', 'Antonio Keithflen McDyess'),
  ('Julius Shareef Abdur-Rahim', 'Alonzo Harding Mourning'),
  ('Julius Shareef Abdur-Rahim', 'Gary Dwayne Payton'),
  ('Julius Shareef Abdur-Rahim', 'Steven Delano "Steve" Smith'),
  ('Walter Ray Allen', 'Vincent Lamont "Vin" Baker'),
  ('Walter Ray Allen', 'Vincent Lamar "Vince" Carter'),
  ('Walter Ray Allen', 'Kevin Maurice Garnett'),
  ('Walter Ray Allen', 'Timothy Duane "Tim" Hardaway'),
  ('Walter Ray Allen', 'Allan Wade Houston'),
  ('Walter Ray Allen', 'Jason Frederick Kidd'),
  ('Walter Ray Allen', 

In [53]:
combination_df = pd.concat([pd.DataFrame(game, columns =['Player 1', 'Player 2']) 
                            for game in combination_players], axis=0)

# df.size returns the number of rows times number of columns in the DataFrame
games_df = combination_df.groupby(['Player 1', 'Player 2']).size().sort_values(ascending=False).reset_index().head(5)
games_df

Unnamed: 0,Player 1,Player 2,0
0,Carmelo Kyan Anthony,LeBron Raymone James,3
1,Carmelo Kyan Anthony,"Carlos Austin Boozer, Jr.",2
2,Carmelo Kyan Anthony,Kobe Bean Bryant,2
3,Carmelo Kyan Anthony,Kevin Wayne Durant,2
4,Charles Wade Barkley,David Maurice Robinson,2


## 2. Use the `from_pandas_edgelist` method to turn the data frame into a graph.

In [63]:
# from_pandas_edgelist returns a graph from Pandas DataFrame containing an edge list
G = nx.from_pandas_edgelist(games_df, 'Player 1', 'Player 2')

## 3. Compute and print the following graph statistics for the graph:

- Number of nodes
- Number of edges
- Average degree
- Density

In [60]:
print('Number of nodes: ', G.size())
print('Number of edges: ', G.order())
print('Average degree: ', nx.average_degree_connectivity(G)) #Degree of a node is the number of edges that it has
print('Density: ', nx.density(G)) 

Number of nodes:  5
Number of edges:  7
Average degree:  {4: 1.0, 1: 3.0}
Density:  0.23809523809523808


## 4. Compute betweenness centrality for the graph and print the top 5 nodes with the highest centrality.

## 5. Compute Eigenvector centrality for the graph and print the top 5 nodes with the highest centrality.

## 6. Compute degree centrality for the graph and print the top 5 nodes with the highest centrality.

## 7. Generate a network visualization for the entire graph using a Kamada-Kawai force-directed layout.

## 8. Create and visualize an ego graph for the player with the highest betweenness centrality.

## 9. Identify the communities within the entire graph and produce another visualization of it with the nodes color-coded by the community they belong to.

## Bonus: Hierarchical Graphs

Thus far, we have analyzed graphs where the nodes represented individual players and the edges represented Olympic games that they have competed in together. We can analyze the data at a higher level if we wanted to, strippping out the players as entities and analyzing the data at the Games level. To do this, we would need to reconstruct the graph so that the *Games* field represents the entities and then use the player names as the edge criteria so that there would be an edge between two Olympic games if an player played in both of them. You already have the tools in your toolbox to be able to do this, so give it a try. 

### Create a graph with Games as the entities and then print out the graph statistics.

### Generate a network visualization of this graph using the layout of your choice.