# Data Overview

#### Objectives:
- Introduce data problem
- Clean and prep source data
- Exploratory data analysis to understand node and edge data
- Create graph object

# Data Problem
Can you relate to either of the following scenarios?

Scenario 1:

***It's Monday morning and you see your co-workers huddled together furiously discussing something seemingly important. Upon approaching the group, you start to hear names of people and words that definitely do not relate to anything your team does daily. Who is Stark? Did they say dragons? What does hodor mean!? And, what the heck is a whitewalker!? Holy moly they are talking about Game of Thrones again and I could care less.***

Scenario 2:

***It's Sunday night at 9pm. You're ready for the newest episode of GOT. But wait, the recap is testing your memory all the way back to season one!! I don't even remember half of those charcters. Who are they again and why in the world are they important!? Aren't they dead?? This is going to be a tough episode to follow.***

Fear not, fuzzy brains! Because we are going to play...

## Game of Nodes!!!

<img src="https://orig00.deviantart.net/f351/f/2014/094/b/3/play_the_game_by_betteo-d7d0925.jpg" width="600">

Image Credit: [Patricio Betteo](http://betteo.blogspot.com)

Even if you could careless or are a GoT master, we are going to uncover who the important characters are in Game of Thrones using graph analytics. This will demonstrate how graph analytics is a powerful technique for quickly drawing conclusions from large amounts of complex, relational data. 

### Data Prep
Graph technologies tend to need incoming data in a certain format or file type in order to build the graph. For this workshop we will demonstrate building a graph from two datasets. One being the nodes and their attributes, and the other the edge file and attributes. 

The manipulation of the external data we are using in this workshop will not be covered as a lesson, but you can check out the prep steps in ~/notebooks/0X-data_prep.ipynb where we combined the two datasets.

### **Knowledge check:**
***What are some example attributes for GOT characters for the nodes and edges that you think we will use?***

## Nodes Overview

Looking at the values and counts in each column of our node data. Example of five rows in our data:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from math import pi

from bokeh.models import ColumnDataSource, HoverTool, ranges, LabelSet, Range1d
from bokeh.plotting import figure, show, output_file, save

from IPython.display import IFrame

In [2]:
# Character node dataset with attributes
node = pd.read_csv("../data/processed/character_interactions_node.csv", sep= ",", keep_default_na=False, na_values=[''])
node.tail()

Unnamed: 0,Id,Label,Allegiances,Gender,Nobility,GoT,CoK,SoS,FfC,DwD,Dead
791,yorko-terys,Yorko Terys,,,,,,,,,
792,ysilla,Ysilla,Targaryen,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
793,yurkhaz-zo-yunzak,Yurkhaz zo Yunzak,,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
794,zei,Zei,Stark,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
795,zollo,Zollo,,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


**Number of Characters**

In [3]:
node['Id'].describe() # Number of Characters

count         796
unique        796
top       cressen
freq            1
Name: Id, dtype: object

**Number of Characters in House Allegiances**

In [4]:
# Distribution of allegiances visual
dist_houses = pd.DataFrame(node['Allegiances'].value_counts()).\
                    reset_index().\
                    rename(columns={'index': 'Allegiances', 'Allegiances': 'char_count'})
    
source = ColumnDataSource(dist_houses)
source.data['index']=source.data['index']+0.5
names = dist_houses.Allegiances.tolist()

h = figure(plot_width=400, 
           plot_height=400, 
           y_range = names,
          title = "Allegiance Distribution")
h.hbar(y='index', height=0.9, left=0,
       right='char_count', color="navy", source=source)

h.add_tools(HoverTool(tooltips=[("Count","@char_count")]))

output_file("../img/allegiances_distribution.html")
save(h,filename='../img/allegiances_distribution.html')

#Workaround for displaying bokeh
IFrame('../img/allegiances_distribution.html', width=700, height=400)

**Gender (1=Male, 0=Female)**

In [5]:
print(node.Gender.value_counts())
node.Gender.isnull().value_counts()

1.0    467
0.0    101
Name: Gender, dtype: int64


False    568
True     228
Name: Gender, dtype: int64

**Nobility (1=Noble, 0=Not noble)**

In [6]:
print(node.Nobility.value_counts())
node.Nobility.isnull().value_counts()

0.0    293
1.0    275
Name: Nobility, dtype: int64


False    568
True     228
Name: Nobility, dtype: int64

**Dead Characters (1=Alive, 0=Dead)**

In [7]:
print(node.Dead.value_counts())
node.Dead.isnull().value_counts()

0.0    364
1.0    204
Name: Dead, dtype: int64


False    568
True     228
Name: Dead, dtype: int64

**Number of Characters Appearing in Each Book**

In [8]:
# Number of characters in each book
book_chars = pd.DataFrame(node[['GoT','CoK','SoS','FfC','DwD']].sum()).\
                    reset_index().\
                    rename(columns={'index': 'Books', 0: 'char_count'})
        
source = ColumnDataSource(book_chars)
source.data['index']=source.data['index']+0.5

names = book_chars.Books.tolist()

h = figure(plot_width=400, 
           plot_height=400, 
           y_range = names,
          title = "Character Distribution by Book")
h.hbar(y="index", height=0.9, left=0,
       right="char_count", color="navy", source=source)

h.add_tools(HoverTool(tooltips=[("Count","@char_count")]))


output_file("../img/book_distribution.html")

save(h, filename="../img/book_distribution.html")

#Workaround for displaying bokeh
IFrame('../img/book_distribution.html', width=700, height=400)

#### Question?
What are some comments we can make from seeing the amount of characters in each book? How could this attribute be used in our graph?

## Edges Overview

Summarizing the values in our edge dataset. Top five rows of our dataset:

In [9]:
# Character Interactions edge dataset with attributes
edge = pd.read_csv("../data/processed/character_interactions_edge.csv", sep= ",", keep_default_na=False, na_values=[''])
print(len(edge))
edge.head()

2823


Unnamed: 0,Source,Target,weight,weight_inv
0,addam-marbrand,brynden-tully,3,0.333333
1,addam-marbrand,cersei-lannister,3,0.333333
2,addam-marbrand,gyles-rosby,3,0.333333
3,addam-marbrand,jaime-lannister,14,0.071429
4,addam-marbrand,jalabhar-xho,3,0.333333


**Weight Summary Statistics**

In [10]:
print(edge.weight.describe())

count    2823.000000
mean       11.558271
std        19.976281
min         3.000000
25%         3.000000
50%         5.000000
75%        11.000000
max       334.000000
Name: weight, dtype: float64


**Weight Values Distribution Plot**

In [11]:
# Weight distribution visual
count, bins = np.histogram(edge.weight, bins = 'fd')
count = np.append(count,[0])

weight_hist = pd.DataFrame(data={'count':count, 'bins': bins})
source = ColumnDataSource(weight_hist)

h = figure(plot_width=700, 
           plot_height=300, 
           x_range=Range1d(0, max(source.data['bins'])), 
           y_range=Range1d(0, max(source.data['count'])),
           title = "Weight Distribution")

h.vbar(x='bins', top='count', width=0.5, color="navy", source=source)

h.add_tools(HoverTool(tooltips=[("Value", "@count")]))
output_file("../img/weight_distribution.html")

save(h, filename="../img/weight_distribution.html")

#Workaround for displaying bokeh
IFrame('../img/weight_distribution.html', width=900, height=350 )

**Looking at the top 25% of the weights**

In [12]:
edge[edge.weight >= 11].weight.describe()

count    729.000000
mean      31.052126
std       31.968477
min       11.000000
25%       14.000000
50%       19.000000
75%       34.000000
max      334.000000
Name: weight, dtype: float64

**Finding characters with the max interaction count**

In [13]:
# Finding character info with top interaction count
top_interaction = edge[edge.weight == 334][['Source', 'Target']].values[0]
node[node['Id'].isin(top_interaction)]

Unnamed: 0,Id,Label,Allegiances,Gender,Nobility,GoT,CoK,SoS,FfC,DwD,Dead
189,eddard-stark,Eddard Stark,Stark,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
602,robert-baratheon,Robert Baratheon,Baratheon,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0


## Create Graph Object

Now that we understand the data we can create our graph object.

In [14]:
import networkx as nx

In [15]:
# Build base graph from edge dataframe
G = nx.from_pandas_dataframe(edge, 'Source', 'Target', ['weight', 'weight_inv'])

In [16]:
# Adding node attributes
for i in sorted(G.nodes()):
    G.node[i]['Label'] = node.loc[node.Id == i,'Label'].values[0]
    G.node[i]['Allegiances'] = node.loc[node.Id==i,'Allegiances'].values[0]
    G.node[i]['Gender'] = node.loc[node.Id==i,'Gender'].values[0]
    G.node[i]['Nobility'] = node.loc[node.Id==i,'Nobility'].values[0]
    G.node[i]['GoT'] = node.loc[node.Id==i,'GoT'].values[0]
    G.node[i]['CoK'] = node.loc[node.Id==i,'CoK'].values[0]
    G.node[i]['SoS'] = node.loc[node.Id==i,'SoS'].values[0]
    G.node[i]['FfC'] = node.loc[node.Id==i,'FfC'].values[0]
    G.node[i]['DwD'] = node.loc[node.Id==i,'DwD'].values[0]
    G.node[i]['Dead'] = node.loc[node.Id==i,'Dead'].values[0]

**Summary of Graph**

In [17]:
# Summary of graph object
G.name = "Game of Thrones Character Interactions"
print(nx.info(G))

Name: Game of Thrones Character Interactions
Type: Graph
Number of nodes: 796
Number of edges: 2823
Average degree:   7.0930


**Finding the same data from our dataframe in our graph! **   
<br>
Node attributes from our characters with the max weight

In [18]:
# Print sample node info (top two interactive characters)
for i in ['eddard-stark','robert-baratheon']:
    print(G.node[i])

{'Label': 'Eddard Stark', 'Allegiances': 'Stark', 'Gender': 1.0, 'Nobility': 1.0, 'GoT': 1.0, 'CoK': 0.0, 'SoS': 0.0, 'FfC': 0.0, 'DwD': 0.0, 'Dead': 1.0}
{'Label': 'Robert Baratheon', 'Allegiances': 'Baratheon', 'Gender': 1.0, 'Nobility': 1.0, 'GoT': 1.0, 'CoK': 0.0, 'SoS': 0.0, 'FfC': 0.0, 'DwD': 0.0, 'Dead': 1.0}


Edge attribute from those two characters

In [19]:
# Print sample edge info
G.edge['eddard-stark']['robert-baratheon']

{'weight': 334, 'weight_inv': 0.002994011976047905}

In [20]:
# Saving graph object
nx.write_gpickle(G,"../data/processed/got_graph.gpickle")