## Creating a Network From Ace Attorney Data
This Jupyter Notebook, adapting code by Dr. B, hopes to create a network showing some data trends from the Ace Attorney franchise's XML data we've created.
First, we need to install some stuff

In [1]:
# START WITH INSTALLS AND IMPORTS!

# If you're missing anything in the import cells below, you should install it with pip (or pip3) in your virtual environment. 

# INSTALLS
#!pip install pathlib
#!pip install saxonche
#!pip install pandas
#!pip install networkx
#!pip install pyvis

In [2]:
# IMPORTS for the text NLP processing
import pathlib
import spacy
from pathlib import Path
from saxonche import PySaxonProcessor
from collections import Counter

In [3]:
# IMPORTS For the network visualizations
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from pyvis.network import Network

### Remember
You'll only need to run the top line below if you haven't downloaded the spacy model before.

In [4]:
# nlp = spacy.cli.download("en_core_web_lg")
nlp = spacy.load('en_core_web_lg')

## Our XML
First, remember what our project XML looks like:
```xml
<line speaker="Phoenix"><cutIn type="holdIt">Hold it!</cutIn></line>
<line speaker="Phoenix">Was your sister calm by that time?</line>
<line speaker="Ini">Like, I guess so... I guess maybe, like, taking her revenge on Dr. Grey, like, made her feel a lot better...</line>
<line speaker="Phoenix"><thought>She says with her whip at the ready...</thought></line>
<line speaker="Ini">And, like, Ms. Morgan was the only one in the Channeling Chamber, you know?</line>
<line speaker="Phoenix">May I ask you one more thing, Ms. Miney?</
<line speaker="Ini">Like, sure.</line>

A simple network we can make is one that shows all the unique speakers of the series, or perhaps just the most frequesnt ones.
First, we need to define our input and output paths and actually get all the speakers using XQuery.

Notice how the speakers-XQuery outputs the speakers' name one time for each line they have. We'll count their number of lines later.

In [5]:
# DEFINE SOME FILE PATHS FOR INPUT, AND (ONCE WE'RE READY) OUTPUT
InputPath = 'AAcorpus-xml'
OutputPath = 'testOutputAA' #Named testOutput2 so as to not sully the original network expirement created in the same folder originally
#You'll need to hand create the folder with the same name as OutputPath

# NOTE: We need to use a return line on this function to return the string value of `r` as the result of our python function.
# With the return line, that makes it possible to call the function in the next cell when we need to deliver the output to nlp.

In [6]:
def xqueryAndNLP(InputPath):
    # XQuery over a collection of files:
    with PySaxonProcessor(license=False) as proc:
        print(proc.version)
        xq = proc.new_xquery_processor()
        xq.set_query_base_uri(Path('.').absolute().as_uri() + '/')
        xq.set_query_content('''
let $AceAtt := collection('AAcorpus-xml/?select=*.xml;recurse=yes')
let $speaker := $AceAtt//line/@speaker => sort()
return string-join($speaker, ' ')
''')
        r = xq.run_query_to_string()
        # print(r)
        r = str(r)
    return r

xqueryAndNLP(InputPath)

SaxonC-HE 12.4.2 from Saxonica


'<?xml version="1.0" encoding="UTF-8"?>??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ?

This output is pretty massive, and if you're eagle-eyed, you'll spot some 'speakers' who aren't speakers at all, but random text or locations left in the wrong place sometime during our data development. We may have to hand-remove certain 'speakers' as well.

Don't worry, those'll be filtered out later once we work with our data with NLP. The unfortunate thing is that the space NLP model will work with each word, or token, individually, so the data regarding characters whose names are more than one word will probably turn out inaccurate. Regardless, let's move forward with NLP.

## Important
As it turns out, the string with all the speakers is a bit too long for the nlp module we loaded earlier, so we need to manually increase it's max length by a bit:

In [7]:
nlp.max_length = 2000000 

In [8]:
# If everything's working properly and you have lots of text for the computer to read, this cell may take a moment to run. 
inputstring = xqueryAndNLP(InputPath)

# start playing with spaCy and nlp:
words = nlp(inputstring)
# print(words)

# Collecting the lemmatized forms will be better than just all the words. (Remember what these are?)
Lemmas = []
for token in words:
    if token.pos_ == "PROPN":
        lemma = token.lemma_
        Lemmas.append(lemma)

# Okay, we'll use python's Counter() find out how many lines each speaker has
# Counter() removes duplicates and counts the number of times something appears. 
# And it outputs a dictionary of key:value pairs already sorted from highest to lowest count.

# print(Lemmas)

lemmaFreq = Counter(Lemmas)
totalLemmaCount = len(lemmaFreq) 

print(f"Lemma count: {totalLemmaCount}")

print(f"Lemma frequency {lemmaFreq}")

SaxonC-HE 12.4.2 from Saxonica
Lemma count: 469
Lemma frequency Counter({'Phoenix': 37058, 'Edgeworth': 21778, 'Ryunosuke': 16077, 'Judge': 14607, 'Apollo': 12959, 'Maya': 7386, 'Athena': 5756, 'Susato': 5275, 'Gumshoe': 4577, 'Kay': 3940, 'Trucy': 2901, 'Ema': 2766, 'Sholmes': 2454, 'Iris': 2331, 'Van': 2147, 'Zieks': 2147, 'Layton': 2141, 'Luke': 1896, 'Mia': 1864, 'Juror': 1771, 'Franziska': 1723, 'Kazuma': 1570, 'Blackquill': 1556, 'Klavier': 1546, 'Lang': 1488, 'Pearl': 1428, 'Courtney': 1361, 'Rayfa': 1111, 'Ray': 1089, 'Nahyuta': 1038, 'Espella': 999, 'Payne': 961, 'Gina': 931, 'Barnham': 841, 'Gregory': 839, 'Stronghart': 789, 'Gregson': 751, 'Lotta': 738, 'Butz': 736, 'Fulbright': 724, 'Mikotoba': 674, 'Badd': 670, 'Debeste': 670, 'Darklaw': 668, 'Oldbag': 645, 'Ryutaro': 628, 'Larry': 613, 'Dhurke': 606, 'Auchi': 600, 'Andrews': 574, 'Vigilante': 499, 'Soseki': 456, 'Simon': 444, 'Hosonaga': 417, 'Kristoph': 405, 'Alba': 401, 'Karma': 387, 'Bikini': 381, 'Valant': 367, 'Byrde

Those familiar with the Ace Attorney franchise will recognize pretty much every main characters' near the top. Once the lemma frequency drops below 100, though, even someone who has played all the games might struggle to remember every character.

In the next step, let's trim our speaker dictionary by a bit:

In [10]:
# To access data in our Counter list and keep it organized from highest to lowest value, we use `most_common()`.
# Then we can slice it to store however many we want to plot. [:10] would plot the first 11 values since python starts counting from zero.

mostCommonSpeakers = dict(lemmaFreq.most_common()[:75])
print(f"Most Common Speakers: {mostCommonSpeakers}")

# Here we are unpacking our sliced dictionary of most common speakers into lists of the values and keys,
# and checking to make sure they remain in their dictionary order here. 

listCounts = list(mostCommonSpeakers.values())
listLems = list(mostCommonSpeakers.keys())
print(f"listCounts: {listCounts}")
print(f"listLems: {listLems}")

Most Common Speakers: {'Phoenix': 37058, 'Edgeworth': 21778, 'Ryunosuke': 16077, 'Judge': 14607, 'Apollo': 12959, 'Maya': 7386, 'Athena': 5756, 'Susato': 5275, 'Gumshoe': 4577, 'Kay': 3940, 'Trucy': 2901, 'Ema': 2766, 'Sholmes': 2454, 'Iris': 2331, 'Van': 2147, 'Zieks': 2147, 'Layton': 2141, 'Luke': 1896, 'Mia': 1864, 'Juror': 1771, 'Franziska': 1723, 'Kazuma': 1570, 'Blackquill': 1556, 'Klavier': 1546, 'Lang': 1488, 'Pearl': 1428, 'Courtney': 1361, 'Rayfa': 1111, 'Ray': 1089, 'Nahyuta': 1038, 'Espella': 999, 'Payne': 961, 'Gina': 931, 'Barnham': 841, 'Gregory': 839, 'Stronghart': 789, 'Gregson': 751, 'Lotta': 738, 'Butz': 736, 'Fulbright': 724, 'Mikotoba': 674, 'Badd': 670, 'Debeste': 670, 'Darklaw': 668, 'Oldbag': 645, 'Ryutaro': 628, 'Larry': 613, 'Dhurke': 606, 'Auchi': 600, 'Andrews': 574, 'Vigilante': 499, 'Soseki': 456, 'Simon': 444, 'Hosonaga': 417, 'Kristoph': 405, 'Alba': 401, 'Karma': 387, 'Bikini': 381, 'Valant': 367, 'Byrde': 352, 'Yew': 352, 'Ahlbi': 350, 'Gant': 342, 'In

# Network time!
Next, we can transform our speaker data into a network using a PyVis Network!

We'll use the mostCommonSpeakers dictionary from the last cell.
We'll iterate through the dictionary's items like this:

In [11]:
for speaker, lineCount in mostCommonSpeakers.items():
    print(f'{speaker}: {lineCount}')

Phoenix: 37058
Edgeworth: 21778
Ryunosuke: 16077
Judge: 14607
Apollo: 12959
Maya: 7386
Athena: 5756
Susato: 5275
Gumshoe: 4577
Kay: 3940
Trucy: 2901
Ema: 2766
Sholmes: 2454
Iris: 2331
Van: 2147
Zieks: 2147
Layton: 2141
Luke: 1896
Mia: 1864
Juror: 1771
Franziska: 1723
Kazuma: 1570
Blackquill: 1556
Klavier: 1546
Lang: 1488
Pearl: 1428
Courtney: 1361
Rayfa: 1111
Ray: 1089
Nahyuta: 1038
Espella: 999
Payne: 961
Gina: 931
Barnham: 841
Gregory: 839
Stronghart: 789
Gregson: 751
Lotta: 738
Butz: 736
Fulbright: 724
Mikotoba: 674
Badd: 670
Debeste: 670
Darklaw: 668
Oldbag: 645
Ryutaro: 628
Larry: 613
Dhurke: 606
Auchi: 600
Andrews: 574
Vigilante: 499
Soseki: 456
Simon: 444
Hosonaga: 417
Kristoph: 405
Alba: 401
Karma: 387
Bikini: 381
Valant: 367
Byrde: 352
Yew: 352
Ahlbi: 350
Gant: 342
Ini: 332
Moe: 331
Graydon: 329
Dahlia: 328
Woods: 328
Atmey: 323
Storyteller: 323
Angel: 319
Teneiro: 317
Powers: 316
Greyerl: 313
Kira: 301


Now then, let's actually build the network!

In [63]:
# Create the network graph
net = Network(height='600px', width='100%', bgcolor='#222222', font_color='white', notebook=True, select_menu=True, cdn_resources="in_line")

# Iterate through the data and add nodes
for speaker, lineCount in mostCommonSpeakers.items():
    if lineCount > 10000:
        net.add_node(speaker, label=speaker, shape='dot', size=lineCount/250, title=f"{speaker} - Lines: {lineCount}", color='#6884ff')
    elif lineCount > 1000:
        net.add_node(speaker, label=speaker, shape='square', size=lineCount/200, title=f"{speaker} - Lines: {lineCount}", color='#ff5347')
    else:
        net.add_node(speaker, label=speaker, shape='triangle', size=lineCount/150, title=f"{speaker} - Lines: {lineCount}", color='#faff6b')

# Customize the layout
# ebb: see PyVis docs: https://pyvis.readthedocs.io/en/latest/documentation.html#pyvis.network.Network.barnes_hut
net.barnes_hut(gravity=-10000, central_gravity=3, damping=0.09, overlap=0)

# Display the graph in the Jupyter Notebook
net.show_buttons(filter_=['physics'])
net.show('AAnetwork_graph.html')

AAnetwork_graph.html


## Let's adjust our network some more...
Right now, our network graph's size and color are coded by hand, split into 3 different categories.
But instead, we could create our network graph using our data trends themselves as part of the graph's skeleton!
We can add edges too, creating a cascading path from our most common speaker to our least common one.

In [81]:
# Create the network graph
net2 = Network(height='600px', width='100%', bgcolor='#3c3d3c', font_color='white', notebook=True, select_menu=True, cdn_resources="in_line")

mcsList = list(mostCommonSpeakers.items())
#print(mcsList)

# Iterate through the data and add nodes
for index, (speaker, lineCount) in enumerate(mostCommonSpeakers.items()):
    #print(f'{index+1}: {speaker} - {lineCount}')

    #This will ensure our nodes' size are consistent with each other, from largest to smallest. 
    size = 75 - index

    #This'll make the string of nodes transition from red to blue cleanly, from highest speaker to lowest.
    color = f"rgba({0+index * 3}, 0, {255 - index * 3}, 0.8)"
    #print(color)
    
    net2.add_node(speaker, label=speaker, shape='dot', size=size, title=f"{speaker} - Lines: {lineCount}", 
                 color=color)

    #We'll add edges between each node, starting from the PREVIOUS speaker, so we don't do anything for the first speaker.
    if index != 0:
        net2.add_edge(mcsList[index-1][0], mcsList[index][0], value=size/3,
                      title=f"{mcsList[index-1][0]} has more lines than {mcsList[index][0]}",
                      color='white')

# Customize the layout
# ebb: see PyVis docs: https://pyvis.readthedocs.io/en/latest/documentation.html#pyvis.network.Network.barnes_hut
net2.barnes_hut(gravity=-10000, central_gravity=.7, damping=0.09, overlap=0)

# Display the graph in the Jupyter Notebook
net2.show_buttons(filter_=['physics'])
net2.show('AAnetwork_graph_2.html')

AAnetwork_graph_2.html
