# Network extractor for Early Print plays

<br>
This script extracts weighted edgelists from plays for further analysis in Gephi.

-  Characters share an edge through the "speaking-in-turn" principle - i.e. one character's speech follows another in the text. Adjacent pairs that occur across Act and Scene divisions are excluded.

-  For this to work, the TEI-encoded plays must included appropriate speaker attributes. Specifically, each speaker tag ('sp') must the the 'who' attribute. 

-  This has been designed to work with plays form the Early Print corpus. However, this could be easily modified by making changes to the extract_all_characters function. From that point, everything else should work as normal.


**NB:** _Somewhat disappointingly, not all Early Print plays have the appropriate annotation. This means that it is not possible to extract like-for-like edgelists for all plays._

<br>

## 0 - Preflight checks

Importing packages and defining functions for later.

In [2]:
# import packages
import re
import pandas as pd
from bs4 import BeautifulSoup
from collections import defaultdict

In [3]:
# Function definitions

def extract_all_characters(soup):
    """
    Function to extract characters from XML file of a play.
    
    Extracts the value of two tag attributes
    
        One relates to Act/Scene divisions and the other is for 
        the name of the speaking character. These should be fairly
        clear from the code.
    
    This function should be modified to deal with different XML schema.
    """
    idList = []
    for a in soup.findAll(['div', 'sp']):
        if 'type' in a.attrs.keys():
            idList.append(a.attrs['type'])
        elif 'who' in a.attrs.keys():
            idList.append(a.attrs['who'])
    df = pd.DataFrame(idList, columns=['names'])
    return df


def character_pairings_in(l):
    """
    Function to create list of tuples of character pairings from extracted data
    
    This also (quite crudely) removes any Act or Scene divisions, which have all
    been tagged using an asterisk.
    """
    # Create list from Pandas DF
    #l = dataframe[0].tolist()
    # Create pairings from list
    l2 = [(l[i],l[i+1]) for i in range(len(l)-1)]
    # Remove all Act and Scene markers
    x = [[t for t in a if not '#' in t] for a in l2]
    # Keep only pairs of characters
    y = [row for row in x if len(row) > 1]
    # Create list of tuples
    character_pairings = [tuple(l) for l in y]
    
    return character_pairings

def create_edgelist_from(pairs):
    """
    Function to create edgelists for "speaking-in-turn" pairs
    
    Returns results in a way that will be useful in Gephi
    """
    # Create edgelist using defaultDict
    edges = defaultdict(int)
    for people in pairs:
        for personA in people:
            for personB in people:
                if personA < personB:
                    edges[personA + ",undirected," + personB] += 1
    
    # Create a dataframe from the defaultDict
    df = pd.DataFrame.from_dict(edges, orient='index')
    df.reset_index(level=0, inplace=True)
    
    # Split cell on comma into muliple columns 
    split = (df['index'].str.split(',', expand=True).rename(columns=lambda x: f"col{x+1}"))
    
    # Merge these split columns with the 'weights' from the first df
    merged = split.join(df[0])
    
    # Rename columns for use in Gephi
    merged.columns = ["Source", "Type", "Target", "Weight"]
    
    return merged

## 1 - Read in play and extract list of characters

This creates a dataframe called **__character-list__**. This dataframe contains all textual divisions and character attributions in the order that they appear in the play.

In [19]:
# Read in play and create BeautifulSoup object
filename = "/path/to/play.xml"
with open(filename, 'r') as file: 
    raw = file.read()
    soup = BeautifulSoup(raw, 'lxml')

In [20]:
# Create list using extract function
character_list = extract_all_characters(soup)

## 2 - Manual cleaning

So far, I have been manually cleaning up some bits and pieces using a text editor. This should be automated for future work at scale.

In [21]:
# Save externally for manual correction
character_list.to_csv("/path/to/play.csv", header=True, index=False)

## 3 - Create edgelist

This takes the manually cleaned list of character names and textual divisions and runs them through two functions. These have been strung togehter in a way that explains what they're doing!

In [31]:
# This will not be necessary, if/when I can automate the cleanup process.
data = pd.read_csv("/path/to/play.csv", header=None)

In [32]:
# If not reading in csv, change 'data' to idList in the following lines
edgelist_df = create_edgelist_from(character_pairings_in(data[0]))

In [33]:
# Save to csv
edgelist_df.to_csv("/path/to/edgelist.csv", sep=",", index=False, header=True)

**TODO** 
-  Create a main() function and iterate