## Count and metric measure rankings

Script for generating ranking statistics for Early Modern plays.

NB: this script depends on the output from NetworkAnalysis.ipynb 
    &&
    Plays need to have a .Gephi file with node metrics.
    
Long term goals:

    -  Automate for full folder of plays.
    -  Remove Gephi from the workflow and generate node stats using Networkx
  

## 0 - Preflight checks

Import packages and define functions for later

In [None]:
# Import required packages, modules
#import os
#from os import listdir
import re
import pandas as pd
from bs4 import BeautifulSoup

In [None]:
# Define functions
def list_xmlfiles(directory):
    """
    Return a list of filenames ending in '.txt' in DIRECTORY.
    Not strictly necessary but will be useful if we try to scale.
    """
    xmlfiles = []
    for filename in listdir(directory):
        if filename.endswith(".xml"):
            xmlfiles.append(filename)
    return xmlfiles

def list_textfiles(directory):
    """
    Return a list of filenames ending in '.txt' in DIRECTORY.
    Not strictly necessary but will be useful if we try to scale.
    """
    textfiles = []
    for filename in listdir(directory):
        if filename.endswith(".txt"):
            textfiles.append(filename)
    return textfiles

def count_totals(character_list):
    """
    Function to count the total number of speech acts and lines per character in each play
    """
    counts = []
    
    for character in list(unique):
        lines = [[line.text for line in test(['l','p'])] for test in soup.findAll(who=character)]
        words = [[word.replace('\n', ' ').replace('\r', '') for word in words] for words in lines]
        #l = [[[((len(re.findall(r'\w+', s)))) for s in i] for i in item] for item in words]
        
        x = []
        for item in words:
            for s in item:
                x.append(len(re.findall(r'\w+', s)))
        
        speech_acts = len(lines)
        total_words = sum(x)
        totals = (character, speech_acts, total_words)
        counts.append(totals)
        
    df = pd.DataFrame(counts, columns=["character", "lines", "words"])
    
    return df

def total_rankings(df):
    """
    Create count rankings based on word and line lengths.
    """
    df["line_rank"] = df["lines"].sort_values(ascending=False).rank(method='dense', ascending=False).astype(int)
    df["word_rank"] = df["words"].sort_values(ascending=False).rank(method='dense', ascending=False).astype(int)
    df["count_rank"] = ((df["line_rank"] + df["word_rank"])/2).astype(int)
    return df

def metric_rankings(df):
    """
    Create metrics rankings based on node metrics from .Gephi file
    
    I don't like this function very much. It's too pandas-y. But it works.
    """
    df["WD_rank"] = df["Weighted Degree"].sort_values(ascending=False).rank(method='dense', ascending=False).astype(int)
    df["EC_rank"] = df["eigencentrality"].sort_values(ascending=False).rank(method='dense', ascending=False).astype(int)
    df["degree_rank"] = df["Degree"].sort_values(ascending=False).rank(method='dense', ascending=False).astype(int)
    df["BC_rank"] = df["betweenesscentrality"].sort_values(ascending=False).rank(method='dense', ascending=False).astype(int)
    df["metrics_rank"] = ((df["WD_rank"] + df["EC_rank"] + df["degree_rank"] + df["BC_rank"])/4).astype(int)
    return df

## 1 - Calculate count measures

First, we read in the Early Print XML play. Then we create idList, which is the list of all characters in the play. We then take the unique set of these characters and feed that into the count_totals function.

This returns a dataframe called **_totals_** which contains a list of count measures for each character in the play.

In [None]:
# Read in plays and create BeautifulSoup object
filename = "/path/to/play.xml"
with open(filename, 'r') as file: 
    raw = file.read()
    soup = BeautifulSoup(raw, 'lxml')

In [None]:
# create list of characters based on lines
idList = []
for a in soup.findAll('sp'):
    if 'who' in a.attrs.keys():
        idList.append(a.attrs['who'])

# Only unique characters
unique = set(idList)

In [None]:
# Count totals
totals = count_totals(unique)

## 1.b - Cleanup tables and rank measures

There are still some errors in these tables that require a little fiddling around. The following lines are meant only as examples of the kind of cleaning up that can be (and has been) performed on the **_totals_** tables.

The cleaned up **_totals_** table is then sent to the total_rankings function, returning a new dataframe called **_count-ranks_**.

In [None]:
# Remove TCP string
totals["character"] = totals['character'].str.replace('A77565_01-','')
# Change spellings and recount the totals
totals["character"] = totals["character"].str.replace('phib','phebe')
totals = totals.groupby('character').sum().reset_index()
# Delete characters
totals = totals.drop([10])

In [None]:
# Calculate + save count ranks
count_ranks = total_rankings(totals)
file1 = "/path/to/ranked_counts.csv"
count_ranks.to_csv(file1, header=True, sep="\t")

## 2 - Calculate metric measures

We now import the table of node metrics generated using Gephi. This table is then sent to the **_metric_ranking_** function which returns a dataframe called **_metric-ranks_**

In [None]:
# read csv
gephi = "/Users/au564346/Desktop/gephi_metrics_brome.csv"
metrics = pd.read_csv(gephi)
metrics = metrics[["Id", "Degree", "Weighted Degree", "eigencentrality", "betweenesscentrality"]]

In [None]:
# Check for consistency
len(metrics) == len(count_ranks)

In [None]:
# Calculate + save ranked metric measures
metric_ranks = metric_rankings(metrics)
metric_ranks.to_csv("/path/to/ranked_metrics.csv",
             header=True, sep="\t")
metric_ranks

## 3 - Combine tables

Firstly, we create an abridged table which has only the average count and metrics rankings.

In [None]:
# Save abridged ranks
ranks = count_ranks.merge(metric_ranks, left_on='character', right_on='Id')
ranks = ranks[["character", "count_rank", "metrics_rank"]]
ranks.to_csv("/paht/to/ranked.csv",
            header=True, sep="\t")

Next, we create a larger table that brings together all of our desired metrics into a single table.

In [None]:
all_ranks = count_ranks.merge(metric_ranks, left_on='character', right_on='Id')
all_ranks = all_ranks[["character", "lines", "words", "Degree", "Weighted Degree",
                        "eigencentrality", "betweenesscentrality", "line_rank", "word_rank", "degree_rank", 
                           "WD_rank","BC_rank", "EC_rank", "count_rank", "metrics_rank"]]
all_ranks.to_csv("/path/to/complete_rankings.csv")

Lastly, we calculate the Spearman's Rho on the average count and metric rankings, in order to see how they compare.

In [None]:
# Calculate + save spearman's rho
corr = ranks.corr(method='spearman')
corr.to_csv("/path/to/correlations.csv",
            header=True, sep="\t")