<a href="https://colab.research.google.com/github/quadrismegistus/character-networks/blob/main/GenerateFictionalSocialNetwork.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Powerloom**: Visualizing Character Networks

### 🏃‍♀️ Starting up

In [14]:
#@title 📚 Choose a book to analyze
#@markdown Enter the title of a book, along with a URL to a text file. To find a URL, search:
#@markdown * [Project Gutenberg](https://gutenberg.org) for out of copyright texts
#@markdown * [Z-lib](https://gutenberg.org) for in-copyright texts

title = "Overstory" #@param {type:"string"}
url = "" #@param {type:"string"}



#### Initial initial imports and settings
import os,sys
from google.colab import drive,files
from ipywidgets import widgets
from IPython.display import Markdown, display, Javascript
# nicer print func
def printm(x): display(Markdown(x))
# elementary dirs
PATH_ROOT='/content'
PATH_LIB=os.path.join(PATH_ROOT,'lib')
PATH_TMP=os.path.join(PATH_ROOT,'tmp')
PATH_TMP_UPLOAD_FN_PRE=os.path.join(PATH_TMP,'uploaded_text')
PATH_TOOLS=os.path.join('/content','tools')
PATH_NOVELS=os.path.join(PATH_ROOT,'texts')
PATH_TO_BOOKNLP=os.path.join(PATH_TOOLS,'book-nlp')
if not os.path.exists(PATH_ROOT): os.makedirs(PATH_ROOT)
if not os.path.exists(PATH_LIB): os.makedirs(PATH_LIB)
if not os.path.exists(PATH_TMP): os.makedirs(PATH_TMP)
# add lib to python path
if not PATH_LIB in sys.path: sys.path.insert(0,PATH_LIB)




# offer button as well
def upload_f(x,tmpfnpre=PATH_TMP_UPLOAD_FN_PRE):
    from google.colab import files
    res = files.upload()
    fn=list(res.keys())[0]
    if fn:
        delete_f(None)
        fnpre,fnext=os.path.splitext(fn)
        tmpfn=tmpfnpre+fnext
        !mv "$fn" "$tmpfn"
        printm(f'* Moving to: {tmpfn}')
    Javascript('Jupyter.notebook.execute_cell_and_select_below()')

def get_uploaded_fn():
    tmp_fns=os.listdir(PATH_TMP)
    matches=[tmp_fn for tmp_fn in tmp_fns if tmp_fn.startswith(os.path.basename(PATH_TMP_UPLOAD_FN_PRE))]
    if matches:
        match=matches[0]
        matchpath=os.path.join(PATH_TMP,match)
        return matchpath

def delete_f(x):
    if get_uploaded_fn():# and os.path.exists(matchpath):
        fn=get_uploaded_fn()
        os.remove(get_uploaded_fn())
        printm(f'* Deleted: {fn}')
    Javascript('Jupyter.notebook.execute_cell_and_select_below()')

printm('### Alternatively to a URL, you can upload a file (txt, epub, pdf, docx, ...)')

ubutton = widgets.Button(description="Upload text")
ubutton.on_click(upload_f)
display(ubutton)

# is uploaded file
if get_uploaded_fn():
    printm(f'Currently using uploaded file: {os.path.basename(get_uploaded_fn())}')
    delbutton = widgets.Button(description="Delete file")
    delbutton.on_click(delete_f)
    display(delbutton)


printm('### Checking input')
# Load widget code
# Valid input?
class InvalidInput(Exception):
    def _render_traceback_(self): pass
NOVEL_URL=url
NOVEL_TITLE_NICE=title.strip()
NOVEL_TITLE=NOVEL_TITLE_NICE.title().replace(' ','')
if not NOVEL_TITLE:
    print('Type a title in the box above')
    raise InvalidInput
else:
    printm(f'* Using title: **{NOVEL_TITLE_NICE}**')
    printm(f'* Using filename: **{NOVEL_TITLE}**')



# get txt
### getting text using code from a gist
!pip install bs4 kitchen wget fulltext epub-conversion pymupdf requests xml_cleaner html2text -q -q
import urllib.request
gisturl='https://gist.githubusercontent.com/quadrismegistus/f76c2ffcccedc496a638ca430b6851ab/raw/ede9744a01adc324a6223b924789f1553853f91c/brute_txt.py'
ofnbrute=os.path.join(PATH_LIB,'brute_txt.py')
if not os.path.exists(ofnbrute):
    urllib.request.urlretrieve(gisturl,ofnbrute)
from brute_txt import brute

def get_novel_txt():
    NOVEL_TXT=''
    # is uploaded file?
    if get_uploaded_fn():
        printm(f'* Loading text from uploaded file: {get_uploaded_fn()}')
        NOVEL_TXT=brute(get_uploaded_fn())
    elif url:
        printm(f'* Loading text from URL: {url}')
        NOVEL_TXT=brute(url)
    else:
        printm('No uploaded file nor URL to file')
        raise InvalidInput
    # save text
    if not NOVEL_TXT:
        printm('Empty text')
        raise InvalidInput
    else:
        beginning=' '.join(NOVEL_TXT[:100].split())
        printm(f'* Loaded book with **{len(NOVEL_TXT.strip().split())}** words')
        printm(f'''* Book begins: {beginning}''')
    return NOVEL_TXT

NOVEL_TXT=get_novel_txt()

### Alternatively to a URL, you can upload a file (txt, epub, pdf, docx, ...)

Button(description='Upload text', style=ButtonStyle())

Currently using uploaded file: uploaded_text.epub

Button(description='Delete file', style=ButtonStyle())

### Checking input

* Using title: **Overstory**

* Using filename: **Overstory**

* Loading text from uploaded file: /content/tmp/uploaded_text.epub

* Loaded book with **56043** words

* Book begins: THE OVERSTORY A NOVEL RICHARD POWERS ![img](images/logo.jpg) W. W. NORTON & COMPANY _Independe

In [15]:
#@title 📂 Choose where to store data
#@markdown ##### Save text data to Google drive?
#@markdown <small>This will save all generated data to your Google Drive, making downloading/uploading annotations and other data unnecessary. If yes, specify a directory path relative to your root Drive folder.</small>
save = "Yes, save in Google Drive" #@param ["Yes, save in Google Drive", "No, keep on temporary storage"]
path = "DigHum/CharacterNetworks" #@param {type:"string"}

if 'Yes' in save:
    # mount
    if not os.path.exists('/content/drive'):
        from google.colab import drive
        drive.mount('/content/drive')
    else:
        printm('* Google Drive mounted at: /content/drive   ')
    # create basedir
    PATH_ROOT_DRIVE=os.path.join('/content/drive','My Drive',path)
    if not os.path.exists(PATH_ROOT_DRIVE): os.makedirs(PATH_ROOT_DRIVE)
    PATH_NOVELS_DRIVE=os.path.join(PATH_ROOT_DRIVE,'texts')
    if not os.path.exists(PATH_NOVELS):
        #@todo what if user changes mind about gdrive?
        os.symlink(PATH_NOVELS_DRIVE, PATH_NOVELS)
        printm(f'* Linking: {PATH_NOVELS} --> **{PATH_NOVELS_DRIVE}**')
    else:
        printm(f'* Linked: {PATH_NOVELS} --> **{PATH_NOVELS_DRIVE}**')


URL_CORENLP='http://nlp.stanford.edu/software/stanford-corenlp-4.1.0.zip'
MODEL_FN='stanford-corenlp-4.1.0-models.jar'
PATH_TO_BOOKNLP_BINARY=os.path.abspath(os.path.join(PATH_TO_BOOKNLP,'runjava'))
PATH_NOVEL=os.path.join(PATH_NOVELS,NOVEL_TITLE)
ofn_novel=os.path.join(PATH_NOVEL,f'{NOVEL_TITLE}.txt')
ofn=os.path.join(PATH_NOVEL,f'data.parses.{NOVEL_TITLE}.jsonl')
ofn_meta=os.path.join(PATH_NOVEL,f'data.charmeta.{NOVEL_TITLE}.csv')
ofn_meta_anno=os.path.join(PATH_NOVEL,f'data.charmeta.{NOVEL_TITLE}.anno.csv')
ofn_booknlp_out=os.path.splitext(ofn_novel)[0]+'.booknlp'
ofn_booknlp_toks=os.path.join(ofn_booknlp_out,'tokens.txt')
ofn_fig_dir=os.path.join(PATH_NOVEL,'imgs')
ofn_gif=os.path.join(PATH_NOVEL,'anim.gif')
ofn_mp4=os.path.join(PATH_NOVEL,'anim.mp4')
printm(f'* Path to novel data set to: **{PATH_NOVEL}**')
if not os.path.exists(PATH_NOVEL): os.makedirs(PATH_NOVEL)
if not os.path.exists(PATH_TOOLS): os.makedirs(PATH_TOOLS)
if not NOVEL_TXT:
    raise InvalidInput
else:
    with open(ofn_novel,'w') as of:
        of.write(NOVEL_TXT)

* Google Drive mounted at: /content/drive   

* Linked: /content/texts --> **/content/drive/My Drive/DigHum/CharacterNetworks/texts**

* Path to novel data set to: **/content/texts/Overstory**

In [16]:
#@title Once you are done, click on menu item **Runtime > Run all**



## 🔩 Installations

In [17]:
#@title Install dependencies
!pip install dynetx fa2 pandas psutil humanize colour dimcli numpy moviepy ffmpeg pyvis networkx gender-guesser imageio-ffmpeg imageio -q


#@title Import modules
# imports
import os,sys
import pandas as pd
import networkx as nx
import dynetx as dn
from collections import Counter
from shutil import which
from colour import Color
import numpy as np
from ipywidgets import interact, interactive, fixed, interact_manual, widgets
import warnings
warnings.filterwarnings('ignore')
import math,os
from tqdm import tqdm
from collections import defaultdict
import plotly.express as px
pd.options.display.max_rows=25
import psutil
import stat
from pathlib import Path
import time
import humanize
import datetime as dt


[K     |████████████████████████████████| 307kB 16.0MB/s 
[K     |████████████████████████████████| 389kB 29.6MB/s 
[K     |████████████████████████████████| 26.9MB 41.7MB/s 
[K     |████████████████████████████████| 51kB 5.9MB/s 
[?25h  Building wheel for ffmpeg (setup.py) ... [?25l[?25hdone


In [None]:
#@title Install Booknlp
os.chdir(PATH_TOOLS)
if not os.path.exists(PATH_TO_BOOKNLP):
    !git clone https://github.com/dbamman/book-nlp
PATH_BOOKNLP_MODELS=os.path.join(PATH_TO_BOOKNLP,'lib',MODEL_FN)
PATH_BOOKNLP_EXEC=os.path.join(PATH_TO_BOOKNLP,'runjava2')
if not os.path.exists(PATH_BOOKNLP_MODELS):
    !wget $URL_CORENLP
    corenlp_fn=URL_CORENLP.split('/')[-1]
    corenlp_dir=f'{corenlp_fn.split(".zip")[0]}'
    !unzip -q "$corenlp_fn"
    ifnfn=f'{corenlp_fn.split(".zip")[0]}/{MODEL_FN}'
    !mv "$ifnfn" "$PATH_BOOKNLP_MODELS"
    !rm "$corenlp_fn"
    !rm -rf "$corenlp_dir"
os.chdir(PATH_ROOT)

def java_size(bytes, units=['b','k','m','g']):
    """ Returns a human readable string representation of bytes """
    return str(bytes) + units[0] if bytes < 1024 else java_size(bytes>>10, units[1:])

memstr=java_size(int(float(psutil.virtual_memory().available)*0.75))

runjava_cmd=f"""
#!/bin/sh

# add all the jars anywhere in the lib/ directory to our classpath
here=$(dirname $0)
CLASSES=$here/bin
CLASSES=$CLASSES:$(echo $here/lib/*.jar | tr ' ' :)
CLASSES=$CLASSES:$here/book-nlp.jar

java -XX:ParallelGCThreads=2 -Xmx{memstr} -ea -classpath $CLASSES $*
"""
with open(PATH_BOOKNLP_EXEC,'w') as of: of.write(runjava_cmd)
!chmod +x "$PATH_BOOKNLP_EXEC"

Cloning into 'book-nlp'...
remote: Enumerating objects: 27, done.[K
remote: Counting objects: 100% (27/27), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 313 (delta 7), reused 20 (delta 2), pack-reused 286[K
Receiving objects: 100% (313/313), 75.23 MiB | 21.57 MiB/s, done.
Resolving deltas: 100% (111/111), done.
--2020-11-02 20:06:24--  http://nlp.stanford.edu/software/stanford-corenlp-4.1.0.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/software/stanford-corenlp-4.1.0.zip [following]
--2020-11-02 20:06:24--  https://nlp.stanford.edu/software/stanford-corenlp-4.1.0.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 504773765 (481M) [application/zip]
Saving to: ‘stanford-corenlp-4.1.

## 🔨 Parse

This will take ~15 minutes.

In [None]:
#@title Parse text using BookNLP
def parse_text(path_txt):
    import time
    now=time.time()
    if not path_txt: return    
    path_out=ofn_booknlp_out
    path_toks=ofn_booknlp_toks
    cmd=f'cd "{PATH_TO_BOOKNLP}" && ./{os.path.basename(PATH_BOOKNLP_EXEC)} novels/BookNLP -doc {path_txt} -printHTML -p {path_out} -tok {path_toks} -f'
    # print('>>',cmd)
    !{cmd} #os.system(cmd)
    os.rename(os.path.join(path_out,'book.id.html'), os.path.join(path_out,'parsed.html'))
    os.rename(os.path.join(path_out,'book.id.book'), os.path.join(path_out,'parsed.json'))
    nownow=time.time()
    with open(path_txt) as tf: numwords=len(tf.read().strip().split())
    speed=numwords/(nownow-now)
    print(f'\n >> Finished parsing in {humanize.naturaldelta(dt.timedelta(seconds=nownow-now))} ({speed} words/sec)')
#@title
# Parse! This will take 10-15 minutes for most novels... time to make coffee?
if not os.path.exists(ofn_booknlp_toks): parse_text(ofn_novel)

In [None]:
#@title Load generated character data
def read_parsed_json(path_parsed_json):
    import json,os
    from collections import defaultdict,Counter
    #json.loads(path_parsed_json)
    dat=json.load(open(path_parsed_json))
    keys=[]
    nullchar=defaultdict(Counter)
    text_id=path_parsed_json.split('/')[-3]
    for char in dat['characters']:
        if not char['names']: continue
        names=[x['n'] for x in char['names']]
        
        chardx={'name':names[0], 'id':char['id'], 'names':', '.join(names), 'text_id':text_id}
        num=0
        for key in ['agent','patient','poss','mod','speaking']:
            chardx['num_'+key]=len(char[key])
            num+=chardx['num_'+key]
            # chardx['words_'+key]=words=[]
            # for event in char[key]:
            #     if 'w' in event:
            #         wtxt=event['w']
            #         wlist=word_tokenize(wtxt) if ' ' in wtxt else [wtxt]
            #         wlist=[w.lower() for w in wlist if w and w[0].isalpha()]
            #         words+=wlist
        num+=1
        chardx['num']=num
        yield chardx

# Load jsons into character metadata
char_jsons=list(read_parsed_json(os.path.join(ofn_booknlp_out,'parsed.json')))
char_df=pd.DataFrame(char_jsons)
id2name=dict(zip(char_df.id,char_df.name))
printm(f'Found **{len(char_jsons)}** characters, for a total of **{sum(char_df.num)}** mentions.')
topn=10
printm(f'#### Top {topn} characters')
display(char_df.sort_values('num',ascending=False)[['name','num','names']].head(topn))

In [None]:
#@title Load word-by-word parse data
# load and gen
def load_tokdf(ofn_booknlp_toks=ofn_booknlp_toks):
    tok_df=pd.read_csv(ofn_booknlp_toks,sep='\t',error_bad_lines=False,engine='python',warn_bad_lines=False)
    tok_df['isChar']=tok_df['characterId'].apply(lambda x: int(x!=-1))
    tok_df['isWord']=1
    return tok_df
tok_df=load_tokdf(ofn_booknlp_toks=ofn_booknlp_toks)
# show
from IPython.display import Markdown, display, Javascript
# nicer print func
def printm(x): display(Markdown(x))
printm('#### Word-level data')
printm(f'* Number of words: {len(tok_df)}')
printm(f'* Number of sentences: {len(set(tok_df.sentenceID))}')
printm(f'* Number of paragraphs: {len(set(tok_df.paragraphId))}')
printm(f'* Number of character mentions: {sum(tok_df.isChar)}')
display(tok_df)

In [None]:
'``' in set(tok_df.originalWord)

## 🤔 Examine character data

In [None]:
#@title Prepare metadata for manual annotation
if 1:# not os.path.exists(ofn_meta):
    # Guess gender of characters?
    import gender_guesser.detector as gender
    gd = gender.Detector()

    prefix_clues={
        'female':{'Ms.', 'Ms ','Mrs.','Mrs ','Miss ','Madame','Mme.',
                  'Mme ','Signorina','Maestra ',
                  'Lady '},
        'male':{'Don ','Signor ','Mr.','Mr ','Maestro ','Lord ','Sir '}
    }

    def guess_gender(name,gd=gd):
        if not name: return None

        for gndr,prfxs in prefix_clues.items():
            for p in prfxs:
                if name.startswith(p):
                    return gndr
        
        gend=gd.get_gender(name).replace('mostly_','')
        return gend
    char_df['gender']=char_df.name.apply(guess_gender)
    dfm=char_df#.groupby(['name','gender','id']).sum().reset_index()

    # create other fields
    dfm['name_real']=dfm['name']
    dfm['other']=''
    dfm['notes']=''
    dfm['race']=''
    dfm['class']=''
    dfm_save=dfm[['id','name','name_real','num','gender','race','class','other','notes']].sort_values('num',ascending=False)
    dfm_save.to_csv(ofn_meta,index=False)
    # if not os.path.exists(ofn_meta_anno):
        # dfm_save.to_csv(ofn_meta_anno,index=False)
    printm('#### Metadata generated automatically')
    display(dfm_save)

In [None]:
#@title Download/upload metadata for manual annotation
from google.colab import files
def download_anno(x):
    files.download(ofn_meta)
dbutton = widgets.Button(description="Download as CSV")
dbutton.on_click(download_anno)
printm('''### Download automatically generated character metadata''')
printm('''Open in a spreadsheet editor (e.g. excel) and change the names
in the column 'name_real' to rename the character.
Delete the name there to declare that character name invalid.
The other columns allow you to set variables to color or size the networks by.
''')
display(dbutton)

printm('### Upload manually refined character metadata')
printm('''When you're done, click "Upload" and upload the new CSV below.
If you saved your sheet as an excel file, export it as CSV before uploading.
''')




def upload_anno(x):
    from google.colab import files
    FILES = files.upload()
    if not FILES: return
    fn=list(FILES.keys())[0]
    fnpre,ext=os.path.splitext(fn)
    if ext not in {'.csv'}:
        print('File must be a CSV file. (e.g. In excel, export as CSV.)')
        raise InvalidInput
    !mv "$fn" "$ofn_meta_anno"


button = widgets.Button(description="Upload annotations")
button.on_click(upload_anno)
display(button)

In [None]:
#@title Reload and filter metadata
# load csv

min_num_mentions=widgets.IntSlider(min=1,max=50,value=5)
printm('### Filter by the minimum number of mentions')

def load_dfm():
    ifn=ofn_meta_anno if os.path.exists(ofn_meta_anno) else ofn_meta
    dfm=pd.read_csv(ifn).fillna('').rename(columns={'num':'num_total', 'count':'num_total', 'name_standardized':'name_real','id':'characterId'})
    dfm=dfm[(dfm.name_real.apply(lambda x: x[0].isalpha()))] 
    return dfm


def filter_metadata(min_num_mentions=min_num_mentions):
    dfm=load_dfm()
    # show changes
    name2real=dict(zip(dfm.name,dfm.name_real))
    for _name,_name_real in sorted(name2real.items()):
        if _name!=_name_real:
            printm(f'* Changed: {_name} --> {_name_real}')

    len1=len(dfm)    
    dfmr=dfm#.rename()
    strcols=set(dfmr.columns) - {'num_total','name'}
    
    newld=[]
    for name,namedf in dfmr.groupby('name_real'):
        num_total=namedf.num_total.sum()
        if num_total<min_num_mentions: continue
        ncdf=namedf.mode()
        common_d=dict(ncdf.iloc[0])
        namedx={**common_d, **{'num_total':num_total,}}
        newld.append(namedx)

    dfmr=pd.DataFrame(newld)#dfmr[dfmr.num_total>=min_num_mentions].groupby(list(strcols)).sum().reset_index()
    len2=len(dfmr)
    printm(f'* Filtered: {len2} of {len1} characters remaining because mentioned at least {min_num_mentions} times')
    # sum by new names
    dfmr=dfmr.sort_values('num_total') #[dfmr.name.str.startswith('Mrs. D')] #.tail(10)
    # print(res)
    return dfmr

live_dfm=interact(filter_metadata,min_num_mentions=min_num_mentions)

## 🕸 Generate social network

In [None]:
dfm = load_dfm()
dfm

In [None]:
dfmr = filter_metadata(**live_dfm.widget.kwargs)
dfmr

In [None]:
def load_alldf():
    # dfm
    dfm=load_dfm() 
    # filtered dfm
    dfmr = filter_metadata(**live_dfm.widget.kwargs)
    # tokens
    tok_df=load_tokdf(ofn_booknlp_toks=ofn_booknlp_toks)
    tok_cols = set(tok_df.columns) - set(dfm.columns)
    all_df=tok_df[['characterId']+list(tok_cols)].merge(dfm,on='characterId',how='left').fillna('')
    return all_df


In [None]:
# adf=load_alldf()
# adf[adf.characterId!=-1]

In [None]:
#@title Generate dynamic network from interactions
NET_STATS=['degree','degree_centrality','betweenness_centrality','eigenvector_centrality','closeness_centrality']
slice_length=widgets.Dropdown(options=[1,5,10,50,100,250,500,1000,2000,5000,10000,25000],value=500,description='Length')
weight_slider=widgets.IntSlider(min=1, max=10, step=1, value=2)
mindegree_slider=widgets.IntSlider(min=0, max=10, step=1, value=1)
weight_factor=widgets.IntSlider(min=1, max=100, step=1, value=5)
time_units=dict([('words','tokenId'), ('sentences','sentenceID'), ('paragraphs','paragraphId')])
time_type=widgets.Dropdown(options=list(time_units.keys()),description='Unit of time',value='words')

def make_dyn_charnet(name_key=fixed('name_real'),t_unit='words',slice_length=1000):
    roundby=slice_length
    t_key=time_units.get(t_unit,'tokenId')
    printm(f'* Divide text every {roundby} {t_unit}')

    from tqdm import tqdm
    # init
    dg = dn.DynGraph(edge_removal=False)
    all_df=load_alldf()
    all_df['slice']=all_df[t_key].apply(lambda x: x//roundby)
    
    # t=paragraph
    last_char=None
    ts=set()
    t=0
    edges=set()
    name2real=dict(zip(all_df.name, all_df.name_real))

    grps=sorted(list(all_df.groupby('slice')))
    grp_ld=[]
    edge_list=[]
    for sl,sldf in grps:
        t=sl
        chars_in_slice=[x for x in sorted(list(set(sldf[name_key]))) if x]
        #printm(f' * At t={t}, found {len(chars_in_slice)} unique characters: {", ".join(chars_in_slice)}')
        for a in chars_in_slice:
            for b in chars_in_slice:
                if b<=a: continue
                dg.add_interaction(u=a,v=b,t=t)
                edge_list.append((t,a,b))
        grp_dx={
            't':t,
            'chars':chars_in_slice,
            'num_chars':len(chars_in_slice),
        }
        grp_ld.append(grp_dx)
    grp_df=pd.DataFrame(grp_ld)

    # show stats
    ts=dg.temporal_snapshots_ids()
    chartups_dyn={tuple(sorted([u,v])+[t]) for (u, v, op, t) in dg.stream_interactions()}
    chartups_stat={tuple(sorted([u,v])) for (u, v, op, t) in dg.stream_interactions()}
    printm(f'* {len(ts)} time slices')
    printm(f'* {len(chartups_stat)} unique character-to-character edges')
    printm(f'* {len(edge_list)} total interactions')
    display(grp_df)
    # display(edge_list[:10])
    return dg,grp_df,edge_list

# show opts
charnet_dynamic_i=interactive(
    make_dyn_charnet,
    all_df=fixed(all_df),
    t_unit=time_type,
    slice_length=slice_length
)
charnet_dynamic_i

In [None]:
#@title Display and fine-tune network
# Convert to static


def to_static(edge_list,t_start=None,t_end=None,min_weight=2,name_key='name_real',stats=NET_STATS,min_degree=1):
    # printm(f'#### Generating static network with minimum weight set to {min_weight}')
    import networkx as nx
    g=nx.Graph()

    num_interactions_d=Counter()
    for t,u,v in sorted(edge_list):
        if t_start and t<t_start: continue
        if t_end and t>t_end: continue
        num_interactions_d[u]+=1
        num_interactions_d[v]+=1
        if not g.has_edge(u,v):
            g.add_edge(u,v,t=[t],weight=1)
        else:
            g[u][v]['weight']+=1
            g[u][v]['t']+=[t]

    if min_weight:
        for a,b,d in list(g.edges(data=True)):
            if d['weight']<min_weight:
                g.remove_edge(a,b)

    # add metadata
    for n in g.nodes():
        # print(n,'??')
        ndf=dfm[dfm[name_key]==n]
        ncdf=ndf.mode()
        try:
            common_d=dict(ncdf.iloc[0])
        except IndexError:
            continue
        for k,v in common_d.items(): g.nodes[n][k]=v
        g.nodes[n]['num_interactions']=num_interactions_d[n]
        g.nodes[n]['num_mentions']=ndf['num_total'].sum() #.iloc[0]['num_total']

    # include node stats
    for stat in stats:
        try:
            func=getattr(nx,stat)
            for n,v in dict(func(g)).items():
                g.nodes[n][stat]=v
        except:
            pass

    # return graph
    if min_degree:
        for n in list(g.nodes()):
            if g.nodes[n]['degree']<min_degree:
                g.remove_node(n)

    return g


#@title
def layout(g):
    from fa2 import ForceAtlas2
    forceatlas2 = ForceAtlas2(
        # Behavior alternatives
        outboundAttractionDistribution=True,  # Dissuade hubs
        linLogMode=False,  # NOT IMPLEMENTED
        adjustSizes=False,  # Prevent overlap (NOT IMPLEMENTED)
        edgeWeightInfluence=1.0,

        # Performance
        jitterTolerance=1.0,  # Tolerance
        barnesHutOptimize=True,
        barnesHutTheta=1.2,
        multiThreaded=False,  # NOT IMPLEMENTED

        # Tuning
        scalingRatio=2.0,
        strongGravityMode=False,
        gravity=1.0,

        # Log
        verbose=False
    )
    
    pos = forceatlas2.forceatlas2_networkx_layout(g, pos=None, iterations=2000)
    return pos


#@title Drawing static networks
def drawnet_nx(g,ofn='net.png',pos=None,weight_factor=1,size_by='degree',size_factor=1000,save=True,title=None,color_by=None,default_color='gray',color_start='red',color_end='blue',default_size=300):
    from matplotlib import pyplot as plt
    fig = plt.figure(figsize=(10,10))#,facecolor=(0, 0, 0))
    if title: fig.suptitle(title, fontsize=16)
    nodelist=g.nodes()
    labels=dict((n,n) for n in nodelist)
    try:
        size_vals=x=np.array([g.nodes[n].get(size_by,np.nan) for n in nodelist])
        normalized = (x-min(x))/(max(x)-min(x))
        node_size=[x*size_factor if x is not np.nan else default_size for x in normalized]
    except:
        node_size=default_size
    node_color=[]

    edgelist=list(g.edges())
    edge_size=[g[a][b]['weight']*weight_factor for a,b in edgelist]
    try:
        edge_size_vals=x=np.array([g[a][b]['weight'] for a,b in edgelist])
        edge_normalized = (x-min(x))/(max(x)-min(x))
        edge_size=[x*weight_factor if x is not np.nan else 1 for x in edge_normalized]
    except:
        edge_size=1


    if color_by:
        color_types=sorted(list(set(g.nodes[n][color_by] for n in g.nodes())))
        num_colors=len(color_types)
        spectrum=list(Color(color_start).range_to(Color(color_end),num_colors))
        colormap=dict(zip(color_types, spectrum))
        node_color=[colormap[g.nodes[n][color_by]].hex for n in g.nodes()]
    else:
        node_color=default_color

    nx.draw_networkx(
        g,
        pos=pos,
        labels=labels,
        nodelist=nodelist,
        node_size=node_size,
        edgelist=edgelist,
        width=edge_size,
        font_color='black',
        font_weight='bold',
        font_size=12,
        node_color=node_color,
        edge_color='teal'
    )
    if save:
        plt.savefig(ofn)
        plt.close()
    else:
        return plt




# Generate
# printm('### Set minimum weight')
# display(weight_slider)
min_degree_slider = widgets.IntSlider(min=1,max=10,step=1,value=2,desc='Minimum degree')

# @interact
def make_static(min_weight=weight_slider,min_degree=min_degree_slider):
    # get current data 
    charnet_dynamic,charnet_dynamic_df,charnet_dynamic_edgelist=charnet_dynamic_i.result
    dg=charnet_dynamic

    charnet_static=g=to_static(charnet_dynamic_edgelist,min_weight=min_weight,min_degree=min_degree)
    printm(f'Graph generated with {g.order()} nodes and {g.size()} edges')
    charnet_static_df=pd.DataFrame(dict(charnet_static.nodes[n]) for n in charnet_static.nodes())
    charnet_static_df_edges=pd.DataFrame({'source':a, 'target':b, **d} for a,b,d in charnet_static.edges(data=True))    
    
    # printm('### Edge data')
    # display(charnet_static_df_edges.sort_values('weight',ascending=False))

    # printm('### Node data')
    # display(charnet_static_df.sort_values('num_total',ascending=False))
    # printm('### Graph preview')
    pos=layout(g)
    title=f'{NOVEL_TITLE} (w>={min_weight})'
    try:
        drawnet_nx(
            g,
            save=False,
            pos=pos,
            title=title,
            weight_factor=10
        )
    except ValueError:
        pass
    return charnet_static, charnet_static_df, charnet_static_df_edges

# charnet_static_i=interactive(make_static,min_weight=weight_slider)
# charnet_static_i


#@title Fiddle with settings
widg_sizeby=widgets.Dropdown(options=['num_total']+sorted(list(NET_STATS)))
# charnet_static,charnet_static_df=charnet_static_i.result
# node_features=set(charnet_static_df.select_dtypes('number').columns) - {'id'}
widg_pos_method = widgets.Dropdown(options=[('Layout from final state','final'), ('Layout evolves','evolve')],desc='Layout method')

def show_graph(t_start=time_slider1,
               t_end=time_slider2,
               min_weight=weight_slider,
               color_by=widg_colorby,
               size_by=widg_sizeby,
               weight_factor=weight_factor,
               min_degree=min_degree_slider,
               save=fixed(False),
               ofn=fixed(None),
               pos=fixed(None),
               title=fixed(None),
               pos_method=widg_pos_method):
    if not title: title=f'Character network for {NOVEL_TITLE} (t={t_start}-{t_end}) (w>={min_weight})'
    try:
        # load dynamic network
        charnet_dynamic,charnet_dynamic_df,charnet_dynamic_edgelist=charnet_dynamic_i.result
    
        # get network up to this point
        g_sofar=to_static(
            charnet_dynamic_edgelist,
            min_weight=min_weight,
            min_degree=min_degree,
            t_start=t_start,
            t_end=t_end
        )

        # positions set?
        if pos is None:
            # set by end stage?
            if pos_method=='final':
                # get total static
                g=to_static(
                    charnet_dynamic_edgelist,
                    min_weight=min_weight,
                    min_degree=min_degree,
                )
                # set positions from total static
                pos=layout(g)
            else:
                pos=layout(g_sofar)
        
        # draw figure
        drawnet_nx(
            g_sofar,
            save=save,
            ofn=ofn,
            pos=pos,
            size_by=widg_sizeby.value,
            color_by=widg_colorby.value,
            weight_factor=weight_factor,
            title=title
        )
    except Exception as e:
        #printm(f'**!! Error: {e}**')
        pass


graph_configurator = interactive(show_graph)
graph_configurator

## 🎥 Generating dynamic network visualizations

In [None]:
#@title Generate underlying images
def drawnets(odir=ofn_fig_dir,pos=None):
    charnet_dynamic,charnet_dynamic_df,charnet_dynamic_edgelist=charnet_dynamic_i.result
    ts=charnet_dynamic.temporal_snapshots_ids()
    if not os.path.exists(odir): os.makedirs(odir)
    
    # get pos
    if pos is None:
        g=to_static(
            charnet_dynamic_edgelist,
            min_weight=weight_slider.value,
            min_degree=min_degree_slider.value,
        )
        pos=layout(g)
    
    for t in tqdm(ts):
        ofn_img=os.path.join(odir,f'net-{str(t).zfill(4)}.png')
        title=f'{NOVEL_TITLE_NICE} (t={str(t).zfill(4)})'
        try:
            show_graph(
                t_start=0,
                t_end=t,
                save=True,
                ofn=ofn_img,
                title=title,
                min_weight=weight_slider.value,
                color_by=widg_colorby.value,
                size_by=widg_sizeby.value,
                weight_factor=weight_factor.value,
                min_degree=min_degree_slider.value,
                pos=pos
            )
        except IndexError:
            pass

def do_drawnets(*x,**y):
    drawnets()

if not os.path.exists(ofn_fig_dir) or not os.listdir(ofn_fig_dir):
    res=do_drawnets()

button2=widgets.Button(description='Regenerate images')
button2.on_click(do_drawnets)
button2

In [None]:
#@title Generate mp4 video from images
def make_vid_from_folder(*x,image_folder=ofn_fig_dir,ofn=ofn_mp4,fps=15):
    import moviepy.video.io.ImageSequenceClip

    image_files = [os.path.join(image_folder,img) for img in sorted(os.listdir(image_folder)) if img.endswith(".png")]
    clip = moviepy.video.io.ImageSequenceClip.ImageSequenceClip(image_files, fps=fps)
    clip.write_videofile(ofn)

if not os.path.exists(ofn_mp4): make_vid_from_folder()
button3=widgets.Button(description='Regenerate Video')
button3.on_click(make_vid_from_folder)
# display(button3)

# show video
from IPython.display import HTML
from IPython.display import display
from base64 import b64encode
mp4 = open(ofn_mp4,'rb').read()
display(button3)

In [None]:
#@title Show video
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
display(HTML("""
<video width=666 controls>
      <source src="%s" type="video/mp4">
</video>
""" % data_url))

In [None]:
#@title Generate gif from images
def make_gif_from_folder(x,folder=ofn_fig_dir,ofn=ofn_gif):
    import imageio
    images = []
    for fn in sorted(os.listdir(folder)):
        if fn.endswith('.png'):
            with open(os.path.join(folder,fn),'rb') as f:
                images.append(imageio.imread(f))
    imageio.mimsave(ofn, images)


if not os.path.exists(ofn_gif): make_gif_from_folder(1)

button2=widgets.Button(description='Regenerate GIF')
button2.on_click(make_gif_from_folder)

# show gif?
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython import display
from pathlib import Path
gifPath = Path(ofn_gif)
if os.path.exists(gifPath):
    # Display GIF in Jupyter, CoLab, IPython
    with open(gifPath,'rb') as f:
        display.Image(data=f.read(), format='png')

button2

## 📐 Analyze results

In [None]:
#@title Distribution of character attention
# create
dfm=filter_metadata(min_num_mentions=min_num_mentions.value)

all_df=dfm.merge(tok_df,on='characterId',how='right').fillna('')
color_by=list(dfm.columns)[list(dfm.columns).index('gender'):]+['none']
dfm['none']='none'
# res=char_df.num.plot(kind='density',width=666)
widg_colorby=widgets.Dropdown(options=color_by,description='Group by')
num_top_chars=widgets.IntSlider(min=5,max=len(dfm)+5,step=5,value=25,description='# Top Chars')

@interact
def showtopn(num_top=num_top_chars):
    n=num_top
    printm(f'#### Median number of mentions per character (all): {int(dfm.num_total.median())}')
    return px.bar(
        dfm.sort_values('num_total',ascending=True).iloc[-n:],
        y="name_real",
        x="num_total",
        hover_data=['name_real','num_total','gender'],
        title=f'Distribution of mentions over top {n} characters',
        text='num_total',
        width=666,
        height=666/25 * n,
        orientation='h'
    )    

In [None]:
#@title Distribution of attention by group
@interact
def show_histogram(group_by=widg_colorby):
    

    printm(f'### Distribution by {group_by}')
    #if group_by!='none':
    xvals=[]
    for cat,catdf in sorted(dfm.groupby(group_by),key=lambda x: -len(x[1])):
        totalperc=round(sum(catdf.num_total)/sum(dfm.num_total)*100,1)
        printm(f'#### {len(catdf)} {cat} characters make up {totalperc}% of all mentions')
        # printm(f'* Median number of mentions per character ({cat}): {int(catdf.num_total.median())}')
        stats=[f'{row.name_real} ({row.num_total})' for i,row in catdf.sort_values('num_total',ascending=False).iterrows()]
        printm(f'''* {', '.join(stats)} [median={int(catdf.num_total.median())}]''')
        xvals+=[cat]
    
    # display(px.histogram(dfm, x='num_total', color=group,marginal='rug')

    fig=px.box(
        dfm.sort_values(group_by),
        y="num_total",
        range_y=(.9,max(dfm.num_total)+10),
        log_y=True,
        x=group_by,
        width=666,
        color=group_by,
        points="all",
        hover_data=[x for x in ['name_real','num_total'] + color_by if x!='none'],
        title=f'Number of mentions for characters by {group_by}'
    )

    return fig

In [None]:
#@title Show character density across length of text
# Other token stats
yopts=widgets.Dropdown(options=[('# of mentions','num_mentions'),('# of unique','num_chars')], desription='Y value')
charnet_dynamic,charnet_dynamic_df,charnet_dynamic_edgelist=charnet_dynamic_i.result
ts=charnet_dynamic.temporal_snapshots_ids()
time_slider1=widgets.IntSlider(min=ts[0], max=ts[-1], step=10, value=ts[0])
time_slider2=widgets.IntSlider(min=ts[0], max=ts[-1], step=10, value=ts[-1])

@interact
def show_density(slice_length=slice_length,color_by=widg_colorby,y_value=yopts):
    num_words_in_preview=30
    slice2txt=defaultdict(list)
    all_df['slice']=all_df.tokenId.apply(lambda x: x//slice_length*slice_length)
    all_df['none']='none'
    slice_ld=[]
    for sl,sldf in all_df.groupby('slice'):
        slice_dx={'slice':sl}
        # get preview
        slice_dx['preview']=[]
        for i,row in sldf.iterrows():
            if not str(row['originalWord']).strip(): continue
            if len(slice_dx['preview'])>num_words_in_preview:break
            slice_dx['preview']+=[str(row['originalWord']).strip()+str(' ' if row['whitespaceAfter']=='S' else '')]
        slice_dx['preview']=''.join(slice_dx['preview'])+'...'
        # count by category
        num_words=len(sldf)
        for cat,catdf in sldf.groupby(color_by):
            num_mentions=sum(catdf['isChar'])
            num_chars=len(set(catdf['characterId']))
            slice_cat_dx=dict(**slice_dx, **{'color_by':cat, 'num_mentions':num_mentions, 'num_chars':num_chars, 'color_by':cat})
            slice_ld.append(slice_cat_dx)

    slicedf=pd.DataFrame(slice_ld)
    printm(f'Median number of unique characters = **{slicedf.num_chars.median()}** names per {slice_length} words')
    printm(f'Median number of character mentions = **{slicedf.num_mentions.median()}** names per {slice_length} words')


    import plotly.express as px
    return px.line(slicedf,x='slice',y=y_value,color='color_by',hover_data=['preview'],
            title=f'{y_value} per {slice_length} words across {NOVEL_TITLE}',
            height=444,
            line_shape='hv')

In [None]:
#@title Show syntactic statistics
#@todo ...
printm('Todo')

## Download data

In [None]:
#@title Zip data
PATH_ZIP=os.path.abspath(os.path.join(PATH_NOVEL,'..',NOVEL_TITLE+'.zip'))
cmd=f'cd {PATH_NOVEL}/.. && zip -q -r9 {PATH_ZIP} {NOVEL_TITLE}'
!{cmd}

In [None]:
#@title Download zip file
def dlzip(x): 
    from google.colab import files
    files.download(PATH_ZIP)
dlsize=os.path.getsize(PATH_ZIP)
def human_size(bytes, units=[' bytes','KB','MB','GB','TB', 'PB', 'EB']):
    """ Returns a human readable string representation of bytes """
    return str(bytes) + units[0] if bytes < 1024 else human_size(bytes>>10, units[1:])
dlbutton=widgets.Button(description=f'Download zip ({human_size(dlsize)})')
dlbutton.on_click(dlzip)
dlbutton

## Bibliography

### Digital character networks


* Sam Alexander, "[Social Network Analysis and the Scale of Modernist Fiction](https://doi.org/10.26597/mod.0086)", *Modernism/Modernity* 3.4 (2019)

* David Bamman, Ted Underwood and Noah Smith, "A Bayesian Mixed Effects Model of Literary Character," *ACL* 2014

* David Elson, Nicholas Dames, Kathleen McKeown, "[Extracting Social Networks from Literary Fiction](https://www.aclweb.org/anthology/P10-1015/)", *Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics* (2010)

* Vincent Labatut and Xavier Bost, "[Extraction and Analysis of Fictional Character Networks: A Survey](https://doi.org/10.1145/3344548)" (*ACM* 52.5.89,2019)

* Graham Sack, “[Character Networks for Narrative Generation: Structural Balance Theory and the Emergence of Proto-Narratives](http://www.panstanford.com/books/9789814463263.html)” in *Complexity and the Human Experience: Modeling Complexity in the Humanities and Social Sciences* (ed. Paul A. Youngman and Mirsad Hadzikadic, Singapore: Pan Stanford Publishing, 2014)

* Graham Sack, "[Character networks for narrative generation](http://www.aaai.org/ocs/index.php/AIIDE/AIIDE12/paper/view/5550)," in *Proceedings of the 8th Artificial Intelligence and Interactive Digital Entertainment Conference - Intelligent Narrative Technologies Workshop* (2012), 38-43.

### Character network theory

* Caroline Levine, *Forms: Whole, Rhythm, Hierarchy, Network* (Princeton, NJ: Princeton UP, 2015), 112ff

* Anna Gibson, “Our Mutual Friend and Network Form”, *Novel: A Forum on Fiction* 48.1 (2015)

* Franco Moretti, "Network Theory, Plot Analysis", *New Left Review* 68 (2011)

* Alex Woloch, *The One vs. the Many: Minor Characters and the Space of the Protagonist in the Novel* (Princeton, NJ: Princeton UP, 2004)