# Text Analysis Computations
The following notebook will compute tf-idf values and wordcloud strings for the list of words for each character for both wowpedia pages and wowhead comments.

In [1]:
import os
import nltk
import pickle
from glob import glob

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

import community

import networkx as nx

import config
from text_helpers import init_collection, populate_collection, \
    get_paths

In [2]:
# Download and import "book"
nltk.download('book', quiet=True)
from nltk import book

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


# Load in necessary data

In [3]:
# get list of all characters which 
chars_with_comments = [
    path.split('\\')[-1].replace('.njson', '') 
    for path in glob('./data/char_comments/*.njson')
]

# read in character DataFrame
df = pd.read_csv(config.PATH_RES + 'df_chars.csv')

# remove chars that doesn't have comments from wowhead
df = df[df['Name'].apply(lambda n: n in chars_with_comments)]

df.head()

Unnamed: 0,Name,Gender,Race,Faction,Status
0,A'dal,Unknown,Naaru,Neutral,Alive
2,Aegwynn,Female,Human,Neutral,Deceased
3,Aessina,Female,Wisp,Neutral,Unknown
5,Agamaggan,Male,Boar,Neutral,Deceased
6,Agatha,Female,Val'Kyr,Neutral,Deceased


In [4]:
# load graph
Gcc = nx.read_gexf(config.PATH_RES + 'Gcc_wow.gexf').to_undirected()

# remove nodes from graph that doesn't have comments from wowhead
for node in list(Gcc.nodes()):
    if node.replace(' ', '_') not in chars_with_comments:
        Gcc.remove_node(node)

print(f'Number of\nNodes: {len(list(Gcc.nodes()))}\nEdges: {len(list(Gcc.edges()))}')

Number of
Nodes: 239
Edges: 2410


# Get Communities
Create or load community partitions.

In [5]:
# create communities if not done already, otherwise load
filename = config.PATH_RES + 'Communities.json'
if not os.path.isfile(filename):
    print('Creating new community partition.')
    partition = community.best_partition(Gcc)
    communities = []
    for p in set(partition.values()):
        names = [n for n in partition if partition[n] == p]
        communities.append(names)
    pickle.dump(communities, open(filename, 'wb'))
    print(f'Saved as pickle {filename}')
else: 
    print('Loading existing community partition.')
    print(f'from pickle {filename}')
    communities = pickle.load(open(filename, 'rb'))

# get top chars in each community
degs = list(Gcc.degree())
com_names = []
for com in communities:
    com_sorted = sorted([(n, v) for n, v in degs if n in com], key=lambda x: x[1], reverse=True)
    top_names = [n for n, _ in com_sorted[:3]]
    com_name = ', '.join(top_names)
    com_names.append(com_name)

Loading existing community partition.
from pickle ./store/Communities.json


In [6]:
print('List of top 3 characters for each community based on node degree')
for i, com in enumerate(com_names):
    print(f'\t{i+1}. {com}')

List of top 3 characters for each community based on node degree
	1. Khadgar, Illidan Stormrage, Velen
	2. Deathwing, Sargeras, Yogg-Saron
	3. Sylvanas Windrunner, Lich King, Varian Wrynn
	4. Malfurion Stormrage, Tyrande Whisperwind, Alexstrasza
	5. Thrall, Ner'zhul, Orgrim Doomhammer
	6. Anzu, Terokk, Talon King Ikiss
	7. Jaina Proudmoore, Anduin Wrynn, Garrosh Hellscream


# Prepare Corpus etc.

In [32]:
def get_filelist(folder):
    """Return paths to character files in chars_with_comments from specified folder"""
    return [folder + n + '.txt' for n in chars_with_comments]


# word and clean files for wowpedia character pages
c_words_wiki = nltk.corpus.PlaintextCorpusReader('', get_filelist(config.PATH_WORDS))
t_words_wiki = nltk.Text(c_words_wiki.words())
c_clean_wiki = nltk.corpus.PlaintextCorpusReader('', get_filelist(config.PATH_CLEAN))

# word files for wowhead user comments
c_words_comments = nltk.corpus.PlaintextCorpusReader('', get_filelist(config.PATH_COMMENTS_WORDS))
t_words_comments = nltk.Text(c_words_wiki.words())

In [7]:
# define what to look into
attr_lookup = {
    'Gender': ['Male', 'Female'],
    'Faction': ['Alliance', 'Horde'],
    'Status': ['Alive', 'Deceased']
}

# Create Collections
For both wowpedia pages and wowhead comments wordlists calculate tf-idf values for each word and create wordcloud strings, based on the splits defined in `attr_lookup`, and save results to `.json` files.

In [8]:
# create collections for attributes for both wowpedia pages and wowhead comments
for source, corpus, path_words in [
    ('wowpedia/', c_words_wiki, config.PATH_WORDS), 
    ('wowhead/', c_words_comments, config.PATH_COMMENTS_WORDS)
]:
    for attr in attr_lookup:
        # check if collection already is created
        save_path = config.PATH_RES + source + attr + '_dict.json'
        if os.path.isfile(save_path):
            print(f'\nSkipping {attr} for {source} since it is already done.')
            continue
        else:
            print(f'\nDoing {attr} for {source}')
        
        # create collection and save it
        col = init_collection(df, attr, path_words, corpus)
        _ = populate_collection(col, save_path)


Skipping Gender for wowpedia/ since it is already done.

Skipping Faction for wowpedia/ since it is already done.

Skipping Status for wowpedia/ since it is already done.

Skipping Gender for wowhead/ since it is already done.

Skipping Faction for wowhead/ since it is already done.

Skipping Status for wowhead/ since it is already done.


In [10]:
# create collections for communities for both wowpedia pages and wowhead comments
for source, corpus, path_words in [
    ('wowpedia/', c_words_wiki, config.PATH_WORDS), 
    ('wowhead/', c_words_comments, config.PATH_COMMENTS_WORDS)
]:  
    col = {}
    save_path = config.PATH_RES + source + 'Louvain_dict.json'
    if os.path.isfile(save_path):
        print(f"Skipping collections for {source} since it is already done.")
        continue
    print(f'Computing collections for communities for {source}')

    for i, names in enumerate(communities): 
        paths = [
            path_words + n.replace(' ', '_') + '.txt' 
            for n in names
        ]
        # save text for community
        col[i] = {'text': nltk.Text(corpus.words(paths))}
    
    # create collection and save it
    col = populate_collection(col, save_path)

Skipping collections for wowpedia/ since it is already done.
Skipping collections for wowhead/ since it is already done.


# Top Words
Inspect top 5 words according to tf-idf for each attribute split and for the different communities by Louvain.

In [41]:
# display top words for attributes
for source in ['wowpedia/', 'wowhead/']:
    print(f"\n\nFor {source}")
    for attr in attr_lookup:
        print(f'\nTop 5 for attribute {attr}')
        col = pickle.load(open(config.PATH_RES + source + attr + '_dict.json', 'rb'))
        for split in attr_lookup[attr]:
            tfidfs = col[split]['tfidf']
            idx = np.argsort(tfidfs)[::-1]
            top_5 = ', '.join(col[split]['words'][idx][:5])
            print(f'\t{split}: {top_5}')
            # print(f'\t{tfidfs[idx][:5]}')



For wowpedia/

Top 5 for attribute Gender
	Male: demon, human, father, jaina, dreadlords
	Female: mother, walker, musha, lady, demon

Top 5 for attribute Faction
	Alliance: alleria, naaru, genn, koltira, eredar
	Horde: bwonsamdi, darkspear, tyrathan, cairne, loa

Top 5 for attribute Status
	Alive: dragon, could, alliance, force, adventurer
	Deceased: dragon, could, first, god, force


For wowhead/

Top 5 for attribute Gender
	Male: razorgore, amalgamation, molten, spine, scion
	Female: whelp, yula, lift, ony, tail

Top 5 for attribute Faction
	Alliance: koltira, skybreaker, lurid, naaru, dreanei
	Horde: ya, troll, clan, orgrim, da

Top 5 for attribute Status
	Alive: get, pet, kill, fight, damage
	Deceased: get, kill, phase, razorgore, add


In [12]:
# display top words per community
for source in ['wowpedia/', 'wowhead/']:
    print(f"\n\nFor {source}")
    print(f'Top 5 words for each community')
    col = pickle.load(open(config.PATH_RES + source + 'Louvain_dict.json', 'rb'))
    for i, com_name in enumerate(com_names):
        print(f'\n"{com_name}"')
        words = col[i]['words']
        tfidf = col[i]['tf'] * col[i]['idf']
        top_5 = ', '.join(words[np.argsort(tfidf)[::-1]][:5])
        print(top_5)



For wowpedia/
Top 5 words for each community

"Khadgar, Illidan Stormrage, Velen"
rommath, lorthemar, halduron, alleria, aethas

"Deathwing, Sargeras, Yogg-Saron"
tyr, algalon, prestor, dragon, titan

"Sylvanas Windrunner, Lich King, Varian Wrynn"
darion, genn, koltira, sylvanas, muradin

"Malfurion Stormrage, Tyrande Whisperwind, Alexstrasza"
tyrande, jarod, maiev, shandris, malfurion

"Thrall, Ner'zhul, Orgrim Doomhammer"
muln, horde, orgrim, maraad, doomhammer

"Anzu, Terokk, Talon King Ikiss"
ikiss, rukhmar, sethekk, skettis, sethe

"Jaina Proudmoore, Anduin Wrynn, Garrosh Hellscream"
li, chen, garrosh, horde, baine


For wowhead/
Top 5 words for each community

"Khadgar, Illidan Stormrage, Velen"
gravity, lapse, capernian, pyroblast, phoenix

"Deathwing, Sargeras, Yogg-Saron"
amalgamation, whelp, ony, tendon, sara

"Sylvanas Windrunner, Lich King, Varian Wrynn"
darion, frostmourne, arthas, lichking, kel

"Malfurion Stormrage, Tyrande Whisperwind, Alexstrasza"
drelanim, jarod, wh

In [40]:
# get texts of male and female characters from wowpedia
text_males = nltk.Text(c_clean_wiki.words(get_paths(df, 'Gender', 'Male', 'wowpedia')))
text_females = nltk.Text(c_clean_wiki.words(get_paths(df, 'Gender', 'Female', 'wowpedia')))

# print where 'demon' occurs
print('Concordance of "demon" for Female texts:')
text_females.concordance('demon', lines=10)
print('\nConcordance of "demon" for Male texts:')
text_males.concordance('demon', lines=10)

Concordance of "demon" for Female texts:
Displaying 10 of 28 matches:
hallenges was the destruction of the demon Zmodlor who had begun to possess chi
Aegwynn rushed in and vanquished the demon before damage to the children could 
 eradicated them . Yet , as the last demon was banished from the mortal world ,
sky above Northrend . Sargeras , the demon king and lord of the Burning Legion 
command , Aegwynn could not best the demon - possessed Medivh , but as their du
 Theramore . The trio confronted the demon , but Jaina was incapacitated by the
ar worse than that little twerp of a demon when your great - grandparents were 
nued to safeguard the world from the demon king ' s minions for nearly nine hun
 Dragon Soul ( also now known as the Demon Soul ) to steal a portion of their p
advise Nekros on how best to use the Demon Soul to control the red dragons . No

Concordance of "demon" for Male texts:
Displaying 10 of 176 matches:
 allied himself with the night elf / demon hybrid and aided 

In [45]:
# lets see how horde talks about alliance, and alliance talks about horde
text_horde = nltk.Text(c_clean_wiki.words(get_paths(df, 'Faction', 'Horde', 'wowpedia')))
text_alliance = nltk.Text(c_clean_wiki.words(get_paths(df, 'Faction', 'Alliance', 'wowpedia')))

print('Concordance of "horde" for Alliance texts:')
text_alliance.concordance('horde', lines=10, width=140)
print('\nConcordance of "alliance" for Horde texts:')
text_horde.concordance('alliance', lines=10, width=140)

Concordance of "horde" for Alliance texts:
Displaying 10 of 120 matches:
While a hero of the Alliance , Khadgar is willing to work with the Horde for the greater good of Azeroth . As a member of the Council of Si
 the alternate Draenor , he led the forces of the Alliance and the Horde to shut down the Dark Portal , and later worked with them to cripp
n the Dark Portal , and later worked with them to cripple the Iron Horde in various areas of the world . He focused heavily on combating Gu
 portal , Khadgar made a plea to the Council of Six to readmit the Horde back into the Kirin Tor in order to fight the demons at full stren
magical powers , though his true intention was to buy time for the Horde to gain power . Lothar also spoke with Khadgar , telling him about
suspicious of his master ' s actions and motives . After meeting a Horde emissary , the half - orc assassin Garona , Khadgar unraveled Medi
rest of the kingdoms aligning together to fight against the coming Horde . He was also 