# Explainer Notebook for Warcraft Major Character Analysis

Resulting website: [youngpenguin.github.io/WOWenShittyWebsite/](https://youngpenguin.github.io/WOWenShittyWebsite/)

GitHub repos: 
- Analysis repo [github.com/simonamtoft/warcraft-major-character-analysis](https://github.com/simonamtoft/warcraft-major-character-analysis)
- Website repo [github.com/YoungPenguin/WOWenShittyWebsite](https://github.com/YoungPenguin/WOWenShittyWebsite)

## 1. Motivation
- What is your dataset?
- Why did you choose this/these particular dataset(s)?
- What was your goal for the end user's experience?

The dataset used in this project comes from:
- Character pages from [wowpedia.fandom.com/wiki/Wowpedia](https://wowpedia.fandom.com/wiki/Wowpedia)
    - We only inspect the major characters found on [wowpedia.fandom.com/wiki/Major_characters](https://wowpedia.fandom.com/wiki/Major_characters)
- Corresponding character pages from [www.wowhead.com/](https://www.wowhead.com/)


We chose the wowpedia data in order to create a network of the major characters in Warcraft as our starting point. We then included user comments from their corresponding character pages from wowhead in order to compare sentiments of users comments of characters to their actual sentiments, along with other text analysis.


The goal of the end user's experience is for the user to be able to see the characteristics of groupings of the major characters in Warcraft by different attributes, e.g. comparing wordclouds of Alliance and Horde characters, and seeing what other people connect these characters with (using wowhead comments).

## 2. Basic Statistics
Let's understand the dataset better
- Write about your choices in data cleaning and preprocessing
- Write a short section that discusses the dataset stats

In order to perform text and network analysis on the chosen data, a lot of data cleaning and preprocessing had to take place.

The primary tools that have been used to get the data are 
- `BeautifulSoup` and `requests` for wowhead comments (see `download_character_comments.py`)
- `urllib` for wowpedia pages (see `api.py`)

For the wowpedia pages, both the raw and clean pages have been downloaded (see `/data/wow_chars/` and `/data/wow_chars_clean/`), while for wowhead the comments were downloaded and stored along with all its metadata, and later cleaned.

For cleaning and preprocessing of the data, the primary tools that have been used are the text analysis library `nltk` and the regular expressions library `re`. <br> 
The cleaning, preprocessing and downloading from websites will be described more in-depth in the next section under Tool 1.

The data can be split into the following subdivisions:
1. Wowhead comments (~700 files, 31MB)
    1. Raw .njson files containing every comment on the wowhead character pages along with some metadata, like dates. (see `download_character_comments.py`)
        1. Note: For some characters, multiple pages are present in which we got comments from all of them.
    2. Processed .txt files, which consists of the comments from the raw files without any metadata (see `comments_clean.py`)
    3. Words .txt files, which based on the processed .txt files have had stopwords removed, text has been tokenized and every word has been lemmatized (see `comments_to_words.py`)
2. Wowpedia pages (~1000 files, 19MB)
    1. Raw .txt files containing the entire wowpedia character page (see `download_character_pages.py`)
    2. Clean .txt files containing the clean version of the wowpedia character page (see `download_character_pages_clean.py`)
    3. Quotes .txt files, which based on the raw .txt files consists of all the quotes from the characters quote section on its wowpedia page (see `extract_character_quotes.py`)
    4. Words .txt files, which based on the clean .txt files have had stopwords removed, text has been tokenized and every word has been lemmatized (see `pages_to_words.py`)

The resulting dataset consists of ~1700 files (totalling 50 MB).

The resulting network graph has 261 nodes and 4009 edges (see `Graph Analysis.ipynb`).

## 3. Tools, Theory and Analysis
Describe the process of theory to insight
- Talk about how you've worked with text, including regular expressions, unicode, etc.
- Describe which network science tools and data analysis strategies you've used, how those network science measures work, and why the tools you've chosen are right for the problem you're solving.
- How did you use the tools to understand your dataset?

### Tool 1: Downloading and Cleaning text

#### Wikipages

#### Wikipages quotes?



#### Wowhead comments
The goal here was to find some augmenting data from Wowhead users for the characters we extracted from the Wikipages. It became apparent that multiple NPCs exist in-game for every character (for technical reasons, etc.), and as a consequence of this, there are multiple Wowhead NPC pages for every Wikipage character.

By hopping around on Wowhead for a bit, we discovered that they have a search page for NPCs which allowed filtering. Thus, we quickly set up a procedure for calling this endpoint with the filter `Has comments` set to `Yes`. Additionally, the search page seemed to fuzzily match the names from the Wikipage characters, but it was not completely resilient to discrepancies (e.g. `Thoras Trollbane` will find some pages, but `Thoras Trollbaen` will not. The search results were embedded in the page as `<div>` elements with a certain *id* attribute. This is where `beautifulsoup` came in handy. It enables you to parse HTML text data into a virtual DOM structure, which allows for more resilient scraping than pure regular expressions can provide. After some DOM and text manipulation, the links to the NPC pages are now extracted for every character.

For every NPC page on Wowhead, we needed to scrape the user comments. Luckily, these seemed to be embedded in the NPC pages themselves in a certain `<script>` element. Conveniently, the comment data is already formatted as JSON and can be directly extracted and saved. We save the comments a `.njson` format, which is simply newline-delimited JSON (i.e. every line of the file parses as its own JSON object).

The entire procedure of taking Wikipage character names and outputting Wowhead comments is defined in the script `download_character_comments.py`. We also added a timeout to the script to avoid making too many requests to Wowhead and getting blacklisted.


The comment text data needed to be cleaned before we could use it for analysis, however. Thus, we created a few regular expressions to help clean up the comment texts:
* `re.sub(r"\[.+?\]", "", t)`
    * For instance, the comments would often include hyperlinks to other Wowhead pages, e.g. `"You need to complete [url=http://www.wowhead.com/?quest=10588]Cipher of Damnation[/url] first."`. This regular expression finds pairs of brackets `[` and `]` with 1 or more character between them, and matches them lazily (i.e. as few chars as possible). Thus, the result of the example would be  `"You need to complete Cipher of Damnation first."`

* `re.sub(r"(\\r|\\n|\\t)+", " ", t)`
    * Users also seemed to write out their comments using several lines. This means that a lot of carriage returns, newlines were present in the raw texts, e.g. `"I say this guy becomes boss lvl.\r\n\r\nAfter experiencing the ..."`. These sequences of spacing characters could be replaced with a simple space character. As such, we created a regular expression which finds all sequences of 1 or more consecutive \r, \n or \t characters and replaces them with a single space character.
    
* `re.findall(r"(.+?(?:[!.?]+|$))(?:\s|$)", t)`
    * Depending on how the text is to be analyzed, it can be useful to split a text blob into its individual sentences. The above expression lazily finds all sequences of characters that are followed by 1 or more sentence delimiter characters (i.e. . or !) or the end of sequence, which are subsequently followed by either a space or the end of the sequence. It seems to work better for our use case than `sent_tokenize` from the `nltk` package.
    

### Tool 2: Network Analysis

### Tool 3: Wordclouds

### Tool 4: Sentiment Analysis
The goal for this section is to produce a unidimensional sentiment score for some input text (be it a Wowhead comment, or a Wikipages quote). The idea is that performing this analysis will provide a new perspective on the data in our network.


For the sentiment analysis part to be carried out, we needed some text to analyze first. Here, we chose to work with text from the scraped Wowhead comments, and with text from the extracted quotes from the Wikipages.

We chose two different methods for performing the sentiment analysis:
* [VADER](https://github.com/cjhutto/vaderSentiment)
    * VADER is a dictionary- and rule-based approach to evaluating text sentiment. This means that under the hood, VADER has a lookup table which is used to assign sentiment scores to tokens individually. However, it has rules which alter the sentiment vaules based on negations, various punctuations, degree modifiers (e.g. *very* or *somewhat*), slang words, emojis, and much more. Furthermore, it should be noted that VADER is specifically tuned for text from social media. The VADER dictionary (or lexicon, as they call it) consists of about 7500 tokens. VADER produces several dimensions for sentiment, but the `compound` dimensions is stated as being the best unidimensional measure of sentiment (it is described as a "normalized, weighted composite score")

* [BERT (flairNLP)](https://github.com/flairNLP/flair)
    * This approach is based on BERT (which is a deep learning technique for natural language modeling). It works by breaking down the input text into tokens, and assigning an embedding to every token. These embeddings are then combined in a manner which takes context/sequence into account (through *attention* mechanisms) and produces a single embedding for the entire input text. Classification can then be performed on this resulting vector, which turns it into a sentiment score (either positive/negative). What is noteworthy about this approach is its ability to capture long-range dependencies in sequences, and its large vocabulary size (this depends of the specific BERT variant, but is mostly in the scope of ~30K different tokens).


The VADER procedure is carried out as follows:
1. Break down input text into individual sentences (using previously described regex)
2. For every sentence, we pass it to VADER's `polarity_scores(...)` method and retrieve its `compound` score
3. Reduce all sentences to a single score by averaging them, and return it


The BERT procedure is carried out as follows:
1. Pass the input text directly to flair by creating a `flair.data.Sentence(...)` and subsequently calling `.predict(...)` on it
2. The output is converted from two-class [0,1] scores to a single continuum ranging from [-1, 1] by multiplying the class score by -1 if the predicted class is `NEGATIVE`
3. Return the converted score

With the two procedures in place, we can look at how the models compare for a couple of examples:
* E



At this point, it would be interesting to see how the VADER sentiment scores compare to the BERT scores. To do this, we can take the two scores for all comments and wikipedia quotes, and plot the following histograms:

![Sentiment distributions](visualizations/sentdist.png)

It is immediately apparent that the two methods produce significantly different scores for the text data we extracted. The VADER scores are very dense around 0, and very shallow tails, with barely any scores when approaching the minimum and maximum of -1 and 1 respectively. On the other hand, the BERT scores are very heavily distributed towards the minimum and maximum, with more negative scores than negative in total.

This difference in score distributions could probably stem from the following facts:
* Many out-of-lexicon tokens will lead to a high concentration of 0-scores for the VADER method. The BERT method has a bigger vocabulary, and out-of-vocabulary tokens do not have as big of an impact on this type of model.
* The BERT method uses a model which has been trained as a binary classification task. This could lead to a sharp decision boundary of confidence either being 0 or 1.


## 4. Discussion
Think critically about your creation
- What went well?
- What is still missing? 
- What could be improved?

**What went well**


**Missing**


**Improvements**

If we were to improve the work, there are multiple things that could be done (possibilities are somewhat endless):
- improve character sentiment by adding more character quotes from the wowpedia pages
    - Some quote sections include linkings to collections of character quotes, from e.g. Warcraft 2, which could be downloaded and processed.
    - Manually find the different quoting pages on wowpedia and process these, adding quotes to major characters. 
    - Finding quotes from outside wowpedia, e.g. the Warcraft movie script or the different Warcraft books (probably behind paywalls)
        - Could lead to time series analysis of charcter sentiment, which could be compared to the sentiment time series of wowhead comments.
- expanding the network to not only include the [major character page on wowpedia](https://wowpedia.fandom.com/wiki/Major_characters)

## 5. Contributions
Who did what? You should write (just briefly) which group member was the main responsible for which elements of the assignment. 

**Main responsibilities**
* Janus Ivert Johansen, s173917
    * Setting up the website 
    * Wordcloud masks
    * Network graph visualization
* Lucas Alexander Sørensen, s174461
    * Downloading comments from wowhead
    * Sentiment Analysis (Bert and VADER)
    * Timeseries sentiment
* Simon Amtoft Pedersen, s173936
    * Downloading & cleaning character pages
    * Extracting character quotes from wowpedia character pages
    * Wordcloud computations (tf-idf etc.)
    * Compute network measures