# Explainer Notebook for Warcraft Major Character Analysis

Resulting website: [youngpenguin.github.io/WOWenShittyWebsite/](https://youngpenguin.github.io/WOWenShittyWebsite/)

GitHub repos: 
- Analysis repo [github.com/simonamtoft/warcraft-major-character-analysis](https://github.com/simonamtoft/warcraft-major-character-analysis)
- Website repo [github.com/YoungPenguin/WOWenShittyWebsite](https://github.com/YoungPenguin/WOWenShittyWebsite)

## 1. Motivation
- What is your dataset?
- Why did you choose this/these particular dataset(s)?
- What was your goal for the end user's experience?

The dataset used in this project comes from:
- Character pages from [wowpedia.fandom.com/wiki/Wowpedia](https://wowpedia.fandom.com/wiki/Wowpedia)
    - We only inspect the major characters found on [wowpedia.fandom.com/wiki/Major_characters](https://wowpedia.fandom.com/wiki/Major_characters)
- Corresponding character pages from [www.wowhead.com/](https://www.wowhead.com/)


We chose the wowpedia data in order to create a network of the major characters in Warcraft as our starting point. We then included user comments from their corresponding character pages from wowhead in order to compare sentiments of users comments of characters to their actual sentiments, along with other text analysis.


The goal of the end user's experience is for the user to be able to see the characteristics of groupings of the major characters in Warcraft by different attributes, e.g. comparing wordclouds of Alliance and Horde characters, and seeing what other people connect these characters with (using wowhead comments).

## 2. Basic Statistics
Let's understand the dataset better
- Write about your choices in data cleaning and preprocessing
- Write a short section that discusses the dataset stats

In order to perform text and network analysis on the chosen data, a lot of data cleaning and preprocessing had to take place.

The primary tools that have been used to get the data are 
- `BeautifulSoup` and `requests` for wowhead comments (see `download_character_comments.py`)
- `urllib` for wowpedia pages (see `api.py`)

For the wowpedia pages, both the raw and clean pages have been downloaded (see `/data/wow_chars/` and `/data/wow_chars_clean/`), while for wowhead the comments were downloaded and stored along with all its metadata, and later cleaned.

For cleaning and preprocessing of the data, the primary tools that have been used are the text analysis library `nltk` and the regular expressions library `re`. <br> 
The cleaning, preprocessing and downloading from websites will be described more in-depth in the next section under Tool 1.

The data can be split into the following subdivisions:
1. Wowhead comments (~700 files, 31MB)
    1. Raw .njson files containing every comment on the wowhead character pages along with some metadata, like dates. (see `download_character_comments.py`)
        1. Note: For some characters, multiple pages are present in which we got comments from all of them.
    2. Processed .txt files, which consists of the comments from the raw files without any metadata (see `comments_clean.py`)
    3. Words .txt files, which based on the processed .txt files have had stopwords removed, text has been tokenized and every word has been lemmatized (see `comments_to_words.py`)
2. Wowpedia pages (~1000 files, 19MB)
    1. Raw .txt files containing the entire wowpedia character page (see `download_character_pages.py`)
    2. Clean .txt files containing the clean version of the wowpedia character page (see `download_character_pages_clean.py`)
    3. Quotes .txt files, which based on the raw .txt files consists of all the quotes from the characters quote section on its wowpedia page (see `extract_character_quotes.py`)
    4. Words .txt files, which based on the clean .txt files have had stopwords removed, text has been tokenized and every word has been lemmatized (see `pages_to_words.py`)

The resulting dataset consists of ~1700 files (totalling 50 MB).

The resulting network graph has 261 nodes and 4009 edges (see `Graph Analysis.ipynb`).

## 3. Tools, Theory and Analysis
Describe the process of theory to insight
- Talk about how you've worked with text, including regular expressions, unicode, etc.
- Describe which network science tools and data analysis strategies you've used, how those network science measures work, and why the tools you've chosen are right for the problem you're solving.
- How did you use the tools to understand your dataset?

**Tool 1: Downloading and Cleaning text**

**Tool 2: Network Analysis**

**Tool 3: Wordclouds**

**Tool 4: Sentiment Analysis**

## 4. Discussion
Think critically about your creation
- What went well?
- What is still missing? 
- What could be improved?

**What went well**


**Missing**


**Improvements**

If we were to improve the work, there are multiple things that could be done (possibilities are somewhat endless):
- improve character sentiment by adding more character quotes from the wowpedia pages
    - Some quote sections include linkings to collections of character quotes, from e.g. Warcraft 2, which could be downloaded and processed.
    - Manually find the different quoting pages on wowpedia and process these, adding quotes to major characters. 
    - Finding quotes from outside wowpedia, e.g. the Warcraft movie script or the different Warcraft books (probably behind paywalls)
        - Could lead to time series analysis of charcter sentiment, which could be compared to the sentiment time series of wowhead comments.
- expanding the network to not only include the [major character page on wowpedia](https://wowpedia.fandom.com/wiki/Major_characters)

## 5. Contributions
Who did what? You should write (just briefly) which group member was the main responsible for which elements of the assignment. 

**Main responsibilities**
* Janus Ivert Johansen, s173917
    * Setting up the website 
    * Wordcloud masks
    * Network graph visualization
* Lucas Alexander Sørensen, s174461
    * Downloading comments from wowhead
    * Sentiment Analysis (Bert and VADER)
    * Timeseries sentiment
* Simon Amtoft Pedersen, s173936
    * Downloading & cleaning character pages
    * Extracting character quotes from wowpedia character pages
    * Wordcloud computations (tf-idf etc.)
    * Compute network measures