# Sentiment Analysis of Game of Thrones wiki - Fandom

#### Chosen characters for sentiment analysis:
 - Cersei Lannister
 - Tyrion Lannister
 - Jon Snow
 - Daenerys Tagaryen
 - Arya Stark

Since there are 389 characters total, there would be too many to analyse and a lot of them are insignificant for this analysis. Thus, we have chosen to conduct sentiment analysis on the 5 characters mentioned above. These were chosen due to their longevity in the series and thus we will be able to conduct sentiment over time (seasons). We want to determine the opinion and subjectivity of wikipedia pages character description and we want to see how the sentiment over time differs for each character. We also would like to see how each characters are described, looking at their lexical diversity, which can also be compared to the timeline, here meaning the 8 seasons, to see which characters are more present at different times. 

Additionally we will look at if the overall descriptions of characters is positive/negative/neutral in language. Thus we can be able to conclude which characters are more well liked and if this changes over time. This fandom page contains many different indivudals that have contributed to this community, so there might be a slight skewness in the analysis, as different authors uses different language, and hence this will affect the sentiment analysis.

In [1]:
# Obtaining the necessary data (character descriptions from the fandom page)
import requests
from bs4 import BeautifulSoup
import re
from nltk import word_tokenize
import pandas as pd
from urllib.request import urlopen

In [2]:
characters = ['Jon Snow', 'Cersei Lannister', 'Tyrion Lannister', 'Daenerys Targaryen', 'Arya Stark']

In [3]:
char_wiki = ['/wiki/Jon_Snow',
             '/wiki/Cersei_Lannister',
             '/wiki/Tyrion_Lannister',
             '/wiki/Daenerys_Targaryen',
             '/wiki/Arya_Stark']

### Using remove function from Assignment 2
#### To clean the text for each character description

In [4]:
def remove(doc):
    digits = '[0-9]'
    symbols = '[!#$%&()*+-./:;<=>?@[\]^_`{|}~\n]'
    single_characters = '[\b[a-zA-Z]\b]'
    
    # remove digits
    doc = re.sub(digits, '', doc)
    # remove symbols
    doc = re.sub(symbols, '', doc)
    # remove single characters
    doc = re.sub(single_characters, '', doc)
    # remove links 
    doc = re.sub(r'http\S+', '', doc)
    # remove punctuations
    doc = re.sub(r'[^\w\s]', '', doc)
    # remove extremely long words
    doc = re.sub(r'\W*\b\w{10,100}\b', '', doc)
    # remove HTML syntax such as \t and \n
    #doc = re.sub(r'\s', '', doc)
    doc = re.sub(r'[\t\n\r]', '', doc)
    # remove wiki references
    # tokenize
    doc = word_tokenize(doc)
    
    return doc

### Obtaining relevant text for each character

In [5]:
# Obtaining character description for the 10 chosen characters

for char in range(len(char_wiki)):
    root = 'https://gameofthrones.fandom.com'
    endpath = char_wiki[char]
    path = root+endpath
    req = requests.get(path)
    html_page = req.content
    soup = BeautifulSoup(html_page, 'html.parser')
    text = soup.find_all(text=True)

    output = ''
    blacklist = [
        '[document]',
        'noscript',
        'header',
        'html',
        'meta',
        'head', 
        'input',
        'script',
    # there may be more elements you don't want, such as "style", etc.
    ]
    
    for words in text:
        if words.parent.name not in blacklist:
            output += '{} '.format(words)
            descrip = remove(output)
    
            from nltk.corpus import stopwords
            stop_words = set(stopwords.words('english'))
            for word in stop_words:
                if word in descrip:
                    descrip.remove(word)
            
            char_data = pd.DataFrame(columns = characters)
            for head in char_data.head():
                char_data[head] = descrip
            

In [6]:
char_data

Unnamed: 0,Jon Snow,Cersei Lannister,Tyrion Lannister,Daenerys Targaryen,Arya Stark
0,Arya,Arya,Arya,Arya,Arya
1,Stark,Stark,Stark,Stark,Stark
2,Game,Game,Game,Game,Game
3,Thrones,Thrones,Thrones,Thrones,Thrones
4,Wiki,Wiki,Wiki,Wiki,Wiki
...,...,...,...,...,...
21558,TV,TV,TV,TV,TV
21559,Community,Community,Community,Community,Community
21560,View,View,View,View,View
21561,Mobile,Mobile,Mobile,Mobile,Mobile


In [None]:
 # Every character description begins after the first 'src'
            #indx_before = descrip.index('src') # position of 'src'
            #descrip = descrip[indx_before+1:]