# Fallout76 Case Study Webscraping and Sentiment Analysis Walkthrough

For our LIS3201 semester-long research project, my team and I investigated toxic reactions in video game journalism headlines, specifically headlines related to the video game Fallout76.

In order to accomplish this I developed a python script that allowed us to scrape Google news headlines about Fallout 76 and then analyze their sentiment index to get quantitative data on toxic reactions. Below you will find the complete code of my first working script along with a walkthrough of comments explaining each step of the process. 

## Step 1: Import Statements

The first step to any script is to import all the necessary libraries. Here is a list of the libraries we imported and what they do:
* requests - access webpages
* bs4 - BeautifulSoup4, takes raw HTML from webpages and turns it into objects in python that can be manipulated easily
* nltk - Natural Language Toolkit, leading python library for working with human language data. We used it for tokenization, the sentiment analyzer and its stopwords list (all these will be explained later).
* csv - exporting data as Comma Seperated Value files

In [1]:
import requests
import bs4
import nltk
import csv
from nltk.corpus import stopwords

## Step 2: Requesting the Webpage

Before I request the webpage, I set up a `headers` variable and a `payload` variable that I could use to request the correct webpage. The `headers` variable contains information about you as the requester so that the page knows who you are and what browser you're using. I found that unless I used this specific header, Google would give me the wrong webpage and just generally not cooperate.

In [2]:
headers = {
    "User-Agent":
        "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}

The `payload` variable contains information about the query you're making. In my case it included the search term, the window of time we wanted to look for results in, and the specific subset of Google News results.

In [3]:
payload = {'as_epq': 'Fallout 76', 'tbs':'cdr:1,cd_min:11/14/2018,cd_max:12/14/2018', 'tbm':'nws'}

Now that I have my `headers` and `payload`, I can request the webpage and save the response as an object called `r`. We used the `.get` method to go out and scrape the page. This method takes three arguments, first the actual base url of the site you're going to, in my case Google, and then `params` which is just our `payload` variable and `headers` which is our `headers` variable.

In [4]:
r = requests.get("https://www.google.com/search", params=payload, headers=headers)

Here I put a print statement just to verify that the request went to the right page.

In [5]:
print("URL = ", r.url)

URL =  https://www.google.com/search?as_epq=Fallout+76&tbs=cdr%3A1%2Ccd_min%3A11%2F14%2F2018%2Ccd_max%3A12%2F14%2F2018&tbm=nws


## Step 3: Getting the Headlines

So now that I've requested a webpage and have it saved as the `r` object, I can start to figure out where our headlines are and pull them out of the raw HTML code. The first step here is to bring in the BeautifulSoup library to make the `r` object easier to work with. I created a new object called `soup` and then used BeautifulSoup to select only the text of r and in an lxml format so I can manipulate it easier.

In [6]:
soup = bs4.BeautifulSoup(r.text, 'lxml')

This still contains all the HTML from the website, most of which is totally irrelevant to what I'm doing. So I used the `select()` method to go in and select only the elements I wanted from the whole page. `select()` takes CSS selectors as its arguments, so I put `div.g.card h3` in because all the headlines are in h3 tags within divs that have the class of g and card. 

In [7]:
base = soup.select("div.g.card h3")

Now `base` only includes the headlines but they still have HTML elements wrapped around them, which I need to get rid of. Below is a print statement to illustrate what one looks like:

In [8]:
print("RAW HEADLINES=", base[3])

RAW HEADLINES= <h3 class="r dO0Ag"><a class="l lLrAF" href="https://www.forbes.com/sites/games/2018/11/20/fallout-76-review-xbox-one-x-look-upon-my-works-and-despair/" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://www.forbes.com/sites/games/2018/11/20/fallout-76-review-xbox-one-x-look-upon-my-works-and-despair/&amp;ved=0ahUKEwiyv-DTnvbhAhW2wMQHHbe2AnEQqQIIPygAMAM">'<em>Fallout 76</em>' Review (Xbox One X): Look Upon My Works And Despair</a></h3>


First, I declared an empty list called `headlines` to put the clean headlines later. To get rid of all the HTML clutter, I created a for loop that iterates over each item in the `base` list and uses the `.text` method to get rid of the HTML and give us only the text. It then appends each item to the empty headlines list.

In [9]:
headlines = []

for row in base:
    clean = row.text
    headlines.append(clean)
    
print("HEADLINES =", headlines[3])

HEADLINES = 'Fallout 76' Review (Xbox One X): Look Upon My Works And Despair


## Step 4: Transforming the Headlines

In [10]:
def clean_tokens(a_list):
    words = nltk.word_tokenize(a_list)
    clean_words = []
    for word in words:
        clean_words.append(word.lower())
    return clean_words

I then used a for loop to iterate the `clean_tokens()` method from earlier over all of the headlines to create a new list of tokenized headlines.

In [11]:
tokens = []
for each in headlines:
    clean = clean_tokens(each)
    tokens.append(clean)

In [12]:
print('TOKENS =',tokens[3])

TOKENS = ["'fallout", '76', "'", 'review', '(', 'xbox', 'one', 'x', ')', ':', 'look', 'upon', 'my', 'works', 'and', 'despair']


Now that the headlines are tokenized I can remove stopwords. "Stopwords" are common words in the english dictionary that I don't to include such as a, the, was, is etc. I imported our list of stopwords from the nltk library and then added the words "fallout" and "76" since they would show up in every search result and I didn't want to include them in the sentiment analysis. The code below is how I imported the stopwrods into our program as a list.

In [13]:
stopwords = nltk.corpus.stopwords.words('english')
stopwords.append("fallout")
stopwords.append("76")

Next I have to iterate over the list of tokenized headlines and remove every instance of a stopword by comparing the tokens to the list of stopwords I just created. This is done through nested loop statements. First I declare an empty list called `filtered` to store the result. The first loop says that for each individual list of tokens within the list of headlines, create a temporary variable called x, execute the next loop and then take x and append it to `filtered`. The next loop says that for each token in the list of tokens, if it doesnt match the stopwords then append it to x. 

Now that sounds confusing but essentially I'm just performing three steps over and over again:
1. Make a temporary variable
2. Fill it with whatever words from the headlines dont match the stopwords list
3. Append that full temporary variable to our filtered list

In [14]:
filtered = []
for list in tokens:
    x = []
    for word in list:
        if word not in stopwords:
            x.append(word)
    filtered.append(x)

My last step was to take this list of individual tokens and put it back into a datatype that the sentiment analyzer can handle. The analyzer takes a list datatype and right now the headlines exist as a list of lists of tokens. I need to convert that back into one list of strings. To accomplish this, I wrote a loop to iterate over the filtered list and for each list of strings, join the strings together whereever there is a space. This results in a single list I called `combined` that contains a string for every headline. 

In [15]:
combined = []
for list in filtered:
    combined.append(" ".join(list))

# Step 5: Finally Analyzing the Sentiment

Now that I have my list of clean headlines without the stopwords, I can plug them into the sentiment analyzer. First I have to import the sentiment analyzer as an object in python and declare an empty list to put the final results.

In [16]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
sia = SIA()
results = []

Then I just use another for loop to iterate the `polarity_scores()` method from `sia` over each of the headlines and then append the output to the new results list.

In [17]:
for line in combined:
    pol_score = sia.polarity_scores(line)
    results.append(pol_score)

Thats it! The sentiment analyzer calculates polarity scores for each headline, creating a dictionary with the negative score, neutral score, positive score, and compound score as shown below:

In [18]:
print("")
for d in results:
    print(d)


{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.372, 'neu': 0.426, 'pos': 0.202, 'compound': -0.3818}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.247, 'neu': 0.753, 'pos': 0.0, 'compound': -0.3182}
{'neg': 0.373, 'neu': 0.423, 'pos': 0.204, 'compound': -0.34}
{'neg': 0.0, 'neu': 0.646, 'pos': 0.354, 'compound': 0.2944}
{'neg': 0.0, 'neu': 0.69, 'pos': 0.31, 'compound': 0.4019}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 0.727, 'pos': 0.273, 'compound': 0.4588}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


# Step 6: Writing the Data to CSV File

Finally, I can write the data out to an external file to crunch it. I used the csv library to write the data out to a csv file using the code below. Essentially what its doing is iterating over the results dictionary and writing out "compound" and then the value associated with compound in each item in the dictionary.

In [20]:
with open('TEST.csv', 'w') as csv_file:
    writer = csv.writer(csv_file)
    for d in results:
        writer.writerow(['compound', d['compound']])

Thanks for reading! Hopefully this walkthrough helps you with your sentiment analyzing endeavors.