## Lab 4 - Web Scraping

The past few labs have dealt with basic programming concepts such as opening and closing files and writing structured code. While these concepts are important to know in order for us to do computational text analysis, we need text to use these concepts on. 

There are two ways to obtain text:
* Retrieving text from a pre-existing corpus
* Creating your own corpus through web-scraping or other means

In this lab we will be looking into web-scraping and how we can create our very own dataset. Read through the following tutorial as it is a very good introduction to web scraping: [Python web scraping](https://www.dataquest.io/blog/web-scraping-tutorial-python/)

Python has a very easy to use web-scraping module called 'BeautifulSoup' that we will be learning to use. But before we can start writing our scraper we need to identify where we are getting our data from and what the layout of the website is.

For the purpose of this lab, we will be scraping some reviews for the Black Panther movie at [rotten tomatoes](https://www.rottentomatoes.com/m/black_panther_2018/reviews?type=&sort=)
We need to understand how a website is structured. A website is built up of blocks of html tags and it is these tags that are important for web scraping. We need to find the specific tag that identifies individual reviews on the webpage. The simplest way to do so on Chrome is to right click on the review and click "inspect" at the bottom of the pop up list. This will open a window that shows the html content of the page. As you move your cursor over the code it will highlight the element on the actual webpage. You need to find the html tag that corresponds to a review as shown in the image below.

![html](./image1.png)

The particular element we are interested in is a class tag. The class tag is used to group similar items together for ease of structuring. In the case of Rotten Tomatoes all reviews have the "the_review" class assigned to them which is highlighted in the image above. Now that we have this information we can write the code that retrieves the reviews that we want to scrape. In order to do this we will use the "requests" module to request access to the site and then we will use BeautifulSoup to scrape the webpage contents and parse the html.

In [1]:
# import the requests module and beautifulsoup
import requests
from bs4 import BeautifulSoup as bs

# first we open a new session which allows us to access websites using python
mySession = requests.session()

# next we retrieve the website
response = mySession.get("https://www.rottentomatoes.com/m/black_panther_2018/reviews")

# now we use beautifulsoup to retrieve and parse the html content
# this allows us to search for the review tags and retrieve them in the next step
soup = bs(response.content, 'html.parser')

# the soup variable contains the parsed html content which we will search through for reviews
# the findAll() function will find all blocks of html code that are assigned the "the_review" class
articleFeed = soup.findAll("div", {"class": "the_review"})

# we can loop through the reviews and print them to make sure our code is working
for review in articleFeed:
    print(review)
    

<div class="the_review" data-qa="review-text">
                    It's also a breath of fresh air for the cinematic universe its set in, one that has developed a few bad habits over the years, one that also needed a fresh perspective, and a divergence from the grander threads a play in the narrative.
                </div>
<div class="the_review" data-qa="review-text">
                    [It] showed not only was there audience appetite for a big-budget black superhero flick, but there was appetite for one that grappled with modern societal challenges like isolationism, oppression, and technological disparity.
                </div>
<div class="the_review" data-qa="review-text">
                    There's so much to celebrate about Black Panther.
                </div>
<div class="the_review" data-qa="review-text">
                    It's a cultural phenomenon.
                </div>
<div class="the_review" data-qa="review-text">
                    The best Marvel movie ye

In [3]:
#There's also a way to show the html code more cleanly, with prettify
for review in articleFeed:
    print(review.prettify())

<div class="the_review" data-qa="review-text">
 It's also a breath of fresh air for the cinematic universe its set in, one that has developed a few bad habits over the years, one that also needed a fresh perspective, and a divergence from the grander threads a play in the narrative.
</div>

<div class="the_review" data-qa="review-text">
 [It] showed not only was there audience appetite for a big-budget black superhero flick, but there was appetite for one that grappled with modern societal challenges like isolationism, oppression, and technological disparity.
</div>

<div class="the_review" data-qa="review-text">
 There's so much to celebrate about Black Panther.
</div>

<div class="the_review" data-qa="review-text">
 It's a cultural phenomenon.
</div>

<div class="the_review" data-qa="review-text">
 The best Marvel movie yet.
</div>

<div class="the_review" data-qa="review-text">
 A tentpole franchise film that broke the mold of what came before, reveling in a diversity and culture th

A few other thing you can do is get the first block of html, with `find()` and extract the contents of the tag itself, with `get_text()` 

In [3]:
#get the first review
articleFirst = soup.find("div", {"class": "the_review"})
articleFirst

<div class="the_review" data-qa="review-text">
                    It's also a breath of fresh air for the cinematic universe its set in, one that has developed a few bad habits over the years, one that also needed a fresh perspective, and a divergence from the grander threads a play in the narrative.
                </div>

In [4]:
#print it a bit neater
articleFirst.prettify()

'<div class="the_review" data-qa="review-text">\n It\'s also a breath of fresh air for the cinematic universe its set in, one that has developed a few bad habits over the years, one that also needed a fresh perspective, and a divergence from the grander threads a play in the narrative.\n</div>\n'

In [8]:
#extract the contents of the "the_review" tag
textFirst = articleFirst.get_text()
textFirst

"\r\n                    It's also a breath of fresh air for the cinematic universe its set in, one that has developed a few bad habits over the years, one that also needed a fresh perspective, and a divergence from the grander threads a play in the narrative.\r\n                "

In [10]:
#you can also get the text of all the reviews, with a for loop
textAll = [review.get_text() for review in articleFeed]
textAll

["\r\n                    It's also a breath of fresh air for the cinematic universe its set in, one that has developed a few bad habits over the years, one that also needed a fresh perspective, and a divergence from the grander threads a play in the narrative.\r\n                ",
 '\r\n                    [It] showed not only was there audience appetite for a big-budget black superhero flick, but there was appetite for one that grappled with modern societal challenges like isolationism, oppression, and technological disparity.\r\n                ',
 "\r\n                    There's so much to celebrate about Black Panther.\r\n                ",
 "\r\n                    It's a cultural phenomenon.\r\n                ",
 '\r\n                    The best Marvel movie yet.\r\n                ',
 '\r\n                    A tentpole franchise film that broke the mold of what came before, reveling in a diversity and culture that had barely been touched on in past superhero films and that

The above example just gets one page of reviews, but often we need much more than that to have a usefull data set. In order to have a big enough dataset to be useful, we will need to scrape all the pages containing reviews. For this example it will be fairly easy. Notice that the url we used contains the following string: "&page=1". We can use that string to loop through all the pages containing reviews. In the next code block we will re-code our scraper to loop through each page. We will retrieve the author and review text. It is up to you to complete the code and store the data to a pandas dataframe.

**Code updated**: no looping through multiple pages. We just put one page of reviews into pandas

In order to loop through all the pages we will use something called **string formatting**. Work through the following tutorial to become familiar with string formatting: [string formatting tutorial](https://www.learnpython.org/en/String_Formatting#:~:text=String%20Formatting-,String%20Formatting,%22%20and%20%22%25d%22.)

We will also use something called **list comprehension**. Read the following short tutorial on this feature: [list comprehension](https://www.learnpython.org/en/List_Comprehensions)

In [13]:
mySession = requests.session()

response = mySession.get("https://www.rottentomatoes.com/m/black_panther_2018/reviews")
soup = bs(response.content, 'html.parser')
    
# we find the table containing the reviews
reviewTable = soup.find("div", {"class": "review_table"})
    
reviewTable

<div class="review_table">
<div class="row review_table_row" data-qa="review-item">
<div class="col-xs-8 critic-info">
<div class="col-sm-7 col-xs-16 critic_img">
<img class="critic_thumb fullWidth" src="http://resizing.flixster.com/bqiGx9sWuMzshbEuQDg5VW4ATP0=/128x128/v1.YzsxMDAwMDAyNTkzO2o7MTkwODk7MjA0ODs1MDA7NTAw" width="100px"/>
</div>
<div class="col-sm-17 col-xs-32 critic_name">
<a class="unstyled bold articleLink" data-qa="review-critic-link" href="/critics/cory-woodroof">Cory Woodroof</a>
<br>
<a href="/critics/source/100009562">
<em class="subtle critic-publication" data-qa="review-critic-publication">615 Film</em>
</a>
</br></div>
</div>
<div class="col-xs-16 review_container">
<div class="review_icon icon small fresh">
</div>
<div class="review_area">
<div class="review-date subtle small" data-qa="review-date">
                Feb 11, 2022
            </div>
<div class="review_desc">
<div class="the_review" data-qa="review-text">
                    It's also a breath of fre

In [14]:
mySession = requests.session()

response = mySession.get("https://www.rottentomatoes.com/m/black_panther_2018/reviews")
soup = bs(response.content, 'html.parser')
    
# we find the table containing the reviews
reviewTable = soup.find("div", {"class": "review_table"})
    
# We retrieve the name of the reviewer
# notice that we are using 3 classes in the select command to find the correct html element
reviewers = [r.get_text() for r in reviewTable.select(".review_table_row .critic_name .articleLink")]
    
# retrieve the review text
reviewText = [t.get_text() for t in reviewTable.select(".review_table_row .the_review")]

In [15]:
reviewers

['Cory Woodroof',
 'Nathan Mattise',
 'Fletcher Powell',
 'Nicol치s Delgadillo',
 'Olly Richards',
 'Jim Rohner',
 'Richard Crouse',
 'Jeffrey Zhang',
 'Isaac Feldberg',
 'Mike Massie',
 'Dan Buffa',
 'Paul McGuire Grimes',
 'Matthew St. Clair',
 'Ade Adeniji',
 'Brandon Avery',
 'Richard Propes',
 'Jason Fraley',
 'Kelechi Ehenulo',
 'Zehra Phelan',
 'Matt Cipolla']

In [18]:
reviewText

["\r\n                    It's also a breath of fresh air for the cinematic universe its set in, one that has developed a few bad habits over the years, one that also needed a fresh perspective, and a divergence from the grander threads a play in the narrative.\r\n                ",
 '\r\n                    [It] showed not only was there audience appetite for a big-budget black superhero flick, but there was appetite for one that grappled with modern societal challenges like isolationism, oppression, and technological disparity.\r\n                ',
 "\r\n                    There's so much to celebrate about Black Panther.\r\n                ",
 "\r\n                    It's a cultural phenomenon.\r\n                ",
 '\r\n                    The best Marvel movie yet.\r\n                ',
 '\r\n                    A tentpole franchise film that broke the mold of what came before, reveling in a diversity and culture that had barely been touched on in past superhero films and that

In [21]:
#optional: removing the \r\n
reviewText = [x.strip() for x in reviewText]

In [22]:
reviewText

["It's also a breath of fresh air for the cinematic universe its set in, one that has developed a few bad habits over the years, one that also needed a fresh perspective, and a divergence from the grander threads a play in the narrative.",
 '[It] showed not only was there audience appetite for a big-budget black superhero flick, but there was appetite for one that grappled with modern societal challenges like isolationism, oppression, and technological disparity.',
 "There's so much to celebrate about Black Panther.",
 "It's a cultural phenomenon.",
 'The best Marvel movie yet.',
 'A tentpole franchise film that broke the mold of what came before, reveling in a diversity and culture that had barely been touched on in past superhero films and that had certainly not been embraced as widely and emphatically before...',
 "Boseman's Black Panther is not only capable of fighting the bad guys but is also a vessel for the film's study of the importance of legacy and identity.",
 'Sure,Black Pa

In [23]:
# now that we have the data store it to a dataframe

import pandas as pd

reviews = {'Reviewer':reviewers, 'Text':reviewText}

reviews_df = pd.DataFrame(reviews) 

reviews_df

Unnamed: 0,Reviewer,Text
0,Cory Woodroof,It's also a breath of fresh air for the cinema...
1,Nathan Mattise,[It] showed not only was there audience appeti...
2,Fletcher Powell,There's so much to celebrate about Black Panther.
3,Nicol치s Delgadillo,It's a cultural phenomenon.
4,Olly Richards,The best Marvel movie yet.
5,Jim Rohner,A tentpole franchise film that broke the mold ...
6,Richard Crouse,Boseman's Black Panther is not only capable of...
7,Jeffrey Zhang,"Sure,Black Panther is a mainstream superhero m..."
8,Isaac Feldberg,A taut triumph of Afrofuturist iconography rea...
9,Mike Massie,Just about the same as every other Marvel title.


The above example was fairly straightforward. Web scraping can get very technical for certain websites so it is important to practise web scraping on various sites. 

For the rest of this lab, find a website to scrape yourself. Write your own scraper and make sure to add enough comments to explain what you are doing. 