# WIM Workshop: API-Webscraping with Python

* Date: Nov 3, 2023
* Instructor: Eehyun Kim (eehkim@iu.edu), Anne Kavalerchik (akavaler@iu.edu)

## Example 1. Famous Quotes

Let's open this link for our first practice: http://quotes.toscrape.com/. It's a website with quotations, the people they are attributed to, and the short biographies of those people.

### Understanding Structure of Website

Click `setting (three vertical dots) > More Tools > Developer Tools` to find out the information about websites.

Then, load the packages we will use, which are basically the same with what we have used for APIs. We will use the python `requests` library to send HTTP requests and `BeautifulSoup` to extract the elements we are interested in.

In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

In [None]:
url = "http://quotes.toscrape.com/"
response = requests.get(url)
response

`<Response [200]>` means that our request was successful.
Usually what we want is the text from a website.
Let's get the text and print it. [Compare it to the source code of the actual webpage](view-source:http://quotes.toscrape.com/)

In [None]:
htmltext = response.text
print(htmltext)

We could use a combination of regular expressions, string matching, and loops to navigate the html, but luckily the Beautiful Soup package makes it much easier. [BeautifulSoup documentation is here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

In [None]:
soup = bs(htmltext,'html.parser')
#print(soup) # this doesn't look much different than before we parsed it, but it will let us navigate it easier

There are several ways to navigate the website. Try to find your element of interest, in this case, first quote from Einstein and get the information of it.

### The code above is shown like this:
***
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
    <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
    <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world"> 
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

We will use `.find` with the target element and search it by the given attibute. For instance, html tag will be `div` with `class` of `quote`.

In [None]:
print(soup.find("div", {"class": "quote"}))

Then, find the inner element under the nested tag structure. 

In [None]:
first_quote = soup.find("div", {"class": "quote"})
quote_text1 = first_quote.find("span", {"itemprop": "text"})
quote_text2 = first_quote.find("span", {"class": "text"})

print(quote_text1)
print(quote_text2)
print(first_quote.span)

In [None]:
print(quote_text2.text)
print(quote_text2.get_text())
print(first_quote.span.text)

In [None]:
# More efficiently, you can just write a line of code to retrieve the quote.

print(soup.find("div", {"class": "quote"}).find("span", {"itemprop": "text"}).text)
print(soup.find("div", {"class": "quote"}).find("span", {"class": "text"}).text)

### Exercise 1. Let's retrieve `author`, Albert Einstein, using `soup.find()`.

In [None]:
soup.find("div", {"class": "quote"}).find("small", {"itemprop": "author"}).text
soup.find("div", {"class": "quote"}).find("small", {"class": "author"}).text

You must realize that there are multiple search terms that produce identical results! Try to find the best search term that works for you.

While `find` returns only one, first appearing element, `.findAll` and `.select` return __all__ elements fitting those attributes. Let's get all of the tags for that quotation and use `get_text` to get __only__ the text from each tag.

In [None]:
first_quote = soup.find("div", {"class": "quote"})

tags = first_quote.findAll("a", {"class": "tag"})
tags_list = []
for tag in tags:
    print(tag.get_text())
    tags_list.append(tag.get_text())
tags_list

# We can do the equivalent task without a loop using this line:
tags_list = [tag.get_text() for tag in tags]

### Scrape all people using for loops

Then, let's make a list of every person on this page, and then every quotation.

In [None]:
all_quotes = soup.findAll('div', {'class':'quote'})

for quote in all_quotes:
    # author
    print("Author:", quote.small.text)
    # quote
    print("Quote:", quote.span.text)
    # tags
    tags = quote.findAll("a", {"class": "tag"})
    print("Tags:", ", ".join([tag.text for tag in tags]))
    print()


### Quick Review

Great! Then let's review all the process. Make a function to collect every person/quote on the page and return a __list__ of information when a link is given.

In [None]:
def list_quotes(url):
    
    response = requests.get(url)
    htmltext = response.text
    soup = bs(htmltext,'html.parser')
    
    refined_list = []
    all_quotes = soup.findAll('div', {'class':'quote'})

    for quote in all_quotes:
        quote_author = quote.small.text
        quote_text = quote.span.text
        quote_tags = quote.findAll("a", {"class": "tag"})
        tags_str = ", ".join([tag.text for tag in quote_tags])
        refined_list.append([quote_author, quote_text, tags_str])
    
    return refined_list

In [None]:
url = "http://quotes.toscrape.com/"

result = list_quotes(url)

print(result)

What we __really__ want is a list of __every person on this website__. To do this, we need to use `requests` to call on all the pages.

It's helpful to do some investigating first. Notice that [quotes.toscrape.com/page/1/](quotes.toscrape.com/page/1/) is this page we have been working with, [quotes.toscrape.com/page/2/](quotes.toscrape.com/page/2/) is the next page, and [quotes.toscrape.com/page/10/](quotes.toscrape.com/page/10/) is the last page. So our goal is to scrape these __10__ pages.

We can generate these 10 different URLs. Then, we are basically going to repeat the process that we did to get all the information from the first page for all 10 pages.

In [None]:
url = 'http://quotes.toscrape.com/page/'

all_quote_list = []

for page_num in range(1, 11):
    page_link = url + str(page_num)
    print(page_link)
    all_quote_list.extend(list_quotes(page_link))

print("Number of Quotes:", len(all_quote_list))

We did it! Here is the data we scraped. Let's use `pandas` and look at the data structure.

In [None]:
quote_df = pd.DataFrame(all_quote_list)
quote_df = quote_df.rename(columns={0: "Author", 1: "Quote", 2: "Tags"})

We can make this a JSON like this:

In [None]:
quote_df

You can save the `pandas` DataFrame as an Excel or CSV file.

In [None]:
quote_df.to_csv('all_quotes.csv')

## Practice using a real life example. BillBoard Hot 100

Suppose we are interested in Billboard Hot 100 and scrape song titles and performers from this link: https://www.billboard.com/charts/hot-100/ <br>
__NOTE__: The structure of this site is much more complicated!

In [None]:
url = "https://www.billboard.com/charts/hot-100/"

response = requests.get(url)
htmltext = response.text
soup = bs(htmltext,'html.parser')


In [None]:
song_titles = soup.select("li.o-chart-results-list__item > h3#title-of-a-story")
performers = soup.select("li.o-chart-results-list__item > span.a-no-trucate")

print(len(song_titles))
print(len(performers))

In [None]:
title_refined = [title.text.strip() for title in song_titles]
performer_refined = [performer.text.strip() for performer in performers]

In [None]:
chart_df = pd.DataFrame({"Song": title_refined, "Performer": performer_refined})
chart_df.index = range(1, 101)
chart_df.head()