$\textbf{Welcome back! This is the activity 1 notebook.}$

In this notebook, we will explore how to implement some of the web scraping and text mining techniques that were covered in lecture.

Packages

In [None]:
#Please uncomment the lines below if you don't have the package installed.
#!pip install --user nltk
#!pip install -U scikit-learn
#!pip install pandas
#!pip install -U matplotlib
#!pip install beautifulsoup4
#!pip install requests
#!pip install lxml

In [None]:
import nltk
import pandas as pd
import sklearn as sk
import numpy as np
import matplotlib.pyplot as plt
import requests
import re
from bs4 import BeautifulSoup

Web Scraping

As you learned in lecture, web scraping can be a valuable tool to build your own datasets.

It is especially useful because it automates manual data entry from websites.

Let's walk through an example.

We are going to scrape data from this bookstore's website: http://books.toscrape.com/

In [None]:
url = "http://books.toscrape.com/"
r = requests.get(url)

In [None]:
content = BeautifulSoup(r.text, "html.parser")

In [None]:
print(content.prettify())

Let's attempt to find the book urls of books on the frontpage.

In [None]:
content.find("article", class_ = "product_pod")

In [None]:
content.find("article", class_="product_pod").div

In [None]:
content.find("article", class_="product_pod").div.img

In [None]:
content.find("article", class_ = "product_pod").div.a

In [None]:
content.find("article", class_ = "product_pod").div.a.get('href')

In [None]:
#now that we know what to look for, we can use the findAll() function to grab everything at once.

In [None]:
lst = [y.div.a.get('href') for y in content.findAll("article", class_ = "product_pod")]

In [None]:
lst

Awesome! Now we can see that we grabbed all of the book urls on the front page. Now let's dig for more.

Next thing we want to look for is the book category urls. (This way we can see what books are classified as.)

Notice the consistent structure that you saw---this is good for scraping purposes!

In [None]:
category_urls = [x.get('href') for x in content.findAll('a', href = re.compile('catalogue/category/books'))]

In [None]:
category_urls[1:]

Great! Now we've found extra information that will be useful.

Now let's put it together to get all the book information we can off this website.

In [None]:
def parse(url):
    result = requests.get(url)
    contents = BeautifulSoup(result.text, 'html.parser')
    return contents

In [None]:
#creating this function because we are going to be calling this operation many times.

In [None]:
site_urls = [url]

contents = parse(site_urls[0])

# while we get two matches, this means that the webpage contains a 'previous' and a 'next' button
# if there is only one button, this means that we are either on the first page or on the last page
# we stop when we get to the last page

while len(contents.findAll("a", href=re.compile("page"))) == 2 or len(site_urls) == 1:
    
    # get the new complete url by adding the fetched URL to the base URL (and removing the .html part of the base URL)
    new_url = "/".join(site_urls[-1].split("/")[:-1]) + "/" + contents.findAll("a", href=re.compile("page"))[-1].get("href")
    
    # add the URL to the list
    site_urls.append(new_url)
    
    # parse the next page
    contents = parse(new_url)

In [None]:
print(site_urls)

Is this solution stable? What if the catalog changed?

In [None]:
###Let's get the book urls that we saw above for all pages.
def get_books(url):
    contents = parse(url)
    #same logic as we saw above, except now getting the full url.
    return(["/".join(url.split("/")[:-1]) + "/" + x.div.a.get('href') for x in contents.findAll("article", class_ = "product_pod")])

In [None]:
books = []
for page in site_urls:
    books.append(get_books(page))
#need to flatten the final book list because get_books returns a list--creates list of lists
books = [item for sublist in books for item in sublist]

In [None]:
print(books)

Great! Now we have the book data. We can proceed to the final step here: getting all the data associated with each book.

In [None]:
##First, I will collect the star ratings.
ratings = []
for url in books:
    contents = parse(url)
    ratings.append(contents.find('p', class_ = re.compile("star-rating")).get("class")[1]) 
print(ratings)

In [None]:
##Now it's your turn! Figure out a way to collect price data and category data for each book. 
prices = []
categories = []
for url in books:
    contents = parse(url)
    #WRITE YOUR CODE HERE
print(prices)

In [None]:
print(categories)

In [None]:
##Figure out a way to collect the name and number of books in stock for each book.
names = []
num_in_stock = []
for url in books:
    contents = parse(url)
    #WRITE YOUR CODE HERE
print(names)
print(num_in_stock)

In [None]:
combined_df = pd.DataFrame({'name': names, 'amount in stock': num_in_stock, 
                            'prices (in British Pounds)': prices, 'star rating': ratings, 
                            'categories': categories})

In [None]:
combined_df

In [None]:
Text Mining

In [None]:
##TO DO: Figure out how to collect product descriptions of each book. We will use this to do text mining. 
##HINT: Figure out how to handle exceptions or possible missing data.

In [None]:
#Now that you've seen me attempt to predict star ratings from the product descriptions with very low accuracy, try to 
#predict the genre of a book from the product description using a multinomial naive bayes model. (This should have more accuracy.)