Lab 1: Simple Web Scraping with Requests

Objective: Learn how to parse and retrieve data from a website in python.

Successful Outcome: Successfully access a piece of data, save it to a variable, and print it out.

Remember to shut down your server when you are done by clicking Control Panel -> Shut Down Server 

# Step 1: Preliminaries

This is where we import the needed modules and "get" the web page we want to get data from.

In [None]:
## Here we are importing the requests and BeautifulSoup modules
import requests
from bs4 import BeautifulSoup

## Requests is the module that actually goes out and accesses the website.
## BeautifulSoup helps with formatting and parsing the html.
page = requests.get("https://en.wikipedia.org/wiki/Nineteen_Eighty-Four")


## Here we are accessing wikipedia using the requests module, running this block will 
## give us a snapshot of the page at the time of access.

## When you run this block there should be no output, go ahead, give it a try.

# Step 2: Reading the Content

In [None]:
## Now that we have grabbed a snapshot of the page, let's see what happens when we print it out!
print(page)


In [None]:
## As you can see, printing the page just gives us a response code, when what we want is the page content.
## This can be simply accomplished by accessing the .content of the page object
## we created earlier and printing it out.
print(page.content)

## This command will give you all of the page's content.

In [None]:
## As you can see we have the page content now, but its not that easy to read, that's why Beautiful Soup is very useful.
## Now we are going to access and parse the content as html using beautiful soup's html.parser.
soup = BeautifulSoup(page.content, 'html.parser')

## Let's see what the parser got us!
print(soup)

# Step 3: Parsing the HTML

If you are not that familiar with what HTML is here is a link to an article that will explain HTML much better than I could.

https://www.w3schools.com/html/html_intro.asp

In [None]:
## As you can see the HTML is now much easier to read, and with some simple commands we can search for objects within the html very easily.
## The find_all command is very useful in finding all instances of the text or class you give it.
## Text parameters go into the first argument, class in the second.
## For this notebook we are going to find the heading of the wiki article. 
## We know the class we are looking for is 'firstheading', you can see this in the output above.
## Google chromes inspect element is very useful for finding the object that you're looking for as well.
body = soup.find_all('', class_='firstHeading')

## Let's see what the soup found.
print(body)

In [None]:
## So its easy as that. All we have is the HTML line, so some trimming is necessary. If you want just the title 
## you would omit the rest of the line using python's string commands. (string[startIndex:EndIndex])
## First we have to cast it as a string though.
title = str(body)
title = title[57:77]
print(title)

## 57 and 77 are the start and end positions of the title.

And there it is! With this simple script, a forloop, and a file with valid links, you could grab as many 
titles from wikipedia as you want. By changing the parameters of the find function around, other pieces of data can be grabbed as well.

# Step 4: More Complex Scraping

Now let's try something a bit more complex. Here is a website that just has example data tables. Let's learn how to grab all the information from a table.

In [None]:
## These first lines are just like we saw earlier. Getting the webpage, and getting a soup object of the page content.
## For easy to read formatting and parsing.
page = requests.get("https://datatables.net/examples/basic_init/zero_configuration.html")
soup = BeautifulSoup(page.content, 'html.parser')


## The complex part is working with tables, which are a very common and relevant piece of data
## on a webpage. The only thing you have to know is that a table is made of 'tr' and 'td' objects.
## A 'tr' object is a row and 'td' is all of the data in that row. (tableRow) and (tableData).


## So here we are simply running a for loop that gets the first and only 'tr' object and all of it's corresponding
## 'td' objects.
data = [[cell.get_text(strip=True) for cell in row.find_all('td')] for row in soup.find_all("tr")]

## Now let's print out all of the row data for the table on the page!
for entry in data:
    print(str(entry))



As you can see, we got every single table entry in the example table. With some string manipulation we can extract each entry very easily to get the name, age, and other variables. Now let's trim and save that data.

In [None]:
## Here we are creating a txtfile called "scraper.txt" in your outputs/ folder.
import os
txtfile = open(os.path.join(os.path.expanduser('~'), "outputs/Lab1", "scraper.txt"),"w")

## Here we are writing the header to the file.
txtfile.write("Name, Position, Age, Start_date, Salary")

##Here we are taking each entry and trimming the undesirable characters and writing it to the file
for entry in data:
    text = str(entry)
    text = text.replace(']', '')
    text = text.replace('[', '')
    text = text.replace("'", '')
    print(text)
    txtfile.write(text + "\n")
    
txtfile.close()

## Go check the file out in the outputs/ folder, it's named scraper.txt!
    

Congratulations!
You have learned how to successfully scrape simple and complex structures from webpages!
If you want to train your scraping muscles a little bit more, ponder this exercise.

https://www.immobiliare.it/Roma/agenzie_immobiliari_provincia-Roma.html
Go to this website and grab/scrape 5 different fields and save them to a .txt file. This will help you identify how different objects on a webpage require different steps to grab them in their entirety.