# JOUR7280/COMM7780 Big Data Analytics for Media and Communication
# Tutorial: Python Web Scraping Using BeautifulSoup
In this tutorial, you will learn how to perform web scraping using Python 3 and the `BeautifulSoup` library. We’ll be scraping weather forecasts from the [National Weather Service](https://www.weather.gov/), and then store data using the Pandas library.

In [None]:
from bs4 import BeautifulSoup
import requests 
import pandas as pd

The `requests` library will make a `GET` request to a web server, which will download the HTML contents of a given web page for us.  
After running our request, we get a `Response` object. This object has a `status_code` attribute, 200 indicates the page was downloaded successfully

In [None]:
url = 'https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168'
page = requests.get(url)
page

We can print out the HTML content of the page using the `content` attribute:

In [None]:
page.content

### Parsing a page with BeautifulSoup


In [None]:
soup = BeautifulSoup(page.content, 'html.parser')

## Exploring page structure with Chrome DevTools
We’ll extract weather information about downtown San Francisco from [this page](http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168).  
The first thing we’ll need to do is inspect the page using Chrome developer tools. Start the developer tools in Chrome by clicking `More Tools` -> `Developer Tools`  
<img src="../figs/dev.png" alt="drawing" width="550"/>
### Finding all instances of a tag at once
If we want to extract a single tag, we use the `find_all` method, which will find all the instances of a tag on a page.

In [None]:
soup.find_all('title')

In [None]:
soup.find_all('h2')

If you instead only want to find the **first** instance of a tag, you can use the `find` method

In [None]:
soup.find('h2')

### Structure of  Extended Forecast
By right clicking on the page near where it says “Extended Forecast”, then clicking “Inspect”, we’ll open up the tag that contains the text “Extended Forecast” in the elements panel.  
The `div` contains the extended forecast items.
If you explore the div, you’ll discover that each forecast item (like “Tonight”, “Thursday”, and “Thursday Night”) is contained in a div with the class `tombstone-container`.
<img src="../figs/inspect.png" alt="drawing" width="550"/>

In the below code, we:
- Download the web page containing the forecast.
- Create a BeautifulSoup class to parse the page.
- Find the div with id `seven-day-forecast`, and assign to variable seven_day
- Inside seven_day, find each individual forecast item.
- Extract and print the first forecast item.

In [None]:
# Download and parse the web page
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')

seven_day = soup.find(id="seven-day-forecast") # Find seven-day-forecast
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0] # extract 1st forecast
print(tonight.prettify()) # Pretty-print this PageElement as a string.

In [None]:
# print every forecast item
for item in forecast_items:
    print(item.prettify(), '\n')

## Extracting information from the page
As you can see, inside the forecast item `tonight` (variable) is all the information we want. There are 4 pieces of information we can extract:
- The name of the forecast item 
- The description of the conditions — this is stored in the title property of img.
- A short description of the conditions 
- The temperature low/high

In [None]:
# extract the name of the forecast item, the short description, and the temperature
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)

We can extract the `title` attribute from the *img* tag. To do this, we just treat the variable `img` like a <span style="color:orange">dictionary</span>, and extract the attribute we want as a key

In [None]:
# extract a short description of the conditions
img = tonight.find("img")
desc = img['title']
print(desc)

In [None]:
img

## Extracting all the information from the page


Here, we search for items with the class `period-name` and store them to a list.

The way to make a list here is called **List Comprehensions**. Rather than creating an empty list and adding each element to the end, you simply define the list and its contents at the same time. See more [here](https://realpython.com/list-comprehension-python/#using-list-comprehensions).

In [None]:
periods = [day.text for day in seven_day.find_all(class_='period-name')]
periods

Apply the same technique to get the other `3` fields

In [None]:
short_descs = [sd.text for sd in seven_day.find_all(class_='short-desc')]
temps = [temp.text for temp in seven_day.find_all(class_='temp')]
descs = [img['title'] for img in seven_day.find_all('img')]

print(short_descs)
print(temps)
print(descs)

## Combining our data into a Pandas Dataframe
We pass `4` lists into `pd.DataFrame` as part of a dictionary. Each dictionary key will become a column in the DataFrame, and each list will become the values in the column

In [None]:
# create a dataframe
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
})
weather

Save the dataframe to a `csv` file.

In [None]:
weather.to_csv('../data/weather.csv')

## Scrape images 

 - **please proceed with caution and always check the terms of use!!**

As we can see there are several images. Images are displayed with the \< img \> tag in HTML.

Next, let's scrape all the weather images and save them to '../figs' folder.

Previously, we have founded all the `img` tags in 7 day forcast. The `src` attribute contains the link of image.

In [None]:
for img in seven_day.find_all('img'):
#     img_url = img.get('src')
    img_url = img['src']
    print(img_url)

The links are not complete, lack of domain: https://forecast.weather.gov/

In [None]:
img_urls = []
domain = 'https://forecast.weather.gov/'
for img in seven_day.find_all('img'):
    img_url = domain + img.get('src')
    img_urls.append(img_url)
    print(img_url)

To downloads and save files with Python, we use the shutil library which is a file operations library.
It is always good to add a **time delay** via the `time.sleep` function 

In [None]:
import shutil

import time 
from random import randint


for img_url in img_urls: 
    
    # time.sleep(5) # a fixed rate (delay 5 sec)
    time.sleep(randint(2,6))  # a random int delay between 2 to 6 sec 
    
    # we set stream = True to download/stream the content of the data
    img_source = requests.get(img_url, stream=True) 
    
    # open file connection, create file and write to it
    # 'wb': write and binary
    with open('../figs/'+img_url.split('/')[-1], 'wb') as file: 
        # save the raw file object, see the tutorial: https://docs.python.org/3/library/shutil.html
        shutil.copyfileobj(img_source.raw, file) 
        
    del img_source # to remove the file from memory

- The codes in this notebook are modified from various sources. All codes are for educational purposes only and released under the CC1.0. 