# Lets build an webscraper and build dataset

Let’s try downloading a simple sample website, http://dataquestio.github.io/web-scraping-pages/simple.html. We’ll need to first download it using the requests.get method.

In [2]:
import requests
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page

<Response [200]>

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully:

In [3]:
page.status_code

200

A status_code of 200 means that the page downloaded successfully. 

# We can print out the HTML content of the page using the content property:

In [4]:
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

## We  use the "BeautifulSoup" library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:

In [5]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object:

In [6]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


## If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page

In [7]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

Note that find_all returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

In [8]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

## Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. We can also use them when scraping to specify specific elements we want to scrape. To illustrate this principle, we’ll work with the following page:

In [9]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

In [10]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [11]:
soup.find_all(class_="outer-text")

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

# lets build a weather dataset from webscraping

We now know enough to proceed with extracting information about the local weather from the National Weather Service website. The first step is to find the page we want to scrape. page link: https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168

## We’ll extract data about the extended forecast.

find div tag with the id seven-day-forecast
The div that contains the extended forecast items.
-->>
in summary:
Download the web page containing the forecast.
Create a BeautifulSoup class to parse the page.
Find the div with id seven-day-forecast, and assign to seven_day
Inside seven_day, find each individual forecast item.
Extract and print the first forecast item


In [14]:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  NOW until
  <br/>
  8:00pm Thu
 </p>
 <p>
  <img alt="" class="forecast-icon" src="newimages/medium/hz.png" title=""/>
 </p>
 <p class="short-desc">
  Heat Advisory
 </p>
</div>


We’ll extract the name of the forecast item, the short description, and the temperature first, since they’re all similar:

In [16]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
#temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
#print(temp)

NOW until8:00pm Thu
Heat Advisory


Now, we can extract the title attribute from the img tag. To do this, we just treat the BeautifulSoup object like a dictionary, and pass in the attribute we want as a key:

In [17]:
img = tonight.find("img")
desc = img['title']
print(desc)




In [18]:
#lets extract all info
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['NOW until8:00pm Thu',
 'Today',
 'Tonight',
 'Friday',
 'FridayNight',
 'Saturday',
 'SaturdayNight',
 'Sunday',
 'SundayNight']

## lets get other fields too

In [21]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(short_descs)
print(temps)
print(descs)

['Heat Advisory', 'Haze', 'Haze', 'Haze', 'Haze thenClear', 'Sunny', 'Mostly Cloudy', 'Partly Sunny', 'Mostly Cloudy']
['High: 92 °F', 'Low: 66 °F', 'High: 82 °F', 'Low: 59 °F', 'High: 75 °F', 'Low: 57 °F', 'High: 70 °F', 'Low: 57 °F']
['', 'Today: Widespread haze. Areas of smoke. Sunny, with a high near 92. Light west wind increasing to 5 to 10 mph in the afternoon. ', 'Tonight: Widespread haze. Areas of smoke. Mostly clear, with a low around 66. West wind 8 to 13 mph. ', 'Friday: Widespread haze. Areas of smoke. Sunny, with a high near 82. West southwest wind 5 to 11 mph. ', 'Friday Night: Widespread haze before 11pm. Areas of smoke before 11pm. Clear, with a low around 59. West wind 9 to 15 mph, with gusts as high as 20 mph. ', 'Saturday: Sunny, with a high near 75. West southwest wind 8 to 16 mph, with gusts as high as 22 mph. ', 'Saturday Night: Mostly cloudy, with a low around 57.', 'Sunday: Partly sunny, with a high near 70.', 'Sunday Night: Mostly cloudy, with a low around 57.'

# save our data into pandas dataframe

In [23]:
import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "desc":descs
})
weather

Unnamed: 0,period,short_desc,desc
0,NOW until8:00pm Thu,Heat Advisory,
1,Today,Haze,"Today: Widespread haze. Areas of smoke. Sunny,..."
2,Tonight,Haze,Tonight: Widespread haze. Areas of smoke. Most...
3,Friday,Haze,Friday: Widespread haze. Areas of smoke. Sunny...
4,FridayNight,Haze thenClear,Friday Night: Widespread haze before 11pm. Are...
5,Saturday,Sunny,"Saturday: Sunny, with a high near 75. West sou..."
6,SaturdayNight,Mostly Cloudy,"Saturday Night: Mostly cloudy, with a low arou..."
7,Sunday,Partly Sunny,"Sunday: Partly sunny, with a high near 70."
8,SundayNight,Mostly Cloudy,"Sunday Night: Mostly cloudy, with a low around..."


## We can now do some analysis on the data. For example, we can use a regular expression and the Series.str.extract method to pull out the numeric temperature values:

In [24]:
temp_nums = weather["temp"].str.extract("(?P<temp_num>d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
temp_nums

KeyError: 'temp'

In [None]:
weather["temp_num"].mean()

In [None]:
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night

In [None]:
weather[is_night]