# Web Scraping using Python

## The requests library

In [28]:
import requests

page = requests.get("https://shwetkm.github.io/A simple example page.html")
page

<Response [200]>

In [29]:
page.content

'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

## Parsing a page with BeautifulSoup

In [30]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

In [31]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [32]:
list(soup.children)

[u'html',
 u'\n',
 <html>\n<head>\n<title>A simple example page</title>\n</head>\n<body>\n<p>Here is some simple content for this page.</p>\n</body>\n</html>]

In [33]:
[type(item) for item in list(soup.children)] 

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

In [34]:
html = list(soup.children)[2]

In [35]:
list(html.children)

[u'\n',
 <head>\n<title>A simple example page</title>\n</head>,
 u'\n',
 <body>\n<p>Here is some simple content for this page.</p>\n</body>,
 u'\n']

In [36]:
body = list(html.children)[3]

In [37]:
list(body.children)

[u'\n', <p>Here is some simple content for this page.</p>, u'\n']

In [38]:
p = list(body.children)[1]

In [39]:
p.get_text()

u'Here is some simple content for this page.'

## Finding all instances of a tag at once

In [40]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

In [41]:
soup.find_all('p')[0].get_text()

u'Here is some simple content for this page.'

## Searching for tags by class and id

In [43]:
page = requests.get("https://shwetkm.github.io/ids-classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    First paragraph.
   </p>
   <p class="inner-text">
    Second paragraph.
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    First outer paragraph.
   </b>
  </p>
  <p class="outer-text">
   <b>
    Second outer paragraph.
   </b>
  </p>
 </body>
</html>



In [44]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">\n<b>\n                First outer paragraph.\n            </b>\n</p>,
 <p class="outer-text">\n<b>\n                Second outer paragraph.\n            </b>\n</p>]

In [45]:
soup.find_all(class_="outer-text")

[<p class="outer-text first-item" id="second">\n<b>\n                First outer paragraph.\n            </b>\n</p>,
 <p class="outer-text">\n<b>\n                Second outer paragraph.\n            </b>\n</p>]

In [46]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">\n                First paragraph.\n            </p>]

## Using CSS Selectors
You can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:

<strong>p a</strong> – finds all a tags inside of a p tag.

<strong>body p a</strong> – finds all a tags inside of a p tag inside of a body tag.

<strong>html body</strong> – finds all body tags inside of an html tag.

<strong>p.outer-text</strong> – finds all p tags with a class of outer-text.

<strong>p#first</strong> – finds all p tags with an id of first.

<strong>body p.outer-text</strong> – finds any p tags with a class of outer-text inside of a body tag.

BeautifulSoup objects support searching a page via CSS selectors using the select method. We can use CSS selectors to find all the p tags in our page that are inside of a div like this:

In [47]:
soup.select("div p")

[<p class="inner-text first-item" id="first">\n                First paragraph.\n            </p>,
 <p class="inner-text">\n                Second paragraph.\n            </p>]

# Scraping Weather Data

We now know enough to proceed with extracting information about the local weather from the National Weather Service website. The first step is to find the page we want to scrape. We’ll extract weather information about downtown San Francisco from <a href="http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.WM13Xzt9600">this</a> page.

In [48]:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Tonight
  <br>
   <br/>
  </br>
 </p>
 <p>
  <img alt="Tonight: Cloudy, with a low around 55. West southwest wind 5 to 10 mph becoming light and variable  after midnight. " class="forecast-icon" src="newimages/medium/novc.png" title="Tonight: Cloudy, with a low around 55. West southwest wind 5 to 10 mph becoming light and variable  after midnight. "/>
 </p>
 <p class="short-desc">
  Cloudy
 </p>
 <p class="temp temp-low">
  Low: 55 °F
 </p>
</div>


In [49]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()

print(period)
print(short_desc)
print(temp)

Tonight
Cloudy
Low: 55 °F


In [50]:
img = tonight.find("img")
desc = img['title']

print(desc)

Tonight: Cloudy, with a low around 55. West southwest wind 5 to 10 mph becoming light and variable  after midnight. 


## Extracting all the information from the page


In [51]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

[u'Tonight',
 u'Sunday',
 u'SundayNight',
 u'Monday',
 u'MondayNight',
 u'Tuesday',
 u'TuesdayNight',
 u'Wednesday',
 u'WednesdayNight']

In [52]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)

[u'Cloudy', u'Cloudy', u'Slight ChanceShowers', u'Chance Rain', u'ShowersLikely andBreezy', u'ChanceShowers', u'ChanceShowers', u'ChanceShowers', u'Slight ChanceShowers']
[u'Low: 55 \xb0F', u'High: 66 \xb0F', u'Low: 55 \xb0F', u'High: 66 \xb0F', u'Low: 56 \xb0F', u'High: 64 \xb0F', u'Low: 52 \xb0F', u'High: 61 \xb0F', u'Low: 50 \xb0F']
[u'Tonight: Cloudy, with a low around 55. West southwest wind 5 to 10 mph becoming light and variable  after midnight. ', u'Sunday: Cloudy, with a high near 66. Light and variable wind becoming west southwest 6 to 11 mph in the afternoon. ', u'Sunday Night: A 20 percent chance of showers after 11pm.  Cloudy, with a low around 55. Southwest wind 6 to 14 mph, with gusts as high as 18 mph.  New precipitation amounts of less than a tenth of an inch possible. ', u'Monday: A 40 percent chance of rain.  Cloudy, with a high near 66. Southeast wind 9 to 14 mph, with gusts as high as 18 mph.  New precipitation amounts of less than a tenth of an inch possible. ', u

## Combining our data into a Pandas Dataframe


In [53]:
import pandas as pd
weather = pd.DataFrame({
        "period": periods, 
        "short_desc": short_descs, 
        "temp": temps, 
        "desc":descs
    })
weather


Unnamed: 0,desc,period,short_desc,temp
0,"Tonight: Cloudy, with a low around 55. West so...",Tonight,Cloudy,Low: 55 °F
1,"Sunday: Cloudy, with a high near 66. Light and...",Sunday,Cloudy,High: 66 °F
2,Sunday Night: A 20 percent chance of showers a...,SundayNight,Slight ChanceShowers,Low: 55 °F
3,"Monday: A 40 percent chance of rain. Cloudy, ...",Monday,Chance Rain,High: 66 °F
4,"Monday Night: Showers likely. Cloudy, with a ...",MondayNight,ShowersLikely andBreezy,Low: 56 °F
5,"Tuesday: A chance of showers, with thunderstor...",Tuesday,ChanceShowers,High: 64 °F
6,Tuesday Night: A 30 percent chance of showers....,TuesdayNight,ChanceShowers,Low: 52 °F
7,"Wednesday: A chance of showers, with thunderst...",Wednesday,ChanceShowers,High: 61 °F
8,Wednesday Night: A slight chance of showers. ...,WednesdayNight,Slight ChanceShowers,Low: 50 °F


In [55]:
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
temp_nums
weather

Unnamed: 0,desc,period,short_desc,temp,temp_num
0,"Tonight: Cloudy, with a low around 55. West so...",Tonight,Cloudy,Low: 55 °F,55
1,"Sunday: Cloudy, with a high near 66. Light and...",Sunday,Cloudy,High: 66 °F,66
2,Sunday Night: A 20 percent chance of showers a...,SundayNight,Slight ChanceShowers,Low: 55 °F,55
3,"Monday: A 40 percent chance of rain. Cloudy, ...",Monday,Chance Rain,High: 66 °F,66
4,"Monday Night: Showers likely. Cloudy, with a ...",MondayNight,ShowersLikely andBreezy,Low: 56 °F,56
5,"Tuesday: A chance of showers, with thunderstor...",Tuesday,ChanceShowers,High: 64 °F,64
6,Tuesday Night: A 30 percent chance of showers....,TuesdayNight,ChanceShowers,Low: 52 °F,52
7,"Wednesday: A chance of showers, with thunderst...",Wednesday,ChanceShowers,High: 61 °F,61
8,Wednesday Night: A slight chance of showers. ...,WednesdayNight,Slight ChanceShowers,Low: 50 °F,50


In [56]:
weather["temp_num"].mean()

58.333333333333336