### Web Scraping basics using BeautifulSoup 
https://www.dataquest.io/blog/web-scraping-tutorial-python/

#### Components of a web page
When we visit a web page, our web browser makes a request to a web server. This request is called a GET request (for getting files from the server). The server then sends back files that tell our browser how to render the page for us. The files fall into a few main types:

* **HTML** – contain the main content of the page.
* **CSS** – add styling to make the page look nicer.
* **JS** – Javascript files add interactivity to web pages.
* **Images** – image formats, such as JPG and PNG allow web pages to show pictures.

#### HTML Tags

Common tags: 
* < html > - this tag tells the web browser that everything inside of it is HTML. 
* < head > - contains data about the title of the page, and other information that generally isn’t useful in web scraping
* < body > - The main content of the web page goes into the body tag.
* < p > - paragraph
* < a > - links
* < div > – indicates a division, or area, of the page.
* < b > – bolds any text inside.
* < i > – italicizes any text inside.
* < table > – creates a table.
* < form > – creates an input form.
* < style > - contains style information for a document, or part of a document. By default, the style instructions written inside that element are expected to be CSS.

#### Requests
After running `get request` using Python's **request library**, we get a Response object. This object has a `status_code` property, which indicates if the page was downloaded successfully or not.

### Example 1

#### TEST: Download and find the parent, child and sibling tags of a *simple* webpage:
`"http://dataquestio.github.io/web-scraping-pages/simple.html"`

In [71]:
import bs4
from bs4 import BeautifulSoup
import requests

#### Request the page and check it's status

In [72]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page.status_code

200

#### Use the BeautifulSoup library to parse this page

In [73]:
soup = BeautifulSoup(page.content, "html.parser")

In [74]:
print(type(soup))

<class 'bs4.BeautifulSoup'>


#### Pretty print soup

In [75]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


#### Children (child elements) of the BeautifulSoup Object

In [76]:
list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [77]:
soup_children = list(soup.children)
print(len(soup_children))

3


In [78]:
for child in soup_children:
    print(type(child), "\n")

<class 'bs4.element.Doctype'> 

<class 'bs4.element.NavigableString'> 

<class 'bs4.element.Tag'> 



#### The last element of the soup.children contain tags that might be of interest

In [79]:
*_, html_tags = soup_children
print(html_tags.prettify())

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [80]:
print(type(html_tags))

<class 'bs4.element.Tag'>


#### Now look at the the children of the html_tags

In [81]:
list(html_tags.children)

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

#### Do the children have children?

In [82]:
for i in range(0, len(list(html_tags.children))):
    if isinstance(list(html_tags.children)[i], bs4.element.Tag):
        this_child = list(html_tags.children)[i]
        print("Child(ren) of", this_child, ":", list(this_child.children), "\n")

Child(ren) of <head>
<title>A simple example page</title>
</head> : ['\n', <title>A simple example page</title>, '\n'] 

Child(ren) of <body>
<p>Here is some simple content for this page.</p>
</body> : ['\n', <p>Here is some simple content for this page.</p>, '\n'] 



**Getting the tags, the nested tags and their children is an interative process. Prior knowledge about the html page content (by inspection of the page content) allows us to efficinently extract particular information about the page.**

* Let's extarct the text content of the tag body.

In [83]:
body = list(html_tags)[3]
body

<body>
<p>Here is some simple content for this page.</p>
</body>

* Now, we can get the p tag by finding the children of the body tag:

In [84]:
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

* We can now isolate the p tag:

In [85]:
p = list(body.children)[1]
p

<p>Here is some simple content for this page.</p>

*Once we’ve isolated the tag, we can use the get_text method to extract all of the text inside the tag:

In [86]:
p.get_text()

'Here is some simple content for this page.'

### Finding all instances of a tag at once

In [87]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

Note that find_all returns a list, so we’ll have to loop through, or use list indexing, it to extract text.

In [88]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

### Searching for tags by class and id

Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. 

In [89]:
import bs4
from bs4 import BeautifulSoup
import requests

In [90]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    First paragraph.
   </p>
   <p class="inner-text">
    Second paragraph.
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    First outer paragraph.
   </b>
  </p>
  <p class="outer-text">
   <b>
    Second outer paragraph.
   </b>
  </p>
 </body>
</html>


* **Search for any p tag that has the class outer-text**

In [91]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

* **tag that has the class outer-text**

In [92]:
soup.find_all(class_="outer-text")

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

* **tag with the id `first`**

In [93]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

### CSS Selectors

* **Search for CSS oobject `div`**

In [94]:
soup.select("div")

[<div>
 <p class="inner-text first-item" id="first">
                 First paragraph.
             </p>
 <p class="inner-text">
                 Second paragraph.
             </p>
 </div>]

In [95]:
soup.select('div p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

* Download the web page containing the forecast using get request.
* Create a BeautifulSoup class to parse the page.
* Find the div with id seven-day-forecast, and assign to seven_day
* Inside seven_day, find each individual forecast item.
* Extract and print the first forecast item.

In [9]:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.WNXNIxIrJE4")

In [10]:
soup = BeautifulSoup(page.content, 'html.parser')

In [11]:
seven_day = soup.find(id="seven-day-forecast")

In [22]:
print(type(seven_day))

<class 'bs4.element.Tag'>


In [16]:
print(seven_day.prettify())

<div class="panel panel-default" id="seven-day-forecast">
 <div class="panel-heading">
  <b>
   Extended Forecast for
  </b>
  <h2 class="panel-title">
   San Francisco CA
  </h2>
 </div>
 <div class="panel-body" id="seven-day-forecast-body">
  <div id="seven-day-forecast-container">
   <ul class="list-unstyled" id="seven-day-forecast-list">
    <li class="forecast-tombstone">
     <div class="tombstone-container">
      <p class="period-name">
       Tonight
       <br>
        <br/>
       </br>
      </p>
      <p>
       <img alt="Tonight: A 40 percent chance of showers, mainly before 11pm.  Cloudy, then gradually becoming partly cloudy, with a low around 50. West wind 5 to 9 mph.  New precipitation amounts between a tenth and quarter of an inch possible. " class="forecast-icon" src="newimages/medium/nshra40.png" title="Tonight: A 40 percent chance of showers, mainly before 11pm.  Cloudy, then gradually becoming partly cloudy, with a low around 50. West wind 5 to 9 mph.  New precip

In [38]:
forecast_items = seven_day.find_all(class_="tombstone-container")

In [39]:
print(type(forecast_items))

<class 'bs4.element.ResultSet'>


In [24]:
print(len(forecast_items))

9


In [41]:
print(type(forecast_items[0]))

<class 'bs4.element.Tag'>


In [43]:
print(forecast_items[0].prettify())

<div class="tombstone-container">
 <p class="period-name">
  Tonight
  <br>
   <br/>
  </br>
 </p>
 <p>
  <img alt="Tonight: A 40 percent chance of showers, mainly before 11pm.  Cloudy, then gradually becoming partly cloudy, with a low around 50. West wind 5 to 9 mph.  New precipitation amounts between a tenth and quarter of an inch possible. " class="forecast-icon" src="newimages/medium/nshra40.png" title="Tonight: A 40 percent chance of showers, mainly before 11pm.  Cloudy, then gradually becoming partly cloudy, with a low around 50. West wind 5 to 9 mph.  New precipitation amounts between a tenth and quarter of an inch possible. "/>
 </p>
 <p class="short-desc">
  Chance
  <br>
   Showers
  </br>
 </p>
 <p class="temp temp-low">
  Low: 50 °F
 </p>
</div>


In [26]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Tonight',
 'Saturday',
 'SaturdayNight',
 'Sunday',
 'SundayNight',
 'Monday',
 'MondayNight',
 'Tuesday',
 'TuesdayNight']

In [27]:
for period in periods:
    print(period, "\n", period.prettify(), "\n\n")

AttributeError: 'str' object has no attribute 'prettify'