# Web Scraping via Python: HTML

## The components of a web page

When we visit a web page, our web browser sends a request to a web server. This request is called a GET 
request, since we’re getting files from the server. The server then sends back files that tell our 
browser how to render the page for us. The files fall into a few main types:

-  HTML — contain the main content of the page.
-  CSS — add styling to make the page look nicer.
-  JS — Javascript files add interactivity to web pages.
-  Images — image formats, such as JPG and PNG allow web pages to show pictures.

After our browser receives all the files, it renders the page and displays it to us. There’s a lot that 
happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re 
web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we 
let us first talka about the HTML.

## HTML

HTML stands for HyperText Markup Language, which is a language that web pages are created in. HTML is not a 
programming language. In fact, it is a markup language that tells a browser how to layout content. HTML 
allows you to do similar things to what you do in a word processor like Microsoft Word — make text bold, 
create paragraphs, and so on. 

Let’s take a quick tour through HTML so we know enough to scrape effectively. HTML consists of elements 
called tags. The most basic tag is the <html> tag. This tag tells the web browser that everything 
inside of it is HTML. We can make a simple HTML document just using this tag:

<pre class="language-python"><code class="language-python">&lt;html&gt;
&lt;/html&gt;</code></pre>

We haven’t added any content to our page yet, so if we viewed our HTML document in a web browser, we 
wouldn’t see anything:
    
    
<div style="border: 1px solid black;min-height: 25px;padding: 10px;margin-bottom:15px;margin-top:15px;"></div>

In HTML, tags are nested, and can go inside other tags. We can add two more tags, the head tag, and the body tag inside 
an html tag.
The head tag contains data about the title of the page, and other information that 
generally isn’t useful in web scraping. The main content of the web page goes into the body tag. 
<pre class="language-python"><code class="language-python">
&lt;html&gt;
    &lt;head&gt;
    &lt;/head&gt;
    &lt;body&gt;
    &lt;/body&gt;
&lt;/html&gt;
</code></pre>
Since we haven’t added any content to our page yet, we wouldn’t see anything in a web browser
for our HTML document.

We’ll now add our first content to the page, in the form of the p tag. The p tag defines a paragraph, and any text 
inside the tag is shown as a separate paragraph:
<pre class="language-python"><code class="language-python">
&lt;html&gt;
    &lt;head&gt;
    &lt;/head&gt;
    &lt;body&gt;
        &lt;p&gt;
            I like Python!
        &lt;/p&gt;
        &lt;p&gt;
            I would like to learn more about Python!
        &lt;/p&gt;
    &lt;/body&gt;
&lt;/html&gt;</code></pre>

Here is how this will look:
<div style="border: 1px solid black;min-height: 25px;padding: 10px;margin-bottom:15px;margin-top:15px;">
<p>I like Python!</p>
<p>I would like to learn more about Python!</p>
</div>

Tags have commonly used names that depend on their position in relation to other tags:

-  child — a child is a tag inside another tag. So the two p tags above are both children of the body tag.
-  parent — a parent is the tag having another tag inside. Above, the html tag is the parent of the body tag.
-  sibiling — a sibiling is a tag that is nested inside the same parent as another tag. For example, head and 
   body are siblings, since they’re both inside html. Both p tags are siblings, since they’re both inside body.

We can also add properties to HTML tags that change their behavior:
    
<pre class="language-python"><code class="language-python">
&lt;html&gt;
    &lt;head&gt;
    &lt;/head&gt;
    &lt;body&gt;
        &lt;p&gt;
            I like Python!
            &lt;a href="https://www.python.org"&gt;Welcome to Python.org&lt;/a&gt;
        &lt;/p&gt;
        &lt;p&gt;
            I would like to learn more about Python!
            &lt;a href="https://catalog.fullerton.edu/preview_course_nopop.php?catoid=2&coid=5816"&gt;ISDS558&lt;/a&gt;        
        &lt;/p&gt;
    &lt;/body&gt;&lt;/html&gt;</code></pre>

Here is how this will look:

<div style="border: 1px solid black;min-height: 25px;padding: 10px;margin-bottom:15px;margin-top:15px;">
<p>I like Python!       <a href="https://www.python.org">Welcome to Python.org</a></p>
<p>I would like to learn more about Python!        <a href="https://catalog.fullerton.edu/preview_course_nopop.php?catoid=2&coid=5816">ISDS558</a></p>
</div>

For a full list of tags, look [here](https://developer.mozilla.org/en-US/docs/Web/HTML/Element).

Before we move into actual web scraping, let’s learn about the class and id properties. These special 
properties give HTML elements names, and make them easier to interact with when we’re scraping. One 
element can have multiple classes, and a class can be shared between elements. Each element can only 
have one id, and an id can only be used once on a page. Classes and ids are optional, and not all 
elements will have them.

We can add classes and ids to our example:

<pre class="language-python"><code class="language-python">
&lt;html&gt;
    &lt;head&gt;
    &lt;/head&gt;
    &lt;body&gt;
        &lt;p class="first-paragraph"&gt;
            I like Python!
            &lt;a href="https://www.python.org" id="python-link"&gt;Welcome to Python.org&lt;/a&gt;
        &lt;/p&gt;
        &lt;p class="second-paragraph"&gt;
            I would like to learn more about Python!
            &lt;a href="https://catalog.fullerton.edu/preview_course_nopop.php?catoid=2&coid=5816" 
                  id="ISDS558-link"&gt;ISDS558&lt;/a&gt;
        &lt;/p&gt;
    &lt;/body&gt;
&lt;/html&gt;</code></pre>

Here is how this will look:

<div style="border: 1px solid black;min-height: 25px;padding: 10px;margin-bottom:15px;margin-top:15px;">
<p>I like Python!       <a href="https://www.python.org">Welcome to Python.org</a></p>
<p>I would like to learn more about Python!        <a href="https://catalog.fullerton.edu/preview_course_nopop.php?catoid=2&coid=5816">ISDS558</a></p>
</div>

In [None]:
As you can see, adding classes and ids doesn’t change how the tags are rendered at all. 

In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://assets.digitalocean.com/articles/eng_python/beautiful-soup/mockturtle.html"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

In [None]:
We can now print out the HTML content of the page, formatted nicely, using 
the prettify method on the BeautifulSoup object:

In [None]:
print(soup.prettify())

As all the tags are nested, we can move through the structure one level at 
a time. We can first select all the elements at the top level of the page 
using the children property of soup. Note that children returns a list 
generator, so we need to call the list function on it:

In [None]:
list(soup.children)

In [None]:
The above tells us that there are two tags at the top level of the page — the 
initial <!DOCTYPE html> tag, and the <html> tag. There is a newline 
character (\n) in the list as well. Let’s see what the type of each 
element in the list is:

In [None]:
[type(item) for item in list(soup.children)]

As you can see, all of the items are BeautifulSoup objects. The first is a 
Doctype object, which contains information about the type of the document. 
The second is a NavigableString, which represents text found in the HTML 
document. The third item is a Tag object, which contains other nested tags. 
The final is a NavigableString again. The most important object type, and 
the one we’ll deal with most often, is 
the Tag object.

The Tag object allows us to navigate through an HTML document, and extract 
other tags and text. You can learn more about the various BeautifulSoup objects here.

We can now select the html tag and its children by taking the third item in the list:

In [None]:
html = list(soup.children)[2]
type(html)

In [None]:
Each item in the list returned by the children property is also a 
BeautifulSoup object, so we can also call the children method on html.

Now, we can find the children inside the html tag:

In [None]:
list(html.children)

In [None]:
As you can see above, there are two tags here, head, and body. If we want to extract the 
text inside the h1 tag, we’ll dive into the body: 

In [None]:
body = list(html.children)[3]
body
type(body)

In [43]:
list(body.children)

['\n',
 <h1>Turtle Soup</h1>,
 '\n',
 <p class="verse" id="first">Beautiful Soup, so rich and green,<br/>
   Waiting in a hot tureen!<br/>
   Who for such dainties would not stoop?<br/>
   Soup of the evening, beautiful Soup!<br/>
   Soup of the evening, beautiful Soup!<br/></p>,
 '\n',
 <p class="chorus" id="second">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beautiful Soup!<br/></p>,
 '\n',
 <p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>
   Game or any other dish?<br/>
   Who would not give all else for two<br/>
   Pennyworth only of Beautiful Soup?<br/>
   Pennyworth only of beautiful Soup?<br/></p>,
 '\n',
 <p class="chorus" id="fourth">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beauti--FUL SOUP!<br/></p>,
 '\n']

Now, we can get the h1 tag by finding the children of the body tag:

In [45]:
h1 = list(body.children)[1]
h1

<h1>Turtle Soup</h1>

In [None]:
Once we’ve isolated the tag, we can use the **get_text** method to extract all of 
the text inside the tag:

In [46]:
h1.get_text()

'Turtle Soup'

Similarly, we can extra the text inside the first p tag:

In [48]:
p1 = list(body.children)[3]
p1

<p class="verse" id="first">Beautiful Soup, so rich and green,<br/>
  Waiting in a hot tureen!<br/>
  Who for such dainties would not stoop?<br/>
  Soup of the evening, beautiful Soup!<br/>
  Soup of the evening, beautiful Soup!<br/></p>

In [49]:
p1.get_text()

'Beautiful Soup, so rich and green,\n  Waiting in a hot tureen!\n  Who for such dainties would not stoop?\n  Soup of the evening, beautiful Soup!\n  Soup of the evening, beautiful Soup!'

We can use similarly ideas to get the texts in other p tags.

In [50]:
p2 = list(body.children)[5]
p2.get_text()

'Beau--ootiful Soo--oop!\n  Beau--ootiful Soo--oop!\n  Soo--oop of the e--e--evening,\n  Beautiful, beautiful Soup!'

In [51]:
p3 = list(body.children)[7]
p3.get_text()

'Beautiful Soup! Who cares for fish,\n  Game or any other dish?\n  Who would not give all else for two\n  Pennyworth only of Beautiful Soup?\n  Pennyworth only of beautiful Soup?'

In [None]:
p4 = list(body.children)[9]
p4.get_text()

## Finding all instances of a tag at once

If we want to find all the texts in the p tag, we can use the 
find_all method.

In [52]:
soup.find_all('p')

[<p class="verse" id="first">Beautiful Soup, so rich and green,<br/>
   Waiting in a hot tureen!<br/>
   Who for such dainties would not stoop?<br/>
   Soup of the evening, beautiful Soup!<br/>
   Soup of the evening, beautiful Soup!<br/></p>,
 <p class="chorus" id="second">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beautiful Soup!<br/></p>,
 <p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>
   Game or any other dish?<br/>
   Who would not give all else for two<br/>
   Pennyworth only of Beautiful Soup?<br/>
   Pennyworth only of beautiful Soup?<br/></p>,
 <p class="chorus" id="fourth">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beauti--FUL SOUP!<br/></p>]

Note that find_all returns a list, so we’ll have to loop through, or use list 
indexing, it to extract text:

In [54]:
## Using loop
for item in soup.find_all('p'):
    print("\n", item.get_text())


 Beautiful Soup, so rich and green,
  Waiting in a hot tureen!
  Who for such dainties would not stoop?
  Soup of the evening, beautiful Soup!
  Soup of the evening, beautiful Soup!

 Beau--ootiful Soo--oop!
  Beau--ootiful Soo--oop!
  Soo--oop of the e--e--evening,
  Beautiful, beautiful Soup!

 Beautiful Soup! Who cares for fish,
  Game or any other dish?
  Who would not give all else for two
  Pennyworth only of Beautiful Soup?
  Pennyworth only of beautiful Soup?

 Beau--ootiful Soo--oop!
  Beau--ootiful Soo--oop!
  Soo--oop of the e--e--evening,
  Beautiful, beauti--FUL SOUP!


In [57]:
## Uisng index

print("\n", soup.find_all('p')[0].get_text())

print("\n", soup.find_all('p')[1].get_text())

print("\n", soup.find_all('p')[2].get_text())

print("\n", soup.find_all('p')[3].get_text())


 Beautiful Soup, so rich and green,
  Waiting in a hot tureen!
  Who for such dainties would not stoop?
  Soup of the evening, beautiful Soup!
  Soup of the evening, beautiful Soup!

 Beau--ootiful Soo--oop!
  Beau--ootiful Soo--oop!
  Soo--oop of the e--e--evening,
  Beautiful, beautiful Soup!

 Beautiful Soup! Who cares for fish,
  Game or any other dish?
  Who would not give all else for two
  Pennyworth only of Beautiful Soup?
  Pennyworth only of beautiful Soup?

 Beau--ootiful Soo--oop!
  Beau--ootiful Soo--oop!
  Soo--oop of the e--e--evening,
  Beautiful, beauti--FUL SOUP!


In [None]:
## Searching for tags by class and id
We introduced classes and ids earlier, but it probably wasn’t clear why 
they were useful. Classes and ids are used by CSS to determine which HTML 
elements to apply certain styles to. We can also use them when scraping to 
specify specific elements we want to scrape. Let us look at the same web page again.

In [58]:
import requests
from bs4 import BeautifulSoup

url = "https://assets.digitalocean.com/articles/eng_python/beautiful-soup/mockturtle.html"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en-US" xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <title>
   Turtle Soup
  </title>
 </head>
 <body>
  <h1>
   Turtle Soup
  </h1>
  <p class="verse" id="first">
   Beautiful Soup, so rich and green,
   <br/>
   Waiting in a hot tureen!
   <br/>
   Who for such dainties would not stoop?
   <br/>
   Soup of the evening, beautiful Soup!
   <br/>
   Soup of the evening, beautiful Soup!
   <br/>
  </p>
  <p class="chorus" id="second">
   Beau--ootiful Soo--oop!
   <br/>
   Beau--ootiful Soo--oop!
   <br/>
   Soo--oop of the e--e--evening,
   <br/>
   Beautiful, beautiful Soup!
   <br/>
  </p>
  <p class="verse" id="third">
   Beautiful Soup! Who cares for fish,
   <br/>
   Game or any other dish?
   <br/>
   Who would not give all else for two
   <br/>
   Pennyworth only of 

Now, we can use the find_all method to search for items by class or by id. For example,
we’ll search for any p tag that has the class verse:

In [59]:
soup.find_all('p', class_='verse')

[<p class="verse" id="first">Beautiful Soup, so rich and green,<br/>
   Waiting in a hot tureen!<br/>
   Who for such dainties would not stoop?<br/>
   Soup of the evening, beautiful Soup!<br/>
   Soup of the evening, beautiful Soup!<br/></p>,
 <p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>
   Game or any other dish?<br/>
   Who would not give all else for two<br/>
   Pennyworth only of Beautiful Soup?<br/>
   Pennyworth only of beautiful Soup?<br/></p>]

We can look for any tag that has the class verse:

In [60]:
soup.find_all(class_='verse')

[<p class="verse" id="first">Beautiful Soup, so rich and green,<br/>
   Waiting in a hot tureen!<br/>
   Who for such dainties would not stoop?<br/>
   Soup of the evening, beautiful Soup!<br/>
   Soup of the evening, beautiful Soup!<br/></p>,
 <p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>
   Game or any other dish?<br/>
   Who would not give all else for two<br/>
   Pennyworth only of Beautiful Soup?<br/>
   Pennyworth only of beautiful Soup?<br/></p>]

In [None]:
We can also search for elements by id:

In [61]:
soup.find_all(id="first")

[<p class="verse" id="first">Beautiful Soup, so rich and green,<br/>
   Waiting in a hot tureen!<br/>
   Who for such dainties would not stoop?<br/>
   Soup of the evening, beautiful Soup!<br/>
   Soup of the evening, beautiful Soup!<br/></p>]

In [62]:
soup.find_all(id="third")

[<p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>
   Game or any other dish?<br/>
   Who would not give all else for two<br/>
   Pennyworth only of Beautiful Soup?<br/>
   Pennyworth only of beautiful Soup?<br/></p>]

In [63]:
soup.find_all(class_='verse', id="third")

[<p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>
   Game or any other dish?<br/>
   Who would not give all else for two<br/>
   Pennyworth only of Beautiful Soup?<br/>
   Pennyworth only of beautiful Soup?<br/></p>]

## Using CSS Selectors

You can also search for items using [CSS selectors](https://developer.mozilla.org/en-US/docs/Learn/CSS/Introduction_to_CSS/Selectors). These selectors are 
how the CSS language allows developers to specify HTML tags to style. 
Here are some examples:

-  p a — finds all a tags inside of a p tag.
-  body p a — finds all a tags inside of a p tag inside of a body tag.
-  html body — finds all body tags inside of an html tag.
-  p.outer-text — finds all p tags with a class of outer-text.
-  p\#first — finds all p tags with an id of first.
-  body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.
You can learn more about CSS selectors here.

BeautifulSoup objects support searching a page via CSS selectors using the 
select method. We can use CSS selectors to find all the h1 tags in our page 
that are inside of a body like this:

In [64]:
soup.select("body h1")

[<h1>Turtle Soup</h1>]

## Example 1: Weather forecasts from the National Weather Service

We now know enough to proceed with extracting information about the local 
weather from the National Weather Service website. The first step is to 
find the page we want to scrape. We’ll extract weather information about 
Fullerton, CA from [here](https://forecast.weather.gov/MapClick.php?lon=-117.90458679199217&lat=33.87326611273156#.XK5dvS3MxQI).

### Downloading weather data

As you can see from [here](https://forecast.weather.gov/MapClick.php?lon=-117.90458679199217&lat=33.87326611273156#.XK5dvS3MxQI), the page has information about the extended 
forecast for the next week, including time of day, temperature, and a brief 
description of the conditions.

In [None]:
The first thing we’ll need to do is inspect the page. 

By right clicking on the page near where it says “Extended Forecast”, then 
clicking “Inspect”, we’ll open up the tag that contains the text “Extended Forecast” 
in the elements panel.

We can then scroll up in the elements panel to find the “outermost” element that 
contains all of the text that corresponds to the extended forecasts. In this 
case, it’s a div tag with the id seven-day-forecast. If you explore the div, 
you’ll discover that each forecast item 
(like “Tonight”, “Thursday”, and “Thursday Night”) is contained in a div 
with the class tombstone-container.

In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://forecast.weather.gov/MapClick.php?lon=-117.90458679199217&lat=33.87326611273156"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")

In [None]:
seven_day

In [None]:
forecast_items = seven_day.find_all(class_="tombstone-container")
first_time = forecast_items[0]
print(first_time.prettify())

### Extracting information from the page

As you can see, inside the forecast item tonight is all the 
information we want. There are 4 pieces of information we can extract:

-  The name of the forecast item — in this case, This Afternoon.
-  The description of the conditions — this is stored in the title property of img.
-  A short description of the conditions — in this case, Sunny.
-  The temperature high — in this case, 79 degrees.

We’ll extract the name of the forecast item, the short description, and the 
temperature first, since they’re all similar:

In [None]:
period = first_time.find(class_="period-name").get_text()
short_desc = first_time.find(class_="short-desc").get_text()
temp = first_time.find(class_="temp").get_text()

print(period)
print(short_desc)
print(temp)

In [None]:
Now, we can extract the title attribute from the img tag. To do this, we 
just treat the BeautifulSoup object like a dictionary, and pass in the 
attribute we want as a key:

In [None]:
img = first_time.find("img")
desc = img['title']

print(desc)

In [None]:
### Extracting all the information from the page

Now that we know how to extract each individual piece of information. Next we
would like to extract everything at once.

In [None]:
period_tags = seven_day.select(".tombstone-container .period-name")
period_tags

In [None]:
periods = [pt.get_text() for pt in period_tags]
periods

As you can see above, our technique gets us each of the period names, in order. 
We can apply the same technique to get the other 3 fields:

In [66]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)

['Sunny', 'Mostly Clear', 'Sunny', 'Partly Cloudy', 'Sunny', 'Clear', 'Sunny', 'Partly Cloudy', 'Mostly Sunny']
['High: 79 °F', 'Low: 55 °F', 'High: 77 °F', 'Low: 56 °F', 'High: 76 °F', 'Low: 55 °F', 'High: 78 °F', 'Low: 55 °F', 'High: 75 °F']
['This Afternoon: Sunny, with a high near 79. Northwest wind around 10 mph. ', 'Tonight: Mostly clear, with a low around 55. Northwest wind 10 to 15 mph becoming light and variable  after midnight. Winds could gust as high as 25 mph. ', 'Thursday: Sunny, with a high near 77. Light and variable wind becoming southwest 10 to 15 mph in the afternoon. Winds could gust as high as 20 mph. ', 'Thursday Night: Partly cloudy, with a low around 56. West wind 10 to 15 mph becoming light and variable  after midnight. Winds could gust as high as 25 mph. ', 'Friday: Sunny, with a high near 76. South wind 5 to 10 mph becoming southwest in the afternoon. Winds could gust as high as 20 mph. ', 'Friday Night: Clear, with a low around 55.', 'Saturday: Sunny, with a

### Combining our data into a Pandas Dataframe

We can now combine the data into a Pandas DataFrame and analyze it. A DataFrame is 
an object that can store tabular data, making data analysis easy. 

In order to do this, we’ll call the DataFrame class, and pass in each list of 
items that we have. We pass them in as part of a dictionary. Each dictionary 
key will become a column in the DataFrame, and each list will become the 
values in the column:

In [67]:
import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
    })
weather

Unnamed: 0,period,short_desc,temp,desc
0,ThisAfternoon,Sunny,High: 79 °F,"This Afternoon: Sunny, with a high near 79. No..."
1,Tonight,Mostly Clear,Low: 55 °F,"Tonight: Mostly clear, with a low around 55. N..."
2,Thursday,Sunny,High: 77 °F,"Thursday: Sunny, with a high near 77. Light an..."
3,ThursdayNight,Partly Cloudy,Low: 56 °F,"Thursday Night: Partly cloudy, with a low arou..."
4,Friday,Sunny,High: 76 °F,"Friday: Sunny, with a high near 76. South wind..."
5,FridayNight,Clear,Low: 55 °F,"Friday Night: Clear, with a low around 55."
6,Saturday,Sunny,High: 78 °F,"Saturday: Sunny, with a high near 78."
7,SaturdayNight,Partly Cloudy,Low: 55 °F,"Saturday Night: Partly cloudy, with a low arou..."
8,Sunday,Mostly Sunny,High: 75 °F,"Sunday: Mostly sunny, with a high near 75."


In [None]:
We can now do some analysis on the data. For example, we can pull out 
the numeric temperature values:




In [89]:
import re
for str in weather["temp"]:
    print(int(re.findall('[0-9]+', str)[0]))

79
55
77
56
76
55
78
55
75


In fact, we can use the Series.str.extract method to pull out the numeric 
temperature values:

In [85]:
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
temp_nums
weather

Unnamed: 0,period,short_desc,temp,desc,temp_num
0,ThisAfternoon,Sunny,High: 79 °F,"This Afternoon: Sunny, with a high near 79. No...",79
1,Tonight,Mostly Clear,Low: 55 °F,"Tonight: Mostly clear, with a low around 55. N...",55
2,Thursday,Sunny,High: 77 °F,"Thursday: Sunny, with a high near 77. Light an...",77
3,ThursdayNight,Partly Cloudy,Low: 56 °F,"Thursday Night: Partly cloudy, with a low arou...",56
4,Friday,Sunny,High: 76 °F,"Friday: Sunny, with a high near 76. South wind...",76
5,FridayNight,Clear,Low: 55 °F,"Friday Night: Clear, with a low around 55.",55
6,Saturday,Sunny,High: 78 °F,"Saturday: Sunny, with a high near 78.",78
7,SaturdayNight,Partly Cloudy,Low: 55 °F,"Saturday Night: Partly cloudy, with a low arou...",55
8,Sunday,Mostly Sunny,High: 75 °F,"Sunday: Mostly sunny, with a high near 75.",75


In [None]:
We could then find the mean of all the high and low temperatures:

In [86]:
weather["temp_num"].mean()

67.33333333333333

In [None]:
We could also only select the rows that happen at night:

In [87]:
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night

0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8    False
Name: temp, dtype: bool

In [88]:
weather[is_night]

Unnamed: 0,period,short_desc,temp,desc,temp_num,is_night
1,Tonight,Mostly Clear,Low: 55 °F,"Tonight: Mostly clear, with a low around 55. N...",55,True
3,ThursdayNight,Partly Cloudy,Low: 56 °F,"Thursday Night: Partly cloudy, with a low arou...",56,True
5,FridayNight,Clear,Low: 55 °F,"Friday Night: Clear, with a low around 55.",55,True
7,SaturdayNight,Partly Cloudy,Low: 55 °F,"Saturday Night: Partly cloudy, with a low arou...",55,True
