# Supplement Notebook for Phase 2 Blog Post - Feynman and Me
* Due to the nature of the topic, code and websites scraped have been borrowed from the Flatiron lecture repository for demonstration purposes for Phase 2 blogging. Markdown cells are original content. Please refer to original notebook [here](https://github.com/flatiron-school/ds-webscraping-opw32).
* My phase 2 blog post with complete write up on the below can be found [here](https://medium.com/@ashley_63724).

In [4]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Tell Python Where To Go
This is where you enter your web address to scrape. 

In [5]:
request = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")

## Let's Check the Status of Our Request
Status 200 - we're all good! For more statuses, check out these funny representations of HTTP status codes as cats: https://www.flickr.com/photos/girliemac/albums/72157628409467125/

In [7]:
request.status_code

200

## Now, BeautifulSoup, Do Your Thang!
Set a variable and BeautifulSoup will list the contents of the request and the "children" elements of the requested element. So for this particular example page, we have a title, a body, and a p-tag. We've made this a list using the list() method.

In [8]:
soup = BeautifulSoup(request.content)
list(soup.children)

['html',
 '\n',
 <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [9]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


## Element Type
As we can see, these are tags. To navigate our list, let's index and slice out the elements we want.

In [10]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

In [18]:
html = list(soup.children)[2]
html

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [19]:
list(html.children)

['\n',
 <head>
 <title>A simple example page</title>
 </head>,
 '\n',
 <body>
 <p>Here is some simple content for this page.</p>
 </body>,
 '\n']

In [22]:
body = list(html.children)[3]
body

<body>
<p>Here is some simple content for this page.</p>
</body>

In [23]:
p = list(body.children)[1]
p

<p>Here is some simple content for this page.</p>

In [24]:
soup = BeautifulSoup(request.content)
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

## Example 2 
I'm a fan of the find all method. Below you'll see we're using a different page, but similar sample content. 

In [25]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup2 = BeautifulSoup(page.content)
soup2

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

## Find_all
Now, by using BeautifulSoups find_all method, we can pass in arguments and grab exactly what we're looking for. In this case, the "p" tag and outer-text class. In the second example, we search specifically by id = 'first'.

In [26]:
soup2.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [27]:
soup2.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

# Example 3 - Weather DataFrame
Bringing it all together. In the steps below, we web scrape for the weather forecast. 
* By viewing the data, we can see we need to get the period name for the time, and the short-desc for the short description of the forecast. *If you wanted or needed more specific information than that, you can search the data by p class or children to find what it is that you're looking for.
* Assign these to their respective variables, and then merge them together into a list of tuples.
* Lastly, we can use pd.DataFrame to take the list of tuples and create a clean dataframe of our weather forecasts!

In [34]:
url3 = 'https://forecast.weather.gov/MapClick.php?lat=41.8843&lon=-87.6324#.XdPlJUVKg6g'
request3 = requests.get(url3)
soup3 = BeautifulSoup(request3.content)
print(soup3.prettify())

<!DOCTYPE html>
<html class="no-js">
 <head>
  <!-- Meta -->
  <meta content="width=device-width" name="viewport"/>
  <link href="http://purl.org/dc/elements/1.1/" rel="schema.DC"/>
  <title>
   National Weather Service
  </title>
  <meta content="National Weather Service" name="DC.title">
   <meta content="NOAA National Weather Service National Weather Service" name="DC.description"/>
   <meta content="US Department of Commerce, NOAA, National Weather Service" name="DC.creator"/>
   <meta content="" name="DC.date.created" scheme="ISO8601"/>
   <meta content="EN-US" name="DC.language" scheme="DCTERMS.RFC1766"/>
   <meta content="weather, National Weather Service" name="DC.keywords"/>
   <meta content="NOAA's National Weather Service" name="DC.publisher"/>
   <meta content="National Weather Service" name="DC.contributor"/>
   <meta content="//www.weather.gov/disclaimer.php" name="DC.rights"/>
   <meta content="General" name="rating"/>
   <meta content="index,follow" name="robots"/>
   <

In [29]:
times = soup3.find_all(class_='period-name')
times

[<p class="period-name">Tonight<br/><br/></p>,
 <p class="period-name">Saturday<br/><br/></p>,
 <p class="period-name">Saturday<br/>Night</p>,
 <p class="period-name">Sunday<br/><br/></p>,
 <p class="period-name">Sunday<br/>Night</p>,
 <p class="period-name">Monday<br/><br/></p>,
 <p class="period-name">Monday<br/>Night</p>,
 <p class="period-name">Tuesday<br/><br/></p>,
 <p class="period-name">Tuesday<br/>Night</p>]

In [30]:
descs = soup3.find_all(class_='short-desc')
descs

[<p class="short-desc">Rain/Snow<br/>Likely</p>,
 <p class="short-desc">Chance<br/>Rain/Snow<br/>then Slight<br/>Chance<br/>Rain/Snow</p>,
 <p class="short-desc">Partly Cloudy</p>,
 <p class="short-desc">Sunny</p>,
 <p class="short-desc">Mostly Clear</p>,
 <p class="short-desc">Mostly Sunny</p>,
 <p class="short-desc">Mostly Clear</p>,
 <p class="short-desc">Mostly Sunny</p>,
 <p class="short-desc">Partly Cloudy</p>]

In [37]:
together = [(entry[0].text, entry[1].text) for entry in zip(times, descs)]
together

[('Tonight', 'Rain/SnowLikely'),
 ('Saturday', 'ChanceRain/Snowthen SlightChanceRain/Snow'),
 ('SaturdayNight', 'Partly Cloudy'),
 ('Sunday', 'Sunny'),
 ('SundayNight', 'Mostly Clear'),
 ('Monday', 'Mostly Sunny'),
 ('MondayNight', 'Mostly Clear'),
 ('Tuesday', 'Mostly Sunny'),
 ('TuesdayNight', 'Partly Cloudy')]

In [32]:
weather = pd.DataFrame(together, columns=['time', 'description'])

weather

Unnamed: 0,time,description
0,Tonight,Rain/SnowLikely
1,Saturday,ChanceRain/Snowthen SlightChanceRain/Snow
2,SaturdayNight,Partly Cloudy
3,Sunday,Sunny
4,SundayNight,Mostly Clear
5,Monday,Mostly Sunny
6,MondayNight,Mostly Clear
7,Tuesday,Mostly Sunny
8,TuesdayNight,Partly Cloudy
