In [1]:
import requests

#### Getting the html content of a website

http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html

In [2]:
res = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")

In [3]:
res.text

'<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <div>\n            <p class="inner-text first-item" id="first">\n                First paragraph.\n            </p>\n            <p class="inner-text">\n                Second paragraph.\n            </p>\n        </div>\n        <p class="outer-text first-item" id="second">\n            <b>\n                First outer paragraph.\n            </b>\n        </p>\n        <p class="outer-text">\n            <b>\n                Second outer paragraph.\n            </b>\n        </p>\n    </body>\n</html>'

In [4]:
type(res.text)

str

#### Parsing html

In [5]:
# in case bs4 throws error try
# !pip install --upgrade html5lib==1.0b8

from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text, 'html.parser')

In [6]:
type(soup)

bs4.BeautifulSoup

In [7]:
print(soup.prettify())

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    First paragraph.
   </p>
   <p class="inner-text">
    Second paragraph.
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    First outer paragraph.
   </b>
  </p>
  <p class="outer-text">
   <b>
    Second outer paragraph.
   </b>
  </p>
 </body>
</html>


In [8]:
soup.find_all('p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>, <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [9]:
soup.find_all('p')[1]

<p class="inner-text">
                Second paragraph.
            </p>

In [10]:
soup.find_all('p')[1].text

'\n                Second paragraph.\n            '

In [11]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [12]:
soup.find_all('p', class_='inner-text')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

In [13]:
len(soup.find_all('p', class_='inner-text'))

2

#### Finding the elements of the site

Since every web page is different and html can get very large and messy, the easiest way to find elements that you are interested in is to start from the browser window. So next we will quickly look at how to find elements using the developer tools in your browser. Open the following webpage in your browser (preferably Chrome): http://forecast.weather.gov/MapClick.php?lat=21.3049&lon=-157.8579#.Wkwh8VQ-fVo 

Find the developer tools in your browser. (In Chrome, it's view --> developer --> developer tools or Control+Shift+C on Windows and Command+Shift+C on Mac) You should end up with a panel at the bottom or the right side of the browser like what you see below. Make sure the Elements panel is highlighted:

In [14]:
res = requests.get("http://forecast.weather.gov/MapClick.php?lat=21.3049&lon=-157.8579")
soup = BeautifulSoup(res.text, 'html.parser')

In [15]:
soup.find_all('p', class_="myforecast-current-lrg")

[<p class="myforecast-current-lrg">70°F</p>]

In [16]:
soup.find_all('p', class_="myforecast-current-lrg")[0]

<p class="myforecast-current-lrg">70°F</p>

In [17]:
type(soup.find_all('p', class_="myforecast-current-lrg")[0])

bs4.element.Tag

In [18]:
soup.find_all('p', class_="myforecast-current-lrg")[0].text

'70°F'

In [19]:
soup.find_all('p', class_="myforecast-current-sm")[0].text

'21°C'

#### Using dictionary for making queries and collecting response

In [27]:
latlon_dict = {
    #'Honolulu':[21.3049, -157.8579],
    'Times Square':[40.757339, -73.985992],
    'Yosemite':[37.8651011, -119.5383294]
}
latlon_dict

{'Times Square': [40.757339, -73.985992],
 'Yosemite': [37.8651011, -119.5383294]}

In [21]:
import time

In [26]:
requests.get(url)

ConnectionError: ('Connection aborted.', gaierror(-2, 'Name or service not known'))

In [28]:
response_dict = {}
for place,coordinates in latlon_dict.items():
    url = "http://forecast.weather.gov/MapClick.php?lat={}&lon={}".format(
        coordinates[0], coordinates[1])
    print(place)
    print(url)
    resp = requests.get(url)
    time.sleep(3)
    soup = BeautifulSoup(resp.text, 'html.parser')
    temp_C = soup.find_all('p', class_="myforecast-current-sm")[0].text
    response_dict[place] = temp_C

Times Square
http://forecast.weather.gov/MapClick.php?lat=40.757339&lon=-73.985992


KeyboardInterrupt: 

In [None]:
response_dict

In [None]:
for place,temperature in response_dict.items():
    print("The current temperature in {} is {}.".format(place, temperature))

#### saving dictionaries

In [None]:
import numpy as np

In [None]:
np.save('mydict.npy', response_dict) 

In [None]:
read_dictionary = np.load('mydict.npy').item()

In [None]:
read_dictionary

### 1 - exercise

We need the zip codes of the 5 landmarks in our data. Fortunatelly Google shows the zip codes at a fixed place if using the right searchphase. <br>
Open this link and using the Inspect tool in the browser try to find the class of the HTML element of a zip code shown at the top of the page! <br> 
https://www.google.com/search?q=San+Jose+zip+code

### 1 - check yourself

The zip code is under a div of class "title" inside a div of class "IAznY"

### 2 - exercise
Now use the requests library to get the html content of this page and create a BeautifulSoup object called soup from this content

In [None]:
### Your code here

### 2 - check yourself

In [None]:
if type(soup) == BeautifulSoup and '94089' in soup.text:
    print('Your soup object is correct')
else:
    print('Your soup object is NOT correct')

### 3 - exercise
Try to find all the div elements of class IAznY in your soup object. How many are there?

In [None]:
### Your code here

### 3 - check yourself
If you haven't found any div of this class you were right.

### 4 - exercise
So it looks like that the scraped HTML code doesn't have the elements you saw in the browser. The reason is that when opening the url in the browser, it uses JavaScript to format the page, but when we scraped it, only the plaine HTML was sent. <br><br>
To see the same content in the browser disable JavaScript usage by following this directions:  https://productforums.google.com/forum/#!msg/chrome/BYOQskiuGU0/dO592rlLbJ0J). <br><br>
Then open the page again and find using the Inspect tool find the HTML elemnt containg the zip code!

### 4 - check yourself

The zip codes are under a div of class "Db7kif" and each zip code is under a span of class "ED44Kd"

### 5 - exercise
Try to find all the span elements of class ED44Kd in your soup object. How many are there?

In [None]:
### Your code here

### 5 - check yourself
You should find 67 elements

### 6 - exercise
Make a list called zipcode_list that contains the text from all the ED44Kd span elements!

In [None]:
### Your code here

### 6 - check yourself

In [None]:
if sorted(zipcode_list)[0] == ',94089' and len(zipcode_list) == 67:
    print('Your list is correct')
else:
    print('Your list is NOT correct')

### 7 - exercise
Read in the weather csv into a pandas dataframe called station. <br>
Create a dictionary called zipcode_dict which keys are the unique values from the landmark column and the value of each key is an empty list. You print the unique values and create the dictionary by hand or as an advanced task, try to create the dictionary without typing any landmark name!

In [None]:
### Your code here

### 7 - check yourself

In [None]:
if sorted(list(zipcode_dict.items())) == [('Mountain View', []),
                                         ('Palo Alto', []),
                                         ('Redwood City', []),
                                         ('San Francisco', []),
                                         ('San Jose', [])]:
    print('Your dictionary is correct')
else:
    print('Your dictionary is NOT correct')

### 8 - exercise
Loop the keys from the zipcode_dict and for each key print the url you would use to search the zip codes of a given city in google by using string formatting. <br>
For example if the city is Palo Alto the url should be: <br>
https://www.google.com/search?q=Palo Alto zip code

In [None]:
### Your code here

### 9 - exercise
Similarly as before, loop the keys from the zipcode_dict and for each key inside the loop:
- Get a response object to the url you would use to search the zip codes of a given city in google. <br>
- Wait 5 seconds with the sleep method
- Make a soup from that resopnse object. <br>
- Make a list of all zip codes in the soup object. You can find the zip codes by looking for the text of the span elements of class 'ED44Kd'.<br>
- Assign this list as value to the key in the zipcode_dict

In [None]:
### Your code here

### 9 - check yourslef

In [None]:
if sorted([len(x) for x in zipcode_dict.values()]) == [7, 7, 10, 56, 67]:
    print('Your dictionary is correct')
else:
    print('Your dictionary is NOT correct')

### 10 - exercise
Save the dictionary into a file!

In [None]:
### Your code here

### 10 - check yourself
Read back the dictianry from the file and check if it's the same as the original one

In [None]:
### Your code here