## Web Crawling 

- Sydney Weather
- https://weather.com/weather/today/l/e2a77b6dabb86cf1bf18990058242f520c3490682bf751655fcad2a6c166e93f

In [1]:
!pip install requests beautifulsoup4 lxml



In [11]:
import requests

URL = "https://weather.com/weather/today/l/e2a77b6dabb86cf1bf18990058242f520c3490682bf751655fcad2a6c166e93f"
resp = requests.get(URL)
print(resp.status_code)
#print(resp.text)

200


## Read the textual response and get the HTML of the web page. 
depending on how we want to specify the data, there are two ways, 
1. consider the HTML as a kind of XML document and use the XPath language to extract the element

2. use CSS selectors on the HTML document, which we can make use of the BeautifulSoup library

In [12]:
from lxml import etree
from bs4 import BeautifulSoup

if resp.status_code == 200:
    # using lxml
    dom = etree.HTML(resp.text)
    elements = dom.xpath("//span[@data-testid='TemperatureValue' and contains(@class,'CurrentConditions')]")
    print(elements[0].text)
    
    # using BeautifulSoup
    soup = BeautifulSoup(resp.text, "lxml")
    elements = soup.select('span[data-testid="TemperatureValue"][class^="CurrentConditions"]')
    print(elements[0].text)
    

67
67°


## CSV data

In [13]:
import io
import pandas as pd
 
URL = "https://fred.stlouisfed.org/graph/fredgraph.csv?id=T10YIE&cosd=2017-04-14&coed=2022-04-14"
resp = requests.get(URL)

if resp.status_code == 200:
    csvtext = resp.text
    csvbuffer = io.StringIO(csvtext)
    df = pd.read_csv(csvbuffer)
    print(df)

            DATE T10YIE
0     2017-04-17   1.88
1     2017-04-18   1.85
2     2017-04-19   1.85
3     2017-04-20   1.85
4     2017-04-21   1.84
...          ...    ...
1299  2022-04-08   2.87
1300  2022-04-11   2.91
1301  2022-04-12   2.86
1302  2022-04-13    2.8
1303  2022-04-14   2.89

[1304 rows x 2 columns]


## JSON data, convert to dictionary    

In [14]:
URL = "https://api.github.com/users/jbrownlee"
resp = requests.get(URL)
if resp.status_code == 200:
    data = resp.json()
    print(data)

{'login': 'jbrownlee', 'id': 12891, 'node_id': 'MDQ6VXNlcjEyODkx', 'avatar_url': 'https://avatars.githubusercontent.com/u/12891?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/jbrownlee', 'html_url': 'https://github.com/jbrownlee', 'followers_url': 'https://api.github.com/users/jbrownlee/followers', 'following_url': 'https://api.github.com/users/jbrownlee/following{/other_user}', 'gists_url': 'https://api.github.com/users/jbrownlee/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/jbrownlee/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/jbrownlee/subscriptions', 'organizations_url': 'https://api.github.com/users/jbrownlee/orgs', 'repos_url': 'https://api.github.com/users/jbrownlee/repos', 'events_url': 'https://api.github.com/users/jbrownlee/events{/privacy}', 'received_events_url': 'https://api.github.com/users/jbrownlee/received_events', 'type': 'User', 'site_admin': False, 'name': 'Machine Learning Mastery', 'company': 'Machine

## Binary data (ZIP file or image)    

In [15]:
URL = "https://en.wikipedia.org/static/images/project-logos/enwiki.png"
wikilogo = requests.get(URL)
if wikilogo.status_code == 200:
    with open("enwiki.png", "wb") as fp:
        fp.write(wikilogo.content)