# Data-X Spring 2020

## Notebook: Web Scraping & Web Crawling

- How to obtain data the fun way and caveats to take care of when doing so

**Author List**: 
- Alexander Fred-Ojala
- Ishaan Malhi


**Original Sources**: 
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- https://www.dataquest.io/blog/web-scraping-tutorial-python/

**License**: Feel free to do whatever you want to with this code

**Compatibility:** Python >= 3.5

## What we are learning today

- Data is seldom all clean and readily available

- Often times you might need to gather data from websites because of any or all of the following:

 - You need to update your data regularly (either as a live stream or in batch, i.e every hour, day etc)
 - Data isn't readily available, you need to collect data for your tasks
 - Because it's fun and why not to do it when you can

### Data Gathering

 - Web scraping is an important arsenal to have in your toolkit if you want to gather different types of data.
 - Usually it's a part of the "Data Gathering" stage.

- Once you gather your data, you clean, prepare/preprocess and use it in your tasks (prediction, analysis etc)

<center><img src="https://media.giphy.com/media/mG1MxDDEMSAVkF7da3/giphy.gif" /></center>

### Popular Web Scraping tools & libraries

This notebook mainly goes over how to get data with the Python packages `requests` and  `BeautifulSoup`. However, there are many other Python packages that can be used for scraping.

Two very popular and widely used are:

* **[Selenium:](http://selenium-python.readthedocs.io/)** Python scraper that can act as a human when visiting websites, almost like a macro. Makes sense of modern Javascript based websites built with React, Angular etc.


* **[Scrapy:](https://scrapy.org/)** For automated scripting and has a lot of built in tools for web crawling and scraping that can facilitate the process (e.g. time based, IP rotation etc). Mainly script based scraping for larger projects.

Q: Why do you think a library like `requests` and `BeautifulSoup` can't scrape websites using Frontend Web Apps (like React, Angular, Ember etc)?

A: The `requests` library asks for static (html) content. Frontend apps (Angular, React, Ember.js etc) dynamically create/load content on the fly. That itself has led to their popularity since you don't need to render everything from scratch.

Fun side note: Frontend web apps also break search engine indexing. If we can't scrape the full site, so can't most search engine bots. A workaround to this is called `Server Side Rendering`. It's an interesting way to "run" a frontend web app on the backend and give back static content. Most sites do this nowadays but this is a caveat you should keep in mind when scraping websites.

An interesting application in industry is [how Airbnb does this](https://github.com/airbnb/hypernova).

### API: Application Programming Interfaces

Many services offer API's to grab data (Twitter, Wikipedia, Reddit etc.) We have already used an API in the Pandas notebook when we grabbed stock data in CSV format to do analysis. If a good API exists, it is usually the preferred method of obtaining data.


### APIs vs Web Scraping

Sometimes APIs don't give us everything we need, OR we need to gather data from websites that don't have an API. In this case, we use Web Scraping.

# Helpful Web Scraping Cheat Sheet

If you want a good documentation of functions in requests and Beautifulsoup (as well as how to save scraped data to an SQLite database), this is a good resource:

- https://blog.hartleybrody.com/web-scraping-cheat-sheet/

# Table of Contents
(Clickable document links)
___

### [0: Pre-steup](#sec0)
Document setup and Python 2 and Python 3 compability

### [1: Simple webscraping intro](#sec1)

Simple example of webscraping on a premade HTML template

#### Subsection: Scraping Caveats: How to be nice and not make enemies

### [2: IMDB top 250 movies w MetaScore](#sec3)

Scrape IMDB and compare MetaScore to user reviews.

### [3: Scrape Images and Files](#sec4)

Scrape a website of Images, PDF's, CSV data or any other file type.

## [Breakout Problem: Scrape Weather Data](#secBK)

Scrape real time weather data in Berkeley.


### [Appendix](#sec5)

#### [Scrape Bloomberg sitemap for political news headlines](#sec6)

#### [Webcrawl Twitter, recursive URL link fetcher + depth](#sec7)

#### [SEO, visualize webite categories as a tree](#sec8)

<a id='sec0'></a>
## Pre-Setup

In [None]:
# stretch Jupyter coding blocks to fit screen
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>")) 
# if 100% it would fit the screen

In [None]:
# make it run on py2 and py3
from __future__ import division, print_function

<a id='sec1'></a>
# Webscraping intro

In order to scrape content from a website we first need to download the HTML contents of the website. This can be done with the Python library **requests** (with its `.get` method).

Then when we want to extract certain information from a website we use the scraping tool **BeautifulSoup4** (import bs4). In order to parse information with beautifulsoup we have to create a soup object from the HTML source code of a website.

In [None]:
import requests # The requests library is an HTTP library for getting and posting content etc.

import bs4 as bs 

import pandas as pd
"""
BeautifulSoup4 is a Python library 
for pulling data out of HTML and XML code.
We can query markup languages for specific content
"""

'\nBeautifulSoup4 is a Python library \nfor pulling data out of HTML and XML code.\nWe can query markup languages for specific content\n'

# Scraping a simple website

In [None]:
source = requests.get("https://alex.fo/other/data-x/") 
# a GET request will download the HTML webpage.

In [None]:
print(source) # If <Response [200]> then 
# the website has been downloaded succesfully

<Response [200]>


**Different types of repsonses:**
Generally status code starting with 2 indicates success. Status code starting with 4 or 5 indicates error. Frequent appearance of the status codes like 404 (Not Found), 403 (Forbidden), 408 (Request Timeout) might indicate that you got blocked by the server.

In [None]:
print(source.content) 
# This is the HTML content of the website,
# as you can see it's quite hard to decipher

b'<!DOCTYPE html>\n<html>\n<head>\n\n<title>Data-X Webscrape Tutorial</title>\n\n<style>\ndiv.container {\n    width: 100%;\n    border: 1px solid gray;\n}\n\n.header {\n    color:green;\n}\n\n#second {\n    font-style: italic;\n}\n\n</style>\n\n</head>\n\n<body style="background-color: pink">\n\n<h1 class="header">Simple Data-X site</h1>\n\n\n<h3 id="second">This site is only live to be scraped.</h3>\n\n\n<div class="container">\n<p>Some cool text in a container</p>\n</div>\n  \n\n  <h4> Random list </h4>\n<nav class="regular_list">\n  <ul>\n    <li><a href="https://en.wikipedia.org/wiki/London">London</a></li>\n    <li><a href="https://en.wikipedia.org/wiki/Tokyo">Tokyo</a></li>\n  </ul>\n</nav>\n\n\n\n\n  <h2>Random London Information within p tags</h2>\n\n  <p>London is the capital city of England. It is the most populous city in the  United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>\n  <p>Standing on the River Thames, London has been a major settlement f

In [None]:
print(type(source.content)) # type byte in Python 3

<class 'bytes'>


In [None]:
# Convert source.content to a beautifulsoup object 
# beautifulsoup can parse (extract specific information) HTML code

soup = bs.BeautifulSoup(source.content, features='html.parser') 
# we pass in the source content
# features specifies what type of code we are parsing, 
# here 'html.parser' specifies that we want beautiful soup to parse HTML code

In [None]:
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [None]:
print(soup) # looks a lot nicer!

<!DOCTYPE html>

<html>
<head>
<title>Data-X Webscrape Tutorial</title>
<style>
div.container {
    width: 100%;
    border: 1px solid gray;
}

.header {
    color:green;
}

#second {
    font-style: italic;
}

</style>
</head>
<body style="background-color: pink">
<h1 class="header">Simple Data-X site</h1>
<h3 id="second">This site is only live to be scraped.</h3>
<div class="container">
<p>Some cool text in a container</p>
</div>
<h4> Random list </h4>
<nav class="regular_list">
<ul>
<li><a href="https://en.wikipedia.org/wiki/London">London</a></li>
<li><a href="https://en.wikipedia.org/wiki/Tokyo">Tokyo</a></li>
</ul>
</nav>
<h2>Random London Information within p tags</h2>
<p>London is the capital city of England. It is the most populous city in the  United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>
<p>Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londi

Above we printed the HTML code of the website, decoded as a beautiful soup object.

### HTML tags
`<xxx> </xxx>`: are all the HTML tags, that specifies certain sections, stylings etc of the website, for more info: 
https://www.w3schools.com/tags/ref_byfunc.asp

Full list of HTML tags: https://developer.mozilla.org/en-US/docs/Web/HTML/Element

### HTML DOM Tree

The HTML DOM Tree is a logical tree that contains all the objects in a webpage.

Any dynamic execution (Javascript, SVG etc) interacts with the DOM tree.

Since HTML content has a hierarchy, a Tree structure appropriately models the relationships between different HTML elements.

For more, see the [Mozilla Docs](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model). There's a great intro to the DOM [here](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction).





---
## `class` and `id`:

class and id attributes of HTML tags, they are used as hooks to give unique styling to certain elements and an id for sections / parts of the page.

- **id:** is a unique tag for a specific element (this often does not change)
- **class:** specifies a class of objects. Several elements in the HTML code can have the same class.

We can use these attributes of an HTML tag to select elements in BeautifulSoup.

### Suppose we want to extract content that is shown on the website

In [None]:
# Inside the <body> tag of the website is where all the main content is
print(soup.body)

<body style="background-color: pink">
<h1 class="header">Simple Data-X site</h1>
<h3 id="second">This site is only live to be scraped.</h3>
<div class="container">
<p>Some cool text in a container</p>
</div>
<h4> Random list </h4>
<nav class="regular_list">
<ul>
<li><a href="https://en.wikipedia.org/wiki/London">London</a></li>
<li><a href="https://en.wikipedia.org/wiki/Tokyo">Tokyo</a></li>
</ul>
</nav>
<h2>Random London Information within p tags</h2>
<p>London is the capital city of England. It is the most populous city in the  United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>
<p>Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londinium.</p>
<footer>footer content</footer>
</body>


In [None]:
print(soup.title) # Title of the website

<title>Data-X Webscrape Tutorial</title>


In [None]:
print(soup.find('title')) # same as .title

<title>Data-X Webscrape Tutorial</title>


In [None]:
# If we want to extract specific text
print(soup.find('p')) # will only return first <p> tag

<p>Some cool text in a container</p>


In [None]:
print(soup.find('p').text) # extracts the string within the <p> tag, strips it of tag

Some cool text in a container


In [None]:
# If we want to extract all <p> tags
print(soup.find_all('p')) # returns list of all <p> tags

[<p>Some cool text in a container</p>, <p>London is the capital city of England. It is the most populous city in the  United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>, <p>Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londinium.</p>]


In [None]:
# we can also search for classes within all tags, using class_
# note that the "class_" property is used to distinguish with Python's builtin `class` keyword/function

print(soup.find(class_='header')) 

<h1 class="header">Simple Data-X site</h1>


In [None]:
# We can also find tags with a specific id

print(soup.find(id='second'))

<h3 id="second">This site is only live to be scraped.</h3>


In [None]:
print(soup.find_all(class_='regular_list')) # find all returns list, 
# even if there is only one object

[<nav class="regular_list">
<ul>
<li><a href="https://en.wikipedia.org/wiki/London">London</a></li>
<li><a href="https://en.wikipedia.org/wiki/Tokyo">Tokyo</a></li>
</ul>
</nav>]


In [None]:
for p in soup.find_all('p'): # print all text paragraphs on the webpage
    print(p.text)

Some cool text in a container
London is the capital city of England. It is the most populous city in the  United Kingdom, with a metropolitan area of over 13 million inhabitants.
Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londinium.


In [None]:
# Extract links / urls
# Links in html is usually coded as <a href="url">
# where the link is url

print(soup.a)
print(type(soup.a))


<a href="https://en.wikipedia.org/wiki/London">London</a>
<class 'bs4.element.Tag'>


In [None]:
soup.a.get('href') 
# to get the link from href attribute

'https://en.wikipedia.org/wiki/London'

In [None]:
links = soup.find_all('a')

links

[<a href="https://en.wikipedia.org/wiki/London">London</a>,
 <a href="https://en.wikipedia.org/wiki/Tokyo">Tokyo</a>]

In [None]:
# if we want to list links and their text info

links = soup.find_all('a')

for l in links:
    print("Info about {}: ".format(l.text), \
          l.get('href')) 
# then we have extracted the link

Info about London:  https://en.wikipedia.org/wiki/London
Info about Tokyo:  https://en.wikipedia.org/wiki/Tokyo


# Scraping Caveats: How to be nice and not make enemies

- Webscraping is not always a welcome activity. 
    As a founder and/or engineer, you don't want to wake up in the middle of the night because your website is down due to scraping!
    
- When webscraping a website, be mindful and nice and make sure you are not inadvertently sending too many requests to the website, which can lead to a potential problem at the website.

- A pretty common reason for websites going temporarily offline is because they get scraped way too much.

## What do we lookout for when scraping?

### Cached content

- Most website content is usually `cached`. This means the webservers are not serving the content directly, they are cached at a nearby caching server (usually called Point-of-Presence or POPs). Web requests usually hit these servers that are able to serve cached content at a higher frequency.

    - You might have heard of services such as AWS Cloudfront or Cloudflare allowing web content to be cached.


- That said, some websites might not do this! One way to check is to see the response headers. Let's see what this looks like.

In [None]:
requests.get('https://wikipedia.org').headers

{'Date': 'Sat, 27 Jun 2020 07:36:36 GMT', 'Cache-Control': 's-maxage=86400, must-revalidate, max-age=3600', 'Server': 'ATS/8.0.8', 'ETag': 'W/"10cc5-5a8a9c4623b07"', 'Last-Modified': 'Mon, 22 Jun 2020 10:33:01 GMT', 'Content-Type': 'text/html', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'Age': '79576', 'X-Cache': 'cp5009 hit, cp5010 hit/216109', 'X-Cache-Status': 'hit-front', 'Server-Timing': 'cache;desc="hit-front"', 'Strict-Transport-Security': 'max-age=106384710; includeSubDomains; preload', 'X-Client-IP': '35.229.190.9', 'Accept-Ranges': 'bytes', 'Content-Length': '17844', 'Connection': 'keep-alive'}

The `Cache-Control` header having a non-zero value means the web content is cached.

The `X-Cache-Status` equaling `hit-front` or `hit` or similar means you hit a cached content, and the backend server did not directly service request.

Yay! This means we can scrape without worrying about taking down Wikipedia.org



# Other useful scraping tips

### robots.txt

Always check if a website has a `robots.txt` document specifying what parts of the site that you're allowed to scrape (however, the website cannot prevent requests from getting its content, but I'd recommend you all to be nice). It may also contain information about the scraping frequency allowed etc.

E.g. 
- http://www.imdb.com/robots.txt
- http://www.nytimes.com/robots.txt

### user-agent

When you're sending a request to a webpage (no matter if it comes from your computer, iphone, or Python's request package), then you also include a user-agent. This let's the webserver know how to render the contents for you. You can also send user-agent information via a request (to specify who you are for example, or to disguise that you're an automated scraper).

Find your machine's / browser's true user agent here: https://www.whoishostingthis.com/tools/user-agent/

In [None]:
# user-agent example

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0',
    'From': 'data-x@gmail.com' 
}

response = requests.get('http://alex.fo/other/data-x', headers=headers)
print(response)
print(response.headers) # the response will also have some meta informaiton about the content

<Response [200]>
{'Date': 'Sun, 28 Jun 2020 05:42:54 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Last-Modified': 'Tue, 19 May 2020 15:01:57 GMT', 'Vary': 'Accept-Encoding', 'Access-Control-Allow-Origin': '*', 'Expires': 'Sun, 28 Jun 2020 05:52:54 GMT', 'Cache-Control': 'max-age=600', 'X-Proxy-Cache': 'MISS', 'X-GitHub-Request-Id': '44D2:1151:2730B8:310DB2:5EF82DDE', 'CF-Cache-Status': 'DYNAMIC', 'cf-request-id': '039b0c43380000dae02f05b200000001', 'Expect-CT': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'Server': 'cloudflare', 'CF-RAY': '5aa5164b8d38dae0-TPE', 'Content-Encoding': 'gzip'}


# Keep a current list IMDB top 250 vs MetaScore

Let's say that we want to build an app that can display the most popular movies at the IMDB website.

We got to the URL that lists the top 250 movies according to the reviews: http://www.imdb.com/chart/top

We see that the entries are stored in a table format, so we try pandas.

In [None]:
df_imdb = pd.read_html('http://www.imdb.com/chart/top',attrs={'class':'chart full-width'})[0]

In [None]:
df_imdb.head()

Unnamed: 0.1,Unnamed: 0,Rank & Title,IMDb Rating,Your Rating,Unnamed: 4
0,,1. The Shawshank Redemption (1994),9.2,12345678910 NOT YET RELEASED Seen,
1,,2. The Godfather (1972),9.1,12345678910 NOT YET RELEASED Seen,
2,,3. The Godfather: Part II (1974),9.0,12345678910 NOT YET RELEASED Seen,
3,,4. The Dark Knight (2008),9.0,12345678910 NOT YET RELEASED Seen,
4,,5. 12 Angry Men (1957),8.9,12345678910 NOT YET RELEASED Seen,


In [None]:
df_imdb.drop(df_imdb.columns[[0,3,4]],axis=1,inplace=True)

In [None]:
df_imdb.tail()

Unnamed: 0,Rank & Title,IMDb Rating
245,246. Swades (2004),8.0
246,247. Winter Sleep (2014),8.0
247,248. Aladdin (1992),8.0
248,249. Chak de! India (2007),8.0
249,250. A Silent Voice: The Movie (2016),8.0


In [None]:
# Extract all URLs to find meta score
imdb_html = requests.get('http://www.imdb.com/chart/top').content
soup = bs.BeautifulSoup(imdb_html, features='html.parser')

In [None]:
links = soup.find('table').find_all('a')
urls = ['http://www.imdb.com'+l.get('href') for l in links]
urls[0]

'http://www.imdb.com/title/tt0111161/'

In [None]:
urls[-1]

'http://www.imdb.com/title/tt5323662/'

In [None]:
import numpy as np
meta_scores = np.zeros(250, dtype=int)

In [None]:

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0',
    'From': 'data-x@gmail.com' 
}

for idx,url in enumerate(urls):
    print('Getting metscore for movie {}'.format(idx))
    film = requests.get(url, headers=headers, timeout=10)
    print(film)
    soup = bs.BeautifulSoup(film.content, features='html.parser')
    info = soup.find(class_='metacriticScore score_favorable titleReviewBarSubItem')
    meta_scores[idx] = int(info.find('span').text)
    if idx == 5:
        break

Getting metscore for movie 0
<Response [200]>
Getting metscore for movie 1
<Response [200]>
Getting metscore for movie 2
<Response [200]>
Getting metscore for movie 3
<Response [200]>
Getting metscore for movie 4
<Response [200]>
Getting metscore for movie 5
<Response [200]>


In [None]:
df_imdb['meta_scores'] = meta_scores

In [None]:
df_imdb.head()

Unnamed: 0,Rank & Title,IMDb Rating,meta_scores
0,1. The Shawshank Redemption (1994),9.2,80
1,2. The Godfather (1972),9.1,80
2,3. The Godfather: Part II (1974),9.0,100
3,4. The Dark Knight (2008),9.0,100
4,5. 12 Angry Men (1957),8.9,90


<a id='sec4'></a>
# Scrape images and other files

Let's see how we can automatically find and download files linked at any website.

The data you need for your projects might not always be raw data, but in the form of files (images, .txt files etc)

In [None]:
# As we can see there are two images on the data-x.blog/resources
# say that we want to download them
# Images are displayed with the <img> tag in HTML

# open connection and create new soup

raw = requests.get('https://data-x.blog/resources/').content
soup = bs.BeautifulSoup(raw,features='html.parser')

print(soup.find('img')) 
# as we can see below the image urls 
# are stored in the src attribute inside the img tag

<img alt="Data-X at Berkeley" height="973" sizes="100vw" src="https://data-x.blog/wp-content/uploads/2018/04/cropped-AdobeStock_120712749-Converted.png" srcset="https://i2.wp.com/data-x.blog/wp-content/uploads/2018/04/cropped-AdobeStock_120712749-Converted.png?w=2000&amp;ssl=1 2000w, https://i2.wp.com/data-x.blog/wp-content/uploads/2018/04/cropped-AdobeStock_120712749-Converted.png?resize=300%2C146&amp;ssl=1 300w, https://i2.wp.com/data-x.blog/wp-content/uploads/2018/04/cropped-AdobeStock_120712749-Converted.png?resize=768%2C374&amp;ssl=1 768w, https://i2.wp.com/data-x.blog/wp-content/uploads/2018/04/cropped-AdobeStock_120712749-Converted.png?resize=1024%2C498&amp;ssl=1 1024w, https://i2.wp.com/data-x.blog/wp-content/uploads/2018/04/cropped-AdobeStock_120712749-Converted.png?w=1480&amp;ssl=1 1480w" width="2000"/>


In [None]:
# Parse all url to the images
img_urls = list()
for img in soup.find_all('img'):
    img_url = img.get('src') 
    if '.jpeg' in img_url or '.jpg' in img_url:
        print(img_url)
        img_urls.append(img_url)
    

https://i2.wp.com/data-x.blog/wp-content/uploads/2017/05/unnamed-2.jpg?resize=740%2C416&ssl=1


In [None]:
## Let's look at what our current file directory looks like

%ls

[0m[01;34msample_data[0m/


In [None]:
# To download and save files with Python we can use 
# the shutil library which is a file operations library
'''
The shutil module offers a number of high-level operations on files and 
collections of files. In particular, functions are provided which support 
file copying and removal.
'''

import shutil

for idx, img_url in enumerate(img_urls): 
    #enumarte to create a file integer name for every image
    
    # make a request to the image URL
    img_source = requests.get(img_url, stream=True) 
    # we set stream = True to download/ 
    # stream the content of the data
    
    with open('img'+str(idx)+'.jpg', 'wb') as file: 
        # open file connection, create file and write to it
        shutil.copyfileobj(img_source.raw, file) 
        # save the raw file object

    del img_source # to remove the file from memory

In [None]:
## Let's see if the file has been saved
%ls

img0.jpg  [0m[01;34msample_data[0m/


## Scraping function to download files of any type from a website

Below is a function that takes in a website and a specific file type to download X of them from the website.

In [None]:
# Extended scraping function of any file format
import os # To interact with operating system and format file name
import shutil # To copy file object from python to disk
import requests
import bs4 as bs

def py_file_scraper(url, html_tag='img', source_tag='src', file_type='.jpg',max=-1):
    
    '''
    Function that scrapes a website for certain file formats.
    The files will be placed in a folder called "files" 
    in the working directory.
    
    url = the url we want to scrape from
    html_tag = the file tag (usually img for images or 
    a for file links)
    
    source_tag = the source tag for the file url 
    (usually src for images or href for files)
    
    file_type = .png, .jpg, .pdf, .csv, .xls etc.
    
    max = integer (max number of files to scrape, 
    if = -1 it will scrape all files)
    '''
    
    # make a directory called 'files' 
    # for the files if it does not exist
    if not os.path.exists('files/'):
        os.makedirs('files/')
    print('Loading content from the url...')
    source = requests.get(url).content
    print('Creating content soup...')
    soup = bs.BeautifulSoup(source,'html.parser')
    
    i=0
    print('Finding tag:%s...'%html_tag)
    for n, link in enumerate(soup.find_all(html_tag)):
        file_url=link.get(source_tag)
        print ('\n',n+1,'. File url',file_url)
        
        
        if 'http' in file_url: # check that it is a valid link
            print('It is a valid url..')
            
            
            if file_type in file_url: #only check for specific 
                # file type
                
                print('%s FILE TYPE FOUND IN THE URL...'%file_type)
                file_name = os.path.splitext(os.path.basename(file_url))[0] + file_type 
                #extract file name from url

                file_source = requests.get(file_url, stream = True)
             
                # open new stream connection

                with open('./files/'+file_name, 'wb') as file: 
                    # open file connection, create file and 
                    # write to it
                    
                    shutil.copyfileobj(file_source.raw, file) 
                    # save the raw file object
                    
                    print('DOWNLOADED:',file_name)
                    
                    i+=1
                    
                del file_source # delete from memory
            else:
                print('%s file type NOT found in url:'%file_type)
                print('EXCLUDED:',file_url) 
                # urls not downloaded from
                
        if i == max:
            print('Max reached')
            break
            

    print('Done!')

# Scrape funny cat pictures

In [None]:
py_file_scraper('https://funcatpictures.com/') 
# scrape cats

Loading content from the url...
Creating content soup...
Finding tag:img...

 1 . File url https://funcatpictures.com/wp-content/uploads/2018/03/fcp2018.png
It is a valid url..
.jpg file type NOT found in url:
EXCLUDED: https://funcatpictures.com/wp-content/uploads/2018/03/fcp2018.png

 2 . File url https://funcatpictures.com/wp-content/uploads/2020/06/filmkvall.jpg
It is a valid url..
.jpg FILE TYPE FOUND IN THE URL...
DOWNLOADED: filmkvall.jpg

 3 . File url https://funcatpictures.com/wp-content/uploads/2020/05/funny-cat-greeting-700x859.jpg
It is a valid url..
.jpg FILE TYPE FOUND IN THE URL...
DOWNLOADED: funny-cat-greeting-700x859.jpg

 4 . File url https://funcatpictures.com/wp-content/uploads/2020/04/fun-cat-pictures-lazy-cat-700x881.jpg
It is a valid url..
.jpg FILE TYPE FOUND IN THE URL...
DOWNLOADED: fun-cat-pictures-lazy-cat-700x881.jpg

 5 . File url https://funcatpictures.com/wp-content/uploads/2020/03/funny-cat-when-you-think-youre-done-arguing-700x893.jpg
It is a valid u

In [None]:
!ls ./files

filmkvall-150x150.jpg
filmkvall.jpg
fun-cat-pictures-if-ed-sheeran-was-a-cat-150x150.jpg
fun-cat-pictures-if-ed-sheeran-was-a-cat-700x771.jpg
fun-cat-pictures-lazy-cat-150x150.jpg
fun-cat-pictures-lazy-cat-700x881.jpg
funny-cat-greeting-150x150.jpg
funny-cat-greeting-700x859.jpg
funny-cat-when-you-think-youre-done-arguing-150x150.jpg
funny-cat-when-you-think-youre-done-arguing-700x893.jpg


# Scrape real data CSV files from websites

In [None]:
py_file_scraper('http://www-eio.upc.edu/~pau/cms/rdata/datasets.html',
                html_tag='a', # R data sets
                source_tag='href', file_type='.csv',max=5)

Loading content from the url...
Creating content soup...
Finding tag:a...

 1 . File url http://www-eio.upc.edu/~pau/cms/rdata/csv/datasets/AirPassengers.csv
It is a valid url..
.csv FILE TYPE FOUND IN THE URL...
DOWNLOADED: AirPassengers.csv

 2 . File url http://www-eio.upc.edu/~pau/cms/rdata/doc/datasets/AirPassengers.html
It is a valid url..
.csv file type NOT found in url:
EXCLUDED: http://www-eio.upc.edu/~pau/cms/rdata/doc/datasets/AirPassengers.html

 3 . File url http://www-eio.upc.edu/~pau/cms/rdata/csv/datasets/BJsales.csv
It is a valid url..
.csv FILE TYPE FOUND IN THE URL...
DOWNLOADED: BJsales.csv

 4 . File url http://www-eio.upc.edu/~pau/cms/rdata/doc/datasets/BJsales.html
It is a valid url..
.csv file type NOT found in url:
EXCLUDED: http://www-eio.upc.edu/~pau/cms/rdata/doc/datasets/BJsales.html

 5 . File url http://www-eio.upc.edu/~pau/cms/rdata/csv/datasets/BOD.csv
It is a valid url..
.csv FILE TYPE FOUND IN THE URL...
DOWNLOADED: BOD.csv

 6 . File url http://www-e

# Extended tip: IP rotation

The website might get suspicious if a lot of requests are coming from the same IP address. If you use a shared proxy, VPN or TOR that can help you get around that problem

For example:

```pyton
proxies = {'http' : 'http://10.10.0.0:0000',  
          'https': 'http://120.10.0.0:0000'}
response = requests.get('https://whateverwebsite.com', proxies=proxies, timeout=5)

```

Also note the `timeout` argument, this specifies that the request should not be carried out indefinitely (prevents the webserver from detecting scraping activity).
 

By using a shared proxy, the website will see the IP address of the proxy server and not yours. A VPN connects you to another network and the IP address of the VPN provider will be sent to the website.

## Again, be NICE. Don't use IP rotation unless you know the content is cached and/or you know the website can handle your load.

## Websites often rely on IP blocking, rate limiting and other techniques to either:

a) Dissuade web scraping

b) Block large web scraping requests that turn to Denial of Service (DOS)

c) Block scraping on restricted parts of the website

## IP Blocking
- The IP address of your machine(s) is temporarily blocked. (If you are using a VPN, often times this means the *VPNs* IP is blocked, which can inconvenience others using the VPN. Be careful.)

## Rate Limiting
- You might get a 4xx http error if you scrape websites at a higher rate than they can handle. Rate limiting is often multi-tiered, i.e you can be limited per second (concurrent requests), per minute or per hour. Rate limits are sometimes temporary but can be permanent (IP Block) if you routinely hit rate limits.

<center><img src="https://media.giphy.com/media/8abAbOrQ9rvLG/giphy.gif"></center>

## Honeypot Servers
- If you are not nice and continue scraping websites at high rates, websites can route your request to special "Honeypot" Servers, which essentially redirect your requests to servers that are designed to waste your CPU resources.

---
<a id='secBK'></a>
# Breakout problem


In this Breakout Problem you should extract live weather data in Berkeley from:

[http://forecast.weather.gov/MapClick.php?lat=37.87158815800046&lon=-122.27274583799971](http://forecast.weather.gov/MapClick.php?lat=37.87158815800046&lon=-122.27274583799971)

* Task scrape
    * period / day (as Tonight, Friday, FridayNight etc.)
    * the temperature for the period (as Low, High)
    * the long weather description (e.g. Partly cloudy, with a low around 49..)
    
Store the scraped data strings in a Pandas DataFrame



**Hint:** The weather information is found in a div tag with `id='seven-day-forecast'`




# Appendix

<a id='sec6'></a>
# Scrape Bloomberg sitemap (XML) for current political news

In [None]:
# XML documents - site maps, all the urls. just between tags
# XML human and machine readable.
# Newest links: all the links for FIND SITE MAP!
# News websites will have sitemaps for politics, bot constantly
# tracking news track the sitemaps

# Before scraping a website look at robots.txt file
bs.BeautifulSoup(requests.get('https://www.bloomberg.com/robots.txt').content,'lxml')

<html><body><p># Bot rules:
# 1. A bot may not injure a human being or, through inaction, allow a human being to come to harm.
# 2. A bot must obey orders given it by human beings except where such orders would conflict with the First Law.
# 3. A bot must protect its own existence as long as such protection does not conflict with the First or Second Law.
# If you can read this then you should apply here https://www.bloomberg.com/careers/
User-agent: *
Disallow: /polska
Disallow: /account/*

User-agent: Mediapartners-Google
Disallow: /about/careers
Disallow: /about/careers/
Disallow: /offlinemessage/
Disallow: /apps/fbk
Disallow: /bb/newsarchive/
Disallow: /apps/news

User-agent: Spinn3r
Disallow: /podcasts/
Disallow: /feed/podcast/
Disallow: /bb/avfile/

User-agent: Googlebot-News
Disallow: /sponsor/
Disallow: /news/sponsors/*

Sitemap: https://www.bloomberg.com/sitemap.xml
Sitemap: https://www.bloomberg.com/feeds/bbiz/sitemap_index.xml
Sitemap: https://www.bloomberg.com/feeds/bpol/sit

In [None]:
source = requests.get('https://www.bloomberg.com/feeds/bpol/sitemap_news.xml').content
soup = bs.BeautifulSoup(source,'xml') # Note parser 'xml'

In [None]:
print(soup.prettify())

<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
 <url>
  <loc>
   https://www.bloomberg.com/news/articles/2020-06-26/mexico-city-s-public-security-chief-wounded-in-shooting
  </loc>
  <news:news>
   <news:publication>
    <news:name>
     Bloomberg
    </news:name>
    <news:language>
     en
    </news:language>
   </news:publication>
   <news:publication_date>
    2020-06-28T00:31:32.800Z
   </news:publication_date>
   <news:title>
    Mexico City’s Police Chief Recovering From Bullet Wounds in Cartel Ambush
   </news:title>
   <news:keywords>
    Drugs, Foreign Affairs, Human Rights, Law, Economic Development, Android, Sport Utility Vehicles, iPhone, Government, War, Cities, National Security, Economics, ESG Concerns, Emerging Markets, Infrastructure, Mobile Phones, Automotive, Megacity, ESG, Consumer Discretion

In [None]:
# Find political news headlines
for news in soup.find_all({'news'}):
    print(news.title.text)
    print(news.publication_date.text)
    #print(news.keywords.text)
    print('\n')

Mexico City’s Police Chief Recovering From Bullet Wounds in Cartel Ambush
2020-06-28T00:31:32.800Z


Trump Tweets Wanted Posters of People in Statue-Teardown Attempt
2020-06-27T23:53:34.944Z


White House Denies Trump Briefed on Russia’s ‘Bounty’ Plan
2020-06-27T22:53:29.312Z


Pence’s Arizona, Florida Events Scrapped as Virus Cases Jump
2020-06-27T22:16:05.380Z


Pence Delays Campaign Events; U.S. Cases Jump 1.9%: Virus Update
2020-06-27T21:07:36.458Z


Trump and Biden Say Campaign Staffs Are More Than Half Women
2020-06-28T01:18:45.276Z


Florida Covid-19 Cases Rise by 9,585, Most During Pandemic
2020-06-27T19:40:39.665Z


Martin Wins Two Years to Beat Ireland’s Greatest Crisis
2020-06-27T18:03:09.383Z


China Lawmaking Body Adds HK Security Law to Agenda, NOW TV Says
2020-06-28T01:58:15.298Z


Malawi’s Opposition Leader Clinches Victory in Election Rerun
2020-06-27T21:07:36.095Z


Princeton Erases Wilson; John Wayne at Risk in OC: Protest Wrap
2020-06-28T04:55:48.835Z




<a id='sec7'></a>
# Web crawl

Web crawling is almost like webscraping, but instead you crawl a specific website (and often its subsites) and extract meta information. It can be seen as simple, recursive scraping. This can be used for web indexing (in order to build a web search engine).

## Web crawl Twitter account
**Authors:** Kunal Desai & Alexander Fred Ojala

In [None]:
import bs4
from bs4 import BeautifulSoup
import requests

In [None]:
# Helper function to maintain the urls and the number of times they appear

url_dict = dict()

def add_to_dict(url_d, key):
    if key in url_d:
        url_d[key] = url_d[key] + 1
    else:
        url_d[key] = 1

In [None]:
# Recursive function which extracts links from the given url upto a given 'depth'.

def get_urls(url, depth):
    if depth == 0:
        return
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    for link in soup.find_all('a'):
        if link.has_attr('href') and "https://" in link['href']:
#             print(link['href'])
            add_to_dict(url_dict, link['href'])
            get_urls(link['href'], depth - 1)

In [None]:
# Iterative function which extracts links from the given url upto a given 'depth'.

def get_urls_iterative(url, depth):
    urls = [url]
    for url in urls:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        for link in soup.find_all('a'):
            if link.has_attr('href') and "https://" in link['href']:
                add_to_dict(url_dict, link['href'])
                urls.append(link['href'])
        if len(urls) > depth:
            break

In [None]:
get_urls("https://twitter.com/GolfWorld", 2)
for key in url_dict:
    print(str(key) + "  ----   " + str(url_dict[key]))

<a id='sec8'></a>
# SEO: Visualize sitemap and categories in a website

**Source:** https://www.ayima.com/guides/how-to-visualize-an-xml-sitemap-using-python.html

In [None]:
# Visualize XML sitemap with categories!
import requests
from bs4 import BeautifulSoup

url = 'https://www.sportchek.ca/sitemap.xml'
url = 'https://www.bloomberg.com/feeds/bpol/sitemap_index.xml'
page = requests.get(url)
print('Loaded page with: %s' % page)

sitemap_index = BeautifulSoup(page.content, 'html.parser')
print('Created %s object' % type(sitemap_index))

Loaded page with: <Response [200]>
Created <class 'bs4.BeautifulSoup'> object


In [None]:
urls = [element.text for element in sitemap_index.findAll('loc')]
print(urls)

['https://www.bloomberg.com/feeds/bpol/sitemap_recent.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_news.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_video_recent.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2020_6.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2020_5.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2020_4.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2020_3.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2020_2.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2020_1.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_12.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_11.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_10.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_9.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_8.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_7.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_6.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_20

In [None]:
def extract_links(url):
    ''' Open an XML sitemap and find content wrapped in loc tags. '''

    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = [element.text for element in soup.findAll('loc')]

    return links

sitemap_urls = []
for url in urls:
    links = extract_links(url)
    sitemap_urls += links

print('Found {:,} URLs in the sitemap'.format(len(sitemap_urls)))

Found 38,100 URLs in the sitemap


In [None]:
with open('sitemap_urls.dat', 'w') as f:
    for url in sitemap_urls:
        f.write(url + '\n')

In [None]:
'''
Categorize a list of URLs by site path.
The file containing the URLs should exist in the working directory and be
named sitemap_urls.dat. It should contain one URL per line.
Categorization depth can be specified by executing a call like this in the
terminal (where we set the granularity depth level to 5):
    python categorize_urls.py --depth 5
The same result can be achieved by setting the categorization_depth variable
manually at the head of this file and running the script with:
    python categorize_urls.py
'''
from __future__ import print_function


categorization_depth=3



# Main script functions


def peel_layers(urls, layers=3):
    ''' Builds a dataframe containing all unique page identifiers up
    to a specified depth and counts the number of sub-pages for each.
    Prints results to a CSV file.
    urls : list
        List of page URLs.
    layers : int
        Depth of automated URL search. Large values for this parameter
        may cause long runtimes depending on the number of URLs.
    '''

    # Store results in a dataframe
    sitemap_layers = pd.DataFrame()

    # Get base levels
    bases = pd.Series([url.split('//')[-1].split('/')[0] for url in urls])
    sitemap_layers[0] = bases

    # Get specified number of layers
    for layer in range(1, layers+1):

        page_layer = []
        for url, base in zip(urls, bases):
            try:
                page_layer.append(url.split(base)[-1].split('/')[layer])
            except:
                # There is nothing that deep!
                page_layer.append('')

        sitemap_layers[layer] = page_layer

    # Count and drop duplicate rows + sort
    sitemap_layers = sitemap_layers.groupby(list(range(0, layers+1)))[0].count()\
                     .rename('counts').reset_index()\
                     .sort_values('counts', ascending=False)\
                     .sort_values(list(range(0, layers)), ascending=True)\
                     .reset_index(drop=True)

    # Convert column names to string types and export
    sitemap_layers.columns = [str(col) for col in sitemap_layers.columns]
    sitemap_layers.to_csv('sitemap_layers.csv', index=False)

    # Return the dataframe
    return sitemap_layers




sitemap_urls = open('sitemap_urls.dat', 'r').read().splitlines()
print('Loaded {:,} URLs'.format(len(sitemap_urls)))

print('Categorizing up to a depth of %d' % categorization_depth)
sitemap_layers = peel_layers(urls=sitemap_urls,
                             layers=categorization_depth)
print('Printed {:,} rows of data to sitemap_layers.csv'.format(len(sitemap_layers)))


Loaded 38,100 URLs
Categorizing up to a depth of 3
Printed 3,100 rows of data to sitemap_layers.csv


In [None]:
'''
Visualize a list of URLs by site path.
This script reads in the sitemap_layers.csv file created by the
categorize_urls.py script and builds a graph visualization using Graphviz.
Graph depth can be specified by executing a call like this in the
terminal:
    python visualize_urls.py --depth 4 --limit 10 --title "My Sitemap" --style "dark" --size "40"
The same result can be achieved by setting the variables manually at the head
of this file and running the script with:
    python visualize_urls.py
'''
from __future__ import print_function


# Set global variables

graph_depth = 3  # Number of layers deep to plot categorization
limit = 3       # Maximum number of nodes for a branch
title = ''       # Graph title
style = 'light'  # Graph style, can be "light" or "dark"
size = '8,5'     # Size of rendered PDF graph


# Import external library dependencies

import pandas as pd
import graphviz



# Main script functions

def make_sitemap_graph(df, layers=3, limit=50, size='8,5'):
    ''' Make a sitemap graph up to a specified layer depth.
    sitemap_layers : DataFrame
        The dataframe created by the peel_layers function
        containing sitemap information.
    layers : int
        Maximum depth to plot.
    limit : int
        The maximum number node edge connections. Good to set this
        low for visualizing deep into site maps.
    '''


    # Check to make sure we are not trying to plot too many layers
    if layers > len(df) - 1:
        layers = len(df)-1
        print('There are only %d layers available to plot, setting layers=%d'
              % (layers, layers))


    # Initialize graph
    f = graphviz.Digraph('sitemap', filename='sitemap_graph_%d_layer' % layers)
    f.body.extend(['rankdir=LR', 'size="%s"' % size])


    def add_branch(f, names, vals, limit, connect_to=''):
        ''' Adds a set of nodes and edges to nodes on the previous layer. '''

        # Get the currently existing node names
        node_names = [item.split('"')[1] for item in f.body if 'label' in item]

        # Only add a new branch it it will connect to a previously created node
        if connect_to:
            if connect_to in node_names:
                for name, val in list(zip(names, vals))[:limit]:
                    f.node(name='%s-%s' % (connect_to, name), label=name)
                    f.edge(connect_to, '%s-%s' % (connect_to, name), label='{:,}'.format(val))


    f.attr('node', shape='rectangle') # Plot nodes as rectangles

    # Add the first layer of nodes
    for name, counts in df.groupby(['0'])['counts'].sum().reset_index()\
                          .sort_values(['counts'], ascending=False).values:
        f.node(name=name, label='{} ({:,})'.format(name, counts))

    if layers == 0:
        return f

    f.attr('node', shape='oval') # Plot nodes as ovals
    f.graph_attr.update()

    # Loop over each layer adding nodes and edges to prior nodes
    for i in range(1, layers+1):
        cols = [str(i_) for i_ in range(i)]
        nodes = df[cols].drop_duplicates().values
        for j, k in enumerate(nodes):

            # Compute the mask to select correct data
            mask = True
            for j_, ki in enumerate(k):
                mask &= df[str(j_)] == ki

            # Select the data then count branch size, sort, and truncate
            data = df[mask].groupby([str(i)])['counts'].sum()\
                    .reset_index().sort_values(['counts'], ascending=False)

            # Add to the graph
            add_branch(f,
                       names=data[str(i)].values,
                       vals=data['counts'].values,
                       limit=limit,
                       connect_to='-'.join(['%s']*i) % tuple(k))

            print(('Built graph up to node %d / %d in layer %d' % (j, len(nodes), i))\
                    .ljust(50), end='\r')

    return f


def apply_style(f, style, title=''):
    ''' Apply the style and add a title if desired. More styling options are
    documented here: http://www.graphviz.org/doc/info/attrs.html#d:style
    f : graphviz.dot.Digraph
        The graph object as created by graphviz.
    style : str
        Available styles: 'light', 'dark'
    title : str
        Optional title placed at the bottom of the graph.
    '''

    dark_style = {
        'graph': {
            'label': title,
            'bgcolor': '#3a3a3a',
            'fontname': 'Helvetica',
            'fontsize': '18',
            'fontcolor': 'white',
        },
        'nodes': {
            'style': 'filled',
            'color': 'white',
            'fillcolor': 'black',
            'fontname': 'Helvetica',
            'fontsize': '14',
            'fontcolor': 'white',
        },
        'edges': {
            'color': 'white',
            'arrowhead': 'open',
            'fontname': 'Helvetica',
            'fontsize': '12',
            'fontcolor': 'white',
        }
    }

    light_style = {
        'graph': {
            'label': title,
            'fontname': 'Helvetica',
            'fontsize': '18',
            'fontcolor': 'black',
        },
        'nodes': {
            'style': 'filled',
            'color': 'black',
            'fillcolor': '#dbdddd',
            'fontname': 'Helvetica',
            'fontsize': '14',
            'fontcolor': 'black',
        },
        'edges': {
            'color': 'black',
            'arrowhead': 'open',
            'fontname': 'Helvetica',
            'fontsize': '12',
            'fontcolor': 'black',
        }
    }

    if style == 'light':
        apply_style = light_style

    elif style == 'dark':
        apply_style = dark_style

    f.graph_attr = apply_style['graph']
    f.node_attr = apply_style['nodes']
    f.edge_attr = apply_style['edges']

    return f




# Read in categorized data
sitemap_layers = pd.read_csv('sitemap_layers.csv', dtype=str)
# Convert numerical column to integer
sitemap_layers.counts = sitemap_layers.counts.apply(int)
print('Loaded {:,} rows of categorized data from sitemap_layers.csv'\
        .format(len(sitemap_layers)))

print('Building %d layer deep sitemap graph' % graph_depth)
f = make_sitemap_graph(sitemap_layers, layers=graph_depth,
                       limit=limit, size=size)
f = apply_style(f, style=style, title=title)

f.render(cleanup=True)
print('Exported graph to sitemap_graph_%d_layer.pdf' % graph_depth)




Loaded 3,100 rows of categorized data from sitemap_layers.csv
Building 3 layer deep sitemap graph
Exported graph to sitemap_graph_3_layer.pdf
