<img src="http://i67.tinypic.com/2jcbwcw.png" align="left"></img><br><br><br><br>


## SOLUTIONS Breakout Lecture 8: Web scraping & web crawling

**Author List**: Alexander Fred Ojala

**Original Sources**: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ & https://www.dataquest.io/blog/web-scraping-tutorial-python/

**License**: Feel free to do whatever you want to with this code

**Compatibility:** Python 2.x and 3.x

# Table of Contents
(Clickable document links)
___

### [0: Pre-steup](#sec0)
Document setup and Python 2 and Python 3 compability

### [1: Simple webscrpaing intro](#sec1)

Simple example of webscraping on a premade HTML template

### [2: Scrape Data-X Schedule](#sec2)

Find and scrape the current Data-X schedule. 

### [3: Scrape Images and Files](#sec3)

Scrape a website of Images, PDF's, CSV data or any other file type.

## [Breakout Problem: Scrape Weather Data](#sec4)

Scrape real time weather data in Berkeley.


### [Appendix](#sec5)

#### [Scrape Bloomberg sitemap for political news headlines](#sec6)

#### [Webcrawl Twitter, recusrive URL link fetcher + depth](#sec7)

#### [SEO, visualize webite categories as a tree](#sec8)

<a id='sec0'></a>
## Pre-Setup

In [1]:
# stretch Jupyter coding blocks to fit screen
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:75% !important; }</style>")) # if 100% it would fit the screen

In [2]:
# make it run on py2 and py3
from __future__ import division, print_function

<a id='sec1'></a>
# Webscraping intro

In order to scrape content from a website we first need to download the HTML contents of the website. This can be done with the Python library **requests** (with its `.get` method).

Then when we want to extract certain information from a website we use the scraping tool **BeautifulSoup4** (import bs4). In order to extract information with beautifulsoup we have to create a soup object from the HTML source code of a website.

In [3]:
import requests # The requests library is an HTTP library for getting content and posting etc.
import bs4 as bs # BeautifulSoup4 is a Python library for pulling data out of HTML and XML code.

# Scraping a simple website

In [4]:
source = requests.get("https://alexanderfo.github.io") # a GET request will download the HTML webpage.
print(source) # If <Response [200]> then the website has been downloaded succesfully

<Response [200]>


**Different types of repsonses:**
Generally status code starting with 2 indicates success. Status code starting with 4 or 5 indicates error

In [5]:
print(source.content) # This is the HTML content of the website, as you can see it's quite hard to decipher

<!DOCTYPE html>

<head>
	<title>Data-X: Simple Git website</title>
	<meta name="author" content="afo" />
</head>

<!-- Website starts here" -->

<body>
<br><br><br><br>
	<div>

		<center>

			<h1 class="header" id="head1">Data-X Lecture 8<br></h1>

			<h2>Record Attendance at: </h2>

			<h3><a href="https://goo.gl/77iPL2">https://goo.gl/77iPL2</a></h3>


			<br>
			<p class="regular" id="first"> Here is a paragraph of random text </p>

			<p class="regular" id="second"> Second paragraph of random text </p>

			<p class="italic"> <i>Third paragraph</i> </p>

		</center>
	
	</div>


</body>
</html>


In [6]:
print(type(source.content)) # type byte in Python 3, type str in Python 2. Byte is default encoding of strings

<type 'str'>


In [7]:
# Read in source.content to beautifulsoup 
# beautifulsoup can parse (extract specific information) HTML code

soup = bs.BeautifulSoup(source.content ,features='lxml') # we pass in the source and choose a parser 

# features specifies what type of code we are parsing, here 'lxml' specifies an HTML parser

In [8]:
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [9]:
print(soup) # This is the HTML code of the website, decoded as a beautiful soup object

<!DOCTYPE html>
<html><head>
<title>Data-X: Simple Git website</title>
<meta content="afo" name="author"/>
</head>
<!-- Website starts here" -->
<body>
<br/><br/><br/><br/>
<div>
<center>
<h1 class="header" id="head1">Data-X Lecture 8<br/></h1>
<h2>Record Attendance at: </h2>
<h3><a href="https://goo.gl/77iPL2">https://goo.gl/77iPL2</a></h3>
<br/>
<p class="regular" id="first"> Here is a paragraph of random text </p>
<p class="regular" id="second"> Second paragraph of random text </p>
<p class="italic"> <i>Third paragraph</i> </p>
</center>
</div>
</body>
</html>


In [10]:
# Suppose we want to extract content that is shown on the website

print(soup.body) # This is the main content of the website, located within the <body> tag

<body>
<br/><br/><br/><br/>
<div>
<center>
<h1 class="header" id="head1">Data-X Lecture 8<br/></h1>
<h2>Record Attendance at: </h2>
<h3><a href="https://goo.gl/77iPL2">https://goo.gl/77iPL2</a></h3>
<br/>
<p class="regular" id="first"> Here is a paragraph of random text </p>
<p class="regular" id="second"> Second paragraph of random text </p>
<p class="italic"> <i>Third paragraph</i> </p>
</center>
</div>
</body>


In [11]:
print(soup.title) # Title of the website
print(soup.find('title')) # same as .title

<title>Data-X: Simple Git website</title>
<title>Data-X: Simple Git website</title>


In [12]:
# If we want to extract specific text
print(soup.find('p')) # will only return first <p> tag

<p class="regular" id="first"> Here is a paragraph of random text </p>


In [13]:
print(soup.find('p').text) # extracts the string within the <p> tag

 Here is a paragraph of random text 


In [14]:
# If we want to extract all <p> tags
print(soup.find_all('p')) # returns list of all <p> tags

[<p class="regular" id="first"> Here is a paragraph of random text </p>, <p class="regular" id="second"> Second paragraph of random text </p>, <p class="italic"> <i>Third paragraph</i> </p>]


In [15]:
print(soup.find(class_='header')) # we can also search for classes within all tags, using class_
print(soup.find(id='second'))
# note _ is used to distinguish with Python's builtin class function

<h1 class="header" id="head1">Data-X Lecture 8<br/></h1>
<p class="regular" id="second"> Second paragraph of random text </p>


In [16]:
print(soup.find_all(class_='regular'))

[<p class="regular" id="first"> Here is a paragraph of random text </p>, <p class="regular" id="second"> Second paragraph of random text </p>]


In [17]:
for p in soup.find_all('p'): # print all p tags in the list
    print(p.text)

 Here is a paragraph of random text 
 Second paragraph of random text 
 Third paragraph 


In [18]:
# Extract links / urls
# Links in html is usually coded as <a href="url"> where the link is url

print(soup.a)
print(type(soup.a))


<a href="https://goo.gl/77iPL2">https://goo.gl/77iPL2</a>
<class 'bs4.element.Tag'>


In [19]:
# if we only want the link
attendance_link = soup.find('a').get('href') # we want to get the string specified by the 'href inside the a tag
print("To record attendance for today's lecture go to: ",attendance_link) # then we have extracted the link

To record attendance for today's lecture go to:  https://goo.gl/77iPL2


<a id='sec2'></a>

# Scrape the current Syllabus Schedule from the Data-X website


In [20]:
source = requests.get('https://data-x.blog/').content # get the source content

In [21]:
soup = bs.BeautifulSoup(source,'lxml')

In [22]:
print(soup.prettify()) # .prettify() method makes the HTML code more readable

# as you can see this code is more difficult to read then the simple example above

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width" name="viewport"/>
  <link href="http://gmpg.org/xfn/11" rel="profile"/>
  <link href="https://data-x.blog/xmlrpc.php" rel="pingback"/>
  <title>
   Data-X – A Public and Open Website for the Data-X Course at UC Berkeley.
  </title>
  <script src="https://r-login.wordpress.com/remote-login.php?action=js&amp;host=data-x.blog&amp;id=120928364&amp;t=1489113457&amp;back=https%3A%2F%2Fdata-x.blog%2F" type="text/javascript">
  </script>
  <script type="text/javascript">
   /* <![CDATA[ */
			if ( 'function' === typeof WPRemoteLogin ) {
				document.cookie = "wordpress_test_cookie=test; path=/";
				if ( document.cookie.match( /(;|^)\s*wordpress_test_cookie\=/ ) ) {
					WPRemoteLogin();
				}
			}
		/* ]]> */
  </script>
  <link href="//s2.wp.com" rel="dns-prefetch"/>
  <link href="//s0.wp.com" rel="dns-prefetch"/>
  <link href="//datax911.wordpress.com" rel="dns-prefetch"/>
  <link href="/

In [23]:
print(soup.find('title').text) # we are at the correct website

Data-X – A Public and Open Website for the Data-X Course at UC Berkeley.


In [24]:
for p in soup.find_all('p'):
    print(p.text)

Instructor: Ikhlaq Sidhu, IEOR, UC Berkeley (contact)
You can find all the resources and code samples for Data-X on this page.  This content for this course is drawn from open source tools and publicly available materials.
At UC Berkeley, this course is 3 units, limited to 55 students in Spring 2017
Thursdays: 5:00 to 7:59 pm in 3108 Etcheverry Hall
In Spring, 2017, the course is run as an experimental section.
Suggestions for Data-X project may be submitted here:
https://goo.gl/forms/h6cAxZS3Il2F0k4F2
Data-X Breadth Perspectives:
Ref B01: Why you’re not getting value from your data science
Syllabus: Click Here
Getting Started:

Course Materials:
Lectures: 
Course Introduction (download)
Remaining Lectures, Homework, and Notebooks to be posted here
Cookbook Code Samples:
Follow this link
Coding Questions: Try Stack Overflow and/or simply ask Google
CS Tools Reference Materials:
Ref CS01: Python Quick Reference Guide, Python Review from Data 8
and Python Data Structures for 2.7.
Ref CS0

In [25]:
navigation_bar = soup.find('nav')
print(navigation_bar)

<nav class="main-navigation" id="site-navigation" role="navigation">
<h1 class="menu-toggle">Menu</h1>
<div class="screen-reader-text skip-link"><a href="#content" title="Skip to content">Skip to content</a></div>
<div class="menu-primary-container"><ul class="menu" id="menu-primary"><li class="menu-item menu-item-type-custom menu-item-object-custom current-menu-item current_page_item menu-item-8" id="menu-item-8"><a href="/">Home</a></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home current-menu-item page_item page-item-2 current_page_item menu-item-9" id="menu-item-9"><a href="https://data-x.blog/">About</a></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-102" id="menu-item-102"><a href="https://data-x.blog/syllabus-data-x/">Syllabus: Data-X</a></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-129" id="menu-item-129"><a href="https://data-x.blog/breakouts/">Breakouts</a></li>
<l

In [26]:
# Now we want to find the Syllabus, however we are at the root web page, not displaying the syllabus
# Get links from the data-x website
for url in navigation_bar.find_all('a'): # look for links in the navigation bar. Tag <nav>
    link = url.get('href')
    if 'data-x.blog' in link:
        print(link) # we see that the syllabus is located at the url https://data-x.blog/syllabus-data-x/
        if 'syllabus' in link:
            syllabus_url = link

https://data-x.blog/
https://data-x.blog/syllabus-data-x/
https://data-x.blog/breakouts/
https://data-x.blog/contact/


In [27]:
print(syllabus_url)

https://data-x.blog/syllabus-data-x/


In [28]:
# Open new connection to the syllabus url. Replace soup object.
source = requests.get(syllabus_url).content
soup = bs.BeautifulSoup(source, 'lxml') # 'lxml' parser better for tables, very similar to 'html.parser'

print(soup.body.prettify()) # we can see that the table is stored within <td> tags

<body class="page-template page-template-page-templates page-template-full-width-page page-template-page-templatesfull-width-page-php page page-id-94 custom-background mp6 customizer-styles-applied not-multi-author display-header-text highlander-enabled highlander-light">
 <div class="hfeed site" id="page">
  <header class="site-header" id="masthead" role="banner">
   <div class="site-branding">
    <div class="site-image">
     <a class="header-image-link" href="https://data-x.blog/" rel="home" title="Data-X">
      <img alt="" height="154" src="https://datax911.files.wordpress.com/2016/12/cropped-banner_matrix1.jpg" width="912"/>
     </a>
    </div>
    <!-- .header-image -->
    <h1 class="site-title">
     <a href="https://data-x.blog/" rel="home" title="Data-X">
      Data-X
     </a>
    </h1>
    <h2 class="site-description">
     A Public and Open Website for the Data-X Course at UC Berkeley.
    </h2>
   </div>
   <!-- .site-branding -->
   <nav class="main-navigation" id="si

### Finding the course scheudle table
Usually data on a website is stored in tables under the `<td>` tag. Here we want to extract the information in the Data-X syllabus.

In [29]:
# We can also get the table
table = soup.find('table')
print(table.prettify()) #HTML code of the table

<table width="518">
 <tbody>
  <tr>
   <td width="36">
    <strong>
     Lec #
    </strong>
   </td>
   <td width="131">
    <strong>
     Topic
    </strong>
   </td>
   <td width="72">
    <strong>
     Tools
    </strong>
   </td>
   <td width="108">
    <strong>
     Cookbook Examples
    </strong>
   </td>
   <td width="68">
    <strong>
     HW DUE
    </strong>
   </td>
   <td width="104">
    <strong>
     Lab
     <br/>
    </strong>
   </td>
  </tr>
  <tr>
   <td width="36">
    <strong>
     1
    </strong>
    <p>
     Jan 19
    </p>
   </td>
   <td width="131">
    Introduction: Overview of Frameworks for obtaining insights from data (Slides)
    <p>
     Slides: Python and Math/Probability Pre-requisites
    </p>
   </td>
   <td width="72">
    Anaconda, Python
   </td>
   <td width="108">
    Setting up Anaconda Environment
   </td>
   <td width="68">
    HW 1 Assigned
   </td>
   <td width="104">
   </td>
  </tr>
  <tr>
   <td width="36">
    <strong>
     2
    </str

In [30]:
# A new row in an HTML table starts with <tr> tag
# A new column entry is defined by <td> tag

In [31]:
table_result = list()
for row in table.find_all('tr'):
    row_cells = row.find_all('td') # find all table data
    row_entries = [cell.text for cell in row_cells]
    print(row_entries) 
    table_result.append(row_entries)# get all the table data into a list

[u'Lec #', u'Topic', u'Tools', u'Cookbook Examples', u'HW DUE', u'Lab\n']
[u'1\nJan 19', u'Introduction: Overview of Frameworks for obtaining insights from data (Slides)\nSlides: Python and Math/Probability Pre-requisites', u'Anaconda, Python', u'Setting up Anaconda Environment', u'HW 1 Assigned', u'']
[u'2\nJan 26', u'Notebook: Python Numpy Notebook\nSlides: Data Structure Outline\nSlides: Numpy Review', u'Python, Numpy, Pandas, JSON formatted files', u'Earthquake Data live query\nExample with JSON file', u'Bring 3 ideas to next class\nHW 1 Due', u'Form Teams']
[u'3\nFeb 2', u'Data signals in Tables.\xa0 Slides: Pandas Overview\nNotebook: Pandas Intro\nNotebook: Pandas and Stock Market', u'Pandas, Numpy, SciPy, Matplotlib', u'Stock market live download to Pandas DataFrame. Quant trading algorithm', u'HW 2 Due', u'Form Teams']
[u'4\nFeb 9', u'Scoring, Linear Prediction and Max Likelihood Prediction. Extending to multiple variables', u'Numpy, SciPy, Matplotlib', u'Code samples: 2 variab

In [32]:
# We can also read it in to a Pandas DataFrame
import pandas as pd    
df = pd.DataFrame(table_result)
df.head()

Unnamed: 0,0,1,2,3,4,5
0,Lec #,Topic,Tools,Cookbook Examples,HW DUE,Lab\n
1,1\nJan 19,Introduction: Overview of Frameworks for obtai...,"Anaconda, Python",Setting up Anaconda Environment,HW 1 Assigned,
2,2\nJan 26,Notebook: Python Numpy Notebook\nSlides: Data ...,"Python, Numpy, Pandas, JSON formatted files",Earthquake Data live query\nExample with JSON ...,Bring 3 ideas to next class\nHW 1 Due,Form Teams
3,3\nFeb 2,Data signals in Tables. Slides: Pandas Overvi...,"Pandas, Numpy, SciPy, Matplotlib",Stock market live download to Pandas DataFrame...,HW 2 Due,Form Teams
4,4\nFeb 9,"Scoring, Linear Prediction and Max Likelihood ...","Numpy, SciPy, Matplotlib",Code samples: 2 variable and multi-variable Li...,HW 3 Due,Validate and Adjust


In [33]:
# Pandas can also grab tables from a website automatically

import pandas as pd

# requires html5lib: 
#!conda install --yes html5lib
dfs = pd.read_html('https://data-x.blog/syllabus-data-x/',header=0) # returns a list of all tables at url
# header = 0, indicates that first row is header



In [34]:
print(type(dfs)) #list of tables
print(len(dfs)) # we only have one table
print(type(dfs[0])) # stored as DataFrame
df = dfs[0]

<type 'list'>
1
<class 'pandas.core.frame.DataFrame'>


In [35]:
# Looks great
df.head(4)

Unnamed: 0,Lec #,Topic,Tools,Cookbook Examples,HW DUE,Lab
0,1 Jan 19,Introduction: Overview of Frameworks for obtai...,"Anaconda, Python",Setting up Anaconda Environment,HW 1 Assigned,
1,2 Jan 26,Notebook: Python Numpy Notebook Slides: Data S...,"Python, Numpy, Pandas, JSON formatted files",Earthquake Data live query Example with JSON file,Bring 3 ideas to next class HW 1 Due,Form Teams
2,3 Feb 2,Data signals in Tables. Slides: Pandas Overvi...,"Pandas, Numpy, SciPy, Matplotlib",Stock market live download to Pandas DataFrame...,HW 2 Due,Form Teams
3,4 Feb 9,"Scoring, Linear Prediction and Max Likelihood ...","Numpy, SciPy, Matplotlib",Code samples: 2 variable and multi-variable Li...,HW 3 Due,Validate and Adjust


<a id='sec3'></a>
# Scrape images and other files

In [36]:
# As we can see there are two images on the data-x syllabus site that we might want to download
# Images are displayed with the <img> tag in HTML

print(soup.find('img')) # as we can see below the image urls are stored as the src inside the img tag

<img alt="" height="154" src="https://datax911.files.wordpress.com/2016/12/cropped-banner_matrix1.jpg" width="912"/>


In [37]:
# Parse all url to the images
img_urls = list()
for img in soup.find_all('img'): 
    img_url = img.get('src') 
    print(img_url) # we only want images with .jpg extension
    if '.jpg' in img_url:
        img_urls.append(img_url)
    

https://datax911.files.wordpress.com/2016/12/cropped-banner_matrix1.jpg
https://datax911.files.wordpress.com/2017/01/course-model.jpg?w=1032
https://sb.scorecardresearch.com/p?c1=2&c2=7518284&c3=&c4=&c5=&c6=&c15=&cv=2.0&cj=1
https://pixel.wp.com/b.gif?v=noscript


In [38]:
print(img_urls)

['https://datax911.files.wordpress.com/2016/12/cropped-banner_matrix1.jpg', 'https://datax911.files.wordpress.com/2017/01/course-model.jpg?w=1032']


In [39]:
# To downloads and save files with Python we can use the shutil library
# which is a file operations library

import shutil

for idx, img_url in enumerate(img_urls): #enumarte to create a file integer name for every image
    
    img_source = requests.get(img_url, stream=True) 
    # we set stream = True to download/ stream the content of the data
    
    with open('img'+str(idx)+'.jpg', 'wb') as file: # open file connection, create file and write to it
        shutil.copyfileobj(img_source.raw, file) # save the raw file object

    del img_source # to remove the file from memory

## Scraping function to download files of any type from a website

In [40]:
# Extended scraping function of any file format
import os # To format file name
import shutil # To copy file object from python to disk
import requests
import bs4 as bs

def py_file_scraper(url, html_tag='img', source_tag='src', file_type='.jpg',max=-1):
    
    '''
    Function that scrapes a website for certain file formats.
    The files will be placed in a folder called "files" in the working directory.
    
    url = the url we want to scrape from
    html_tag = the file tag (usually img for images or a for file links)
    source_tag = the source tag for the file url (usually src for images or href for files)
    file_type = .png, .jpg, .pdf, .csv, .xls etc.
    max = integer (max number of files to scrape, if = -1 it will scrape all files)
    '''
    
    # make a directory called 'files' for the files if it does not exist
    if not os.path.exists('files/'):
        os.makedirs('files/')

    source = requests.get(url).content
    soup = bs.BeautifulSoup(source,'lxml')
    
    i=0
    for link in soup.find_all(html_tag):
        file_url=link.get(source_tag)
        
        
        if 'http' in file_url: # check that it is a valid link

            if file_type in file_url: #only check for specific file type

                file_name = os.path.splitext(os.path.basename(file_url))[0] + file_type 
                #extract file name from url

                file_source = requests.get(file_url, stream = True)
                # open new stream connection

                with open('./files/'+file_name, 'wb') as file: 
                    # open file connection, create file and write to it
                    shutil.copyfileobj(file_source.raw, file) # save the raw file object
                    print('DOWNLOADED:',file_name)
                    
                    i+=1
                    
                del file_source # delete from memory
            else:
                print('EXCLUDED:',file_url) # urls not downloaded from
                
        if i==max:
            print('Max reached')
            break
            

    print('Done!')

In [169]:
py_file_scraper('https://data-x.blog/syllabus-data-x/') # scrape images form data-x syllabus

DOWNLOADED: cropped-banner_matrix1.jpg
DOWNLOADED: course-model.jpg
EXCLUDED: https://sb.scorecardresearch.com/p?c1=2&c2=7518284&c3=&c4=&c5=&c6=&c15=&cv=2.0&cj=1
EXCLUDED: https://pixel.wp.com/b.gif?v=noscript
Done!


In [170]:
# scrape pdf's from data-x site
py_file_scraper('https://data-x.blog/',html_tag='a',source_tag='href',file_type='.pdf',max=3)

EXCLUDED: https://data-x.blog/
EXCLUDED: https://data-x.blog/
EXCLUDED: https://data-x.blog/
EXCLUDED: https://data-x.blog/syllabus-data-x/
EXCLUDED: https://data-x.blog/breakouts/
EXCLUDED: https://data-x.blog/contact/
EXCLUDED: http://scet.berkeley.edu/data-x-course/
EXCLUDED: https://data-x.blog/contact/
EXCLUDED: https://goo.gl/forms/h6cAxZS3Il2F0k4F2
EXCLUDED: https://goo.gl/forms/h6cAxZS3Il2F0k4F2
DOWNLOADED: why-you_re-not-getting-value-from-your-data-science.pdf
EXCLUDED: https://data-x.blog/syllabus-data-x/
DOWNLOADED: installation-pre-reqs-osx_v5.pdf
DOWNLOADED: installation-pre-reqs-windows_v3.pdf
Max reached
Done!


In [171]:
# scrape csv files from website
py_file_scraper('http://www-eio.upc.edu/~pau/cms/rdata/datasets.html',html_tag='a', # R data sets
                source_tag='href', file_type='.csv',max=5)

DOWNLOADED: AirPassengers.csv
EXCLUDED: http://www-eio.upc.edu/~pau/cms/rdata/doc/datasets/AirPassengers.html
DOWNLOADED: BJsales.csv
EXCLUDED: http://www-eio.upc.edu/~pau/cms/rdata/doc/datasets/BJsales.html
DOWNLOADED: BOD.csv
EXCLUDED: http://www-eio.upc.edu/~pau/cms/rdata/doc/datasets/BOD.html
DOWNLOADED: Formaldehyde.csv
EXCLUDED: http://www-eio.upc.edu/~pau/cms/rdata/doc/datasets/Formaldehyde.html
DOWNLOADED: HairEyeColor.csv
Max reached
Done!


---
<a id='sec4'></a>
# Breakout problem


In this week's breakout you should extract live weather data in Berkeley from:

[http://forecast.weather.gov/MapClick.php?lat=37.87158815800046&lon=-122.27274583799971](http://forecast.weather.gov/MapClick.php?lat=37.87158815800046&lon=-122.27274583799971)

* Task scrape
    * period / day (as Tonight, Friday, FridayNight etc.
    * the temperature for the period (as Low, High)
    * the long weather description (e.g. Partly cloudy, with a low around 49..)
    
Store the scraped data strings in a Pandas DataFrame



**Hint:** The weather information is found in a div tag with `id='seven-day-forecast'`



# Breakout solution

In [49]:
import requests
import bs4 as bs
import pandas as pd

source = requests.get('http://forecast.weather.gov/MapClick.php?lat=37.87158815800046&lon=-122.27274583799971').content
soup = bs.BeautifulSoup(source,features='lxml')

In [50]:
forecast = soup.find(id='seven-day-forecast')

In [51]:
print(forecast.prettify())

<div class="panel panel-default" id="seven-day-forecast">
 <div class="panel-heading">
  <b>
   Extended Forecast for
  </b>
  <h2 class="panel-title">
   Berkeley CA
  </h2>
 </div>
 <div class="panel-body" id="seven-day-forecast-body">
  <div id="seven-day-forecast-container">
   <ul class="list-unstyled" id="seven-day-forecast-list">
    <li class="forecast-tombstone">
     <div class="tombstone-container">
      <p class="period-name">
       Tonight
       <br/>
       <br/>
      </p>
      <p>
       <img alt="Tonight: Partly cloudy, with a low around 49. Southwest wind around 5 mph becoming calm  in the evening. " class="forecast-icon" src="newimages/medium/nsct.png" title="Tonight: Partly cloudy, with a low around 49. Southwest wind around 5 mph becoming calm  in the evening. "/>
      </p>
      <p class="short-desc">
       Partly Cloudy
      </p>
      <p class="temp temp-low">
       Low: 49 °F
      </p>
     </div>
    </li>
    <li class="forecast-tombstone">
     <div

In [52]:
day = [d.text for d in forecast.find_all(class_='period-name')]
temp = [temp.text for temp in forecast.find_all(class_='temp')]
desc = forecast.find_all('img')

In [53]:
print(day)
print()
print(temp)

[u'Tonight', u'Friday', u'FridayNight', u'Saturday', u'SaturdayNight', u'Sunday', u'SundayNight', u'Monday', u'MondayNight']

[u'Low: 49 \xb0F', u'High: 70 \xb0F', u'Low: 50 \xb0F', u'High: 70 \xb0F', u'Low: 50 \xb0F', u'High: 72 \xb0F', u'Low: 49 \xb0F', u'High: 71 \xb0F', u'Low: 51 \xb0F']


In [54]:
# extract weather description
desc_list=list()
for txt in desc:
    print(txt.get('alt'))
    desc_list.append(txt.get('alt'))

Tonight: Partly cloudy, with a low around 49. Southwest wind around 5 mph becoming calm  in the evening. 
Friday: Mostly sunny, with a high near 70. Calm wind becoming west around 6 mph in the afternoon. 
Friday Night: Partly cloudy, with a low around 50. Light north wind. 
Saturday: Mostly sunny, with a high near 70. Light southwest wind. 
Saturday Night: Partly cloudy, with a low around 50. West southwest wind around 6 mph becoming light and variable  in the evening. 
Sunday: Sunny, with a high near 72.
Sunday Night: Partly cloudy, with a low around 49.
Monday: Mostly sunny, with a high near 71.
Monday Night: Mostly cloudy, with a low around 51.


In [58]:
pd.set_option('display.max_colwidth', -1) # to print full results
df = pd.DataFrame({'day':day,'temp':temp,'desc':desc_list})
print('Berkeley 7 day weather forecast')
df

Berkeley 7 day weather forecast


Unnamed: 0,day,desc,temp
0,Tonight,"Tonight: Partly cloudy, with a low around 49. Southwest wind around 5 mph becoming calm in the evening.",Low: 49 °F
1,Friday,"Friday: Mostly sunny, with a high near 70. Calm wind becoming west around 6 mph in the afternoon.",High: 70 °F
2,FridayNight,"Friday Night: Partly cloudy, with a low around 50. Light north wind.",Low: 50 °F
3,Saturday,"Saturday: Mostly sunny, with a high near 70. Light southwest wind.",High: 70 °F
4,SaturdayNight,"Saturday Night: Partly cloudy, with a low around 50. West southwest wind around 6 mph becoming light and variable in the evening.",Low: 50 °F
5,Sunday,"Sunday: Sunny, with a high near 72.",High: 72 °F
6,SundayNight,"Sunday Night: Partly cloudy, with a low around 49.",Low: 49 °F
7,Monday,"Monday: Mostly sunny, with a high near 71.",High: 71 °F
8,MondayNight,"Monday Night: Mostly cloudy, with a low around 51.",Low: 51 °F


In [56]:
pd.options.display.max_colwidth=50 #change back to default max col_width

<a id='sec5'></a>
# Appendix

<a id='sec6'></a>
# Scrape Bloomberg sitemap (XML) for current political news

In [59]:
# XML documents - site maps, all the urls. just between tags
# XML human and machine readable.
# Newest links: all the links for FIND SITE MAP!
# News websites will have sitemaps for politics, bot constantly
# tracking news track the sitemaps

# Before scraping a website look at robots.txt file
bs.BeautifulSoup(requests.get('https://www.bloomberg.com/robots.txt').content,'lxml')

<html><body><p># Bot rules:\n# 1. A bot may not injure a human being or, through inaction, allow a human being to come to harm.\n# 2. A bot must obey orders given it by human beings except where such orders would conflict with the First Law.\n# 3. A bot must protect its own existence as long as such protection does not conflict with the First or Second Law.\n# If you can read this then you should apply here https://www.bloomberg.com/careers/\nUser-agent: *\nDisallow: /news/live-blog/2016-03-11/bank-of-japan-monetary-policy-decision-and-kuroda-s-briefing\nDisallow: /polska\nUser-agent: Mediapartners-Google*\nDisallow: /about/careers\nDisallow: /about/careers/\nDisallow: /offlinemessage/\nDisallow: /apps/fbk\nDisallow: /bb/newsarchive/\nDisallow: /apps/news\nSitemap: https://www.bloomberg.com/feeds/bbiz/sitemap_index.xml\nSitemap: https://www.bloomberg.com/feeds/bpol/sitemap_index.xml\nSitemap: https://www.bloomberg.com/feeds/bview/sitemap_index.xml\nSitemap: https://www.bloomberg.com/fe

In [60]:
source = requests.get('https://www.bloomberg.com/feeds/bpol/sitemap_news.xml').content
soup = bs.BeautifulSoup(source,'xml') # Note parser 'xml'

In [61]:
print(soup.prettify())

<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
 <url>
  <loc>
   https://www.bloomberg.com/politics/videos/2017-03-10/greek-opposition-leader-says-party-won-t-back-austerity
  </loc>
  <news:news>
   <news:publication>
    <news:name>
     Bloomberg
    </news:name>
    <news:language>
     en
    </news:language>
   </news:publication>
   <news:title>
    Greek Opposition Leader Says Party Won't Back Austerity
   </news:title>
   <news:publication_date>
    2017-03-10T00:03:44.266Z
   </news:publication_date>
   <news:keywords>
    Greece
   </news:keywords>
   <news:stock_tickers/>
  </news:news>
 </url>
 <url>
  <loc>
   https://www.bloomberg.com/politics/videos/2017-03-09/south-korea-s-leadership-crisis-a-timeline
  </loc>
  <news:news>
   <news:publication>
    <news:name>
     Bloomberg
    </news:name>
    

In [62]:
# Find political news headlines
for news in soup.find_all({'news'}):
    print(news.title.text)
    print(news.publication_date.text)
    #print(news.keywords.text)
    print('\n')

Greek Opposition Leader Says Party Won't Back Austerity
2017-03-10T00:03:44.266Z


South Korea's Leadership Crisis: A Timeline
2017-03-09T22:54:56.977Z


Boris Johnson Says Vast Brexit Bill Wouldn't Be Reasonable
2017-03-09T21:26:37.179Z


Britain Told 15-Year Talks on EU Trade Can't Be Ruled Out
2017-03-09T21:03:10.063Z


South Korea's Park Ousted From Presidency, Triggering Vote
2017-03-10T02:38:36.027Z


South Korea’s Park Ousted From Presidency
2017-03-10T03:05:26.974Z


Gutting Dodd-Frank Is Hard, So GOP Focuses Elsewhere
2017-03-09T19:47:37.599Z


Fitch Warns of ‘Debt Challenge’ Confronting U.K. Government
2017-03-09T16:13:12.353Z


South Korea's Economic Woes Will Bedevil Its Next President
2017-03-10T02:29:39.849Z


Greece's Main Opposition Leader on Debt, Populism, Trump
2017-03-10T00:12:17.473Z


Trump’s Washington Hotel Accused of Unfairly Competing With Local Wine Bar
2017-03-09T22:01:30.334Z


White House Says Trump Was Unaware of Flynn’s Foreign Agent Work
2017-03-09T23:0

<a id='sec7'></a>
# Web crawl

Web crawling is almost like webscraping, but instead you crawl a specific website (and often its subsites) and extract meta information. It can be seen as simple, recursive scraping. This can be used for web indexing (in order to build a web search engine).

## Web crawl Twitter account
**Authors:** Kunal Desai & Alexander Fred Ojala

In [63]:
import bs4
from bs4 import BeautifulSoup
import requests

In [64]:
# Helper function to maintain the urls and the number of times they appear

url_dict = dict()

def add_to_dict(url_d, key):
    if key in url_d:
        url_d[key] = url_d[key] + 1
    else:
        url_d[key] = 1

In [65]:
# Recursive function which extracts links from the given url upto a given 'depth'.

def get_urls(url, depth):
    if depth == 0:
        return
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    for link in soup.find_all('a'):
        if link.has_attr('href') and "https://" in link['href']:
#             print(link['href'])
            add_to_dict(url_dict, link['href'])
            get_urls(link['href'], depth - 1)

In [66]:
# Iterative function which extracts links from the given url upto a given 'depth'.

def get_urls_iterative(url, depth):
    urls = [url]
    for url in urls:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        for link in soup.find_all('a'):
            if link.has_attr('href') and "https://" in link['href']:
                add_to_dict(url_dict, link['href'])
                urls.append(link['href'])
        if len(urls) > depth:
            break

In [67]:
get_urls("https://twitter.com/GolfWorld", 2)
for key in url_dict:
    print(str(key) + "  ----   " + str(url_dict[key]))

https://twitter.com/z_blair/status/541760205340409857?ref_src=twsrc^tfw  ----   1
https://pbs.twimg.com/profile_images/636196994139054080/VBfyLr6U.jpg  ----   2
https://www.pinterest.com/golfdigest  ----   8
https://t.co/cJznE7gdwP  ----   1
https://twitter.com/BillyHo_Golf/status/839199197345820672  ----   1
https://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fwww.golfdigest.com%2Fstory%2Fgolf-digest-podcast-david-feherty-on-phil-mickelsons-brain-spring-break-trips-he-cant-remember-and-the-infamous-fartgate-with-tiger-woods&title=Golf%20Digest%20Podcast%3A%20David%20Feherty%20on%20Phil%20Mickelson's%20brain%2C%20Spring%20Break%20trips%20he%20can't%20remember%20and%20the%20infamous%20%22Fartgate%22%20with%20Tiger%20Woods&summary=Golf%20Digest%20Podcast%3A%20David%20Feherty%20on%20Phil%20Mickelson's%20brain%2C%20Spring%20Break%20trips%20he%20can't%20remember%20and%20the%20infamous%20%22Fartgate%22%20with%20Tiger%20Woods&source=GolfDigest  ----   1
https://t.co/1ym5tXewcM  --

<a id='sec8'></a>
# SEO: Visualize sitemap and categories in a website

**Source:** https://www.ayima.com/guides/how-to-visualize-an-xml-sitemap-using-python.html

In [68]:
# Visualize XML sitemap with categories!
import requests
from bs4 import BeautifulSoup

url = 'https://www.sportchek.ca/sitemap.xml'
url = 'https://www.bloomberg.com/feeds/bpol/sitemap_index.xml'
page = requests.get(url)
print('Loaded page with: %s' % page)

sitemap_index = BeautifulSoup(page.content, 'html.parser')
print('Created %s object' % type(sitemap_index))

Loaded page with: <Response [200]>
Created <class 'bs4.BeautifulSoup'> object


In [69]:
urls = [element.text for element in sitemap_index.findAll('loc')]
print(urls)

[u'https://www.bloomberg.com/feeds/bpol/sitemap_recent.xml', u'https://www.bloomberg.com/feeds/bpol/sitemap_news.xml', u'https://www.bloomberg.com/feeds/bpol/sitemap_video_recent.xml', u'https://www.bloomberg.com/feeds/bpol/sitemap_2017_3.xml', u'https://www.bloomberg.com/feeds/bpol/sitemap_2017_2.xml', u'https://www.bloomberg.com/feeds/bpol/sitemap_2017_1.xml', u'https://www.bloomberg.com/feeds/bpol/sitemap_2016_12.xml', u'https://www.bloomberg.com/feeds/bpol/sitemap_2016_11.xml', u'https://www.bloomberg.com/feeds/bpol/sitemap_2016_10.xml', u'https://www.bloomberg.com/feeds/bpol/sitemap_2016_9.xml', u'https://www.bloomberg.com/feeds/bpol/sitemap_2016_8.xml', u'https://www.bloomberg.com/feeds/bpol/sitemap_2016_7.xml', u'https://www.bloomberg.com/feeds/bpol/sitemap_2016_6.xml', u'https://www.bloomberg.com/feeds/bpol/sitemap_2016_5.xml', u'https://www.bloomberg.com/feeds/bpol/sitemap_2016_4.xml', u'https://www.bloomberg.com/feeds/bpol/sitemap_2016_3.xml', u'https://www.bloomberg.com/feed

In [70]:
def extract_links(url):
    ''' Open an XML sitemap and find content wrapped in loc tags. '''

    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = [element.text for element in soup.findAll('loc')]

    return links

sitemap_urls = []
for url in urls:
    links = extract_links(url)
    sitemap_urls += links

print('Found {:,} URLs in the sitemap'.format(len(sitemap_urls)))

Found 18,530 URLs in the sitemap


In [71]:
with open('sitemap_urls.dat', 'w') as f:
    for url in sitemap_urls:
        f.write(url + '\n')

In [72]:
'''
Categorize a list of URLs by site path.
The file containing the URLs should exist in the working directory and be
named sitemap_urls.dat. It should contain one URL per line.
Categorization depth can be specified by executing a call like this in the
terminal (where we set the granularity depth level to 5):
    python categorize_urls.py --depth 5
The same result can be achieved by setting the categorization_depth variable
manually at the head of this file and running the script with:
    python categorize_urls.py
'''
from __future__ import print_function


categorization_depth=3



# Main script functions


def peel_layers(urls, layers=3):
    ''' Builds a dataframe containing all unique page identifiers up
    to a specified depth and counts the number of sub-pages for each.
    Prints results to a CSV file.
    urls : list
        List of page URLs.
    layers : int
        Depth of automated URL search. Large values for this parameter
        may cause long runtimes depending on the number of URLs.
    '''

    # Store results in a dataframe
    sitemap_layers = pd.DataFrame()

    # Get base levels
    bases = pd.Series([url.split('//')[-1].split('/')[0] for url in urls])
    sitemap_layers[0] = bases

    # Get specified number of layers
    for layer in range(1, layers+1):

        page_layer = []
        for url, base in zip(urls, bases):
            try:
                page_layer.append(url.split(base)[-1].split('/')[layer])
            except:
                # There is nothing that deep!
                page_layer.append('')

        sitemap_layers[layer] = page_layer

    # Count and drop duplicate rows + sort
    sitemap_layers = sitemap_layers.groupby(list(range(0, layers+1)))[0].count()\
                     .rename('counts').reset_index()\
                     .sort_values('counts', ascending=False)\
                     .sort_values(list(range(0, layers)), ascending=True)\
                     .reset_index(drop=True)

    # Convert column names to string types and export
    sitemap_layers.columns = [str(col) for col in sitemap_layers.columns]
    sitemap_layers.to_csv('sitemap_layers.csv', index=False)

    # Return the dataframe
    return sitemap_layers




sitemap_urls = open('sitemap_urls.dat', 'r').read().splitlines()
print('Loaded {:,} URLs'.format(len(sitemap_urls)))

print('Categorizing up to a depth of %d' % categorization_depth)
sitemap_layers = peel_layers(urls=sitemap_urls,
                             layers=categorization_depth)
print('Printed {:,} rows of data to sitemap_layers.csv'.format(len(sitemap_layers)))


Loaded 18,530 URLs
Categorizing up to a depth of 3
Printed 1,813 rows of data to sitemap_layers.csv


In [73]:
'''
Visualize a list of URLs by site path.
This script reads in the sitemap_layers.csv file created by the
categorize_urls.py script and builds a graph visualization using Graphviz.
Graph depth can be specified by executing a call like this in the
terminal:
    python visualize_urls.py --depth 4 --limit 10 --title "My Sitemap" --style "dark" --size "40"
The same result can be achieved by setting the variables manually at the head
of this file and running the script with:
    python visualize_urls.py
'''
from __future__ import print_function


# Set global variables

graph_depth = 3  # Number of layers deep to plot categorization
limit = 3       # Maximum number of nodes for a branch
title = ''       # Graph title
style = 'light'  # Graph style, can be "light" or "dark"
size = '8,5'     # Size of rendered PDF graph


# Import external library dependencies

import pandas as pd
import graphviz



# Main script functions

def make_sitemap_graph(df, layers=3, limit=50, size='8,5'):
    ''' Make a sitemap graph up to a specified layer depth.
    sitemap_layers : DataFrame
        The dataframe created by the peel_layers function
        containing sitemap information.
    layers : int
        Maximum depth to plot.
    limit : int
        The maximum number node edge connections. Good to set this
        low for visualizing deep into site maps.
    '''


    # Check to make sure we are not trying to plot too many layers
    if layers > len(df) - 1:
        layers = len(df)-1
        print('There are only %d layers available to plot, setting layers=%d'
              % (layers, layers))


    # Initialize graph
    f = graphviz.Digraph('sitemap', filename='sitemap_graph_%d_layer' % layers)
    f.body.extend(['rankdir=LR', 'size="%s"' % size])


    def add_branch(f, names, vals, limit, connect_to=''):
        ''' Adds a set of nodes and edges to nodes on the previous layer. '''

        # Get the currently existing node names
        node_names = [item.split('"')[1] for item in f.body if 'label' in item]

        # Only add a new branch it it will connect to a previously created node
        if connect_to:
            if connect_to in node_names:
                for name, val in list(zip(names, vals))[:limit]:
                    f.node(name='%s-%s' % (connect_to, name), label=name)
                    f.edge(connect_to, '%s-%s' % (connect_to, name), label='{:,}'.format(val))


    f.attr('node', shape='rectangle') # Plot nodes as rectangles

    # Add the first layer of nodes
    for name, counts in df.groupby(['0'])['counts'].sum().reset_index()\
                          .sort_values(['counts'], ascending=False).values:
        f.node(name=name, label='{} ({:,})'.format(name, counts))

    if layers == 0:
        return f

    f.attr('node', shape='oval') # Plot nodes as ovals
    f.graph_attr.update()

    # Loop over each layer adding nodes and edges to prior nodes
    for i in range(1, layers+1):
        cols = [str(i_) for i_ in range(i)]
        nodes = df[cols].drop_duplicates().values
        for j, k in enumerate(nodes):

            # Compute the mask to select correct data
            mask = True
            for j_, ki in enumerate(k):
                mask &= df[str(j_)] == ki

            # Select the data then count branch size, sort, and truncate
            data = df[mask].groupby([str(i)])['counts'].sum()\
                    .reset_index().sort_values(['counts'], ascending=False)

            # Add to the graph
            add_branch(f,
                       names=data[str(i)].values,
                       vals=data['counts'].values,
                       limit=limit,
                       connect_to='-'.join(['%s']*i) % tuple(k))

            print(('Built graph up to node %d / %d in layer %d' % (j, len(nodes), i))\
                    .ljust(50), end='\r')

    return f


def apply_style(f, style, title=''):
    ''' Apply the style and add a title if desired. More styling options are
    documented here: http://www.graphviz.org/doc/info/attrs.html#d:style
    f : graphviz.dot.Digraph
        The graph object as created by graphviz.
    style : str
        Available styles: 'light', 'dark'
    title : str
        Optional title placed at the bottom of the graph.
    '''

    dark_style = {
        'graph': {
            'label': title,
            'bgcolor': '#3a3a3a',
            'fontname': 'Helvetica',
            'fontsize': '18',
            'fontcolor': 'white',
        },
        'nodes': {
            'style': 'filled',
            'color': 'white',
            'fillcolor': 'black',
            'fontname': 'Helvetica',
            'fontsize': '14',
            'fontcolor': 'white',
        },
        'edges': {
            'color': 'white',
            'arrowhead': 'open',
            'fontname': 'Helvetica',
            'fontsize': '12',
            'fontcolor': 'white',
        }
    }

    light_style = {
        'graph': {
            'label': title,
            'fontname': 'Helvetica',
            'fontsize': '18',
            'fontcolor': 'black',
        },
        'nodes': {
            'style': 'filled',
            'color': 'black',
            'fillcolor': '#dbdddd',
            'fontname': 'Helvetica',
            'fontsize': '14',
            'fontcolor': 'black',
        },
        'edges': {
            'color': 'black',
            'arrowhead': 'open',
            'fontname': 'Helvetica',
            'fontsize': '12',
            'fontcolor': 'black',
        }
    }

    if style == 'light':
        apply_style = light_style

    elif style == 'dark':
        apply_style = dark_style

    f.graph_attr = apply_style['graph']
    f.node_attr = apply_style['nodes']
    f.edge_attr = apply_style['edges']

    return f




# Read in categorized data
sitemap_layers = pd.read_csv('sitemap_layers.csv', dtype=str)
# Convert numerical column to integer
sitemap_layers.counts = sitemap_layers.counts.apply(int)
print('Loaded {:,} rows of categorized data from sitemap_layers.csv'\
        .format(len(sitemap_layers)))

print('Building %d layer deep sitemap graph' % graph_depth)
f = make_sitemap_graph(sitemap_layers, layers=graph_depth,
                       limit=limit, size=size)
f = apply_style(f, style=style, title=title)

f.render(cleanup=True)
print('Exported graph to sitemap_graph_%d_layer.pdf' % graph_depth)




Loaded 1,813 rows of categorized data from sitemap_layers.csv
Building 3 layer deep sitemap graph
Exported graph to sitemap_graph_3_layer.pdf       
