<center> <img src="https://github.ccs.neu.edu/caglar/DS3000/blob/master/img/ds3000.png?raw=true"> </center>

<center> <h1> Week 4 - Day 2 </h1> </center>

<center> <h2> Part 2: Web Scraping </h2></center>

## Outline
1. <a href='#1'>Reading Web Resources from URLs</a>
2. <a href='#2'>Web Scraping using BeautifulSoup</a>
3. <a href='#3'>Scraping Specific Tags from Webpages</a>
4. <a href='#4'>Scraping Web Pages by Tags and Attributes</a>
5. <a href='#5'>Scraping Child Tags under a Parent Tag</a>
6. <a href='#6'>Storing the Scraped Data</a>
7. <a href='#7'>Web Crawling</a>
8. <a href='#8'>More on Web Scraping</a>


## Web Scraping
* Using programs or scripts to pretend to browse websites, examine the content on those websites, retrieve and extract data from those websites
* Also called web spidering, web crawling, web harvesting, or web data extraction

## Why Scrape?
* Pull your dataset from a website when your data is not readily available
* Extract different pieces of information from online resources when working with non-traditional datasets (e.g., social media posts)

## Wyh not Scrape?
* It may not be legal
* You can't republish copyrighted information
* Terms of service violations are not okay
* Some websites don't like you scraping their content!

<a id="1"></a>

## 1. Reading Web Resources from URLs
* **`urllib`** module allows you to read data from URLs
Open the URL url, which can be either a string 

In [1]:
import urllib.request as urllib
html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
print(html.read())

b'<!DOCTYPE html>\n<!--[if (lte IE 9) ]><html lang="en" class="no-js oldie"><![endif]-->\n<!--[if (gt IE 9)|!(IE)]><!--><html lang="en" class="no-js"><!--<![endif]-->\n<head>\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge">\n\t<meta name="viewport" content="width=device-width, initial-scale=1">\n\t<script src="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/lib/modernizr.min.js"></script>\n\t<link rel="shortcut icon" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/favicon.ico" type="image/x-icon" />\n\t<meta name="apple-mobile-web-app-title" content="NU Khoury">\n\t<link rel="apple-touch-icon" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" />\n\t<link rel="apple-touch-icon-precomposed" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" />\n\t<link rel="apple-touch-icon" sizes="180x180" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_them

### 1.1. urlopen() function
* Opens a URL, which can be a string or Request object
* Returns a file-like object containing the contents of the URL

In [2]:
import urllib.request as urllib
html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
print(html.read())

b'<!DOCTYPE html>\n<!--[if (lte IE 9) ]><html lang="en" class="no-js oldie"><![endif]-->\n<!--[if (gt IE 9)|!(IE)]><!--><html lang="en" class="no-js"><!--<![endif]-->\n<head>\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge">\n\t<meta name="viewport" content="width=device-width, initial-scale=1">\n\t<script src="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/lib/modernizr.min.js"></script>\n\t<link rel="shortcut icon" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/favicon.ico" type="image/x-icon" />\n\t<meta name="apple-mobile-web-app-title" content="NU Khoury">\n\t<link rel="apple-touch-icon" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" />\n\t<link rel="apple-touch-icon-precomposed" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" />\n\t<link rel="apple-touch-icon" sizes="180x180" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_them

### 1.2. read() method
* Reads the entire page of the Request object

In [3]:
import urllib.request as urllib
html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
print(html.read())

b'<!DOCTYPE html>\n<!--[if (lte IE 9) ]><html lang="en" class="no-js oldie"><![endif]-->\n<!--[if (gt IE 9)|!(IE)]><!--><html lang="en" class="no-js"><!--<![endif]-->\n<head>\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge">\n\t<meta name="viewport" content="width=device-width, initial-scale=1">\n\t<script src="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/lib/modernizr.min.js"></script>\n\t<link rel="shortcut icon" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/favicon.ico" type="image/x-icon" />\n\t<meta name="apple-mobile-web-app-title" content="NU Khoury">\n\t<link rel="apple-touch-icon" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" />\n\t<link rel="apple-touch-icon-precomposed" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" />\n\t<link rel="apple-touch-icon" sizes="180x180" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_them

<a id="2"></a>

## 2. Web Scraping using BeautifulSoup
* A common web scraping library
* Helps format and organize the messy web by fixing bad HTML and presenting us with easily-traversible Python objects
* Need to install the library before you can use it in Python
    * pip install beautifulsoup4 (run this in Anaconda prompt)
    * Full documentation available at https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [4]:
#imports the BeautifulSoup object in bs4 library
from bs4 import BeautifulSoup

In [5]:
import urllib.request as urllib
from bs4 import BeautifulSoup

html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
soup = BeautifulSoup(html.read())



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


### soup = BeautifulSoup(htm.read())
* Transforms the HTML content into a BeautifulSoup object, called soup
* The BeautifulSoup object retains the general structure of a web page:
    * `<html></html>`
    * `<head></head>`
    * `<body></body>`

In [6]:
soup.html

<html class="no-js" lang="en"><!--<![endif]-->
<head>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<script src="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/lib/modernizr.min.js"></script>
<link href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<meta content="NU Khoury" name="apple-mobile-web-app-title"/>
<link href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" rel="apple-touch-icon"/>
<link href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" rel="apple-touch-icon-precomposed"/>
<link href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon-180.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="https://fast.fonts.com/cssapi/cac43e8c-6965-44df-b8ca-9784607a3b53.css" rel="stylesheet" type="text/css"/>

In [7]:
soup.head

<head>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<script src="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/lib/modernizr.min.js"></script>
<link href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<meta content="NU Khoury" name="apple-mobile-web-app-title"/>
<link href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" rel="apple-touch-icon"/>
<link href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" rel="apple-touch-icon-precomposed"/>
<link href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon-180.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="https://fast.fonts.com/cssapi/cac43e8c-6965-44df-b8ca-9784607a3b53.css" rel="stylesheet" type="text/css"/>
<link href="https://www.khoury.northeastern.ed

In [8]:
soup.body

<body class="archive tax-role term-tenured-and-tenure-track-faculty term-23 tribe-no-js">
<!-- NU Google Tag Manager (noscript) -->
<noscript><iframe height="0" src="https://www.googletagmanager.com/ns.html?id=GTM-WGQLLJ" style="display:none;visibility:hidden" width="0"></iframe></noscript>
<!-- End Google Tag Manager (noscript) -->
<!-- Khoury Google Tag Manager (noscript) -->
<noscript><iframe height="0" src="https://www.googletagmanager.com/ns.html?id=GTM-KN6KMJB" style="display:none;visibility:hidden" width="0"></iframe></noscript>
<!-- End Google Tag Manager (noscript) --> <a class="sr-only" href="#main-content" id="top">Skip to main content</a>
<header id="site-header" role="banner">
<div id="contact-region">
<div class="container">
<div class="contact-contents expanded">
<div class="contact-col">
<h4>Main</h4>
<a href="mailto:khoury@northeastern.edu">khoury@northeastern.edu</a>
<a href="tel:6173732462">617.373.2462</a>
<a href="http://www.northeastern.edu/nupd/campus-safety/">Ca

In [9]:
soup.body.main

<main class="container" id="main-content">
<div class="row">
<section class="with-subnav" id="primary-content">
<div class="page-header">
<h1 class="page-title">Tenured and Tenure Track Faculty</h1>
</div>
<div class="row">
<nav class="sidebar-subnav" id="people-subnav">
<div class="menu-people-container"><ul class="menu menu-vertical" id="menu-people-1"><li class="menu-item menu-item-type-post_type menu-item-object-page current-menu-ancestor current-menu-parent current_page_parent current_page_ancestor menu-item-has-children menu-item-2282"><a href="https://www.khoury.northeastern.edu/people/" rel="nofollow">People</a>
<ul class="sub-menu">
<li class="menu-item menu-item-type-taxonomy menu-item-object-role current-menu-item menu-item-727"><a aria-current="page" href="https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/">Tenured and Tenure Track Faculty</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-role menu-item-8882"><a href="https://www

In [10]:
soup.body.main.h1

<h1 class="page-title">Tenured and Tenure Track Faculty</h1>

* What if you wanted to retrieve the text contained in `<h1> </h1>`?

### 2.1. get_text() method
* **`get_text()`** strips all tags from the webpage and returns a string containing the text inbetween the tags only.
* Strips away all the tags and returns a tagless block of text


In [11]:
soup.body.main.h1.get_text()

'Tenured and Tenure Track Faculty'

### 2.1. get_text() method cont'd
* Call .get_text() immediately before you print, store, or manipulate your final data
* A lot easier to find what you’re looking for in a BeautifulSoup object than in a block of text
* try to preserve the tag structure of a document as long as possible

In [12]:
soup.body.main

<main class="container" id="main-content">
<div class="row">
<section class="with-subnav" id="primary-content">
<div class="page-header">
<h1 class="page-title">Tenured and Tenure Track Faculty</h1>
</div>
<div class="row">
<nav class="sidebar-subnav" id="people-subnav">
<div class="menu-people-container"><ul class="menu menu-vertical" id="menu-people-1"><li class="menu-item menu-item-type-post_type menu-item-object-page current-menu-ancestor current-menu-parent current_page_parent current_page_ancestor menu-item-has-children menu-item-2282"><a href="https://www.khoury.northeastern.edu/people/" rel="nofollow">People</a>
<ul class="sub-menu">
<li class="menu-item menu-item-type-taxonomy menu-item-object-role current-menu-item menu-item-727"><a aria-current="page" href="https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/">Tenured and Tenure Track Faculty</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-role menu-item-8882"><a href="https://www

In [13]:
soup.body.main.get_text()

"\n\n\n\nTenured and Tenure Track Faculty\n\n\n\nPeople\n\nTenured and Tenure Track Faculty\nProfessors of the Practice\nTeaching Faculty\nResearch Faculty and Staff\nCourtesy Appointments\nCo-op and Advising\nAdministrative Staff\nSystems Staff\nPost Docs\nPhD Students\nView All\n\n\nWe’re Hiring View open positions\n\n \n\n\n\n\n\n\n\nTenured and Tenure Track Faculty\n\n\t\t\t\t\t\t\t\t\tAmal Ahmed\t\t\t\t\t\t\t\n\n\nAssociate Professor\nSy and Laurie Sternberg Interdisciplinary Associate Professor\n\n\n\n\n\n\n\n\n\nTenured and Tenure Track Faculty\n\n\t\t\t\t\t\t\t\t\tChristopher Amato\t\t\t\t\t\t\t\n\n\nAssistant Professor\n\n\n\n\n\n\n\n\n\nAdministrative Staff Tenured and Tenure Track Faculty\n\n\t\t\t\t\t\t\t\t\tJaved Aslam\t\t\t\t\t\t\t\n\n\nProfessor\nSenior Associate Dean - Academic Affairs\n\n\n\n\n\n\n\n\n\nTenured and Tenure Track Faculty\n\n\t\t\t\t\t\t\t\t\tKenneth Baclawski\t\t\t\t\t\t\t\n\n\nAssociate Professor Emeritus\n\n\n\n\n\n\n\n\n\nTenured and Tenure Track Facu

### 2.2. get() method
* Retrieves an attribute of a tag
* **tagName.get("attributeName")**

In [14]:
soup.body.main.a

<a href="https://www.khoury.northeastern.edu/people/" rel="nofollow">People</a>

In [15]:
soup.body.main.a.get("href")

'https://www.khoury.northeastern.edu/people/'

### 2.3. attrs attribute
* Returns a dictionary of the attributes of a tag
* **tagName.attrs**

In [16]:
soup.body.main.a.attrs

{'rel': ['nofollow'], 'href': 'https://www.khoury.northeastern.edu/people/'}

* **attrs** can also be used to retrieve attributes of a tag:

In [17]:
soup.body.main.a.attrs["href"]

'https://www.khoury.northeastern.edu/people/'

In [18]:
soup.body.main.a.get_text()

'People'

### 2.4. find() method
* Allows you to search through an HTML page and find a specific tag
* soup_name.find("tagName")
* Returns the first occurrence of the tag
* Returns None if the tag/attribute does not exist

In [19]:
soup.find("title")

<title>Tenured and Tenure Track Faculty | Khoury College of Computer Sciences</title>

In [20]:
soup.find("title").get_text()

'Tenured and Tenure Track Faculty | Khoury College of Computer Sciences'

<a id="3"></a>

## 3. Scraping Specific Tags from Webpages
* **`find_all(tagName)`** returns **a list of all the tags** found within the page.

#### Lets' extract all the links found on the Khoury Faculty page
* Links are placed in `<a>` tags 
* A typical link in HTML looks like this:
    * Backend: `<a href="https://www.khoury.northeastern.edu/people/carla-brodley/"> Carla E. Brodley </a>`
    * Frontend: <a href="https://www.khoury.northeastern.edu/people/carla-brodley/"> Carla E. Brodley </a>

In [21]:
import urllib.request as urllib
from bs4 import BeautifulSoup

html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
soup = BeautifulSoup(html.read())

a_tags = soup.find_all("a")



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [22]:
a_tags

[<a class="sr-only" href="#main-content" id="top">Skip to main content</a>,
 <a href="mailto:khoury@northeastern.edu">khoury@northeastern.edu</a>,
 <a href="tel:6173732462">617.373.2462</a>,
 <a href="http://www.northeastern.edu/nupd/campus-safety/">Campus Safety</a>,
 <a href="https://www.northeastern.edu/campusmap/map/index.html">Campus Map</a>,
 <a href="tel:6173736519">617.373.6519</a>,
 <a href="mailto:admissions@northeastern.edu">admissions@northeastern.edu</a>,
 <a href="mailto:khoury-advising@northeastern.edu">khoury-advising@northeastern.edu</a>,
 <a href="tel:6173738613">617.373.5545</a>,
 <a href="mailto:khoury-gradschool@northeastern.edu">khoury-gradschool@northeastern.edu</a>,
 <a href="#" rel="nofollow">Contact – Parent</a>,
 <a href="https://www.khoury.northeastern.edu/directions-and-parking/">Directions &amp; Parking</a>,
 <a href="https://www.khoury.northeastern.edu/facilities/">Facilities</a>,
 <a href="https://www.khoury.northeastern.edu/systems/">Systems</a>,
 <a hr

In [23]:
for link in a_tags:
    if link.get('href').startswith("http"):
        print(link.get('href'))

http://www.northeastern.edu/nupd/campus-safety/
https://www.northeastern.edu/campusmap/map/index.html
https://www.khoury.northeastern.edu/directions-and-parking/
https://www.khoury.northeastern.edu/facilities/
https://www.khoury.northeastern.edu/systems/
https://www.khoury.northeastern.edu/about/
https://www.ccis.northeastern.edu/events/
https://www.khoury.northeastern.edu/news/
https://www.khoury.northeastern.edu/open-positions/
https://www.khoury.northeastern.edu/current-students/
https://www.khoury.northeastern.edu/industry/
https://www.khoury.northeastern.edu/diversity/
https://www.khoury.northeastern.edu/contact/
https://www.khoury.northeastern.edu
https://www.khoury.northeastern.edu/academics/
https://www.khoury.northeastern.edu/academics/undergraduate/
https://www.khoury.northeastern.edu/academics/masters/
https://www.khoury.northeastern.edu/academics/phd/
https://www.khoury.northeastern.edu/academics/certificate/
https://www.khoury.northeastern.edu/academics/courses/
https://ww

* What if you justed wanted to scrape the hyperlinks, not email addresses or phone numbers?

In [24]:
for link in a_tags:
    href = link.get("href")
    if href.startswith("https"):
        print(href)

https://www.northeastern.edu/campusmap/map/index.html
https://www.khoury.northeastern.edu/directions-and-parking/
https://www.khoury.northeastern.edu/facilities/
https://www.khoury.northeastern.edu/systems/
https://www.khoury.northeastern.edu/about/
https://www.ccis.northeastern.edu/events/
https://www.khoury.northeastern.edu/news/
https://www.khoury.northeastern.edu/open-positions/
https://www.khoury.northeastern.edu/current-students/
https://www.khoury.northeastern.edu/industry/
https://www.khoury.northeastern.edu/diversity/
https://www.khoury.northeastern.edu/contact/
https://www.khoury.northeastern.edu
https://www.khoury.northeastern.edu/academics/
https://www.khoury.northeastern.edu/academics/undergraduate/
https://www.khoury.northeastern.edu/academics/masters/
https://www.khoury.northeastern.edu/academics/phd/
https://www.khoury.northeastern.edu/academics/certificate/
https://www.khoury.northeastern.edu/academics/courses/
https://www.khoury.northeastern.edu/research/
https://www.

<a id="4"></a>

## 4. Scraping Web Pages by Tags and Attributes
* Web pages use tags and attributes to style and format pages.
* **`find_all()`** method allows you to search through a web page and extract useful information
* findAll(tagName, tagAttributes)
    * Looks through a tag’s descendants and retrieves all descendants that match your filters
    * Returns None if the tag/attribute does not exist


* Let's retrieve the faculty names from the page:

<img src="res/html_tree.png" />

In [25]:
import urllib.request as urllib
from bs4 import BeautifulSoup

html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
soup = BeautifulSoup(html.read())

#retrieves all h3 tags with class = "person-name"
faculty_list = soup.find_all("h3", {"class":"person-name"})



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [26]:
faculty_list

[<h3 class="person-name">
 									Amal Ahmed							</h3>, <h3 class="person-name">
 									Christopher Amato							</h3>, <h3 class="person-name">
 									Javed Aslam							</h3>, <h3 class="person-name">
 									Kenneth Baclawski							</h3>, <h3 class="person-name">
 									Albert-László Barabási							</h3>, <h3 class="person-name">
 									Timothy W. Bickmore							</h3>, <h3 class="person-name">
 									Michelle Borkin							</h3>, <h3 class="person-name">
 									Carla E. Brodley							</h3>, <h3 class="person-name">
 									Agnes H. Chan							</h3>, <h3 class="person-name">
 									David Choffnes							</h3>, <h3 class="person-name">
 									William D. Clinger							</h3>, <h3 class="person-name">
 									Seth Cooper							</h3>, <h3 class="person-name">
 									Gene Cooperman							</h3>, <h3 class="person-name">
 									Peter Desnoyers							</h3>, <h3 class="person-name">
 									Cody Dunne							</h3>, <h3 class="person-name">
 									Ehsan E

### find_all() method calls
* both retrieve all h3 tags with class = "person-name"

In [53]:
faculty_list = soup.find_all("h3", {"class":"person-name"})

In [54]:
faculty_list = soup.find_all("h3", class_="person-name")

* Now that we have all h3 tags stored in a list, we can extract the text content contained in `<h3></h3>`
* Use **get_text()**

In [29]:
import urllib.request as urllib
from bs4 import BeautifulSoup

html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
soup = BeautifulSoup(html.read())

#retrieves all h3 tags with class = "person-name"
faculty_list = soup.find_all("h3", {"class":"person-name"})

#retrieves the text contained in each prof's h3 tag
for prof in faculty_list:
    print(prof.get_text().strip())



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Amal Ahmed
Christopher Amato
Javed Aslam
Kenneth Baclawski
Albert-László Barabási
Timothy W. Bickmore
Michelle Borkin
Carla E. Brodley
Agnes H. Chan
David Choffnes
William D. Clinger
Seth Cooper
Gene Cooperman
Peter Desnoyers
Cody Dunne
Ehsan Elhamifar
Tina Eliassi-Rad
Don Fallis
Matthias Felleisen
Larry Finkelstein
Yun (Raymond) Fu
Wolfgang Gatterbauer
Matthew Goodwin
Paul Hand
Woodrow Hartzog
Stephen Intille
Engin Kirda
David Lazer
Karl Lieberherr
Long Lu
Panagiotos (Pete) Manolios
Stacy C. Marsella
Renée Miller
Alan Mislove
Huy Lê Nguyen
Cristina Nita-Rotaru
Guevara Noubir
Alina Oprea
Andrea Grimes Parker
Rupal Patel
Robert Platt
Predrag Radivojac
Rajmohan Rajaraman
Aanjhan Ranganathan
Richard Rasala
Mirek Riedewald
Christoph Riedl
William Robertson
Magy Seif El-Nasr
Abhi Shelat
Olin Shivers
David Smith
Ravi Sundaram
Frank Tip
Stavros Tripakis
Jonathan Ullman
Jan-Willem van de Meent
Alessandro Vespignani
Emanuele Viola
Jan Vitek
Olga Vitek
Thomas Wahl
Byron Wallace
Mitchell Wand
Lu 

<a id="5"></a>

  
## 5. Scraping Child Tags under a Parent Tag
* Let's create our own record of faculty names, titles, webpage links, and profile picture URLs

<img src = "res/parent_child_tags.png" />

<center><img src = "res/khoury_grid_item.png" /></center>

In [30]:
import urllib.request as urllib
from bs4 import BeautifulSoup

html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
soup = BeautifulSoup(html.read())

#retrieves all <div class = "grid-item"> tags
faculty_divs = soup.find_all("div", class_="grid-item")
faculty_divs



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


[<div class="grid-item">
 <a href="https://www.khoury.northeastern.edu/people/amal-ahmed/">
 <div class="grid-image"><img alt="" src="https://www.khoury.northeastern.edu/wp-content/uploads/2016/03/Amal-Ahmed-Index-Image.jpg"/></div>
 <p class="roles"><span class="role-23">Tenured and Tenure Track Faculty</span></p>
 <h3 class="person-name">
 									Amal Ahmed							</h3>
 </a>
 <div class="position-list">
 <p class="position">Associate Professor</p>
 <p class="position">Sy and Laurie Sternberg Interdisciplinary Associate Professor</p>
 </div>
 </div>, <div class="grid-item">
 <a href="https://www.khoury.northeastern.edu/people/chris-amato/">
 <div class="grid-image"><img alt="" src="https://www.khoury.northeastern.edu/wp-content/uploads/2016/08/Amato_Chris-index-image.jpg"/></div>
 <p class="roles"><span class="role-23">Tenured and Tenure Track Faculty</span></p>
 <h3 class="person-name">
 									Christopher Amato							</h3>
 </a>
 <div class="position-list">
 <p class="position">

### 5.1. Let's extract faculty names
* All names are marked with `<h3 class = "person-name">`
* Because there is one `<h3>` tag under `<div class = "grid-item">` we can use the **find()** method to scrape it
* Need to call the **get_text()** on it to get the content
* Also need to strip whitespace using **strip()**

In [31]:
faculty_divs = soup.find_all("div", class_="grid-item")

for prof in faculty_divs:
    fname = prof.find("h3").get_text().strip()
    print(fname)

Amal Ahmed
Christopher Amato
Javed Aslam
Kenneth Baclawski
Albert-László Barabási
Timothy W. Bickmore
Michelle Borkin
Carla E. Brodley
Agnes H. Chan
David Choffnes
William D. Clinger
Seth Cooper
Gene Cooperman
Peter Desnoyers
Cody Dunne
Ehsan Elhamifar
Tina Eliassi-Rad
Don Fallis
Matthias Felleisen
Larry Finkelstein
Yun (Raymond) Fu
Wolfgang Gatterbauer
Matthew Goodwin
Paul Hand
Woodrow Hartzog
Stephen Intille
Engin Kirda
David Lazer
Karl Lieberherr
Long Lu
Panagiotos (Pete) Manolios
Stacy C. Marsella
Renée Miller
Alan Mislove
Huy Lê Nguyen
Cristina Nita-Rotaru
Guevara Noubir
Alina Oprea
Andrea Grimes Parker
Rupal Patel
Robert Platt
Predrag Radivojac
Rajmohan Rajaraman
Aanjhan Ranganathan
Richard Rasala
Mirek Riedewald
Christoph Riedl
William Robertson
Magy Seif El-Nasr
Abhi Shelat
Olin Shivers
David Smith
Ravi Sundaram
Frank Tip
Stavros Tripakis
Jonathan Ullman
Jan-Willem van de Meent
Alessandro Vespignani
Emanuele Viola
Jan Vitek
Olga Vitek
Thomas Wahl
Byron Wallace
Mitchell Wand
Lu 

### 5.2. Similarly we can get the title of each faculty member
* Retrieve the `<p class="position"` tag under `<div class = "grid-item">`

In [32]:
faculty_divs = soup.find_all("div", class_="grid-item")

for prof in faculty_divs:   
    fname = prof.find("h3").get_text().strip()
    ftitle= prof.find("p","position").get_text().strip()
    
    print(fname)
    print(ftitle)  


Amal Ahmed
Associate Professor
Christopher Amato
Assistant Professor
Javed Aslam
Professor
Kenneth Baclawski
Associate Professor Emeritus
Albert-László Barabási
Robert Gray Dodge Professor of Network Science
Timothy W. Bickmore
Professor
Michelle Borkin
Assistant Professor
Carla E. Brodley
Dean - Khoury College of Computer Sciences
Agnes H. Chan
Professor Emeritus
David Choffnes
Associate Professor
William D. Clinger
Professor Emeritus
Seth Cooper
Assistant Professor
Gene Cooperman
Professor
Peter Desnoyers
Associate Professor
Cody Dunne
Assistant Professor
Ehsan Elhamifar
Assistant Professor
Tina Eliassi-Rad
Associate Professor
Don Fallis
Professor of Philosophy and Computer Sciences
Matthias Felleisen
Trustee Professor
Larry Finkelstein
Professor Emeritus
Yun (Raymond) Fu
Professor
Wolfgang Gatterbauer
Associate Professor
Matthew Goodwin
Associate Professor
Paul Hand
Assistant Professor
Woodrow Hartzog
Professor
Stephen Intille
Associate Professor
Engin Kirda
Professor
David Lazer
Di

### 5.3. Can get the link to the faculty member's webpage
* Retrieve the **`href`** attribute of the **`<a>`** tag under `<div class = "grid-item">`

In [33]:
faculty_divs = soup.find_all("div", class_="grid-item")

for prof in faculty_divs:   
    fname = prof.find("h3").get_text().strip()
    ftitle= prof.find("p","position").get_text().strip()
    fpage = prof.find("a").get("href").strip()

    print(fname)
    print(ftitle)
    print(fpage)

Amal Ahmed
Associate Professor
https://www.khoury.northeastern.edu/people/amal-ahmed/
Christopher Amato
Assistant Professor
https://www.khoury.northeastern.edu/people/chris-amato/
Javed Aslam
Professor
https://www.khoury.northeastern.edu/people/jay-javed-aslam/
Kenneth Baclawski
Associate Professor Emeritus
https://www.khoury.northeastern.edu/people/kenneth-baclawski/
Albert-László Barabási
Robert Gray Dodge Professor of Network Science
https://www.khoury.northeastern.edu/people/albert-laszlo-barabasi/
Timothy W. Bickmore
Professor
https://www.khoury.northeastern.edu/people/timothy-bickmore/
Michelle Borkin
Assistant Professor
https://www.khoury.northeastern.edu/people/michelle-borkin/
Carla E. Brodley
Dean - Khoury College of Computer Sciences
https://www.khoury.northeastern.edu/people/carla-brodley/
Agnes H. Chan
Professor Emeritus
https://www.khoury.northeastern.edu/people/agnes-chan/
David Choffnes
Associate Professor
https://www.khoury.northeastern.edu/people/david-choffnes/
Willi

### 5.4. Can get the profile picture too
* Retrieve the **`src`** attribute of the **`<img>`** tag under `<div class = "grid-item">`

In [34]:
faculty_divs = soup.find_all("div", class_="grid-item")

for prof in faculty_divs:   
    fname = prof.find("h3").get_text().strip()
    ftitle= prof.find("p","position").get_text().strip()
    fpage = prof.find("a").get("href").strip()
    fpic = prof.find("img").get("src").strip()

    print(fname)
    print(ftitle)
    print(fpage)
    print(fpic)

Amal Ahmed
Associate Professor
https://www.khoury.northeastern.edu/people/amal-ahmed/
https://www.khoury.northeastern.edu/wp-content/uploads/2016/03/Amal-Ahmed-Index-Image.jpg
Christopher Amato
Assistant Professor
https://www.khoury.northeastern.edu/people/chris-amato/
https://www.khoury.northeastern.edu/wp-content/uploads/2016/08/Amato_Chris-index-image.jpg
Javed Aslam
Professor
https://www.khoury.northeastern.edu/people/jay-javed-aslam/
https://www.khoury.northeastern.edu/wp-content/uploads/2016/02/Javed-Aslam-index-image-e1456779428335.jpg
Kenneth Baclawski
Associate Professor Emeritus
https://www.khoury.northeastern.edu/people/kenneth-baclawski/
https://www.khoury.northeastern.edu/wp-content/uploads/2016/02/Kenneth-Baclawski-index-image-e1456779323580.jpg
Albert-László Barabási
Robert Gray Dodge Professor of Network Science
https://www.khoury.northeastern.edu/people/albert-laszlo-barabasi/
https://www.khoury.northeastern.edu/wp-content/uploads/2015/12/barabasi-index.jpg
Timothy W. 

<a id="6"></a>

## 6. Storing the Scraped Data
* Most of the time, you'll want to store your scraped data in a file.
* Consider using DataFrames for tabular data.

In [35]:
import pandas as pd

df = pd.DataFrame(columns = ["Name", "Title", "Link", "Picture"])
df

Unnamed: 0,Name,Title,Link,Picture


### 6.1. Appending Rows to a DataFrame
* Use append() method 
* Pass in a dictionary containing Column names and Values

In [36]:
df = df.append({"Name": "Dumbledore", "Title": "Headmaster", "Link":"hogwarts.edu", 
                "Picture":"hogwarts.edu/dumby.png"}, ignore_index=True)
df

Unnamed: 0,Name,Title,Link,Picture
0,Dumbledore,Headmaster,hogwarts.edu,hogwarts.edu/dumby.png


In [37]:
df = df.drop(0)

In [38]:
df

Unnamed: 0,Name,Title,Link,Picture


In [39]:
faculty_divs = soup.find_all("div", class_="grid-item")

for prof in faculty_divs:   
    fname = prof.find("h3").get_text().strip()
    ftitle= prof.find("p","position").get_text().strip()
    fpage = prof.find("a").get("href").strip()
    fpic = prof.find("img").get("src").strip()
    
#appends the fields to their respective columns
#note the curly braces 
    df = df.append({"Name":fname, "Title":ftitle, "Link":fpage, "Picture": fpic}, ignore_index=True)

In [40]:
df

Unnamed: 0,Name,Title,Link,Picture
0,Amal Ahmed,Associate Professor,https://www.khoury.northeastern.edu/people/ama...,https://www.khoury.northeastern.edu/wp-content...
1,Christopher Amato,Assistant Professor,https://www.khoury.northeastern.edu/people/chr...,https://www.khoury.northeastern.edu/wp-content...
2,Javed Aslam,Professor,https://www.khoury.northeastern.edu/people/jay...,https://www.khoury.northeastern.edu/wp-content...
3,Kenneth Baclawski,Associate Professor Emeritus,https://www.khoury.northeastern.edu/people/ken...,https://www.khoury.northeastern.edu/wp-content...
4,Albert-László Barabási,Robert Gray Dodge Professor of Network Science,https://www.khoury.northeastern.edu/people/alb...,https://www.khoury.northeastern.edu/wp-content...
5,Timothy W. Bickmore,Professor,https://www.khoury.northeastern.edu/people/tim...,https://www.khoury.northeastern.edu/wp-content...
6,Michelle Borkin,Assistant Professor,https://www.khoury.northeastern.edu/people/mic...,https://www.khoury.northeastern.edu/wp-content...
7,Carla E. Brodley,Dean - Khoury College of Computer Sciences,https://www.khoury.northeastern.edu/people/car...,https://www.khoury.northeastern.edu/wp-content...
8,Agnes H. Chan,Professor Emeritus,https://www.khoury.northeastern.edu/people/agn...,https://www.khoury.northeastern.edu/wp-content...
9,David Choffnes,Associate Professor,https://www.khoury.northeastern.edu/people/dav...,https://www.khoury.northeastern.edu/wp-content...


In [41]:
#displays first 5 rows
df.head()

Unnamed: 0,Name,Title,Link,Picture
0,Amal Ahmed,Associate Professor,https://www.khoury.northeastern.edu/people/ama...,https://www.khoury.northeastern.edu/wp-content...
1,Christopher Amato,Assistant Professor,https://www.khoury.northeastern.edu/people/chr...,https://www.khoury.northeastern.edu/wp-content...
2,Javed Aslam,Professor,https://www.khoury.northeastern.edu/people/jay...,https://www.khoury.northeastern.edu/wp-content...
3,Kenneth Baclawski,Associate Professor Emeritus,https://www.khoury.northeastern.edu/people/ken...,https://www.khoury.northeastern.edu/wp-content...
4,Albert-László Barabási,Robert Gray Dodge Professor of Network Science,https://www.khoury.northeastern.edu/people/alb...,https://www.khoury.northeastern.edu/wp-content...


In [42]:
#displays last 5 rows
df.tail()

Unnamed: 0,Name,Title,Link,Picture
64,Lu Wang,Assistant Professor,https://www.khoury.northeastern.edu/people/lu-...,https://www.khoury.northeastern.edu/wp-content...
65,Daniel Wichs,Associate Professor,https://www.khoury.northeastern.edu/people/dan...,https://www.khoury.northeastern.edu/wp-content...
66,Christo Wilson,Associate Professor,https://www.khoury.northeastern.edu/people/chr...,https://www.khoury.northeastern.edu/wp-content...
67,Lawson Wong,Assistant Professor,https://www.khoury.northeastern.edu/people/law...,https://www.khoury.northeastern.edu/wp-content...
68,Rose Yu,Assistant Professor,https://www.khoury.northeastern.edu/people/ros...,https://www.khoury.northeastern.edu/wp-content...


### 6.2. Writing the DataFrame to CSV

In [43]:
df.to_csv("khoury_faculty.csv")

<a id="7"></a>

## 7. Web Crawling

In [44]:
deanURL = "https://www.khoury.northeastern.edu/people/carla-brodley/"
    
dean_page = urllib.urlopen(deanURL)    
page_soup = BeautifulSoup(dean_page.read())



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [45]:
page_soup

<!DOCTYPE html>
<!--[if (lte IE 9) ]><html lang="en" class="no-js oldie"><![endif]--><!--[if (gt IE 9)|!(IE)]><!--><html class="no-js" lang="en"><!--<![endif]-->
<head>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<script src="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/lib/modernizr.min.js"></script>
<link href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<meta content="NU Khoury" name="apple-mobile-web-app-title"/>
<link href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" rel="apple-touch-icon"/>
<link href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" rel="apple-touch-icon-precomposed"/>
<link href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon-180.png" rel="apple-touch-icon" sizes="180x180"/>
<li

<img src = "res/email.png" />

In [46]:
deanURL = "https://www.khoury.northeastern.edu/people/carla-brodley/"

#let's open the page
dean_page = urllib.urlopen(deanURL)

#creates a BeautifulSoup object containing the content of the page
page_soup = BeautifulSoup(dean_page.read())

#finds the p tage with class = "contact-email"
email_container = page_soup.find("p", class_="contact-email")

#finds the a tag the ithin the contact-email p tag and extracts the text(email address)
email = email_container.find("a").get_text()



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [47]:
email

'khoury-dean@northeastern.edu'

### Let's do this for everyone on the page!

### 7.1. Web Crawling
* Scrapers traversing multiple pages and even multiple sites
* Web crawlers retrieve page contents for a URL, examine that page for another URL, and retrieve that page or some portions of it

In [48]:
import urllib.request as urllib
from bs4 import BeautifulSoup
import time
import random

#opens the faculty page
html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
#turns the pag into a BeautifulSoup object
soup = BeautifulSoup(html.read())

#get all the content under <div class = "grid-item">
faculty_divs = soup.find_all("div", class_="grid-item")

#defines an empty list that will contain the email addressed crawled from faculty webpages
emails = []

for prof in faculty_divs[:3]:
    
    #gets faculty name for each prof
    fname = prof.find("h3").get_text().strip()
    #gets the URL to their webpage
    fpage = prof.find("a").get("href").strip()
    
    #opens the URL for each prof
    fac_page = urllib.urlopen(fpage)    
    #turns the URL for each prof into a BeautifulSoup object
    page_soup = BeautifulSoup(fac_page.read())
    
    #closes the urllib connection so the website won't get mad at us
    fac_page.close()
    
    #on the new page, finds the email container, <p class = "contact_email">
    email_container = page_soup.find("p", class_="contact-email")
    #gets the text for the <a> tage, the email address
    email = email_container.find("a").get_text()
    
    #appends the email to a list and displays it
    emails.append(email)
    print(email)
    
    #waits for a random number of seconds(2-5) before moving on to the next prof in the iterable
    #done to avoid overwhelming the website and getting blocked as a bot
    time.sleep(random.randint(2,6))

print("\n\n\nDone scraping the addresses")



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


amal@ccs.neu.edu
camato@ccs.neu.edu
jaa@ccs.neu.edu



Done scraping the addresses


In [49]:
#displays the list of email addresses
emails

['amal@ccs.neu.edu', 'camato@ccs.neu.edu', 'jaa@ccs.neu.edu']

#### Let's add these email addresses to our dataframe, df

In [50]:
df.head()

Unnamed: 0,Name,Title,Link,Picture
0,Amal Ahmed,Associate Professor,https://www.khoury.northeastern.edu/people/ama...,https://www.khoury.northeastern.edu/wp-content...
1,Christopher Amato,Assistant Professor,https://www.khoury.northeastern.edu/people/chr...,https://www.khoury.northeastern.edu/wp-content...
2,Javed Aslam,Professor,https://www.khoury.northeastern.edu/people/jay...,https://www.khoury.northeastern.edu/wp-content...
3,Kenneth Baclawski,Associate Professor Emeritus,https://www.khoury.northeastern.edu/people/ken...,https://www.khoury.northeastern.edu/wp-content...
4,Albert-László Barabási,Robert Gray Dodge Professor of Network Science,https://www.khoury.northeastern.edu/people/alb...,https://www.khoury.northeastern.edu/wp-content...


In [51]:
len(emails)

3

In [52]:
#we can add a new column
df["Email"] = emails

ValueError: Length of values does not match length of index

In [None]:
df.head()

<a id="8"></a>

## 8. More on Web Scraping
* **BeautifulSoup** Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* **Selenium:** https://selenium-python.readthedocs.io/index.html
    * Web automation and scraping; dynamic GET and POST requests; can interact with dynamic web pages, forms, etc.
* **Scrapy:** https://scrapy.org/
    * Optimized web crawling tasks