[![AnalyticsDojo](../fig/final-logo.png)](http://rpi.analyticsdojo.com)
<center><h1>Introduction to Python - Web Mining</h1></center>
<center><h3><a href = 'http://rpi.analyticsdojo.com'>rpi.analyticsdojo.com</a></h3></center>

## This tutorial is directly from the the BeautifulSoup documentation.
[https://www.crummy.com/software/BeautifulSoup/bs4/doc/]

### Before you begin
If running locally you need to make sure that you have beautifulsoup4 installed. 
`conda install beautifulsoup4`

In [None]:
# All html documents have structure.  Here, we can see a basic html page.  

In [2]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [3]:
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())


<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


### A Retreived Beautiful Soup Object 
- Can be parsed via dot notation to travers down the hierarchy by *class name*, *tag name*, *tag type*, etc.



In [4]:
soup


<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [5]:
#Select the title class.
soup.title
 


<title>The Dormouse's story</title>

In [6]:
#Name of the tag.
soup.title.name




'title'

In [7]:
#String contence inside the tag
soup.title.string




"The Dormouse's story"

In [8]:
#Parent in hierarchy.
soup.title.parent.name




'head'

In [9]:
#List the first p tag.
soup.p




<p class="title"><b>The Dormouse's story</b></p>

In [10]:
#List the class of the first p tag.
soup.p['class']




['title']

In [11]:
#List the class of the first p tag.
soup.a




<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [12]:
#List all a tags.
soup.find_all('a')



[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [13]:

soup.find(id="link3")


<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [14]:
#The Robots.txt listing who is allowed.
response = requests.get("https://en.wikipedia.org/robots.txt")
txt = response.text
print(txt)

﻿# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: 

In [15]:
response = requests.get("https://www.rpi.edu")
txt = response.text
soup = BeautifulSoup(txt, 'html.parser')

print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <!-- Global Site Tag (gtag.js) - Google Analytics -->
  <script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-29465755-1">
  </script>
  <script>
   window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments)};
  gtag('js', new Date());

  gtag('config', 'UA-29465755-1');
  </script>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <!--required for Bootstrap v4-->
  <title>
   Rensselaer Polytechnic Institute (RPI) :: Architecture, Business, Engineering, Humanities, IT &amp; Web Science, Science
  </title>
  <!-- BOOTSTRAP v4 CSS-->
  <link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0-alpha.6/css/bootstrap.min.css" integrity="sha384-rwoIResjU2yc3z8GV/NPeZWAv56rSmLldC3R/AZzGRnGxQQKnKkoFVhFQhNUwEyJ" rel="stylesheet"/>
  <script crossor

In [16]:
soup.find_all('a')

[<a class="skip-main" href="#maincontent">Skip to main content</a>,
 <a href="https://rpi.edu"><img alt="Rensselaer Polytechnic Institute" height="115" src="https://www.rpi.edu/dept/cct/apps/web-branding/v2/header/meganav/img/RPIlogo_white.png" width="378"/></a>,
 <a href="http://studentlife.rpi.edu/">Students</a>,
 <a href="http://studentlife.rpi.edu/student-orientation/parents-families">Parents</a>,
 <a href="http://rpinfo.rpi.edu/">Faculty &amp; Staff</a>,
 <a href="http://alumni.rpi.edu">Alumni</a>,
 <a href="http://admissions.rpi.edu/">Apply</a>,
 <a href="https://info.rpi.edu/visit">Visit</a>,
 <a href="http://giving.rpi.edu/">Give </a>,
 <a href="https://info.rpi.edu/rpi-search" target="_blank">Search</a>,
 <a class="mmabou" href="http://rpi.edu/about/index.html">About Rensselaer</a>,
 <a class="mmacad" href="http://rpi.edu/academics/index.html">Academics</a>,
 <a href="http://www.arch.rpi.edu/">Architecture</a>,
 <a href="http://lallyschool.rpi.edu/">Business</a>,
 <a href="htt

In [None]:
# Experiment with selecting your own website.  Selecting out a url. 

response = requests.get("enter url here")
txt = response.text
soup = BeautifulSoup(txt, 'html.parser')

print(soup.prettify())

#For more info, see 
[https://github.com/stanfordjournalism/search-script-scrape](https://github.com/stanfordjournalism/search-script-scrape) 