## Outline
* Basic elements in html and what they are (html, body, tables, trs, tds, divs)
* Common attributes in html, such as class and id
* Concept of selectors relationships (select children or direct child, next sibling, nth child)
* Parsing HTML(Beautiful Soup)

## The result of scraping: HTML data

* GET requests result in HTML; What does this look like?
* Let's see what is on the NYU Computer Science department website.

# Requests

* `requests` is a python package that allows you to use Python to interact with the Internet!  
* There are other packages (e.g. `urllib`), but `requests` is the easiest to use.

In fact, to get the UCSD home page is a simple as
```
import requests
text = requests.get("https://cs.nyu.edu/home/index.html").text
```

In [15]:
import requests
url = "https://cs.nyu.edu/home/index.html"

r = requests.get(url)
    
urlText = r.text

Nchars = 6000
print(urlText[:Nchars]) # Print the first 6000 characters
print("\n\n... " + str(len(urlText)-Nchars) + " additional characters")


<!DOCTYPE html>
<!--[if lt IE 7]>
<html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]-->
<!--[if IE 7]>
<html class="no-js lt-ie9 lt-ie8" lang="en"> <![endif]-->
<!--[if IE 8]>
<html class="no-js lt-ie9" lang="en"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en">
 <!--<![endif]-->
 <head>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8"/>
  <title>
   NYU Computer Science
  </title>
  <meta content="The homepage of the Computer Science Department at the Courant Institute of Mathematical Sciences, a part of New York University." name="description"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="/home/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/>
  <link href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css" rel="stylesheet"/>
  <link href="/home/css/main.css?20180702b" rel="stylesheet" type="text/css"/>
  <link href="https://cdnjs.cloudflare.com/ajax

In [16]:
len(r.text)

31159

Wow, that is gross looking!  It is **raw** HTML, which the browser uses to make the viewable site.

## What is HTML?

* HTML (HyperText Markup Language) is the most basic building block of the Web. It describes and defines the content of a webpage along with the basic layout of the webpage.

* HTML markup includes special "elements" (tags) such as 
    * `<head>, <title>, <body>, <p>, <div>, <img>`,.....
    


1. `<head>` provides general information about the document, would be the first child of `<html>`
2. `<body>` represents the content of an HTML document. There can be only one `<body>` element in a document.
3. `<div>` is the generic container for flow content. It has no effect on the content or layout until styled using CSS. As a "pure" container, the element does not inherently represent anything. Instead, it's used to group content so it can be easily styled using the class or id attributes, marking a section of a document as being written in a different language (using the lang attribute), and so on.
4. `<table>` tr: table row, th: table header, td: table data


See [this tutorial](http://fab.academany.org/2018/labs/fablaboshanghai/students/bob-wu/Fabclass/week2_project_management/HTML.html) for more reference.

In [27]:
!cat sample.html

<html>
	<head>
		<title> Page title </title>
	</head>

	<body>
		<h1> This is a heading </h1>
		<p> This is a paragraph </p>

		<p> This is another paragraph </p>

		<table>
			<caption>Alien football stars</caption>
			<tr> 
				<th scope="col">Player</th>
				<th scope="col">Gloobles</th>
				<th scope="col">Za'taak</th>
			</tr>
			<tr>
				<th scope="row">TR-7</th>
				<td>7</td>
				<td>4,569</td>
			</tr>
			<tr>
				<th scope="row">Khiresh Odo</th>
				<td>7</td>
				<td>7,223</td>
			</tr>
			<tr>
				<th scope="row">Mia Oolong</th>
				<td>9</td>
				<td>6,219</td>
			</tr>
		</table>
	</body>
</html>

In [17]:
# open a file via python

import webbrowser
import os
webbrowser.open('file://' + os.path.realpath("sample.html"));


##  Images and Hyperlinks

* Tag for a picture (can use a link to the image):

`<img src="HumDum.png" alt="Humbpty Dumpty">`

* Tag for a hyperlink: 

  `<a href="https://nyu.edu/">Visit our page!</a>`


In [28]:
!cat pic_ref.html

<html>
	<head>
		<title> Page title </title>
	</head>

	<body>
		<h1> My morning starts with:  </h1>
        <img src="cup.jpg" alt="coffee cup">
		<p> You can find gigantic cups  
        <a href="https://www.amazon.com/Allures-Illusions-Worlds-Largest-Gigantic/dp/B00F690L0E?ref_=fsclp_pl_dp_6">HERE!</a>
        </p>

		<p> This is another paragraph </p>
	</body>
</html>

In [18]:
import webbrowser
import os
webbrowser.open('file://' + os.path.realpath("pic_ref.html"));

# the size of the images can be edited. 

## attribute 
An attribute gives us more information about the tag and what it does. The attributes which affect how CSS works are the CLASS, ID, and the STYLE attributes.

## `class` attribute 
The class attribute is often used to point to a class name in a style sheet. It can also be used by a JavaScript to access and manipulate elements with the specific class name.

In [20]:
!cat class_attribute.html

<!DOCTYPE html>
<html>
<head>
<style>
.city {
  background-color: tomato;
  color: white;
  padding: 10px;
}
</style>
</head>
<body>

<h2 class="city">London</h2>
<p>London is the capital of England.</p>

<h2 class="city">Paris</h2>
<p>Paris is the capital of France.</p>

<h2 class="city">Tokyo</h2>
<p>Tokyo is the capital of Japan.</p>

</body>
</html>

In [19]:
import webbrowser
import os
webbrowser.open('file://' + os.path.realpath("class_attribute.html"));

## `id` attribute
The id attribute specifies a unique id for an HTML element. The value of the id attribute must be unique within the HTML document.

The id attribute is used to point to a specific style declaration in a style sheet. It is also used by JavaScript to access and manipulate the element with the specific id.

The syntax for id is: write a hash character (#), followed by an id name. Then, define the CSS properties within curly braces {}.

In [22]:
!cat id_attribute.html

<!DOCTYPE html>
<html>
<head>
<style>
#myHeader {
  background-color: lightblue;
  color: black;
  padding: 40px;
  text-align: center;
}
</style>
</head>
<body>

<h1 id="myHeader">My Header</h1>

</body>
</html>

In [21]:
import webbrowser
import os
webbrowser.open('file://' + os.path.realpath("id_attribute.html"));


## Selectors' Relationship
- Parent – An element that contains another element
- Child – An element contained within another element
- Descendant – Any element inside of an element, even if it is nested within further elements
- Ascendant – Any element that contains our element, even if it isn't the immediate parent


See [this tutorial](https://www.w3schools.com/css/css_combinators.asp) on how to use css combinators

# Cleaning

* Now we have an idea what the basic structure of the HTML looks like, we can start cleaning it. 

* To process it we can use [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

**Warning:** BeautifulSoup has changed quite a bit between versions, so make sure you are looking at documentation for the version you are using (4 here).

In [22]:
url = "https://cs.nyu.edu/home/index.html"
r = requests.get(url)   
urlText = r.text


from bs4 import BeautifulSoup
soup = BeautifulSoup(urlText, 'html.parser')
soup

<!DOCTYPE html>

<!--[if lt IE 7]>
<html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]-->
<!--[if IE 7]>
<html class="no-js lt-ie9 lt-ie8" lang="en"> <![endif]-->
<!--[if IE 8]>
<html class="no-js lt-ie9" lang="en"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en">
<!--<![endif]-->
<head>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta charset="utf-8"/>
<title>
   NYU Computer Science
  </title>
<meta content="The homepage of the Computer Science Department at the Courant Institute of Mathematical Sciences, a part of New York University." name="description"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="/home/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/>
<link href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css" rel="stylesheet"/>
<link href="/home/css/main.css?20180702b" rel="stylesheet" type="text/css"/>
<link href="https://cdnjs.cloudflare.com/ajax/libs/animate.css/3

In [24]:
# we can extract the title of the document

soup.title
soup.title.string

'\n   NYU Computer Science\n  '

In [25]:
# we can extract the first paragraph 

soup.p

# open link in the browser, right click and "page source". Can you find <p> tags?
# and hyperlinks

<p>
           Yann LeCun has been elected to the <a href="http://www.nasonline.org/news-and-multimedia/news/2021-nas-election.html">National Academy of Sciences</a>. Congratulations!
         </p>

In [11]:
# Grab all the links

all_links  = soup.find_all('a')
all_links
#type(all_links)

[<a href="#main-content" id="skiptomaincontent">
       Skip to main content
      </a>,
 <a href="https://www.nyu.edu/life/safety-health-wellness/coronavirus-information.html" style="color: #545157;">COVID-19 INFO</a>,
 <a href="https://cs.nyu.edu/home/courses/grad-fall21.html">
         Graduate
        </a>,
 <a href="https://cs.nyu.edu/home/courses/ug-fall21.html">
         Undergraduate
        </a>,
 <a href="http://www.nyu.edu">
          NYU
         </a>,
 <a href="http://cims.nyu.edu">
         COURANT
        </a>,
 <a href="/">
          Computer Science
         </a>,
 <a aria-expanded="false" href="/home/about/" id="nav-about" role="button">
         About
        </a>,
 <a href="/dynamic/news/">
           News &amp; Events
          </a>,
 <a href="/home/about/directions.html">
           Directions
          </a>,
 <a href="/home/about/contacts.html">
           Contacts
          </a>,
 <a aria-expanded="false" href="/home/research/overview.html" id="nav-research" rol

In [12]:
# print all the links
for link in soup.find_all('a'):
    print("new link: " + link.text+ "\n")
    

new link: 
      Skip to main content
     

new link: COVID-19 INFO

new link: 
        Graduate
       

new link: 
        Undergraduate
       

new link: 
         NYU
        

new link: 
        COURANT
       

new link: 
         Computer Science
        

new link: 
        About
       

new link: 
          News & Events
         

new link: 
          Directions
         

new link: 
          Contacts
         

new link: 
        Research
       

new link: 
          Overview
         

new link: 
          Centers
         

new link: 
          Areas
         

new link: 
          Faculty Achievements
         

new link: 
          Theses & Reports
         

new link: 
          Colloquium
         

new link: 
          Seminars
         

new link: 
        People
       

new link: 
          Faculty
         

new link: 
          Researchers
         

new link: 
          Administration & Staff
         

new link: 
          Ph.D. Students
         

new lin

In [13]:
for link in soup.find_all('a'):
    print("new link: " + str(link.get('href'))+ "\n")
 

new link: #main-content

new link: https://www.nyu.edu/life/safety-health-wellness/coronavirus-information.html

new link: https://cs.nyu.edu/home/courses/grad-fall21.html

new link: https://cs.nyu.edu/home/courses/ug-fall21.html

new link: http://www.nyu.edu

new link: http://cims.nyu.edu

new link: /

new link: /home/about/

new link: /dynamic/news/

new link: /home/about/directions.html

new link: /home/about/contacts.html

new link: /home/research/overview.html

new link: /home/research/overview.html

new link: /home/research/centers.html

new link: /dynamic/research/areas/

new link: /dynamic/people/achievements/

new link: /dynamic/reports/

new link: /dynamic/news/colloquium/

new link: /home/research/seminars.html

new link: /home/people/

new link: /dynamic/people/faculty/

new link: /dynamic/people/researchers/

new link: /dynamic/people/staff/

new link: /dynamic/people/phd_students/

new link: /home/people/alumni.html

new link: /home/people/in_memoriam.html

new link: http

In [26]:
# Show the text
print(soup.get_text())












   NYU Computer Science
  































   $(function () {
            $("#news").load("/dynamic/news/short_list/");
        });
        $(function () {
            $("#events").load("/dynamic/news/event/short_list/");
        });
  












      Skip to main content
     


COVID-19 INFO | Fall 2021 Schedule Information:
        
        Graduate
       
       /
       
        Undergraduate
       






         NYU
        


        |
       

        COURANT
       







         Computer Science
        





       Toggle navigation
      













        About
       



          News & Events
         



          Directions
         



          Contacts
         





        Research
       



          Overview
         



          Centers
         



          Areas
         



          Faculty Achievements
         



          Theses & Reports
         



          Colloquium
         



          Seminars
         