## Day 42 of 100DaysOfCode 🐍
### Web Scraping - HTTP, XPath notation, XPath syntax, and Selectors

### **HTTP 🌐📨**

HTTP stands for **Hypertext Transfer Protocol**.

- It is the foundation of data communication on the internet, allowing web browsers to fetch and display web pages.
- It works by transmitting requests from clients (like browsers) to servers and receiving responses back.

### **XPath notation 🧭📝**

XPath notation is a syntax for navigating and selecting elements in XML and HTML documents. It allows to locate specific nodes or elements based on it's relationships or attributes within the document structure.

### **XPath syntax 🧭🧾**

XPath syntax is a concise and powerful way to describe the path to specific elements in XML and HTML documents. It uses a combination of slashes, node names, and predicates to define the location of the desired elements within the document.

### **Selectors 🎯📌**

Selectors in web scraping are patterns or expressions used to identify and extract specific elements from a webpage's HTML or XML structure. It allows to target and retrieve desired data, such as text, images, or links, efficiently from web pages.

#### **Exercise - Choose DataCamp!**

In [None]:
# Consider the following HTML
<html>
  <body>
    <div>
      <p>Hello World!</p>
      <div>
        <p>Choose DataCamp!</p>
      </div>
    </div>
    <div>
      <p>Thanks for Watching!</p>
    </div>
  </body>
</html>

# Create an XPath string to the desired paragraph element
xpath = '/html/body/div/div/p'

#### **Attribute**

@ represents "attribute"
- @class
- @id
- @href

#### Brackets and Attributes
xpath = '//p[@class="class-1"]'

#### Contain with Contains
Xpath Contain Notation: **contains(@attri-name, "string-expre")**

#### **Exercise - Where it's @**

In [None]:
# Consider the following HTML
<html>
  <body>
    <div id="div1" class="class-1">
      <p class="class-1 class-2">Hello World!</p>
      <div id="div2">
        <p id="p2" class="class-2">Choose DataCamp!</p>
      </div>
    </div>
    <div id="div3" class="class-2">
      <p class="class-2">Thanks for Watching!</p>
    </div>
  </body>
</html>

# Create an XPath string to the desired paragraph element
xpath = '//*[@id="div3"]/p' or
xpath = '//*[@class="class-2"]/p'

#### **Exercise - Check your Class**

In [None]:
# Consider the following HTML
<html>
  <body>
    <div id="div1" class="class-1">
      <p class="class-1 class-2">Hello World!</p>
      <div id="div2">
        <p id="p2" class="class-2">Choose DataCamp!</p>
      </div>
    </div>
    <div id="div3" class="class-2">
      <p class="class-2">Thanks for Watching!</p>
    </div>
  </body>
</html>

# Create an XPath string to the desired paragraph element
xpath = '//p[@class="class-1 class-2"]'
xpath = '//*[@class="class-2"]/p'

#### **Exercise - Hyper(link) Active**

In [None]:
# Consider the following HTML
<html>
  <body>
    <div id="div1" class="class-1">
      <p class="class-1 class-2">Hello World!</p>
      <div id="div2">
        <p id="p2" class="class-2">Choose
            <a href="http://datacamp.com">DataCamp!</a>!
        </p>
      </div>
    </div>
    <div id="div3" class="class-2">
      <p class="class-2">Thanks for Watching!</p>
    </div>
  </body>
</html>

# Create an XPath string to the desired paragraph element
xpath = '//a[contains(@class,"package-snippet")]/@href'

#### **XPath Chaining**

In [None]:
sel.xpath('//div/span/p[3]') Same as >> sel.xpath('//div').xpath('./span/p[3]')

#### **Exercise - Divvy Up This Exercise**

We have pre-loaded an HTML into the string variable `html`. In this two part problem you will use this `html` variable as the HTML document to set up a `Selector` object with, and create a `SelectorList` which selects all `div` elements; then, you will check your understanding of what happens within the `SelectorList`.

<br>

> Instructions
- Set up the Selector object sel with the html variable passed as the text argument.
- Assign to the variable divs a SelectorList of all div elements within the HTML document.

In [None]:
from scrapy import Selector

# Create a Selector selecting html as the HTML document
sel = Selector(text=html)

# Create a SelectorList of all div elements in the HTML document
divs = sel.xpath('//div')

#### **Practice - HTML text to Selector**

In [None]:
# Importing the Selector from Scrapy
from scrapy import Selector

In [None]:
# Importing Requests and URL
import requests
url = 'https://en.wikipedia.org/wiki/Web_scraping'

In [None]:
# Getting the contents using get() method
html = requests.get(url).content

In [None]:
# Passing the content to the Selector
sel = Selector(text=html)

#### **Requesting a Selector**

We have pre-loaded the URL for a particular website in the string variable `url` and use the requests library to put the content from the website into the string variable `html`. Your task is to create a `Selector` object `sel` using the HTML source code stored in `html`.

<br>

> Instructions
- Fill in the two blanks below to assign to create the Selector object sel which uses the string html as the text it inputs.

In [None]:
# Importing a scrapy Selector
from scrapy import Selector

# Importing requests
import requests

# Creating the string html containing the HTML source
html = requests.get(url).content

# Creating the Selector object sel from html
sel = Selector(text=html)

# Print out the number of elements in the HTML document
print("There are 1020 elements in the HTML document.")
print("You have found: ", len(sel.xpath('//*')))