# webScraping -

- Two ways to extract data from a website:
    - **Use the API of website** (if, exists)[ex. Facebook APi, Twitter API, Amazon API, etc.]
    - **Web Scraping / Web Harvesting / Web Data Extraction**- means accessing the HTML page  of website and extracting useful / required information / data from it.

### Steps involved in scraping -

1. **Send an HTTP request to the URL of website** - access to the particular website, result the server reponds to the request by returning HTML content of site. 
    - **requests package** is used

2. **Parsing The Data** - after getting accessed to the HTML content, we have to parse the data. Reason for it is - 
    - HTML data is nested & we cannot extract required data
    - Create a Nested / Tree Structure of the HTML data.
    - **html5lib** most advanced pacakage for parsing
    
3. **Tree Traversal** - navigating & searching the parse tree which created in Step 2.
    - **BeautifulSoup** - pulling data from **HTML** & **XML** files

#### Step - 1: Installing & Loading Reuored Third-party Packages

- **Installing 3rd-party libraries** -
    - !pip install requests
    - !pip install html5lib
    - !pip install bs4
    
> **pip** -  a **package management system** used to **install and manage software packages** written in Python.

In [1]:
# Installing 3rd-party libraries

!pip install requests



In [2]:
!pip install html5lib

Collecting html5lib
  Using cached html5lib-1.1-py2.py3-none-any.whl (112 kB)
Installing collected packages: html5lib
Successfully installed html5lib-1.1


In [3]:
!pip install bs4



- **Loading Required Packages**-
    - import requests
    - import html5lib
    - from bs4 import BeautifulSoup
    - import csv

In [1]:
# Loading Required Packages
import csv
import requests
import html5lib
from bs4 import BeautifulSoup

#### Task 2 : Accessing HTML content from website


In [5]:
url = 'https://www.geeksforgeeks.org/data-structures/' # specify the URL of site want to scrape
r = requests.get(url) # send HTTP request to specified URL & save response from server in a response object, i.e., 'r'
print(r.content) # gives raw HTML content of website, it's of'sting' type

b'<!DOCTYPE html>\r\n<!--[if IE 7]>\r\n<html class="ie ie7" lang="en-US" prefix="og: http://ogp.me/ns#">\r\n<![endif]-->\r\n<!--[if IE 8]>\r\n<html class="ie ie8" lang="en-US" prefix="og: http://ogp.me/ns#">\r\n<![endif]-->\r\n<!--[if !(IE 7) | !(IE 8)  ]><!-->\r\n<html lang="en-US" prefix="og: http://ogp.me/ns#" >\r\n\r\n<!--<![endif]-->\r\n<head>\r\n<meta charset="UTF-8" />\r\n<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1"> \r\n\r\n<link rel="shortcut icon" href="https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_favicon.png" type="image/x-icon" />\r\n<meta name="theme-color" content="#308D46" />\r\n\r\n<meta property="og:image" content="https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_200x200-min.png">\r\n<meta property="og:image:type" content="image/png">\r\n<meta property="og:image:width" content="200">\r\n<meta property="og:image:height" content="200">\r\n<script defer src="https://apis.google.com/js/platform.js"></script>\r

#### Task 3 : Parsing the HTML content

In [6]:
url = 'http://www.values.com/inspirational-quotes'
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html5lib') # page.content = raw HTML Content, 'html5lib' = HTML parser
print(soup.prettify()) # gives visual representation of the parse three created from raw HTML content.

# BeautifulSoup built on the HTML parsing libraries [ as, 'html5lib', 'lxml', 'html.parser', etc]
# So BeautifulSoup object and specify the parser library can be created at the same time.

<!DOCTYPE html>
<html class="no-js" dir="ltr" lang="en-US">
 <head>
  <title>
   Inspirational Quotes - Motivational Quotes - Leadership Quotes | PassItOn.com
  </title>
  <meta charset="utf-8"/>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width,initial-scale=1.0,maximum-scale=1" name="viewport"/>
  <meta content="The Foundation for a Better Life | Pass It On.com" name="description"/>
  <link href="/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
  <link href="/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
  <link href="/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
  <link href="/site.webmanifest" rel="manifest"/>
  <link color="#c8102e" href="/safari-pinned-tab.svg" rel="mask-icon"/>
  <meta content="#c8102e" name="msapplication-TileColor"/>
  <meta content="#ffffff" name="theme-color"/>
  <link crossorigin="anonymous" href

#### Task 4 : Searching & Navigating through the Parse Tree

- extract some useful data from HTML Content.
- The soup object contains all the data in the nested structure which could be programmatically extracted.


>**Scraping Quotes**.

In [5]:
# loading URL
url = 'http://www.values.com/inspirational-quotes'

# sending HTTP request
page = requests.get(url)

# Creating HTML Parser
soup = BeautifulSoup(page.content, 'html5lib')

quotes = [] # a list to store quotes

tables = soup.find('div', attrs = {'id':'all_quotes'})

for row in tables.findAll('div', class_ = 'col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top'):
    quote = {}
    quote['theme'] = row.h5.text
    quote['url'] = row.a['href']
    quote['img'] = row.img['src']
    quote['lines'] = row.img['alt'].split(' #')[0]
    quote['author'] = row.img['alt'].split(' #')[1]
    quotes.append(quote)

filename = 'inspirational_quotes.csv'
with open(filename, 'w', newline = '') as f:
    w = csv.DictWriter(f, ['theme', 'url', 'img', 'lines', 'author'])
    w.writeheader()
    for quote in quotes:
        w.writerow(quote)

**Step 4 : worl by line**
- tag.**find()** - return the first matching element
    - **'div'** argument - HTML tag we want to search.
    - **class_** argument - a dictionary type element to specify the additional attributes associated with that tag.**find()**
    
    - table.**prettify()**
    - tag.**findALL()** - returns a list of all matching elements

In [13]:
print(tables.prettify())

<div class="row" id="all_quotes">
 <div class="col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top">
  <a href="/inspirational-quotes/7826-the-world-needs-dreamers-and-the-world-needs">
   <img alt="The world needs dreamers and the world needs doers. But above all what the world needs most are dreamers that do. #&lt;Author:0x0000557555ead380&gt;" class="margin-10px-bottom shadow" height="310" src="https://assets.passiton.com/quotes/quote_artwork/7826/medium/20210316_tuesday_quote.jpg?1615507606" width="310"/>
  </a>
  <h5 class="value_on_red">
   <a href="/inspirational-quotes/7826-the-world-needs-dreamers-and-the-world-needs">
    INNOVATION
   </a>
  </h5>
 </div>
 <div class="col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top">
  <a href="/inspirational-quotes/7824-what-good-is-an-idea-if-it-remains-an-idea">
   <img alt="What good is an idea if it remains an idea? Try. Experiment. Iterate. Fail. Try again. Change the world. #&lt;Author:0x00005575560d13a0&

In [11]:
import pandas as pd

file = pd.read_csv('inspirational_quotes.csv',encoding='cp1252')

# if their is utf-8 codec error - use "cp1252" as encoding

file

Unnamed: 0,theme,url,img,lines,author
0,INNOVATION,/inspirational-quotes/7826-the-world-needs-dre...,https://assets.passiton.com/quotes/quote_artwo...,The world needs dreamers and the world needs d...,<Author:0x0000557555ead380>
1,INNOVATION,/inspirational-quotes/7824-what-good-is-an-ide...,https://assets.passiton.com/quotes/quote_artwo...,What good is an idea if it remains an idea? Tr...,<Author:0x00005575560d13a0>
2,DETERMINATION,/inspirational-quotes/7031-nobody-grows-old-by...,https://assets.passiton.com/quotes/quote_artwo...,Nobody grows old by merely living a number of ...,<Author:0x00005575577de548>
3,DETERMINATION,/inspirational-quotes/3926-fall-seven-times-st...,https://assets.passiton.com/quotes/quote_artwo...,"Fall seven times, stand up eight.",<Author:0x00005575578a76f0>
4,DETERMINATION,/inspirational-quotes/8195-some-succeed-becaus...,https://assets.passiton.com/quotes/quote_artwo...,"Some succeed because they are destined to, but...",<Author:0x00005575579359a0>
5,DETERMINATION,/inspirational-quotes/6714-if-you-have-made-mi...,https://assets.passiton.com/quotes/quote_artwo...,"If you have made mistakes, even serious ones, ...",<Author:0x000055755799fc88>
6,DETERMINATION,/inspirational-quotes/4218-there-is-no-chance-...,https://assets.passiton.com/quotes/quote_artwo...,"There is no chance, no destiny, no fate, that ...",<Author:0x0000557557aa83a0>
7,FRIENDSHIP,/inspirational-quotes/8194-friendship-is-a-str...,https://assets.passiton.com/quotes/quote_artwo...,Friendship is a strong and habitual inclinatio...,<Author:0x0000557557c8bc80>
8,FRIENDSHIP,/inspirational-quotes/7438-the-most-beautiful-...,https://assets.passiton.com/quotes/quote_artwo...,The most beautiful discovery true friends make...,<Author:0x0000557557d7beb0>
9,FRIENDSHIP,/inspirational-quotes/8193-friends-are-the-fam...,https://assets.passiton.com/quotes/quote_artwo...,Friends are the family you choose.,<Author:0x0000557557fa6140>
