# <center>WEB SCRAPING</center>
---

# BeautifulSoup 

## Workflow
* step1: import library -> `BeautifulSoup, requests`
* step2: fetch the pages -> `requests.get(url:str)`
* step3: page content -> `requests.get(url:str).text`
* step4: create soup -> `BeautifulSoup(content, "lxml")`
* step5: locate the element in soup ->

1. **Import the libraries**

In [1]:
# import libraries
from bs4 import BeautifulSoup
import requests

2. **Send request to the website**

In [14]:
# website
url = "https://subslikescript.com/movie/Titanic-120338"

# send requests, get resonse as return
result = requests.get(url)

# store the resonse content
content = result.text

In [15]:
type(result),type(content)

(requests.models.Response, str)

In [19]:
result, content[:300]

(<Response [200]>,
 '<!doctype html>\n<html lang="en" dir="ltr">\n<head>\n\t<!-- Global site tag (gtag.js) - Google Analytics -->\n<script async src="https://www.googletagmanager.com/gtag/js?id=UA-120598793-1"></script>\n<script>\n  window.dataLayer = window.dataLayer || [];\n  function gtag(){dataLayer.push(arguments);}\n  gtag')

3. **Create soup with parser: lxml**

In [20]:
# create soup
soup = BeautifulSoup(content,"lxml")

In [23]:
# make the HTML code look prettier
soup.prettify()[:300]

'<!DOCTYPE html>\n<html dir="ltr" lang="en">\n <head>\n  <!-- Global site tag (gtag.js) - Google Analytics -->\n  <script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-120598793-1">\n  </script>\n  <script>\n   window.dataLayer = window.dataLayer || [];\n  function gtag(){dataLayer.push(argume'

---

## Getting HTML of a Website

In [26]:
# import library
from bs4 import BeautifulSoup
import requests

# url of website
url = "https://subslikescript.com/movie/Titanic-120338"

# send request and store response in result
result = requests.get(url)

# html content
content = result.text

# create soup with parser: lxml
soup = BeautifulSoup(content, "lxml")

# make the HTML code look prettier
soup.prettify()[:500]

'<!DOCTYPE html>\n<html dir="ltr" lang="en">\n <head>\n  <!-- Global site tag (gtag.js) - Google Analytics -->\n  <script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-120598793-1">\n  </script>\n  <script>\n   window.dataLayer = window.dataLayer || [];\n  function gtag(){dataLayer.push(arguments);}\n  gtag(\'js\', new Date());\n\n  gtag(\'config\', \'UA-120598793-1\');\n  </script>\n  <meta charset="utf-8"/>\n  <title>\n   Titanic (1997) Movie Script  | Subs like Script\n  </title>\n  <meta content="Rea'

---

## Single Page Web Scraping

### Find elements or access website content with:
1. **ID**
2. **Class Name**
3. **Tag Name, CSS Selectors**
4. **Xpath (XML Path)**

In [37]:
# import library
from bs4 import BeautifulSoup
import requests

# url of website
url = "https://subslikescript.com/movie/Titanic-120338"

# send request and store response in result
result = requests.get(url)

# html content
content = result.text

# create soup with parser: lxml
soup = BeautifulSoup(content, "lxml")

# make the HTML code look prettier
soup.prettify()[:500]

# access website content with the help of HTML tags & attributes
#<article class="main-article"> ... </article>, its the outmost structure inside body tag
outer_box = soup.find("article",class_ = "main-article")
type(outer_box)

bs4.element.Tag

**Get the Title**

In [47]:
print(type(soup.find("h1")))

<class 'bs4.element.Tag'>


In [48]:
print(soup.find("h1"))

<h1>Titanic (1997) - full transcript</h1>


In [58]:
# get the title
title = soup.find("h1").get_text()
title

'Titanic (1997) - full transcript'

In [59]:
title = outer_box.find("h1").get_text()
title

'Titanic (1997) - full transcript'

In [67]:
transcript = outer_box.find("div",class_= "full-script").get_text(strip=True, separator=" ")
print(transcript[:500])

13 meters. You should see it. Okay, take her up and over the bow rail. Mir 2, we're going over the bow.
Stay with us. Okay, quiet. We're rolling. Seeing her coming out of the
darkness like a ghost ship... ...still gets me every time. To see the sad ruin of the
great ship sitting here... ...where she landed at
2:30 in the morning Of April 15, 1912 ...after her long fall... ...from the world above. You are so full of shit, boss. Dive 6. Here we are again
on the deck of Titanic. 2% miles down
3,821


### Export the scraped data

In [68]:
!ls

[1m[36mBEAUTIFULSOUP[m[m      README.md          [1m[36mhtml[m[m               [1m[36mscrapy[m[m
[1m[36mPython[m[m             WEB SCRAPING.ipynb [1m[36mimages[m[m


In [70]:
# creating a file to store the scraped data and store it in .txt file

with open(f"{title}.txt", "w") as file:
    file.write(transcript)

In [71]:
!ls

[1m[36mBEAUTIFULSOUP[m[m                        WEB SCRAPING.ipynb
[1m[36mPython[m[m                               [1m[36mhtml[m[m
README.md                            [1m[36mimages[m[m
Titanic (1997) - full transcript.txt [1m[36mscrapy[m[m


In [74]:
!head "Titanic (1997) - full transcript.txt"

13 meters. You should see it. Okay, take her up and over the bow rail. Mir 2, we're going over the bow.
Stay with us. Okay, quiet. We're rolling. Seeing her coming out of the
darkness like a ghost ship... ...still gets me every time. To see the sad ruin of the
great ship sitting here... ...where she landed at
2:30 in the morning Of April 15, 1912 ...after her long fall... ...from the world above. You are so full of shit, boss. Dive 6. Here we are again
on the deck of Titanic. 2% miles down
3,821 meters- The pressure outside is 3% tons per square inch These windows are
9 inches thick. If they go, it's sayonara in two microseconds. All right, enough of that bullshit. Just put her down on the roof of
the officers' quarters like yesterday. Mir 2, we're landing right over the
Grand Staircase. You guys set to launch? Yeah, Brock. Launching Dunkin now.


## Multiple Page Web Scraping