# L9 Web Scrapping

`Selenium`: a web testing library. It is used to automate browser activities <br>
`BeautifulSoup`: a Python package for parsing HTML and XML documents. It creates parse trees that are helpful to extract data easily from websites

To extract data using web scraping with python, we will follow the following steps:
1. Find the URL that you want to scrape
2. Inspect the Page
3. Find the data you want to extract
4. Write the code
5. Run the code and extract the data
6. Store the data in the required format

In [33]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [35]:
#Set up the website we will scrap, including the tab we are interested in
url = "https://www.goodreads.com/quotes/tag/{}"   #This is a string that serves as a template for URLs on Goodreads.
                                                  #The {} is a placeholder for a tag category.
url.format("inspirational-quotes") #This is the final URL that will be used to scrape quotes tagged as "inspirational-quotes."


'https://www.goodreads.com/quotes/tag/inspirational-quotes'

`requests.get(url)`: Sends an HTTP GET request to the url (which is the Goodreads quotes page). <br>
`res`: Stores the server's response, which includes the HTML content of the page. <br>
`res.text`: Extracts the HTML source code from the response. <br>
`BeautifulSoup(res.text)`: Parses the HTML content into a BeautifulSoup object, which makes it easier to navigate and extract specific elements.

In [37]:
#Create a soup containing all the content and the website
res = requests.get(url)
soup = BeautifulSoup(res.text)
soup

<!-- This is a random-length HTML comment: vqrmttmktirdjduwienwwbdcqxgthksushihfsvfxpefdjgldiqjinjqnhvkaegwfnxxkwwotizbeickxuifbbavqzetojpraftlanuaefhexjwwgcpmjhzyurebkunvgvnkouvwjvawlxiujivynjiloiwxqmuxklmdupxlorjttgvopoenlfmaxsemggejuizzmpqjgogsngndhjxwldfghhgqsgfjjlkzkymifmkqdbgoplwinhsvkdtxprpxbubjwjkekyjgjkclnejteexhvnihqvwelngkctkrpgakgodbhyblzjhidhjayqifutoxfmcdvqognqxoesyohshqdlhrrlrfpmzqffsriffsrjxcazgrnhpstovwsmbzaafadwclbofdlhasyznznowipdhqnasxnecnalvvtstywvqjqtguewzyaajbhfdwrykgnssvbjxmbzepbumqjfvqqiwpdwqxefrsomksskjdkhxqvgmilxsamzgdvlgkcxklcwptctdvguxzamivhpcbxukhxgunjnznytypgkcdwrjyebbrgkijimitesszcifvttehjllqttpmkblsklbnulhbxpeooaqfjqgitcsgdhnlzgvsuiqjfyujspkrsndmjxbqpkuuizyoaumsvztjkhnfgwxujhhwyqzsttwytbcrialxkyqsirxociokhnzwdpfowogjvdtksnnxcbiythpmrbwkywutaynkpghdzapvwixdqxuhisigncuftqrnvdbwwnwxgyqpxmmfajvugcwgkjcympsepqymkfrggyxflqiusdhtzambnafewnxgykzxfxiwrmobzyrhcihvjvjelnfbuospdgjmpijltcjysnsjdcptlvemvqficejxwrzoaqbbngqddskjtwextrzhjmjdmbnamaabpuugcxkinnqbfrgpctvsr

In [103]:
print(soup.prettify())

<!-- This is a random-length HTML comment: kqdvuxcuvhytuekordurvhgpcfrtbmfahdhmrckfnbuflstteocgqzdpvrhggnnnxgfxbswjesiizqacxyigtpzzmhsnglfbptmncagetzbyshaxxkjbmmksyrvslyjlhascorqrefqdhwmuihxrpmecsbfdwvkvigoizdmtaqjaravbindbcabnvxpxicxwkxfjfbeoskynnsubwvlifyaeaybszxlrbcovbuunusyxauttrvtktpemhxrgojczubxknnhoqspzjpjbrbhndlayopamirykkgklxxqueibhsnbbhnkejfhbyovthnkzuavpmtyksfjloitmdihjnzdmalbjwwbltsgbyqewotszgvagfpjzkcqbkdmzwbnuezgbewxwctvzqepjgxfmvciiedpnakszcrfcdfwdedvweabugphvikvotojxcxvrwdrdvzraemfxwdscuwzoljnctgjwhpyfklzfzvtlfkzrhkuqazvuhmktwycyczcsupbfbhecvismlvztbgpdpmrsiupbrafsypyictqqtxpmvnurdvlocwrwfqmxczimgxittzshaxuntwonkjyvixqtvgwqhmvgmbqpllafzwmisqmqoijpwlfpewpesttjythntythlsknaqdgifyqnabgompxnmzcpldbmvglwgpnswzmyrooqulxrggoighbdkuqaypgfbovyqvyvglfczggzisfaptefllrtbfartnakcbeuwzjsikyyvgjblnrdmvmqipkdhogwzhmaiwscojdhjljgsljlmnwpclwhztjjvtwngmgesrxurvwczolzgnprddcvqgjtmvzuhnpheldiiwcnlyxceozzkuspqnhnnztbejgqiqpprxiruavbahquiedptvdvcwxgaatoynnpphyyolfxoyqyldznscpggmpnejlzdycbzicv

`soup.find_all("div", attrs={"class": "quoteText"})` <br>
Searches the entire HTML document for all `<div>` elements that have class="quoteText". <br>
Returns a list of matching elements.

In [39]:
#use Javasscript if the website prevent the scrapping
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text, "html.parser")


In [41]:
# use the find_all tool and save them
quote_divs = soup.find_all("div", attrs={"class": "quoteText"})
print(quote_divs)

[<div class="quoteText">
      “Be yourself; everyone else is already taken.”
    <br/>
  ―
  <span class="authorOrTitle">
    Oscar Wilde
  </span>
</div>, <div class="quoteText">
      “I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.”
    <br/>
  ―
  <span class="authorOrTitle">
    Marilyn Monroe
  </span>
</div>, <div class="quoteText">
      “So many books, so little time.”
    <br/>
  ―
  <span class="authorOrTitle">
    Frank Zappa
  </span>
</div>, <div class="quoteText">
      “Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.”
    <br/>
  ―
  <span class="authorOrTitle">
    Albert Einstein
  </span>
</div>, <div class="quoteText">
      “A room without books is like a body without a soul.”
    <br/>
  ―
  <span class="authorOrTitle">
    Marcus Tullius Cicero
  </span>
</div>, 

In [45]:
for quote in quote_divs:
    print(quote.get_text(strip=True, separator=" "))

“Be yourself; everyone else is already taken.” ― Oscar Wilde
“I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.” ― Marilyn Monroe
“So many books, so little time.” ― Frank Zappa
“Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.” ― Albert Einstein
“A room without books is like a body without a soul.” ― Marcus Tullius Cicero
“Be who you are and say what you feel, because those who mind don't matter, and those who matter don't mind.” ― Bernard M. Baruch
“You've gotta dance like there's nobody watching, Love like you'll never be hurt, Sing like there's nobody listening, And live like it's heaven on earth.” ― William W. Purkey
“You know you're in love when you can't fall asleep because reality is finally better than your dreams.” ― Dr. Seuss
“You only live once, but if you do it right, once is e

In [53]:
# Extract quotes and authors into a dictionary
quote_list = []
for quote in quote_divs:
    text = quote.get_text(strip=True, separator=" ")
    
    # Splitting text into quote and author
    if "―" in text:
        quote_text, author = text.rsplit("―", 1)
    elif "—" in text:
        quote_text, author = text.rsplit("—", 1)
    else:
        quote_text = text.strip("“” ").strip()
        author = "Unknown"

    quote_text = quote_text.strip("“” ").strip()
    author = author.strip()



    
    quote_list.append({"quote": quote_text, "author": author})

print(quote_list)

[{'quote': 'Be yourself; everyone else is already taken.', 'author': 'Oscar Wilde'}, {'quote': "I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.", 'author': 'Marilyn Monroe'}, {'quote': 'So many books, so little time.', 'author': 'Frank Zappa'}, {'quote': "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.", 'author': 'Albert Einstein'}, {'quote': 'A room without books is like a body without a soul.', 'author': 'Marcus Tullius Cicero'}, {'quote': "Be who you are and say what you feel, because those who mind don't matter, and those who matter don't mind.", 'author': 'Bernard M. Baruch'}, {'quote': "You've gotta dance like there's nobody watching, Love like you'll never be hurt, Sing like there's nobody listening, And live like it's heaven on earth.", 'author': 'William W. Purkey'}, {'quote':

In [57]:
# Convert to Pandas DataFrame
df = pd.DataFrame(quote_list)

# Display DataFrame (replace ace_tools function)
print(df.head())  # Show the first 5 rows

# Save to CSV
df.to_csv("quotes.csv", index=False, encoding="utf-8")

print("Quotes saved to quotes.csv")


                                               quote                 author
0       Be yourself; everyone else is already taken.            Oscar Wilde
1  I'm selfish, impatient and a little insecure. ...         Marilyn Monroe
2                     So many books, so little time.            Frank Zappa
3  Two things are infinite: the universe and huma...        Albert Einstein
4  A room without books is like a body without a ...  Marcus Tullius Cicero
Quotes saved to quotes.csv


### wrong code

In [15]:
data = []
for quote in quote_divs:
    text = quote.get_text(strip=True, separator=" ")
    data.append([text])

In [19]:
import csv

csv_filename = "quotes.csv"
with open(csv_filename, "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["Quote"])
    writer.writerows(data)

`quoteText_div = quote_div.find_next("div", attrs={"class": "quoteText"})` <br>
Finds the next `<div>` with the class `"quoteText"` inside `quote_div`. <br>
This should contain the actual text of the quote.

In [11]:
quote_div = quote_divs[0]
quote_div

<div class="quoteText">
      “Be yourself; everyone else is already taken.”
    <br/>
  ―
  <span class="authorOrTitle">
    Oscar Wilde
  </span>
</div>

In [13]:
#choose quotes from our list
quoteText_div = quote_div.find_next("div", attrs={"class": "quoteText"})
print(quoteText_div)

<div class="quoteText">
      “I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.”
    <br/>
  ―
  <span class="authorOrTitle">
    Marilyn Monroe
  </span>
</div>


`quoteText_div.text`: Extracts the text content from the quoteText_div HTML element. <br>
`.strip()`: Removes leading and trailing whitespace (spaces, newlines, tabs). <br>
Stores the cleaned text in the variable striped.

In [15]:
#use the strip tool to just keep the text needed
striped = quoteText_div.text.strip()
print(striped)

“I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.”
    
  ―
  
    Marilyn Monroe


`striped.split("\n")` <br>
Splits the string wherever there is a newline `(\n)`. <br>
Returns a list of substrings.

In [17]:
#Split our quote into pieces
striped_text = striped.split("\n") 
print(striped_text)

["“I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.”", '    ', '  ―', '  ', '    Marilyn Monroe']


`striped_text[0]`: Selects the first element from the striped_text list, which contains the quote. <br>
`[1:-1]`: Uses string slicing to remove the first and last character.

In [19]:
#Save the quote and the author in a separate variable
quote = striped_text[0][1:-1]  
quote

"I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best."

In [21]:
author = striped_text[4].strip() 
author

'Marilyn Monroe'

In [23]:
def getAllQuotes(url): 
    quotes = []
    for quote_div in quote_divs:
        quoteText_div = quote_div.find_next("div", attrs={"class" :
"quoteText"})
        striped = quoteText_div.text.strip()
        striped_text= striped.split("\n")
        quote = striped_text[0][1:-1]
        author = striped_text[-1].strip()
        quote_item = {
            "quote" : quote,
            "author" : author
        }

In [25]:
quote_data = getAllQuotes(url)
print(quote_data)

AttributeError: 'NoneType' object has no attribute 'text'

In [132]:
import pandas as pd
df = pd.DataFrame(quote_data)
df.to_csv("scrap.csv", index=None)