# Webscraping using BeautifulSoup

## Imports and installations

In [None]:
import requests
import json
import pandas as pd 
import time
from bs4 import BeautifulSoup as bs # this is the library that facilitates scraping in python

Here you can find the [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for BeautifulSoup

In [None]:
#!conda install -c anaconda beautifulsoup4

Important tool for examining the site structure: Google Chrome Developer Tools. You can access them inside Chrome with "ctrl + shift + i".

## Setting up the link structure

In [None]:
url = "https://www.thedailystar.net/tags/road-accident"
base_url = "https://www.thedailystar.net"

## Disclaimer!

Webscraping collects data from websites in an automated fashion. Each request puts an additional load on the server. Running scrapers can put massive loads on servers and bring them down. Most websites do not want to be scraped. The rules for scraping a website can be found in the *robots.txt* site.  
Be careful when scraping social media sites, because scraping their content is against their user agreement. You risk to be banned from the social media site and you can get into severe legal trouble.

Let's look at the rules for scraping at our desired website:   [https://www.thedailystar.net/robots.txt](https://www.thedailystar.net/robots.txt)

## Trying to request the page
If we get a <Response [200]> we are good to go. Otherwise the URL is not reachable

In [None]:
doc = requests.get(url)
doc

In [None]:
type(doc)

In [None]:
if str(doc) == "<Response [200]>":
    # create a soup object that contains the navigable html presentation of the page
    soup = bs(doc.content, 'html.parser')
    print(f"Retrieved url: {url}")
else:
    print(f"{url} cannot be reached.")

In [None]:
# putting it all together into a function
def make_soup(url):
    doc = requests.get(url)
    if str(doc) == "<Response [200]>":
        # create a soup object that contains the navigable html presentation of the page
        soup = bs(doc.content, 'html.parser')
        print(f"Retrieved url: {url}")
    else:
        print(f"{url} cannot be reached.")
        
    return soup

## EDA for webscraping
Explore the soup object

In [None]:
soup

In [None]:
type(soup)

Wow... that is a lot of text. Do we have to find the information with regex?  
"soup" is NOT a text object but a "navigable" object. Let us explore the different ways to navigate to the information that we are looking for.  

## Knowing HTML syntax
It is good to have a basic understanding of the html syntax and how webpages are structured.  

**Important tags in a website are:**  
h1 - header 1  
h2 - header 2  
h3 - header 3  
h4 - header 4  
p - paragraph  
div - division  
ol - ordered list  
ul - unordered list  
li - list item  
a - link    
img - image

**Important attributes:**  
id - specifies the id for a unique HTML element  
class - specifies the class of several HTML elements for attaching CSS code  
href - attribute of a link, that indicates the link's destination  
src - attribute for the source of an image  

Good resource for learning HTML: [https://www.w3schools.com/html/](https://www.w3schools.com/html/)

## Common tasks for Beautiful Soup:
Getting all links from a page

In [None]:
# getting all links from a page
for link in soup.find_all('a'):
    print(link.get('href'))

In [None]:
# extracting all text from a page

print(soup.get_text())

So we just grabbing everything also grabs a lot of whitespace and a lot of duplicate content. We need a strategy that is more specific on selecting only the parts that are relevant for our search.

## Introducing tags, find and find_all

In [None]:
#Tags
tag = soup.li # is just getting the first occurrence of <li> tag
tag

In [None]:
tag.name

In [None]:
tag.attrs

In [None]:
tag.text

In order to find the tags inside the soup we can use soup.find() or soup.find_all().  
* soup.find() only returns the first object
* soup.find_all() returns a lists of all found objects

**If you get stuck in drilling down in the soup object, you are most likely trying to call methods on results that were returned as a list. You have to loop over the elements in the list to continue to navigate the soup object.**

##  Fishing in the soup

In [None]:
soup.find('h1')

In [None]:
soup.find('div')

In [None]:
soup.find_all('div') # use len() to find out how many items you have found

# Scraping accidents from "The Daily Star" 
## Finding the relevant div inside our webpage

We want to find the interesting parts on [www.thedailystar.net/tags/road-accident](https://www.thedailystar.net/tags/road-accident)  


Use Chrome Developer Tools to narrow down the div that contains all the information that we are interested in. 

class name: "view-sub-category-news-listing"

In [None]:
# accessing divs that are specified by a class name
container = soup.find("div", attrs={"class": "view-sub-category-news-listing"})
container

In [None]:
type(container)

In order to get an overview of the structure of the single elements we only look at the first element, to explore it further. find_all returns a list of objects, so we can access the elements by indexing.

In [None]:
item = container.find_all('li')[0]
item

In [None]:
item.div

In [None]:
container.find_all('a')[0].attrs['href']

In [None]:
container.attrs

In [None]:
container.find("h4")

In [None]:
len(container.find_all("h4"))

In [None]:
links = []
headings = []
for row in container.find_all('h4'):
    # getting the heading
    heading = row.text
    headings.append(heading)
    
    # getting the link to the article
    link = row.find('a')
    if 'href' in link.attrs:
        print(f"{heading} - {link.attrs['href']}")
        links.append(link.attrs['href'])

In [None]:
len(container.find_all('p'))

## Pagination
How can we navigate to the next page?

### Finding text on the webpage

In [None]:
next_button = soup.find(text="SHOW MORE")
next_button

This did not work! Take a look at the webpage in the developer tools and find out why!

In [None]:
next_button = soup.find(text="Show more")
next_button

In [None]:
next_button.parent

In [None]:
next_button_link = soup.find(text="Show more").parent.attrs['href']
next_button_link

The fact that the next page is accessed by a page number can be used to automatically create the link for the next page! Pagination starts at page 0 for the first page.  
Let us try to go to the third page: [https://www.thedailystar.net/tags/road-accident?page=2](https://www.thedailystar.net/tags/road-accident?page=2)

## Extracting the main article page

In [None]:
links[0]

As we can see this is just the internal link structure. In order to get the complete url we have to construct it.

In [None]:
page_link = base_url+ links[0]
page_link

In [None]:
page_soup = make_soup(page_link)

In [None]:
page_soup.get_text()

In [None]:
top = page_soup.find("div", attrs={"class": "pane-top"})
top

In [None]:
top.find('div', attrs={"class": "small-text"})

In [None]:
date_string = top.find('div', attrs={"class": "small-text"}).text
date_string

In [None]:
headline = top.find('h1').text
headline

In [None]:
author = page_soup.find("div", attrs={"class": "author-name"}).span.text
author

In [None]:
article = page_soup.find('div', attrs={"class": "field-body"})
article

In [None]:
paragraphs = article.find_all('p')
paragraphs

In [None]:
subheading = paragraphs[0].text
subheading

In [None]:
article_text = ""
for i, paragraph in enumerate(paragraphs):
    if i == 0:
        #print(paragraph.text)
        subheading = paragraph
    else:
        article_text += paragraph.text
print(article_text)

## TODO: Put it all together
1. grab all the links from the first page
2. navigate to the next page 
3. repeat step 1. and 2. until you have gathered all the article links
4. grab all the required content from each article page and save it in an appropriate format



Keep in mind:  
- scrapers tend to fail, so use a lot of try: except: statements
- scrape slowly (like a human) or you might get blocked from the website
- do not unnecessarily hit the website, grab the page once and then extract all the content. Iterative coding in jupyter notebooks really helps for scraping

# Additional resources:

Book: [Web Scraping with Python - Oreilly](https://www.amazon.de/Web-Scraping-Python-Collecting-Modern/dp/1491985577) - absolutely worth it!

Scraping website with javascript requires the use of Selenium: https://python.gotrained.com/selenium-scraping-booking-com/ 

Scraping frame work [scrapy](https://scrapy.org/)

# Scraping without understanding content
## Trying to scrape text in bangla

In [None]:
bangla_soup = make_soup('https://www.prothomalo.com/topic/%E0%A6%B8%E0%A7%9C%E0%A6%95-%E0%A6%A6%E0%A7%81%E0%A6%B0%E0%A7%8D%E0%A6%98%E0%A6%9F%E0%A6%A8%E0%A6%BE')

In [None]:
stories = bangla_soup.find_all("div", attrs={"class": "bn-story-card"})
stories[0].get_text()

In [None]:
tag = stories[0].find("div", attrs={'data-testid': 'tag-related'}).find('time').text
tag

In [None]:
for story in stories:
    print(story.find("div", attrs={'data-testid': 'tag-related'}).find('time').text)