# Webscraping using BeautifulSoup

## Imports and installations

In [1]:
import requests
import json
import pandas as pd 
import time
from bs4 import BeautifulSoup as bs # this is the library that facilitates scraping in python

Here you can find the [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for BeautifulSoup

In [2]:
#!conda install -c anaconda beautifulsoup4

Important tool for examining the site structure: Google Chrome Developer Tools. You can access them inside Chrome with "ctrl + shift + i".

## Setting up the link structure

In [4]:
url = "https://www.thedailystar.net/tags/road-accident"
base_url = "https://www.thedailystar.net"

## Disclaimer!

Webscraping collects data from websites in an automated fashion. Each request puts an additional load on the server. Running scrapers can put massive loads on servers and bring them down. Most websites do not want to be scraped. The rules for scraping a website can be found in the *robots.txt* site.  
Be careful when scraping social media sites, because scraping their content is against their user agreement. You risk to be banned from the social media site and you can get into severe legal trouble.

Let's look at the rules for scraping at our desired website:   [https://www.thedailystar.net/robots.txt](https://www.thedailystar.net/robots.txt)

## Trying to request the page
If we get a <Response [200]> we are good to go. Otherwise the URL is not reachable

In [5]:
doc = requests.get(url)
doc

<Response [200]>

In [6]:
type(doc)

requests.models.Response

In [9]:
if str(doc) == "<Response [200]>":
    # create a soup object that contains the navigable html presentation of the page
    soup = bs(doc.content, 'html.parser')
    print(f"Retrieved url: {url}")
else:
    print(f"{url} cannot be reached.")

Retrieved url: https://www.thedailystar.net/tags/road-accident


In [10]:
# putting it all together into a function
def make_soup(url):
    doc = requests.get(url)
    if str(doc) == "<Response [200]>":
        # create a soup object that contains the navigable html presentation of the page
        soup = bs(doc.content, 'html.parser')
    else:
        print(f"{url} cannot be reached.")
        
    return soup

## EDA for webscraping
Explore the soup object

In [28]:
soup

<!DOCTYPE html>

<!--[if lt IE 7]><html class="lt-ie9 lt-ie8 lt-ie7" lang="en" dir="ltr"><![endif]-->
<!--[if IE 7]><html class="lt-ie9 lt-ie8" lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="lt-ie9" lang="en" dir="ltr"><![endif]-->
<!--[if gt IE 8]><!--><html dir="ltr" lang="en"><!--<![endif]-->
<head>
<!--[if IE]><![endif]-->
<link href="//fonts.gstatic.com" rel="dns-prefetch"/>
<link href="//fonts.googleapis.com" rel="dns-prefetch"/>
<link href="//fonts.googleapis.com" rel="preconnect"/>
<link crossorigin="" href="//fonts.gstatic.com" rel="preconnect"/>
<link href="//maxcdn.bootstrapcdn.com" rel="preconnect"/>
<link href="//maxcdn.bootstrapcdn.com" rel="dns-prefetch"/>
<link href="//assetsds.cdnedge.bluemix.net" rel="preconnect"/>
<link href="//assetsds.cdnedge.bluemix.net" rel="dns-prefetch"/>
<link href="//ctools" rel="preconnect"/>
<link href="//ctools" rel="dns-prefetch"/>
<link href="//cpn" rel="preconnect"/>
<link href="//cpn" rel="dns-prefetch"/>
<link href="//aja

In [29]:
type(soup)

bs4.BeautifulSoup

Wow... that is a lot of text. Do we have to find the information with regex?  
"soup" is NOT a text object but a "navigable" object. Let us explore the different ways to navigate to the information that we are looking for.  

## Knowing HTML syntax
It is good to have a basic understanding of the html syntax and how webpages are structured.  

**Important tags in a website are:**  
h1 - header 1  
h2 - header 2
h3 - header 3  
h4 - header 4  
p - paragraph
div - division  
ol - ordered list  
ul - unordered list  
li - list item  
a - link    
img - image

**Important attributes:**  
id - specifies the id for a unique HTML element  
class - specifies the class of several HTML elements for attaching CSS code  
href - attribute of a link, that indicates the link's destination 
src - attribute for the source of an image  

Good resource for learning HTML: [https://www.w3schools.com/html/](https://www.w3schools.com/html/)

## Common tasks for Beautiful Soup:
Getting all links from a page

In [11]:
# getting all links from a page
for link in soup.find_all('a'):
    print(link.get('href'))

#main-content
/
/newspaper
/business
/opinion
/sports
/arts-entertainment
/lifestyle
/toggle
/showbiz
/shout
/satireday
/star-youth
http://epaper.thedailystar.net
/sections
/multimedia
/star-live
/star-weekend
/world
/spaces
/shift
/bytes
/next-step
/in-focus
/literature
/book-reviews
/health
/science
/law-our-rights
/wide-angle
/environment
/travel
/city
/country
/supplements
/round-tables
/google/search
http://www.thedailystar.net/bangla/
/bangladesh/news/3-killed-mymensingh-road-accident-2085249
/bangladesh/news/3-killed-mymensingh-road-accident-2085249
/bangladesh/news/3-killed-mymensingh-road-accident-2085249
/man-dies-after-truck-crashes-motorcycle-patuakhali-2057033
/man-dies-after-truck-crashes-motorcycle-patuakhali-2057033
/man-dies-after-truck-crashes-motorcycle-patuakhali-2057033
/3-motorcyclists-killed-in-natore-road-accident-2006733
/3-motorcyclists-killed-in-natore-road-accident-2006733
/3-motorcyclists-killed-in-natore-road-accident-2006733
/4-bangladeshi-expats-killed-1

In [12]:
# extracting all text from a page

print(soup.get_text())






















































road accident | The Daily Star





















Skip to main content








road accident

Your right to know
 







 







NewspaperBusinessOpinionSportsA & ELifestyleToggleShowbizShoutSatiredayStar YouthepaperAll SectionsMultimediaStar LiveStar WeekendWorldSpacesShiftBytesNext StepIn FocusLiteratureBook ReviewsHealthScienceLaw & Our RightsWide AngleEnvironmentTravelCityCountrySupplementsRound Tablesবাংলা
 














 
  
 3 killed in Mymensingh road accident 
        Three people were killed and another was injured as a truck hit a CNG-run auto-rickshaw in Tarakanda upazila of Mymensingh this morning.  

  
 Man dies after truck crashes into motorcycle in Patuakhali 
        A man was killed in a road accident this morning after a truck hit his motorbike from the opposite direction on the Patuakhali-Kuakata road at Eusufpur area.  

  
 3 motorcyclists killed in Natore road accident 
        Three men riding a motorcyc

So we just grabbing everything also grabs a lot of whitespace and a lot of duplicate content. We need a strategy that is more specific on selecting only the parts that are relevant for our search.

## Introducing tags, find and find_all

In [33]:
#Tags
tag = soup.li # is just getting the first occurrence of <li> tag
tag

<li class="d-hide tm-show mobile-search"><form action="/google/search"><input class="m-search-submit" type="submit" value=""/><input class="m-search-text" name="search" placeholder="type your keyword"/></form></li>

In [34]:
tag.name

'li'

In [35]:
tag.attrs

{'class': ['d-hide', 'tm-show', 'mobile-search']}

In order to find the tags inside the soup we can use soup.find() or soup.find_all().  
* soup.find() only returns the first object
* soup.find_all() returns a lists of all found objects

**If you get stuck in drilling down in the soup object, you are most likely trying to call methods on results that were returned as a list. You have to loop over the elements in the list to continue to navigate the soup object.**

##  Fishing in the soup

In [36]:
soup.find_all('h1')

[<h1>road accident</h1>]

In [39]:
soup.find('div')

<div class="nocontent" id="skip-link">
<a class="element-invisible element-focusable" href="#main-content">Skip to main content</a>
</div>

In [15]:
soup.find_all('div') # use len() to find out how many items you have found

[<div class="nocontent" id="skip-link">
 <a class="element-invisible element-focusable" href="#main-content">Skip to main content</a>
 </div>,
 <div id="___tdspushdiv"></div>,
 <div id="fb-root"></div>,
 <div class="header-top">
 <div class="container">
 <div class="row">
 <div class="logo two-25 pull-left page-title no-border">
 <h1>road accident</h1>
 <div class="region region-header-left"><div class="region-inner clearfix"><div class="block block-block header-date-wrapper no-title odd first last block-count-1 block-region-header-left block-2" id="block-block-2"><div class="block-inner clearfix">
 <div class="block-content content"><date class="site-date"></date><div class="date-beside uppercase">Your right to know</div></div>
 </div></div></div></div> </div>
 <div class="top-add pull-right two-75 align-right">
 <div class="region region-header-right"><div class="region-inner clearfix"><div class="block block-dfp no-title odd first last block-count-2 block-region-header-right block-g

# Scraping accidents from "The Daily Star" 
## Finding the relevant div inside our webpage

We want to find the interesting parts on [www.thedailystar.net/tags/road-accident](https://www.thedailystar.net/tags/road-accident)  


Use Chrome Developer Tools to narrow down the div that contains all the information that we are interested in. 

class name: "view-sub-category-news-listing"

In [19]:
# accessing divs that are specified by a class name
container = soup.find("div", attrs={"class": "view-sub-category-news-listing"})# .get_text()
container

<div class="view view-sub-category-news-listing view-id-sub_category_news_listing view-display-id-panel_pane_2 view-dom-id-f618f619b4d0e9d0d37319540502d0f6">
<div class="view-content">
<ul class="list-border list-border-dotted"> <li class="">
<div class="content-img pull-left pad-right-big"> <a class="thumb margin-bottom-zero" href="/bangladesh/news/3-killed-mymensingh-road-accident-2085249"><a href="/bangladesh/news/3-killed-mymensingh-road-accident-2085249"><img alt="Mymensingh Road Accident" class="image-style-medium-1" src="https://assetsds.cdnedge.bluemix.net/sites/default/files/styles/medium_1/public/feature/images/2021/04/28/accident_0.jpg?itok=BikeTdPq" title="Mymensingh Road Accident"/></a></a> </div>
<h4 class="pad-bottom-small"> <a href="/bangladesh/news/3-killed-mymensingh-road-accident-2085249">3 killed in Mymensingh road accident</a> </h4>
<p>        Three people were killed and another was injured as a truck hit a CNG-run auto-rickshaw in Tarakanda upazila of Mymensingh 

In [42]:
type(container)

bs4.element.Tag

In order to get an overview of the structure of the single elements we only look at the first element, to explore it further. find_all returns a list of objects, so we can access the elements by indexing.

In [44]:
item = container.find_all('li')[0]
item

<li class="">
<div class="content-img pull-left pad-right-big"> <a class="thumb margin-bottom-zero" href="/bangladesh/news/3-killed-mymensingh-road-accident-2085249"><a href="/bangladesh/news/3-killed-mymensingh-road-accident-2085249"><img alt="Mymensingh Road Accident" class="image-style-medium-1" src="https://assetsds.cdnedge.bluemix.net/sites/default/files/styles/medium_1/public/feature/images/2021/04/28/accident_0.jpg?itok=BikeTdPq" title="Mymensingh Road Accident"/></a></a> </div>
<h4 class="pad-bottom-small"> <a href="/bangladesh/news/3-killed-mymensingh-road-accident-2085249">3 killed in Mymensingh road accident</a> </h4>
<p> Three people were killed and another was injured as a truck hit a CNG-run auto-rickshaw in Tarakanda upazila of Mymensingh this morning. </p></li>

In [49]:
item.div

<div class="content-img pull-left pad-right-big"> <a class="thumb margin-bottom-zero" href="/bangladesh/news/3-killed-mymensingh-road-accident-2085249"><a href="/bangladesh/news/3-killed-mymensingh-road-accident-2085249"><img alt="Mymensingh Road Accident" class="image-style-medium-1" src="https://assetsds.cdnedge.bluemix.net/sites/default/files/styles/medium_1/public/feature/images/2021/04/28/accident_0.jpg?itok=BikeTdPq" title="Mymensingh Road Accident"/></a></a> </div>

In [51]:
container.find_all('a')[0].attrs#['href']

{'href': '/bangladesh/news/3-killed-mymensingh-road-accident-2085249',
 'class': ['thumb', 'margin-bottom-zero']}

In [72]:
container.attrs

{'class': ['view',
  'view-sub-category-news-listing',
  'view-id-sub_category_news_listing',
  'view-display-id-panel_pane_2',
  'view-dom-id-57c4a0103b9ee3f8414412941a6088b0']}

In [57]:
container.find("h4")

<h4 class="pad-bottom-small"> <a href="/bangladesh/news/3-killed-mymensingh-road-accident-2085249">3 killed in Mymensingh road accident</a> </h4>

In [69]:
len(container.find_all("h4"))

30

In [75]:
links = []
headings = []
for row in container.find_all('h4'):
    heading = row.text
    print (heading)
    headings.append(heading)
    link = row.find('a')
    if 'href' in link.attrs:
        print(link.attrs['href'])
        links.append(link.attrs['href'])

 3 killed in Mymensingh road accident 
/bangladesh/news/3-killed-mymensingh-road-accident-2085249
 Man dies after truck crashes into motorcycle in Patuakhali 
/man-dies-after-truck-crashes-motorcycle-patuakhali-2057033
 3 motorcyclists killed in Natore road accident 
/3-motorcyclists-killed-in-natore-road-accident-2006733
 4 Bangladeshi expats killed, 1 injured in Mauritius road accident  
/4-bangladeshi-expats-killed-1-injured-in-mauritius-road-accident-1989965
 Top PIB official dies in road accident in Jatrabari 
/top-pib-official-dies-in-road-accident-in-jatrabari-1965045
 6 killed, 30 injured in Bagerhat road accident 
/country/bagerhat-road-accident-5-killed-25-injured-1880740
 Teacher, 10 students hurt in road accident 
/city/10-students-hurt-in-road-accident-teacher-hand-severed-1878889
 National handball team goalkeeper among 2 killed in Kushtia road accident 
/country/news/national-volleyball-team-goalkeeper-among-2-killed-kushtia-road-accident-1871194
 Biker killed after bus 

In [66]:
len(container.find_all('p'))

30

### Finding text on the webpage

In [184]:
next_button = soup.find(text="SHOW MORE")
next_button

This did not work! Take a look at the webpage in the developer tools and find out why!

In [189]:
next_button = soup.find(text="Show more")
next_button

'Show more'

In [188]:
next_button_link = soup.find(text="Show more").parent.attrs['href']
next_button_link

'/tags/road-accident?page=1'

In [175]:
## Extracting the main article page

In [76]:
links[0]

'/bangladesh/news/3-killed-mymensingh-road-accident-2085249'

In [88]:
page_link = base_url+ links[0]
page_link

'https://www.thedailystar.net/bangladesh/news/3-killed-mymensingh-road-accident-2085249'

In [89]:
doc = requests.get(page_link)
page_soup = bs(doc.content, 'html.parser')

In [90]:
page_soup

<!DOCTYPE html>

<!--[if lt IE 7]><html class="lt-ie9 lt-ie8 lt-ie7" lang="en" dir="ltr"><![endif]-->
<!--[if IE 7]><html class="lt-ie9 lt-ie8" lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="lt-ie9" lang="en" dir="ltr"><![endif]-->
<!--[if gt IE 8]><!--><html dir="ltr" lang="en"><!--<![endif]-->
<head>
<!--[if IE]><![endif]-->
<link href="//fonts.googleapis.com" rel="preconnect"/>
<link href="//fonts.googleapis.com" rel="dns-prefetch"/>
<link crossorigin="" href="//fonts.gstatic.com" rel="preconnect"/>
<link href="//fonts.gstatic.com" rel="dns-prefetch"/>
<link href="//maxcdn.bootstrapcdn.com" rel="preconnect"/>
<link href="//maxcdn.bootstrapcdn.com" rel="dns-prefetch"/>
<link href="//assetsds.cdnedge.bluemix.net" rel="dns-prefetch"/>
<link href="//assetsds.cdnedge.bluemix.net" rel="preconnect"/>
<link href="//ctools" rel="preconnect"/>
<link href="//ctools" rel="dns-prefetch"/>
<link href="//cpn" rel="dns-prefetch"/>
<link href="//cpn" rel="preconnect"/>
<link href="//aja

In [94]:
top = page_soup.find("div", attrs={"class": "pane-top"})
top

<div class="panel-pane pane-top no-title block">
<meta itemid="https://google.com/article" itemprop="mainEntityOfPage" itemscope="" itemtype="https://schema.org/WebPage"/>
<div class="breadcrumb" itemscope="" itemtype="http://schema.org/BreadcrumbList">
<span itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
<a href="/" itemprop="item"><span itemprop="name">Home</span></a> <meta content="1" itemprop="position"/>
</span>
<i class="fa fa-angle-double-right" style="padding: 0 7px;"></i>
<span itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
<a href="/bangladesh" itemprop="item"><span itemprop="name">Bangladesh</span></a> <meta content="2" itemprop="position"/>
</span>
</div>
<div class="small-text">
<meta content="2021-04-28T16:53:42+06:00" itemprop="datePublished"/>
<meta content="2021-04-28T16:53:42+06:00" itemprop="dateModified"/>
04:53 PM, April 28, 2021 / LAST MODIFIED: 05:12 PM, April 28, 2021</div>
<h1 itemprop="headline">3

In [97]:
top.find('div', attrs={"class": "small-text"})

<div class="small-text">
<meta content="2021-04-28T16:53:42+06:00" itemprop="datePublished"/>
<meta content="2021-04-28T16:53:42+06:00" itemprop="dateModified"/>
04:53 PM, April 28, 2021 / LAST MODIFIED: 05:12 PM, April 28, 2021</div>

In [99]:
date_string = top.find('div', attrs={"class": "small-text"}).text
date_string

'\n\n\n04:53 PM, April 28, 2021 / LAST MODIFIED: 05:12 PM, April 28, 2021'

In [100]:
headline = top.find('h1').text
headline

'3 killed in Mymensingh road accident'

In [107]:
author = page_soup.find("div", attrs={"class": "author-name"}).span.text
author

'\nStar Digital Report '

In [109]:
article = page_soup.find('div', attrs={"class": "field-body"})
article

<div class="field-body view-mode-full">
<p><strong>Three people were killed and another was injured as a truck hit a CNG-run auto-rickshaw in Tarakanda upazila of Mymensingh this morning.</strong></p>
<p>The deceased was identified as Shahid Mia (33), the auto-rickshaw driver; Khalil Mia (32); and Masum Mia (36) from Durgapur upazila of Netrakona.</p><h4><a href="https://news.google.com/publications/CAAiECW73usLivqPCSeQRsSUvRQqFAgKIhAlu97rC4r6jwknkEbElL0U" style="font-weight:600; color:#4285F3; border-bottom: 1px dotted #4285F3; font-size: 18px;"> <img src="/sites/all/themes/tds/images/google_news.svg" style="display: inline-block; margin-right: 15px; margin-bottom: -3px; height: 30px;"/>For all latest news, follow The Daily Star's Google News channel. </a></h4>
<p>Quoting locals, officer-in-charge of Tarakanda Police Station Md Abul Khayer said the accident happened at Khicha when the truck hit the Mymensingh-bound auto-rickshaw, leaving Shahid dead on the spot and three others injure

In [116]:
paragraphs = article.find_all('p')
paragraphs

[<p><strong>Three people were killed and another was injured as a truck hit a CNG-run auto-rickshaw in Tarakanda upazila of Mymensingh this morning.</strong></p>,
 <p>The deceased was identified as Shahid Mia (33), the auto-rickshaw driver; Khalil Mia (32); and Masum Mia (36) from Durgapur upazila of Netrakona.</p>,
 <p>Quoting locals, officer-in-charge of Tarakanda Police Station Md Abul Khayer said the accident happened at Khicha when the truck hit the Mymensingh-bound auto-rickshaw, leaving Shahid dead on the spot and three others injured around 10:00 am.</p>,
 <p>The injured were rushed to Mymensingh Medical College Hospital (MMCH), where doctors declared Khalil and Masum dead.</p>,
 <p>Another critically injured passenger -- Aminul Islam -- is undergoing treatment at MMCH.</p>,
 <p>Police seized the truck and auto-rickshaw, but the truck driver managed to flee.</p>,
 <p>A case was lodged.</p>]

In [117]:
subheading = paragraphs[0].text
subheading

'Three people were killed and another was injured as a truck hit a CNG-run auto-rickshaw in Tarakanda upazila of Mymensingh this morning.'

In [123]:
article_text = ""
for i, paragraph in enumerate(paragraphs):
    if i == 0:
        print(paragraph.text)
        subheading = paragraph
    else:
        article_text += paragraph.text
print(article_text)

Three people were killed and another was injured as a truck hit a CNG-run auto-rickshaw in Tarakanda upazila of Mymensingh this morning.
The deceased was identified as Shahid Mia (33), the auto-rickshaw driver; Khalil Mia (32); and Masum Mia (36) from Durgapur upazila of Netrakona.Quoting locals, officer-in-charge of Tarakanda Police Station Md Abul Khayer said the accident happened at Khicha when the truck hit the Mymensingh-bound auto-rickshaw, leaving Shahid dead on the spot and three others injured around 10:00 am.The injured were rushed to Mymensingh Medical College Hospital (MMCH), where doctors declared Khalil and Masum dead.Another critically injured passenger -- Aminul Islam -- is undergoing treatment at MMCH.Police seized the truck and auto-rickshaw, but the truck driver managed to flee.A case was lodged.


## Pagination