What is Web Scrapping ?

Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

In [2]:
# step1 : send request to the website
res = requests.get("https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/")

In [3]:
res

<Response [200]>

In [4]:
res.content

b'<!DOCTYPE html>\r\n<!--[if lt IE 9]><html class="lt-ie9" lang="en-US"><![endif]-->\r\n<!--[if gt IE 8]><!-->\r\n<html lang="en-US">\r\n  <!--<![endif]-->\r\n\r\n  <head>\r\n    <!-- Title and Meta Description -->\r\n\r\n    <title>FDIC | Failed Bank List</title>\r\n    <meta property="og:title" content="FDIC | Failed Bank List" />\r\n\r\n    <meta\r\n      name="description"\r\n      content="Look up information on failed banks, including how your accounts and loans are affected and how vendors can file claims against receivership."\r\n    />\r\n    <meta\r\n      property="og:description"\r\n      content="Look up information on failed banks, including how your accounts and loans are affected and how vendors can file claims against receivership."\r\n    />\r\n\r\n    <link\r\n      rel="canonical"\r\n      href="https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/"\r\n    />\r\n    <meta\r\n      property="og:url"\r\n      content="https://www.fdic.gov/resource

In [5]:
# step2 : convert response object into its content using beautifulSoap
soup = BeautifulSoup(res.content,'html')

In [6]:
soup

<!DOCTYPE html>
<!--[if lt IE 9]><html class="lt-ie9" lang="en-US"><![endif]--><!--[if gt IE 8]><!--><html lang="en-US">
<!--<![endif]-->
<head>
<!-- Title and Meta Description -->
<title>FDIC | Failed Bank List</title>
<meta content="FDIC | Failed Bank List" property="og:title"/>
<meta content="Look up information on failed banks, including how your accounts and loans are affected and how vendors can file claims against receivership." name="description"/>
<meta content="Look up information on failed banks, including how your accounts and loans are affected and how vendors can file claims against receivership." property="og:description"/>
<link href="https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/" rel="canonical"/>
<meta content="https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/" property="og:url"/>
<!-- Basic Page Needs -->
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<!-- Share image -->
<meta cont

In [7]:
# step3 : Find all the tables
table = soup.find_all('table')

In [8]:
table

[<table cellpadding="0" cellspacing="0" class="dataTable text-light dataTables-sidebar overflow-x-auto">
 <thead class="dataTables-content-header bg-blue">
 <tr>
 <th class="text-no-wrap text-left padding-left-2 padding-right-105 padding-top-2 padding-bottom-1">
 <p class="font-size-16px font-serif-xs text-light margin-0 padding-0 text-white">
                   Bank Name
                 </p>
 </th>
 <th class="text-no-wrap text-left padding-left-2 padding-right-105 padding-top-2 padding-bottom-1">
 <p class="font-size-16px font-serif-xs text-light margin-0 padding-0 text-white">
                   City
                 </p>
 </th>
 <th class="text-no-wrap text-left padding-left-2 desktop:padding-left-1 padding-right-105 padding-top-2 padding-bottom-1">
 <p class="font-size-16px font-serif-xs text-light margin-0 padding-0 text-white">
                   State
                 </p>
 </th>
 <th class="text-no-wrap text-left padding-left-2 padding-right-105 padding-top-2 padding-bottom-1

In [9]:
len(table)

1

In [10]:
table[0]

<table cellpadding="0" cellspacing="0" class="dataTable text-light dataTables-sidebar overflow-x-auto">
<thead class="dataTables-content-header bg-blue">
<tr>
<th class="text-no-wrap text-left padding-left-2 padding-right-105 padding-top-2 padding-bottom-1">
<p class="font-size-16px font-serif-xs text-light margin-0 padding-0 text-white">
                  Bank Name
                </p>
</th>
<th class="text-no-wrap text-left padding-left-2 padding-right-105 padding-top-2 padding-bottom-1">
<p class="font-size-16px font-serif-xs text-light margin-0 padding-0 text-white">
                  City
                </p>
</th>
<th class="text-no-wrap text-left padding-left-2 desktop:padding-left-1 padding-right-105 padding-top-2 padding-bottom-1">
<p class="font-size-16px font-serif-xs text-light margin-0 padding-0 text-white">
                  State
                </p>
</th>
<th class="text-no-wrap text-left padding-left-2 padding-right-105 padding-top-2 padding-bottom-1">
<p class="font-s

In [11]:
# step4 : Convert Those Tables Into DataObject
data = pd.read_html(str(table))

In [12]:
data[0]

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date
0,Almena State Bank,Almena,KS,15426,Equity Bank,"October 23, 2020"
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb","October 16, 2020"
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.","April 3, 2020"
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,"February 14, 2020"
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,"November 1, 2019"
...,...,...,...,...,...,...
558,"Superior Bank, FSB",Hinsdale,IL,32646,"Superior Federal, FSB","July 27, 2001"
559,Malta National Bank,Malta,OH,6629,North Valley Bank,"May 3, 2001"
560,First Alliance Bank & Trust Co.,Manchester,NH,34264,Southern New Hampshire Bank & Trust,"February 2, 2001"
561,National State Bank of Metropolis,Metropolis,IL,3815,Banterra Bank of Marion,"December 14, 2000"


In [13]:
data[0].to_csv('web_scrapping.csv')

In [None]:
# Practice URL : https://asrank.caida.org/asns/your_roll_no

In [14]:
res=requests.get("https://asrank.caida.org/asns/75")

In [15]:
res

<Response [200]>

In [16]:
res.content

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<!-- Global site tag (gtag.js) - Google Analytics -->\n\t<script async src="https://www.googletagmanager.com/gtag/js?id=UA-116819380-1"></script>\n\t<script>\n\t  window.dataLayer = window.dataLayer || [];\n\t  function gtag(){dataLayer.push(arguments);}\n\t  gtag(\'js\', new Date());\n\t  gtag(\'config\', \'UA-116819380-1\');\n\t</script>\n\n    <link rel="icon" type="image/x-icon" href="/favicon.ico" />\n    <!-- Required meta tags -->\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=1000, initial-scale=1, shrink-to-fit=yes">\n    <!-- meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no" -->\n    <meta charset="UTF-8">\n        <title>AS Rank: AS75  (Argonne National Laboratory) </title>\n    <META NAME="Description" CONTENT="AS Rank:71143 Customer Cone:1 Transit Degree:0">\n            <!-- Bootstrap CSS -->\n        <link href="/css/bootstrap.min.css" r

In [18]:
soup=BeautifulSoup(res.content,"html")

In [None]:
table=soup.find_all(table)

In [None]:
table

In [20]:
df=pd.read_html(str(soup))

In [21]:
df

[               0                            1                            2  \
 0      AS number                           75                           75   
 1        AS name                       ANL-AS                       ANL-AS   
 2   organization  Argonne National Laboratory  Argonne National Laboratory   
 3        country                United States                United States   
 4        AS rank                        71143                        71143   
 5  customer cone                        1 asn                     0 prefix   
 6      AS degree                     0 global                    0 transit   
 
                              3                            4  \
 0                           75                           75   
 1                       ANL-AS                       ANL-AS   
 2  Argonne National Laboratory  Argonne National Laboratory   
 3                United States                United States   
 4                        71143               

In [23]:
df[0]

Unnamed: 0,0,1,2,3,4,5,6,7
0,AS number,75,75,75,75,75,75,75
1,AS name,ANL-AS,ANL-AS,ANL-AS,ANL-AS,ANL-AS,ANL-AS,ANL-AS
2,organization,Argonne National Laboratory,Argonne National Laboratory,Argonne National Laboratory,Argonne National Laboratory,Argonne National Laboratory,Argonne National Laboratory,Argonne National Laboratory
3,country,United States,United States,United States,United States,United States,United States,United States
4,AS rank,71143,71143,71143,71143,71143,71143,71143
5,customer cone,1 asn,0 prefix,0 address,,,,
6,AS degree,0 global,0 transit,provider,peer,customer,,


In [25]:
df=df[0]

In [30]:
df.size

56

In [36]:
df.drop_duplicates().size

56

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       7 non-null      object
 1   1       7 non-null      object
 2   2       7 non-null      object
 3   3       7 non-null      object
 4   4       6 non-null      object
 5   5       6 non-null      object
 6   6       5 non-null      object
 7   7       5 non-null      object
dtypes: object(8)
memory usage: 576.0+ bytes


In [40]:
df.isna()

Unnamed: 0,0,1,2,3,4,5,6,7
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
5,False,False,False,False,True,True,True,True
6,False,False,False,False,False,False,True,True


In [42]:
df.isnull()

Unnamed: 0,0,1,2,3,4,5,6,7
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
5,False,False,False,False,True,True,True,True
6,False,False,False,False,False,False,True,True


In [47]:
df.fillna(method="ffill").isna().any()

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
dtype: bool