# CORONA VIRUS DATA - Webscrapper - BeautifulSoup

To scrape a web page we need to download the page. We can download pages using the Python requests library. <br>
The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us. <br>
Parsing a page with BeautifulSoup <br>
We can use the BeautifulSoup library to parse document, and extract the text.

Install BeautifulSoup and Requests python libraries
pip install pip install beautifulsoup4
pip install requests

In [1]:
from bs4 import BeautifulSoup
import requests

The requests library will make a GET request to a web server and parse using beautiful soup

In [2]:
url = "https://www.worldometers.info/coronavirus/#countries"

request = requests.get(url)

In [3]:
soup = BeautifulSoup(request.content, 'html.parser')

Find particular table using class attributes

In [4]:
tables_container = soup.find('table', attrs={'class':'main_table_countries'})

Create lists to store values

In [5]:
countries = []
total_cases = []
new_cases = []
total_deaths = []
new_deaths = []
total_recoveries = []
active_cases = []
serious_criticals = []
total_tests = []

Store all rows of table in rows

In [6]:
rows = tables_container.find_all('tr')


Fetch each row using for loop and store data in each row

In [7]:
for row in rows:
    cells = row.find_all('td')
    # cells

    if(len(cells)>1):
        try:
            country=cells[0].a.text
            # country = cells[0].text
            countries.append(country)
        except:
            countries.append('None')

        total_case=cells[1]
        total_cases.append(total_case.text)
                
        try:
            new_case=cells[2]
            new_cases.append(new_case.text)
        except:
            new_cases.append('None')
                        
        total_death=cells[3]
        total_deaths.append(total_death.text)
                
        try:
            new_death=cells[4]
            new_deaths.append(new_death.text)
        except:
            new_deaths.append('None')

        total_recover=cells[5]
        total_recoveries.append(total_recover.text)

        active_case=cells[6]
        active_cases.append(active_case.text)

        serious_critical=cells[7]
        serious_criticals.append(serious_critical.text)
        
        total_test=cells[10]
        total_tests.append(total_test.text)

Print length of each list to check if all are having same value to store in dataframe

In [8]:
print(len(countries))

228


In [9]:
print(len(total_cases))

228


In [10]:
print(len(new_cases))

228


In [11]:
print(len(total_deaths))

228


In [12]:
print(len(new_deaths))
# new_deaths

228


In [13]:
print(len(total_recoveries))

228


In [14]:
print(len(active_cases))

228


In [15]:
print(len(serious_criticals))

228


In [16]:
print(len(total_tests))

228


Import pandas and numpy libraries to perform data cleaning and manipulating data.
DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables.

In [17]:
import pandas as pd
import numpy as np

Create dataframe to store values from each list in seperate columns

In [24]:
df = pd.DataFrame({'Country': countries,
'Total_Cases':total_cases,
'New_Cases':new_cases,
'Total_Deaths':total_deaths,
'New_Deaths':new_deaths,
'Total_Recovered':total_recoveries,
'Active_Cases':active_cases,
'Serious_Critical':serious_criticals,
'Total_Tests':total_tests
})

Drop rows where country is having none value

In [25]:
df.drop(df.index[df['Country'] == 'None'], inplace = True)

In [26]:
df['Country']

8                        USA
9                      Spain
10                     Italy
11                    France
12                   Germany
               ...          
215    Caribbean Netherlands
216         Papua New Guinea
217    Saint Pierre Miquelon
218                    Yemen
219                    China
Name: Country, Length: 210, dtype: object

Reset index to start with 1

In [27]:
# df.reset_index(drop=True,inplace=True) ####for index from 0
df.index = np.arange(1, len(df)+1) ### for index from 1

In [28]:
df

Unnamed: 0,Country,Total_Cases,New_Cases,Total_Deaths,New_Deaths,Total_Recovered,Active_Cases,Serious_Critical,Total_Tests
1,USA,560433,+133,22115,+10,32634,505684,11766,2833112
2,Spain,169496,+2665,17489,+280,64727,87280,7371,600000
3,Italy,156363,,19899,,34211,102253,3343,1010193
4,France,132591,,14393,,27186,91012,6845,333807
5,Germany,127854,,3022,,64300,60532,4895,1317887
...,...,...,...,...,...,...,...,...,...
206,Caribbean Netherlands,3,,,,,3,,10
207,Papua New Guinea,2,,,,,2,,72
208,Saint Pierre Miquelon,1,,,,,1,,
209,Yemen,1,,,,,1,,


In [30]:
df.tail()

Unnamed: 0,Country,Total_Cases,New_Cases,Total_Deaths,New_Deaths,Total_Recovered,Active_Cases,Serious_Critical,Total_Tests
206,Caribbean Netherlands,3,,,,,3,,10.0
207,Papua New Guinea,2,,,,,2,,72.0
208,Saint Pierre Miquelon,1,,,,,1,,
209,Yemen,1,,,,,1,,
210,China,82160,108.0,3341.0,2.0,77663.0,1156,121.0,


Remove '+' symbol from New_Deaths column

In [31]:
# <dont -use> df['New Deaths'] = df['New Deaths'].str.split('+') <br>
# <dont -use> df['New Deaths'] = df['New Deaths'].apply(lambda x: {i for sub in x for i in sub.split('-')}) <br>
# <dont -use> df['New Deaths'] = df['New Deaths'].str.replace(r'\b(\w+)(\s+\1)+\b', r'\1') <br>
# to replace + in single column <br>
df['New_Deaths'] = df['New_Deaths'].str.replace(r'+','')


In [32]:
df['New_Deaths']

1       10
2      280
3         
4         
5         
      ... 
206       
207       
208       
209       
210      2
Name: New_Deaths, Length: 210, dtype: object

Remove '+' symbol from New_Cases column

In [33]:
df['New_Cases'] = df['New_Cases'].str.replace(r'+','')

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 210 entries, 1 to 210
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Country           210 non-null    object
 1   Total_Cases       210 non-null    object
 2   New_Cases         210 non-null    object
 3   Total_Deaths      210 non-null    object
 4   New_Deaths        210 non-null    object
 5   Total_Recovered   210 non-null    object
 6   Active_Cases      210 non-null    object
 7   Serious_Critical  210 non-null    object
 8   Total_Tests       210 non-null    object
dtypes: object(9)
memory usage: 16.4+ KB


In [35]:
df.dtypes

Country             object
Total_Cases         object
New_Cases           object
Total_Deaths        object
New_Deaths          object
Total_Recovered     object
Active_Cases        object
Serious_Critical    object
Total_Tests         object
dtype: object

In [36]:
df = df.replace(',','', regex=True)

# to convert all columns to string
c = df.select_dtypes(object).columns <br>
df[c] = df[c].apply(pd.to_numeric,errors='coerce')

In [37]:
# df['Country'] = df['Country'].astype(str)
df = df.astype(str)

Remove whitespaces from dataframe

In [38]:
df.applymap(lambda x: x.strip() if type(x)==object else x)

Unnamed: 0,Country,Total_Cases,New_Cases,Total_Deaths,New_Deaths,Total_Recovered,Active_Cases,Serious_Critical,Total_Tests
1,USA,560433,133,22115,10,32634,505684,11766,2833112
2,Spain,169496,2665,17489,280,64727,87280,7371,600000
3,Italy,156363,,19899,,34211,102253,3343,1010193
4,France,132591,,14393,,27186,91012,6845,333807
5,Germany,127854,,3022,,64300,60532,4895,1317887
...,...,...,...,...,...,...,...,...,...
206,Caribbean Netherlands,3,,,,,3,,10
207,Papua New Guinea,2,,,,,2,,72
208,Saint Pierre Miquelon,1,,,,,1,,
209,Yemen,1,,,,,1,,


Can use seperate function to remove_whitespace <br>
def remove_whitespace(x): <br>
    try:
        # remove spaces inside and outside of string
        x = " ".join(x.split())

    except:
        pass
    return x

In [40]:
df['New_Deaths']=df['New_Deaths'].apply(lambda x: x.replace(' ',''))

Convert all values except Country column into numeric values

In [41]:
df.loc[:,df.columns != 'Country'] = df.loc[:,df.columns != 'Country'].apply(pd.to_numeric,errors='coerce')

Fill all empty values with 0

In [42]:
df.loc[:,df.columns != 'Country'] = df.loc[:,df.columns != 'Country'].fillna(0)

Convert values to integer

In [43]:
df.loc[:,df.columns != 'Country'] = df.loc[:,df.columns != 'Country'].astype(int)

Create sum of all columns seperately using sum() function and store respectively

In [51]:
w_total_cases = df['Total_Cases'].sum()
w_new_cases = df['New_Cases'].sum()
w_total_deaths = df['Total_Deaths'].sum()
w_new_deaths = df['New_Deaths'].sum()
w_total_recovered = df['Total_Recovered'].sum()
w_active_cases = df['Active_Cases'].sum()
w_serious_critical = df['Serious_Critical'].sum()
w_total_tests = df['Total_Tests'].sum()

Create seperate dataframe 'world_row' to store sum of all columns having index 0 to add at top

In [52]:
world_row = pd.DataFrame({'Country': 'World', 
'Total_Cases':w_total_cases,
'New_Cases':w_new_cases,
'Total_Deaths':w_total_deaths,
'New_Deaths':w_new_deaths,
'Total_Recovered':w_total_recovered,
'Active_Cases':w_active_cases,
'Serious_Critical':w_serious_critical,
'Total_Tests':w_total_tests}, 
index =[0])

Concatenate 'world_row' dataframe to our existing dataframe to store value of total world data

In [53]:
df = pd.concat([world_row, df]).reset_index(drop = True) 

Sort values by 'Total_Cases' in descending order

In [54]:
df = df.sort_values(by='Total_Cases', ascending=False)

Reset index from 1

In [55]:
df.index = np.arange(1, len(df)+1) ### for index from 1

Store dataframe in csv file

In [56]:
df.to_csv('text.csv')

In [57]:
df.tail()

Unnamed: 0,Country,Total_Cases,New_Cases,Total_Deaths,New_Deaths,Total_Recovered,Active_Cases,Serious_Critical,Total_Tests
207,British Virgin Islands,3,0,0,0,2,1,0,0
208,Caribbean Netherlands,3,0,0,0,0,3,0,10
209,Papua New Guinea,2,0,0,0,0,2,0,72
210,Saint Pierre Miquelon,1,0,0,0,0,1,0,0
211,Yemen,1,0,0,0,0,1,0,0


You can also view dataframe in html file

In [48]:
df.to_html('index.html')