<h1>Assignment 2 - Web Scraping</h1>

**In this assignment you need to work with data from the [worldometers](https://www.worldometers.info/coronavirus/) website. I want you to scrape all 215 countries information about coronavirus cases from the website.**
The data has to include:
- `Country name`
- `Total cases`
- `Total deaths`
- `Total recovered`
- `Active cases`
- `New cases`
- `New deaths`
- `Total tests`
- `Population`

**You need to use beautiful soup 4 and regular expressions for this task. Save results in csv file and read this dataset**

<h3> Import Dependencies
</h3>

In [213]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import requests

<h3> Starting scrape the data </h3>


In [235]:
url = "https://www.worldometers.info/coronavirus/"
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'lxml')
    table = soup.find('table')
    rows = table.find_all('tr')
    list_rows = []

    for row in rows:
        cells = row.find_all('td')
        clean_row = []

        for cell in cells:
            str_cell = str(cell)
            clean = re.compile('<.*?>')
            clean_text = re.sub(clean, '', str_cell)
            clean_text = clean_text.strip()
            clean_text = re.sub(r'\s+', ' ', clean_text)
            clean_row.append(clean_text)
        list_rows.append(clean_row)

    for clean_row in list_rows:
        clean_row

else:
    print("Failed to retrieve the webpage")

df = pd.DataFrame(list_rows)
string_to_check = '<td style="font-weight: bold; text-align:right;">'
df = df[df.apply(lambda row: row.astype(str).str.contains(string_to_check).any(), axis=1)]
df = df.iloc[:, :15]

new_value = ''
df.replace(to_replace=string_to_check, value=new_value, regex=True, inplace=True)

df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
9,1,USA,111790779,,1219159,,109761610.0,,810010.0,1009.0,333898,3641,1186851502,3544901,334805269
10,2,India,45034136,,533547,,,,,,32016,379,935879495,665334,1406631776
11,3,France,40138560,,167642,,39970918.0,,0.0,,612013,2556,271490188,4139547,65584518
12,4,Germany,38827570,,182952,,38240600.0,,404018.0,,462874,2181,122332384,1458359,83883596
13,5,Brazil,38729836,,711249,,36249161.0,,1769426.0,,179843,3303,63776166,296146,215353593


In [236]:
col_str = str(col_labels)
cleantext2 = BeautifulSoup(col_str, "lxml").get_text()
cleantext2 = cleantext2.strip()
cleantext2 = re.sub(r'\s+', ' ', cleantext2)
all_header.append(cleantext2)

df_header = pd.DataFrame(all_header)
df_header = df_header[0].str.split(',', expand=True)
df_header = df_header.iloc[0:1, :17]
df_header.drop(columns=[2], inplace=True)
df_header.drop(columns=[11], inplace=True)
df_header.columns = range(df_header.shape[1])

frames = [df_header, df]
data = pd.concat(frames)
data = data.rename(columns=data.iloc[0])
data.drop(data.columns[0], axis=1, inplace=True)
data.reset_index(drop=True, inplace=True)
data.columns = data.columns.str.replace(' ', '')
data.drop(data.index[0], axis=0, inplace=True)
data.head()

Unnamed: 0,Country,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,Serious,TotCases/1Mpop,Deaths/1Mpop,TotalTests,Tests/1Mpop,Population
1,USA,111790779,,1219159,,109761610.0,,810010.0,1009.0,333898,3641,1186851502,3544901,334805269
2,India,45034136,,533547,,,,,,32016,379,935879495,665334,1406631776
3,France,40138560,,167642,,39970918.0,,0.0,,612013,2556,271490188,4139547,65584518
4,Germany,38827570,,182952,,38240600.0,,404018.0,,462874,2181,122332384,1458359,83883596
5,Brazil,38729836,,711249,,36249161.0,,1769426.0,,179843,3303,63776166,296146,215353593


In [233]:
# Validation - USA
data[data['Country'] == 'USA']

Index(['Country', 'TotalCases', 'NewCases', 'TotalDeaths', 'NewDeaths',
       'TotalRecovered', 'NewRecovered', 'ActiveCases', 'Serious',
       'TotCases/1Mpop', 'Deaths/1Mpop', 'TotalTests', 'Tests/1Mpop',
       'Population'],
      dtype='object')

<h3> Save & Read CSV
</h3>

In [238]:
#Write File
data.to_csv('coronavirus_data.csv', index=False)

In [239]:
# Read file
new_df = pd.read_csv('coronavirus_data.csv')
new_df.head()

Unnamed: 0,Country,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,Serious,TotCases/1Mpop,Deaths/1Mpop,TotalTests,Tests/1Mpop,Population
0,USA,111790779,,1219159,,109761610.0,,810010.0,1009.0,333898,3641,1186851502,3544901,334805269
1,India,45034136,,533547,,,,,,32016,379,935879495,665334,1406631776
2,France,40138560,,167642,,39970918.0,,0.0,,612013,2556,271490188,4139547,65584518
3,Germany,38827570,,182952,,38240600.0,,404018.0,,462874,2181,122332384,1458359,83883596
4,Brazil,38729836,,711249,,36249161.0,,1769426.0,,179843,3303,63776166,296146,215353593
