#Data Engineer - Webscraping


## Objectives

In this part you will:

*   Use webscraping to get bank information


For this lab, we are going to be using Python and several Python libraries. Some of these libraries might be installed in your lab environment or in SN Labs. Others may need to be installed by you. The cells below will install these libraries when executed.


In [1]:
#!mamba install pandas==1.3.3 -y
#!mamba install requests==2.26.0 -y
!mamba install bs4==4.10.0 -y
!mamba install html5lib==1.1 -y

/bin/bash: mamba: command not found
/bin/bash: mamba: command not found


## Imports

Import any additional libraries you may need here.


In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Extract Data Using Web Scraping


The wikipedia webpage [https://en.wikipedia.org/wiki/List_of_largest_banks](https://en.wikipedia.org/wiki/List_of_largest_banks?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0221ENSkillsNetwork23455645-2022-01-01) provides information about largest banks in the world by various parameters. Scrape the data from the table 'By market capitalization' and store it in a JSON file.


### Webpage Contents

Gather the contents of the webpage in text format using the `requests` library and assign it to the variable <code>html_data</code>


In [3]:

URL = "https://en.wikipedia.org/wiki/List_of_largest_banks"
html_data = requests.get(URL).text


<b>Question 1</b> Print out the output of the following line, and remember it as it will be a quiz question:


In [4]:
html_data[101:124]

'List of largest banks -'

### Scraping the Data

<b> Question 2</b> Using the contents and `beautiful soup` load the data from the `By market capitalization` table into a `pandas` dataframe. The dataframe should have the bank `Name` and `Market Cap (US$ Billion)` as column names.  Display the first five rows using head.


Using BeautifulSoup parse the contents of the webpage.


In [5]:

soup=BeautifulSoup(html_data,'html.parser')

Load the data from the `By market capitalization` table into a pandas dataframe. The dataframe should have the bank `Name` and `Market Cap (US$ Billion)` as column names. Using the empty dataframe `data` and the given loop extract the necessary data from each row and append it to the empty dataframe.


In [6]:
data = pd.DataFrame(columns=["Name", "Market Cap (US$ Billion)"])

for row in soup.find_all('tbody')[3].find_all('tr'):
    col = row.find_all("td")
    #have to make sure there is no empty tr
    if len(col)==3:
        Bank_Name = col[1].text.strip()
        Market_Cap = col[2].text.strip()
        #finally we append the data of each row to the talbe
        data = data.append({"Name":Bank_Name,"Market Cap (US$ Billion)":Market_Cap},ignore_index=True)
    

**Question 3** Display the first five rows using the `head` function.


In [7]:
#Write your code here
data.head(5)

Unnamed: 0,Name,Market Cap (US$ Billion)
0,JPMorgan Chase,400.37[6]
1,Industrial and Commercial Bank of China,295.65
2,Bank of America,279.73
3,Wells Fargo,214.34
4,China Construction Bank,207.98


### Loading the Data

Usually you will Load the `pandas` dataframe created above into a JSON named `bank_market_cap.json` using the `to_json()` function, but this time the data will be sent to another team who will split the data file into two files and inspect it. If you save the data it will interfere with the next part of the assignment.


In [8]:
#Write your code here
data.to_json('bank_market_cap.json',orient = 'records',indent=5)