# Web scraping with Python: part 1

This notebook introduces an easy approach to web scraping using the popular pandas library. The library has several "read" functions, which can be used for reading csv, excel as well as html files. 

- **Advantage:** 'read_html' function is very easy to apply and read all the HTML tables on a webpage. Usually, the numeric data is presented as a table on the website.

- **Drawback:** The `read_html` function reads only HTML tables, which means if you have a non-html table in your webpage, or html code, which is not presented as table, then it will not be read by pandas.

Please note, that read_html function depends on several other libraries which you need to install from cmd (command line). As always, open a new command line window and tpye the following commands:
- pip install lxml
- pip install html5lib
- pip install BeautifulSoup4

All of the above 3 libraries are very useful for web scraping, and read_html function from pandas library directly uses them to easily parse tables from a website

In [1]:
import pandas as pd

In [2]:
# paste the link to the webpage that you want to read
url = "http://rate.am/en/armenian-dram-exchange-rates/banks/non-cash"

In [3]:
# read the webpage and save it to the variable "scraped_data"
scraped_data = pd.read_html(url)

In [4]:
# print it to see what was scraped
print(scraped_data)

[       0                 1                     2                         3  \
0  Banks  Exchanges points  Credit Organizations  Investment Organizations   

              4                    5  
0  Central bank  International Rates  ,                         0                                                  1  \
0  Rates by previous date  Select date: Select time:  07:00 \t07:15 \t07:...   
1                     NaN                                                NaN   

    2   3  
0 NaN NaN  
1 NaN NaN  ,                                                    0                       1   \
0                                                 NaN                    Bank   
1                                                 Buy                    Sell   
2                                                  1.         ArmBusinessBank   
3                                                  2.           Converse Bank   
4                                                  3.              Ameriabank   

As you can see the data is pretty unstructured. This is the consequence of having more than one table on the webpage. The table with exchanges rates, that we need to read is the 3rd table on the page, which means it will have index 2:

In [6]:
our_table = scraped_data[2]

In [7]:
# view our_table
our_table.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,,Bank,Branches,Date,1 USD \t1 EUR \t1 RUR \t1 GBP \t1 GEL \t1 CHF ...,1 USD \t1 EUR \t1 RUR \t1 GBP \t1 GEL \t1 CHF ...,1 USD \t1 EUR \t1 RUR \t1 GBP \t1 GEL \t1 CHF ...,1 USD \t1 EUR \t1 RUR \t1 GBP \t1 GEL \t1 CHF ...,,,,,
1,Buy,Sell,Buy,Sell,Buy,Sell,Buy,Sell,,,,,
2,1.,ArmBusinessBank,,52,"17 Apr, 20:00",484,487.50,512.10,519.1,8.52,8.75,602.1,614.1
3,2.,Converse Bank,,35,"17 Apr, 20:00",484,488,512,520.0,8.57,8.78,604.0,613.0
4,3.,Ameriabank,,13,"17 Apr, 20:00",483.75,487.75,512.50,520.5,8.53,8.78,604.0,614.0


More or less, we have semistructured data, yet, it needs to be customized more. For example, w edo not need the very first 2 rows that provide only headers for the table, and we also do not need the last rows: we need only observations about commercial banks. As we start from the 3rd row (with index 2) and have 17 commercial banks, it means we must end with 2+17=19th row.

In [9]:
our_data = scraped_data[2][2:19]

In [10]:
our_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
2,1.0,ArmBusinessBank,,52,"17 Apr, 20:00",484.0,487.5,512.1,519.1,8.52,8.75,602.1,614.1
3,2.0,Converse Bank,,35,"17 Apr, 20:00",484.0,488.0,512.0,520.0,8.57,8.78,604.0,613.0
4,3.0,Ameriabank,,13,"17 Apr, 20:00",483.75,487.75,512.5,520.5,8.53,8.78,604.0,614.0
5,4.0,Artsakhbank,,24,"17 Apr, 20:00",484.5,487.5,513.0,523.0,8.55,8.8,602.0,616.0
6,5.0,HSBC Bank Armenia,,9,"17 Apr, 20:00",484.0,488.0,510.5,522.5,8.49,8.81,602.0,616.0


Now, our data is structured and can be used for analysis. Yet, it has some additional columns that we will not use. Let's drop those columns all at once

In [11]:
# create list of columns to be dropped
dropable = [0,2,3]

In [12]:
# drop the unnecessary columns to have the final data
# axis=1 argument tells that we want to drop a column, not a row
data = our_data.drop(dropable,axis=1)

In [13]:
# view the first 5 observations
data.head()

Unnamed: 0,1,4,5,6,7,8,9,10,11,12
2,ArmBusinessBank,"17 Apr, 20:00",484.0,487.5,512.1,519.1,8.52,8.75,602.1,614.1
3,Converse Bank,"17 Apr, 20:00",484.0,488.0,512.0,520.0,8.57,8.78,604.0,613.0
4,Ameriabank,"17 Apr, 20:00",483.75,487.75,512.5,520.5,8.53,8.78,604.0,614.0
5,Artsakhbank,"17 Apr, 20:00",484.5,487.5,513.0,523.0,8.55,8.8,602.0,616.0
6,HSBC Bank Armenia,"17 Apr, 20:00",484.0,488.0,510.5,522.5,8.49,8.81,602.0,616.0


Now we have a structured data that can be used for further analysis. An optional last step that one may take is renaming the columns. The code below assigns list of strings to columns as a name.

In [14]:
data.columns = ["Bank_name","Date/Time","USD_Buy","USD_Sell","EUR_Buy","EUR_Sell","RUB_Buy","RUB_Sell","GBP_Buy","GBP_Sell"]

In [15]:
data.head()

Unnamed: 0,Bank_name,Date/Time,USD_Buy,USD_Sell,EUR_Buy,EUR_Sell,RUB_Buy,RUB_Sell,GBP_Buy,GBP_Sell
2,ArmBusinessBank,"17 Apr, 20:00",484.0,487.5,512.1,519.1,8.52,8.75,602.1,614.1
3,Converse Bank,"17 Apr, 20:00",484.0,488.0,512.0,520.0,8.57,8.78,604.0,613.0
4,Ameriabank,"17 Apr, 20:00",483.75,487.75,512.5,520.5,8.53,8.78,604.0,614.0
5,Artsakhbank,"17 Apr, 20:00",484.5,487.5,513.0,523.0,8.55,8.8,602.0,616.0
6,HSBC Bank Armenia,"17 Apr, 20:00",484.0,488.0,510.5,522.5,8.49,8.81,602.0,616.0


If you wish to work further on the database, yet using another programming language/software, then you may save the data to a csv file by using the following command:

In [None]:
data.to_csv("rate_am_data.csv")