**Note:** The databases used and the codes can be found in my GitHub repository: https://github.com/willneto/WebScraping

## Web Scraping - Website CVM - Paid-In Capital  & Treasury Shares (VALE S.A.)
**Paid-In Capital = Capital Integralizado; Treasury Shares = Ações em Tesouraria

- <span style='color:green'> Before we start, it is essential to say that this program may stop working or have errors if the websites accessed have any changes. So, if this happens, I kindly ask you to let me know to update the code to remain useful. Thanks!</span>


In this study, we seek to demonstrate how we managed to extract information about a particular company in a given quarter from the CVM website (the government agency that inspects/regulates the Brazilian financial market). For this, we searched data from June 30, 2020, for the company Vale S.A.

This information is presented on the CVM website as follows (we will call this the **main page**):

<img src="./img/pg 1.jpg">

**Note:** We mark the information we want to import with a red square and highlight with arrows what is worth observing on the CVM website page.

<span style='color:red'> However, this page is only accessed through previous access to the following page (we will call this as **home page**): </span>

- link: https://www.rad.cvm.gov.br/ENET/frmGerenciaPaginaFRE.aspx?NumeroSequencialDocumento=95152&CodigoTipoInstituicao=1

<img src="./img/pg 2.jpg">

In this last image, we draw attention to two points:

- To enter the page you want (first image), you must click on the arrow (marked with a red circle) and then select the option "Dados da Empresa."

- We also try to highlight a number in the website URL. This number will play a significant role in this code in the future. So don't forget about it.

**To proceed, we must understand what kind of request is made when we access the main page:**

<img src="./img/info 1.jpg">

Note that the request is of type **GET**; Then, we can browse the information through the website's URL. In this way, we also highlighted, in the previous image, some vital information in the URL request.

**Note:** When accessing the site, several files are generated. Thus, it is necessary to identify which one is the main request for data - we highlight it with a red square (on the left side of the image).

Next, let's look a little more carefully at the GET request information:

<img src="./img/info 2.jpg">

- This information is contained in the source code of the **home page**!

- <span style='color:red'> We remember that to access the home page, we needed a number that makes up the URL.</span>

- **This number is found in a CSV file made available by CVM.**

We could take the URL generated by the GET request and scrape from it. However, if we had the goal of creating a Web Scraping of a set of shares and several periods, we would have to simplify this process; otherwise, it would be very laborious and, depending on the number of shares and the length of the period, impossible. With that, we will develop the following path:

<img src="./img/way.jpg">

Libraries for web scraping

In [1]:
from urllib import request
from bs4 import BeautifulSoup

Libraries for data science

In [2]:
import pandas as pd
import numpy as np
import datetime as datetime

Library to manipulate string

In [3]:
from unidecode import unidecode ## Biblioteca para tirar acentos

Libraries for downloading and unzipping files

In [4]:
import wget
from zipfile import ZipFile

## Step I

Link to the CVM website where the files with the number of documents are:

http://dados.cvm.gov.br/dados/CIA_ABERTA/DOC/ITR/DADOS

**We will only download 2020 file.**

<img src="./img/cvm.jpg">

In [5]:
##Download
wget.download("http://dados.cvm.gov.br/dados/CIA_ABERTA/DOC/ITR/DADOS/" + "itr_cia_aberta_2020.zip")

'itr_cia_aberta_2020.zip'

In [6]:
#Unzipping
ZipFile("itr_cia_aberta_2020.zip",'r').extractall("CVM Data")

In [7]:
#Importing as a Data Frame
data_cvm_2020 = pd.read_csv("CVM Data\itr_cia_aberta_2020.csv",sep=";",decimal=",",encoding = "ISO-8859-1")

In [8]:
#Look at the Data Frame!
data_cvm_2020.head()

Unnamed: 0,CNPJ_CIA,DT_REFER,VERSAO,DENOM_CIA,CD_CVM,CATEG_DOC,ID_DOC,DT_RECEB,LINK_DOC
0,00.000.000/0001-91,2020-03-31,1,BCO BRASIL S.A.,1023,ITR,92782,2020-05-07,http://www.rad.cvm.gov.br/ENETCONSULTA/frmDown...
1,00.000.000/0001-91,2020-06-30,1,BCO BRASIL S.A.,1023,ITR,95714,2020-08-06,http://www.rad.cvm.gov.br/ENETCONSULTA/frmDown...
2,00.000.000/0001-91,2020-09-30,1,BCO BRASIL S.A.,1023,ITR,98051,2020-11-05,http://www.rad.cvm.gov.br/ENETCONSULTA/frmDown...
3,00.000.208/0001-00,2020-03-31,1,BRB BCO DE BRASILIA S.A.,14206,ITR,93168,2020-05-15,http://www.rad.cvm.gov.br/ENETCONSULTA/frmDown...
4,00.000.208/0001-00,2020-06-30,1,BRB BCO DE BRASILIA S.A.,14206,ITR,96358,2020-08-20,http://www.rad.cvm.gov.br/ENETCONSULTA/frmDown...


In [9]:
##Information from Vale S.A.
data_cvm_2020.loc[data_cvm_2020["DENOM_CIA"]=="VALE S.A."]

Unnamed: 0,CNPJ_CIA,DT_REFER,VERSAO,DENOM_CIA,CD_CVM,CATEG_DOC,ID_DOC,DT_RECEB,LINK_DOC
1582,33.592.510/0001-54,2020-03-31,1,VALE S.A.,4170,ITR,92568,2020-04-28,http://www.rad.cvm.gov.br/ENETCONSULTA/frmDown...
1583,33.592.510/0001-54,2020-06-30,1,VALE S.A.,4170,ITR,95152,2020-07-29,http://www.rad.cvm.gov.br/ENETCONSULTA/frmDown...
1584,33.592.510/0001-54,2020-09-30,1,VALE S.A.,4170,ITR,97927,2020-10-28,http://www.rad.cvm.gov.br/ENETCONSULTA/frmDown...
1585,33.592.510/0001-54,2020-09-30,2,VALE S.A.,4170,ITR,97948,2020-10-29,http://www.rad.cvm.gov.br/ENETCONSULTA/frmDown...


In [10]:
#Let's create a variable to store the number we need
id_doc = data_cvm_2020.loc[(data_cvm_2020["DENOM_CIA"]=="VALE S.A.") &
                           (data_cvm_2020["VERSAO"]==1) &
                           (data_cvm_2020["DT_REFER"]=="2020-06-30")]["ID_DOC"].values[0]

In [11]:
#Document number
id_doc

95152

## Step II

In [12]:
##Let's create the URL with the document number
url_I = ("https://www.rad.cvm.gov.br/ENET/frmGerenciaPaginaFRE.aspx?NumeroSequencialDocumento="+ 
         str(id_doc)+"&CodigoTipoInstituicao=1")

We will set up the request for the home page

In [13]:
req_pre = request.Request(url_I, method="GET") ##Building the requisition
r_pre = request.urlopen(req_pre) #Request
soup_pre = BeautifulSoup(r_pre, "html.parser") #We are saving the page code in an object of type BeautifulSoup

At this point, we'll save the Cookie for future requests. **This step is essential** because we can't get the site information correctly without the Cookie!

In [14]:
c = r_pre.getheader("Set-Cookie")

**Now, inside the object of the type BeautifulSoup, we will identify the information we need for the main page request.**

Document type

In [15]:
#Let's create the variable
dis_dc = (str(soup_pre.find_all(id="cmbQuadro")[0]).split("value=")[-1].
          split("NomeTipoDocumento=")[-1].split("&")[0])

In [16]:
#Result!
dis_dc

'ITR'

Company Name

In [17]:
#Let's create the variable
empresa = (str(soup_pre.find_all(id="cmbQuadro")[0]).split("value=")[-1].
           split("Empresa=")[-1].split("&")[0].replace(" ","%20"))

In [18]:
#Result!
empresa

'VALE%20S.A.'

Information date

In [19]:
#Let's create the variable
data = str(soup_pre.find_all(id="cmbQuadro")[0]).split("value=")[-1].split("Referencia=")[-1].split("&")[0]

In [20]:
#Result!
data

'2020-06-30'

Document number:

- This information we already have. However, we save it again to locate it more easily in case of an error.

In [21]:
#Let's create the variable
n_doc = str(soup_pre).split("window.frames[0].location=")[1].split("NumeroSequencialDocumento=")[1].split("&")[0]

In [22]:
#Result!
n_doc

'95152'

CVM code

In [23]:
#et's create the variable
cd_cvm = str(soup_pre).split("window.frames[0].location=")[1].split("RegistroCvm=")[1].split("&")[0]

In [24]:
#Result!
cd_cvm

'1789'

HASH code

In [25]:
#et's create the variable
cd_hash= str(soup_pre.find_all(id="hdnHash")[0]).split("value=")[-1][1:-3]

In [26]:
#Result!
cd_hash

'yW5OVr2URmRpOrNReF0FcxKTAoZVRISe7lAYUsA3A'

**Building the URL to access the main page**

- Note that we identified which values should be kept constant within the URL and which ones we needed to change to access the page!

In [27]:
url_II = ("https://www.rad.cvm.gov.br/ENET/frmDadosComposicaoCapitalITR.aspx?"+
          "Grupo=Dados+da+Empresa&Quadro=Composi%c3%a7%c3%a3o+do+Capital&NomeTipoDocumento="+dis_dc+"&"+
          "Empresa="+empresa+"&"+
          "DataReferencia="+data+"&"+
          "Versao=1&CodTipoDocumento=3&"+
          "NumeroSequencialDocumento="+n_doc+"&"+
          "NumeroSequencialRegistroCvm="+cd_cvm+"&"+
          "CodigoTipoInstituicao=1&"+
          "Hash="+cd_hash)

We'll fix string problems (removing accents and special characters)

In [28]:
url_II = unidecode(url_II)

## Step III

In [29]:
##URL where we will access
url_II

'https://www.rad.cvm.gov.br/ENET/frmDadosComposicaoCapitalITR.aspx?Grupo=Dados+da+Empresa&Quadro=Composi%c3%a7%c3%a3o+do+Capital&NomeTipoDocumento=ITR&Empresa=VALE%20S.A.&DataReferencia=2020-06-30&Versao=1&CodTipoDocumento=3&NumeroSequencialDocumento=95152&NumeroSequencialRegistroCvm=1789&CodigoTipoInstituicao=1&Hash=yW5OVr2URmRpOrNReF0FcxKTAoZVRISe7lAYUsA3A'

<span style='color:red'>
The first version of this article did not have the code below.</span> However, it stopped working correctly. Therefore, we add this command to access the website with the cookie. With this extra access the rest of the code worked correctly again.

In [30]:
op = request.build_opener() #Request to open home pagel
op.addheaders.append(('Cookie', c)) #Adding the cookie
pre = op.open(url_II) ##Opening the website!
pre.close() ##Closing the website!

At this point, the reader can perform an interesting exercise - **if you haven't opened the home page.** When trying to access the page with the above URL, the reader will notice that the table information comes with values equal to zero. This problem happens **because we need to open the main page first using the same cookie.** So next, we will carry out this process:

In [31]:
opener = request.build_opener() #Request to open home pagel
opener.addheaders.append(('Cookie', c)) #Adding the cookie
f = opener.open(url_I) ##Opening the website!

With the home page **open**, we can make the request for the main page.

In [32]:
req= request.Request(url_II, method="GET") #Requisition for the main page
req.add_header("Cookie",c) #Adding the cookie
r = request.urlopen(req) #Request
soup= BeautifulSoup(r, "html.parser") #We are saving the page code in an object of type BeautifulSoup

The Pandas library offers us a module to capture tables contained within HTML codes; we will use this module to facilitate the process. For this, we have to:

- Convert the BeautifulSoup object to string format.
- The module returns a list of tables; as in our case, there is only one table on the page, we know that it will be at the zero position of the list.
- Finally, let's save the information in a data frame called "table".

In [33]:
#Importing
table = pd.read_html(str(soup.format_string))[0]

In [34]:
#Result!
table

Unnamed: 0,0,1
0,Número de Ações (Mil),30/06/2020
1,Do Capital Integralizado,Do Capital Integralizado
2,Ordinárias,5.284.475
3,Preferenciais,0
4,Total,5.284.475
5,Em Tesouraria,Em Tesouraria
6,Ordinárias,154.564
7,Preferenciais,0
8,Total,154.564


Let's create another data frame to save the information in a more organized way!

**Note:**
- PN = Preferred Stock
- ON = Common Stock

In [35]:
##Creating the Data Frame
df = pd.DataFrame(table.iloc[[2,3,4,6,7,8]][1].values,
                  columns=['Amount'],
                  index=['Paid-In Capital (ON)',
                         'Paid-In Capital (PN)',
                         'Paid-In Capital (Total)',
                         'Treasury Shares (ON)',
                         'Treasury Shares (PN)',
                         'Treasury Shares (Total)'])

In [36]:
#Result!
df

Unnamed: 0,Amount
Paid-In Capital (ON),5.284.475
Paid-In Capital (PN),0
Paid-In Capital (Total),5.284.475
Treasury Shares (ON),154.564
Treasury Shares (PN),0
Treasury Shares (Total),154.564


**Let's now export this information through a CSV file!**

In [37]:
df.to_csv("VALE.csv")

<span style='color:red'> The home page is open, so we need to close it.</span>

In [38]:
f.close() ##Closing the website!