# Alderman's Analyzer

With this script, we will collect data on alderman's projects from a website of the **City Hall of Indaial**, in the state of Santa Catarina, in Brazil, and create a database, which can have incremental updates and, of course, feed a **Power BI** dashboard.

This project was created for the Observatório Social, a project born in Indaial, in Santa Catarina.

**Python** will be used for collection and web scrapping and a local **SQL Server** database will be used for data insertion. Later, the data will be consumed to create a dashboard.

## 1. Understanding the project

We intend to create a script that web scraps the City Hall of Indaial site to collect data on suggested measures of public interest from the aldermen to the competent authorities, and, with this, to evaluate their engagement throughout the year.

The website is going to be https://camaraindaial.sc.gov.br/pg/proposicoes and we will go through the first option, that shows all propositions of each year:

<img src="Indaial's website.png" alt="Indaial's website" width="800"/>

When clicking on the link, a list of all propositions will appear, starting by the last one:

<img src="Propositions.png" alt="Propositions" width="800"/>

Finally, clicking on the first option, which represents the last proposition made, the information we are going to collect will appear:

<img src="Proposition940.png" alt="Proposition940" width="800"/>

The information contained on the page is:

- The date of the meeting
- The current situation of the proposition
- The subject of the proposition
- The author, and finally
- The text that explains what the proposition is about

### 1.1. Used tools and its versions 

We will use these free softwares:

- Python in version 3.9.12
    - Specific libraries in this projetc:
        - pip install beautifulsoup4==4.11.1
        - pip install pandas==1.4.2
        - pip install requests==2.27.1
        - pip install beautifulsoup4==4.11.1
- SQL Server Developer Edition (mos recent version)
- SSMS - Microsoft Management Studio (most recent version)

## 2. Getting started

Let's start by creating the variable that will serve as the user agent for our web scrap.

The "user-agent" is the identification that the browser passes to the websites and that they use to deliver the appropriate support or layout.

In [1]:
agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'

After, let's import the libraries we will use to web scrap the information:

In [2]:
import pandas as pd

from urllib.request import Request, urlopen

# to connect to the SQL Server
import pyodbc

# to parse the HTML
import bs4

### 2.1. Taking a look at the HTML

Now, let's take a look at how the information comes when we make a request. In order to see in a more beautiful way, we will use the ```prettify``` function in the BeautifulSoup library.

In [3]:
# creating the url variable
url='https://www.legislador.com.br//LegisladorWEB.ASP?WCI=ProposicaoTexto&ID=3&TPProposicao=1&nrProposicao=940&aaProposicao=2022'

# loading the agent that will be used in the request
headers = {'User-Agent': agent}

# executing the query
req = Request(url, headers = headers)
response = urlopen(req)

print(bs4.BeautifulSoup(response, 'html.parser').prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <link href="legis.ico" rel="shortcut icon" type="image/x-ico"/>
  <title>
   Câmara Municipal de Indaial _ Indicação nº 940/2022 de 05/12/2022
  </title>
  <meta content="Câmara Municipal de Indaial _ Indicação nº 940/2022 de 05/12/2022" name="description">
   <link href="css/geral3.css" rel="stylesheet"/>
   <link href="https://d11gitgevq44cw.cloudfront.net/libs/font-awesome/5x/css/all.min.css" rel="stylesheet"/>
   <link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css" integrity="sha384-Gn5384xqQ1aoWXA+058RXPxPg6fy4IWvTNh0E263XmFcJlSAwiGgFAW/dAiS6JXm" rel="stylesheet"/>
   <script crossorigin="anonymous" integrity="sha256-FgpCb/KJQlLNfOu91ta32o/NMZxltwRo8QtmkMRdAu8=" src="https://code.jquery.com/jquery-3.3.1.min.js">
   </script>
   <script crossorigin="anonymous" integrity="sha384-ApNb

### 2.2. Finding the information we need

If we take a closer look at the HTML's tags, the information we need is in the tags ```dt```, ```dd``` and ```p```. So, this way, it will be easier to get the information we need, because we already know where they are.

Let's creat a function that will execute the query for us anytime we need it, and a function to parse the HTLM:

In [4]:
# creating a function that will execute the query for us

def WebRequest(url):
    req = Request(url, headers = headers)
    response = urlopen(req)
    return response.read()

In [5]:
# creating a function to parse and read the HTML

def parse_html(url):
    html = WebRequest(url)
    soup = bs4.BeautifulSoup(html, 'html.parser')    
    return soup

In [6]:
html = parse_html(url)

In [7]:
# the information's title

html.findAll('dt')

[<dt class="col-sm-3">Reunião</dt>,
 <dt class="col-sm-3">Situação</dt>,
 <dt class="col-sm-3">Assunto</dt>,
 <dt class="col-sm-3">Autor</dt>]

In [8]:
# getting just the text

for i in html.findAll('dt'):
    print(i.get_text())

Reunião
Situação
Assunto
Autor


In [9]:
# the information's content

html.find_all('dd')

[<dd class="col-sm-9">05/12/2022</dd>,
 <dd class="col-sm-9">Entrada no Expediente</dd>,
 <dd class="col-sm-9">Manutenção de via pública</dd>,
 <dd class="col-sm-9">Vereador <br/><b>Elton Marcos Possamai</b>.</dd>]

In [10]:
# getting just the text

for i in html.findAll('dd'):
    print(i.get_text())

05/12/2022
Entrada no Expediente
Manutenção de via pública
Vereador Elton Marcos Possamai.


### 2.3. Collecting the content

Now that we know where the information is, let's create a function that you iterate through the HTML and put it into a dictionary.

In [11]:
# creating a function to get the content of the page

def get_content(html):
    dt = html.find_all('dt')
    dd = html.find_all('dd')
    dic = {}
    for i in range(len(dt)):
        x = dt[i].get_text()
        y = dd[i].get_text()
        dic[x] = y
    return dic

In [12]:
# visualising the results

get_content(html)

{'Reunião': '05/12/2022',
 'Situação': 'Entrada no Expediente',
 'Assunto': 'Manutenção de via pública',
 'Autor': 'Vereador Elton Marcos Possamai.'}

#### 2.3.1. Improving the ```get_content``` function

We also need to have the year and number of the proposition, so we will implement an improvement to the last function to put this information into the dictionary.

Additionally, we will implement an improvement that change the url by itself, that is, that changes the page we are. But how to do it?

If we pay closer attention to the url, at the end, we see that the year and number of the proposition are there:

<img src="url.png" alt="url" width="800"/>

Therefore, we will also implement an improvement to change the number of the year and the proposition to whatever we want.

Also, we will complete the information we need inserting the number of the proposition, the year and the text that explains what the proposition is about:

In [13]:
def get_fullcontent(prop, year):
    url = 'https://www.legislador.com.br//LegisladorWEB.ASP?WCI=ProposicaoTexto&ID=3&TPProposicao=1&nrProposicao='+str(prop)+'&aaProposicao='+str(year)
    html = parse_html(url)
    dic = get_content(html)                                                                                           
    dic['Proposição'] = prop
    dic['Ano'] = year
    dic['Texto'] = html.p.get_text()
    return dic

In [14]:
get_fullcontent(937,2022)

{'Reunião': '05/12/2022',
 'Situação': 'Entrada no Expediente',
 'Assunto': 'Limpeza, Macadamização, Patrolamento, Retificação; Alargamento',
 'Autor': 'Vereador Flávio Augusto Ferri Molinari.',
 'Proposição': 937,
 'Ano': 2022,
 'Texto': 'O vereador que esta subscreve, no uso das atribuições que lhe confere o Regimento Interno desta Casa Legislativa, vem requerer encaminhamento de cópia da presente Indicação à Secretaria de Obras sugerindo o exposto a seguir: Macadamização, patrolamento e, sobretudo, fechamento de valetas geradas por chuvas na rua Ervino Schroeder, Encano Baixo. A pedido dos moradores.'}

### 2.4. Query loop

Now, that we have the function to query the whole information we need, we will create a loop to query as many propositions we want.

In [15]:
def query_loop(start, quantity, year, possible_erros = 2, waitime = 0.5):
    import time
    
    last_query = start + quantity -1
    
    # erros
    erros = 0
    
    # variables fot the loop
    props = []
    
    while start <= last_query and erros <= possible_erros:
        try:
            x = get_fullcontent(start,year)
            x['Situação'] # if there's no key named 'Situação', it means that the page is empty
            props = props + [get_fullcontent(start,year)]
        except:
            erros += 1
            pass
        
        time.sleep(waitime)
        
        start += 1
    
    return pd.DataFrame(props)

## 3. Conecting to SQL Server

After creating the database and the table in SQL Server, we will proceed generating the connection with the SQL Server.

In [16]:
# connecting with Windows authentication

conn = pyodbc.connect('Trusted_Connection=yes', 
                      driver = '{ODBC Driver 17 for SQL Server}',
                      server = 'localhost', 
                      database = 'Indaial')

query = '''
    select 
        * 
    from Proposicoes
'''
# the read_sql_query method from pandas is used to read the data from the query
sql_query = pd.read_sql_query(query,conn)
sql_query



Unnamed: 0,DataReuniao,DataDeliberacao,Situacao,Assunto,Autor,Proposicao,Ano,Texto
0,1996-02-22,1996-02-22,Proposição Aprovada,Serviços e Obras,Vereador Henrique Fritz.,1,1996,construção de Escola de 1º grau no Bairro Nova...
1,1996-02-22,1996-02-22,Proposição Aprovada,Serviços e Obras,Vereador Henrique Fritz.,2,1996,construção de Escola nas imediações dos Loteam...
2,1996-02-22,1996-02-22,Proposição Aprovada,"Limpeza, Macadamização, Patrolamento, Retifica...",Vereador Henrique Fritz.,3,1996,alargamento da Rua ID 90 ...
3,1996-02-22,1996-02-22,Proposição Aprovada,Rede de Água / Esgoto / Pluvial,Vereador Henrique Fritz.,4,1996,prolongamento da rede d'água na Rua Reinhold S...
4,1996-02-22,1996-02-22,Proposição Aprovada,Rede de Água / Esgoto / Pluvial,Vereador Henrique Fritz.,5,1996,Prolongamento de rede d'água na Rua Lorenz até...
...,...,...,...,...,...,...,...,...
28578,1996-10-17,1996-10-17,Proposição Aprovada,Diversos,Vereador Sílvio Gonçalves da Luz.,395,1996,Efetuar melhorias na rede elétrica do Beco Rau...
28579,1996-10-21,1996-10-21,Proposição Aprovada,Diversos,Vereador Antônio Carlos Fink.,396,1996,"Alterar o artigo 1º da Lei nº 1.255/82, que in..."
28580,1996-10-29,1996-10-29,Proposição Aprovada,Diversos,Vereador Henrique Fritz.,397,1996,Efetuar levantamento dos artesões estabelecido...
28581,1996-12-02,1996-12-02,Proposição Aprovada,Iluminação Pública e Rede de Energia Elétrica,Vereador Remir José de Faveri.,398,1996,Colocação de um transformador na Rua Venezuela...


There are some of the propositions that don't have the 'Deliberação' date, so in order to avoid trying to insert into the database we just created a wrong number of columns, we will create a data frame with all of the columns and insert the data into it before inserting it into the SQL Server.

In [17]:
# Creating a database
database = pd.DataFrame(columns=['Reunião', 'Deliberação', 'Situação', 'Assunto', 'Autor', 'Proposição', 
                                 'Ano', 'Texto'])
database

Unnamed: 0,Reunião,Deliberação,Situação,Assunto,Autor,Proposição,Ano,Texto


In [18]:
# testing the database variable. To use it, just remove the # symbol

#test = query_loop(900,50,2022)

#database = pd.concat([database, test])

#database

### 3.1. Creating useful functions

Let's create some useful functions to help us perform basic actions in the SQL Server

In [19]:
def SQLInsertProposicoes(PropTable):
    
    # creating the data frame with all columns that will be used to insert the data into SQL Server
    base = pd.DataFrame(columns=['Reunião', 'Deliberação', 'Situação', 'Assunto', 'Autor', 'Proposição',
                                 'Ano', 'Texto'])
    
    # concatenating the table with the query (PropTable) with the data frame (base) with all columns
    PropTable1 = pd.concat([base,PropTable]).fillna('')

    # connecting to the SQL Server
    conn = pyodbc.connect('Trusted_Connection=yes', 
                          driver = '{ODBC Driver 17 for SQL Server}',
                          server = 'localhost', 
                          database = 'Indaial')
    
    # creating the cursor the will execute the querys in SQL Server
    cursor = conn.cursor()

    # inserting the data into SQL Server
    for index, row in PropTable1.iterrows():

        cursor.execute('''

            INSERT INTO Proposicoes (
                DataReuniao,
                DataDeliberacao,
                Situacao,
                Assunto,
                Autor,
                Proposicao,
                Ano,
                Texto
            ) 
            values(?,?,?,?,?,?,?,?)''', 

            row['Reunião'], 
            row['Deliberação'], 
            row['Situação'], 
            row['Assunto'], 
            row['Autor'], 
            row['Proposição'], 
            row['Ano'], 
            row['Texto']

        )

    # commiting and closing the cursor
    conn.commit()
    cursor.close()

In [20]:
#SQLInsertProposicoes(test)

In [21]:
# function to Select everything from the SQL Server

def SQLSelect(query):
    
    # connecting to the SQL Server
    conn = pyodbc.connect('Trusted_Connection=yes', 
                          driver = '{ODBC Driver 17 for SQL Server}',
                          server = 'localhost', 
                          database = 'Indaial')

    out = pd.read_sql_query(query,conn)
    return out

In [22]:
# function to truncate the table from the SQL Server

def SQLTruncate(TableName):
    
    # connecting to the SQL Server
    conn = pyodbc.connect('Trusted_Connection=yes', 
                          driver = '{ODBC Driver 17 for SQL Server}',
                          server = 'localhost', 
                          database = 'Indaial')
    
    # creating the cursor the will execute the querys in SQL Server
    cursor = conn.cursor()

    cursor.execute(f'''

                   TRUNCATE TABLE {TableName}

                   ''')
    
    # commiting and closing the cursor
    conn.commit()
    cursor.close()

In [23]:
# execute to clean the table

#SQLTruncate('Proposicoes')

## 3.2. Incremental update

Now that we are already able to get the data from wherever we want, connect to the database, perform queries such as ```SELECT``` and ```TRUNCATE```, and insert the data into the SQL Server, we need to create a way to  **increment** the data based on the last information registered.

In [24]:
def InsertNextProp(year):

    # look for the last proposition number registered in SQL Server
    data_year = SQLSelect(f'select Proposicao = max(Proposicao) from Proposicoes where Ano = {year}')
    last_prop = data_year['Proposicao'].loc[0]

    # verify if there's a registered data for that year    
    if last_prop == None:
        next_prop = 1
    else:
        next_prop = int(last_prop) + 1
    
    
    # get and insert data in the table
    data = get_fullcontent(next_prop,year)
    
    # this block tests if the current page is empty and tries the next one if yes
    if data.get('Situação') == None:
        next_prop += 1
        data = get_fullcontent(next_prop,year)
    
    if data['Situação'] != '':
        table = pd.DataFrame([data])
        SQLInsertProposicoes(table)
    else:
        raise

The function above search for the last proposition registered fot that year and if there's none, it starts to **insert the data from the fisrt**, and if there is, **it inserts the next one**.

Now, we will create a function that insert data for an entire year, one by one:

In [25]:
def SearchRecordDataYear(year, quantity = 999999, possible_erros = 2, waiTime = 0):
    import time
    
    # erros    
    erros = 0

    # it searcher for the next proposition. If there's none, it tries twice before stopping
    while erros <= possible_erros:
        try:
            InsertNextProp(year)
        except:
            erros += 1
            pass

        time.sleep(waiTime)

## 3.3. Updating the database

Finally, we will create a loop that goes from one year to another, specified by the user:

In [26]:
'''start_year = 1996
final_year = 1996

for i in list(range(start_year, final_year+1)):
    print('Starting recording data of the year: ',i)
    try:
        SearchRecordDataYear(i)
    except:
        pass
    
print('Insert finished 😁')'''

"start_year = 1996\nfinal_year = 1996\n\nfor i in list(range(start_year, final_year+1)):\n    print('Starting recording data of the year: ',i)\n    try:\n        SearchRecordDataYear(i)\n    except:\n        pass\n    \nprint('Insert finished 😁')"

Now, our database is ready to be imported to the **Power BI** to create visualisations.

## 4. Dashboard

After getting the data we need, it's time to create our visualisations. The report below was created in Power BI.

The dashboard can be accessed at: [Aldermen's Analyser Dashboard](https://app.powerbi.com/view?r=eyJrIjoiZDVkNGM0NTEtZmRkYi00N2FhLTkwY2YtYzIxODc5NWJjNTJjIiwidCI6IjcyNmE2MjA3LTUwZjYtNDlkNS1iMGQ0LTFhNGYwNmRiYjM4OSJ9)