<a id='Top'></a>
<center>
    <h1><b>Using Machine Learning to Analyze the 2nd Round of the 2022 Brazilian Presidential Election</b></h1>
<h3>Author: Yuri Henrique Galvao - Student # 3151850</h3>
</center>

---
This is the Final Project for the Artificial Intelligence Diploma program of The University of Winnipeg - Professional, Applied and Continuing Education (PACE). The idea of this project is to extract the available data generated by the Electronic Voting Machines (EVM) that were used in the 2nd Round of the 2022 Brazilian Presidential Election, clean it, analyze it, and then use clustering models to find data patterns - especially hidden or non-intuitive patterns - and anomalies.

For these clustering and anomaly-detection tasks, I will use the following three unsupervised clustering algorithms, which will be presented and compared: K-Means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Self-Organizing Maps (SOMs).

This notebook is divided into 6 main sections. At the end of the notebook, you will find also the references for the content that I used during the preparation of this notebook. In order to increase readability, some code is hidden (so, please unhide them if you want).

### SECTIONS:  
1. [Introduction](#Intro)<br>  
2. [Data Ingestion](#Data_ingestion)<br>  
3. [Exploratory Data Analysis](#Exploratory_Data_Analysis)<br>
4. [Clustering](#Clustering)<br>
    4.1 [K-Means](#K-Means)<br>
    4.2 [DBSCAN](#DBSCAN)<br>
    4.3 [SOM](#SOM)<br>
5. [Conclusions](#Conclusions)<br>
6. [References](#References)<br>

<a id='Intro'></a>
## 1. Introduction  <a href='#Top' style="text-decoration: none;">^</a>
### 1.1. Context

In October of the current year (2022) Brazil had the fiercest presidential election of the past three
decades. During the first round of federal elections, more than 10 candidates ran for President of the Federative Republic of Brazil, but all Brazilians knew that only two candidates would have real chances of going to the second round: Luis Inácio “Lula” da Silva and Jair Messias Bolsonaro.

As everyone expected, the presidential elections went to the second round with Lula and
Bolsonaro, and Brazil had a very controversial election in which Lula won with 50.9% of the valid votes against 49.1% of the valid votes that Bolsonaro received.

Those election were controversial because, among with other issues, Brazil uses an 100%
electronic voting system, which is based on a digital electronic voting machine (EVM) and in a digital voting processing system which is considered a black-box system.

### 1.2. Project Idea

In an attempt to make the system more transparent, the public organization responsible for the
elections, the Superior Electoral Court (Tribunal Superior Eleitoral – TSE), published all the data gathered by the EVM in their own website. More precisely, it is possible to get the logs, the digital vote registry (Registro Digital de Voto – RDV), and the voting machine bulletin (Boletim de Urna – BU), of each machine used in the elections.

In this project I will exctract raw data from the log files and the BUs, transform it, load it into Pandas DataFrames, clean it, analyze it, and then use clustering models to find hidden or non-intuitive patterns and anomalies. Nevertheless, those patterns will be a great finding and should tell a lot about the dataset.

For that, this notebook focuses on three algorithms: K-Means, DBSCAN and SOMs. They are implemented using mainly two Python libraries: Scikit-Learn and MiniSom.

<a id='Data_ingestion'></a>
## 2. Data Ingestion <a href='#Top' style="text-decoration: none;">^</a>

### 2.1. Importing the necessary libraries for the ETL process

In [None]:
import pandas as pd
import numpy as np
import glob, py7zr, os

### 2.2. Extracting the Data

#### 2.2.1. Web Scraping
To web scrap TSE's website and get the zip files of each EVM, I had do develop a Selenium-based web scraper. This webscraper it able to download all EVMs from every single Federative Unit (also called "scope"). Since Brazil has 26 states, 1 Federal District, and 1 scope for the EVMs placed in other countries, the total of scopes / federative units is 28. Therefore, you can run 28 instances of this web scraper to speed up the process of downloading the zip files.

In [None]:
!git clone https://github.com/ygalvao/BRA_Scraper_2022.git

In [None]:
!python 2022brascraper/web_scraper.py

#### 2.2.2. Unzipping
The Bash script below automates the process of unzipping the __.bu__ and __.logjez__ files. It will extract the files in the __"extracted"__ subfolder.

In [None]:
!./extract_bu_and_log.sh

### 2.3. Transforming the Data
__The BU files (.bu) are in ASN.1 format__, which is not readable by humans nor by Python (at least not natively). Therefore, the Bash script below automates the process of dumping the data from the ASN.1 format files into readable flat files (.txt).

Moreover, the "bu_etl" function transforms the data from a BU text file and stores it into a Pandas DataFrame (then returns it).

In [None]:
!./dump_bu.sh

In [None]:
def bu_etl(file_path:str)->pd.DataFrame:
    """
    Transforms the data from a BU text file and stores it into a Pandas DataFrame.
    
    Arg.: (str) the path for the BU text file 
    
    Returns: a Pandas DataFrame.
    """
    
    # Read the .txt file
    with open(file_path) as f:
        data = f.read()

    # Split the data into lines
    lines = data.split('\n')

    # Initialize two empty lists to store the rows of the DataFrame
    data_dict = {}
    codes = []
    brancos_e_nulos = 0

    # Iterate through the lines of the file
    for i, line in enumerate(lines):

        try:
            # Split the line by ' = ' and extract the left and right parts
            left, right = line.split(' = ')
        except:
            continue

        # Split the left part by '.', extract the last element (the column name), and delete white spaces before and after it(if there is any)
        column_name = left.split('.')[-1].strip()

        # Get the desired data and stores it in a dictionary
        wanted_variables = (
            'fase',
            'local',
            'municipio',
            'zona',
            'secao',
            'qtdEleitoresCompBiometrico',
            'idEleicao',
            'qtdEleitoresAptos',
            'qtdComparecimento',
            'qtd_votos_13',
            'qtd_votos_22',
            'brancos_nulos',
            'versaoVotacao'
            )

        value = right.strip("'")

        if column_name == 'codigo':
            codes.append(value)

        if column_name == 'quantidadeVotos' and len(codes) == 1 and (i > 34 and i < 40):
            column_name = f'qtd_votos_{codes[0]}'
            codes.pop()

        if column_name == 'quantidadeVotos' and len(codes) == 1 and (i > 40  and i < 46):
            column_name = f'qtd_votos_{codes[0]}'
            codes.pop()

        if column_name == 'quantidadeVotos' and len(codes) == 0 and (i >= 46):
            column_name = 'brancos_nulos'
            brancos_e_nulos += int(value)
            value = brancos_e_nulos

        if column_name in wanted_variables:
            data_dict[column_name] = [value]

    # Create the DataFrame from the rows
    df = pd.DataFrame(data_dict)
    
    return df

### 2.4. Loading the Data
Below are the procedures (mainly loops) to, finally, load the data into DataFrames that can be used by us, by Scikit-Learn, and by MiniSom.

In [None]:
# Importing the data from the BUs into a Pandas DataFrame
files_df2 = glob.glob("./BU_e_RDV/extracted/*.txt")
df_bu_list = []

## This loop will 
for i, file_path in enumerate(files_df2):
    bu = bu_etl(file_path)
    df_bu_list.append(bu)
    os.remove(file_path)
    
    if i % 50 == 0:            
        df_bu = pd.concat(df_bu_list, ignore_index=True)
        df_bu.to_csv('df_bu.csv') # Saves the DF to a CSV file, so we don't need to run all the ETL process again in the future
        df_bu_list = [df_bu]

if len(df_bu_list) > 1:
    df_bu = pd.concat(df_bu_list, ignore_index=True)
    df_bu.to_csv('df_bu.csv') # Saves the DF to a CSV file, so we don't need to run all the ETL process again in the future

In [None]:
# Importing the data from the log files into a Pandas DataFrame
files_df1 = glob.glob("./BU_e_RDV/extracted/*.logjez") # This is where I stored all the 7Zip logfiles from the "extract_bu_and_log.sh" script
df_logs_list = []

## This loop will exctract the logs (in flat file format, .dat) and correctly import its data into a Pandas DataFrame
## At the end of  each iteraction, it will delete the recently extracted .dat file (which has around 700kB)
## and the original 7Zip file (the .logjez files), in order to save space in disk
for i, file_path in enumerate(files_df1):
    logjez_file = py7zr.SevenZipFile(file_path, mode="r")
    logjez_file.extractall(path="./BU_e_RDV/extracted/")
    logjez_file.close()
    log = pd.read_csv(
        './BU_e_RDV/extracted/logd.dat',
        encoding='ISO 8859-1',
        header=None,
        names=['date_time', 'event_type', 'id_evm', 'system', 'description', 'authenticator'],
        sep=None,
        engine='python'
    )
    df_logs_list.append(log)
    os.remove('./BU_e_RDV/extracted/logd.dat')
    os.remove(file_path)
    
    if i % 50 == 0:
        df_logs = pd.concat(df_logs_list, ignore_index=True)
        df_logs.to_csv('df_logs.csv') # Saves the DF to a CSV file, so we don't need to run all the ETL process again in the future
        df_logs_list = [df_logs]
        
    if i > 20000:
        break

if len(df_logs_list) > 1:
    df_logs = pd.concat(df_logs_list, ignore_index=True)
    df_logs.to_csv('df_logs.csv') # Saves the DF to a CSV file, so we don't need to run all the ETL process again in the future

In [None]:
# In case of crash of this notebook (or if you already have the CSV files), 
#you can just read the CSV files that were created in the previous two cells.
# For that, just run this cell.
# Nevertheless, running this cell when it is not necessary is safe and harmless.
df_logs = pd.read_csv('df_logs.csv', index_col=0)
df_bu = pd.read_csv('df_bu.csv', index_col=0)

In [None]:
df_logs.info()

In [None]:
df_bu.info()

<a id='Exploratory_Data_Analysis'></a>
## 3. Exploratory Data Analysis <a href='#Top' style="text-decoration: none;">^</a>

### 3.1. Overall Analysis of the Logs DataFrame

In [None]:
from IPython.display import display as show

In [None]:
show(df_logs.head(10))
show(df_logs.tail(10))

In [None]:
df_logs.info()

We can see that each row in the df_logs DataFrame is one event of one of the systems running inside an EVM.

Due to feasibility purposes, this DataFrame is only a fraction (around 2%) of the data of all the EVM's logs.

#### 3.1.1. Checking duplicated rows

In [None]:
df_logs.duplicated().sum()

#### 3.1.2. Checking null values

In [None]:
df_logs.isna().sum()

#### 3.1.3. Checking zero values

In [None]:
(df_logs == 0).any(axis=1).sum()

In [None]:
# Checking zero values on the columns
(df_logs == 0).any(axis=0)

#### 3.1.4. Checking the possible values for each variable and their frequencies

In [None]:
for column in df_logs.columns:
    print(column+''''s different possible values quantity:''', df_logs[column].nunique(), '\n')

### 3.2. Overall Analysis of the BUs DataFrame

In [None]:
from IPython.display import display as show

In [None]:
show(df_bu.head(10))
show(df_bu.tail(10))

In [None]:
df_bu.info()

We can see that each row in the df_bu DataFrame is one EVM.

Due to feasibility purposes, this DataFrame is only a fraction (around 13%) of the data of all the EVM's logs.

#### 3.1.1. Checking duplicated rows

In [None]:
df_bu.duplicated().sum()

#### 3.1.2. Checking null values

In [None]:
df_bu.isna().sum()

#### 3.1.3. Checking zero values

In [None]:
(df_bu == 0).any(axis=1).sum()

In [None]:
# Checking zero values on the columns
(df_bu == 0).any(axis=0)

#### 3.1.4. Checking the possible values for each variable and their frequencies

In [None]:
for column in df_bu.columns:
    print(column+''''s different possible values quantity:''', df_bu[column].nunique(), '\n')

In [None]:
df_bu.idEleicao.value_counts()

### 3.3. Data Cleaning - Logs DF

#### 3.3.1. Deleting unnecessary features

In [None]:
df_logs.drop(columns='authenticator', inplace=True)

In [None]:
df_logs.info()

#### 3.3.2. Correcting data types

In [None]:
df_logs['date_time'] = pd.to_datetime(df_logs.date_time)

In [None]:
for column in ['event_type', 'id_evm', 'system']:
    df_logs[column] = df_logs[column].astype('category')

In [None]:
df_logs.info()

#### 3.3.3. Deleting personal data
This step is for ethical reasons: it is just to delete some personal data from the poll workers / clerks that are in the log files, like social insurance numbers (Cadastro de Pessoa Física - CPF, in Brazil).

In [None]:
df_logs[df_logs.description.str.contains('mesário', case=False)]

In [None]:
def delete_cpf(description:str)->str:
    """Deletes only the CPF."""
    
    new = description.split(' ')
    for token in new:
        try:
            int(token)
        except:
            pass
        else:
            new.remove(token)
    
    new = ' '.join(new)
    
    return new

In [None]:
df_logs['description'] = df_logs.description.apply(delete_cpf)

In [None]:
df_logs[df_logs.description.str.contains('mesário', case=False)]

### 3.4. Data Cleaning - BUs DF

#### 3.4.1. Deleting unnecessary features

In [None]:
df_bu.drop(columns='fase', inplace=True)

In [None]:
df_bu = df_bu[df_bu.idEleicao=='545'].copy() # The id for the 2nd Round of the 2022 Presidencial Election is 545

In [None]:
df_bu.drop(columns='idEleicao', inplace=True)

In [None]:
df_bu.drop(columns='versaoVotacao', inplace=True)

In [None]:
df_bu.info()

#### 3.4.2. Replacing null values
After analyzing the null values in this DF, I came to the conclusion that they mean, actually, zero (0) values.

In [None]:
df_bu.fillna(0, inplace=True)

#### 3.4.3. Correcting data types

In [None]:
for column in df_bu.columns.to_list():
    if column[:3] == 'qtd':
        df_bu[column] = df_bu[column].astype('int')
    else:
        df_bu[column] = df_bu[column].astype('category')
        
df_bu['brancos_nulos'] = df_bu['brancos_nulos'].astype('int')

In [None]:
df_bu.info()

### 3.5. Descriptive Statistics

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns

In [None]:
# Defining a function to classify columns into categorical and quantitative variables
def check_variables(df: pd.DataFrame) -> list:
    """
    Separates the categorical variables from the quantitative variables, and store them in their respective list.
    """
    
    cols = df.columns
    date_cols = df.select_dtypes(include='datetime').columns
    quantitative_cols = df.select_dtypes(include='number').columns 
    categorical_cols = list(set(cols) - set(quantitative_cols) - set(date_cols))
    quantitative_cols = set(quantitative_cols) - set(date_cols)
    
    return categorical_cols, list(quantitative_cols), list(date_cols)

In [None]:
# Defining a function to examine categorical variables
def examine_categorical(categ_var : pd.Series, top : int = 10, others : bool = True) -> None:
    '''
    This function gets a Pandas DataSeries (categorical column of the Pandas DataFrame) and: 
    - Gets the top 10 (or other chosen quantity) values
    - Compiles all the other values into "others" (or not, if chosen otherwise)
    - Prints a frequency distribution table
    - Plots a pie chart
    - Plots a bar chart
    '''
    
    vc = categ_var.value_counts()
    vc2 = vc.sort_values(ascending=False)[:top]
    new_row = pd.Series(data = {'others': vc.sort_values(ascending=False)[top:].sum()})
    vc3 = pd.concat([vc2, new_row])
    
    if others == True:
        vc = vc3
        msg = f'''Please, note that, for many reasons, only the top {top} values were considered to these calculations and visualizations.
All the other values were compiled into the "others" name.'''
    else:
        vc = vc2
        msg = f'''Please, note that, for many reasons, only the top {top} values were considered to these calculations and visualizations.'''
    
    # Frequency distribution
    print(f'''Frequency distribution table for different values of "{categ_var.name}" variable: \n\n{vc}\n''')
    
    print(msg)
    
    
    # Pie chart
    vc.plot(
    kind='pie',
    ylabel=categ_var.name,
    autopct='%.2f%%',
    figsize=(10,10))
    
    plt.show()
    plt.close()
    
    # Bar chart
    bar = vc.plot(
        kind='bar',
        figsize=(10,8),
        align='center')
    
    bar.bar_label(bar.containers[0])
    
    plt.show()
    plt.close()
    
    print('_' * 120+'\n' * 3)

In [None]:
# Defining a function to examine numerical variables
def examine_quant(
    variable:pd.Series,
    optmize_n_bins:bool=False,
    no_outliers:bool=False,
    n_bins:bool=False
)->None:
    '''
    Gets a Pandas DataSeries and: 
    - Prints measures of central tendancy
    - Prints measures of spread
    - Take the outliers out using the 1.5 IQR criterion, if "no_outliers" == True
    - Try to calculate the optimal number of bins for the histogram, if "optmize_n_bins" == True
    - Plots a histogram
    - Plots box-plot
    '''
    
    var_desc = variable.describe()
    
    IQR = var_desc['75%'] - var_desc['25%']
    
    print(f'''### Measures for variable '{variable.name}':
    
## Measures of center:
Mode: {variable.mode()[0]}
Mean: {var_desc['mean']}
Median: {var_desc['50%']}

## Measures of spread:
Min: {var_desc['min']}
Max: {var_desc['max']}
Range: {var_desc['max'] - var_desc['min']}

1st Quartile (Q25): {var_desc['25%']}
3rd Quartile (Q75): {var_desc['75%']}
IQR: {IQR}

Standard deviation: {var_desc['std']}\n''')
    
    if no_outliers == True:
        variable = variable[(variable <= (var_desc['75%'] + 1.5 * IQR)) & (variable >= (var_desc['25%'] - 1.5 * IQR))]
        
    def freedman_diaconis(variable : np.ndarray) -> int:
        """
        Use Freedman Diaconis rule to compute optimal histogram bin width - it tries to return the optimal number of bins. 
        """

        data = np.asarray(variable.values, dtype=np.float_)
        IQR  = stats.iqr(data, rng=(25, 75), scale=1.0, nan_policy='propagate')
        N    = data.size
        bw   = (2 * IQR) / np.power(N, 1/3)

        datmin, datmax = data.min(), data.max()
        datrng = datmax - datmin
        
        result = int(((datrng / bw) + 1)/5)

        return result
    
    #Histogram
    if optmize_n_bins:
        try:
            n_bins_ = freedman_diaconis(variable)
        except Exception as e:
            print(e)
        else:            
            variable.hist(bins=n_bins_)
            plt.show()
            plt.close()
    elif n_bins:
        variable.hist(bins=n_bins)
        plt.show()
        plt.close()
    else:
        variable.hist()
        plt.show()
        plt.close()
    
    #Boxplot
    plt.boxplot(x=variable, labels=[variable.name])
    plt.ylabel(variable.name)
    plt.show()
    plt.close()
    
    #Separator line
    print('_' * 120+'\n' * 3)

#### 3.5.1. Visualizing the Data - Logs DF

In [None]:
mpl.rcParams['font.family'] = ['serif']

In [None]:
# Defining better settings for the visualizations using Seaborn
sns.set(rc={'figure.figsize':(8,6)}, style='whitegrid')

In [None]:
# Classifying variables by their type
cat_cols_logs, quan_cols_logs, date_cols_logs = check_variables(df_logs)

In [None]:
print('Categorical variables:')
show(cat_cols_logs)
print('\nQuantitative variables:')
show(quan_cols_logs)
print('\nFull date/time variables:')
show(date_cols_logs)

##### Visualizing and Examining Categorical Variables

In [None]:
# Examining categorical variables
for variable in cat_cols_logs:
    examine_categorical(df_logs[variable], top=8, others=True)

There are no quantitative variables in df_logs.

In [None]:
# Examining quantitative variables with outliers
for variable in quan_cols_logs:
    try:
        examine_quant(df_logs[variable], optmize_n_bins=False, no_outliers=False, n_bins=False)
    except Exception as e:
        print(e)

In [None]:
# Examining quantitative variables without outliers
for variable in quan_cols_logs:
    try:
        examine_quant(df_logs[variable], optmize_n_bins=False, no_outliers=True, n_bins=False)
    except Exception as e:
        print(e)

#### 3.5.2. Visualizing the Data - BUs DF

In [None]:
# Classifying variables by their type
cat_cols_bu, quan_cols_bu, date_cols_bu = check_variables(df_bu)
print('Categorical variables:')
show(cat_cols_bu)
print('\nQuantitative variables:')
show(quan_cols_bu)
print('\nFull date/time variables:')
show(date_cols_bu)

##### Visualizing and Examining Categorical Variables

In [None]:
# Examining categorical variables
for variable in cat_cols_bu:
    examine_categorical(df_bu[variable], top=12, others=True)

##### Visualizing and Examining Quantitative Variables

In [None]:
# Examining quantitative variables with outliers
for variable in quan_cols_bu:
    try:
        examine_quant(df_bu[variable], optmize_n_bins=False, no_outliers=False, n_bins=False)
    except Exception as e:
        print(e)

In [None]:
# Examining quantitative variables without outliers
for variable in quan_cols_bu:
    try:
        examine_quant(df_bu[variable], optmize_n_bins=False, no_outliers=True, n_bins=False)
    except Exception as e:
        print(e)

<a id='Clustering'></a>
## 4. Clustering <a href='#Top' style="text-decoration: none;">^</a>

### 4.1 Preprocessing the Data

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

#### 4.1.1. Logs DF

##### Label Encoding

In [None]:
labelencoder = LabelEncoder()

In [None]:
df_logs

In [None]:
for column in cat_cols_logs:
    df_logs[column+'_enc'] = labelencoder.fit_transform(df_logs[column])

In [None]:
df_logs

In [None]:
df_logs.info()

In [None]:
X_logs = df_logs[['event_type_enc', 'description_enc', 'system_enc', 'id_evm_enc']]

In [None]:
X_logs

##### Scaling the data

In [None]:
sc = MinMaxScaler(feature_range = (0,1))
X_logs_scaled = sc.fit_transform(X_logs)

In [None]:
X_logs_scaled

In [None]:
X_logs.info()

#### 4.1.2. BUs DF

##### Translating features names to English

In [None]:
df_bu.rename(columns={
    'municipio' : 'municipality',
    'zona' : 'zone',
    'secao' : 'section',
    'qtdEleitoresCompBiometrico' : 'qty_voters_with_biometrics',
    'qtdEleitoresAptos' : 'qty_voters_able_to_vote',
    'qtdComparecimento' : 'qty_attendance',
    'qtd_votos_13' : 'qty_votes_on_13',
    'qtd_votos_22' : 'qty_votes_on_22',
    'brancos_nulos' : 'qty_blank_and_null_votes',
}, inplace=True)

In [None]:
# Classifying variables by their type again
cat_cols_bu, quan_cols_bu, date_cols_bu = check_variables(df_bu)
print('Categorical variables:')
show(cat_cols_bu)
print('\nQuantitative variables:')
show(quan_cols_bu)
print('\nFull date/time variables:')
show(date_cols_bu)

##### Label Encoding

In [None]:
labelencoder = LabelEncoder()

In [None]:
for column in cat_cols_bu:
    df_bu[column+'_enc'] = labelencoder.fit_transform(df_bu[column])

In [None]:
df_bu

In [None]:
df_bu.info()

In [None]:
X_bu = df_bu[quan_cols_bu + [column+'_enc' for column in cat_cols_bu]]

In [None]:
X_bu

In [None]:
X_bu.info()

In [None]:
X_bu_quant = df_bu[quan_cols_bu]

In [None]:
X_bu_quant

##### Feature Engineering

In [None]:
# Let's create a new feature: attendance_rate
df_bu['attendance_rate'] = df_bu.qty_attendance / df_bu.qty_voters_able_to_vote

In [None]:
# Let's scale it, so it become from 0 to 100, instead of 0 to 1
df_bu['attendance_rate'] = df_bu['attendance_rate'] * 100

In [None]:
X_bu_quant['attendance_rate'] = df_bu['attendance_rate']

In [None]:
X_bu_quant

In [None]:
X_bu_quant.info()

<a id='K-Means'></a>
### 4.2. K-Means

In [None]:
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D

#### 4.2.1. Logs DF

In [None]:
model = KMeans(random_state=1, n_init=10)
visualizer = KElbowVisualizer(model, k=(2,10))

visualizer.fit(X_logs_scaled)
visualizer.show()
plt.show()

##### Choosing optimal K
In order to find an appropriate number of clusters, the elbow method was used. In this method for this case, the inertia for a number of clusters between 2 and 10 will be calculated. The rule is to choose the number of clusters where you see a kink or "an elbow" in the graph.

The graph above shows the reduction of a distortion score as the number of clusters increases. However, there is no clear "elbow" visible. The underlying algorithm suggests 4 clusters. A choice of 4 or 5 clusters seems to be fair.

##### Building the model and getting the clusters

In [None]:
KM_5_clusters_logs = KMeans(n_clusters=5, init='k-means++', random_state=123, n_init=10).fit(X_logs_scaled) # initialise and fit K-Means model

KM_5_clustered_logs = X_logs.copy()
KM_5_clustered_logs.loc[:,'Cluster'] = KM_5_clusters_logs.labels_ # append labels to points

In [None]:
KM_5_clustered_logs

##### Visualizing the clusters

In [None]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))


scat_1 = sns.scatterplot(
    KM_5_clustered_logs,
    x='event_type_enc',
    y='system_enc',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM_5_clustered_logs,
    x='id_evm_enc',
    y='event_type_enc',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

axes[0].scatter(KM_5_clusters_logs.cluster_centers_[:,1],KM_5_clusters_logs.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters_logs.cluster_centers_[:,0],KM_5_clusters_logs.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()

In [None]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))


scat_1 = sns.scatterplot(
    KM_5_clustered_logs,
    x='description_enc',
    y='id_evm_enc',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM_5_clustered_logs,
    x='description_enc',
    y='event_type_enc',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

axes[0].scatter(KM_5_clusters_logs.cluster_centers_[:,1],KM_5_clusters_logs.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters_logs.cluster_centers_[:,0],KM_5_clusters_logs.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()

##### Checking the size of the clusters

In [None]:
KM4_clust_sizes = KM4_clustered.groupby('Cluster').size().to_frame()
KM4_clust_sizes.columns = ["KM_size"]
KM4_clust_sizes

#### 4.2.2. BUs DF - with labeled categorical features

In [None]:
model = KMeans(random_state=1, n_init=10)
visualizer = KElbowVisualizer(model, k=(2,10))

visualizer.fit(X_bu)
visualizer.show()
plt.show()

##### Choosing optimal K
In order to find an appropriate number of clusters, the elbow method was used. In this method for this case, the inertia for a number of clusters between 2 and 10 will be calculated. The rule is to choose the number of clusters where you see a kink or "an elbow" in the graph.

The graph above shows the reduction of a distortion score as the number of clusters increases. However, there is no clear "elbow" visible. The underlying algorithm suggests 4 clusters. A choice of 4 or 5 clusters seems to be fair.

##### Building the model and getting the clusters

In [None]:
KM_5_clusters = KMeans(n_clusters=5, init='k-means++', random_state=123, n_init=10).fit(X_bu) # initialise and fit K-Means model

KM5_clustered = X_bu.copy()
KM5_clustered.loc[:,'Cluster'] = KM_5_clusters.labels_ # append labels to points

In [None]:
KM5_clustered

##### Visualizing the clusters

In [None]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

scat_1 = sns.scatterplot(
    KM5_clustered,
    x='municipality_enc',
    y='qty_votes_on_22',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered,
    x='municipality_enc',
    y='qty_votes_on_13',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

axes[0].scatter(KM_5_clusters.cluster_centers_[:,1],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters.cluster_centers_[:,0],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()

In [None]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

scat_1 = sns.scatterplot(
    KM5_clustered,
    x='qty_voters_with_biometrics',
    y='qty_votes_on_22',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered,
    x='qty_voters_with_biometrics',
    y='qty_votes_on_13',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

axes[0].scatter(KM_5_clusters.cluster_centers_[:,1],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters.cluster_centers_[:,0],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()

In [None]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

scat_1 = sns.scatterplot(
    KM5_clustered,
    x='qty_voters_with_biometrics',
    y='qty_blank_and_null_votes',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered,
    x='qty_blank_and_null_votes',
    y='qty_votes_on_22',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

axes[0].scatter(KM_5_clusters.cluster_centers_[:,1],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters.cluster_centers_[:,0],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()

##### Checking the size of the clusters

In [None]:
KM5_clust_sizes = KM5_clustered.groupby('Cluster').size().to_frame()
KM5_clust_sizes.columns = ["KM_size"]
KM5_clust_sizes

#### 4.2.2. BUs DF - without categorical features

In [None]:
model = KMeans(random_state=1, n_init=10)
visualizer = KElbowVisualizer(model, k=(2,10))

visualizer.fit(X_bu_quant)
visualizer.show()
plt.show()

##### Choosing optimal K
In order to find an appropriate number of clusters, the elbow method was used. In this method for this case, the inertia for a number of clusters between 2 and 10 will be calculated. The rule is to choose the number of clusters where you see a kink or "an elbow" in the graph.

The graph above shows the reduction of a distortion score as the number of clusters increases. However, there is no clear "elbow" visible. The underlying algorithm suggests 4 clusters. A choice of 4 or 5 clusters seems to be fair.

##### Building the model and getting the clusters

In [None]:
KM_5_clusters_only_quant = KMeans(n_clusters=5, init='k-means++', random_state=123, n_init=10).fit(X_bu_quant) # initialise and fit K-Means model

KM5_clustered_only_quant = X_bu_quant.copy()
KM5_clustered_only_quant.loc[:,'Cluster'] = KM_5_clusters_only_quant.labels_ # append labels to points

In [None]:
KM5_clustered_only_quant

##### Visualizing the clusters

In [None]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

scat_1 = sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_votes_on_13',
    y='qty_votes_on_22',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_blank_and_null_votes',
    y='qty_votes_on_13',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

#axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
#axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()

In [None]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

scat_1 = sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_voters_with_biometrics',
    y='qty_votes_on_22',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_voters_with_biometrics',
    y='qty_votes_on_13',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()

In [None]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

scat_1 = sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_voters_with_biometrics',
    y='qty_blank_and_null_votes',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_voters_with_biometrics',
    y='qty_attendance',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

#axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()

In [None]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_voters_able_to_vote',
    y='qty_blank_and_null_votes',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_voters_able_to_vote',
    y='qty_attendance',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

#axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
#axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()

In [None]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

sns.scatterplot(
    KM5_clustered_only_quant,
    x='attendance_rate',
    y='qty_votes_on_13',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered_only_quant,
    x='attendance_rate',
    y='qty_votes_on_22',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

#axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
#axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()

In [None]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_voters_with_biometrics',
    y='attendance_rate',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_voters_able_to_vote',
    y='attendance_rate',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

#axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
#axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()

##### Checking the size of the clusters

In [None]:
KM5_clust_sizes = KM5_clustered_only_quant.groupby('Cluster').size().to_frame()
KM5_clust_sizes.columns = ["KM_size"]
KM5_clust_sizes

<a id='DBSCAN'></a>
### 4.3. DBSCAN
In DBSCAN there are two major hyperparameters:

- eps
- min_samples

It is difficult arbitrarily to say what values will work the best. Therefore, I will first create a matrix of combinations.

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
from itertools import product

#### 4.3.1. Logs DF

After many tries, I realized that DBSCAN will not work with the Logs DF, since it always crashes the Jupyter Lab (regardless if it is in my own computer or in Google Colab or in a powerful machine in Google Cloud Platform Workbench).

This is probably due to the fact that this dataset is a huge one (it has more than 2,000,000 rows, and it is already much smaller), and also it has only categorical variables (even though they are already encoded).

#### 4.3.1. BUs DF

##### Chosing optimal parameters

In [None]:
eps_values = np.arange(6,10.75,0.75) # eps values to be investigated
min_samples = np.arange(6,12) # min_samples values to be investigated

DBSCAN_params = list(product(eps_values, min_samples))

In [None]:
# Because DBSCAN creates clusters itself based on those two parameters let's check the number of generated clusters.

no_of_clusters = []
sil_score = []

for p in DBSCAN_params:
    DBS_clustering = DBSCAN(eps=p[0], min_samples=p[1]).fit(X_bu_quant)
    no_of_clusters.append(len(np.unique(DBS_clustering.labels_)))
    sil_score.append(silhouette_score(X_bu_quant, DBS_clustering.labels_))

In [None]:
tmp = pd.DataFrame.from_records(DBSCAN_params, columns =['Eps', 'Min_samples'])   
tmp['No_of_clusters'] = no_of_clusters

pivot_1 = pd.pivot_table(tmp, values='No_of_clusters', index='Min_samples', columns='Eps')

fig, ax = plt.subplots(figsize=(12,6))
sns.heatmap(pivot_1, annot=True,annot_kws={"size": 16}, cmap="YlGnBu", ax=ax)
ax.set_title('Number of clusters')
plt.show()

The heatplot immediately above shows that, with the given parameters, the number of clusters vary from 2 to 330. However, most of the combinations gives more than 20 clusters. Nevertheless, we can safely choose numbers located on the bottom-left or the bottom-right corner of the heatmap.

##### Cluster # 1

##### Building the model and getting the clusters

In [None]:
DBS_clustering = DBSCAN(eps=16, min_samples=14).fit(X_bu_quant)

DBSCAN_clustered = X_bu_quant.copy()
DBSCAN_clustered.loc[:,'Cluster'] = DBS_clustering.labels_ # append labels to points

##### Checking the size of the clusters

In [None]:
DBSCAN_clust_sizes = DBSCAN_clustered.groupby('Cluster').size().to_frame()
DBSCAN_clust_sizes.columns = ["DBSCAN_size"]
DBSCAN_clust_sizes

DBSCAN created 8 clusters plus outliers cluster (-1). Sizes of clusters vary significantly. There are 6579 outliers.

##### Cluster # 2

##### Building the model and getting the clusters

In [None]:
DBS_clustering = DBSCAN(eps=6, min_samples=10).fit(X_bu_quant)

DBSCAN_clustered = X_bu_quant.copy()
DBSCAN_clustered.loc[:,'Cluster'] = DBS_clustering.labels_ # append labels to points

##### Checking the size of the clusters

In [None]:
DBSCAN_clust_sizes = DBSCAN_clustered.groupby('Cluster').size().to_frame()
DBSCAN_clust_sizes.columns = ["DBSCAN_size"]
DBSCAN_clust_sizes

DBSCAN created 3 clusters plus outliers cluster (-1). Sizes of clusters are almost the same. There are 34613 outliers in this cluster.

<a id='SOM'></a>
### 4.3. SOM
Self Organizing Map (SOM) is an unsupervised ANN that uses competitive learning to update its weights - i.e Competition, Cooperation and Adaptation.

Each neuron of the output layer is present with a vector with dimension n. The distance between each neuron present at the output layer and the input data is computed. The neuron with the lowest distance is termed as the most suitable fit.

Updating the vector of the suitable neuron in the final process is known as adaptation, along with its neighbour in cooperation. After selecting the suitable neuron and its neighbours, we process the neuron to update. The more the distance between the neuron and the input, the more the data grows. 

In [None]:
from minisom import MiniSom 

#### 4.3.1. BUs DF

In [None]:
X_bu_quant.values.shape # Let's just check the number of columns in the dataset

##### Setup # 1

##### Building the model

In [None]:
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.1, sigma=1.5)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 10000, verbose=True)

##### Visualizing the U-Matrix

The U-Matrix is a common way to visualize the results of a Self-Organizing Map (SOM). It is a 2D representation of the SOM's neurons and the distances between them, where each cell in the U-Matrix corresponds to a neuron in the SOM. The color of each cell represents the distance between that neuron and its neighbors.

Each cell in the U-Matrix corresponds to a neuron in the SOM, and the numbers inside the cells are the indexes of the neurons. The lines separating the cells represent the distances between the neurons.

In general, cells with similar colors in the U-Matrix tend to have similar input vectors assigned to them. This means that these neurons form clusters of similar data in the input space. The darker the color, the more similar are the vectors assigned to that neuron and its neighbors, indicating that these neurons are closer in the input space.

In [None]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()

##### Setup # 2

##### Building the model

In [None]:
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.2, sigma=12)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 10000, verbose=True)

##### Visualizing the U-Matrix

In [None]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()

##### Setup # 3

##### Building the model

In [None]:
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.5, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 10000, verbose=True)

##### Visualizing the U-Matrix

In [None]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()

##### Setup # 4

##### Building the model

In [None]:
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.75, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 10000, verbose=True)

##### Visualizing the U-Matrix

In [None]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()

##### Setup # 5

##### Building the model

In [None]:
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.25, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 50000, verbose=True)

##### Visualizing the U-Matrix

In [None]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()

##### Setup # 6

##### Building the model

In [None]:
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.25, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 150000, verbose=True)

##### Visualizing the U-Matrix

In [None]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()

##### Setup # 6

##### Building the model

In [None]:
neurons_a = 25
neurons_b = 25
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.25, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 200000, verbose=True)

##### Visualizing the U-Matrix

In [None]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()

##### Setup # 6

##### Building the model

In [None]:
neurons_a = 40
neurons_b = 40
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.25, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 500000, verbose=True)

##### Visualizing the U-Matrix

In [None]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()

##### Setup # 7

##### Building the model

In [None]:
neurons_a = 35
neurons_b = 35
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.35, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 500000, verbose=True)

##### Visualizing the U-Matrix

In [None]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()

##### Setup # 8

##### Building the model

In [None]:
neurons_a = 10
neurons_b = 10
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.35, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 50000, verbose=True)

##### Visualizing the U-Matrix

In [None]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()

##### Generating Clusters

Of course, we can also generate clusters from SOMs. These clusters are based on the winner neurons (basically, each winner neuron represents a cluster).

In [None]:
# each winner neuron represents a cluster
winner_coordinates = np.array([som.winner(x) for x in X_bu_quant.values]).T
# with np.ravel_multi_index we convert the bidimensional
# coordinates to a monodimensional index
cluster_index = np.ravel_multi_index(winner_coordinates, (neurons_a, neurons_b))

# plotting the clusters using the first 2 dimentions of the data
for c in np.unique(cluster_index):
    plt.scatter(X_bu_quant.values[cluster_index == c, 0],
                X_bu_quant.values[cluster_index == c, 1], label='cluster='+str(c), alpha=.7)

# plotting centroids
#for centroid in som.get_weights():
#    plt.scatter(centroid[:, 0], centroid[:, 1], marker='x', 
#                s=80, linewidths=35, color='k', label='centroid')
plt.legend()
plt.show()

##### Setup # 9

##### Building the model

In [None]:
neurons_a = 3
neurons_b = 5
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.35, sigma=.5)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 50000, verbose=True)

##### Visualizing the U-Matrix

In [None]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()

##### Generating Clusters

In [None]:
# each winner neuron represents a cluster
winner_coordinates = np.array([som.winner(x) for x in X_bu_quant.values]).T
# with np.ravel_multi_index we convert the bidimensional
# coordinates to a monodimensional index
cluster_index = np.ravel_multi_index(winner_coordinates, (neurons_a, neurons_b))

# plotting the clusters using the first 2 dimentions of the data
for c in np.unique(cluster_index):
    plt.scatter(X_bu_quant.values[cluster_index == c, 0],
                X_bu_quant.values[cluster_index == c, 1], label='cluster='+str(c), alpha=.7)

# plotting centroids
#for centroid in som.get_weights():
#    plt.scatter(centroid[:, 0], centroid[:, 1], marker='x', 
#                s=80, linewidths=35, color='k', label='centroid')
plt.legend()
plt.show()

##### Setup # 9

##### Building the model

In [None]:
neurons_a = 2
neurons_b = 3
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.3, sigma=.5)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 50000, verbose=True)

##### Generating Clusters

In [None]:
# Defining a function to make it easier and faster
def generate_cluster_from_som(dim1:int, dim2:int)->None:
    # each winner neuron represents a cluster
    winner_coordinates = np.array([som.winner(x) for x in X_bu_quant.values]).T
    # with np.ravel_multi_index we convert the bidimensional
    # coordinates to a monodimensional index
    cluster_index = np.ravel_multi_index(winner_coordinates, (neurons_a, neurons_b))

    # plotting the clusters using the first 2 dimentions of the data
    for c in np.unique(cluster_index):
        plt.scatter(X_bu_quant.values[cluster_index == c, dim1],
                    X_bu_quant.values[cluster_index == c, dim2], label='cluster='+str(c), alpha=.7)

    plt.xlabel(X_bu_quant.columns.to_list()[dim1])
    plt.ylabel(X_bu_quant.columns.to_list()[dim2])
    # plotting centroids
    #for centroid in som.get_weights():
    #    plt.scatter(centroid[:, 0], centroid[:, 1], marker='x', 
    #                s=80, linewidths=35, color='k', label='centroid')
    plt.legend()
    plt.show()

In [None]:
generate_cluster_from_som(0, 1)

In [None]:
generate_cluster_from_som(2, 0)

In [None]:
generate_cluster_from_som(2, 3)

In [None]:
generate_cluster_from_som(6, 3)

In [None]:
generate_cluster_from_som(6, 0)

In [None]:
generate_cluster_from_som(4, 6)

In [None]:
generate_cluster_from_som(2, 6)

<a id='Conclusions'></a>
## 5. Conclusions <a href='#Top' style="text-decoration: none;">^</a>

### 5.1. Patterns Found in the Data

K-Means and SOM algorithms and models generated some interesting clusters and, although it is possible to explain clusters from different points of view (basing on different variables), there are some visualizations in which the clusters appear more and are much more readable.

Basically, K-means generated the following 5 clusters from the logs DF:
- Red (0): EVMs with medium quantity of votes on 22 (Bolsonaro) and medium to high quantity of votes on 13 (Lula). This cluster also shows EVMs with high quantity of voters with biometrics and evms with a high attendance rate.

- Blue (1): EVMs with low quantity of votes on 22 (Bolsonaro) and low quantity of votes on 13 (Lula). This cluster also shows EVMs with low quantity of voters with biometrics.

- Green (2): EVMs with medium quantity of votes on 22 (Bolsonaro) and low quantity of votes on 13 (Lula).

- Purple (3): EVMs with high quantity of votes on 22 (Bolsonaro) and medium quantity of votes on 13 (Lula).

- Orange (4): EVMs with low quantity of votes on 22 (Bolsonaro) and high quantity of votes on 13 (Lula). This cluster also shows EVMs with medium quantity of voters with biometrics.

And SOM generated the following 6 clusters from the logs DF:
- Dark Blue (0): EVMs with medium quantity of votes on 22 (Bolsonaro), medium to high quantity of votes on 13 (Lula), high attendance rate, and high quantity of voters with biometrics.

- Green (1): EVMs with low quantity of votes on 22 (Bolsonaro), high quantity of votes on 13 (Lula), high attendance rate, and medium quantity of voters with biometrics.

- Red (2): EVMs with low quantity of votes on 22 (Bolsonaro), medium quantity of votes on 13 (Lula), medium to high attendance rate, and medium quantity of voters with biometrics.

- Purple (3): EVMs with low quantity of votes on 22 (Bolsonaro), low (the lowest) quantity of votes on 13 (Lula), medium to low attendance rate (the lowest), and low quantity of voters with biometrics.

- Yellow (4): EVMs with high (the highest) quantity of votes on 22 (Bolsonaro), low quantity of votes on 13 (Lula), medium to high attendance rate, and medium quantity of voters with biometrics.

- Light Blue (5): EVMs with low quantity of votes on 22 (Bolsonaro), low quantity of votes on 13 (Lula), medium to high attendance rate, and low quantity of voters with biometrics (the lowest).

### 5.2. Algorithms Comparison

Regarding the three clustering algorithms used in this notebook, we can say that K-Means is the easiest to use, whereas SOM is the fastest and also seems to be the most complete and most customizable one - coincidence or not (probably not), SOM is the only algorithm that is considered to be within the Deep Learning subset, since it is an Artificial Neural Network.

DBSCAN is very heavy and needs a lot of computing power and tweaking to be really effective. Moreover, DBSCAN seems to be more suitable for anomaly detection than for clustering itself although, unfortunately, I wasn't able to effectively detect anomalies using DBSCAN in this project due to the reasons I will still mention in these conclusions.

### 5.3. Problems and Challenges

The first major challenge in this project was the data ingestion phase, because it needed a really efficient web scraper (in which I invested a lot of time to develop it) and because of the huge volume of data: in total, there were used around 490,000 EVMs in the 2nd round of the Brazilian Federal Elections last year. I ended up downloading the zip files of around 110,000 EVMs, and I couldn't keep going because of the limitations of my computer and of my Google Cloud Platform account (yes, I had to use my GCP Vertex AI Workbench to this project, since my computer wasn't enough).

The second major challeng was, also, a problem and was related to the first one: the lack of computing power to properly process this huge amount of data. Even using cloud resources, I got many crashes and I had to decrease the size of the datasets (mainly the logs one) to keep working in this project. Trying to work with the whole data from all the logs and all the EVM's boulletins (_Boletins de Urna_ - BUs) from 110,000 EVMs wasn't feasible at all.

The third and last problem was the type of the data of the logs. The fact that all the features of the logs were categorical didn't help in anything. In fact, it was the major reason why the K-Means clusters of the logs data didn't seem useful at all. I tried to solve this problem by label encoding it (since one-hot / dummy encoding wasn't feasible) and scaling it, but it didn't helped too much.

<a id='Conclusions'></a>
## 6. References <a href='#Top' style="text-decoration: none;">^</a>

<a href='https://github.com/JustGlowing/minisom' style="text-decoration: none;">[1]</a> 	Victor Dey, "Beginners Guide to Self-Organizing Maps", AIM, 2021<br>
<a href='https://analyticsindiamag.com/beginners-guide-to-self-organizing-maps/' style="text-decoration: none;">[2]</a> 	Giuseppe Vettigli, "MiniSom", GitHUb, 2022<br>
<a href='https://https://www.kaggle.com/code/datark1/customers-clustering-k-means-dbscan-and-ap' style="text-decoration: none;">[3]</a> 	Robert Kwiatkowski, "Customers clustering: K-Means, DBSCAN and AP", Kaggle, 2022<br>
<a href='https://towardsdatascience.com/clustering-on-numerical-and-categorical-features-6e0ebcf1cbad' style="text-decoration: none;">[4]</a> 	Jorge Martín Lasaosa, "Clustering on numerical and categorical features", Towards Data Science, 2021<br>