# Tratamento de dados

Antes de realizar as análises e treinamento da rede neural, foi necessário realizar algumas modificações e tratamentos nos dados brutos. Este notebook documenta o processo de limpeza e preparação dos dados para garantir que estejam adequados para serem utilizados posteriormente na rede.

## Bibliotecas

In [23]:
# Importações utilizadas

import pandas as pd
import re
import matplotlib.pyplot as plt
import numpy as np

## Dataset

O dataset a seguir contém todos os dados fornecidos pelo Computational 2D Materials Database (C2DB) [1]:

In [24]:
df = pd.read_csv("C2DB_full.csv")
df

Unnamed: 0,Formula ✕,Band gap ✕,2D plasma frequency (x) ✕,2D plasma frequency (y) ✕,Band gap (G₀W₀) ✕,Band gap (HSE06) ✕,"Conduction band effective mass, direction 1 ✕","Conduction band effective mass, direction 2 ✕",Cond. band minimum ✕,Conduction band minimum (G₀W₀) ✕,...,"Stiffness tensor, 32-component ✕","Stiffness tensor, 33-component ✕",Stoichiometry ✕,Mass ✕,Age ✕,ID ✕,Unique identifier ✕,Username ✕,Vacuum level difference ✕,Volume ✕
0,Be4,0.0,-,-,-,-,-,-,-,-,...,-,-,A,36.049,6w,1,Be4-09dd42ad034e,cmr,-0.0,256.436
1,As4O8,3.232,-,-,-,-,1.117,1.155,-3.938,-,...,0.000,-69.689,AB2,427.678,6w,2,As4O8-5242a449d950,cmr,0.0,1411.694
2,Ca4As4,0.998,-,-,-,1.583,1.141,3.465,-1.024,-,...,0.003,26.732,AB,459.998,6w,3,As4Ca4-bf7bbbdbefe0,cmr,-0.0,1426.071
3,Fe4S8,0.0,-,-,-,-,-,-,-,-,...,3.617,55.290,AB2,479.860,6w,4,Fe4S8-897195c26aff,cmr,0.0,1080.629
4,In2Se2,1.63,-,-,-,2.254,0.207,1.067,-2.271,-,...,-0.000,13.382,AB,387.578,6w,5,In2Se2-0a48e35c06ea,cmr,-0.0,761.622
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4029,Rh2P2S6,0.0,4.043,4.045,-,0.000,-,-,-,-,...,0.021,36.035,ABC3,460.119,6w,4036,P2Rh2S6-b6a27022f56f,cmr,-0.001,580.987
4030,Ta2P2Se6,0.294,0.000,0.000,-,0.813,0.235,1.179,-0.799,-,...,-1.189,22.64,ABC3,897.669,6w,4037,P2Ta2Se6-e2c90519357b,cmr,-0.0,677.616
4031,Zr2P2Se6,0.394,0.000,0.000,-,1.074,0.741,1.006,-1.142,-,...,0.937,24.638,ABC3,718.222,6w,4038,P2Zr2Se6-2486f04ec8ea,cmr,0.0,755.942
4032,Mo2W2Se8,1.27,0.000,0.000,-,1.749,0.479,0.513,0.853,-,...,-0.002,90.448,ABC4,1191.348,6w,4039,Mo2W2Se8-a1d716aad84d,cmr,-0.0,700.669


## Seleção de atributos

Para realizar a previsão de band gap, precisaremos apenas de alguns parâmetros específicos. A escolha desses parâmetros foi feita com base em uma análise manual, selecionando aqueles que possuem maior sentido físico para o problema em questão. São eles:

- Formula
- Band gap [TARGET]
- Thermodynamic stability level	
- Energy
- Work function
- Heat of formation
- Space group number
- Volume of unit cell

Dessa forma, retiraremos todas as colunas que não serão utilizadas:

In [25]:
df.drop(columns=['2D plasma frequency (x)              ✕',
       '2D plasma frequency (y)              ✕',
       'Band gap (G₀W₀)              ✕', 'Band gap (HSE06)              ✕',
       'Conduction band effective mass, direction 1              ✕',
       'Conduction band effective mass, direction 2              ✕',
       'Cond. band minimum              ✕',
       'Conduction band minimum (G₀W₀)              ✕',
       'Conduction band minimum (HSE06)              ✕',
       'Dir. band gap              ✕', 'Direct band gap (G₀W₀)              ✕',
       'Direct band gap (HSE06)              ✕',
       'Dir. gap wo. soc.              ✕', 'Exc. bind. energy              ✕',
       'Gap wo. soc.              ✕',
       'Phonon dynamic stability (low/high)              ✕', 'Valence band effective mass, direction 1              ✕',
       'Valence band effective mass, direction 2              ✕',
       'Val. band maximum              ✕',
       'Valence band maximum (G₀W₀)              ✕',
       'Valence band maximum (HSE06)              ✕',
       'First class material              ✕', 'Calculator              ✕',
       'Anisotropic exchange (out-of-plane)              ✕',
       'Area of unit-cell              ✕', 'Topology              ✕',
       'Crystal type              ✕', 'DOS at ef              ✕',
       'DOS at ef no soc.              ✕',
       'Energy above convex hull              ✕', 'Fermi level              ✕', 'Magnetic anisotropy (E<sub>z</sub> - E<sub>x</sub>)              ✕',
       'Magnetic anisotropy (E<sub>z</sub> - E<sub>y</sub>)              ✕',
       'Magnetic easy axis              ✕', 'Magnetic state              ✕', 'Material class              ✕',
       'Material has inversion symmetry              ✕',
       'Magnetic              ✕', 'Material unique ID              ✕',
       'Maximum force              ✕', 'Maximum stress              ✕',
       'Maximum value of S_z at magnetic sites              ✕',
       'Minimum eigenvalue of Hessian              ✕',
       'Monolayer reported DOI              ✕',
       'Nearest neighbor exchange coupling              ✕',
       'Charge              ✕', 'Number of atoms              ✕',
       'Number of nearest neighbors              ✕', 'n-spins              ✕',
       'Out-of-plane dipole along +z axis              ✕',
       'Path to collection folder              ✕', 'PBC              ✕',
       'Point group              ✕', 'Unique ID              ✕',
       'Related COD id              ✕', 'Related ICSD id              ✕',
       'Single-ion anisotropy (out-of-plane)              ✕',
       'Soc. total energy, x-direction              ✕',
       'Soc. total energy, y-direction              ✕',
       'Soc. total energy, z-direction              ✕',
       'Space group              ✕', 'Speed of sound (x)              ✕',
       'Speed of sound (y)              ✕',
       'Static interband polarizability (x)              ✕',
       'Static interband polarizability (y)              ✕',
       'Static interband polarizability (z)              ✕',
       'Static lattice polarizability (x)              ✕',
       'Static lattice polarizability (y)              ✕',
       'Static lattice polarizability (z)              ✕',
       'Static total polarizability (x)              ✕',
       'Static total polarizability (y)              ✕',
       'Static total polarizability (z)              ✕',
       'Stiffness dynamic stability (low/high)              ✕',
       'Stiffness tensor, 11-component              ✕',
       'Stiffness tensor, 12-component              ✕',
       'Stiffness tensor, 13-component              ✕',
       'Stiffness tensor, 21-component              ✕',
       'Stiffness tensor, 22-component              ✕',
       'Stiffness tensor, 23-component              ✕',
       'Stiffness tensor, 31-component              ✕',
       'Stiffness tensor, 32-component              ✕',
       'Stiffness tensor, 33-component              ✕',
       'Stoichiometry              ✕', 'Mass              ✕',
       'Age              ✕', 'ID              ✕',
       'Unique identifier              ✕', 'Username              ✕',
       'Vacuum level difference              ✕', 'Vacuum level              ✕', 'Magnetic moment              ✕'], inplace=True)

Renomeando as colunas do dataset:

In [26]:
df.rename(columns={
'Formula              ✕': 'Formula', 
'Band gap              ✕': 'Band gap',
'Thermodynamic stability level              ✕': 'Thermodynamic stability level', 
'Energy              ✕': 'Energy',
'Work function (avg. if finite dipole)              ✕': 'Work function', 
'Heat of formation              ✕': 'Heat of formation',
'Space group number              ✕': 'Space group number',
'Volume              ✕': 'Volume of unit cell',
}, inplace=True)

## Retirando as linhas com dados faltantes

Já que temos uma quantidade considerável de dados e poucas linhas com dados faltantes, vamos simplesmente dropar as linhas que apresentam algum item faltante, sem usar nehuma técnica de preenchimento artificial:

In [27]:
# Dropping das linhas que contêm "-"

df_sem_hifen = df.drop(df[df.eq("-").any(axis=1)].index)
df = df_sem_hifen

# Remove linhas com células vazias

df = df.dropna()  

In [28]:
df

Unnamed: 0,Formula,Band gap,Thermodynamic stability level,Energy,Work function,Heat of formation,Space group number,Volume of unit cell
0,Be4,0.0,1,-13.110,5.102,0.425,67,256.436
1,As4O8,3.232,3,-72.425,6.94,-1.065,31,1411.694
2,Ca4As4,0.998,2,-32.647,2.781,-0.743,14,1426.071
3,Fe4S8,0.0,2,-70.802,5.0,-0.168,7,1080.629
4,In2Se2,1.63,3,-14.491,4.59,-0.500,12,761.622
...,...,...,...,...,...,...,...,...
4029,Rh2P2S6,0.0,3,-52.474,4.675,-0.298,162,580.987
4030,Ta2P2Se6,0.294,2,-54.430,4.853,-0.317,2,677.616
4031,Zr2P2Se6,0.394,3,-53.151,4.916,-0.672,2,755.942
4032,Mo2W2Se8,1.27,3,-81.079,4.417,-0.653,25,700.669


## Particionando as fórmulas moleculares

Como os dados apresentam fórmulas químicas, é interessante particioná-las para oferecer uma maior capacidade de capturar informações específicas de cada parte da fórmula, podendo melhorar a qualidade e a interpretabilidade do modelo. Para implementar o particionamento, utilizamos a função abaixo fornecida pelo ChatGPT [4]. A biblioteca `re` funciona utilizando a sintaxe de expressões regulares, que são ferramentas para realizar pesquisas e manipulações de texto [6]. 

In [31]:
def extract_elements(formula):
    elements = re.findall(r'([A-Z][a-z]*)(\d*)', formula)
    return dict((el, int(num) if num else 1) for el, num in elements)

df['Elementos'] = df['Formula'].apply(extract_elements)
df_elementos = pd.DataFrame(df['Elementos'].tolist()).fillna(0)

df = df.drop(columns=['Elementos'])

df_elementos

Unnamed: 0,Be,As,O,Ca,Fe,S,In,Se,Sc,V,...,Os,Hg,Ir,Mo,Re,Rh,Ru,Y,Cs,K
0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,4.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,4.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,4.0,8.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4018,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4019,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4020,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Eletronegatividade

Um parâmetro que pode influenciar de forma considerável na previsão e que não constava no dataset é a eletronegatividade. Assim, optou-se por incluir manualmente essa coluna ao conjunto de dados. Contudo, a eletronegatividade é um parâmetro atômico, portanto, para incluí-la utilizou-se uma expressão que calcula a eletronegatividade média da molécula considerando a contribuição de cada elemento com base em seu número de átomos e sua eletronegatividade:

$$\chi_m = \frac{\sum_i n_i \cdot \chi_i}{\sum_i n_i} $$

em que:

$\chi_m$ e $\chi_i$ são as eletronegatividades da molécula e do átomo, respectivamente;

e $n_i$ é o número de átomos.

O dicionário a seguir contém as eletronegatividades de todos os elementos presentes nas fórmulas do dataset, obtidas no site da FQ.pt [5].

In [32]:
eletronegatividades = {
 'H': 2.2,
 'Li': 0.98,
 'Be': 1.57,
 'B': 2.04,
 'C': 2.55,
 'N': 3.04,
 'O': 3.44,
 'F': 3.98,
 'Na': 0.93,
 'Mg': 1.31,
 'Al': 1.61,
 'Si': 1.9,
 'P': 2.19,
 'S': 2.58,
 'Cl': 3.16,
 'K': 0.82,
 'Ca': 1.0,
 'Sc': 1.36,
 'Ti': 1.54,
 'V': 1.63,
 'Cr': 1.66,
 'Mn': 1.55,
 'Fe': 1.83,
 'Co': 1.88,
 'Ni': 1.91,
 'Cu': 1.9,
 'Zn': 1.65,
 'Ga': 1.81,
 'Ge': 2.01,
 'As': 2.18,
 'Se': 2.55,
 'Br': 2.96,
 'Rb': 0.82,
 'Sr': 0.95,
 'Y': 1.22,
 'Zr': 1.33,
 'Nb': 1.6,
 'Mo': 2.16,
 'Ru': 2.2,
 'Rh': 2.28,
 'Pd': 2.2,
 'Ag': 1.93,
 'Cd': 1.69,
 'In': 1.78,
 'Sn': 1.96,
 'Sb': 2.05,
 'Te': 2.1,
 'I': 2.66,
 'Cs': 0.79,
 'Ba': 0.89,
 'Hf': 1.3,
 'Ta': 1.5,
 'W': 2.36,
 'Re': 1.9,
 'Os': 2.2,
 'Ir': 2.2,
 'Pt': 2.28,
 'Au': 2.54,
 'Hg': 2.0,
 'Tl': 1.62,
 'Pb': 2.33,
 'Bi': 2.02,
}

In [33]:
# Adiciona a eletronegatividade respectiva de cada molécula em uma lista

eletronegatividade_molecula = []

for indice, linha in df_elementos.iterrows():
    eletronegatividade = 0
    soma_elementos = linha.sum()

    for elemento, quantidade in linha.items():
        valor_eletronegatividade = eletronegatividades[elemento]
        contribuicao = (quantidade * valor_eletronegatividade) / soma_elementos
        eletronegatividade += contribuicao

    eletronegatividade_molecula.append(eletronegatividade)

In [34]:
# Cria uma coluna no dataset para os dados de eletronegatividade

df['Electronegativity'] = eletronegatividade_molecula

Juntando os dados particionados ao dataset:

In [35]:
df = df.join(df_elementos, how='right')

Retirando a coluna "Formula", já que agora ela está escrita com base nos elementos:

In [36]:
df_sem_formula = df.drop('Formula', axis=1)
df = df_sem_formula

## Salvando em arquivo do dataset, no formato .csv

In [37]:
df.to_csv('dataset_tratado.csv', index=False)

In [38]:
df

Unnamed: 0,Band gap,Thermodynamic stability level,Energy,Work function,Heat of formation,Space group number,Volume of unit cell,Electronegativity,Be,As,...,Os,Hg,Ir,Mo,Re,Rh,Ru,Y,Cs,K
0,0.0,1.0,-13.110,5.102,0.425,67.0,256.436,1.570000,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3.232,3.0,-72.425,6.94,-1.065,31.0,1411.694,3.020000,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.998,2.0,-32.647,2.781,-0.743,14.0,1426.071,1.590000,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,2.0,-70.802,5.0,-0.168,7.0,1080.629,2.330000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.63,3.0,-14.491,4.59,-0.500,12.0,761.622,2.165000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4017,0.0,3.0,-27.794,5.116,-0.398,59.0,369.582,2.506667,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4018,1.007,3.0,-9.915,5.042,-0.183,156.0,259.417,2.420000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4019,0.0,3.0,-28.959,3.499,-0.312,129.0,330.404,1.646667,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4020,0.0,3.0,-20.727,5.08,-0.947,156.0,199.128,2.243333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Referências

1. C2DB. (2024). Computational 2D Materials Database. Disponível em: https://c2db.fysik.dtu.dk/.
2. CASSAR, D. R. Transformação e normalização. (2023)
3. CASSAR, D. R. Conversão simbólico-numérico. (2023)
4. ChatGPT para ajuda na resolução de bugs. Disponível em: https://chatgpt.com/.
5. FQ.pt. (2024). Eletronegatividade. Disponível em: https://www.fq.pt/ligacao-quimica/eletronegatividade.
6. Python Software Foundation. (2024). re — Operações com expressões regulares. Disponível em: https://docs.python.org/pt-br/3/library/re.html.