<a href="https://colab.research.google.com/github/telmacarvalho/tcc-smishing/blob/main/train_ml_advanced_smishing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Treinamento do modelo BERTimbau

## Importa bibliotecas

In [None]:
# Manipulação de dados
import pandas as pd

In [None]:
# Machine Learning Clássico (Scikit-Learn)
from sklearn.model_selection import train_test_split

In [None]:
# Machine Learning Avançado (BERTimbau)
!pip install datasets -q
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

## 9. Extração dos dados da pasta *refined*

In [None]:
# Autoriza acesso ao Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Extrai os dados refinados da pasta refined
project_path = '/content/drive/MyDrive/tcc/'
file_path = f'{project_path}data/refined/data_processed.csv'
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,id,source,text,label,comprimento,texto_processado
0,1_0,sms,"Oi, peço para transferir aquele valor para meu...",Legitimate,71,oi peo transfer val mpes nmer
1,2_0,sms,Bom dia babe..tudo bem? Não se esqueça de liga...,Legitimate,85,bom dia babetud bem esque lig pra tio boss h bj
2,3_0,sms,AEN8AFWXJHC Confirmado. Compraste 19.00MT de ...,Legitimate,157,aenafwxjhc confirm compr mt credit pmo nov sal...
3,4_0,sms,"7GD04E51YZM. Caro Cliente, o codigo para efect...",Legitimate,189,gdeyzm car client codig efectu levant cont mpe...
4,5_0,sms,Bom dia babe. Está bem. Vou comprar.\nBoa viagem,Legitimate,47,bom dia bab est bem vou compr boa viag


## 10. Divisão de dados para treino e teste

In [None]:
# Divisão dos dados e vetorização com TF-IDF
X = df['texto_processado']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f'Quantidade dos atributos de treino = {X_train.shape[0]} e de teste = {X_test.shape[0]}')
print(f'Quantidade dos rótulos de treino = {y_train.shape[0]} e de teste = {y_test.shape[0]}')

Quantidade dos atributos de treino = 2048 e de teste = 513
Quantidade dos rótulos de treino = 2048 e de teste = 513


## 11. Preparação dos Dados para o formato do Hugging Face

In [None]:
# Converte os dfs de treino e teste para o formato Dataset
# Obs: Por padrão as colunas são nomeadas 'text' e 'label'
train_dict = {'text': X_train.tolist(), 'label': y_train.tolist()}
test_dict = {'text': X_test.tolist(), 'label': y_test.tolist()}

train_dataset = Dataset.from_dict(train_dict)
test_dataset = Dataset.from_dict(test_dict)

print("Datasets convertidos para o formato Hugging Face:")
print(train_dataset)
print(test_dataset)

Datasets convertidos para o formato Hugging Face:
Dataset({
    features: ['text', 'label'],
    num_rows: 2048
})
Dataset({
    features: ['text', 'label'],
    num_rows: 513
})


## 12. Tokenização e Carregamento do Modelo Pré-treinado

In [None]:
model_name = 'neuralmind/bert-base-portuguese-cased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Cria uma função para tokenizar os dados
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/647 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
# Aplica a tokenização nos datasets
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

print("\n Exemplos de dataset após tokenização:")
print(f'\n {tokenized_train_dataset[0]}\n',tokenized_train_dataset[1])

Map:   0%|          | 0/2048 [00:00<?, ? examples/s]

Map:   0%|          | 0/513 [00:00<?, ? examples/s]


 Exemplo de dataset após tokenização:

 {'text': 'val enviam next cont vem nom helen orteci', 'label': 'Smishing', 'input_ids': [101, 1201, 5379, 22287, 872, 20220, 336, 3539, 202, 22287, 7920, 194, 438, 8948, 22283, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1,

In [None]:
# Carrega o modelo
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)


pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 13. Carregamento do Modelo Pré-treinado