<a href="https://colab.research.google.com/github/slavallec/sextingscan/blob/master/Modelo_SextingScan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Modelo NLP para detectar el sexting

### 1. Instalando libreria para BERT

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 9.7MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 49.2MB/s 
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |█

### 2. Importando librerías necesarias

In [2]:
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import torch
import numpy as np
from sklearn.model_selection import train_test_split
from torch import nn,optim
from torch.utils.data import Dataset, DataLoader
import pandas as pd
from textwrap import wrap

### 3. Configuración inicial

In [13]:
#Inicializando
RANDOM_SEED = 42
BATCH_SIZE = 16
DATASET_PATH = '/content/drive/My Drive/M2/NLP/sexting_spanish_dataset.csv'
NCLASESS = 2

np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


### 4. Cargando Dataset from Google Drive

In [14]:
from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv(DATASET_PATH, error_bad_lines = False)
print(df.head())
print(df.shape)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
                             frase sentiment
0                Me mandas tu pack  Negativo
1           Que bonita estabas hoy  Positivo
2                  Que bonita eres  Positivo
3                   Pasame tu pack  Negativo
4  Mandame una foto de tus piernas  Negativo
(191, 2)


b'Skipping line 14: expected 2 fields, saw 3\nSkipping line 43: expected 2 fields, saw 3\nSkipping line 51: expected 2 fields, saw 3\nSkipping line 52: expected 2 fields, saw 3\nSkipping line 68: expected 2 fields, saw 4\nSkipping line 69: expected 2 fields, saw 3\nSkipping line 144: expected 2 fields, saw 3\nSkipping line 146: expected 2 fields, saw 3\nSkipping line 191: expected 2 fields, saw 3\n'


### 5. Normalizando Dataset

In [15]:
df['label'] = (df['sentiment']=='Positivo').astype(int)
df.drop('sentiment', axis=1, inplace=True)
df.head()

Unnamed: 0,frase,label
0,Me mandas tu pack,0
1,Que bonita estabas hoy,1
2,Que bonita eres,1
3,Pasame tu pack,0
4,Mandame una foto de tus piernas,0


### 6. Tokenizacion

In [16]:
#Cargando modelo de BERT -> BETO
PRE_TRAINED_MODEL_NAME = "finiteautomata/beto-sentiment-analysis"
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

### 7. Preparando Dataset para Entrenamiento

In [17]:
class SextingDataset(Dataset):
  def __init__(self,frases,labels,tokenizer):
    self.frases = frases
    self.labels = labels
    self.tokenizer = tokenizer

  def __len__(self):
    return len(self.frases)

  def __getitem__(self,item):
    frase = str(self.frases[item])
    label = self.labels[item]
    encoding = tokenizer.encode_plus(
        frase,
        max_length = 15,
        truncation = True,
        add_special_tokens = True,
        return_token_type_ids = False,
        pad_to_max_length = True,
        return_attention_mask = True,
        return_tensors = 'pt'
    )

    return {
        'frase': frase,
        'input_ids': encoding['input_ids'].flatten(),
        'attention_mask': encoding['attention_mask'].flatten(),
        'label': torch.tensor(label, dtype=torch.long)
    }

In [18]:
# Data Loader
def data_loader(df,tokenizer, batch_size):
  dataset = SextingDataset(
      frases = df.frase.to_numpy(),
      labels = df.label.to_numpy(),
      tokenizer = tokenizer
  )

  return DataLoader(dataset, batch_size = BATCH_SIZE, num_workers =  4)

### 8. Separando % de Entrenamiento

In [19]:
df_train, df_test = train_test_split(df, test_size = 0.2, random_state=RANDOM_SEED)

train_data_loader = data_loader(df_train, tokenizer, BATCH_SIZE)
test_data_loader = data_loader(df_test, tokenizer, BATCH_SIZE)

  cpuset_checked))


### 9. Creando nuevo Modelo

In [20]:
## Creando el modelo
class SextingSentimentClassifier(nn.Module):
  def __init__(self,n_classes):
    super(SextingSentimentClassifier, self).__init__()
    self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME, return_dict=False)
    self.drop = nn.Dropout(p=0.3)
    self.linear = nn.Linear(self.bert.config.hidden_size, n_classes)

  def forward(self, input_ids, attention_mask):
    _, cls_output = self.bert(
        input_ids = input_ids,
        attention_mask = attention_mask
    )
    drop_output = self.drop(cls_output)
    output = self.linear(drop_output)
    return output

In [22]:
model = SextingSentimentClassifier(NCLASESS)
model = model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=841.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=439508881.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at finiteautomata/beto-sentiment-analysis were not used when initializing BertModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### 10. Configuración del Entrenamiento

In [24]:
EPOCHS = 10
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_data_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps = 0,
    num_training_steps = total_steps
)
loss_fn = nn.CrossEntropyLoss().to(device)

### 11. Iteración de Entrenamiento

In [27]:
def train_model(model,data_loader,loss_fn,optimizer,device,scheduler,n_examples):
  model = model.train()
  losses = []
  correct_predictions = 0
  for batch in data_loader:
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    labels = batch['label'].to(device)
    outputs = model(input_ids = input_ids, attention_mask = attention_mask)
    _, preds = torch.max(outputs, dim=1)
    loss = loss_fn(outputs,labels)
    correct_predictions += torch.sum(preds == labels)
    losses.append(loss.item())
    loss.backward()
    nn.utils.clip_grad_norm(model.parameters(), max_norm=1.0)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()
  return correct_predictions.double()/n_examples, np.mean(losses)

def eval_model(model,data_loader,loss_fn,device,n_examples):
  model = model.eval()
  losses = []
  correct_predictions = 0
  with torch.no_grad():
    for batch in data_loader:
      input_ids = batch['input_ids'].to(device)
      attention_mask = batch['attention_mask'].to(device)
      labels = batch['label'].to(device)
      outputs = model(input_ids = input_ids, attention_mask = attention_mask)
      _, preds = torch.max(outputs, dim=1)
      loss = loss_fn(outputs,labels)
      correct_predictions += torch.sum(preds == labels)
      losses.append(loss.item())
  return correct_predictions.double()/n_examples, np.mean(losses)

### 12. Ejecución de Entrenamiento

In [28]:
for epoch in range(EPOCHS):
  print('Epoca {} de {}'.format(epoch+1, EPOCHS))
  print('------------------------')
  train_acc,train_loss = train_model(
      model,train_data_loader, loss_fn, optimizer, device, scheduler, len(df_train)
  )
  test_acc, test_loss = eval_model(
      model, test_data_loader, loss_fn, device, len(df_test)
  )
  print('Entrenamiento: Loss: {}, accuracy: {}'.format(train_loss, train_acc))
  print('Validación: Loss: {}, accuracy: {}'.format(test_loss, test_acc))
  print(' ')

Epoca 1 de 10
------------------------


  cpuset_checked))
  from ipykernel import kernelapp as app


Entrenamiento: Loss: 0.24396389557255638, accuracy: 0.9022556390977443
Validación: Loss: 1.1927785873413086, accuracy: 0.7931034482758621
 
Epoca 2 de 10
------------------------




Entrenamiento: Loss: 0.1906142846888138, accuracy: 0.9323308270676691
Validación: Loss: 1.3993825316429138, accuracy: 0.7241379310344828
 
Epoca 3 de 10
------------------------




Entrenamiento: Loss: 0.11082385056134728, accuracy: 0.9624060150375939
Validación: Loss: 1.4445909708738327, accuracy: 0.7586206896551724
 
Epoca 4 de 10
------------------------




Entrenamiento: Loss: 0.06669708644039929, accuracy: 0.9849624060150375
Validación: Loss: 1.4217157810926437, accuracy: 0.7758620689655172
 
Epoca 5 de 10
------------------------




Entrenamiento: Loss: 0.009119225898757577, accuracy: 1.0
Validación: Loss: 1.4893786758184433, accuracy: 0.7068965517241379
 
Epoca 6 de 10
------------------------




Entrenamiento: Loss: 0.00496845634188503, accuracy: 1.0
Validación: Loss: 1.5914259552955627, accuracy: 0.7586206896551724
 
Epoca 7 de 10
------------------------




Entrenamiento: Loss: 0.0012992755186537073, accuracy: 1.0
Validación: Loss: 1.644255056977272, accuracy: 0.7758620689655172
 
Epoca 8 de 10
------------------------




Entrenamiento: Loss: 0.000995175854768604, accuracy: 1.0
Validación: Loss: 1.6727291643619537, accuracy: 0.7758620689655172
 
Epoca 9 de 10
------------------------




Entrenamiento: Loss: 0.0008286610569080545, accuracy: 1.0
Validación: Loss: 1.6825901418924332, accuracy: 0.7758620689655172
 
Epoca 10 de 10
------------------------




Entrenamiento: Loss: 0.00081726011639047, accuracy: 1.0
Validación: Loss: 1.6825901418924332, accuracy: 0.7758620689655172
 


### 13. Creación de clase para probar modelo

In [31]:
def clasificadordelSexting(frase_text):
  encoding_frase = tokenizer.encode_plus(
      frase_text,
      max_length = 15,
      truncation = True,
      add_special_tokens = True,
      return_token_type_ids = False,
      pad_to_max_length = True,
      return_attention_mask = True,
      return_tensors = 'pt'
  )
  input_ids = encoding_frase['input_ids'].to(device)
  attention_mask = encoding_frase['attention_mask'].to(device)
  output = model(input_ids, attention_mask)
  _, prediction = torch.max(output, dim=1)
  print("\n".join(wrap(frase_text)))
  if prediction:
    print('Sexting Consensuado: ')
  else:
    print('Sexting No Consensuado ')

### 14. Citando modelo mediante ejemplo

In [38]:
frase_text = "¿puedo verte?"
clasificadordelSexting(frase_text)

¿puedo verte?
Sexting Consensuado: 


