hey there! let's get started ...
let's import pandas & numpy

In [8]:
import pandas as pd
import numpy as np

let's import our train & test datasets

In [9]:
df = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')


let's see how the dataset looks

In [10]:
df

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


The 'target' column indicates whether a tweet is disaster-related (1) or not (0). By using the .value_counts() method, we can see the distribution of these classes in the training dataset.

In [12]:
df['target'].value_counts()

target
0    4342
1    3271
Name: count, dtype: int64

Before analyzing text data, we need to preprocess it. This function, clean_text, removes or replaces unwanted characters and punctuation in the text.

In [13]:
def clean_text(x):

    x = str(x)
    for punct in "/-'":
        x = x.replace(punct, ' ')
    for punct in '&':
        x = x.replace(punct, f' {punct} ')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
        x = x.replace(punct, '')
    return x

let's apply this function on our datasets now

In [14]:
df["text"] = df["text"].apply(lambda x: clean_text(x))

In [16]:
test["text"] = test["text"].apply(lambda x: clean_text(x))

let's split our first dataset into train and evaluation parts 

In [17]:

from sklearn.model_selection import train_test_split


train_df, test_df = train_test_split(df, test_size=0.3, random_state=42)


train_texts, train_labels = train_df['text'].tolist(), train_df['target'].tolist()
test_texts, test_labels = test_df['text'].tolist(), test_df['target'].tolist()


let's import the tokenizer for our model 'distilbert-base-uncased' from hugging-face transformers and tokenize our train and evaluation texts

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128, return_tensors="pt")
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=128, return_tensors="pt")


We use PyTorch to handle our data for deep learning. In this step, we convert the train_labels and test_labels into PyTorch tensors using the torch.tensor() function.

In [None]:
import torch

train_labels = torch.tensor(train_labels)
test_labels = torch.tensor(test_labels)


To train a PyTorch model, we need to organize our data using the Dataset class. Here, we define a custom class, DisasterTweetsDataset, which inherits from torch.utils.data.Dataset. This custom dataset simplifies managing and accessing the encoded tweets and their labels.

In [None]:
from torch.utils.data import Dataset

class DisasterTweetsDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item


train_dataset = DisasterTweetsDataset(train_encodings, train_labels)
test_dataset = DisasterTweetsDataset(test_encodings, test_labels)


We use the TrainingArguments class from the 🤗 Transformers library to configure our training process. This class allows us to define various parameters for training and evaluation.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',        
    evaluation_strategy="epoch",  
    save_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    logging_dir='./logs',
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    save_total_limit=1, 
)


To evaluate the performance of our model, we define a custom function, compute_metrics, which calculates the F1-score using the predictions and true labels.

In [None]:
from sklearn.metrics import f1_score


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)
    f1 = f1_score(labels, predictions, average="weighted")
    return {"f1": f1}


To train and evaluate our model, we use the AutoModelForSequenceClassification class and the Trainer from the 🤗 Transformers library.

In [None]:
from transformers import AutoModelForSequenceClassification, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", 
    num_labels=2  
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)


The code snippet disables the Weights & Biases (W&B) integration by setting the environment variable WANDB_DISABLED to "true". W&B is a popular tool for tracking experiments and visualizing model performance, but in some cases, you may want to disable it.

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"


let's trainnn ...

In [None]:
trainer.train()

In this step, we use the tokenizer to preprocess the text data in the test set. The tokenizer converts the raw text into a format that can be fed into the model for prediction.

In [None]:

test_encodings = tokenizer(
    test['text'].tolist(),  
    truncation=True,
    padding=True,
    max_length=128,
    return_tensors="pt"  
)


After preparing the test data, we use the trained model to make predictions. The process is done in inference mode (without updating model weights) using torch.no_grad().

In [None]:

with torch.no_grad(): 
    outputs = model(**test_encodings)  
    logits = outputs.logits 
    predictions = torch.argmax(logits, dim=1) 


it's time to create the submission dataframe

In [None]:

submission = pd.DataFrame({
    'id': test['id'],          
    'target': predictions.numpy()  
})


Finally, let's save our submission file in csv format ...

In [None]:

submission.to_csv('/kaggle/working/submission.csv', index=False)


Feel free to play with the training arguments & leave an upvote if you found this notebook helpful ...