Unlock the power of Natural Language Processing (NLP) by diving into Kaggle competitions! In this tutorial, I will cover essential NLP concepts, explore real-world datasets, and walk you through your first competition. Whether you’re a beginner or curious about NLP, this tutorial will kickstart your journey. 🚀

This tutorial is heavily inspired by a bunch of other resources and all the credit goes to them (all the resources can be found at the end of this notebook). This tutorial is my attempt to learn something new as well as help you to learn.

## The data
We are going to use the data from this competition - ["Natural Language Processing with Disaster Tweets"](https://www.kaggle.com/competitions/nlp-getting-started). Kaggle hosts some "getting started" competitions which are easy enough for beginners to learn the tricks of the trade and challenging enough to practice different ML concepts. This is one such competiion where you can learn as you work through it.

Download the data by visiting the competition page or if you are working in kaggle notebook then you can attach this data to your notebook by following the steps mentioned [here](https://www.kaggle.com/docs/notebooks).

In [None]:
import pandas as pd

The dataset has three files - 
* submission.csv - is a sample file to let you know what your submission should look like.
* train.csv - this is the data on which you would need to train your model.
* test.csv - this is the data on which you have to do the predictions after you have done the training.

In [None]:
train_path = "/kaggle/input/nlp-getting-started/train.csv"
eval_path = "/kaggle/input/nlp-getting-started/test.csv"

we will use pandas to read the data.

In [None]:
train_df = pd.read_csv(train_path)
eval_df = pd.read_csv(eval_path)

In [None]:
train_df.head(5) 

In [None]:
eval_df.head(5)

## Exploring the data
You can use the describe method in pandas to get a quick look into what your dataset looks like. For example, here you can see the most common keywords in the data is "fatalities" and most common location is "USA".

In [None]:
train_df.describe(include="object")

## Feature engineering
We can join the keyword, location and text to create a meanigful string that contains the information of the keyword and location as well. There are some keywords and location which doesn't have anything for this first we will put N/A for these cells. 

You can use other strategies to take care of the rows which have Nan value for some columns but don't entirely remove them because removign them will remove important datapoints.

`fillna` fills the given value in all the cells which are Nan.

In [None]:
train_df = train_df.fillna("N/A")

In [None]:
train_df.head(5)

you can concatenate the values of multiple rows using the plus(+) operator. 

In [None]:
train_df['input'] = 'keyword: ' + train_df.keyword + '; location: ' + train_df.location + '; text: ' + train_df.text

In [None]:
train_df.head()

Convert the dataframe to huggingface dataset. Doing this helps in making the dataframe much more easier and faster to work.

In [None]:
from datasets import Dataset,DatasetDict
dataset = Dataset.from_pandas(train_df)

In [None]:
dataset.info

## The model
Since we are wexperimenting at this point, we can select a model which is small enough to run quickly. For this I have selected the below model which has good balance of size and performance.

Once you progerss with your experiments and have evaluated your model's performance, you can opt for more complex models.

In [None]:
model_name = 'microsoft/deberta-v3-small'

## Tokenization
Neural networks don't understand texts, they understand numbers. Before feeding data we need to break the text into words and then convert those to numbers. Breaking text to words is known as tokenization and converting the words to numbers is known as numericalization.

Huggingface provides a `AutoTokenizer` API to tokenize text. This also numericalizes the tokens into numbers.

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

create a function which can use this tokenizer to tokenize a text.

In [None]:
test_tokz = tokenizer.tokenize("hello! have a good day")
test_tokz

Look how the tokenizer breaks down the text into words. An underscore is added to the begining of the words to represents the start of a word. 

Let's create a function to tokenize the text.

In [None]:
def tokz(df):
    return tokenizer(df["input"])

the `map` function takes our function and applies it to the entire dataset.

In [None]:
dataset = dataset.map(tokz)

In [None]:
dataset[1]

In [None]:
tokenizer.vocab['Forest']

## Preparing the data for training 

Transformers expect the target column to be named as "labels".

In [None]:
dataset = dataset.rename_column("target", "label")

In [None]:
dataset

## Splitting data into Train and test
Next we will split the data into train and test. This is required to prevent the following problems-

* Generalization: By testing on unseen data, you assess how well your model generalizes beyond the training data. It ensures that your model doesn’t just memorize the training examples but learns meaningful patterns.

* Overfitting Prevention: If you train and test the model on the same data (without splitting), it can lead to overfitting. Overfit models perform well on the training data but fail to generalize to new instances. Train-test split helps prevent this issue.

* Bias-Variance Tradeoff: It also helps you understand the bias-variance tradeoff. A model with high bias (underfitting) may perform poorly on both training and testing data, while a model with high variance (overfitting) may perform well on training data but poorly on testing data

we reserve 25% of our training data as the validation set. this will be hidden from the model during training and then an evaluation of the model's prediction will be done on this part of the data at the end of each epoch.

In [None]:
train_test = dataset.train_test_split(test_size=0.25)

In [None]:
train_test

it is always a goodd idea to look into the data and then decide your splitting strategy. splitting like we did above is not recomended but for the sake of this tutorial we are fine.

## Evaluation
The competition page describes the evaluation metric which is going to be used in the competition. Here they are going to use a metric known as the F1 metric. You can check the competition [page](https://www.kaggle.com/competitions/nlp-getting-started) for a detailed description of the metric.

We will need to create a function which will take our predictions and then calculate the metric for us. For this we will make use of the scikitlearn library which is a popular library for machine learnign tasks other than deep learning.

In [None]:
from sklearn.metrics import f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    f1 = f1_score(labels, preds, average='weighted')

    return {'f1': f1}


## Fine tuning the model
The model which we are going to use is already trained on loads of english text. However, it's unaware of the tweet data on which we want to make predictions. That is why we will fine tune the model on our data.

In [None]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_name,num_labels=2)

We will pass our data in a batch of 128 and keep a low epoch to train quickly. Keep the learning rate as given below, it will work just fine for a wide range of situations. 

The learning rate is one of the most important hyperparameter. A bad learning rate can ruin your trained model's performance.

Watch this [lesson](https://course18.fast.ai/lessonsml1/lesson9.html) from the awsome fastai course to understand the importance of learning rate.

In [None]:
bs = 128
epochs = 4
lr = 8e-5

the hugging face trainer requires you to set model arguments via the `TrainingArguments` object. don't worry about all the arguments as the important arguments are the ones listed above.

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

Now we begin our training and pass the our metric computation function to it.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_test["train"], 
    eval_dataset=train_test["test"],   
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

the f1 score is showing a good increase.

as I explained earlier that kaggle competitions provide a test set as well on which we need to carry out predictions after our model is trained. These predictions will then be submitted to the compeition.

## Preparing the test data
We will do the same transformation to the test data as well.

In [None]:
eval_df = eval_df.fillna("N/A")

In [None]:
eval_df['input'] = 'keyword: ' + eval_df.keyword + '; location: ' + eval_df.location + '; text: ' + eval_df.text

In [None]:
eval_df.head(2)

In [None]:
eval_ds = dataset.from_pandas(eval_df).map(tokz)
eval_ds

In [None]:
eval_ds[1]["input"]

## Submitting Predictions

We can use the trainer object to carry out the predictions.

In [None]:
preds = trainer.predict(eval_ds).predictions.astype(float)
preds

In [None]:
import numpy as np
predictions = np.argmax(preds, axis=1)

In [None]:
predictions

finally we will create the submission.csv file and this is what we are going to submit to the competition.

In [None]:
submission = Dataset.from_dict({
    'id': eval_ds['id'],
    'target': predictions
})

submission.to_csv('submission.csv', index=False)

## References
* https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners
* https://course.fast.ai/Lessons/lesson4.html
* https://huggingface.co/learn/nlp-course/chapter1/1