**Hugging Face Model**

In the script I have tried to do prediction using hugging face model after performing text cleaning.

In the script I have performed the following steps:

1. text cleaning
2. Custom K-fold
3. Setting up Hugging face model
4. Prediction

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import nltk
import regex as re
from nltk.stem import WordNetLemmatizer
import string
from scipy.special import softmax
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from sklearn import model_selection
from sklearn.metrics import mean_squared_error
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
#to install datasets library
!pip install datasets
!pip install fsspec==0.9.0
from datasets import Dataset,load_metric

In [None]:
train = pd.read_csv(r'/kaggle/input/nlp-getting-started/train.csv')
test = pd.read_csv(r'/kaggle/input/nlp-getting-started/test.csv')

In [None]:
train.head()

# Text Cleaning

In [None]:
#Removing punchuation from sentences and stop words

nltk.download('stopwords', quiet=True)
stopwords = nltk.corpus.stopwords.words('english')
def remove_pun_stopwords(text):
    text = re.sub(r'[^\w\s]','',text)
    text = [i.lower() for i in text.lower().split() if i not in stopwords]
    return(' '.join(text))

> * Remove punctuations, numbers, brackets, links and special character.

In [None]:
def clean_text(text):

    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

In [None]:
def clean(text):
    text = remove_pun_stopwords(text)
    text = clean_text(text)
    return text

In [None]:
train['text'] = train['text'].apply(clean)
test['text'] = test['text'].apply(clean)

> **Lemmatization**
> 
> Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word. Stemming follows an algorithm with steps to perform on the words which makes it faster.
> 
> > We will use Lemmatization here as it will keep an actual word.

In [None]:
lemmatizer = WordNetLemmatizer()

def word_lemmatizer(text):
    
    text = [lemmatizer.lemmatize(i) for i in text.split()]
    return(' '.join(text))

train['text'] = train['text'].apply(word_lemmatizer)
test['text'] = test['text'].apply(word_lemmatizer)

In [None]:
def create_folds(data, num_splits):
    # we create a new column called kfold and fill it with -1
    data["kfold"] = -1
    
    # the next step is to randomize the rows of the data
    data = data.sample(frac=1,random_state=10).reset_index(drop=True)

    # calculate number of bins by Sturge's rule
    # I take the floor of the value, you can also
    # just round it
    num_bins = int(np.floor(1 + np.log2(len(data))))
    
    # bin targets
    data.loc[:, "bins"] = pd.cut(
        data["target"], bins=num_bins, labels=False
    )
    
    # initiate the kfold class from model_selection module
    kf = model_selection.StratifiedKFold(n_splits=num_splits)
    
    # fill the new kfold column
    # note that, instead of targets, we use bins!
    for f, (t_, v_) in enumerate(kf.split(X=data, y=data.bins.values)):
        data.loc[v_, 'kfold'] = f
    
    # drop the bins column
    data = data.drop("bins", axis=1)

    # return dataframe with folds
    return data

> **Custom K fold to make training and validation set**

In [None]:
#Creating K - folds for training and validation
train = create_folds(train, num_splits=5)
train = train.rename(columns={'target':'label'})

> * **Defining batch size and max length**

In [None]:
batch_size = 16
max_length = 256

> * **Splitting data in train and validation**

In [None]:
train_dataset = Dataset.from_pandas(train[train.kfold != 0].reset_index(drop=True))
valid_dataset = Dataset.from_pandas(train[train.kfold == 0].reset_index(drop=True))

> * **Tokenizing the data** 

In [None]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
def tokenize(batch): return tokenizer(batch['text'], padding=True,truncation=True, max_length=max_length)

In [None]:
train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
valid_dataset = valid_dataset.map(tokenize, batched=True, batch_size=len(valid_dataset))

**Modelling**

Here we are using **distilbert-base-uncased model** however you can use any other model as well. You can go through the following link and try different models:

[https://huggingface.co/models?pipeline_tag=text-classification](http://)

In [None]:
# disable W&B logging as we don't have access to the internet
%env WANDB_DISABLED=True

In [None]:
metric = load_metric("accuracy")
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english', num_labels=2)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

args = TrainingArguments(
    "outputs_dir",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    fp16=True,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    seed=7,
    weight_decay=0.005,
    load_best_model_at_end=True
)

trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer,
compute_metrics=compute_metrics
)

columns_to_return = ['input_ids', 'label', 'attention_mask']
train_dataset.set_format(type='torch', columns=columns_to_return)
valid_dataset.set_format(type='torch', columns=columns_to_return)

In [None]:
trainer.train()

> * **Tokenizing the test data set**

In [None]:
test_dataset = Dataset.from_pandas(test)
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset))

In [None]:
columns_to_return = ['input_ids', 'attention_mask']
test_dataset.set_format(type='torch', columns=columns_to_return)

In [None]:
test_preds = trainer.predict(test_dataset)
#using softmax to convert to probabilities
probabilities = softmax(test_preds[0], axis=1)
probabilities


In [None]:
test_ids = test['id'].values

submission = pd.DataFrame({
    'id': test_ids,
    'target': probabilities[:,1]
})

In [None]:
submission.loc[submission['target']<0.5,'target'] = 0
submission.loc[submission['target']>=0.5,'target'] = 1
submission['target'] = submission['target'].astype(int)

In [None]:
submission['target'].value_counts()

In [None]:
submission.to_csv('submission.csv',index = False)

**Please comment and upvote. Thanks**