# From shallow learning to 2020 SOTA(GPT-2, ROBERTA) 

# Abstract

In this notebook I explore a variety of Machine learning models ranging from good old shallow learning(Naive naives, TF-IDF, SVMs) to the state of the art in NLP(GPT2, ROBERTA) with the goal of finding the best possible model and preprocessing steps for the task of tweet classification posted on this Kaggle competition: https://www.kaggle.com/c/nlp-getting-started  
As a by-product of this experimentation we also obtain a clear comparison across a number of popular NLP algorithms.

**Note**: Because Kaggle kernels can't run for more than 9 hours, I had to train the biggest models on my local computer and this notebook loads them from a checkpoint. Also for grid search the notebook loads the cached results from csv to save compute time.  
All the checkpoints and grid search results loaded in this notebook where generated using solely the code in this notebook.

# Index

- [Exploratory analysis](#epa)
- [Shallow Learning](#shallow_learning)
- [Fast text](#fast_text)
- [Text preprocessing](#text_preprop)
- [BERT & ROBERTA](#bert_e2e)
- [LSTMs](#Conclusions)
- [GPT2](#gpt2)
- [Conclusions](#Conclusions)

In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
!pip install wordcloud
!pip install transformers==3.0.2
!pip install simpletransformers
!pip install sklearn
!pip install nltk
!pip install unidecode
!pip install normalise
!pip install contractions
import os
import string
import re
import sys
sys.path.insert(1, '/kaggle/input/pymodules4/') # link modules to be accessible from this Kaggle kernel
os.system('python3 -m spacy download en')# it doesnt work when running directly on terminal
from sklearn.preprocessing import FunctionTransformer
from sklearn.base import TransformerMixin, BaseEstimator
import spacy
import nltk
import unidecode
from normalise import normalise
import contractions
from nltk.corpus import stopwords
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import dill
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
import logging
from simpletransformers.classification import ClassificationModel
from spacy_text_classifier import SpacyClassifier
train_df = pd.read_csv("/kaggle/input/data-baby2/train.csv")
test_df = pd.read_csv("/kaggle/input/data-baby2/test.csv")
msk = np.random.rand(len(train_df)) < 0.3
dev_df = train_df[msk]
train_df = train_df[~msk]
dev_df.reset_index(inplace=True)
train_df.reset_index(inplace=True)

## <a id="epa">Exploratory Analysis</a>

### A quick look at our data

Let's look at our data... first, an example of what is NOT a disaster tweet.

In [None]:
train_df[train_df["target"] == 0]["text"].values[1]

In [None]:
train_df[train_df["target"] == 1]["text"].values[1]

In [None]:
train_df.head()

In [None]:
train_df.describe()

In [None]:
train_df.info()

In [None]:
test_df.info()

In [None]:
train_has_keyword_df = train_df[train_df.keyword.notnull()]
train_has_location_df = train_df[train_df.location.notnull()]
train_has_keyword_df.head()

In [None]:
text = " ".join(keyword for keyword in train_has_keyword_df.keyword)
print ("There are {} words in the combination of all keywords.".format(len(text)))

In [None]:
import seaborn as sns
ax = sns.countplot(train_df['target'])

## wordclouds

In [None]:
def renderWordcloud(text):
    # Create and generate a word cloud image:
    wordcloud = WordCloud().generate(text)

    # Display the generated image:
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    F = plt.gcf()
    Size = F.get_size_inches()
    F.set_size_inches(Size[0]*2, Size[1]*2, forward=True) # Set forward to True to resize window along with plot in figure.
    plt.show()

### Wordcloud of keywords

In [None]:
text = " ".join(keyword for keyword in train_has_keyword_df.keyword)
renderWordcloud(text)

### Wordcloud of locations

In [None]:
text = " ".join(loc for loc in train_has_location_df.location)
renderWordcloud(text)

## Unique words

In [None]:
len(train_has_keyword_df.keyword.unique()), len(train_has_location_df.location.unique())

## Relationship between categorical vars & target

### relationship between keyword and target

In [None]:
group_keyword_sum_target = train_has_keyword_df.groupby("keyword").sum().sort_values("target")
group_keyword_len = train_has_keyword_df.groupby("keyword").count()
group_keyword_sum_target#['true/all'] = group_keyword_sum_target.target / group_keyword_len.target

In [None]:
pd.set_option('display.max_rows', None)
group_keyword_sum_target

### correlation between keyword and Target

#### contigency table

In [None]:
df_target_1 = train_has_keyword_df['target']==1
df_target_1.head()

In [None]:
pd.crosstab(train_has_keyword_df['keyword'], [df_target_1], rownames=['keyword'], colnames=['target'])

### relationship between location and target: percentage of true targets per location

In [None]:
group_location_sum_target = train_has_location_df.groupby("location").sum().sort_values("target")
group_location_len = train_has_location_df.groupby("location").count()
group_location_sum_target['true/all'] = group_location_sum_target.target / group_location_len.target

In [None]:
group_location_sum_target.sort_values(by=['target'], ascending=False)[:50]

In [None]:
pd.set_option('display.max_rows', 20)

### How are numbers formatted in tweets? Wordcloud of words from tweets that contain numbers

I'm interested in this cause to see wether numbers hold a correlation to words/meaning of the tweet

In [None]:
import re
pattern = re.compile("[0-9]")
numbers_df = train_df[train_df['text'].str.contains('[0-9]', regex= True, na=False)]
number_texts = [keyword if pattern.search(keyword) else None for keyword in numbers_df.text]
text = " ".join(number_texts)
renderWordcloud(text)




# Models

### Building vectors

The theory behind this model is pretty simple: the words contained in each tweet are a good indicator of whether they're about a real disaster or not (this is not entirely correct, but it's a great place to start).

We'll use scikit-learn's `CountVectorizer` to count the words in each tweet and turn them into data our machine learning model can process.

Note: a `vector` is, in this context, a set of numbers that a machine learning model can work with. We'll look at one in just a second.

In [None]:
count_vectorizer = feature_extraction.text.CountVectorizer()

## let's get counts for the first 5 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])

In [None]:
## we use .todense() here because these vectors are "sparse" (only non-zero elements are kept to save space)
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

In [None]:
example_train_vectors[4]

The above tells us that:
1. There are 54 unique words (or "tokens") in the first five tweets.
2. The first tweet contains only some of those unique tokens - all of the non-zero counts above are the tokens that DO exist in the first tweet.

Now let's create vectors for all of our tweets.

In [None]:
train_vectors = count_vectorizer.fit_transform(train_df["text"])
## note that we're NOT using .fit_transform() here. Using just .transform() makes sure
# that the tokens in the train vectors are the only ones mapped to the test vectors - 
# i.e. that the train and test vectors use the same set of tokens.
dev_vectors = count_vectorizer.transform(dev_df["text"])

### Our model

As we mentioned above, we think the words contained in each tweet are a good indicator of whether they're about a real disaster or not. The presence of particular word (or set of words) in a tweet might link directly to whether or not that tweet is real.

What we're assuming here is a _linear_ connection. So let's build a linear model and see!

In [None]:
## Our vectors are really big, so we want to push our model's weights
## toward 0 without completely discounting different words - ridge regression 
## is a good way to do this.
clf = linear_model.RidgeClassifier()

Let's test our model and see how well it does on the training data. For this we'll use `cross-validation` - where we train on a portion of the known data, then validate it with the rest. If we do this several times (with different portions) we can get a good idea for how a particular model or method performs.

The metric for this competition is F1, so let's use that here.

In [None]:
%%time
clf.fit(train_vectors, train_df["target"])
predictions = clf.predict(dev_vectors);
print(classification_report(dev_df['target'], predictions))



* The above scores aren't terrible! It looks like our assumption will score roughly 0.8 on the leaderboard. There are lots of ways to potentially improve on this (TFIDF, LSA, LSTM / RNNs, the list is long!) - give any of them a shot!


# Feature engineering

### use keywords for prediction

In [None]:
count_vectorizer = feature_extraction.text.CountVectorizer()
train_replaced_na_keyword = train_df.copy()
train_replaced_na_keyword['keyword'] = train_df['keyword'].fillna(' ')
train_vectors = count_vectorizer.fit_transform(train_replaced_na_keyword["keyword"])

dev_replaced_na_keyword = dev_df.copy()
dev_replaced_na_keyword['keyword'] = dev_df['keyword'].fillna(' ')
dev_vectors = count_vectorizer.transform(dev_replaced_na_keyword["keyword"])
train_replaced_na_keyword

In [None]:
train_vectors[1].todense()

In [None]:
clf = linear_model.RidgeClassifier()
clf.fit(train_vectors, train_df["target"])
predictions = clf.predict(dev_vectors);
print(classification_report(dev_df['target'], predictions))



### use tweet text in combination with keyword

In [None]:
count_vectorizer = feature_extraction.text.CountVectorizer()
train_textandkeyword_df = train_df.copy()
train_textandkeyword_df['keyword'] = train_df['keyword'].fillna(' ')
train_textandkeyword_df['textandkeyword'] = train_textandkeyword_df['text'] + " // " + train_textandkeyword_df['keyword']
train_vectors = count_vectorizer.fit_transform(train_textandkeyword_df["textandkeyword"])



dev_textandkeyword_df = dev_df.copy()
dev_textandkeyword_df['keyword'] = dev_df['keyword'].fillna(' ')
dev_textandkeyword_df['textandkeyword'] = dev_textandkeyword_df['text'] + " // " + dev_textandkeyword_df['keyword']
dev_vectors = count_vectorizer.transform(dev_textandkeyword_df["textandkeyword"])



In [None]:
clf = linear_model.RidgeClassifier()
clf.fit(train_vectors, train_df["target"])
predictions = clf.predict(dev_vectors);
print(classification_report(dev_df['target'], predictions))




Adding keyword to text reduces the score

### use only text but only over tweets that have keywords

In [None]:
clf = linear_model.RidgeClassifier()
train_vectors = count_vectorizer.fit_transform(train_textandkeyword_df["text"])
dev_vectors = count_vectorizer.transform(dev_textandkeyword_df["text"])
clf.fit(train_vectors, train_df["target"])
predictions = clf.predict(dev_vectors);
print(classification_report(dev_df['target'], predictions))




#  Vector representations

### Use TF-IDF to highlight important words in the text

Select top 10 words from every tweet using TF-IDF and feed them to a classifier

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
clf = linear_model.RidgeClassifier()
train_vectors = vectorizer.fit_transform(train_df["text"])
dev_vectors = vectorizer.transform(dev_df["text"])
clf.fit(train_vectors, train_df["target"])
predictions = clf.predict(dev_vectors);
print(classification_report(dev_df['target'], predictions))





not removing stop words is better:

In [None]:

vectorizer = TfidfVectorizer()
clf = linear_model.RidgeClassifier()
train_vectors = vectorizer.fit_transform(train_df["text"])
dev_vectors = vectorizer.transform(dev_df["text"])
clf.fit(train_vectors, train_df["target"])
predictions = clf.predict(dev_vectors);
print(classification_report(dev_df['target'], predictions))



### <a id="best_shallow">Best model so far: ridge classifier using TF-IDF for text encoding</a>

In [None]:
clf = linear_model.RidgeClassifier()
pipe = Pipeline([('vectorizer', TfidfVectorizer()), ('predictor', clf)])
pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))




# <a id="shallow_learning">Evaluate multiple classifier models</a>

In [None]:
from sklearn.base import BaseEstimator
class ClfSwitcher(BaseEstimator):
    def __init__(self, estimator = linear_model.RidgeClassifier()):
        """
        A Custom BaseEstimator that can switch between classifiers.
        :param estimator: sklearn object - The classifier
        """ 
        self.estimator = estimator


    def fit(self, X, y=None, **kwargs):
        self.estimator.fit(X, y)
        return self


    def predict(self, X, y=None):
        return self.estimator.predict(X)


    def predict_proba(self, X):
        return self.estimator.predict_proba(X)


    def score(self, X, y):
        return self.estimator.score(X, y)

In [None]:
%%time




clf = linear_model.RidgeClassifier()
pipeline = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')), ('clf', ClfSwitcher())])



parameters = [
    {
        'clf__estimator': [SVC()],
    },
    {
        'clf__estimator': [SGDClassifier()],
    },
    {
        'clf__estimator': [MultinomialNB()],
    },
    {
        'clf__estimator': [linear_model.RidgeClassifier()],
    },
    {
        'clf__estimator': [MLPClassifier(random_state=1, max_iter=200, early_stopping=True)],
    },
    {
        'clf__estimator': [RandomForestClassifier()]
    }
]


os.write(1, b"Starting grid search of models\n")
gscv = GridSearchCV(pipeline, parameters, cv=3, n_jobs=12, return_train_score=True, verbose=3, scoring='f1')
gscv.fit(train_df["text"], train_df["target"])

In [None]:
df = pd.DataFrame(gscv.cv_results_)
df

compare best two classifier: MultinomialNB & MLPClassifier

In [None]:
%%time
clf = MultinomialNB()
pipe = Pipeline([('vectorizer', TfidfVectorizer()), ('predictor', clf)])
pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))





In [None]:
%%time
clf = MLPClassifier(random_state=1, max_iter=200, early_stopping=True)
pipe = Pipeline([('vectorizer', TfidfVectorizer()), ('predictor', clf)])
pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))



hyperparameter search for MultinomialNB

In [None]:
pipeline = Pipeline([('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])



parameters = [
    {
        'clf__alpha': [1, 0, 0.5],
        'clf__fit_prior': [True, False]
    },
]

os.write(1, b"Starting grid search of alpha & fit_prior\n")
gscv = GridSearchCV(pipeline, parameters, cv=3, n_jobs=12, return_train_score=True, verbose=3, scoring='f1')
gscv.fit(train_df["text"], train_df["target"])

In [None]:
df = pd.DataFrame(gscv.cv_results_)
df

In [None]:
clf = MultinomialNB(fit_prior=False)
pipe = Pipeline([('vectorizer', TfidfVectorizer()), ('predictor', clf)])
pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))


# More advanced Vector representation techniques

## <a id="fast_text">Fast-text</a>

In [None]:
%%time
!pip install gensim


from typing import Callable, List, Optional, Tuple

import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator
import gensim
from nltk import ngrams
from gensim.models.keyedvectors import FastTextKeyedVectors
import random
#api = gensim.downloader
#api.BASE_DIR = "."
#fastText_model = api.load("fasttext-wiki-news-subwords-300")  
from gensim.models import FastText
fastText_model = FastText.load('/kaggle/input/english-wikipedia-articles-20170820-models/enwiki_2017_08_20_fasttext.model')

def randvec(w, n=50, lower=-1.0, upper=1.0):
    """Returns a random vector of length `n`. `w` is ignored."""
    return np.array([random.uniform(lower, upper) for i in range(n)])

def get_oov_fasttext(w):
    twograms = ngrams(w, 2)
    vectors = []
    for gram in twograms:
        word_2gram = gram[0] + gram[1]
        if word_2gram in fastText_model:
            vectors.append(fastText_model[word_2gram])
    if(len(vectors) > 0):
        return np.sum(vectors, axis=0)
    else:
        return randvec(w, n=300)
def fasttext_vec(w):    
    """Return `w`'s fastext representation if available, else return 
    a random vector."""
    if(w in fastText_model):
        return fastText_model[w]
    else:
        return get_oov_fasttext(w)
    
class FastTextTransformer(BaseEstimator, TransformerMixin):
    def __init__( self, combine_strategy="concatenate", max_sentence_length=30):
        assert (combine_strategy=="concatenate" or combine_strategy=='mean')
        self.combine_strategy = combine_strategy
        self.max_sentence_length = max_sentence_length
        self.empty_word_token = "EOF"
        
    def transform(self, text_list):
        texts = text_list.tolist()
        result = [];
        for text in texts:
            vectors = [];
            words = text.split()
            if(self.combine_strategy == 'concatenate'):
                max_index = self.max_sentence_length
            else:
                max_index = len(words) 
            for index in range(max_index):
                if(len(words) > index):
                    word = words[index]
                else:
                    word = self.empty_word_token
                vectors.append(fasttext_vec(word))
            if(self.combine_strategy == 'concatenate'):
                result.append(np.concatenate(vectors))
            elif(self.combine_strategy == 'mean'):
                result.append(np.mean(vectors, axis=0))
        return result;

    def fit(self, X, y=None):
        """No fitting necessary so we just return ourselves"""
        return self

In [None]:
os.write(1, b"Starting fasttext experiments\n")
clf = SGDClassifier()
pipe = Pipeline([('vectorizer', FastTextTransformer(combine_strategy='concatenate')), ('predictor', clf)])
pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))

#### Use Mean function for combining word vectors

In [None]:
clf = SGDClassifier()
pipe = Pipeline([('vectorizer', FastTextTransformer(combine_strategy='mean')), ('predictor', clf)])
pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))


Compare same model using TF-IDF rather than fast-text for encoding

In [None]:
clf = SGDClassifier()
pipe = Pipeline([('vectorizer', TfidfVectorizer()), ('predictor', clf)])
pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))


Encoding text with TF-IDF vectorizer proves better than fast-text, which makes sense because TF-IDF helps our model pay attention to important words, while fast-text doesn't. This means that a model that has attention mechanism may deliver promising results, later in this notebook I'll experiment with one.

# <a id="text_preprop">pre-processing techniques</a>

In [None]:
# normalise has several nltk data dependencies. Install these by running the following python commands:

import nltk
for dependency in ("brown", "names", "wordnet", "averaged_perceptron_tagger", "universal_tagset"):
    nltk.download(dependency)

In [None]:
nlp = spacy.load("en", disable=['parser', 'tagger', 'ner'])
nltk.download('stopwords')
stops = stopwords.words("english")



def remove_accented_chars(text):
    """remove accented characters from text, e.g. caf√©"""
    text = unidecode.unidecode(text)
    return text

def expand_contractions(text):
    """expand shortened words, e.g. don't to do not"""
    return contractions.fix(text);

def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',str(text))


def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)


def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)


def remove_punctuation(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

#FIXME: for lemmatizing, normalizing and removing stop words we tokenize the text and then join all the tokens using Python String.join, when doing this info like some punctuation marks are dropped which reduces model's accuracy
#If you find a way to over come this problem please let me know in the comments!
def text_normalizer(comment, lemmatize, lowercase, remove_stopwords, remove_accents, normalize_contractions, normalize_URL, normalize_emoji, normalize_html, normalize_punctuation):
    if lowercase:
        comment = comment.lower()
    if(remove_accents):
        comment = remove_accented_chars(comment)
    if(normalize_contractions):
        comment = expand_contractions(comment)
    if(normalize_URL):
        comment = remove_URL(comment)
    if(normalize_emoji):
        comment = remove_emoji(comment)   
    if(normalize_html):
        comment = remove_html(comment)   
    if(normalize_punctuation):
        comment = remove_punctuation(comment)   
    if(remove_stopwords):
        comment = nlp(comment)
        words = [];
        for token in comment:
            if not remove_stopwords or (remove_stopwords and token.text not in stops):
                    words.append(token.text)
        comment = " ".join(words)
    if(lemmatize):
        comment = nlp(comment)
        comment = " ".join(word.lemma_.strip() for word in comment)
    return comment


class PrePropTextTransformer(BaseEstimator, TransformerMixin):
    def __init__( self, lemmatize=False, lowercase=False, remove_stopwords=False, remove_accents=False, normalize_contractions=False, normalize_URL=False, normalize_emoji=False, normalize_html=False, normalize_punctuation=False):
        self.lemmatize=lemmatize
        self.lowercase=lowercase
        self.remove_stopwords=remove_stopwords
        self.remove_accents=remove_accents
        self.normalize_contractions=normalize_contractions
        self.normalize_URL=normalize_URL
        self.normalize_emoji=normalize_emoji
        self.normalize_html=normalize_html
        self.normalize_punctuation=normalize_punctuation
        
    def transform(self, text_list):
        texts = text_list.tolist()
        result = [];
        for text in texts:
            result.append(text_normalizer(text, self.lemmatize, self.lowercase, self.remove_stopwords, self.remove_accents, self.normalize_contractions, self.normalize_URL, self.normalize_emoji, self.normalize_html, self.normalize_punctuation))
        return pd.Series(result)

    def fit(self, X, y=None):
        """No fitting necessary so we just return ourselves"""
        return self



#### baseline:

In [None]:
%%time
clf = MultinomialNB(fit_prior=False)
pipe = Pipeline([('vectorizer', TfidfVectorizer()), ('predictor', clf)])
pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))


### use preprocess_text to try improve baseline

In [None]:
%%time
os.write(1, b"Starting experimetns w text preprop\n")
clf = MultinomialNB()
pipe = Pipeline([('preprop', PrePropTextTransformer()), ('vectorizer', TfidfVectorizer()), ('predictor', clf)])

pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))



In [None]:
%%time
clf = MultinomialNB(fit_prior=False)
pipe = Pipeline([('preprop', PrePropTextTransformer(lemmatize=True,
                                                    lowercase=True,
                                                    remove_stopwords=True,
                                                    remove_accents=True, 
                                                    normalize_contractions=True,
                                                    normalize_URL=True,
                                                    normalize_emoji=True,
                                                    normalize_html=True,
                                                    normalize_punctuation=True
                                                   )), ('vectorizer', TfidfVectorizer()), ('predictor', clf)])

pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))


In [None]:
 %%time
pipeline = Pipeline([('preprop', PrePropTextTransformer()),
                     ('vectorizer', TfidfVectorizer()),
                     ('predictor', MultinomialNB(fit_prior=False))])



parameters = [
        {'preprop__lemmatize': [True, False]},
        {'preprop__lowercase': [True, False]},
        {'preprop__remove_stopwords': [True, False]},
        {'preprop__remove_accents': [True, False]},
        {'preprop__normalize_contractions': [True, False]},
        {'preprop__normalize_URL': [True, False]},
        {'preprop__normalize_emoji': [True, False]},
        {'preprop__normalize_html': [True, False]},
        {'preprop__normalize_punctuation': [True, False]}
    ]

try:
    grid_search_pd = pd.read_csv("/kaggle/input/precomputedgridsearches2/text_preprop_gs_results.csv")
except:
    # FIXME: n_jobs has to be 1 or it crashes
    gscv = GridSearchCV(pipeline, parameters, cv=3, n_jobs=1, return_train_score=True, verbose=3, scoring='f1')
    gscv.fit(train_df["text"], train_df["target"])
    grid_search_pd = pd.DataFrame(gscv.cv_results_);
    grid_search_pd.to_csv("/kaggle/input/precomputedgridsearches/text_preprop_gs_results.csv")
grid_search_pd


## best result so far using text preprocessing:

In [None]:
%%time
clf = MultinomialNB(fit_prior=False)
pipe = Pipeline([('preprop', PrePropTextTransformer(lemmatize=False,
                                                    lowercase=False,
                                                    remove_stopwords=False,
                                                    remove_accents=True, 
                                                    normalize_contractions=False,
                                                    normalize_URL=True,
                                                    normalize_emoji=True,
                                                    normalize_html=True,
                                                    normalize_punctuation=False
                                                   )), ('vectorizer', TfidfVectorizer()), ('predictor', clf)])

pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))


### Does using a tweet tokenizer improve it?

In [None]:
%%time
from nltk.tokenize import TweetTokenizer
clf = MultinomialNB(fit_prior=False)
pipe = Pipeline([('preprop', PrePropTextTransformer(lemmatize=False,
                                                    lowercase=False,
                                                    remove_stopwords=False,
                                                    remove_accents=True, 
                                                    normalize_contractions=False,
                                                    normalize_URL=True,
                                                    normalize_emoji=True,
                                                    normalize_html=True,
                                                    normalize_punctuation=False
                                                   )), ('vectorizer', TfidfVectorizer(tokenizer=TweetTokenizer().tokenize)), ('predictor', clf)])

pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))


# More advanced word embeddings

### BERT

In [None]:
from typing import Callable, List, Optional, Tuple

import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator
from spacy.util import minibatch
import torch
!pip install transformers
from transformers import BertModel, BertTokenizer



def mean_across_all_tokens(hidden_states):
    return torch.mean(hidden_states[-1], dim=1)

def sum_all_tokens(hidden_states):
    return torch.sum(hidden_states[-1], dim=1)

def concat_all_tokens(hidden_states):
    batch_size, max_tokens, emb_dim = hidden_states[-1].shape
    return torch.reshape(hidden_states[-1], (batch_size, max_tokens * emb_dim))

def CLS_token_embedding(hidden_states):
    return hidden_states[-1][:, 0, :]

class BertTransformer(BaseEstimator, TransformerMixin):
    def __init__(
            self,
            max_length: int = 60,
            tokenizer = BertTokenizer.from_pretrained("bert-base-uncased"),
            model = BertModel.from_pretrained("bert-base-uncased", output_hidden_states=True),
            embedding_func = mean_across_all_tokens,
            combine_sentence_tokens=True
    ):
        self.tokenizer = tokenizer;
        self.combine_sentence_tokens = combine_sentence_tokens;
        self.embedding_func = embedding_func;
        self.model = model
        self.model.eval()
        self.max_length = max_length

    def _tokenize(self, text_list: List[str]) -> Tuple[torch.tensor, torch.tensor]:
        # Tokenize the text with the provided tokenizer
        input_ids = self.tokenizer.batch_encode_plus(text_list,
                                                    add_special_tokens=True,
                                                    max_length=self.max_length,
                                                    pad_to_max_length=True
                                                    )["input_ids"]

        return torch.LongTensor(input_ids)
         

    def _tokenize_and_predict(self, text_list: List[str]) -> torch.tensor:
        input_ids_tensor = self._tokenize(text_list)
        out = self.model(input_ids=input_ids_tensor)
        hidden_states = out[2]
        if(self.combine_sentence_tokens):
            return self.embedding_func(hidden_states)
        else:
            return hidden_states[-1]
    
    def transform(self, text_list: List[str], batch_size=32):
        if isinstance(text_list, pd.Series):
            text_list = text_list.tolist()
        batches = minibatch(text_list, size=batch_size)
        predictions = []
        for batch in batches:
            with torch.no_grad():
                 batch_predictions = self._tokenize_and_predict(batch)
            predictions.append(batch_predictions)
        return torch.cat(predictions, dim=0)

    def fit(self, X, y=None):
        """No fitting necessary so we just return ourselves"""
        return self

In [None]:
bertTransformer = BertTransformer(combine_sentence_tokens=False)
bertTransformer.transform(["pablo", "I love Pablo"]).shape

In [None]:
%%time
os.write(1, b"Starting experiments with BERT\n")
clf = linear_model.RidgeClassifier()
pipe = Pipeline([('vectorizer', BertTransformer()), ('predictor', clf)])
pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))


text preprocessing, BERT encoding, RidgeClassifier classifier

In [None]:
%%time
clf = linear_model.RidgeClassifier()
pipe = Pipeline([('preprop', PrePropTextTransformer(lemmatize=False,
                                                    lowercase=False,
                                                    remove_stopwords=False,
                                                    remove_accents=True, 
                                                    normalize_contractions=False,
                                                    normalize_URL=True,
                                                    normalize_emoji=True,
                                                    normalize_html=True,
                                                    normalize_punctuation=False
                                                   )), ('vectorizer', BertTransformer()), ('predictor', clf)])

pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))


<a id='bert-NN'></a>
BERT embeddings feeded to feed forward NN

In [None]:
%%time
clf = MLPClassifier(random_state=1, max_iter=200, early_stopping=True)
pipe = Pipeline([('preprop', PrePropTextTransformer(lemmatize=False,
                                                    lowercase=False,
                                                    remove_stopwords=False,
                                                    remove_accents=True, 
                                                    normalize_contractions=False,
                                                    normalize_URL=True,
                                                    normalize_emoji=True,
                                                    normalize_html=True,
                                                    normalize_punctuation=False
                                                   )), ('vectorizer', BertTransformer()), ('predictor', clf)])

pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))


## hyper param search MLP classifier

reduce num of ephocs and set cv=2 to reduce training time

In [None]:
%%time
clf = MLPClassifier(random_state=1, max_iter=100, early_stopping=True)
pipe = Pipeline([('preprop', PrePropTextTransformer(lemmatize=False,
                                                    lowercase=False,
                                                    remove_stopwords=False,
                                                    remove_accents=True, 
                                                    normalize_contractions=False,
                                                    normalize_URL=True,
                                                    normalize_emoji=True,
                                                    normalize_html=True,
                                                    normalize_punctuation=False
                                                   )), ('vectorizer', BertTransformer()), ('clf', clf)])

parameters = [
        {'clf__hidden_layer_sizes': [(100, 100), (100,), (100, 50), (50, 50)]}
    ]

try:
    grid_search_pd = pd.read_csv("/kaggle/input/precomputedgridsearches2/mlp_clf_gs_results.csv")
except:
    # FIXME: n_jobs has to be 1 or it crashes
    gscv = GridSearchCV(pipe, parameters, cv=2, n_jobs=1, return_train_score=True, verbose=3, scoring='f1')
    gscv.fit(train_df["text"], train_df["target"])
    grid_search_pd = pd.DataFrame(gscv.cv_results_);
    grid_search_pd.to_csv("/kaggle/input/precomputedgridsearches2/mlp_clf_gs_results.csv")
grid_search_pd


In [None]:
%%time
clf = MLPClassifier(random_state=1, max_iter=100, early_stopping=True)
pipe = Pipeline([('preprop', PrePropTextTransformer(lemmatize=False,
                                                    lowercase=False,
                                                    remove_stopwords=False,
                                                    remove_accents=True, 
                                                    normalize_contractions=False,
                                                    normalize_URL=True,
                                                    normalize_emoji=True,
                                                    normalize_html=True,
                                                    normalize_punctuation=False
                                                   )), ('vectorizer', BertTransformer()), ('clf', clf)])

parameters = [
        {'clf__hidden_layer_sizes': [(100, 50, 50), (100,100, 50)]}
    ]

try:
    grid_search_pd = pd.read_csv("/kaggle/input/precomputedgridsearches2/mlp_clf2_gs_results.csv")
except:
    # FIXME: n_jobs has to be 1 or it crashes
    gscv = GridSearchCV(pipe, parameters, cv=2, n_jobs=1, return_train_score=True, verbose=3, scoring='f1')
    gscv.fit(train_df["text"], train_df["target"])
    grid_search_pd = pd.DataFrame(gscv.cv_results_);
    grid_search_pd.to_csv("/kaggle/input/precomputedgridsearches2/mlp_clf2_gs_results.csv")
grid_search_pd


Increasing FFN network layers doesn't make a significant improvement

### BERT compare different ways of generating sentence embeddings for classification: concatenate VS avg VS [CLS] token embedding

In [None]:
bertTransformer = BertTransformer()
bertTransformer.transform(["granola bars"]).shape

In [None]:
%%time
clf = MLPClassifier(random_state=1, max_iter=100, early_stopping=True)
pipe = Pipeline([('preprop', PrePropTextTransformer(lemmatize=False,
                                                    lowercase=False,
                                                    remove_stopwords=False,
                                                    remove_accents=True, 
                                                    normalize_contractions=False,
                                                    normalize_URL=True,
                                                    normalize_emoji=True,
                                                    normalize_html=True,
                                                    normalize_punctuation=False
                                                   )), ('vectorizer', BertTransformer()), ('predictor', clf)])


parameters = [
        {'vectorizer__embedding_func': [sum_all_tokens, mean_across_all_tokens, concat_all_tokens, CLS_token_embedding]}
    ]

try:
    grid_search_pd = pd.read_csv("/kaggle/input/precomputedgridsearches2/emb_funcs_gs_results.csv")
except:
    # FIXME: n_jobs has to be 1 or it crashes
    gscv = GridSearchCV(pipe, parameters, cv=2, n_jobs=1, return_train_score=True, verbose=3, scoring='f1')
    gscv.fit(train_df["text"], train_df["target"])
    grid_search_pd = pd.DataFrame(gscv.cv_results_);
    grid_search_pd.to_csv("/kaggle/input/precomputedgridsearches2/emb_funcs_gs_results.csv")
grid_search_pd



## XGBoost model

In [None]:
%%time 
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from transformers import RobertaTokenizer, RobertaModel
from xgboost import XGBClassifier
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base', output_hidden_states=True)
clf = XGBClassifier(n_estimators=300)
pipe = Pipeline([('preprop', PrePropTextTransformer(lemmatize=False,
                                                    lowercase=False,
                                                    remove_stopwords=False,
                                                    remove_accents=True, 
                                                    normalize_contractions=False,
                                                    normalize_URL=True,
                                                    normalize_emoji=True,
                                                    normalize_html=True,
                                                    normalize_punctuation=False
                                                   )), ('vectorizer', BertTransformer(model=model, tokenizer=tokenizer)), ('predictor', clf)])

pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))

# Sequential models

So far we've combined the word embeddings in some way and trained feed-fordward, shadow models, let's see what happens when our word embeddings are evaluated using a sequential model!

### <a id="lstms">LSTM classifiers</a>

#### Vanilla LSTM

In [None]:
%%time
os.write(1, b"Starting experiments with sequential models\n")
bertTransformer = BertTransformer(combine_sentence_tokens=False)
X_train = bertTransformer.transform(train_df["text"])
X_dev = bertTransformer.transform(dev_df["text"])
y_train = train_df["target"].tolist()
y_dev = dev_df["target"].tolist()

In [None]:
%%time
from torch_rnn_classifier_attn import TorchRNNClassifier
torch_rnn = TorchRNNClassifier(
        vocab=[],
        use_embedding=False,
        bidirectional=False,
        hidden_dim=50,
        max_iter=50,
        eta=0.05) 
_ = torch_rnn.fit(X_train, y_train)

In [None]:
%%time
from sklearn.metrics import classification_report
predictions = torch_rnn.predict(X_dev)
print(classification_report(y_dev, predictions))

####  BI-LSTM classifier

In [None]:
%%time
from torch_rnn_classifier import TorchRNNClassifier

torch_bi_lstm = TorchRNNClassifier(
        vocab=[],
        use_embedding=False,
        bidirectional=True,
        hidden_dim=50,
        max_iter=50,
        eta=0.05) 
_ = torch_bi_lstm.fit(X_train, y_train)

In [None]:
%%time
predictions = torch_bi_lstm.predict(X_dev)
print(classification_report(y_dev, predictions))


#### BI-LSTM with attention

#### Manning 2015 Global Attention 

In [None]:
%%time
from torch_rnn_classifier_attn import TorchRNNClassifier
torch_bilstm_attn_manning2015 = TorchRNNClassifier(
        vocab=[],
        use_embedding=False,
        attention="GlobalAttnManning2015",
        bidirectional=True,
        hidden_dim=50,
        max_iter=30,
        eta=0.05) 
_ = torch_bilstm_attn_manning2015.fit(X_train, y_train)

In [None]:
%%time
predictions = torch_bilstm_attn_manning2015.predict(X_dev)
print(classification_report(y_dev, predictions))


#### <a id="shou_peng_attn">Shou, Peng, et al. 2016 Attention</a>

In [None]:
%%time
torch_bilstm_attn_ShouPeng2016 = TorchRNNClassifier(
        vocab=[],
        use_embedding=False,
        attention="AttnShouPeng2016",
        bidirectional=True,
        hidden_dim=50,
        max_iter=30,
        eta=0.05) 
_ = torch_bilstm_attn_ShouPeng2016.fit(X_train, y_train)

In [None]:
%%time
predictions = torch_bilstm_attn_ShouPeng2016.predict(X_dev)
print(classification_report(y_dev, predictions))


# <a id="bert_e2e">BERT (finetuned)</a>

### Finetune BERT models for classification task

BERT model with an added single linear layer on top for classification. As we feed input data, the entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task.

Even though we already tested a similar architecture in this notebook: [BERT embeddings feeded TO A SK-learn feed forward NN](#bert-NN), the below model has an advantage that makes it more promising: both the transformer used to generating the word embeddings and the classification layer weights are adjusted(https://arxiv.org/abs/2004.14448) 

In [None]:
class BertClassifierPredictor(BaseEstimator, ClassifierMixin):
    def __init__(self, model=ClassificationModel('roberta', 'roberta-base', args={"overwrite_output_dir": True},
                                                 use_cuda=False)):
        self.model = model
        logging.basicConfig(level=logging.INFO)
        transformers_logger = logging.getLogger("transformers")
        transformers_logger.setLevel(logging.WARNING)


    def fit(self, X, y):
        X = X.tolist()
        y = y.tolist()
        self.classes_ = unique_labels(y)
        self.X_ = X
        self.y_ = y
        d = {'text': X, 'labels': y}
        df = pd.DataFrame(data=d)
        self.model.train_model(df)
        return self

    def predict(self, X):
        return self.model.predict(X.tolist())[0]

In [None]:
%%time
os.write(1, b"Starting experiments w BERT finetuning\n")
model = ClassificationModel('bert', '/kaggle/input/simpletransformer-outputs/checkpoint-bert-epoch-1', args={"overwrite_output_dir": True}, use_cuda=False)

clf = BertClassifierPredictor(model=model)
pipe = Pipeline([('preprop', PrePropTextTransformer(lemmatize=False,
                                                    lowercase=False,
                                                    remove_stopwords=False,
                                                    remove_accents=True, 
                                                    normalize_contractions=False,
                                                    normalize_URL=True,
                                                    normalize_emoji=True,
                                                    normalize_html=True,
                                                    normalize_punctuation=False
                                                   )), ('predictor', clf)])

#model already trained, uncomment below line to continue trasining
#pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))

### <a id="winner">Roberta(Finetuned)</a>

In [None]:
%%time
from simpletransformers.classification import ClassificationModel
import logging

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)
# Create a TransformerModel
model = ClassificationModel('roberta', '/kaggle/input/simpletransformer-outputs/checkpoint-roberta-epoch-1', args={"overwrite_output_dir": True}, use_cuda=False)

clf = BertClassifierPredictor(model=model)
pipe_roberta_finetuned = Pipeline([('preprop', PrePropTextTransformer(lemmatize=False,
                                                    lowercase=False,
                                                    remove_stopwords=False,
                                                    remove_accents=True, 
                                                    normalize_contractions=False,
                                                    normalize_URL=True,
                                                    normalize_emoji=True,
                                                    normalize_html=True,
                                                    normalize_punctuation=False
                                                   )), ('predictor', clf)])

#model already trained, uncomment below line to continue trasining
#pipe_roberta_finetuned.fit(train_df["text"], train_df["target"])
predictions = pipe_roberta_finetuned.predict(dev_df['text'])
print(classification_report(dev_df['target'], predictions))


## Roberta word vectors(not finetuned) + feed forward NN

In [None]:
%%time
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base', output_hidden_states=True)
clf = MLPClassifier(random_state=1, max_iter=200, early_stopping=True)
pipe = Pipeline([('preprop', PrePropTextTransformer(lemmatize=False,
                                                    lowercase=False,
                                                    remove_stopwords=False,
                                                    remove_accents=True, 
                                                    normalize_contractions=False,
                                                    normalize_URL=True,
                                                    normalize_emoji=True,
                                                    normalize_html=True,
                                                    normalize_punctuation=False
                                                   )), ('vectorizer', BertTransformer(model=model, tokenizer=tokenizer)), ('predictor', clf)])

pipe.fit(train_df["text"], train_df["target"])
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))

## Roberta word vectors used in best model so far (biLSTM with attention)

In [None]:
%%time
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base', output_hidden_states=True)
bertTransformer = BertTransformer(model=model, tokenizer=tokenizer, combine_sentence_tokens=False)
X_train = bertTransformer.transform(train_df["text"])
X_dev = bertTransformer.transform(dev_df["text"])
y_train = train_df["target"].tolist()
y_dev = dev_df["target"].tolist()


In [None]:
%%time
from torch_rnn_classifier_attn import TorchRNNClassifier
torch_bilstm_attention_roberta_exp = TorchRNNClassifier(
        vocab=[],
        use_embedding=False,
        attention="AttnShouPeng2016",
        bidirectional=True,
        hidden_dim=50,
        max_iter=30,
        eta=0.05) 
_ = torch_bilstm_attention_roberta_exp.fit(X_train, y_train)

In [None]:
%%time
predictions = torch_bilstm_attention_roberta_exp.predict(X_dev)
print(classification_report(y_dev, predictions))


## Spacy classifier(Architecture=Enseble)

From the doc(https://spacy.io/api/textcategorizer#architectures):  
Stacked ensemble of a bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention

In [None]:
%%time
os.write(1, b"Starting experiments with spacyclassifier\n")
spacy_classifier = SpacyClassifier(n_iter=50)


pipe = Pipeline([('preprop', PrePropTextTransformer(lemmatize=False,
                                                    lowercase=False,
                                                    remove_stopwords=False,
                                                    remove_accents=True, 
                                                    normalize_contractions=False,
                                                    normalize_URL=True,
                                                    normalize_emoji=True,
                                                    normalize_html=True,
                                                    normalize_punctuation=False
                                                   )), ('predictor', spacy_classifier)])

pipe.fit(train_df["text"], train_df["target"])

In [None]:
%%time
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))

# <a id="gpt2">GPT-2</a>

Frozen(not finetuned during training) distill-GPT2 language model used to generate word embeddings + feed forward classification layer on top

In [None]:
%%time
from GPT2_classifier import GPT2Classifier
checkpoint_path = "/kaggle/input/models3/GPT2_CLS_CHECKPOINT_ephoc20.pth.tar"
gpt2_classifier = GPT2Classifier(max_iter=20, finetune_GPT2=False, batch_size=32, checkpoint_path=checkpoint_path,
                 base_dir=".", classes=[0,1])

pipe = Pipeline([('preprop', PrePropTextTransformer(lemmatize=False,
                                                    lowercase=False,
                                                    remove_stopwords=False,
                                                    remove_accents=True, 
                                                    normalize_contractions=False,
                                                    normalize_URL=True,
                                                    normalize_emoji=True,
                                                    normalize_html=True,
                                                    normalize_punctuation=False
                                                   )), ('predictor', gpt2_classifier)])

pipe.fit(train_df["text"], train_df["target"])




In [None]:
%%time
predictions = pipe.predict(dev_df["text"]);
print(classification_report(dev_df['target'], predictions))

> # <a id="Conclusions">Conclusions</a>

From comparing all the models tested on this notebook I observe that the best performing model for the task of tweet classification is: [Roberta classifier(Finetuned)](#winner)

Other interesting findings:
- using TF-IDF vectors + shallow learning(eg: ridge classifier) yields results that are hard to beat by most deep learning models [Link](#best_shallow)
- Adding attention to an LSTM classifier increases its accuracy noticeably, particularly when using shoug peng 2016 attention [Link](#shou_peng_attn)

# Follow up work

- Train word embeddings and LSTMs w attention together, so word vectors get fine tuned, this will definitely yield better results than when we don't finetune the word vectors and only train the LSTM classifier
- Can we place a NN on top of the LSTM and have it behave as an attention layer? if no why not?
- Oversample the minority class
- Use other types of attentions when working with RNNs Manning 2015 suggests a better type called local attention
- Explore bigger versions of GPT2
- Explore other ways of generating sentence embeddings(eg: add a CLS token and use its hidden state as sentence representation) using GPT2
- Finetune GPT2 model, my model only trains the classificaiton layer, tried to also finetune GPT2 and even the current code is supposed to be able to finetune it, but for some reason doesn't work

# Submit best model predictions to Kaggle

Train the final model on all available labelled data, the below df contains the examples used for training and dev sets during the notebooka

Increased training ephocs from 1 to 2 for submission

In [None]:
entire_train_df = pd.concat([train_df, dev_df])

In [None]:
%%time
# retrain best model on  all available data
model = ClassificationModel('roberta', '/kaggle/input/simpletransformer-outputs/checkpoint-roberta_submission-epoch-2', args={"overwrite_output_dir": True, "num_train_epochs": 2}, use_cuda=False)

clf = BertClassifierPredictor(model=model)
pipe_roberta_finetuned_submit = Pipeline([('preprop', PrePropTextTransformer(lemmatize=False,
                                                    lowercase=False,
                                                    remove_stopwords=False,
                                                    remove_accents=True, 
                                                    normalize_contractions=False,
                                                    normalize_URL=True,
                                                    normalize_emoji=True,
                                                    normalize_html=True,
                                                    normalize_punctuation=False
                                                   )), ('predictor', clf)])
#model already trained, uncomment below line to continue trasining
#pipe_roberta_finetuned_submit.fit(entire_train_df["text"], entire_train_df["target"])


In [None]:
test_vectors = test_df['text']
sample_submission = pd.read_csv("/kaggle/input/data-baby2/sample_submission.csv")
sample_submission["target"] = pipe_roberta_finetuned_submit.predict(test_vectors)
sample_submission.head()

In [None]:
sample_submission.to_csv("submission.csv", index=False)