This notebook wants to record my process of learning NLP techniques from almost 0 to a level decent enough to participate to the [Coleridge Initiative - Show US the Data](https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data), which ends in 3 months. Since I expect the [Natural Language Processing with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started) to be more approacheable, I will focus on that first.

I never had the chance to work with NLP in my professional experience and I want to make it an option for the future.

It is meant to be more of a diary for myself and it is made public mainly to hold myself accountable for making a significant progress in these 3 months. The starting point is [the Kaggle course on NLP](https://www.kaggle.com/learn/natural-language-processing), which is giving already some nice information but I feel I have to dig a bit deeper to be able to do some analysis without copying.

Given that the course uses spaCy, a quick online search suggests it is a good starting point. We will see where to go from there.

**Do not expect a brillian notebook, nor a high scoring one. It will most likely be a fairly pedantic exploration of functionalities I don't know yet. Feel free to drop a suggestion in the comments**


***Day count = 6***

In [None]:
!pip install tubesml==0.4.2

In [None]:
import numpy as np 
import pandas as pd

import spacy

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Load an NLP model

The very first thing every tutorial I found do is to load a model with SpaCy. They do so by running a load method, the resulting object has quite a few methods and I am sure we are going to need a few of those shortly

In [None]:
nlp = spacy.load('en')
[m for m in dir(nlp) if '__' not in m]

From the [documentation](https://spacy.io/usage/models#languages), I can see that several languages are supported, some more than other as one would expect. For the English language, for example, I see they provide

* language data: it contains stopwords, some language exceptions, and other things I can't recognize. The language expections for the tokenizer (**check it later**) does not seem a comprehensive list, for example in italian there is nord-est (north-east), but not nord-ovest (north-west), but it seems aimed to cover all the common contractions like I'm or You're. 
* pipelines: as the name suggests, they are a list of opertations you can perform in a certain order. The models are pretrained and in the documentation there is some accuracy value for each component. **I don't know accuracy against what, I will check it later**.

# Tokenizer and Lemma

I can use the loaded model to analyze a text as follows

In [None]:
doc = nlp("This is my first sentence I process, I don't know what is going to happen. Do you?")
doc.to_json()

It seems it automatically detects the sentences by looking for a full stop, my bad grammar must be a nightmare for that.

In [None]:
for sent in doc.sents:
    print(sent)

Let's see if it is only the full stop

In [None]:
doc = nlp("This is my first sentence I process. I don't know what is going to happen... We will see. Do you?")
for sent in doc.sents:
    print(sent)

And then it already detected the tokens, which are units of text like words or punctuation.

In [None]:
for token in doc:
    print(f'Token: {token},\t\tBase form: {token.lemma_},\t\t\tPart of speech: {token.pos_}\t\t\tSentiment: {token.sentiment} ')

Looking at the json above, it also looks like the model finds, for example, if it is a pronoun or a verb. I can access this information via the [token attributes](https://spacy.io/api/token).

Can I just do this for a full dataframe?

In [None]:
df = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')

tokens = []  # from https://stackoverflow.com/questions/44395656/applying-spacy-parser-to-pandas-dataframe-w-multiprocessing
lemma = []
pos = []

# I am curious about what is the time cost of each operation, don't mind the time parts
from time import time
tok_time = []
lem_time = []
po_time =[]

tot_s = time()
for doc in nlp.pipe(df['text'].values, batch_size=50, n_process=4):
    if doc.is_parsed:
        s = time()
        tokens.append([n.text for n in doc])
        tok_time.append(time() - s)
        s = time()
        lemma.append([n.lemma_ for n in doc])
        lem_time.append(time() - s)
        s = time()
        pos.append([n.pos_ for n in doc])
        po_time.append(time() - s)
    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some blanks in case the parse fails
        tokens.append(None)
        lemma.append(None)
        pos.append(None)
        
tot_time = time() - tot_s

df['tokens'] = tokens
df['lemma'] = lemma
df['pos'] = pos

df[['text', 'tokens', 'lemma', 'pos']].sample(10)

Not the greatest achievement in the history of achievements but yes, I can. **Not sure how it would work with a very large dataframe, I guess I'll find out later**.

The new thing I used here is [nlp.pipe](https://spacy.io/api/language#pipe), which process the text as a stream and allows for parallelization and batches (I suppose this would make it more memory friendly). Out of curiosity, I timed each operation in the hope of seeing where the computation time goes

In [None]:
print(f'Total time for entire dataframe: \t\t{tot_time}')
print(f'Mean time of tokenizer: \t\t{np.mean(tok_time)} +- {np.std(tok_time)}')
print(f'Mean time of lemmatizer: \t\t{np.mean(lem_time)} +- {np.std(lem_time)}')
print(f'Mean time of morphologizer: \t\t{np.mean(po_time)} +- {np.std(po_time)}')
print(f'Total time for the model to run without the above operations: \t\t{tot_time - np.sum(tok_time) - np.sum(lem_time) - np.sum(po_time)}')

In other words, almost all the time goes into the operation `nlp(...)`, which makes sense since each element we extract is simply an attribute of the resulting object. Hence, I should try to do that operation as fewer times as possible.

### Takeaway of this section

SpaCy offers nice pretrained models that can extract a lot of features from a string. The operation can be parallelized easily, which can compensate from the (at least for me) impossibility of leveraging the pandas indexing.

# Pattern matching

The second topic in many tutorials, Kaggle's included, is about how to match tokens or phrases within a document. It feels like a natural second step of this journey.

I am going to follow what SpaCy does [in its documentation](https://spacy.io/usage/rule-based-matching).

To match tokens based on rules we set, we can use `Matcher`

In [None]:
from spacy.matcher import Matcher

[m for m in dir(Matcher) if '__' not in m]

In [None]:
# nlp is our model for english we loaded earlier

matcher = Matcher(nlp.vocab, validate=True)  # initialize object that shares the same vocabulary
# validate will tell the matcher to validate the patterns provided against the vocabulary and, if necessary, raise an error

# Create a pattern as a list of dictionaries
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
# add the pattern to the matcher with an ID
matcher.add("HelloWorld", [pattern])

doc = nlp("Hello, world! Hello World! These are other words, like hello, but not world. Hello,world")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(string_id, start, end, span.text)

Thus to add a pattern to the matcher, we make a list of lists of patterns we look for. Each list of pattern is a list of dictionaries that define the sequence of tokens that we want.

To test this, let's make more patterns with different IDs

In [None]:
patterns = [
    [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],
    [{"LOWER": "hello"}, {"LOWER": "world"}]
]
matcher.add("HelloWorld", patterns)

pattern = [{'TEXT': 'world'}]  # it should not then find World
matcher.add('World', [pattern])

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(string_id, start, end, span.text)

It also appear you can overwrite an existing pattern by simply adding it with the same ID (as I did for `HelloWorld`). We can access the patterns with the `_patterns` attribute

In [None]:
matcher._patterns

Interestingly, it doesn't really overwrite but rather add another element to the list. This is convenient to add new patterns without making new IDs every time. Indeed there are other methods to remove and get these patterns

In [None]:
matcher.get('HelloWorld')  # give the ID of the pattern you want

In [None]:
matcher.remove('World')  # remove a pattern 
# or throw an error if it doesn't exist

I see from the docs we can also use regular expressions to match the patterns but the possibility of doing something when a match is found seems more interesting. 

For this, there the `on_match` option when we add a new pattern. It has to be a function that takes the matcher, the document, an id of the match, and the match.

In [None]:
def callback_on_match(matcher, doc, id, matches):
    print('Matched!', id)
    
pattern = [{'LOWER': 'world'}]  
matcher.add('World', [pattern], on_match=callback_on_match)

_ = matcher(doc)

To match phrases we can instead use the [`PhraseMatcher`](https://spacy.io/api/phrasematcher), which differs from the `Matcher` as it accepts patterns in the form of a Doc. 

A Doc is a sequence of Tokens

In [None]:
from spacy.matcher import PhraseMatcher

[m for m in dir(PhraseMatcher) if '__' not in m]

It initializes as before with a vocabulary that has to be the same of the model we are using, but it also has the possibility of setting the token attribute to match on.

In [None]:
matcher = PhraseMatcher(nlp.vocab)
terms = ['NLP', 'difficulty', 'interaction']

patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)

doc = nlp("This journey into NLP is not easy and take patience "
          "the difficulty is in finding interactions between these techinques and the one that I already know")
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

It feels just a more user-friendly version of the Matcher above.

More interesting seems to be the [`DependencyMatcher`](https://spacy.io/api/dependencymatcher), which allows to match dependency trees, but *I will leave this topic for another time.*

# Text Classification

This appears to be the entry level for machine learning. Time to use those competition datasets

In [None]:
df = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
df.head()

The Machine Learning practitioner that is in me tells me I need to prepare the data for the model. It is my understanding that this would be just another component in the NLP pipeline. Following the Kaggle course

In [None]:
nlp = spacy.blank("en")

# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe(
              "textcat",
              config={
                "exclusive_classes": True,
                "architecture": "bow"})

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)

Now, here is me trying to be less confused:

* `textcat` looked like a simple name to give to the pipeline step but I can't change it, so the next cell will show the available names
* `bow` stands for bag-of-words, which is how we want to represent the data. This model disregard grammar or word order. It essentially assigns to each word in the document a number that represents the times that word occurs in the document.

Now, I am struggling in finding a good list of available options but I can see that `textcat` is a name available here

In [None]:
nlp.factories

And I can find some model architectures, like bag-of-words, here: https://spacy.io/api/architectures#TextCatBOW.

Digging deeper into the documentation, here is the list of [built-in pipeline components](https://spacy.io/usage/processing-pipelines#built-in)

Moreover, I can access to the pipeline names at any time via

In [None]:
nlp.pipe_names

The list is very short because I started by using `spacy.blank()`, which creates a blank pipeline. A more traditional approach is to load a model and in that case the pipeline shows more pretrained transformers.

Even longer if we load a model more complex than `en`.

In [None]:
nlp = spacy.load("en")

# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe(
              "textcat",
              config={
                "exclusive_classes": True,
                "architecture": "bow"})

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)
nlp.pipe_names

In [None]:
nlp = spacy.blank("en")

# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe(
              "textcat",
              config={
                "exclusive_classes": True,
                "architecture": "bow"})

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)

Moving on to our model, I need to add labels

In [None]:
# Add labels to text classifier
textcat.add_label('disaster')
textcat.add_label('not-disaster')

The `TextCategorizer` requires the label to be in a dictionary of boolean values for each class and each entry.

In [None]:
train_texts = df['text'].values
train_labels = [{'cats': {'disaster': label == 1, 'not-disaster': label == 0}} for label in df['target']]

train_data = list(zip(train_texts, train_labels))
train_data[:3]

Now I can train the model. This is done in batches and epochs. I need an optimizer and the Kaggle course suggests to use `begin_training`

In [None]:
from spacy.util import minibatch

spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()
[d for d in dir(optimizer) if '__' not in d]

This method will be called `initialize` in the future and, by default, it returns an [SGD optimizer](https://thinc.ai/docs/api-optimizers).

The loop over epochs is very intuitive, the one over batches deserves some investigation

In [None]:
batches = minibatch(train_data, size=5)  # this is a generator
i = 0
for batch in batches:
    if i > 0:  # this is ugly code to just see the first batch
        continue
    print(batch)
    i += 1

So it is a list (text, label) but to perform the training we need separate lists for texts and labels, so we need to zip it.

In [None]:
t, l = zip(*batch)  # this is actually the last batch of the previous loop
print(l)
t

In [None]:
import random

random.seed(13)
spacy.util.fix_random_seed(48)

losses = {}
for epoch in range(10):
    random.shuffle(train_data)  # to avoid getting stuck in suboptimal solutions
    # Create the batch generator with batch size = 10
    batches = minibatch(train_data, size=10)
    # Iterate through minibatches
    for batch in batches:
        texts, labels = zip(*batch)
        nlp.update(texts, labels, sgd=optimizer, losses=losses)
    print(losses)

Many things are going on here, I need to pause.

* `nlp.update` updates the models in the pipeline. We only have `textcat` so that's the only model that get updated. We have to provide an optimizer and it updates the dictionary of losses. It also allows for a dropout rate.
* The loss is somewhat mysterious. I would expect it to go down and I struggle in finding what loss function are we talking about. The optimizer has and `L2` attribute so that would be a good candidate but I am not sure why is it then increasing with the training. **I need to search more about this** Further research tells me that the loss is the mean squared error and that it is increasing because I should be resetting it in each epoch. (More details at the end of the section).

The model is trained, I want to predict something with it. If I make up a sentence and ask for its classification, I need to 
* tokenize it
* estract the pipeline component I need to make the prediction
* make the prediction

In [None]:
texts = ["this is a calm tweet, shiny day",
         "everything is on fire", 
         "the party is on fire", 
         'I am bombing this test']
docs = [nlp.tokenizer(text) for text in texts]
docs

In [None]:
# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(docs)

print(scores)

# From the scores, find the label with the highest score/probability
predicted_labels = scores.argmax(axis=1)
print([textcat.labels[label] for label in predicted_labels])

Which is surprisingly good given my expectations. 

If I want to predict on the test set, I can do the following

In [None]:
df = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
test_texts = df['text'].values
docs = [nlp.tokenizer(text) for text in test_texts]
scores, _ = textcat.predict(docs)
predicted_labels = scores.argmax(axis=1)
labels = [textcat.labels[label] for label in predicted_labels]
labels[:3]

Easy enough, so I should be able to submit my first prediction which is going to be correct 77% of the times.

In [None]:
sub = pd.read_csv('/kaggle/input/nlp-getting-started/sample_submission.csv')
sub['label'] = labels
sub['target'] = np.where(sub['label'] == 'disaster', 1, 0)
sub[['id', 'target']].to_csv('bow_sub.csv', index=False)
sub.head()

A good exercise for me is always to take what I learned and make it a bit more functional. For example, I definitely want to have a CV score when I train the model. Time to make a function or two

In [None]:
def prepare_train_data(text='text', target='target'):
    df = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
    train_labels = [{'cats': {'disaster': label == 1, 'not-disaster': label == 0}} for label in df[target]]
    
    train_texts = df[text].values
    data = list(zip(train_texts, train_labels))
    
    return data


def prepare_test_data(nlp, text='text'):
    df = df = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
    test_texts = df['text'].values
    docs = [nlp.tokenizer(text) for text in test_texts]
    
    return docs


def train_nlp(train_data, val_data, epochs=10, cv_loop=True):
    nlp = spacy.blank("en")
    # Create the TextCategorizer with exclusive classes and "bow" architecture
    textcat = nlp.create_pipe(
                  "textcat",
                  config={
                    "exclusive_classes": True,
                    "architecture": "bow"})
    # Add the TextCategorizer to the empty model
    nlp.add_pipe(textcat)
    # Add labels to text classifier
    textcat.add_label('disaster')
    textcat.add_label('not-disaster')
    
    spacy.util.fix_random_seed(1)
    optimizer = nlp.begin_training()
    
    #losses = {}
    for epoch in range(epochs):
        losses = {}  # I believe this is better
        random.shuffle(train_data)  # to avoid getting stuck in suboptimal solutions
        # Create the batch generator with batch size = 10
        batches = minibatch(train_data, size=10)
        # Iterate through minibatches
        for batch in batches:
            texts, labels = zip(*batch)
            nlp.update(texts, labels, sgd=optimizer, losses=losses)
        print(losses)
        
    # Use textcat to get the scores for each doc
    textcat = nlp.get_pipe('textcat')
    try:
        scores, _ = textcat.predict(val_data)
    except AttributeError:
        val_texts = [i[0] for i in val_data]
        docs = [nlp.tokenizer(text) for text in val_texts]
        scores, _ = textcat.predict(docs)
    pos_scores = scores[:, 0] # probability of being a disaster tweet
    
    if cv_loop:
        test_docs = prepare_test_data(nlp)
        # Use textcat to get the scores for each doc
        scores, _ = textcat.predict(test_docs)
        test_scores = scores[:, 0]  # probability of being a disaster tweet
        
        return pos_scores, test_scores
    
    return pos_scores


def evaluate_predictions(true_label, pred_label):
    print(f'Accuracy: \t\t {round(accuracy_score(y_true=true_label, y_pred=(pred_label>0.5).astype(int)), 4)}')
    print(f'AUC ROC score: \t\t {round(roc_auc_score(y_true=true_label, y_score=pred_label), 4)}')
    print(f'Log Loss: \t\t {round(log_loss(y_true=true_label, y_pred=pred_label), 4)}')


def cv_nlp(n_folds=5):
    df_train = np.array(prepare_train_data())
    
    kfolds = KFold(n_splits=n_folds, shuffle=True, random_state=2)
    
    oof = np.zeros(len(df_train))
    preds = None
    
    for n_fold, (train_index, test_index) in enumerate(kfolds.split(df_train)):
        
        train_set = list(df_train[train_index])
        test_set = list(df_train[test_index])
        
        print(f'Fold {n_fold}')
        oof[test_index], fold_preds = train_nlp(train_set, test_set, epochs=10, cv_loop=True)
        print('_'*40)
        print('\n')
        print('_'*40)
        
        if preds is None:
            preds = fold_preds
        else:
            preds += fold_preds / n_folds
            
    true_labels = [int(i[1]['cats']['disaster']) for i in df_train]
    
    evaluate_predictions(true_labels, oof)
    df = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
    tml.plot_classification_probs(data=df, true_label=df['target'], pred_label=oof)
    
    return oof, preds

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, roc_auc_score, log_loss
import tubesml as tml

oof, preds = cv_nlp(n_folds=5)

sub = pd.read_csv('/kaggle/input/nlp-getting-started/sample_submission.csv')
sub['target'] = (preds>0.5).astype(int)
sub[['id', 'target']].to_csv('bow_sub_5folds.csv', index=False)
sub.head()

In the out-of-fold prediction, we estimate an accuracy of 78% and we get a 73% on the LB. Particularly challenging to appropriately slice the list of tuples of string and dictionary that is the train set format, but with some index gymnastic and missing a lot the always convenient pandas indexing we can get the job reasonably done. The plot of the prediction vs true label is available in the [**tubesML package**](https://pypi.org/project/tubesml/).

## Takeaway of this section

With a Bag of Words representation of the data, it is relatively easy to build a text classifier with a decent accuracy. This representation only accounts for the frequency of appearance of the words, not their function or context.

The more examples are shown to the model, the more unique words may appear, the more the vocabulary grows, hence leading to a very sparse matrix. A better pre-processing step would have been to clean the text a bit. For example

* By removing very common words
* By ignoring the case
* By ignoring mispelled words

An interesting step I want to take in the future is to include all the above in a single pipeline, something I suspect being very simple given the structure of the scipy methods.

Still a mystery is the interpretation of the `loss` that spacy outputs during training. So I went to the code of the text categorizer and I noticed that in the update function there is a line `losses[self.name] += loss`, which explains why the loss is always increasing. I am not sure why they do that. The loss is then computed by the `get_loss` method and it is `float(mean_square_error)`. 

This highlights a mistake (I think) in the Kaggle course as well: defining `losses={}` outside the loop of the epochs leads to having this value always increased, but I would be more interested in seeing the progress of the loss over epochs, thus it must be reset at the beginning of each epoch. (I will follow up on that once I get an answer in the forum)

# Word embeddings

Following the common definition, word embeddings represent each word numerically so that the vector represents **the word meaning or its usage**. If bag of words does not consider the context of each word, these embeddings aim to do so.

Each word is represented by a vector in a defined vector space. The values of the vector components are learned based on the word usage. Therefore, we can expect seeing similar representations for words with similar meaning.

The method SpaCy uses to learn the components is `Word2Vec`, which uses shallow, 2-layer neural networks trained to reconstruct the linguistic context of words. It can use either of 2 architectures:

* Continuous Bag of Words (CBoW): learns the embedding by predicting the current word based on its context. 
* Continuous Skip Diagram: learns by predicting the surrounding words given the current one.

In SpaCy, it is already prebuilt in their models, but we need to pick a more complext one than the simple `en`

In [None]:
# Need to load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')
print(nlp.pipe_names)
[m for m in dir(nlp) if '__' not in m]

The pipeline already has more steps than before. According to the [documentation](https://spacy.io/models/en#en_core_web_lg), there should be more pipelines (tok2vec, tagger, parser, ner, attribute_ruler, lemmatizer), so I am not sure what am I missing here.

As before, let's play around with it

In [None]:
text = "I like vectors because you can sum them"
text_tokens = nlp(text)
[m for m in dir(text_tokens) if '__' not in m]

Not only I have a vector representation of the text we gave (which, as far as I understand, is the average of all the vectors representing each word)

In [None]:
text_tokens.vector[:10]

But also a 300-dimensional vector representation for each of the 8 words

In [None]:
print(np.array([token.vector for token in  text_tokens]).shape)
for token in text_tokens:
    print(f'{token.text},\t {token.has_vector},\t {token.vector_norm},\t {token.is_oov}')

Every word is in the vocabulary (`is_oov` flags words out of the vocabulary), which is not surprising as we loaded a model with 685k unique vectors

Back to the Disaster tweets dataframe.

In [None]:
df = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
doc_vectors = np.array([nlp(text).vector for text in df.text])
print(doc_vectors.shape)
doc_vectors

The simplest thing we can do now, is to use these newly created features to perform our classification problem as any other classification problem. For better visibility, we can start by creating a dataframe.

In [None]:
doc_df = pd.DataFrame(doc_vectors, columns=np.arange(0,300))
doc_df.head()

And create some CV scores to pick a model.

In [None]:
import tubesml as tml
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
import lightgbm as lgb

In [None]:
models = [('Extra Tree', ExtraTreesClassifier(n_estimators=1000, n_jobs=-1, max_depth=8)),
          ('Logit', LogisticRegression(C=1, max_iter=2000)),
          ('Xgb', xgb.XGBClassifier(n_estimators=2000, n_jobs=-1, reg_alpha=0.3, learning_rate=0.05,
                                    reg_lambda=1, subsample=0.7, max_depth=4, 
                                    random_state=324,objective='binary:logistic',use_label_encoder=False, eval_metric='logloss')), 
          ('Lgb', lgb.LGBMClassifier(n_estimators=2000, learning_rate=0.05, reg_alpha=0.3, reg_lambda=1, subsample=0.7, n_jobs=-1))]

kfolds = KFold(n_splits=5, random_state=235, shuffle=True)

oof_res = {}
for model in models:
    print(model[0])

    full_pipe = Pipeline([('scaler', tml.DfScaler()), model])

    if 'gb' not in model[0]:
        oof = tml.cv_score(data=doc_df, target=df.target, estimator=full_pipe, cv=kfolds, predict_proba=True)
    else:
        oof = tml.cv_score(data=doc_df, target=df.target, estimator=full_pipe, cv=kfolds, predict_proba=True, 
                           early_stopping=100, eval_metric='logloss')
        
    oof_res[model[0]] = oof

    tml.eval_classification(doc_df, df.target, oof, plot=1, proba=True)
    
    print('_'*40)
    print('_'*40)

Which is a bit higher than the score we got in the previous section and indeed scores 80.8% on the public LB, improving of 4% our previous attempt with `textcat`.

In [None]:
df_test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
doc_vectors = np.array([nlp(text).vector for text in df_test.text])
doc_df_test = pd.DataFrame(doc_vectors, columns=np.arange(0,300))
full_pipe = Pipeline([('scaler', tml.DfScaler()), 
                      ('Lgb', lgb.LGBMClassifier(n_estimators=200, learning_rate=0.05, 
                                                 reg_alpha=0.3, reg_lambda=1, subsample=0.7, n_jobs=-1))])
full_pipe.fit(doc_df, df.target)
preds = full_pipe.predict(doc_df_test)

sub = pd.read_csv('/kaggle/input/nlp-getting-started/sample_submission.csv')
sub['target'] = preds
sub[['id', 'target']].to_csv('word_embeddings_lgb.csv', index=False)
sub.head()