In [1]:
import numpy as np 
import pandas as pd
import spacy
from spacy.util import minibatch, compounding
from spacy.training import Example
import random



## 1 - Load the data

In [None]:
train_data = pd.read_csv('/content/train.csv')
test_data = pd.read_csv('/content/test.csv')

## 2 - Data Preprocessing, Exploratory Data Analysis

In [None]:
train_data.head()

In [None]:
train_data['text'] = train_data.comment_text.apply(lambda x: x.replace('\n', ' '))
test_data['text'] = test_data.comment_text.apply(lambda x: x.replace('\n', ' '))

This code is preparing the data for training a machine learning model using the spaCy library.

The first line creates a list called cats that contains the different categories or labels that the machine learning model will be trained to predict.

The second line initializes an empty list called train_prepared_data that will be used to store the preprocessed data.

The third line defines a function called format_text_spacy that takes a text parameter and returns a tuple containing the text and a dictionary with the category labels and their corresponding values for that text.

The for loop then iterates through the rows of the train_data dataset and applies the format_text_spacy function to each row. The resulting tuples are appended to the train_prepared_data list.

Overall, this code is preparing the data in a format that can be used by the spaCy library to train a machine learning model to predict the different categories of text data.

In [None]:
cats = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
train_prepared_data = []

def format_text_spacy(text):
    return (text.text, {'cats': {cat: text[cat] for cat in cats}})
    
for i in range(0,len(train_data)):
    text = train_data.iloc[i]
    train_prepared_data.append(format_text_spacy(text))

In [None]:
train_prepared_data[0:5]

This is code written in Python using the spaCy library for Natural Language Processing (NLP). It creates a new blank spaCy model for the English language called nlp.

The model is designed to perform multi-label classification of text, where a text document can be assigned multiple labels simultaneously.

The code adds a new pipe to the nlp model, which is a text classification component for multi-label classification. This pipe is assigned to the variable textcat.

Next, the code adds 6 labels to the text classification component using the add_label() method: "toxic", "severe_toxic", "obscene", "threat", "insult", and "identity_hate". These labels are the categories of the classification task that the model will try to predict for a given input text.

Overall, this code creates a blank spaCy model with a text classification component for multi-label classification, and adds 6 different labels to it. The specific name of the model is not mentioned, but it is a custom model created by the user.

In [None]:
nlp = spacy.blank("en")
textcat = nlp.add_pipe("textcat_multilabel")
textcat.add_label("toxic")
textcat.add_label("severe_toxic")
textcat.add_label("obscene")
textcat.add_label("threat")
textcat.add_label("insult")
textcat.add_label("identity_hate")

This code is training a text classification model using the spaCy library. The model being trained is a text categorization model with multiple labels, as indicated by the fact that 'textcat_multilabel' is being disabled. The code is using stochastic gradient descent (sgd) as the optimization algorithm and the 'compounding' function is being used to set the batch size dynamically.

The code first creates a list called 'other_pipes', which contains all the pipeline components in the spaCy model except for the text classification component. Then, these components are disabled using 'nlp.disable_pipes(*other_pipes)' so that only the text classification component will be trained.

After that, the code initializes the optimizer using 'nlp.begin_training()', which creates an optimizer with default values for learning rate, momentum, etc. and returns it.

Then, the code trains the model for 10 epochs. Within each epoch, the code iterates over the training data in batches of size specified by the 'compounding' function. Each batch is used to update the model by calling 'nlp.update' with the examples in the batch, the optimizer, a dropout rate of 0.2, and a dictionary called 'losses' to keep track of the training loss.

Finally, the code prints the epoch number and the training loss for that epoch.

In summary, the code is training a text classification model using spaCy with the textcat_multilabel component and stochastic gradient descent as the optimization algorithm.

In [None]:
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat_multilabel']
with nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = nlp.begin_training()
    print("Training the model...")
    for epoch in range(10):
        losses = {}
        batches = minibatch(train_prepared_data[0:10000], size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            examples = []
            for text, annot in batch:
                examples.append(Example.from_dict(nlp.make_doc(text), annot))
            nlp.update(examples, sgd=optimizer, drop=0.2, losses=losses)
        print("Epoch: {} Loss: {}".format(epoch+1, losses))

In [None]:
test = nlp("you are ugly")

test.cats

### *Conclusion:* The model which can be recommended to the client is "Classification model using SpaCy"

## 4 - Saving the model
Save the DS best model in the Jupyter notebook `model.ipynb` in one of the following formats:

- `network.save('model.h5')` #keras
- `joblib.dump(model, "model.pkl")` # optional
- `torch.save(model.state_dict(), './model.pt')` #pytorch
- `model.save('path/to/model')`

End of document.