## Overview

KerasNLP makes it very easy to create simple model pipelines at a very fast rate. In this guide we create a simple text classification pipeline from scratch including augmentation, model building etc.

## Imports & setup

This tutorial requires you to have KerasNLP installed:

```shell
pip install keras-nlp
```

We begin by importing all required packages:

In [None]:
import numpy as np
import skimage.io as io
import random
import os
import cv2
import pandas as pd
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from glob import glob
from scipy.io import loadmat
import matplotlib.pyplot as plt
import keras_nlp
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import *
from tensorflow.keras import Sequential
import matplotlib.pyplot as plt

## Data loading

This guide uses the
[Quora Insincere Questions Classification Dataset](https://www.kaggle.com/competitions/quora-insincere-questions-classification/data)
for demonstration purposes.

To get started, we first load the dataset:


In [None]:
!wget https://storage.googleapis.com/kerascvnlp_data/train.csv

--2023-07-06 13:08:17--  https://storage.googleapis.com/kerascvnlp_data/train.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.79.128, 108.177.119.128, 108.177.126.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.79.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 124206772 (118M) [text/csv]
Saving to: ‘train.csv’


2023-07-06 13:08:22 (22.5 MB/s) - ‘train.csv’ saved [124206772/124206772]



In [None]:
df = pd.read_csv('train.csv')
df

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0
...,...,...,...
1306117,ffffcc4e2331aaf1e41e,What other technical skills do you need as a c...,0
1306118,ffffd431801e5a2f4861,Does MS in ECE have good job prospects in USA ...,0
1306119,ffffd48fb36b63db010c,Is foam insulation toxic?,0
1306120,ffffec519fa37cf60c78,How can one start a research project based on ...,0


In [None]:
text = df['question_text'].tolist()
target = df['target'].tolist()

## Model Building

We use the pretrained `Roberta Classifier` from the KerasNLP to build a simple text classifier.

In [None]:
classifier = keras_nlp.models.RobertaClassifier.from_preset(
    "roberta_base_en",
    num_classes=2,
)
classifier.backbone.trainable = False

history = classifier.fit(x=text[:10000], y=target[:10000], verbose =1, epochs=1,batch_size=16)



In [None]:
print(text[0])

How did Quebec nationalists see their province as a nation in the 1960s?


In [None]:
classifier.predict([text[0]])



array([[ 1.8242502, -1.719076 ]], dtype=float32)

##Custom Preprocessing

We also try out the various preprocessing utilities provided by the `KerasNLP`. We start with the `RobertaTokenizer` to tokenize the text and then use the `MultisegmentPacker` to pack the dataset input.

In [None]:
tokenizer = keras_nlp.models.RobertaTokenizer.from_preset("roberta_base_en")

packer = keras_nlp.layers.MultiSegmentPacker(
    start_value=tokenizer.start_token_id,
    end_value=tokenizer.end_token_id,
    sequence_length=64,
)

token_ids, segment_ids = packer(tokenizer(text[:10000]))
x = {
    "token_ids": token_ids,
    "segment_ids": segment_ids,
    "padding_mask": token_ids != 0,
}

y = target[:10000]

In [None]:
classifier = keras_nlp.models.RobertaClassifier.from_preset(
    "roberta_base_en",
    preprocessor=None,
    num_classes=2,
)
classifier.backbone.trainable = False

classifier.fit(x, y, verbose =1, epochs=1,batch_size=16)

  inputs = self._flatten_to_reference_inputs(inputs)




<keras.callbacks.History at 0x7fe39444aa40>