# Text Classification with HuggingFace & ktrain

In this notebook, we'll perform text classification on the [NY Room Rental Ads](https://www.kaggle.com/vaishnavivenkatesan/newyork-room-rentalads) dataset with **HuggingFace Transformer Model** using **ktrain**

**ktrain** is a Python library that makes deep learning and AI more accessible and easier to apply.


Following are some of the pre-trained Transformer Model that we'll use & calculate their accuracy 

* roberta-base
* bert-base-uncased
* distilbert-base-uncased
* xlm-roberta-base


**Note**: 0 = Vague & 1 = Not Vague


As always will keep this notebook well commented & organized for easy reading. Please do UPVOTE if you find it helpful :)

## Libraries

In [None]:
# Install ktrain
!pip install --upgrade pip -q
!pip install -q ktrain

In [None]:
# Generic
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings, gc
warnings.filterwarnings("ignore")

# Tensorflow
import tensorflow as tf

# ktrain
import ktrain
from ktrain import text

# sklearn
from sklearn.model_selection import train_test_split

## Data

In [None]:
# Load
url = '../input/newyork-room-rentalads/room-rental-ads.csv'
df = pd.read_csv(url, header='infer')

# Dropping Null Values
df.dropna(inplace=True)

# Total Records
print("Total Records: ", df.shape[0])

# Inspect
df.head()

In [None]:
# Data Split
target = ['Vague/Not']
data = ['Description']

X = df[data]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.1, random_state=42)

## Parameters

In [None]:
# Common Parameters
max_len = 500
batch_size = 6
learning_rate = 5e-5
epochs = 1

## With Transformer = Roberata-base

In [None]:
# Transformer Model
model_ = 'roberta-base'
t_mod = text.Transformer(model_, maxlen=max_len, classes = [0,1])


'''Converting split data to list [so it can processed]'''
#train
X_tr = X_train['Description'].tolist()
y_tr = y_train['Vague/Not'].tolist()

#test
X_ts = X_test['Description'].tolist()
y_ts = y_test['Vague/Not'].tolist()


# Pre-processing training & test data
train = t_mod.preprocess_train(X_tr,y_tr)
test = t_mod.preprocess_train(X_ts,y_ts)

# Model Classifier
model = t_mod.get_classifier()

learner = ktrain.get_learner(model, train_data=train, val_data=test, batch_size=batch_size)

In [None]:
# Train Model
learner.fit_onecycle(learning_rate, epochs)

In [None]:
# Evaluate
x = learner.validate(class_names=t_mod.get_classes())

### Accuracy of ~ 77% achieved with Roberta

In [None]:
# Prediction
classes = ['Vague', 'Not Vague']
predictor = ktrain.get_predictor(learner.model, preproc=t_mod)
pred_class = predictor.predict(X_test['Description'][67])
print("Predicted Class: ", classes[pred_class])

## With Transformer = bert-base-uncased

In [None]:
# Transformer Model
model_ = 'bert-base-uncased'
t_mod = text.Transformer(model_, maxlen=500, classes = [0,1])


'''Converting split data to list [so it can processed]'''
#train
X_tr = X_train['Description'].tolist()
y_tr = y_train['Vague/Not'].tolist()

#test
X_ts = X_test['Description'].tolist()
y_ts = y_test['Vague/Not'].tolist()


# Pre-processing training & test data
train = t_mod.preprocess_train(X_tr,y_tr)
test = t_mod.preprocess_train(X_ts,y_ts)

# Model Classifier
model = t_mod.get_classifier()

learner = ktrain.get_learner(model, train_data=train, val_data=test, batch_size=6)

In [None]:
# Train Model
learner.fit_onecycle(learning_rate, epochs)

In [None]:
# Evaluate
x = learner.validate(class_names=t_mod.get_classes())

### Accuracy of ~ 81% achieved with Bert

## With Transformer = distilbert-base-uncased

In [None]:
# Transformer Model
model_ = 'distilbert-base-uncased'
t_mod = text.Transformer(model_, maxlen=500, classes = [0,1])


'''Converting split data to list [so it can processed]'''
#train
X_tr = X_train['Description'].tolist()
y_tr = y_train['Vague/Not'].tolist()

#test
X_ts = X_test['Description'].tolist()
y_ts = y_test['Vague/Not'].tolist()


# Pre-processing training & test data
train = t_mod.preprocess_train(X_tr,y_tr)
test = t_mod.preprocess_train(X_ts,y_ts)

# Model Classifier
model = t_mod.get_classifier()

learner = ktrain.get_learner(model, train_data=train, val_data=test, batch_size=6)

In [None]:
# Train Model
learner.fit_onecycle(learning_rate, epochs)

In [None]:
# Evaluate
x = learner.validate(class_names=t_mod.get_classes())

### Accuracy of ~ 71% achieved with DistilBert

## With Transformer = xlm-roberta-base

In [None]:
# Transformer Model
model_ = 'xlm-roberta-base'
t_mod = text.Transformer(model_, maxlen=500, classes = [0,1])


'''Converting split data to list [so it can processed]'''
#train
X_tr = X_train['Description'].tolist()
y_tr = y_train['Vague/Not'].tolist()

#test
X_ts = X_test['Description'].tolist()
y_ts = y_test['Vague/Not'].tolist()


# Pre-processing training & test data
train = t_mod.preprocess_train(X_tr,y_tr)
test = t_mod.preprocess_train(X_ts,y_ts)

# Model Classifier
model = t_mod.get_classifier()

learner = ktrain.get_learner(model, train_data=train, val_data=test, batch_size=6)

In [None]:
# Train Model
learner.fit_onecycle(learning_rate, epochs)

In [None]:
# Evaluate
x = learner.validate(class_names=t_mod.get_classes())

### Accuracy of ~ 45% achieved with XLM-Roberta

In [None]:
# Garbage Collect
gc.collect()

### Hope that was helpful & gave you an idea of ktrain & using pre-trained HuggingFace Models.