# <p style="background-color:skyblue; font-family:newtimeroman; font-size:150%; text-align:center">Training Transformers with Custom Data </p> 
Hugging face transformers are wrappers that help in several tasks like Sentiment Analysis, Question Answering Etc.  
Some more are given in [this](https://www.kaggle.com/kabirnagpal/vaccine-tweet-analysis-with-hugging-face) notebook.   
However you can use same architecture to train on your own dataset and fine tune it.  
This is available both in Tensorflow and PyTorch and is easy to use.  
In this Notebook I've used TF for the same purpose and tried to predict rating by a user based on review. 
You can refer here for some more codes:
1. [GitHub](https://github.com/katanaml/sample-apps/blob/master/02/sentiment-fine-tuning-huggingface.ipynb)
2. [TF Docs](https://www.tensorflow.org/tutorials/text/classify_text_with_bert)

# <p style="background-color:skyblue; font-family:newtimeroman; font-size:150%; text-align:center">Importing Packages </p>

In [None]:
import pandas as pd
import tensorflow as tf
!pip install contractions
import contractions
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import re
import transformers
import warnings
warnings.filterwarnings("ignore")
lem = WordNetLemmatizer()

# <p style="background-color:skyblue; font-family:newtimeroman; font-size:150%; text-align:center">Loading Dataa</p>

In [None]:
data = pd.read_csv("../input/flipkart-customer-review-and-rating/data.csv")
data.head()

# <p style="background-color:skyblue; font-family:newtimeroman; font-size:150%; text-align:center">Preprocessing </p>

Preprocessing is a necessary steps, as it helps in removing errors, emojis and other unnecessary words/symbols.  
I've created a separate notebook, for most used preprocessing methods.  
[[Tips] List of preprocessing techniques in NLP](https://www.kaggle.com/kabirnagpal/tips-list-of-preprocessing-techniques-in-nlp)

In [None]:
data['review'][0]

In [None]:
def preprocess(x):
    x = x.replace("READ MORE","")
    x = x.encode('ascii','ignore')
    x = x.decode()
    x = x.lower()
    x = contractions.fix(x)
    x = ' '.join([word for word in x.split() if not word in set(stopwords.words('english'))])
    x =  re.sub('[^a-zA-Z0-9]', ' ', x)
    x = ' '.join(x.split())
    x = lem.lemmatize(x)
    return x

In [None]:
data['review'] = data['review'].apply(preprocess)

In [None]:
data['review'][0]

# <p style="background-color:skyblue; font-family:newtimeroman; font-size:150%; text-align:center">Train Test Split</p>

In [None]:
from sklearn.model_selection import train_test_split
X = list(data['review'].values)
y = pd.get_dummies(data['rating']).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# <p style="background-color:skyblue; font-family:newtimeroman; font-size:150%; text-align:center">Tokenising with Hugging Face</p> 
Tokenising Converts words to numbers.  
It is highly recommeded to use these methods rather than numbering yourself, becuase,  
the pretrained models are trained on a huge dataset and autmatically adjusted numbers to be given to words.  
Hence Good and Awesome will lie closer tha Good and Bad.  
I've used a length of 30 words only, howver it's a hyperparamter and can be later tuned.  
Along with numerical values, it also provided mask, which are used by attention layer.  
( don't worry, you'll understand this while learning aout transformers. )

In [None]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

X_train = tokenizer(X_train, truncation=True, padding=True,max_length=30)
X_test = tokenizer(X_test, truncation=True, padding=True,max_length = 30)

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(X_train),
    y_train
))

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(X_test),
    y_test
))

# <p style="background-color:skyblue; font-family:newtimeroman; font-size:150%; text-align:center">Bert Model</p> 

In [None]:
from transformers import TFDistilBertForSequenceClassification


model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=5)


optimizer = tf.keras.optimizers.SGD(learning_rate=10e-5)
model.compile(optimizer=optimizer, loss=tf.keras.losses.CategoricalCrossentropy(), metrics=['accuracy'])
model.fit(train_dataset.shuffle(42).batch(512),
          epochs=5,
          batch_size=512,
          validation_data=val_dataset.shuffle(42).batch(512))

A lot of hyperparamters here can be tuned to increase the score,  
however the motive of the notebook was to get you familiar with the method.  
**num_labels** is the number of classses we've, i.e. 5 ( rating 1 - 5 )  
You can also used different pretrained methods given [here](https://huggingface.co/models).  
## Happy Learning