 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

# `Transfer Learning`

* taking features learned on one problem and using them to solve some other similar problem
    * e.g. using the features from a model that was trained to identify bicycles to train a model that can identify motorcycles

* inspired by how humans learn
    * when we learn a new skill we use old knowledge to help us learn faster
    * e.g. if you want to learn how to drive a stick shift car you can use the experience you gained driving an automatic transmission car

**Why use transfer learning:**

* there is an almost limitless amount of data available today
   * BUT: raw unstructured data doesn't help us train supervised learning algorithms


* to train algorithms we need clean, properly labeled data
    * `Deep Learning` models require A LOT of data for training
    

* by leveraging what other models already learned we can reduce the amount of data we need to train a `Deep Learning` model
 

##  `Transfer Learning in Deep Learning`

* Deep Learning models are extremely well suited to inductive learning

### Two most popular strategies
<br>
    

**1.** `Using pre-trained models as feature extractors`


**2.** `Fine tuning pre-trained models`

### `Using pre-trained models as feature extractors`

* each layer of Deep Learning models arhitecture learns different features

* perform transfer learning by using a trained model and 'freezing' certain layers of it
    * this acts as feature extraction

**The procedure**

* we find a model that solves a problem similar to the one we have
    
* we "cut" out some of the layers

* we connect the remaining layers of the original model to our own model

**Simplified:**

* remove the last layer from some already trained network

* add what we want to the end of that network
  * can be a classifier, etc.

<center><img src="https://edlitera-images.s3.amazonaws.com/transfer_learning_extract_features.png" width="1500"/>

source:
<br>
https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a

# `Transformers`

* an extension to the RNN architecture
    * compensates for the shortcomings of the RNN architecture

* arhitecture that revolutionized how we solve sequence-to-sequence tasks 

* has mostly replaced LSTMs in a lot of tasks today

**Main reasons:**

* today GPUs and TPUs are used for Deep Learning
    * most neural networks benefit from paralel processing
    
* problem with RNNs: 
    * data needs to be fed into the network sequentially
    * this stops us from utilizing the real power of GPUs and TPUs
    * therefore, RNNs take longer to train 

* transformers don't have this problem
    * they also model dependencies better
    * they achieve better results 

## `Transformers structure`


<center><img src="https://edlitera-images.s3.amazonaws.com/transformers_structure.png" width="600" >

source:
<br>
https://arxiv.org/abs/1706.03762

* easy to notice a few new concepts (positional encoding, attention, there is also a special type of normalization)

* the structure is fairly complicated

* explaining them in detail is beyond the scope of this class


<center><img src="https://edlitera-images.s3.amazonaws.com/transformers_structure_parts.png" width="600" >


# `Popular transformers arhitectures`

* nowadays there are many transformers arhitectures

* some are widely know, some are more obscure

* the two most famous examples are **BERT** and **GPT**

* they are essentially variants of the transformer network where only one half is used
    * **BERT** - arhitecture created by **stacking the encoder part** of the transformers network
    * **GPT** - arhitecture created by **stacking the decoder part** of the transformers network

<center><img src="https://edlitera-images.s3.amazonaws.com/BERT_and_GPT.png" width="1400" >

* for us, BERT (and its variants) will better serve our purpose so let's explain how it works

# `BERT - Bidirectional Encoder Representation from Transformers`

* created from the encoder part of the transformers network

* trains in two phases:
    <br>
    
    * **pre-training** - model learns about language and context 
    * **fine tuning** - model learns how to solve a specific problem

## `BERT pretraining`

* learns the concepts of language and context by training (in an unsupervised way) on two problems at the same time:

   <br>
   
   * **Masked Language Modeling** 
   
   * **Next Sentence Prediction** 

**Masked Language Modeling**

<br>

* the model is given sentences with certain words hidden ("masked") and needs to learn to predict those masked words
    * layman's terms: fill in the blanks in the sentence problem


* helps BERT learn how bidirectional context works in sentences

**Next Sentence Prediction**

* similar to a binary classification problem

* the model is given two sentences: sentence A and sentence B

* it needs to predict whether sentence B follows sentence A

* helps BERT understand the concept of context over multiple sentences 

## `BERT fine tuning`

* prime example of transfer learning

* after pretraining we can fine tune BERT to solve various NLP tasks

* **very fast: most of the model knowledge comes from what it learned during pretraining**

**How it works**

* we simply add output layers to the end of the network


* depending on the layers we add, we can fine tune BERT to solve different tasks


* **the majority of the network stays the same (the pretrained part), only the last few layers change depending on what task we want to solve**

# `Creating transformers models with Hugging Face and ktrain`

* there are a lot of different models that are used for text classification


* new neural network arhitectures get created very often, however most of them are not publicly available

* it is common practice to use pre-trained models and modify them for their own purposes 
    * unless you can do in-house research and development in the field of Deep Learning 

* **models based on transformers tipically perform best for the purposes of text classification**

* the easiest way to work with transformers i.e. to use pretrained transformers models for your own purposes is to use the **`HuggingFace transformers library`**

* to simplify using HuggingFace transformers, a Python library called **`ktrain`** was created
    * lightweight wrapper for `Tensorflow` and `Keras`
    * you can build standard Deep Learning models with it
    * it can easily interface with the `HuggingFace` transformers package

## `Hugging Face`

* an extremely popular Python transformers package, available for both Pytorch and Tensorflow (and Keras)

* tons of community support

* provides easy way to work with many different transformers arhitectures such as:
    <br>
    
    * BERT
    * RoBERTa
    * DistilBERT
    * GPT
    * GPT-2
    * XLNet
    * T5
    * etc.
    

**These are just some of the pre-trained models that are available, there are currently over 10,000 pretrained models available on Hugging Face's model page: https://huggingface.co/models?p=0**
    

* that doesn't mean there are over 10,000 different pre-trained arhitectures

* the package contains a finite amount of transformers arhitectures (essentially every popular transformers arhitecture)

**There is also support for training models using Amazon SageMaker !**

## `ktrain`

* a Python library inspired by networks such as `fastai` and `ludwig`

* designed to make creating complex models as accessible as possible

* great for transfer learning purposes
    <br>
    
    * supports a lot of pretrained models that work with text data, images, graph data and even tabular data

**Why is ktrain important to us?**

* allows us to fine tune transformers extremely easy


* we can choose one of the many models offered by HuggingFace 
    * this means that we can always find a pretrained transformers model that will help us solve our task


* solves one of the biggest problems of transformers for beginners: coding them
    * transformers are incredibly complex to code and tune properly
    * without libraries such as `ktrain`, implementing and using them is nearly impossible for beginners

**Why not use `ktrain` for everything?**

* it turns a neural network model into even more of a black box model


* offering complex functionality with just a few lines of code is great, BUT...
    * we are a lot more limited in what we can actually modify
    * customization is hard

**Conclusion**

* use **`ktrain`** to fine tune transformers models
    * try to find a model on HuggingFace that will best suit your needs
        * e.g. a model which was trained on data that is as similar as possible to yours
    * or just use one of the generally good models
        * e.g. BERT or its variations
   
   
* **do not always default to transformers**
    * training transformers, even just fine tuning, takes a lot more time than training simpler RNN based networks
    * try training an RNN based network first
    * if results are bad, try using transformers

**Also keep in mind:**


* if you have computational resources available, fine tune a transformers model using ktrain _before doing anything else_


* if you get extremely bad results with transformers, your data likely isn't very good for solving your problem

    * things to try in this case: add some preprocessing steps, remove other etc.

# `Training a transformer model with HuggingFace and ktrain`

In [1]:
import ktrain
from ktrain import text
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [2]:
# Load in our data and create a Dataframe

df = pd.read_csv("https://edlitera-datasets.s3.amazonaws.com/imdb_dataset.csv")

In [3]:
# Take a look at the first five rows

df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [4]:
# Shuffle data

df = df.sample(frac=1).reset_index(drop=True)

In [5]:
# Define independent feature

X = df["review"]

# Define dependent feature

y = df["sentiment"]

In [6]:
# Separate data into training data and testing data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

In [7]:
# Separate data into training data and validation data

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train, y_train, test_size=0.20, random_state=42
)

In [8]:
# Choose one of the available models from the HuggingFace website

MODEL_NAME = 'distilbert-base-uncased'

In [9]:
# Define the transformer model

transformer = text.Transformer(MODEL_NAME, maxlen=100, class_names=["negative", "positive"])

In [10]:
# Preprocess train and validation data

train = transformer.preprocess_train(X_train.values, y_train.values)
valid = transformer.preprocess_test(X_valid.values, y_valid.values)

preprocessing train...
language: en
train sequence lengths:
	mean : 231
	95percentile : 591
	99percentile : 908


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 230
	95percentile : 577
	99percentile : 917


In [11]:
# Define model task

model = transformer.get_classifier()

In [12]:
# Define model 

learner = ktrain.get_learner(model, train_data=train, val_data=valid, batch_size=16)

In [13]:
# Train for one epoch
# Set learning rate as 2e-5
# Train for one epoch

learner.fit_onecycle(2e-5, 1)



begin training using onecycle policy with max lr of 2e-05...


<keras.callbacks.History at 0x21eb7966460>

In [14]:
# Get predictor

predictor = ktrain.get_predictor(learner.model, preproc=transformer)

In [15]:
# Save predictor

predictor.save("transformers_models/Distilbert_final")

In [16]:
# Load predictor

predictor = ktrain.load_predictor("transformers_models/Distilbert_final")

In [20]:
# Take a look at an example

X_test.tolist()[1]

'Initially I was put off renting this movie due to the jacket art for the DVD. In fact, this held true with friends of mine who didn\'t rent it due to the art and the mental image(s) it conjured of being a movie that held little or no interest to me (or to my friends). But, I rented and watched it and was truly amazed.<br /><br />I agree with another user\'s comments that this movie is not for everyone due to the blatant sexual inferences, so it is definitely not something I\'d want young children to watch (and doubt seriously if they would understand it anyway).<br /><br />I enjoy movies like this whereby the character\'s personalities and who they are are genuinely defined in a no-nonsense, direct way with no teasers to indicate they will turn out bad. The acting done ... was it acting? Ricci and Jackson performed so well, I was drawn into this movie not even realizing they were acting. Same thing with the story ... may seem far-fetched somewhat, but it was done so very, very well. I

In [21]:
# Make prediction for an example

predictor.predict(X_test.tolist()[1])

'positive'

In [22]:
# Make predictions for a small sample

y_pred = predictor.predict(X_test[:100].tolist())

In [23]:
# Convert predictions from strings to integers
# "positive" gets converted into 1, "negative" into 0

true_predictions = []

for prediction in y_pred:
    if prediction == "positive":
        true_predictions.append(1)
    else:
        true_predictions.append(0)

In [24]:
# Print classification report for all examples

#print(classification_report(y_test, true_predictions))

#Print classification report for the small sample

y_test_sample = y_test[:100]

print(classification_report(y_test_sample, true_predictions))



              precision    recall  f1-score   support

           0       0.91      0.87      0.89        47
           1       0.89      0.92      0.91        53

    accuracy                           0.90       100
   macro avg       0.90      0.90      0.90       100
weighted avg       0.90      0.90      0.90       100



 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>