# [ktrain](https://github.com/amaiya/ktrain)

ktrain is a lightweight wrapper for the deep learning library TensorFlow Keras (and other libraries) to help build, train, and deploy neural networks and other machine learning models. 

ktrain is designed to make deep learning and AI more accessible and easier to apply for both newcomers and experienced practitioners. 

With only a few lines of code, ktrain allows you to easily and quickly:

  - employ fast, accurate, and easy-to-use pre-canned models for text, vision, graph, and tabular data:

    - text data:
      - Text Classification: BERT, DistilBERT, NBSVM, fastText, and other models [example notebook]
      - Sequence Labeling (NER): Bidirectional LSTM with optional CRF layer and various embedding schemes such as pretrained BERT and fasttext word embeddings and character embeddings [example notebook]
      - Ready-to-Use NER models for English, Chinese, and Russian with no training required [example notebook]
      - Sentence Pair Classification for tasks like paraphrase detection [example notebook]
      - Unsupervised Topic Modeling with LDA [example notebook]
      - Document Similarity with One-Class Learning: given some documents of interest, find and score new documents that are thematically similar to them using One-Class Text Classification [example notebook]
      - Document Recommendation Engines and Semantic Searches: given a text snippet from a sample document, recommend documents that are semantically-related from a larger corpus [example notebook]
      - Text Summarization: summarize long documents with a pretrained BART model - no training required [example notebook]
      - End-to-End Question-Answering: ask a large text corpus questions and receive exact answers [example notebook]
      - Easy-to-Use Built-In Search Engine: perform keyword searches on large collections of documents [example notebook]
      - Zero-Shot Learning: classify documents into user-provided topics without training examples [example notebook]
      - Language Translation: translate text from one language to another [example notebook]

    - vision data:
      - image classification (e.g., ResNet, Wide ResNet, Inception) [example notebook]
      - image regression for predicting numerical targets from photos (e.g., age prediction) [example notebook]

    - tabular data:
      - tabular classification (e.g., Titanic survival prediction) [example notebook]
      - causal inference using meta-learners [example notebook]
    - estimate an optimal learning rate for your model given your data using a Learning Rate Finder

    - utilize learning rate schedules such as the triangular policy, the 1cycle policy, and SGDR to effectively minimize loss and improve generalization

    - build text classifiers for any language (e.g., Arabic Sentiment Analysis with BERT, Chinese Sentiment Analysis with NBSVM)

    - easily train NER models for any language (e.g., Dutch NER )

    - load and preprocess text and image data from a variety of formats

    - inspect data points that were misclassified and provide explanations to help improve your model

    - leverage a simple prediction API for saving and deploying both models and data-preprocessing steps to make predictions on new raw data

    - built-in support for exporting models to ONNX and TensorFlow Lite (see example notebook for more information)


In [1]:
!pip install ktrain 

import numpy as np
import pandas as pd
import tensorflow as tf
import ktrain
from ktrain import text

Collecting ktrain
  Downloading ktrain-0.27.3.tar.gz (25.3 MB)
[K     |████████████████████████████████| 25.3 MB 59.4 MB/s 
[?25hCollecting scikit-learn==0.23.2
  Downloading scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 59.4 MB/s 
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 70.0 MB/s 
Collecting cchardet
  Downloading cchardet-2.1.7-cp37-cp37m-manylinux2010_x86_64.whl (263 kB)
[K     |████████████████████████████████| 263 kB 87.4 MB/s 
Collecting syntok
  Downloading syntok-1.3.1.tar.gz (23 kB)
Collecting seqeval==0.0.19
  Downloading seqeval-0.0.19.tar.gz (30 kB)
Collecting transformers<=4.3.3,>=4.0.0
  Downloading transformers-4.3.3-py3-none-any.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 73.6 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (

### Data Preparation

In [2]:
# Copy IMDB dataset from github
!git clone https://github.com/laxmimerit/IMDB-Movie-Reviews-Large-Dataset-50k.git


Cloning into 'IMDB-Movie-Reviews-Large-Dataset-50k'...
remote: Enumerating objects: 10, done.[K
remote: Counting objects: 100% (10/10), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 10 (delta 1), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (10/10), done.


In [3]:
train_data = pd.read_excel('/content/IMDB-Movie-Reviews-Large-Dataset-50k/train.xlsx', dtype=str)
train_data.head()

Unnamed: 0,Reviews,Sentiment
0,"When I first tuned in on this morning news, I ...",neg
1,"Mere thoughts of ""Going Overboard"" (aka ""Babes...",neg
2,Why does this movie fall WELL below standards?...,neg
3,Wow and I thought that any Steven Segal movie ...,neg
4,"The story is seen before, but that does'n matt...",neg


In [4]:
test_data = pd.read_excel('/content/IMDB-Movie-Reviews-Large-Dataset-50k/test.xlsx', dtype=str)
test_data.head()

Unnamed: 0,Reviews,Sentiment
0,Who would have thought that a movie about a ma...,pos
1,After realizing what is going on around us ......,pos
2,I grew up watching the original Disney Cindere...,neg
3,David Mamet wrote the screenplay and made his ...,pos
4,"Admittedly, I didn't have high expectations of...",neg


In [5]:
train_data.shape, test_data.shape

((25000, 2), (25000, 2))

In [6]:
 (x_train, y_train), (x_test,y_test), preprocess =text.texts_from_df(train_df=train_data,                  
                   text_column='Reviews', label_columns='Sentiment', 
                   val_df=test_data, 
                   maxlen= 400,
                   preprocess_mode = 'bert')

['neg', 'pos']
   neg  pos
0  1.0  0.0
1  1.0  0.0
2  1.0  0.0
3  1.0  0.0
4  1.0  0.0
['neg', 'pos']
   neg  pos
0  0.0  1.0
1  0.0  1.0
2  1.0  0.0
3  0.0  1.0
4  1.0  0.0
downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


In [7]:
print(x_train)
print(x_train[0].shape)

[array([[ 101, 2043, 1045, ..., 2295, 1012,  102],
       [ 101, 8210, 4301, ...,    0,    0,    0],
       [ 101, 2339, 2515, ...,    0,    0,    0],
       ...,
       [ 101, 2026, 2643, ..., 3185, 2001,  102],
       [ 101, 2043, 1045, ..., 1997, 2702,  102],
       [ 101, 2061, 2339, ..., 7987, 1013,  102]]), array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])]
(25000, 400)


In [8]:
print(y_train)
print(y_train[0].shape)

[[1. 0.]
 [1. 0.]
 [1. 0.]
 ...
 [1. 0.]
 [1. 0.]
 [1. 0.]]
(2,)


In [9]:
model = text.text_classifier(name='bert', 
                             train_data = (x_train,y_train),
                             preproc = preprocess)

Is Multi-Label? False
maxlen is 400
done.


In [10]:
# Get Optimal Learning Rate

learner = ktrain.get_learner(model=model,
                             train_data = (x_train,y_train),
                             val_data = (x_test,y_test),
                             batch_size = 6)

In [11]:
# This might take days to run.

# learner.lr_find()
# learner.lr_plot()

__Optimal learning rate for this model is 2e-5.__

## Model Fine Tuning 

It will take a couple of hours
.

In [12]:
import time
start = time.time()

print('Start time: ', time.ctime(start))

learner.fit_onecycle(lr=2e-5, epochs = 1) # 3 more will result in overfitting.

end = time.time()
print('Ending time: ', time.ctime(end))
print('학습 소요시간(시간: ', int((end-start)/3600))

Start time:  Sat Sep 18 21:45:54 2021


begin training using onecycle policy with max lr of 2e-05...
Ending time:  Sat Sep 18 22:22:41 2021
학습 소요시간(시간:  0


In [13]:
predictor = ktrain.get_predictor(learner.model,preprocess)

In [14]:
data = ['this movie was horrible. the plot was really boring. Acting was okay, though',
        'the film really sucked. there is no plot and acting was bad',
        'what a beautiful movie. great plot, great acting.  will see it again']

In [15]:
predictor.predict(data)

['neg', 'neg', 'pos']