## Uber's Ludwig: Datatype Agnostic Toolbox Built on top of TensorFlow

### by Scott Jones
### August 23, 2019

![title](ludwig_logo.png)

- easy to use (basically no coding)

- visualization of model performance (including model comparisons)

- flexible to many different model parameters

- generally applicable to wide range of use cases

- regularly updated (new BERT functionality)

- can load data from AWS, GCP, Azure

## Resources

- github.com/uber/ludwig
- uber.github.io
    - Getting Started
    - Examples
    - User Guide
    - API
- https://towardsdatascience.com/introducing-ubers-ludwig-5bd275a73eda

![title](ludwig_example_dataset.png)

*ludwig train --data_csv reuters-allcats.csv --model_definition "{input_features: [{name: text, type: text, encoder: parallel_cnn, level: word}], output_features: [{name: class, type: category}]}"*

![title](terminal_training_run.png)

*ludwig predict --data_csv reuters-allcats.csv --model_path results/experiment_run_0/model/*

![title](predict_file_output.png)

### Or, can combine training and test into a single command:

*ludwig experiment --data_csv reuters-allcats.csv --model_definition "{input_features: [{name: text, type: text, encoder: parallel_cnn, level: word}], output_features: [{name: class, type: category}]}"*

### Can also put the model definition into a YAML file:

input_features:  
    -  
        name: text  
        type: text  
        level: word  
        encoder: parallel_cnn

output_features:  
    -  
        name: class  
        type: category

## Visualization

*ludwig visualize --visualization learning_curves --training_statistics path/to/training_statistics.json*

![title](accuracy.png)

![title](loss.png)

## Python API

*!apt-get install libgmp3-dev*  
*!pip install ludwig*

### Train a model

*model_definition = {...}*  
*ludwig_model = LudwigModel(model_definition)*  
*train_stats = ludwig_model.train(data_csv=csv_file_path)*

### load a dataframe

*train_stats = ludwig_model.train(data_df=dataframe)*

### load a previously trained model

*ludwig_model = LudwigModel.load(model_dir)*

### predict

*predictions = ludwig_model.predict(data_csv=csv_file_path) # dataframe also valid again*

### test

*predictions, test_stats = ludwig_model.test(data_csv=csv_file_path) # data_df=dataframe*

### release resources

*model.close()*

## Model Comparisons

*compare_classifiers_performance_from_pred*

![title](model_comparison.png)

*compare_classifiers_multiclass_multimetric*

![title](multiclass_performance.png)

## Confusion Matrix

![title](confusion_matrix.png)

### Key advantage of Ludwig is the data-type specific encoders and decoders (adaptibility)

- preprocessing methods specific to your data

### For text data: 

- handled similar to sequence data but includes SpaCy based methods
    - tokenizer
    - stopword removal
    - punctuation, number filter
    - lemmatization
- lowercase data, fill missing values

## NLP Applications

- Text classification
- Natural Language Understanding
- NER tagging
- Translation
- Chatbot modeling

In [7]:
import json
from sklearn.model_selection import train_test_split
import pandas as pd
#!pip install bert-tensorflow

path="/Users/sjones/Google Drive/Kindle_rating_predict"

reviews = []
file = "/Users/sjones/Google Drive/Kindle_rating_predict/reviews_Kindle_Store_5.json"
for line in open(file, 'r'):
    reviews.append(json.loads(line))
    
reviews_df = pd.DataFrame(reviews)
train, test = train_test_split(reviews_df, test_size=0.15, random_state=42, shuffle=True)
train.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
208012,B007SH90RM,"[0, 0]",5.0,"This was a good one, let's start by saying tha...","05 13, 2014",A2GU8WAUL1GREZ,"Veritas Vincit ""Bill""",Deep Mystery and Masterful Storytelling,1399939200
590910,B00E123W58,"[0, 0]",5.0,I was gifted my copy of RUNE II in exchange fo...,"07 23, 2013",A1RUT4WFFKVI1I,Anne Nelson,What a explosion as it hit my kindle!!!!,1374537600
563433,B00DKB3LKM,"[0, 0]",5.0,I really liked this book. Another great book i...,"06 26, 2013",A30GJ3BNSILI9D,Bill McBride,A fact fills story,1372204800
375492,B00AM0WQD2,"[1, 1]",5.0,Great information for creating gifts! I have m...,"12 13, 2012",A3KBGY56U6MPA3,this is great,Definitely Recommended!,1355356800
293305,B0094B6QQ8,"[0, 0]",2.0,Very short story...not credible at all...and j...,"09 14, 2013",A1JWFH1A4XT5PU,Diana Galan,Bah,1379116800


In [8]:
train_sm = pd.concat([train['overall'], train['reviewText']], axis=1)
test_sm = pd.concat([test['overall'], test['reviewText']], axis=1)

# Get rid of reviews with no text
noDataTr = train_sm[train_sm['reviewText'] == ""].index
train_sm.drop(noDataTr, inplace=True)
noDataTe = test_sm[test_sm['reviewText'] == ""].index
test_sm.drop(noDataTe, inplace=True)

# Restrict our data set 
#train_sm = train_sm.sample(frac=0.3, random_state=42)
train_sm = train_sm.sample(n=1000, random_state=42)
test_sm = test_sm.sample(n=200, random_state=42)

my_dict = {1.0: "one star", 2.0: "two stars", 3.0: "three stars", 4.0: "four stars", 5.0: "five stars"}
train_sm = train_sm.replace({"overall": my_dict})
test_sm = test_sm.replace({"overall": my_dict})

In [10]:
from ludwig.api import LudwigModel

# train a model
model_definition = {'input_features': [{'name': 'reviewText', 'type': 'text', 'level': 'word', 'encoder': 'parallel_cnn', 
                                      'do_lower_case': 'True', 'preprocessing': {'word_tokenizer': 'english_tokenize_filter'}}], 'output_features': [{'name': 'overall', 'type': 'category'}]}

#print(model_definition)

model = LudwigModel(model_definition)
train_stats = model.train(train_sm)

# or load a model
#model = LudwigModel.load(model_path)






Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor







Received SIGINT, will finish this epoch and then conclude the training
Send another SIGINT to immediately interrupt the process


In [None]:
# obtain predictions
predictions = model.predict(test_sm)

model.close()

# Embeddings

- allow shared representations across disparate data sets (transfer learning)
- lower dimensional representations of word vectors that maintain context
- also: language model

![title](transfer_learning_schematic.png)

In [11]:
#Google's BERT (Bidirectional Encoder Representations from Transformers) also now available in Ludwig:

model_definition = {'input_features': [{'name': 'reviewText', 'type': 'text', 'encoder': 'bert', 'config_path': '/content/drive/My Drive/Kindle_rating_predict/wwm_uncased_L-24_H-1024_A-16/bert_config.json', 
                                      'checkpoint_path': '/content/drive/My Drive/Kindle_rating_predict/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt', 'do_lower_case': 'True',
                                     'preprocessing': {'word_tokenizer': 'bert', 'word_vocab_file': '/content/drive/My Drive/Kindle_rating_predict/wwm_uncased_L-24_H-1024_A-16/vocab.txt', 'padding_symbol': '[PAD]',
                                                    'unknown_symbol': '[UNK]'}, 'reduce_output': 'True'}], 'output_features': [{'name': 'overall', 'type': 'category'}]}



## Using BERT in Ludwig

- BERT tokenizer must be used 
- maps each integer in sequence to its embedding (encoder of Transformer)
- load downloaded BERT weights, hyperparameters and vocabulary


# Questions?