# Get your Data
Normally a large dataset will be loaded, here we write a small one out for demsontration purposes.
Say we have a set of Tweets. Surely they are very constructed but the principle works if the corpus is large enough.

In [1]:
corpus = [
    'OMG! I love driving my new Mercedes. It is so fast.',
    'Guys see how cool my friend looks driving my new VW. He\'s loving it!',
    'I have always dreamt of buying a campervan from my friend.'
    'Today I will finally get my new Computer.'
    'I have always dreamt of a campervan from VW.'
    'He doesn\'t seem to like driving my new Lamborghini! Maybe not fast enough :D'
         ]

# Define a seedlist
The goal is to find entities in your corpus without the need of using pretrained models to do so. This makes it more robust to spelling or grammer mistakes (especially in non-english contexts) and also lifts limitations of what an entity might be.

In this case we want to identify words IN CONTEXT that might be car brands:

In [2]:
seed_list = ['mercedes', 'lamborghini']

# Create Tagger
This is the main framework to do the iterative training described in README.md 
To apply less restrictions (since corpora may be very different in nature) the actual model definition and embedding used will be defined separately.


In [3]:
from src.NER import NERTagger

tagger = NERTagger(corpus,
                   entities = [{'name': 'CAR_BRAND', 'seed': seed_list}],
                   seed = 123456, # for reproducability
                   window = 3, # context window around desired word ( designed for not using advanced layers like LSTM)
                   n_jobs = 1, # for large datasets, multiple jobs will be faster
                   train_min_pos_rate = 0.4 # How confident does the model have to be in order to adjust the seed list
                   )



# Create Embedding and Model
Since this is a Proof of Concept, we will be using very simple word embeddings (basically counts) and neural networks since our corpus is too small for anything else.

In [4]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Embedding, Dropout, Input
from tensorflow.keras.optimizers import Adam

from tensorflow.keras.wrappers.scikit_learn import KerasClassifier


EMBEDDING_SIZE = 20
model_dims = tagger.get_required_dimensions()

mlp_model = Sequential()
mlp_model.add(Embedding(model_dims['num_labels'], EMBEDDING_SIZE, input_length=model_dims['in_dim']))
mlp_model.add(Flatten())
mlp_model.add(Dense(100, activation='relu'))
mlp_model.add(Dropout(0.5))
mlp_model.add(Dense(model_dims['out_dim'], activation='softmax'))

MODEL_PARAMS = {
    "epochs": 10,
    "batch_size": 5,
    "loss": "categorical_crossentropy",
    "metrics": ["accuracy"],
    "optimizer": Adam(amsgrad=False,
                      beta_1=0.9,
                      beta_2=0.999,
                      decay=0.00,
                      epsilon=1e-8,
                      lr=0.001),
}

def compile_model(model, loss, optimizer, metrics):
    model.compile(loss=loss, optimizer=optimizer, metrics=metrics)
    return model

model = KerasClassifier(build_fn=compile_model, model=mlp_model, **MODEL_PARAMS)

tagger.set_model(model)

# Train

Now the model is trained in multiple iterations (see main README.md for details) to learn contextualo rules for where in a sentence a car brand might be situated.

In [5]:
import os

generate_config = {
    'max_iterations': 5,
    'min_probability': 0.3,
    'min_update_rate': 0.02
}

os.makedirs('example_run', exist_ok=True)
    
tagger.generate_predictive_rules(iteration_save_path='example_run',
                                  save_iterations=list(range(generate_config['max_iterations']+1)),
                                  **generate_config)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10






# Evaluate

Now the trained model is evaluated. Performance can obviously not be expected to be brilliant given the minimal dataset.

In [6]:
# Get the probability of every word in the sentence to be of the trained entity. 
# For simplicity, all words were lower cased and punctuation was reduced to spaces.

test_sentence = 'Guys see how cool my friend looks driving my new VW. He\'s loving it!'
probabs = tagger.predict_token_probabilities(test_sentence)
print(probabs)

[('guys', 0.02955764), ('see', 0.03723589), ('how', 0.061579213), ('cool', 0.1061618), ('my', 0.17758377), ('friend', 0.18183373), ('looks', 0.29528335), ('driving', 0.14679635), ('my', 0.14607784), ('new', 0.17787132), ('vw', 0.49876878), ("he's", 0.088515826), ('loving', 0.061647322), ('it', 0.025124522)]


In [7]:
# Now only print a sentences words that has a contextual likelihood > cutoff for being of the given entity.

cutoff = 0.5   # normally much higher for appropriately sized data sets

brands = [(p[0], ix) for ix, p in enumerate(probabs) if round(p[1],2)>=cutoff]

print('Words in sentence that are likely car brands:\n')
for b in brands:
    print(f'{b[0]} (Token number {b[1]})')

Words in sentence that might be car brands:

vw (Token number 10)


In [8]:
# Notice that this is ONLY CONTEXTUAL and not based on a word list -> Brand does not need to be known in advance
# Although the tagger is more designed to tag the corpus it is trained on, it would theoretically also work on 
# unknown words 

test_sentence = 'Guys see how cool my friend looks driving my new UnknownCarBrand. He\'s loving it!'

probabs = tagger.predict_token_probabilities(test_sentence)
brands = [(p[0], ix) for ix, p in enumerate(probabs) if round(p[1],2)>=cutoff]

print('Words in sentence that are likely car brands:\n')
for b in brands:
    print(f'{b[0]} (Token number {b[1]})')


Words in sentence that are likely car brands:

unknowncarbrand (Token number 10)
