<a href="https://colab.research.google.com/github/tanaymukherjee/Natural-Language-Processing/blob/master/10_Training_A_Neural_Network_Model_I.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training A Neural Network Model - I

## Training and updating models

### Creating training data - I

In [11]:
import spacy

In [12]:
# Create an NLP object
from spacy.lang.en import English
nlp = English()

In [13]:
# Import the Doc class
from spacy.tokens import Doc, Span, Token

In [14]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


spaCy's rule-based Matcher is a great way to quickly create training data for named entity models. A list of sentences is available as the variable TEXTS.

We want to find all mentions of different iPhone models, so we can create training data to teach a model to recognize them as 'GADGET'.

In [15]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [16]:
TEXTS = ['How to preorder the iPhone X',
 'iPhone X is coming',
 'Should I pay $1,000 for the iPhone X?',
 'The iPhone 8 reviews are here',
 'Your iPhone goes up to 11 today',
 'I need a new phone! Any tips?']

1. Write a pattern for two tokens whose lowercase forms match 'iphone' and 'x'.
2. Write a pattern for two tokens: one token whose lowercase form matches 'iphone' and an optional digit using the '?' operator.

In [17]:
# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{'LOWER': 'iphone'}, {'LOWER': 'x'}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{'LOWER': 'iphone'}, {'IS_DIGIT': True, 'OP': '?'}]

# Add patterns to the matcher
matcher.add('GADGET', None, pattern1, pattern2)

### Creating training data - II

Let's use the match patterns we've created in the above block to bootstrap a set of training examples. A list of sentences is available as the variable TEXTS.

In [18]:
# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{'LOWER': 'iphone'}, {'LOWER': 'x'}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{'LOWER': 'iphone'}, {'IS_DIGIT': True, 'OP': '+'}]

# Add patterns to the matcher
matcher.add('GADGET', None, pattern1, pattern2)

1. Create a doc object for each text using nlp.pipe and find the matches in it.
2. Create a list of (start, end, label) tuples for the matches.

In [19]:
# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Find the matches in the doc
    matches = matcher(doc)
    
    # Get a list of (start, end, label) tuples of matches in the text
    entities = [(start, end, 'GADGET') for match_id, start, end in matches]
    print(doc.text, entities)    

How to preorder the iPhone X [(4, 6, 'GADGET'), (4, 5, 'GADGET')]
iPhone X is coming [(0, 2, 'GADGET'), (0, 1, 'GADGET')]
Should I pay $1,000 for the iPhone X? [(7, 9, 'GADGET'), (7, 8, 'GADGET')]
The iPhone 8 reviews are here [(1, 2, 'GADGET'), (1, 3, 'GADGET')]
Your iPhone goes up to 11 today [(1, 2, 'GADGET')]
I need a new phone! Any tips? []


1. Match on the doc and create a list of matched spans.
2. Format each example as a tuple of the text and a dict, mapping 'entities' to the entity tuples.
3. Append the example to TRAINING_DATA and inspect the printed data.

In [20]:
TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, 'GADGET') for span in spans]
    
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {'entities': entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)
    
print(*TRAINING_DATA, sep='\n')    

('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET'), (20, 26, 'GADGET')]})
('iPhone X is coming', {'entities': [(0, 8, 'GADGET'), (0, 6, 'GADGET')]})
('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET'), (28, 34, 'GADGET')]})
('The iPhone 8 reviews are here', {'entities': [(4, 10, 'GADGET'), (4, 12, 'GADGET')]})
('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})
('I need a new phone! Any tips?', {'entities': []})
