# NER with spaCy

This is a notebook that explores the effectiveness of a 'vanilla' pre-trained language model from spaCy.

## Installs

In [26]:
# !pip install spacy
# !python3 -m spacy download en_core_web_sm
# !pip install s3fs
# !pip install boto
# !pip install boto3

In [18]:
import os
import spacy
import pandas as pd
import spacy.displacy as displacy

import s3fs
import boto3
import boto

## Pipeline Preparation

In [19]:
nlp = spacy.load("en_core_web_sm") #define a pipeline with a language model. English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.

In [20]:
text = "STSM077030 - Companies and shareholders: company takeovers: Stamp Duty on block transfer Once a takeover offer is declared unconditional,\
    all the acceptances received up to that date are usually included in one ‘block transfer’- a single stock transfer form with an accompanying \
    schedule setting out the total number of shares in the target company to be transferred to the offeror, together with the consideration payable for each. \
    Stamp Duty will be chargeable on the amount or value of the consideration for each transfer, and should be set out in the schedule accompanying the stock transfer form.\
    STSM021190 provides more information on block transfers, including where the block transfer contains transfers on sale involving individual \
    shareholders where the consideration does not exceeed £1,000 and so may benefit from a £1,000 certificate of value and  not attract a Stamp Duty charge. \
    Separate block transfers must be prepared in respect of chargeable and non chargeable transfers. \
    STSM077040 - STSM077060 give details of how Stamp Duty is calculated on different types of consideration given for transfers of securities under a takeover. \
    Further block transfers will often be executed, for example covering acceptances received during a specified period after the offer is declared unconditional,\
    and/or to cover compulsory acquisitions from minority shareholders under section 979 Companies Act 2006."

In [21]:
doc = nlp(text)

In [22]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

STSM077030 - Companies 0 22 ORG
1,000 815 820 MONEY
1,000 848 853 MONEY
Stamp Duty 894 904 ORG
2006 1440 1444 DATE


In [7]:
displacy.render(doc, style='ent')

In [34]:
def find_vis_ents(text):
    doc = nlp(text)
    #for ent in doc.ents:
        #print(ent.text, ent.start_char, ent.end_char, ent.label_)
    displacy.render(doc, style='ent')

In [24]:
find_vis_ents(text)

STSM077030 - Companies 0 22 ORG
1,000 815 820 MONEY
1,000 848 853 MONEY
Stamp Duty 894 904 ORG
2006 1440 1444 DATE


## Integrate Doccano Data

In [29]:
ner_data_file = "../data/processed/line_by_line_NER_data_sampled_12062020_more_ents.csv"

df = pd.read_csv(ner_data_file, sep="\t", low_memory=False)

In [30]:
df.head()

Unnamed: 0,text,text_token,labels,updated,original_labels,base_path,sampled,label_list
0,They can come to your home or somewhere nearby .,"['They', 'can', 'come', 'to', 'your', 'home', ...","[[22, 26, 'LOCATION'], [30, 39, 'LOCATION']]",True,"[[22, 26, 'LOCATION'], [30, 39, 'LOCATION']]",/dealing-hmrc-additional-needs,False,"['O', 'O', 'O', 'O', 'O', 'LOCATION', 'O', 'LO..."
1,If you think you should get it but haven ’ t c...,"['If', 'you', 'think', 'you', 'should', 'get',...","[[57, 71, 'ORGANIZATION'], [82, 96, 'LOCATION'...",True,"[[82, 96, 'LOCATION'], [57, 71, 'ORGANIZATION'...",/christmas-bonus,False,"['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
2,You can not update the memorandum once the com...,"['You', 'can', 'not', 'update', 'the', 'memora...","[[23, 33, 'FORM'], [43, 50, 'ORGANIZATION']]",True,"[[22, 33, 'FORM'], [42, 49, 'ORGANIZATION']]",/limited-company-formation,False,"['O', 'O', 'O', 'O', 'O', 'FORM', 'O', 'O', 'O..."
3,You ’ ll be told at the end of your registrati...,"['You', '’', 'll', 'be', 'told', 'at', 'the', ...","[[36, 65, 'EVENT'], [97, 103, 'ORGANIZATION']]",True,"[[97, 103, 'ORGANIZATION'], [36, 65, 'EVENT'],...",/register-childminder-agency-england,False,"['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
4,The charity or CASC will give you a form to si...,"['The', 'charity', 'or', 'CASC', 'will', 'give...","[[4, 11, 'ORGANIZATION'], [15, 19, 'ORGANIZATI...",True,"[[4, 11, 'ORGANIZATION'], [36, 40, 'FORM'], [1...",/income-tax-reliefs,False,"['O', 'ORGANIZATION', 'O', 'ORGANIZATION', 'O'..."


In [31]:
for i in df['text'][:10]:
    print(i)

They can come to your home or somewhere nearby .
If you think you should get it but haven ’ t contact the Jobcentre Plus office or pension centre that deals with your payments .
You can not update the memorandum once the company has been registered .
You ’ ll be told at the end of your registration inspection visit if you can start working as an agency .
The charity or CASC will give you a form to sign .
If you need more help There ’ s more detailed guidance on trusts and Income Tax .
If you get into a dispute with your landlord you need to keep paying rent - otherwise you may be evicted .
They can ask for the meeting to be postponed if this person can ’ t make it .
To decide your tax code HMRC will estimate how much interest you ’ ll get in the current year by looking at how much you got the previous year .
Get help with the calculations You can get help to calculate a week ’ s pay from Acas ( Advisory Conciliation and Arbitration Service ) or Citizens Advice .


In [35]:
for i in df['text'][:10]:
    print(i)
    find_vis_ents(i)
    print()

They can come to your home or somewhere nearby .



If you think you should get it but haven ’ t contact the Jobcentre Plus office or pension centre that deals with your payments .



You can not update the memorandum once the company has been registered .



You ’ ll be told at the end of your registration inspection visit if you can start working as an agency .



The charity or CASC will give you a form to sign .



If you need more help There ’ s more detailed guidance on trusts and Income Tax .



If you get into a dispute with your landlord you need to keep paying rent - otherwise you may be evicted .



They can ask for the meeting to be postponed if this person can ’ t make it .



To decide your tax code HMRC will estimate how much interest you ’ ll get in the current year by looking at how much you got the previous year .



Get help with the calculations You can get help to calculate a week ’ s pay from Acas ( Advisory Conciliation and Arbitration Service ) or Citizens Advice .



