## Redaction of Names from Documents Using SpaCy

#### Applications of Named Entity Recognition
- Sanitization is the process of removing sensitive information from a document or other message (or sometimes encrypting it), so that the document may be distributed to a broader audience 

#### Purpose of Sanitization/Redaction of Document
- For anonymity of source in document
- To ensure there is no sensitive or personally identifiable information in the document
- Censorship

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
doc = nlp(open('covid_research.txt').read())

In [8]:
for token in doc.ents:
    if token.label_ == 'PERSON':
        print('{:<30}{:<30}'.format(token.text, token.label_))

Sonya Babu-Narayan            PERSON                        
Hugo Pedder                   PERSON                        
Guo et al.                    PERSON                        
Madjid et al.                 PERSON                        
Mohammad Madjid               PERSON                        
Guo                           PERSON                        
Prof Kevin McConway           PERSON                        
Guo et al.                    PERSON                        
Prof Naveed Sattar            PERSON                        
Prof Tim Chico                PERSON                        
Troponin                      PERSON                        


In [9]:
# Function to Sanitize/Redact Names
def sanitize_names(text):
    docx = nlp(text)
    redacted_sentences = []
    for ent in docx.ents:
        ent.merge()
    for token in docx:
        if token.ent_type_ == 'PERSON':
            redacted_sentences.append("[REDACTED]")
        else:
            redacted_sentences.append(token.string)
    return "".join(redacted_sentences)

In [10]:
ex1 = open('covid_research.txt').read()

In [11]:
# Redact the Names
sanitize_names(ex1)

'Dr [REDACTED], Associate Medical Director at the British Heart Foundation and Honorary Consultant Cardiologist, said: \n\n“Every day we learn more about Covid-19. Information to date suggests that people with heart disease, or are at risk of heart disease due to factors such as high blood pressure, diabetes or being severely overweight with a body mass index higher than 40, are at an increased risk of complications caused by the virus.\n\n“If you have one of these conditions you should be taking all precautions possible to reduce your chance of catching the virus.\n\n“Viruses can cause significant inflammation which can injure the heart and can worsen a person’s existing heart condition even if the virus does not enter the heart directly.\n\n“Evidence shows that people with higher levels of a protein used to measure heart injury in their blood are more likely to die after contracting Covid-19. \n\n“However this kind of observational evidence can’t tell us why some people suffer heart 

In [12]:
from spacy import displacy

In [18]:
displacy.render(doc, style = 'ent')

In [16]:
mytext = nlp(sanitize_names(ex1))

In [17]:
displacy.render(mytext, style = 'ent')