# Named Entity Recognition: Chinese

In this lesson, we're going to learn about a text analysis method called *Named Entity Recognition* (NER) as applied to Chinese. This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.

---

## Dataset

The example text for Chinese is *敬告中国二万万女同胞* by 秋瑾. (Thanks to Paul Vierthaler for selecting and finding the text.)

**Here's a preview of spaC's NER tagging *敬告中国二万万女同胞*.**

If you compare the results to the [English example](Named-Entity-Recognition), you'll notice that the Chinese NER is much less good at recognizing entities, and is especially bad ata distinguishing different kinds of entities, like ORG vs LOC. You need a lot of examples to train a model to distinguish different entity types; currently, English is the only model that does a decent job of it.

You can read more about the [data sources used to train Chinese](https://spacy.io/models/zh) on the spaCy model page.

In [5]:
displacy.render(document, style="ent")

---

## NER with spaCy
If you've already used the pre-processing notebook for this language, you can skip the steps for installing spaCy and downloading the language model.

### Install spaCy

In [None]:
!pip install -U spacy

### Import Libraries

We're going to import `spacy` and `displacy`, a special spaCy module for visualization.

In [2]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

We're also going to import the `Counter` module for counting people, places, and things, and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

### Download Language Model

Next we need to download the Chinese-language model (`zh_core_web_md`), which will be processing and making predictions about our texts. You can read more about the [data sources used to train Chinese](https://spacy.io/models/zh) on the spaCy model page.

In [None]:
!python -m spacy download zh_core_web_md

### Load Language Model

Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.

**1.** We can import the model as a module and then load it from the module.

In [3]:
import zh_core_web_md
nlp = zh_core_web_md.load()

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/3r/55b5kjpd4s14_tg80r24vs7r0000gq/T/jieba.cache
Loading model cost 0.590 seconds.
Prefix dict has been built successfully.


**2.** We can load the model by name.

In [4]:
#nlp = spacy.load('es_core_news_md')

If you just downloaded the model for the first time, it's advisable to use Option 1. Then you can use the model immediately. Otherwise, you'll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel.. in the Jupyter Lab menu).

## Process Document

We first need to process our `document` with the loaded NLP model. Most of the heavy NLP lifting is done in this line of code.

After processing, the `document` object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.

In the cell below, we open and the example document. Then we run`nlp()` on the text and create our document.

In [4]:
filepath = '../texts/other-languages/zh.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

## Get Named Entities

All the named entities in our `document` can be found in the `document.ents` property. If we check out `document.ents`, we can see all the entities from the example document.

In [6]:
document.ents

(中国, 二万万, ;遇, 杂冒, 几岁, 两, 一二, 三年, 三天, 非洲, 陈后, 图安乐, 进学堂, 进学堂, 开学堂, 苦作)

Each of the named entities in `document.ents` contains [more information about itself](https://spacy.io/usage/linguistic-features#accessing), which we can access by iterating through the `document.ents` with a simple `for` loop.

For each `named_entity` in `document.ents`, we will extract the `named_entity` and its corresponding `named_entity.label_`.

In [7]:
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

中国 GPE
二万万 PERCENT
;遇 PERSON
杂冒 WORK_OF_ART
几岁 DATE
两 CARDINAL
一二 CARDINAL
三年 DATE
三天 DATE
非洲 LOC
陈后 PERSON
图安乐 ORG
进学堂 LOC
进学堂 LOC
开学堂 ORG
苦作 PERSON


To extract just the named entities that have been identified as `PERSON` (person), we can add a simple `if` statement into the mix:

In [9]:
for named_entity in document.ents:
    if named_entity.label_ == "PERSON":
        print(named_entity)

;遇
陈后
苦作


## NER with Long Texts or Many Texts

In [10]:
import math
number_of_chunks = 80

chunk_size = math.ceil(len(text) / number_of_chunks)

text_chunks = []

for number in range(0, len(text), chunk_size):
    text_chunk = text[number:number+chunk_size]
    text_chunks.append(text_chunk)

In [11]:
chunked_documents = list(nlp.pipe(text_chunks))

## Get People

To extract and count the people, we will use an `if` statement that will pull out words only if their "ent" label matches "PERSON."

In [12]:
people = []

for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "PERSON":
            people.append(named_entity.text)

people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df

Unnamed: 0,character,count
0,;遇,1
1,陈,1
2,苦作,1


## Get Places

To extract and count places, we can follow the same model as above, except we will change our `if` statement to check for "ent" labels that match "LOC."

In [13]:
places = []
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "LOC":
            places.append(named_entity.text)

places_tally = Counter(places)

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df

Unnamed: 0,place,count
0,进学堂,2
1,非洲,1


## Get NER in Context

In [14]:
from IPython.display import Markdown, display
import re

def get_ner_in_context(keyword, document, desired_ner_labels= False):
    
    if desired_ner_labels != False:
        desired_ner_labels = desired_ner_labels
    else:
        desired_ner_labels = ['PERSON', 'ORG', 'LOC']  
        
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        #process each sentence
        sentence_doc = nlp(sentence.text)
        for named_entity in sentence_doc.ents:
            #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
            if keyword.lower() in named_entity.text.lower()  and named_entity.label_ in desired_ner_labels:
                #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
                #sentence_text = sentence.text
            
                sentence_text = re.sub('\n', ' ', sentence.text)
                sentence_text = re.sub(f"{named_entity.text}", f"**{named_entity.text}**", sentence_text, flags=re.IGNORECASE)

                display(Markdown('---'))
                display(Markdown(f"**{named_entity.label_}**"))
                display(Markdown(sentence_text))

In [15]:
for document in chunked_documents:
    get_ner_in_context('进学堂', document)

---

**LOC**

成、名不就;生了儿子，就要送他**进学堂**

---

**LOC**

年姑娘的呢，若能够**进学堂**更好;就不进