# Natural Language Processing

### The problem?

- Endless amounts of unstructured data found in emails, tweets, letters, memos, etc.
- Even in transcripts
- How can we make sense of all this data?
- How can we 'easily' find relevant information for our reporting?

### The solution?
- Computer programming to process all that text using **natural language processing**!
- <a href="https://machinelearningmastery.com/natural-language-processing/">Learn more</a> about the complexity and the history of NLP.

### Journalism examples

- <a href="http://doctors.ajc.com/part_1_license_to_betray/">License to betray</a> – Finding word stems and roots to uncover abuse. (<a href="http://doctors.ajc.com/about_this_investigation/?ecmp=doctorssexabuse_microsite_stories">More info</a>)
- <a href="https://www.revealnews.org/article/federal-judges-rulings-favored-companies-in-which-he-owned-stock/">Federal judge’s rulings favored companies in which he owned stock</a> – Finding all stock owned by judges in disclosure forms and comparing to caseloads.
- <a href="https://www.latimes.com/local/cityhall/la-me-crime-stats-20151015-story.html">LAPD underreported serious assaults, skewing crime stats for 8 years</a> – Text classification analysis.

### The tools

- Spacy v. NLTK
- NLTK launched in 2001, Spacy in 2015
- NLTK is now bloated and complex, requiring many steps to deal with many changes etc.
- Spacy is lean and modern, and can compute some text 4x to 20x faster than NLTK.
- Spacy does **nearly** everything that NLTK does, but better.
- NLTK, however, is still the library of choice for sentiment analysis.

# Working with Spacy

## Step 1. Install Spacy

If this first time ever using spacy on this computer, you must first do either the ```!conda install``` or ```!pip install```:

In [1]:
## pip install
!pip install -U spacy




In [2]:
## import libary.

import spacy

#### Which language model is best for you?
<a href="https://spacy.io/usage/models">https://spacy.io/usage/models</a>

## Step 2. Install language model


In [3]:
!python -m spacy download en_core_web_trf


Collecting en-core-web-trf==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.2.0/en_core_web_trf-3.2.0-py3-none-any.whl (460.2 MB)
[K     |████████████████████████████████| 460.2 MB 443 kB/s eta 0:00:01     |███▎                            | 47.6 MB 449 kB/s eta 0:15:18     |█████▌                          | 79.6 MB 459 kB/s eta 0:13:49     |███████▍                        | 106.8 MB 457 kB/s eta 0:12:53     |█████████                       | 129.6 MB 446 kB/s eta 0:12:21     |█████████▉                      | 141.6 MB 449 kB/s eta 0:11:50     |██████████                      | 142.9 MB 449 kB/s eta 0:11:47     |███████████████                 | 216.0 MB 414 kB/s eta 0:09:50     |███████████████▊                | 225.4 MB 443 kB/s eta 0:08:50     |██████████████████▌             | 266.7 MB 433 kB/s eta 0:07:27     |██████████████████▉             | 271.1 MB 459 kB/s eta 0:06:52     |███████████████████▋            | 282.5 MB 416 kB/s et

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')


In [None]:
!python -m spacy download en_core_web_sm

In [4]:
## import that language model
import en_core_web_trf

### Place English libary into a ```nlp``` pipeline

In [5]:
## build nlp pipeline (a function will tokenize, parse and ner for us)
nlp = en_core_web_trf.load()

In [6]:
## what type of object is nlp
type(nlp)

spacy.lang.en.English

## Step 3. Text analysis

In [29]:
### Sample English text:
text = u'''\
On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, \
creator of the VoIP service Skype, for $8.5 billion. \
Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. \
Sandeep Junnarkar got this from Wikipedia. \
But he'd rather head to New York city to see Vincent van Gogh's The Starry Night at the Museum of Modern Art. \
Mount Washington, which is really Agiocochook, is the highest peak in the Northeastern United States \
at 6,288.2 ft and the most topographically prominent mountain east \
of the Mississippi River. It's not in the state of Mississippi. \
Ninety-year-old William Shatner, who gained fame portraying Captain James T. Kirk on \
the original "Star Trek," just hitched a ride aboard a suborbital spacecraft \
that grazed the edge of outer space before parachuting to a landing, \
making Shatner the oldest person ever to travel to space. \
This may ground other astronaughts from future flights.
'''

In [8]:
## print the text
print(text)

On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, creator of the VoIP service Skype, for $8.5 billion. Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he'd rather head to New York city to see Vincent van Gogh's The Starry Night at the Museum of Modern Art. Mount Washington, which is really Agiocochook, is the highest peak in the Northeastern United States at 6,288.2 ft and the most topographically prominent mountain east of the Mississippi River. It's not in the state of Mississippi. Ninety-year-old William Shatner, who gained fame portraying Captain James T. Kirk on the original "Star Trek," just hitched a ride aboard a suborbital spacecraft that grazed the edge of outer space before parachuting to a landing, making Shatner the oldest person ever to travel to space.



### Tokenize our text

- Tokenizing is always the first step in text analysis. 
- It breaks all text into isolated but related units (including spaces, symbols, punctuation, numbers, words etc.)
- However, it retains the connection between all the words, sentences, and paragraphs.

In [30]:
## let's run the nlp function and create a spacy doc
doc = nlp(text)
doc

On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, creator of the VoIP service Skype, for $8.5 billion. Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he'd rather head to New York city to see Vincent van Gogh's The Starry Night at the Museum of Modern Art. Mount Washington, which is really Agiocochook, is the highest peak in the Northeastern United States at 6,288.2 ft and the most topographically prominent mountain east of the Mississippi River. It's not in the state of Mississippi. Ninety-year-old William Shatner, who gained fame portraying Captain James T. Kirk on the original "Star Trek," just hitched a ride aboard a suborbital spacecraft that grazed the edge of outer space before parachuting to a landing, making Shatner the oldest person ever to travel to space. This may ground other astronaughts from future flights.

In [10]:
## what type of data is it?
type(doc)

spacy.tokens.doc.Doc

In [11]:
## show each token
for item in doc:
    print(item)
    print("********")

On
********
May
********
10
********
,
********
2011
********
,
********
Microsoft
********
announced
********
its
********
acquisition
********
of
********
 
********
Skype
********
Technologies
********
,
********
creator
********
of
********
the
********
 
********
VoIP
********
 
********
service
********
 
********
Skype
********
,
********
for
********
$
********
8.5
********
billion
********
.
********
Microsoft
********
is
********
headquartered
********
near
********
Seattle
********
Washington
********
while
********
Skype
********
remains
********
in
********
Palo
********
Alto
********
,
********
California
********
.
********
Sandeep
********
Junnarkar
********
got
********
this
********
from
********
Wikipedia
********
.
********
But
********
he
********
'd
********
rather
********
head
********
to
********
New
********
York
********
city
********
to
********
see
********
Vincent
********
van
********
Gogh
********
's
********
The
********
Starry
********
Night
********
a

### Stop Words

- These are common words that add no additional meaning to our analysis.
- Words like ```the```, ```and``` and ```any```.
- Spacy has just over 326 ```stop words``` in its default library.
- Read more on <a href="https://medium.com/@saitejaponugoti/stop-words-in-nlp-5b248dadad47">stop words</a>

In [12]:
## show all default stop words
nlp.Defaults.stop_words

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [18]:
## check if a word (have, near, be) is a stop word 
nlp.vocab["is"].is_stop


True

In [19]:
## how many stop words do we have?
len(nlp.Defaults.stop_words)

326

In [21]:
## Add your own stop word
 ### adds stop word to default
 #### adds stop word to current memory
nlp.Defaults.stop_words.add("lol")
nlp.vocab["lol"].is_stop = True

In [23]:
## CHECK IF 'lol' is a stop word
nlp.vocab["lol"].is_stop

True

In [25]:
len(nlp.Defaults.stop_words)

327

In [31]:
## how many do stop words do we have now?
for token in doc:
    print(f"{token.text}: {token.pos} - {token.pos_}")

On: 85 - ADP
May: 96 - PROPN
10: 93 - NUM
,: 97 - PUNCT
2011: 93 - NUM
,: 97 - PUNCT
Microsoft: 96 - PROPN
announced: 100 - VERB
its: 95 - PRON
acquisition: 92 - NOUN
of: 85 - ADP
 : 103 - SPACE
Skype: 96 - PROPN
Technologies: 96 - PROPN
,: 97 - PUNCT
creator: 92 - NOUN
of: 85 - ADP
the: 90 - DET
 : 103 - SPACE
VoIP: 92 - NOUN
 : 103 - SPACE
service: 92 - NOUN
 : 103 - SPACE
Skype: 96 - PROPN
,: 97 - PUNCT
for: 85 - ADP
$: 99 - SYM
8.5: 93 - NUM
billion: 93 - NUM
.: 97 - PUNCT
Microsoft: 96 - PROPN
is: 87 - AUX
headquartered: 100 - VERB
near: 85 - ADP
Seattle: 96 - PROPN
Washington: 96 - PROPN
while: 98 - SCONJ
Skype: 96 - PROPN
remains: 100 - VERB
in: 85 - ADP
Palo: 96 - PROPN
Alto: 96 - PROPN
,: 97 - PUNCT
California: 96 - PROPN
.: 97 - PUNCT
Sandeep: 96 - PROPN
Junnarkar: 96 - PROPN
got: 100 - VERB
this: 95 - PRON
from: 85 - ADP
Wikipedia: 96 - PROPN
.: 97 - PUNCT
But: 89 - CCONJ
he: 95 - PRON
'd: 87 - AUX
rather: 86 - ADV
head: 100 - VERB
to: 85 - ADP
New: 96 - PROPN
York: 96 - PRO

In [None]:
## Remove a stop word from list because it is relevant.
## notice the word "empty" is a stop word.

## remove from default
 ## remove from current memory

In [None]:
## CHECK IF 'empty' is a stop word


### Parts of speech



In [None]:
## print all parts of speech words


### Step 4. Named Entity Recognition (NER)

#### Spacy easily returns the words that matter to us like names of companies, people, places, art works, numbers, etc.

- ```.ents``` ------------> Finds all entities in doc spacy object.

- ```ent.text``` ------------> The actual text.

- ```ent.label``` ------------> A numeric code for the entity.

- ```ent.label_``` ------------> The word's entity category.

- ```spacy.explain(ent.label_)``` ---------> A description of the category.




In [32]:
## print text
print(text)

On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, creator of the VoIP service Skype, for $8.5 billion. Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he'd rather head to New York city to see Vincent van Gogh's The Starry Night at the Museum of Modern Art. Mount Washington, which is really Agiocochook, is the highest peak in the Northeastern United States at 6,288.2 ft and the most topographically prominent mountain east of the Mississippi River. It's not in the state of Mississippi. Ninety-year-old William Shatner, who gained fame portraying Captain James T. Kirk on the original "Star Trek," just hitched a ride aboard a suborbital spacecraft that grazed the edge of outer space before parachuting to a landing, making Shatner the oldest person ever to travel to space. This may ground other astronaughts from future flights.



In [None]:
## loop through doc


In [36]:
## find all entities
for word in doc.ents:
    print(f"{word}---> {word.label_} ----> {spacy.explain(word.label_)}")
    

May 10, 2011---> DATE ----> Absolute or relative dates or periods
Microsoft---> ORG ----> Companies, agencies, institutions, etc.
Skype Technologies---> ORG ----> Companies, agencies, institutions, etc.
$8.5 billion---> MONEY ----> Monetary values, including unit
Microsoft---> ORG ----> Companies, agencies, institutions, etc.
Seattle---> GPE ----> Countries, cities, states
Washington---> GPE ----> Countries, cities, states
Skype---> ORG ----> Companies, agencies, institutions, etc.
Palo Alto---> GPE ----> Countries, cities, states
California---> GPE ----> Countries, cities, states
Sandeep Junnarkar---> PERSON ----> People, including fictional
Wikipedia---> ORG ----> Companies, agencies, institutions, etc.
New York---> GPE ----> Countries, cities, states
Vincent van Gogh's---> PERSON ----> People, including fictional
The Starry Night---> WORK_OF_ART ----> Titles of books, songs, etc.
the Museum of Modern Art---> ORG ----> Companies, agencies, institutions, etc.
Mount Washington---> LOC 

In [None]:
## find all entities with their label


In [None]:
## find all entities with their label and label descriptors


### Create a CSV that holds all the organizations/companies in a document

In [39]:
## find all entities and place in a list using list comprehension


entities = [word.text for word in doc.ents] ## find all entities
entities 

## find all entity labels
ent_labels = [word.label_ for word in doc.ents]
ent_labels[0:10]

['DATE', 'ORG', 'ORG', 'MONEY', 'ORG', 'GPE', 'GPE', 'ORG', 'GPE', 'GPE']

In [40]:
entities[0:10]

['May 10, 2011',
 'Microsoft',
 'Skype Technologies',
 '$8.5 billion',
 'Microsoft',
 'Seattle',
 'Washington',
 'Skype',
 'Palo Alto',
 'California']

In [41]:
### Turn the two lists into a dictionary using a for loop
my_entities_fl = []
for (key, value) in zip(ent_labels, entities):
    my_dict = {key: value}
    my_entities_fl.append(my_dict)
    
my_entities_fl

[{'DATE': 'May 10, 2011'},
 {'ORG': 'Microsoft'},
 {'ORG': 'Skype Technologies'},
 {'MONEY': '$8.5 billion'},
 {'ORG': 'Microsoft'},
 {'GPE': 'Seattle'},
 {'GPE': 'Washington'},
 {'ORG': 'Skype'},
 {'GPE': 'Palo Alto'},
 {'GPE': 'California'},
 {'PERSON': 'Sandeep Junnarkar'},
 {'ORG': 'Wikipedia'},
 {'GPE': 'New York'},
 {'PERSON': "Vincent van Gogh's"},
 {'WORK_OF_ART': 'The Starry Night'},
 {'ORG': 'the Museum of Modern Art'},
 {'LOC': 'Mount Washington'},
 {'LOC': 'Agiocochook'},
 {'LOC': 'the Northeastern United States'},
 {'QUANTITY': '6,288.2 ft'},
 {'LOC': 'the Mississippi River'},
 {'GPE': 'Mississippi'},
 {'DATE': 'Ninety-year-old'},
 {'PERSON': 'William Shatner'},
 {'PERSON': 'James T. Kirk'},
 {'WORK_OF_ART': 'Star Trek'},
 {'PERSON': 'Shatner'}]

In [43]:
my_entities_fl2 = []
for (key, value) in zip(ent_labels, entities):
    if key == "PERSON":
        my_dict = {key: value}
        my_entities_fl2.append(my_dict)
    
my_entities_fl2

[{'PERSON': 'Sandeep Junnarkar'},
 {'PERSON': "Vincent van Gogh's"},
 {'PERSON': 'William Shatner'},
 {'PERSON': 'James T. Kirk'},
 {'PERSON': 'Shatner'}]

In [44]:
import pandas as pd

In [46]:
df = pd.DataFrame(my_entities_fl2)
df.to_csv("people_only.csv", encoding = "UTF-8", index= False)

In [47]:
### Turn the two lists into a dictionary using 
### dictionary comprehension within list comprehension



In [None]:
## the previous lists hold all entities. 
## let's narrow them down to the orgs/companies


In [None]:
## What data types are these?


### Let's deduplicate

We could wait and use unique in Pandas.

In [None]:
## deduplicate a dictionary


In [None]:
## import pandas
import pandas as pd

In [None]:
# ## use pandas to write to csv file
filename = "test_entities.csv"
df = pd.DataFrame(orgs_only) ## we turn our life dict into a dataframe which we're call df
df.to_csv(filename, encoding='utf-8', index=False)

print(f"{filename} is in your project folder!")

In [49]:

## function to find entities
def show_entities(my_text):
  '''
  my_text must be a spacy doc tokenized object; 
  already run through nlp pipeline
  '''
  each_token = "Token"
  entity_type = "Entity"
  entity_def = "Entity Defined"
  print(f"{each_token:{30}}{entity_type:{15}}{entity_def}")
  if my_text.ents:
      for word in doc.ents:
          print(f"{word.text:{30}} {word.label_:{15}} {str(spacy.explain(word.label_))}")
  else:
      print("There are no entities in this text")


In [50]:
## show entities in my english sentence
show_entities(doc)

Token                         Entity         Entity Defined
May 10, 2011                   DATE            Absolute or relative dates or periods
Microsoft                      ORG             Companies, agencies, institutions, etc.
Skype Technologies             ORG             Companies, agencies, institutions, etc.
$8.5 billion                   MONEY           Monetary values, including unit
Microsoft                      ORG             Companies, agencies, institutions, etc.
Seattle                        GPE             Countries, cities, states
Washington                     GPE             Countries, cities, states
Skype                          ORG             Companies, agencies, institutions, etc.
Palo Alto                      GPE             Countries, cities, states
California                     GPE             Countries, cities, states
Sandeep Junnarkar              PERSON          People, including fictional
Wikipedia                      ORG             Companies, age

## Word Frequency

In [52]:
from collections import Counter  ## a package that helps us count up frequency
## Counter(some_variable)
## variable_name.most.common(some_number)

#remove stopwords and punctuations



In [55]:
words = [word.text for word in doc \
        if word.is_stop != True and \
        word.is_punct != True \
        and word.text !="\xa0"]
words

['10',
 '2011',
 'Microsoft',
 'announced',
 'acquisition',
 'Skype',
 'Technologies',
 'creator',
 'VoIP',
 'service',
 'Skype',
 '$',
 '8.5',
 'billion',
 'Microsoft',
 'headquartered',
 'near',
 'Seattle',
 'Washington',
 'Skype',
 'remains',
 'Palo',
 'Alto',
 'California',
 'Sandeep',
 'Junnarkar',
 'got',
 'Wikipedia',
 'head',
 'New',
 'York',
 'city',
 'Vincent',
 'van',
 'Gogh',
 'Starry',
 'Night',
 'Museum',
 'Modern',
 'Art',
 'Mount',
 'Washington',
 'Agiocochook',
 'highest',
 'peak',
 'Northeastern',
 'United',
 'States',
 '6,288.2',
 'ft',
 'topographically',
 'prominent',
 'mountain',
 'east',
 'Mississippi',
 'River',
 'state',
 'Mississippi',
 'Ninety',
 'year',
 'old',
 'William',
 'Shatner',
 'gained',
 'fame',
 'portraying',
 'Captain',
 'James',
 'T.',
 'Kirk',
 'original',
 'Star',
 'Trek',
 'hitched',
 'ride',
 'aboard',
 'suborbital',
 'spacecraft',
 'grazed',
 'edge',
 'outer',
 'space',
 'parachuting',
 'landing',
 'making',
 'Shatner',
 'oldest',
 'person',

In [56]:
word_freq = Counter(words)
word_freq

Counter({'10': 1,
         '2011': 1,
         'Microsoft': 2,
         'announced': 1,
         'acquisition': 1,
         'Skype': 3,
         'Technologies': 1,
         'creator': 1,
         'VoIP': 1,
         'service': 1,
         '$': 1,
         '8.5': 1,
         'billion': 1,
         'headquartered': 1,
         'near': 1,
         'Seattle': 1,
         'Washington': 2,
         'remains': 1,
         'Palo': 1,
         'Alto': 1,
         'California': 1,
         'Sandeep': 1,
         'Junnarkar': 1,
         'got': 1,
         'Wikipedia': 1,
         'head': 1,
         'New': 1,
         'York': 1,
         'city': 1,
         'Vincent': 1,
         'van': 1,
         'Gogh': 1,
         'Starry': 1,
         'Night': 1,
         'Museum': 1,
         'Modern': 1,
         'Art': 1,
         'Mount': 1,
         'Agiocochook': 1,
         'highest': 1,
         'peak': 1,
         'Northeastern': 1,
         'United': 1,
         'States': 1,
         '6,288.2': 1,

In [59]:
top_words = word_freq.most_common(4)
top_words

[('Skype', 3), ('Microsoft', 2), ('Washington', 2), ('Mississippi', 2)]

In [None]:
#remove that weird unicode


## Install other languages
#### Other languages can be found at https://spacy.io/usage/models

#### Disclaimer: Language models are built by open source communities. English and German are the most advanced language models.

### Spanish language model

In [60]:
## !python install the library
!python -m spacy download es_core_news_sm

Collecting es-core-news-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.2.0/es_core_news_sm-3.2.0-py3-none-any.whl (14.0 MB)
[K     |████████████████████████████████| 14.0 MB 5.3 MB/s eta 0:00:01    |█████▌                          | 2.4 MB 5.3 MB/s eta 0:00:03
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')


In [61]:
## import the library and create nlp pipleline
import es_core_news_sm
nlp = es_core_news_sm.load()

In [62]:
### Sample Spanish Text (sorry!)
stext = """
El 10 de mayo de 2011, Microsoft anunció la adquisición de Skype Technologies, \
creador del servicio de VoIP Skype, por 8.500 millones de dólares. Microsoft tiene \
su sede cerca de Seattle Washington, mientras que Skype permanece en Palo Alto, California. \
Sandeep Junnarkar obtuvo esto de Wikipedia. Pero preferiría ir a la ciudad \
de Nueva York para ver La noche estrellada de Vincent van Gogh en el Museo de Arte Moderno. \
El monte Washington, que en realidad es Agiocochook, es el pico más alto del noreste \
de los Estados Unidos con 6.288,2 pies y la montaña más prominente topográficamente \
al este del río Mississippi. No está en el estado de Mississippi. \
William Shatner, de noventa años, quien saltó a la fama por interpretar al \
capitán James T. Kirk en la película original "Star Trek", acaba de subirse a \
una nave espacial suborbital que rozó el borde del espacio exterior antes \
de lanzarse en paracaídas hacia un aterrizaje, convirtiendo a Shatner en el más antiguo. \
persona que haya viajado al espacio.
"""

In [63]:
## tokenize and show parts of speech for each token
doc = nlp(stext)

In [None]:
## show the tokens


In [64]:
## show entities
show_entities(doc)

Token                         Entity         Entity Defined
Microsoft                      ORG             Companies, agencies, institutions, etc.
Skype Technologies             MISC            Miscellaneous entities, e.g. events, nationalities, products or works of art
VoIP Skype                     MISC            Miscellaneous entities, e.g. events, nationalities, products or works of art
Microsoft                      ORG             Companies, agencies, institutions, etc.
Seattle                        LOC             Non-GPE locations, mountain ranges, bodies of water
Washington                     ORG             Companies, agencies, institutions, etc.
Skype                          MISC            Miscellaneous entities, e.g. events, nationalities, products or works of art
Palo Alto                      LOC             Non-GPE locations, mountain ranges, bodies of water
California                     LOC             Non-GPE locations, mountain ranges, bodies of water
Sandeep Ju

### Chinese language model

In [None]:
## !python install the library


In [None]:
## import the library and create nlp pipleline


In [None]:
### Sample Chinese Text (sorry!)
ctext = '''
2011年5月10日，微軟宣布收購Skype Technologies，
VoIP服務的創造者，價格為85億美元。
微軟總部位於華盛頓州西雅圖市附近，而Skype仍位於加利福尼亞州帕洛阿爾托。\
Sandeep Junnarkar從Wikipedia獲得了此信息。\
但他寧願前往法國巴黎在羅浮宮看《蒙娜麗莎》。\
華盛頓山（實際上是Agiocochook）是美國東北部的最高峰\
位於6,288.2英尺，是東面地形最突出的山脈\
密西西比河。
'''

In [None]:
## create a spacy doc object


In [None]:
## run our function!
