[Reference](https://towardsdatascience.com/an-easy-introduction-to-natural-language-processing-b1e2801291c1)

# A few dependencies

First we’ll install a few useful Python NLP libraries that will aid us in analysing this text.

### Installing spaCy, general Python NLP lib
`pip3 install spacy`
### Downloading the English dictionary model for spaCy
`python3 -m spacy download en_core_web_lg`
### Installing textacy, basically a useful add-on to spaCy
`pip3 install textacy`

# Entity Analysis 

Now that everything is installed, we can do a quick entity analysis of our text. Entity analysis will go through your text and identify all of the important words or “entities” in the text. When we say “important” what we really mean is words that have some kind of real-world semantic meaning or significance.

In [11]:
### coding: utf-8
import spacy

### Load spaCy's English NLP model
nlp = spacy.load('en_core_web_lg')

### The text we want to examine
text = "Amazon.com, Inc., doing business as Amazon, is an American electronic commerce and cloud computing company based in Seattle, Washington, that was founded by Jeff Bezos on July 5, 1994. The tech giant is the largest Internet retailer in the world as measured by revenue and market capitalization, and second largest after Alibaba Group in terms of total sales. The amazon.com website started as an online bookstore and later diversified to sell video downloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming, software, video games, electronics, apparel, furniture, food, toys, and jewelry. The company also produces consumer electronics - Kindle e-readers, Fire tablets, Fire TV, and Echo - and is the world's largest provider of cloud infrastructure services (IaaS and PaaS). Amazon also sells certain low-end products under its in-house brand AmazonBasics."

In [6]:
### Parse the text with spaCy
### Our 'document' variable now contains a parsed version of text.
document = nlp(text)

We first load spaCy’s learned ML model and initialise the text want to process. We run the ML model on our text to extract the entities. When you run that code you’ll get the following output:

In [7]:
### print out all the named entities that were detected
for entity in document.ents:
    print(entity.text, entity.label_)

Amazon.com, Inc. ORG
Amazon ORG
American NORP
Seattle GPE
Washington GPE
Jeff Bezos PERSON
July 5, 1994 DATE
second ORDINAL
Alibaba Group ORG
amazon.com ORG
Fire TV ORG
Echo -  LOC
PaaS ORG
Amazon ORG
AmazonBasics ORG


The [3 letter codes](https://spacy.io/usage/linguistic-features#entity-types) beside the text are labels which indicate the type of entity we are looking at. Looks like our model did a pretty good job! Jeff Bezos is indeed a person, the date is identified correctly, Amazon is an organisation, and both Seattle and Washington are Geopolitical entities (i.e countries, cities, states, etc). The only tricky ones it got wrong were that things like Fire TV and Echo are actually products, not organisations. It also missed out on the other things that Amazon sells “*video downloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming, software, video games, electronics, apparel, furniture, food, toys, and jewelry,*” probably because they were in a big, uncapitalised list and thus looked fairly unimportant.

Overall our model has accomplished what we wanted to. Imagine we had a huge document full of hundreds of pages of text. This NLP model could quickly get you an overview of what the document is about and what the key entities in it are.

# Operating on entities

Let’s try and do something a bit more applicable. Let’s say you have the same block of text as above, but you would like to remove the names of all people and organisations automatically, for privacy concerns. The spaCy library has a very useful `scrub` function which we can use to scrub away any entity categories we don’t want to see. Here’s what that would look like:

In [8]:
### Replace a specific entity with the word "PRIVATE"
def replace_entity_with_placeholder(token):
    if token.ent_iob != 0 and (token.ent_type_ == "PERSON" or token.ent_type_ == "ORG"):
        return "[PRIVATE] "
    else:
        return token.string

In [9]:
### Loop through all the entities in a piece of text and apply entity replacement
def scrub(text):
    doc = nlp(text)
    for ent in doc.ents:
        ent.merge()
    tokens = map(replace_entity_with_placeholder, doc)
    return "".join(tokens)

In [10]:
print(scrub(text))

[PRIVATE] , doing business as [PRIVATE] , is an American electronic commerce and cloud computing company based in Seattle, Washington, that was founded by [PRIVATE] on July 5, 1994. The tech giant is the largest Internet retailer in the world as measured by revenue and market capitalization, and second largest after [PRIVATE] in terms of total sales. The [PRIVATE] website started as an online bookstore and later diversified to sell video downloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming, software, video games, electronics, apparel, furniture, food, toys, and jewelry. The company also produces consumer electronics - Kindle e-readers, Fire tablets, [PRIVATE] , and Echo - and is the world's largest provider of cloud infrastructure services (IaaS and [PRIVATE] ). [PRIVATE] also sells certain low-end products under its in-house brand [PRIVATE] .


That worked great! This is actually an incredibly powerful technique. People use the `ctrl + f` function on their computer all the time to find and replace words in their document. But with NLP, we can find and replace *specific entities*, taking into account their semantic meaning and not just their raw text.



# Extracting information from text

The library the we installed previously `textacy` implements several common NLP information extraction algorithms on top of spaCy. It’ll let us do a few more advanced things than the simple out of the box stuff.

One of the algorithms it implements is called [Semi-structured Statement Extraction](https://www.pydoc.io/pypi/textacy-0.5.0/autoapi/extract/index.html#extract.semistructured_statements). This algorithm essentially parses some of the information that spaCy’s NLP model was able to extract and based on that we can grab some more specific information about certain entities! In a nutshell, we can extract certain “facts” about the entity of our choice.

Let’s see what that looks like in code. For this one, we’re going to take the *entire summary* of Washington D.C’s Wikipedia page.

In [12]:
# coding: utf-8

import spacy
import textacy.extract

In [14]:
### The text we want to examine
text = """Washington, D.C., formally the District of Columbia and commonly referred to as Washington or D.C., is the capital of the United States of America.[4] Founded after the American Revolution as the seat of government of the newly independent country, Washington was named after George Washington, first President of the United States and Founding Father.[5] Washington is the principal city of the Washington metropolitan area, which has a population of 6,131,977.[6] As the seat of the United States federal government and several international organizations, the city is an important world political capital.[7] Washington is one of the most visited cities in the world, with more than 20 million annual tourists.[8][9]
The signing of the Residence Act on July 16, 1790, approved the creation of a capital district located along the Potomac River on the country's East Coast. The U.S. Constitution provided for a federal district under the exclusive jurisdiction of the Congress and the District is therefore not a part of any state. The states of Maryland and Virginia each donated land to form the federal district, which included the pre-existing settlements of Georgetown and Alexandria. Named in honor of President George Washington, the City of Washington was founded in 1791 to serve as the new national capital. In 1846, Congress returned the land originally ceded by Virginia; in 1871, it created a single municipal government for the remaining portion of the District.
Washington had an estimated population of 693,972 as of July 2017, making it the 20th largest American city by population. Commuters from the surrounding Maryland and Virginia suburbs raise the city's daytime population to more than one million during the workweek. The Washington metropolitan area, of which the District is the principal city, has a population of over 6 million, the sixth-largest metropolitan statistical area in the country.
All three branches of the U.S. federal government are centered in the District: U.S. Congress (legislative), President (executive), and the U.S. Supreme Court (judicial). Washington is home to many national monuments and museums, which are primarily situated on or around the National Mall. The city hosts 177 foreign embassies as well as the headquarters of many international organizations, trade unions, non-profit, lobbying groups, and professional associations, including the Organization of American States, AARP, the National Geographic Society, the Human Rights Campaign, the International Finance Corporation, and the American Red Cross.
A locally elected mayor and a 13‑member council have governed the District since 1973. However, Congress maintains supreme authority over the city and may overturn local laws. D.C. residents elect a non-voting, at-large congressional delegate to the House of Representatives, but the District has no representation in the Senate. The District receives three electoral votes in presidential elections as permitted by the Twenty-third Amendment to the United States Constitution, ratified in 1961."""
### Parse the text with spaCy
### Our 'document' variable now contains a parsed version of text.
document = nlp(text)

In [15]:
### Extracting semi-structured statements
statements = textacy.extract.semistructured_statements(document, "Washington")

print("**** Information from Washington's Wikipedia page ****")
count = 1
for statement in statements:
    subject, verb, fact = statement
    print(str(count) + " - Statement: ", statement)
    print(str(count) + " - Fact: ", fact)
    count += 1

**** Information from Washington's Wikipedia page ****
1 - Statement:  (Washington, is, the capital of the United States of America.[4)
1 - Fact:  the capital of the United States of America.[4
2 - Statement:  (Washington, is, the principal city of the Washington metropolitan area, which has a population of 6,131,977.[6)
2 - Fact:  the principal city of the Washington metropolitan area, which has a population of 6,131,977.[6
3 - Statement:  (Washington, is, home to many national monuments and museums, which are primarily situated on or around the National Mall)
3 - Fact:  home to many national monuments and museums, which are primarily situated on or around the National Mall


Our NLP model found 3 useful facts about Washington D.C from that text:

(1) Washington is the capital of the USA

(2) Washington’s population and the fact that it is metropolitan

(3) Many national monuments and museums

The best part about this is that those are all really the most important pieces of information within that block of text!

# Try others

In [20]:
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_lg') # install 'en' model (python3 -m spacy download en)

In [26]:
doc = nlp("Based on the information provided on Google, The University of Muhammadiyah Surakarta is one of the top 10 universities in Indonesia. Yusuf, as a junior lecturer in Informatics Department, is pursuing his PhD in Japan.")
print(displacy.render(doc, style='ent', jupyter=True))
print ()
for x in doc.ents:    
    print('Named Entity: {},{}'.format(x, x.label_))

None

Named Entity: Google,ORG
Named Entity: The University of Muhammadiyah Surakarta,ORG
Named Entity: 10,CARDINAL
Named Entity: Indonesia,GPE
Named Entity: Yusuf,PERSON
Named Entity: Informatics Department,ORG
Named Entity: PhD,WORK_OF_ART
Named Entity: Japan,GPE


In [65]:
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
import pandas as pd

factory = StopWordRemoverFactory()
stopword = factory.create_stop_word_remover()

df = pd.read_csv("kalimat.csv")
for i, kalimat in enumerate (df['kalimat']):
    stop = stopword.remove(kalimat)
    print(i,stop)

0 semua perbuatan tergantung niatnya ba... 
1 terkadang datang kepadaku suara gemeri... 
2 bacalah beliau menjawab aku bisa baca na... 
3 Dengan Menggunakan Python Library Sastrawi dapat melakukan proses Stopword Removal


In [66]:
df['kalimat'] = df['kalimat'].apply(lambda x: " ".join(stopword.remove(x) for x in x.split()))
df['kalimat'].head()

0            semua perbuatan tergantung niatnya  ba...
1           terkadang datang kepadaku  suara gemeri...
2             bacalah beliau menjawab aku   baca na...
3    Dengan Menggunakan Python  Library Sastrawi   ...
Name: kalimat, dtype: object