# Finding words, phrases, names and concepts

Bab ini akan memperkenalkan Anda pada dasar-dasar pemrosesan teks dengan spaCy. Anda akan belajar tentang struktur data, cara bekerja dengan model statistik, dan cara menggunakannya untuk memprediksi fitur linguistik dalam teks Anda.

## Introduction to spaCy

### The nlp object

* berisi pipa pemrosesan
* termasuk aturan bahasa-spesifik untuk tokenization dll.

In [1]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

### The Doc object

In [2]:
# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

Hello
world
!


### The Token object

In [3]:
doc = nlp("Hello world!")

# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

world


### The Span object

In [4]:
doc = nlp("Hello world!")

# A slice from the Doc is a Span object
span = doc[1:4]

# Get the span text via the .text attribute
print(span.text)

world!


### Lexical attributes

In [5]:
doc = nlp("It costs $5.")

print('Index: ', [token.i for token in doc])
print('Text: ', [token.text for token in doc])
print('is_alpha:', [token.is_alpha for token in doc])
print('is_punct:', [token.is_punct for token in doc])
print('like_num:', [token.like_num for token in doc])

Index:  [0, 1, 2, 3, 4]
Text:  ['It', 'costs', '$', '5', '.']
is_alpha: [True, True, False, False, False]
is_punct: [False, False, False, False, True]
like_num: [False, False, False, True, False]


### Practice: Getting Started

Ayo mulai dan coba spaCy! Dalam latihan ini, Anda dapat mencoba beberapa dari 45+ [bahasa yang tersedia](https://spacy.io/usage/models#languages).

*Kursus ini memperkenalkan banyak konsep baru, jadi jika Anda memerlukan penyegaran cepat, unduh [spaCy Cheat Sheet](http://datacamp-community-prod.s3.amazonaws.com/29aa28bf-570a-4965-8f54-d6a541ae4e06) dan simpan!*

In [8]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

# Process a text
doc = nlp("This is a sentence.")

# Print the document text
print(doc.text)

This is a sentence.


In [9]:
# Import the German language class
from spacy.lang.de import German

# Create the nlp object
nlp = German()

# Process a text (this is German for: "Kind regards!")
doc = nlp("Liebe Grüße!")

# Print the document text
print(doc.text)

Liebe Grüße!


In [10]:
# Import the Spanish language class
from spacy.lang.es import Spanish

# Create the nlp object
nlp = Spanish()

# Process a text (this is Spanish for: "How are you?")
doc = nlp("¿Cómo estás?")

# Print the document text
print(doc.text)

¿Cómo estás?


In [11]:
# Import the Spanish language class
from spacy.lang.id import Indonesian

# Create the nlp object
nlp = Indonesian()

# Process a text (this is Spanish for: "How are you?")
doc = nlp("Ini adalah kalimat")

# Print the document text
print(doc.text)

Ini adalah kalimat


### Documents, spans and tokens

Saat Anda memanggil `nlp` pada sebuah string, spaCy pertama-tama memberi token pada teks dan membuat objek dokumen. Dalam latihan ini, Anda akan belajar lebih banyak tentang `Doc`, serta pandangannya tentang `Token` dan `Span`.

In [12]:
# Import the English language class and create the nlp object
from spacy.lang.en import English
nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

I


In [13]:
# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


### Lexical attributes

Dalam contoh ini, Anda akan menggunakan objek `Doc` dan `Token` spaCy, dan atribut lexical untuk menemukan persentase dalam teks. Anda akan mencari dua token berikutnya: angka dan tanda persen.

In [14]:
# Process the text
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. Now less than 4% are.")

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals '%'
        if next_token.text == '%':
            print('Percentage found:', token.text)

Percentage found: 60
Percentage found: 4


**Catatan** : Seperti yang Anda lihat, Anda dapat melakukan banyak analisis yang sangat powerful menggunakan token dan atributnya.

## Statistical models

### What are statistical models?

* Aktifkan spaCy untuk memprediksi atribut linguistik dalam konteks
  * Part-of-speech tags
  * Syntactic dependencies
  * Named entities
* Trained on labeled example texts
* Dapat diperbarui dengan lebih banyak contoh untuk menyempurnakan prediksi

### Model Packages

* Binary weights
* Vocabulary
* Meta information (language, pipeline)

In [1]:
import spacy

# Load the small English model
nlp = spacy.load('en_core_web_sm')

### Predicting Part-of-speech Tags

In [20]:
# Process a text
doc = nlp("She ate the pizza")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN


### Predicting Syntactic Dependencies

<img src="images/syntactic.png" width=450px height=450px align=left />

In [18]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


### Predicting Named Entities

In [21]:
# Process a text
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


### Tip: the explain method

Dapatkan denominasi cepat dari tag dan label yang paling umum.

In [22]:
spacy.explain('GPE')

'Countries, cities, states'

In [23]:
spacy.explain('NNP')

'noun, proper singular'

In [24]:
spacy.explain('dobj')

'direct object'

### Loading models

Mari kita mulai dengan memuat model.

In [25]:
# Load the 'en_core_web_sm' model – spaCy is already imported
nlp = spacy.load('en_core_web_sm')

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


In [28]:
#!python -m spacy download de_core_news_sm

In [2]:
# Load the 'de_core_news_sm' model – spaCy is already imported
nlp = spacy.load('de_core_news_sm')

text = "Als erstes Unternehmen der Börsengeschichte hat Apple einen Marktwert von einer Billion US-Dollar erreicht"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

Als erstes Unternehmen der Börsengeschichte hat Apple einen Marktwert von einer Billion US-Dollar erreicht


**Catatan** : Sekarang Anda telah berlatih memuat model, mari kita lihat beberapa prediksi mereka.

### Predicting linguistic annotations

Anda sekarang akan dapat mencoba salah satu paket model pre-trained spaCy dan melihat prediksinya dalam aksi. Jangan ragu untuk mencobanya di teks Anda sendiri! 

In [3]:
# Load the small English model
nlp = spacy.load('en_core_web_sm')

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print('{:<12}{:<10}{:<10}'.format(token_text, token_pos, token_dep))

It          PRON      nsubj     
’s          VERB      punct     
official    NOUN      ccomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


In [5]:
spacy.explain('PART')

'particle'

In [6]:
# Iterate over the predicted entities
for ent in doc.ents:
    # print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


**Catatan** : Sejauh ini, modelnya sudah benar setiap saat. Dalam latihan berikutnya, Anda akan melihat apa yang terjadi jika modelnya salah, dan bagaimana menyesuaikannya.

### Predicting named entities in context

Model bersifat statistik dan tidak selalu benar. Apakah prediksi mereka benar tergantung pada data pelatihan dan teks yang Anda proses. Mari kita lihat sebuah contoh.

In [8]:
text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # print the entity text and label
    print(ent.text, ent.label_)

New iPhone EVENT
Apple ORG


In [9]:
# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print('Missing entity:', iphone_x.text)

Missing entity: iPhone X


**Catatan** : Tentu saja, Anda tidak selalu harus melakukan ini secara manual. Dalam bab berikutnya, Anda akan belajar tentang pencocokan berbasis aturan spaCy, yang dapat membantu Anda menemukan kata dan frasa tertentu dalam teks.

## Rule-based matching

### Why notjust regular expressions?

* Cocokkan pada objek Doc, bukan hanya string
* Cocokkan dengan token dan atribut token
* Gunakan prediksi model
* Contoh: "duck"(verb) vs. "duck"(noun)

### Match patterns

* Lists of dictionaries, one per token
* Match exact token texts
  * `[{'ORTH': 'iPhone'}, {'ORTH': 'X'}]`
* Match lexical attributes
  * `[{'LOWER': 'iphone'}, {'LOWER': 'x'}]`
* Match any token attributes
  * `[{'LEMMA': 'buy'}, {'POS': 'NOUN'}]`

### Using the Matcher (1)

In [10]:
import spacy
# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{'ORTH': 'iPhone'}, {'ORTH': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

# Process some text
doc = nlp("New iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

In [11]:
matches

[(9528407286733565721, 1, 3)]

### Using the Matcher (2)

* `match_id` : hash value ofthe pattern name
* `start` : start index of matched span
* `end` : end index of matched span

In [12]:
# Call the matcher on the doc
doc = nlp("New iPhone X release date leaked")
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


### Matching lexical attributes

In [14]:
pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]

doc = nlp("2018 FIFA World Cup: France won!")
doc.text

'2018 FIFA World Cup: France won!'

### Matching other token attributes

In [15]:
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]

doc = nlp("I loved dogs but now I love cats more.")
doc.text

'I loved dogs but now I love cats more.'

### Using operators and quantiers (1)

In [16]:
pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'}, # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]

doc = nlp("I bought a smartphone. Now I'm buying apps.")
doc.text

"I bought a smartphone. Now I'm buying apps."

### Using operators and quantiers (2)

<img src="images/operators.png" width=400px height=400px align=left />

### Practice: Using the Matcher

Mari kita coba Matcher berbasis aturan spaCy. Anda akan menggunakan contoh dari latihan sebelumnya dan menulis pola yang cocok dengan frasa "iPhone X" dalam teks.

In [19]:
doc = nlp("New iPhone X release date leaked as Apple reveals pre-orders by mistake")

In [20]:
# Import the Matcher and initialize it with the shared vocabulary
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]

# Add the pattern to the matcher
matcher.add('IPHONE_X_PATTERN', None, pattern)

# Use the matcher on the doc
matches = matcher(doc)
print('Matches:', [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


**Catatan** : Anda berhasil menemukan satu kecocokan: token di `doc[1: 3]` yang menjelaskan rentang untuk "iPhone X".

### Writing match patterns

Dalam latihan ini, Anda akan berlatih menulis pola kecocokan yang lebih kompleks menggunakan atribut dan operator token yang berbeda. Pencocokan sudah diinisialisasi dan tersedia sebagai pencocokan variabel.

In [21]:
doc = nlp("After making the iOS update you won't notice a radical system-wide redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of iOS 11's furniture remains the same as in iOS 10. But you will discover some tweaks once you delve a little deeper.")

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{'TEXT': 'iOS'}, {'IS_DIGIT': True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('IOS_VERSION_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


In [22]:
doc = nlp("i downloaded Fortnite on my laptop and can't open the game at all. Help? so when I was downloading Minecraft, I got the Windows version where it is the '.zip' folder and I used the default program to unpack it... do I also need to download Winzip?")

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{'LEMMA': 'download'}, {'POS': 'PROPN'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('DOWNLOAD_THINGS_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [23]:
doc = nlp("Features of the app include a beautiful design, smart search, automatic labels and optional voice responses.")

# Write a pattern for adjective plus one or two nouns
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses


**Catatan** : itulah beberapa pola yang cukup rumit! Mari kita beralih ke bab berikutnya dan melihat bagaimana menggunakan spaCy untuk analisis teks yang lebih maju.