# Introduction to SpaCy - part 1

Let’s get started and try out spaCy! You’ll be able to try out some of the 55+ available languages


#### Print your name

In [1]:
## Your code here 
print("Exercise by: Janne Bragge")

Exercise by: Janne Bragge


## Import SpaCy and load pipelines

#### Exercise - Import SpaCy 

- import spacy
- print spacy version
- download SpaCy text processing pipeline `en_core_web_sm` if needed



In [2]:
## Task 1:
## Your code here 
#3.7.2
import spacy
print(spacy.__version__)

3.8.4


## SpaCy pipelines
### Inspecting the pipeline

Let’s inspect the small English model’s pipeline!

- Load the `en_core_web_sm` model and create the `nlp` object.
- Print the names of the pipeline components using `nlp.pipe_names`.
- Print the full pipeline of `(name, component)`
 tuples using `nlp.pipeline`.

In [3]:
# Load the en_core_web_sm model
nlp = spacy.load("en_core_web_sm")

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7f4c4ded4c50>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7f4c4ded4b90>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7f4c52a0dee0>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7f4c4dd0c550>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7f4c4dd0b410>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7f4c524d9620>)]


#### Exercise - What happens when you call nlp?

What does spaCy do when you call nlp on a string of text?

```python
doc = nlp("This is a sentence.")
```

Your answer here...

"Tokenize the text and apply each pipeline component in order"

When spaCy's nlp function is called on a string, it performs several natural language processing steps. First, the text is broken down into tokens, which are individual words and punctuation marks. After that, each token is assigned its part of speech, such as noun or verb. The sentence structure is parsed, and named entities, like people or places, are identified. Additionally, numerical vector representations are created for the words, describing their semantic meaning. Finally, a Doc object is created, containing all the processed information. 

(https://spacy.pythonhumanities.com/01_02_linguistic_annotations.html)
(https://gitlab.dclabra.fi/wiki/s/BJsE3w2Hd)
(https://course.spacy.io/en/chapter3)

#### Exercise - Use cases for custom components

Which of these problems can be solved by custom pipeline components? Choose all that apply!

1. Updating the pre-trained models and improving their predictions
2. Computing your own values based on tokens and their attributes
3. Adding named entities, for example based on a dictionary
4. Implementing support for an additional language

*Your answer here...*

2. Computing your own values based on tokens and their attribute, 
3. Adding named entities, for example based on a dictionary

(https://course.spacy.io/en/)

### NLP pipelines, tokens and token attributes

#### Exercise - NLP pipeline 

Create function `return_verbs()` that

- load SpaCy text processing pipeline `en_core_web_sm` to variable `nlp`
- create a `doc` using nlp pipeline
- print part of speech (= sanaluokka) for 5 first tokens
- return verb list of all verbs in `doc`


```python
def return_verbs(text) :

    # 1. Load the small English model
    
    # 2. Process the text
    
    # 3. Print first 5 tokens and part-of-speechs - hint. use token.pos_ tag
    
    # 4. Find all verbs in doc 
    
    return verb_list
```


In [4]:
## Task 2:
## Your code here 

def return_verbs(text):
    """
    Lataa pienen englanninkielisen mallin, prosessoi tekstin,
    tulostaa viisi ensimmäistä tokenia ja niiden sanaluokat,
    ja palauttaa listan verbeistä.
    """

    # 1. Lataa pieni englanninkielinen malli
    nlp = spacy.load("en_core_web_sm")

    # 2. Prosessoi teksti
    doc = nlp(text)

    # 3. Tulosta viisi ensimmäistä tokenia ja niiden sanaluokat
    print("First 5 tokens and part-of-speechs:")
    for token in doc[:5]:
        print(f"{token.text:<15} {token.pos_:<10}")

    # 4. Etsi kaikki verbit doc-objektista
    verb_list = [token.text for token in doc if token.pos_ == "VERB"]

    return verb_list


In [5]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
verbs = return_verbs(text)

print("\nText verbs", verbs)
print("Total:", len(verbs))

First 5 tokens and part-of-speechs:
It              PRON      
’s              VERB      
official        NOUN      
:               PUNCT     
Apple           PROPN     

Text verbs ['’s', 'reach']
Total: 2


#### Exercise - Token lemma

Create function `return_lemmas()` that

- load SpaCy text processing pipeline `en_core_web_sm` to variable `nlp`
- create a `doc` using nlp pipeline
- print lemma for 5 first tokens
- return lemma list of all tokens in `doc`


```python
def return_lemmas(text) :

    # 1. Load the small English model and Process the text
    
    # 2. Print first 5 tokens and lemmas  
    
    return lemma_list
```


In [6]:
## Task 3:
## Your code here 
import spacy

def return_lemmas(text):

    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)

    print("First 5 tokens and lemmas:\n")

    for token in doc[:5]:
        print(f"{token.text:<20} {token.lemma_}")

    lemma_list = [token.lemma_ for token in doc]

    return lemma_list

 

In [7]:
text = "I was happy when I ran with my two older dogs."
lemmas = return_lemmas(text)

print("\nText lemmas:", lemmas)
print("Total:", len(lemmas))

First 5 tokens and lemmas:

I                    I
was                  be
happy                happy
when                 when
I                    I

Text lemmas: ['I', 'be', 'happy', 'when', 'I', 'run', 'with', 'my', 'two', 'old', 'dog', '.']
Total: 12


## SpaCy EntityRecognizer

"A named entity is basically a real-life object which has proper identification and can be denoted with a proper name. Named Entities can be a place, person, organization, time, object, or geographic entity.

For example, named entities would be Roger Federer, Honda city, Samsung Galaxy S10. Named entities are usually instances of entity instances. For example, Roger Federer is an instance of a Tennis Player/person, Honda City is an instance of a car and Samsung Galaxy S10 is an instance of a Mobile Phone." 

https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/

In [8]:
## Let's see what "GPE" means
print(spacy.explain("GPE"))

Countries, cities, states


#### Exercise - EntityRecognizer

Create function `return_places()` that

- load SpaCy text processing pipeline `en_core_web_sm` to variable `nlp`
- create a `doc` using nlp pipeline 
- print first 5 entities in the document
- return list of all places in `doc`


```python
def return_places(text) :

    # 1. Load the small English model and process the text
    
    # 2. Print document entity count
    
    # 3. Print document entities

    # 4. create list of places (entity label == GPE)
    
    return place_list
```


In [9]:
## Task 4:
## Your code here 

import spacy

def return_places(text):

    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)

    print(f"Document entity count: {len(doc.ents)}")
    print("Document entities and entity labels:\n")
    
    for ent in doc.ents:
        print(f"{ent.text:<20} {ent.label_}")

    place_list = [ent.text for ent in doc.ents if ent.label_ == "GPE"]
    
    return place_list


In [10]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value. Current CEO is Tim Cook."
places = return_places(text)

print("\nText entity places:", places)
print("Total:", len(places))

Document entity count: 5
Document entities and entity labels:

Apple                ORG
first                ORDINAL
U.S.                 GPE
$1 trillion          MONEY
Tim Cook             PERSON

Text entity places: ['U.S.']
Total: 1


#### Exercise - Add wikipedia search also for persons and organizations:

```python
def wikipedia_search(text) :

    import spacy
    from spacy.tokens import Span
    
    nlp = spacy.load("en_core_web_sm")

    print("\n")
    
    def get_wikipedia_url(span):
        # Get a Wikipedia URL 
        if span.label_ in ("GPE"): ### -1-
            entity_text = span.text.replace(" ", "_")
            return "https://en.wikipedia.org/w/index.php?search=" + entity_text
    
    
    # Set the Span extension wikipedia_url using the getter get_wikipedia_url
    Span.set_extension("wikipedia_url", getter=get_wikipedia_url, force=True)

    urls = []
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ in ("GPE"): ### -2-
            print('%-15s' % ent.text, '%-10s' % ent.label_, '%-30s' %ent._.wikipedia_url)
            urls.append(ent._.wikipedia_url)
    
    return urls


```

In [11]:
## Task 5:
## Your code here 

import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")

def get_wikipedia_url(span):
    if span.label_ in ("GPE", "PERSON", "ORG"):
        entity_text = span.text.replace(" ", "_")
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text

# Set the Span extension wikipedia_url using the getter get_wikipedia_url
Span.set_extension("wikipedia_url", getter=get_wikipedia_url, force=True)

def wikipedia_search(text):
    print("\n")
    urls = []
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ in ("GPE", "PERSON", "ORG"):
            print('%-15s' % ent.text, '%-10s' % ent.label_, '%-30s' % ent._.wikipedia_url)
            urls.append(ent._.wikipedia_url)
    return urls

 

In [12]:
text =  """
        In over fifty years from his very first recordings right through to his 
        last album, David Bowie was at the vanguard of contemporary culture. He lives in England. 
        But does he work for Nokia?
        """

urls=wikipedia_search(text)
urls



David Bowie     PERSON     https://en.wikipedia.org/w/index.php?search=David_Bowie
England         GPE        https://en.wikipedia.org/w/index.php?search=England
Nokia           ORG        https://en.wikipedia.org/w/index.php?search=Nokia


['https://en.wikipedia.org/w/index.php?search=David_Bowie',
 'https://en.wikipedia.org/w/index.php?search=England',
 'https://en.wikipedia.org/w/index.php?search=Nokia']

#### Exercise - Why the code below is bad?

In [13]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)

Found proper noun before a verb: Berlin


*Your answer here

The code also assumes that if a proper noun is found, the next token must be a verb. This is an oversimplified assumption that doesn't hold true in many cases.

"Berlin looks like a nice city and Helsinki metropoly attracts many tourists" -> only Berlin is found.
"Berlin looks like a nice city and Helsinki attracts many tourists" -> Berlin and Helsinki are found."

# Reflection
1. What is spaCy?
2. Can you find Finnish pipelines in SpaCy?
3. What’s not included in a model package that you can load into spaCy?
    - A meta file including the language, pipeline and license.
    - Binary weights to make statistical predictions.
    - The labelled data that the model was trained on.
    - Strings of the model's vocabulary and their hashes. 
4. What is `nlp` in these exercises?
5. Is it possible to update spaCy models? How?
6. Why you should be careful with training data, if you re-train SpaCy model? 
7. Miksi luonnollisen kielen käsittely on niin vaikeaa?


*Your answers here...*

1. spaCy is a Python library designed for advanced Natural Language Processing. It provides pre-trained models for various languages, container objects representing linguistic elements like sentences and words, and attributes for linguistic features such as parts of speech. spaCy also natively supports advanced features like word vectors and offers visualizers for syntactic structure and named entities. Users can customize models, train new ones, and integrate with other machine learning libraries, making it a versatile tool for NLP tasks. (Vasiliev, 2020)

2. Yes, spaCy provides Finnish pipelines.

    fi_core_news_sm: A small pipeline, suitable for basic NLP tasks.   
    fi_core_news_md: A medium-sized pipeline, offering better accuracy for more complex tasks.
    fi_core_news_lg: A large pipeline, providing the highest accuracy but requiring more resources.

    (https://spacy.io/models)

3. The labelled data that the model was trained on is not included

4. It's "a container of metadata". In this case, the nlp object is loaded with information from the en_core_web_sm file.

5. It's possible to update SpaCy models.  While you can't directly modify the pre-trained models, you can achieve updates through these methods:

- Training with new data

    This involves adding new labeled data to the existing model and retraining it. This allows the model to  learn new patterns and improve its performance on specific tasks or domains. 

(https://towardsdatascience.com/improving-the-ner-model-with-patent-texts-spacy-prodigy-and-a-bit-of-magic-44c86282ea99/#:~:text=prodigy%20annotated%20data-,Results,entities%20from%20the%20specific%20domain.)

- Custom components

    You can create custom pipeline components to add new functionalities or modify existing ones. 1  This allows you to tailor the model to your specific requirements and integrate it with other tools or resources

(https://spacy.io/usage/saving-loading)

6. Retraining SpaCy models can introduce new challenges. If not done carefully, it can lead to biases in the model, reflecting the biases present in the new training data. This can result in unfair or discriminatory outcomes, especially in sensitive applications like sentiment analysis or text classification.

Furthermore, retraining can cause the model to overfit to the new data, performing well on the training data but poorly on unseen data. It can also lead to catastrophic forgetting, where the model forgets previously learned patterns or knowledge, especially if the new data is significantly different from the original training data. (Vasiliev, 2020)

(https://spacy.io/usage/training)

7. Natural language processing is complex because language is inherently ambiguous. Words can have multiple meanings depending on context, and grammatical structures can be intricate and varied. Informal usage, such as slang, idiomatic expressions, and colloquial constructions, further challenge processing. Language is ever-evolving, with new words and expressions constantly emerging. Moreover, language is multifaceted and encompasses hidden meanings, like sarcasm and jokes, which significantly influence interpretation.

vrt. "kehtaa" sanan merkitys Pohjanmaalla verrattuna Uudellamaalla. 

---------------------
References 

Vasiliev, Y. (2020). Natural Language Processing with Python and spaCy: A Practical Introduction. No Starch Press.

### Check your answers by running following cell:

In [14]:
# Do not change this code!

import sys
sys.path.insert(0, '../answers/spacy/')
from spacy1_check import *

print("Results:")
correct = spacy1_check(return_verbs, return_lemmas, return_places, wikipedia_search)

print("Correct answers", correct, "/ 4.")



Results:
First 5 tokens and part-of-speechs:
I               PRON      
like            VERB      
book            NOUN      
.               PUNCT     
I               PRON      


First 5 tokens and lemmas:

Running              run
is                   be
my                   my
hoppy                hoppy
.                    .


Document entity count: 2
Document entities and entity labels:

Helsinki             GPE
Finland              GPE




Iivo Niskanen   PERSON     https://en.wikipedia.org/w/index.php?search=Iivo_Niskanen
Kuopio          GPE        https://en.wikipedia.org/w/index.php?search=Kuopio
Swix            GPE        https://en.wikipedia.org/w/index.php?search=Swix


Correct answers 4 / 4.


### Nice work! 