# Introduction to SpaCy - part 2

Let's see how we can convert text to numbers and what we can do with the numbers.

#### Print your name

In [1]:
## Your code here 
print("Exercise by: Janne Bragge")

Exercise by: Janne Bragge


## Import SpaCy and load pipelines

#### Exercise - Import SpaCy 

- import spacy
- print spacy version
- You may need to download SpaCy text processing pipelines `de_core_news_sm`, `en_core_web_lg` and `fi_core_news_lg`. 

In [2]:
## Task 1:
## Your code here 
import spacy
print(spacy.__version__)
 

3.8.4


## Strings to hashes

- Look up the string “cat” in `nlp.vocab.strings` to get the hash.
- Look up the hash to get back the string.

In [3]:
import spacy

nlp_en = spacy.load("en_core_web_sm")
nlp_de = spacy.load("de_core_news_sm")
print('English:')
code_en1 = nlp_en.vocab.strings["cat"]
code_en2 = nlp_en.vocab.strings["Katze"]
print('cat = ', code_en1)
print('Katze = ',code_en2)
print('German:')
code_de1 = nlp_de.vocab.strings["cat"]
code_de2 = nlp_de.vocab.strings["Katze"]
print('cat = ', code_de1)
print('Katze = ',code_de2, '\n')
print('English & cat hash:', nlp_en.vocab.strings[code_en1])
print('German & cat hash:', nlp_de.vocab.strings[code_en1])
print('German & Katze hash:', nlp_de.vocab.strings[code_de2])

import gc
del nlp_en
del nlp_de
gc.collect()



English:
cat =  5439657043933447811
Katze =  13148695887159701572
German:
cat =  5439657043933447811
Katze =  13148695887159701572 

English & cat hash: cat
German & cat hash: cat
German & Katze hash: Katze


9200

#### Exercise - PERSON vs. Person vs. person hash?
- Use package `en_core_web_sm`
- Create function `str_to_hash(string)` to get hash codes for strings “PERSON / Person / person”.
- Create function `hash_to_str(hash)` to get strings for hash codes.

In [4]:
## Task 2:
## Your code here 

nlp = spacy.load("en_core_web_sm")

def str_to_hash(string):
    """Get the hash code for a string."""
    return nlp.vocab.strings[string]

def hash_to_str(hash):
    """Get the string representation for a hash code."""
    return nlp.vocab.strings[hash]
 

In [5]:
print('Hash codes:')
print(str_to_hash("PERSON"))
print(str_to_hash("Person"))
print(str_to_hash("person"))

print('\nStrings:')
print(hash_to_str(380))
print(hash_to_str(2313063860588076218))
print(hash_to_str(14800503047316267216))


Hash codes:
380
2313063860588076218
14800503047316267216

Strings:
PERSON
Person
person


#### Exercise - Why does this code throw an error?

```python
import spacy

nlp_en = spacy.load("en_core_web_sm")
nlp_de = spacy.load("de_core_news_sm")
code = nlp_de.vocab.strings["Katze"]
word = nlp_en.vocab.strings[code]
```

*Your answer here...*

Koodi aiheuttaa virheen, koska siinä yritetään käyttää saksankielisen sanaston hash-koodia englanninkielisessä sanastossa. Sanastot ovat kielikohtaisia, eikä saksankielisen sanan hash-koodi vastaa mitään merkintää englanninkielisessä sanastossa. Tämä johtaa virheeseen, kun koodi yrittää hakea sanaa englanninkielisestä sanastosta käyttäen saksankielistä hash-koodia.

#### Exercise - And why this works?

```python
import spacy

nlp_en = spacy.load("en_core_web_sm")
nlp_de = spacy.load("de_core_news_sm")
code = nlp_de.vocab.strings["Katze"]

nlp_en.vocab.strings.add("Katze")
word = nlp_en.vocab.strings[code]
```

*Your answer here...*

Koodi toimii, koska se lisää sanan "Katze" eksplisiittisesti englanninkieliseen sanastoon käyttäen nlp_en.vocab.strings.add("Katze"). Tämä luo merkinnän "Katze" sanalle englanninkieliseen sanastoon ja antaa sille uuden hash-koodin. Nyt kun koodi yrittää hakea sanaa nlp_en.vocab.strings[code], se löytää vastaavan merkinnän englanninkielisestä sanastosta, vaikka alkuperäinen hash-koodi tulikin saksankielisestä sanastosta.

## Inspecting word vectors

SpaCy uses word vectors and with word vectors you can do much more than with just hash codes. 

#### Exercise - "bananas" word vector 1
- Use small `en_core_web_sm` model 
- Print the vector for `"bananas"` using the `token.vector` attribute.
- How long word vector is?

In [6]:
## Task 3:
## Your code here 

# Load the small English model
nlp = spacy.load("en_core_web_sm")
# Get the vector for "bananas"
doc = nlp("bananas")
bananas_vector_sm = doc[0].vector

 

In [7]:
print(bananas_vector_sm)
print("Vector length:", len(bananas_vector_sm))

[-0.7758115   1.6605448  -0.39189708 -0.32252163  0.78028345 -0.4860569
  1.3714446   1.0607959  -0.33311647 -0.31107557  0.76571095 -0.18480885
 -0.19373725 -0.10193576 -0.05747509  1.138708   -0.5497646   0.27067184
  0.6271011  -1.2856631  -1.3586081  -0.19643408 -0.1460527   0.26088634
  0.80363554  0.11018915  0.62885725 -0.8537091  -0.38757703  0.3829352
 -0.40778205  1.0657502  -0.69960725  0.61539143 -1.0997975  -0.7574061
 -0.38754362  1.1433274  -0.01413831  0.01014978  0.6591927  -0.7107739
 -0.9430653  -0.69120467  0.3341462   0.09536986 -0.2903002   0.11057171
 -0.53214103 -1.05284     0.49351197 -0.62800545 -0.6515595  -0.5959477
  0.02242928  1.2707175   0.9217299   0.46632504  0.5323633   0.75816953
 -0.76025885  0.08518691  0.17911062 -0.8465437   0.02004845  1.1643531
  0.7171085  -0.87006867  0.00917868 -1.0137509  -0.5400829   0.18582483
  0.49860662  0.5396634   0.00840527 -0.7188309   0.69892013 -0.7034926
 -1.2734017  -0.08274153 -0.1123147  -0.3993013   0.084944

#### Exercise - "bananas" word vector 2
- Repeate same with large model `en_core_web_lg`

In [8]:
## Task 4:
## Your code here 

# Load the large English model
nlp = spacy.load("en_core_web_lg")

# Get the vector for "bananas"
doc = nlp("bananas")
bananas_vector_lg = doc[0].vector
 



In [9]:
print(bananas_vector_lg)
print("Vector length:", len(bananas_vector_lg))

[-2.1689e-01 -2.5989e+00 -1.3144e+00  2.2500e+00 -4.6767e-01 -2.0695e+00
 -6.3379e-01 -4.0222e-01 -3.4022e+00 -3.6932e-01 -7.9938e-01 -1.0412e+00
  9.3756e-01  1.6070e+00  8.8330e-01 -2.8483e+00  1.3349e-01 -3.1656e+00
  8.1896e-01 -4.8113e+00  1.5655e+00  1.6665e+00 -4.7081e-01 -1.9475e+00
 -1.1779e+00 -1.3810e+00 -2.0071e+00 -2.1639e-01  9.0609e-01  1.5279e+00
  1.2587e-04 -2.9000e+00  7.6069e-01 -2.2825e+00  1.2495e-02 -1.5653e+00
  2.0052e+00 -1.7747e+00  5.9220e-01 -1.1428e+00 -1.3441e+00  3.4784e-01
  1.7492e+00  1.9086e+00  1.0600e+00  1.2965e+00  4.1431e-01  7.9416e-01
 -1.1277e+00 -1.1403e+00  7.5891e-01 -9.4419e-01  1.4413e+00 -2.2554e+00
  1.6226e-01  3.8901e-01  1.2299e-01  1.1577e+00  1.5524e+00  1.3853e+00
  1.1112e+00  7.5767e-01  3.9431e+00 -2.8506e-01 -2.1645e+00 -1.0862e+00
 -1.4973e+00 -1.2781e+00  2.4643e+00 -1.5886e+00  2.5679e-01  6.4918e-01
  1.6809e-01  5.7693e-01  3.1121e-01 -4.5278e-01 -2.7555e+00 -2.1846e+00
  4.4865e+00  2.7107e-01 -5.3831e-01  8.3013e-01  6

## Comparing similarities

In this part, you’ll be using spaCy’s similarity methods to compare Doc, Token and Span objects and get similarity scores.

***Note.*** *SpaCy uses cosine similarity as default.*


In [10]:
nlp = spacy.load("en_core_web_lg")

doc1 = nlp("It's a warm summer day.")
doc2 = nlp("It's sunny outside.")

# Get the similarity of doc1 and doc2
similarity = doc1.similarity(doc2)
print(similarity)

0.8573941222194427


#### Exercise - "bananas" similarity

- Use `en_core_web_lg` model
- Use the `token.similarity` method to compare `bananas` to words:

```python
words = "bananas banana apple apples orange oranges monkey monkeys cat human car space"
```


In [11]:
## Task 5:
## Your code here 


# Define the list of words to compare
words = "bananas banana apple apples orange oranges monkey monkeys cat human car space"

# Split the words string into a list
word_list = words.split()

# Calculate the similarity between "bananas" and each word in the list
bananas_similarities = []  
for word in word_list:
    doc1 = nlp("bananas")
    doc2 = nlp(word)
    similarity = doc1.similarity(doc2)  # Calculate similarity
    bananas_similarities.append([word, similarity])  # Append to the list
 

In [12]:
for iii in range(len(bananas_similarities)) :
    print(bananas_similarities[iii])

['bananas', 1.0]
['banana', 0.8712358421094191]
['apple', 0.6073679969526968]
['apples', 0.7323371266255725]
['orange', 0.4965249349408526]
['oranges', 0.6361825133240437]
['monkey', 0.38314181399815933]
['monkeys', 0.4383020464176]
['cat', 0.20118522144345108]
['human', 0.08909514704072133]
['car', 0.09095619833949749]
['space', 0.08698710343572413]


#### Exercise - Longer text similarity

- Use `en_core_web_lg` model
- Create function text_similarity(ref, text) that uses `span.similarity` to compare similarities between `ref vs. text`

**Note.** Both texts contains same words, but text2 word order is partly random.

In [13]:
## Task 6:
## Your code here 

def text_similarity(ref, text):
    
  doc_ref = nlp(ref)
  doc_text = nlp(text)
  return doc_ref.similarity(doc_text)


 

In [14]:
ref = "Alice likes bananas and apples. She often eats fruits, when she comes from school."
text1 = "Alice likes bananas and apples. She eats fruits all the time, when she comes from work."
text2 = "Alice bananas apples likes and. She eats fruits all the time, when she comes from work."

print('Similarity:', text_similarity(ref, text1))
print('Similarity:', text_similarity(ref, text2))

Similarity: 0.9683894159504514
Similarity: 0.9683894156750373


#### Exercise - Longer text similarity with BLEU

Create similarity comparisons function with BLEU. You get "official" BLEU function with following import. (You may need to install `sacrebleu` library first.)

```python
from sacrebleu.metrics import BLEU
```
https://github.com/mjpost/sacrebleu?tab=readme-ov-file


In [15]:
## Task 7:
## Your code here 

from sacrebleu.metrics import BLEU

bleu = BLEU()

def text_bleu(ref, text):
    
  results = bleu.corpus_score([text], [[ref]])
  return results
 

In [16]:
print('BLEU:', text_bleu(ref, text1))
print('BLEU:', text_bleu(ref, text2))

BLEU: BLEU = 54.02 78.9/61.1/47.1/37.5 (BP = 1.000 ratio = 1.118 hyp_len = 19 ref_len = 17)
BLEU: BLEU = 27.60 78.9/33.3/17.6/12.5 (BP = 1.000 ratio = 1.118 hyp_len = 19 ref_len = 17)


## Stop words

Stop words are words that are labeled as words that have no information in text analysis. Therefore these words are often removed before text analysis.

#### Exercise - Stop words

- Count Finnish stop words and print 5 of them
- Count English stop words and print 5 of them

**Hint.** Stop words are python sets. So you need to convert them to lists before you can select first 5 of them. 

In [17]:
## Task 8:
## Your code here 
from spacy.lang.fi.stop_words import STOP_WORDS as fi_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop

# Count Finnish stop words and print 5 of them
len_stopwords_fi = len(fi_stop)
five_stopwords_fi = list(fi_stop)[:5]

# Count English stop words and print 5 of them
len_stopwords_en = len(en_stop)
five_stopwords_en = list(en_stop)[:5]

 

In [18]:
print("FI:", len_stopwords_fi)
print(five_stopwords_fi)
print("EN:", len_stopwords_en)
print(five_stopwords_en)

FI: 822
['niiksi', 'toinen', 'kun', 'josta', 'kenties']
EN: 326
['no', 'hereby', 'over', 'whole', 'eleven']


#### Exercise - Removing stop words

- Use `"en_core_web_lg"` model 
- Create function `` that filters stopwords from the document
- print the original and filtered text

**Hint.** You can find stopwords from document by using `token.is_stop` -tag.

In [19]:
## Task 9:
## Your code here 

def filter_stop_en(text):

    doc = nlp(text)
    filtered_tokens = [token.text for token in doc if not token.is_stop]
    return " ".join(filtered_tokens)

alice_sentense = 'Alice like bananas and apples. She often eats fruits, when she come from school.'


 

In [20]:
alice_sentense = 'Alice like bananas and apples. She often eats fruits, when she come from school.'

# Print the text excluding stop words
print(alice_sentense)
print(filter_stop_en(alice_sentense))

Alice like bananas and apples. She often eats fruits, when she come from school.
Alice like bananas apples . eats fruits , come school .


#### Exercise - Removing stop words (Finnish)

- Repeat same with Finnish pipeline `fi_core_news_lg` and sentense  
  ```python
  alice_lause = 'Alice pitää banaaneista ja omenoista. Hän syö usein hedelmiä koulusta tullessaan.'
  ```

In [21]:
## Task 10:
## Your code here 

nlp = spacy.load("fi_core_news_lg")

alice_lause = 'Alice pitää banaaneista ja omenoista. Hän syö usein hedelmiä koulusta tullessaan.'

def filter_stop_fi(text):
    doc = nlp(text)
    filtered_tokens = [token.text for token in doc if not token.is_stop]
    return " ".join(filtered_tokens)

 



In [22]:
alice_lause = 'Alice pitää banaaneista ja omenoista. Hän syö usein hedelmiä koulusta tullessaan.'

# Print the text excluding stop words
print(alice_lause)
print(filter_stop_fi(alice_lause))

Alice pitää banaaneista ja omenoista. Hän syö usein hedelmiä koulusta tullessaan.
Alice pitää banaaneista omenoista . syö hedelmiä koulusta tullessaan .


# Reflection
1. What is `"word vector"`? And how many features spaCy `"word vector"` has?
2. What is "bag of words"?
3. What is lemma? What about lemmatization?
4. Why spaCy does not use stemming?
5. Based on what Spacy calculates similaties between sentences?

*Your answers here...*

1) **Sanavektori** on sanan numeerinen esitys, joka ilmaisee sanan merkityksen ja sanan suhteet muihin sanoihin. SpaCyn sanavektoreissa on riippuen mallin koosta piirteitä n. 100 - 300.

2) **"Bag of words"**: Tämä on tekstin esitystapa, jossa teksti kuvataan sanojen esiintymistiheyden perusteella. Sanat ikään kuin laitetaan "pussiin", eikä niiden järjestystä oteta huomioon. Esimerkiksi lause "kissa istuu matolla" esitettäisiin sanaluettelona, jossa jokainen sana esiintyy kerran. "Bag of words" -mallia käytetään usein tekstiluokittelussa ja tiedonhaussa.

3) **Lemma** on sanan perusmuoto, esimerkiksi "juoksen" lemmana on "juosta". Lemmatisointi on prosessi, jossa sana palautetaan sen lemmaan. Se on tärkeää, koska se yhdistää eri taivutusmuodot yhteen perusmuotoon, mikä helpottaa tekstin analysointia. Esimerkiksi "juoksin", "juoksee" ja "juoksemme" lemmatisoidaan kaikki muotoon "juosta".

4) **Stemming** on prosessi, jossa sanoista poistetaan pääteosa, jolloin saadaan sanavartalo. Esimerkiksi "juokseminen" stemmattuna voisi olla "juoksem". SpaCy käyttää lemmatisointia stemmingin sijaan, koska lemmatisointi tuottaa yleensä merkitykseltään oikeampia tuloksia. Stemming voi johtaa epäselviin tai virheellisiin sanavartaloihin, jotka eivät ole todellisia sanoja. SpaCy pyrkii säilyttämään sanan merkityksen analyysin aikana.

5) **SpaCy laskee lauseiden samankaltaisuuden käyttämällä sanavektoreita (word vectors)**. Nämä vektorit edustavat sanojen merkityksiä moniulotteisessa avaruudessa. Lauseiden samankaltaisuus lasketaan vertaamalla niiden sanavektorien keskiarvojen välistä kosinietäisyyttä. Mitä pienempi kosinietäisyys on, sitä samankaltaisempia lauseet ovat. SpaCy:n mallit on koulutettu suurilla tekstimassoilla, jolloin ne oppivat sanojen ja lauseiden merkityssuhteet.

References

Vasiliev, Y. (2020). Natural Language Processing with Python and spaCy: A Practical Introduction. No Starch Press.

SpaCy documentation (https://spacy.io/)

### Check your answers by running following cell:

In [23]:
# Do not change this code!

import sys
sys.path.insert(0, '../answers/spacy/')
from spacy2_check import *

print("Results:")
correct = spacy2_check(str_to_hash, hash_to_str, text_similarity, text_bleu, filter_stop_en, filter_stop_fi,
          bananas_vector_sm, bananas_vector_lg, bananas_similarities, 
          len_stopwords_fi, five_stopwords_fi, len_stopwords_en, five_stopwords_en)

print("Correct answers", correct, "/ 13.")



Results:
	 'bananas_vector_sm' is not correct. Please check your answer.
	 'filter_stop_en' is not correct. Please check your answer.
Correct answers 11 / 13.


### Nice work! 