## Exercise 1

first paragraph of https://en.wikipedia.org/wiki/Path_of_Exile

### Download the language model

In [4]:
%%capture output
!python -m spacy download en_core_web_lg

In [5]:
if "Download and installation successful" in output.stdout:
    print("Download and installation successful!")
else:
    print(output.stdout)
    print(output.stderr)

Download and installation successful!


### Get and preprocess text

In [57]:
import spacy
import re
import pandas as pd

text = "Path of Exile is a free-to-play action role-playing video game developed and published by Grinding Gear Games. Following an open beta phase, the game was released for Microsoft Windows in October 2013.[3][4][5][6][7] A version for Xbox One was released in August 2017, and a PlayStation 4 version was released in March 2019. Path of Exile takes place in a dark fantasy world, where the government of the island nation of Oriath exiles people to the continent of Wraeclast, a ruined continent home to many ancient gods. Taking control of an exile, players can choose to play as one of seven character classes – Marauder, Duelist, Ranger, Shadow, Witch, Templar, and Scion. Players are then tasked with fighting their way back to Oriath, defeating ancient gods and great evils during their journey. Path of Exile 2, a sequel, is currently in development. It was originally announced in 2019 as a large update for the original game. In 2023, the studio announced that it would instead be a separate game.[8]";
# preprocess
text_preprocessed = re.sub(r'\[[^\]]*\]', '', text)

print(text_preprocessed)

Path of Exile is a free-to-play action role-playing video game developed and published by Grinding Gear Games. Following an open beta phase, the game was released for Microsoft Windows in October 2013. A version for Xbox One was released in August 2017, and a PlayStation 4 version was released in March 2019. Path of Exile takes place in a dark fantasy world, where the government of the island nation of Oriath exiles people to the continent of Wraeclast, a ruined continent home to many ancient gods. Taking control of an exile, players can choose to play as one of seven character classes – Marauder, Duelist, Ranger, Shadow, Witch, Templar, and Scion. Players are then tasked with fighting their way back to Oriath, defeating ancient gods and great evils during their journey. Path of Exile 2, a sequel, is currently in development. It was originally announced in 2019 as a large update for the original game. In 2023, the studio announced that it would instead be a separate game.


### Select appropriate spaCy model

In [58]:
import en_core_web_lg
import random

nlp_en = en_core_web_lg.load()

### Select random sentence

In [82]:
article_en = nlp_en(text_preprocessed)
# picks 3 random sentences
# random_sentence = ' '.join(map(str, random.sample(list(article_en.sents), 3)))
random_sentence = random.choice(list(article_en.sents)).text
print(random_sentence)

Taking control of an exile, players can choose to play as one of seven character classes – Marauder, Duelist, Ranger, Shadow, Witch, Templar, and Scion.


### Tokenize sentence

In [83]:
# tokenize
tokens = nlp_en(random_sentence)
#tokens = nlp_en(text_preprocessed)

# print tokens (only first 20)
print([token.text for token in tokens])

['Taking', 'control', 'of', 'an', 'exile', ',', 'players', 'can', 'choose', 'to', 'play', 'as', 'one', 'of', 'seven', 'character', 'classes', '–', 'Marauder', ',', 'Duelist', ',', 'Ranger', ',', 'Shadow', ',', 'Witch', ',', 'Templar', ',', 'and', 'Scion', '.']


### Lemmatize it

In [84]:
print([token.lemma_ for token in tokens])

['take', 'control', 'of', 'an', 'exile', ',', 'player', 'can', 'choose', 'to', 'play', 'as', 'one', 'of', 'seven', 'character', 'class', '–', 'Marauder', ',', 'Duelist', ',', 'Ranger', ',', 'Shadow', ',', 'Witch', ',', 'Templar', ',', 'and', 'scion', '.']


### Carry out POS-tagging, dependency plot

In [85]:
from spacy import displacy

tokens_data = [{
        'Token': token.text,
        'Lemma': token.lemma_,
        'POS': token.pos_,
        'Tag': token.tag_,
        'Dep': token.dep_,
        'Shape': token.shape_,
        'Is_alpha': token.is_alpha,
        'Is_stop': token.is_stop
    } for token in tokens]

print(pd.DataFrame(tokens_data))
displacy.render(tokens, style='dep', options={'distance': 100}, jupyter=True)

        Token      Lemma    POS  Tag       Dep  Shape  Is_alpha  Is_stop
0      Taking       take   VERB  VBG     advcl  Xxxxx      True    False
1     control    control   NOUN   NN      dobj   xxxx      True    False
2          of         of    ADP   IN      prep     xx      True     True
3          an         an    DET   DT       det     xx      True     True
4       exile      exile   NOUN   NN      pobj   xxxx      True    False
5           ,          ,  PUNCT    ,     punct      ,     False    False
6     players     player   NOUN  NNS     nsubj   xxxx      True    False
7         can        can    AUX   MD       aux    xxx      True     True
8      choose     choose   VERB   VB      ROOT   xxxx      True    False
9          to         to   PART   TO       aux     xx      True     True
10       play       play   VERB   VB     xcomp   xxxx      True    False
11         as         as    ADP   IN      prep     xx      True     True
12        one        one    NUM   CD      pobj    x

### Carry out NER

In [86]:
entities_data = [{'Text': ent.text, 'Start_char': ent.start_char, 'End_char': ent.end_char, 'Label': ent.label_} for ent in tokens.ents]
print(pd.DataFrame(entities_data))

                             Text  Start_char  End_char     Label
0                             one          58        61  CARDINAL
1                           seven          65        70  CARDINAL
2                        Marauder          91        99   PRODUCT
3  Ranger, Shadow, Witch, Templar         110       140       ORG
4                           Scion         146       151   PRODUCT


In [87]:
displacy.render(tokens, style='ent', options={'distance': 100}, jupyter=True)