## Tokénisation, lemmatisation, étiquetage morphosyntaxique : 
---

**Objectif** : 

    - Avoir un tableau avec mot / lemme / POS / Oeuvre / Songwriter.
    
==> Boîte à outils SpaCy : https://spacy.io/api/tokenizer 
==> https://spacy.io/usage/linguistic-features

### Petit focus sur la tokénisation, la lemmatisation et l'étiquetage morpho-syntaxique.

   - Qu'est-ce que c'est ?
   - Dans quel but faire cela ?
   
---

In [1]:
# Installation de spacy :

# pip install spacy

# Librairie pandas (manipulation de données csv, dataframe, etc.)
import pandas as pd

# Import et lecture du corpus :
corpus = pd.read_csv('./data/corpus_nettoye.csv')

## Création du dataset avec séparation des chansons par le code "\n . \n" 

In [2]:
import spacy
from spacy.lang.en import English
nlp = English()

# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.tokenizer

# Petit réglage pour permettre d'écrire sur les données...
# (sécurité panda)

pd.options.mode.chained_assignment = None  # default='warn' (Cf. https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas)


corpus_test=corpus

#J'ajoute des \n à la fin des musiques, c'est utile pour les 3 grams
corpus_test["Lyrics"]=corpus_test["Lyrics"]+" \n . \n"

#On tokenise les lyrics des chansons 
corpus_test['words'] = corpus_test['Lyrics'].apply(lambda x: nlp.tokenizer(str(x)))


#J'explose les données 
corpus_test=corpus_test.explode("words", ignore_index=True)
del corpus_test['Unnamed: 0']

corpus_test.tail(10)

Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words
41282,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,'ve
41283,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,got
41284,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,to
41285,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,hide
41286,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,your
41287,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,love
41288,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,away
41289,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,\n
41290,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,.
41291,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,\n


In [3]:
# À faire : téléchargement du modèle :   

# python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")

## Méthode :

## Cf. https://stackoverflow.com/questions/44395656/applying-spacy-parser-to-pandas-dataframe-w-multiprocessing

## Spacy is highly optimised and does the multiprocessing for you. 
## As a result, I think your best bet is to take the data out of 
## the Dataframe and pass it to the Spacy pipeline as a list rather 
## than trying to use .apply directly.
## You then need to the collate the results of the parse, and put 
## this back into the Dataframe. 

lemma = []
pos = []

for doc in nlp.pipe(corpus_test['words'].astype('unicode').values, batch_size=50):
    if doc.is_parsed:
        #tokens.append([n.text for n in doc])
        lemma.append([n.lemma_ for n in doc])
        pos.append([n.pos_ for n in doc])
    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some blanks in case the parse fails
        # tokens.append(None)
        lemma.append(None)
        pos.append(None)
        
# corpus_test['tokens'] = tokens
corpus_test['lemma'] = lemma
corpus_test['pos'] = pos


In [4]:
# Regard sur les données : 

corpus_test[0:5]

Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words,lemma,pos
0,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,Words,[word],[NOUN]
1,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,are,[be],[AUX]
2,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,flowing,[flow],[VERB]
3,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,out,[out],[ADP]
4,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,like,[like],[SCONJ]


In [5]:
len(corpus_test)


41292

In [6]:
# Vérification qu'il n'y a pas de pb d'alignement entre words / lemma / pos
corpus_test.tail()

Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words,lemma,pos
41287,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,love,[love],[NOUN]
41288,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,away,[away],[ADV]
41289,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,\n,[\n ],[SPACE]
41290,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,.,[.],[PUNCT]
41291,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,\n,[\n],[SPACE]


In [7]:
song_names = np.unique(corpus_test['Song'])
song_name = song_names[5]
corpus_test.query("`Song`==@song_name")[-10:]

Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words,lemma,pos
981,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four\nCan I have a little more\n...,All,[all],[DET]
982,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four\nCan I have a little more\n...,together,[together],[ADV]
983,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four\nCan I have a little more\n...,now,[now],[ADV]
984,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four\nCan I have a little more\n...,\n,[\n],[SPACE]
985,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four\nCan I have a little more\n...,All,[all],[DET]
986,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four\nCan I have a little more\n...,together,[together],[ADV]
987,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four\nCan I have a little more\n...,now,[now],[ADV]
988,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four\nCan I have a little more\n...,\n,[\n ],[SPACE]
989,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four\nCan I have a little more\n...,.,[.],[PUNCT]
990,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four\nCan I have a little more\n...,\n,[\n],[SPACE]


---

### Transformation en three-grams de POS :

   - Pour quoi faire ?
   - Voir notamment, sur ce point : ZHAO, Ying, ZOBEL, Justin, « Searching with style : authorship attribution in classic literature », in Proceedings of the thirtieth Australasian conference on Computer science, Volume 62, Australian Computer Society, Inc., AUS, 2007, pp. 59–68.
    
---

In [8]:
# Transformation en 3grams :

import nltk
from nltk.util import ngrams
liste_pos = list(corpus_test['pos'])
three_grams_pos = list(ngrams(liste_pos, 3))
three_grams_pos

len(three_grams_pos)

# Ajout dans la dataframe :
# corpus_test['3grams_pos'] = three_grams_pos
len(three_grams_pos)

41290

In [9]:
# Différence de longueur entre la df et la liste, ce qui est logique. 
# Des idées pourquoi ? 

a = [(['NOUN'], ['ADV'], ['NaN'])]
b = [(['ADV'], ['NaN'], ['NaN'])]
three_grams_pos.extend(a)
three_grams_pos.extend(b)

len(three_grams_pos)

41292

In [10]:
corpus_test['3grams_pos'] = three_grams_pos

In [11]:
corpus_test.tail()

Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words,lemma,pos,3grams_pos
41287,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,love,[love],[NOUN],"([NOUN], [ADV], [SPACE])"
41288,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,away,[away],[ADV],"([ADV], [SPACE], [PUNCT])"
41289,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,\n,[\n ],[SPACE],"([SPACE], [PUNCT], [SPACE])"
41290,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,.,[.],[PUNCT],"([NOUN], [ADV], [NaN])"
41291,"""You've Got to Hide Your Love Away""",Help!,Lennon,Lennon,1965,Here I stand head in hand\nTurn my face to the...,\n,[\n],[SPACE],"([ADV], [NaN], [NaN])"


In [12]:
song_names = np.unique(corpus_test['Song'])
song_name = song_names[1]
corpus_test.query("`Song`==@song_name")

Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words,lemma,pos,3grams_pos
12118,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",It,[-PRON-],[PRON],"([PRON], [AUX], [AUX])"
12119,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",'s,[be],[AUX],"([AUX], [AUX], [X])"
12120,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",been,[be],[AUX],"([AUX], [X], [ADV])"
12121,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",a,[a],[X],"([X], [ADV], [NOUN])"
12122,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",hard,[hard],[ADV],"([ADV], [NOUN], [AUX])"
...,...,...,...,...,...,...,...,...,...,...
12364,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",alright,[alright],[INTJ],"([INTJ], [PUNCT], [SPACE])"
12365,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",...,[...],[PUNCT],"([PUNCT], [SPACE], [PUNCT])"
12366,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",\n,[\n ],[SPACE],"([SPACE], [PUNCT], [SPACE])"
12367,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",.,[.],[PUNCT],"([PUNCT], [SPACE], [PRON])"


In [13]:
for song_name in song_names:
    corpus_test.drop(index=corpus_test.query("`Song`==@song_name").index[-3:], inplace=True)
corpus_test.reset_index(drop=True, inplace=True)

In [14]:
# Check results
song_name = song_names[1]
corpus_test.query("`Song`==@song_name")
corpus_test.query("`Song`==@song_name").tail(5)

Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words,lemma,pos,3grams_pos
12196,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",know,[know],[VERB],"([VERB], [PRON], [VERB])"
12197,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",I,[-PRON-],[PRON],"([PRON], [VERB], [INTJ])"
12198,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",feel,[feel],[VERB],"([VERB], [INTJ], [PUNCT])"
12199,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",alright,[alright],[INTJ],"([INTJ], [PUNCT], [SPACE])"
12200,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",...,[...],[PUNCT],"([PUNCT], [SPACE], [PUNCT])"


In [15]:
# Sauvegarde en csv :

corpus_test.to_csv('./data/corpus_tokmorph.csv')