## Tokénisation, lemmatisation, étiquetage morphosyntaxique : 
---

**Objectif** : 

    - Avoir un tableau avec mot / lemme / POS / Oeuvre / Songwriter.
    
==> Boîte à outils SpaCy : https://spacy.io/api/tokenizer 
==> https://spacy.io/usage/linguistic-features

### Petit focus sur la tokénisation, la lemmatisation et l'étiquetage morpho-syntaxique.

   - Qu'est-ce que c'est ?
   - Dans quel but faire cela ?
   
---

In [5]:
# Installation de spacy :

# pip install spacy

# Librairie pandas (manipulation de données csv, dataframe, etc.)
import pandas as pd

# Import et lecture du corpus :
corpus = pd.read_csv('./data/corpus_nettoye_ajout_solo.csv')

corpus = pd.read_csv('./Corpus/Extension Harrison/corpus_nettoye_extension_harrison.csv')

corpus

Unnamed: 0.1,Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics
0,1,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...
1,2,"""All I've Got to Do""",UK: With the Beatles\n US: Meet the Beatles!,Lennon,Lennon,1963,Whenever I want you around yeh All I gotta do...
2,3,"""All My Loving""",UK: With the Beatles\n US: Meet the Beatles!,McCartney,McCartney,1963,Close your eyes and I'll kiss you Tomorrow I'l...
3,5,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four Can I have a little more Fi...
4,6,"""All You Need Is Love""",Magical Mystery Tour,Lennon,Lennon,1967,"Love, love, love Love, love, love Love, love, ..."
...,...,...,...,...,...,...,...
227,266,Out The Blue,Mind Games,Lennon,Lennon,1973,Out the blue you came to me And blew away life...
228,267,Only People,Mind Games,Lennon,Lennon,1973,Only people just know how to talk to people On...
229,268,I Know (I Know),Mind Games,Lennon,Lennon,1973,The years have passed so quickly One thing I'v...
230,269,You Are Here,Mind Games,Lennon,Lennon,1973,From Liverpool to Tokyo What a way to go From ...


## Création du dataset avec séparation des chansons par le code "\n . \n" 

In [6]:
import spacy
from spacy.lang.en import English
nlp = English()

# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.tokenizer

# Petit réglage pour permettre d'écrire sur les données...
# (sécurité panda)

pd.options.mode.chained_assignment = None  # default='warn' (Cf. https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas)


corpus_test=corpus

#J'ajoute des \n à la fin des musiques, c'est utile pour les 3 grams
corpus_test["Lyrics"]=corpus_test["Lyrics"]+" \n . \n"

#On tokenise les lyrics des chansons 
corpus_test['words'] = corpus_test['Lyrics'].apply(lambda x: nlp.tokenizer(str(x)))


#J'explose les données 
corpus_test=corpus_test.explode("words", ignore_index=True)
del corpus_test['Unnamed: 0']

corpus_test.tail(10)

Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words
48750,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,tomorrow
48751,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,Chickinsuckin
48752,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,mothertruckin
48753,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,Meat
48754,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,City
48755,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,shookdown
48756,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,U.S.A
48757,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,\n
48758,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,.
48759,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,\n


In [7]:
# À faire : téléchargement du modèle :   

# python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")

## Méthode :

## Cf. https://stackoverflow.com/questions/44395656/applying-spacy-parser-to-pandas-dataframe-w-multiprocessing

## Spacy is highly optimised and does the multiprocessing for you. 
## As a result, I think your best bet is to take the data out of 
## the Dataframe and pass it to the Spacy pipeline as a list rather 
## than trying to use .apply directly.
## You then need to the collate the results of the parse, and put 
## this back into the Dataframe. 

lemma = []
pos = []

for doc in nlp.pipe(corpus_test['words'].astype('unicode').values, batch_size=50):
    if doc.is_parsed:
        #tokens.append([n.text for n in doc])
        lemma.append([n.lemma_ for n in doc])
        pos.append([n.pos_ for n in doc])
    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some blanks in case the parse fails
        # tokens.append(None)
        lemma.append(None)
        pos.append(None)
        
# corpus_test['tokens'] = tokens
corpus_test['lemma'] = lemma
corpus_test['pos'] = pos


In [8]:
# Regard sur les données : 

corpus_test[0:5]

Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words,lemma,pos
0,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,Words,[word],[NOUN]
1,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,are,[be],[AUX]
2,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,flowing,[flow],[VERB]
3,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,out,[out],[ADP]
4,"""Across the Universe""",Let It Be,Lennon,Lennon,1968,Words are flowing out like endless rain into a...,like,[like],[SCONJ]


In [11]:
len(corpus_test)


48760

In [12]:
# Vérification qu'il n'y a pas de pb d'alignement entre words / lemma / pos
corpus_test.tail()

Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words,lemma,pos
48755,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,shookdown,[shookdown],[PROPN]
48756,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,U.S.A,[U.S.A],[PROPN]
48757,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,\n,[\n ],[SPACE]
48758,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,.,[.],[PUNCT]
48759,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,\n,[\n],[SPACE]


In [14]:
import numpy as np
song_names = np.unique(corpus_test['Song'])
song_name = song_names[5]
corpus_test.query("`Song`==@song_name")[-10:]

Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words,lemma,pos
864,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four Can I have a little more Fi...,now,[now],[ADV]
865,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four Can I have a little more Fi...,All,[all],[DET]
866,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four Can I have a little more Fi...,together,[together],[ADV]
867,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four Can I have a little more Fi...,now,[now],[ADV]
868,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four Can I have a little more Fi...,All,[all],[DET]
869,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four Can I have a little more Fi...,together,[together],[ADV]
870,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four Can I have a little more Fi...,now,[now],[ADV]
871,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four Can I have a little more Fi...,\n,[\n ],[SPACE]
872,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four Can I have a little more Fi...,.,[.],[PUNCT]
873,"""All Together Now""",Yellow Submarine,Lennon-McCartney,McCartney,1967,One two three four Can I have a little more Fi...,\n,[\n],[SPACE]


---

### Transformation en three-grams de POS :

   - Pour quoi faire ?
   - Voir notamment, sur ce point : ZHAO, Ying, ZOBEL, Justin, « Searching with style : authorship attribution in classic literature », in Proceedings of the thirtieth Australasian conference on Computer science, Volume 62, Australian Computer Society, Inc., AUS, 2007, pp. 59–68.
    
---

In [15]:
# Transformation en 3grams :

import nltk
from nltk.util import ngrams
liste_pos = list(corpus_test['pos'])
three_grams_pos = list(ngrams(liste_pos, 3))
three_grams_pos

len(three_grams_pos)

# Ajout dans la dataframe :
# corpus_test['3grams_pos'] = three_grams_pos
len(three_grams_pos)

48758

In [16]:
# Différence de longueur entre la df et la liste, ce qui est logique. 
# Des idées pourquoi ? 

a = [(['NOUN'], ['ADV'], ['NaN'])]
b = [(['ADV'], ['NaN'], ['NaN'])]
three_grams_pos.extend(a)
three_grams_pos.extend(b)

len(three_grams_pos)

48760

In [17]:
corpus_test['3grams_pos'] = three_grams_pos

In [18]:
corpus_test.tail()

Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words,lemma,pos,3grams_pos
48755,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,shookdown,[shookdown],[PROPN],"([PROPN], [PROPN], [SPACE])"
48756,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,U.S.A,[U.S.A],[PROPN],"([PROPN], [SPACE], [PUNCT])"
48757,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,\n,[\n ],[SPACE],"([SPACE], [PUNCT], [SPACE])"
48758,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,.,[.],[PUNCT],"([NOUN], [ADV], [NaN])"
48759,Meat City,Mind Games,Lennon,Lennon,1973,Well I been Meat City to see for myself Been M...,\n,[\n],[SPACE],"([ADV], [NaN], [NaN])"


In [19]:
song_names = np.unique(corpus_test['Song'])
song_name = song_names[1]
corpus_test.query("`Song`==@song_name")

Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words,lemma,pos,3grams_pos
10725,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",It,[-PRON-],[PRON],"([PRON], [AUX], [AUX])"
10726,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",'s,[be],[AUX],"([AUX], [AUX], [X])"
10727,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",been,[be],[AUX],"([AUX], [X], [ADV])"
10728,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",a,[a],[X],"([X], [ADV], [NOUN])"
10729,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",hard,[hard],[ADV],"([ADV], [NOUN], [AUX])"
...,...,...,...,...,...,...,...,...,...,...
10951,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",alright,[alright],[INTJ],"([INTJ], [PUNCT], [SPACE])"
10952,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",...,[...],[PUNCT],"([PUNCT], [SPACE], [PUNCT])"
10953,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",\n,[\n ],[SPACE],"([SPACE], [PUNCT], [SPACE])"
10954,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",.,[.],[PUNCT],"([PUNCT], [SPACE], [PRON])"


In [20]:
# Removing the three artificially added rows at the end of each songs
for song_name in song_names:
    corpus_test.drop(index=corpus_test.query("`Song`==@song_name").index[-3:], inplace=True)
corpus_test.reset_index(drop=True, inplace=True)
# Remove the added characters
corpus_test["Lyrics"] = corpus_test["Lyrics"].apply(lambda x: x[:-8])

In [21]:
# Check results
song_name = song_names[1]
corpus_test.query("`Song`==@song_name")
corpus_test.query("`Song`==@song_name").tail(5)

Unnamed: 0,Song,Album Debut,Songwriter(s),Lead Vocal(s),Year,Lyrics,words,lemma,pos,3grams_pos
10783,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",know,[know],[VERB],"([VERB], [PRON], [VERB])"
10784,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",I,[-PRON-],[PRON],"([PRON], [VERB], [INTJ])"
10785,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",feel,[feel],[VERB],"([VERB], [INTJ], [PUNCT])"
10786,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",alright,[alright],[INTJ],"([INTJ], [PUNCT], [SPACE])"
10787,"""A Hard Day's Night""",UK: A Hard Day's Night\n US: 1962–1966,Lennon-McCartney,Lennon\n (with McCartney),1964,"It's been a hard day's night, and I been worki...",...,[...],[PUNCT],"([PUNCT], [SPACE], [PUNCT])"


In [23]:
# Sauvegarde en csv :

corpus_test.to_csv('./data/corpus_tokmorph_ajout_solo.csv')