# Feature extraction

<i>Attention, the embeddings take quite some time to generate. On my computer (iMac 2019, 3,7 GHz 6-Core Intel Core i5), it took about 6 hours. The embedding files are already in the embeddings folder and the final generate_embeddings function in the last cell only generates the embeddings if you specify overwrite=True.</i>

In this notebook we produce five embeddings for each chunk:
<ul>
<li><b>word2vec</b>, a classic word embedding from which we can make an embedding for our chunk by averaging over the words in the chunk</li>
<li><b>fastText</b>, the embedding used in the French spacy model we have been using</li>
<li><b>universal sentence encoder</b>, a relatively recent and large multilingual embedding</li>
<li><b>camembert-large</b>, a relatively recent and large and French-specific embedding</li>
<li><b>camembert-base-wikipedia-4gb</b>, a smaller French specific embedding</li>
</ul>
There is not easy to find a ready-made word2vec embedding in French, as it is no longer in fashion. We train a model based on our train set.

In [1]:
import gensim
from gensim.models import Word2Vec
import pandas as pd
import os
from nltk.corpus import stopwords
from nltk import word_tokenize #we use an older tool for the older embedding
from nltk import sent_tokenize
from gensim.models import Word2Vec


no_dupl=pd.read_csv('no_dupl.csv')
all_chunks={}

for t in ["valid1", "valid2", "valid3", "valid4", 'test']:
    all_chunks[t]=[]
    for author in no_dupl["author"].unique():
        with open(os.path.join(t, author), 'r') as f:
            all_chunks[t]+=f.read().split("\n\t\t\n")

fr_stop=stopwords.words('french')+['?', '°', '(', '.', '=', '!', ';', ')', ',', ':', "'", '-', '*', ']', '[', '"', '...']

def prepare(chunk):
    sents=sent_tokenize(chunk, language='french')

    return [[token.lower() for token in word_tokenize(sent, language='french') if token.lower() not in fr_stop] for sent in sents]

train_chunks=all_chunks["valid1"]+all_chunks["valid2"]+all_chunks["valid3"]+all_chunks["valid4"]
train_sents=[sent for chunk in train_chunks for sent in prepare(chunk)]

w2v_model = gensim.models.Word2Vec(vector_size=200,  workers=4, sg=0)  
w2v_model.build_vocab(train_sents)
w2v_model.train(train_sents, total_examples=w2v_model.corpus_count, epochs=5)



(5724820, 6432490)

In [2]:
emb_models=["w2v","fastText", "uni_sent_enc", "camembert-wiki", "camembert-large"]
vector_sizes=[200, 300, 512, 768, 1024]

In [3]:
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' #we supress the tensorflow info messages
import torch
import spacy
import numpy as np


We import spacy for fastText:

In [4]:
nlp = spacy.load('fr_core_news_lg')

We load the universal sentence encoder from the tensorflow hub.

In [5]:
import tensorflow as tf
import tensorflow_hub as hub

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
uni_sent_enc = hub.load(module_url)

We load the two Camembert model using the transformers library:

In [6]:
from transformers import CamembertModel, CamembertTokenizer

tokenizer={}
camembert={}
tokenizer["camembert-large"] = CamembertTokenizer.from_pretrained("camembert/camembert-large")
camembert["camembert-large"] = CamembertModel.from_pretrained("camembert/camembert-large")

tokenizer["camembert-wiki"] = CamembertTokenizer.from_pretrained("camembert/camembert-base-wikipedia-4gb")
camembert["camembert-wiki"] = CamembertModel.from_pretrained("camembert/camembert-base-wikipedia-4gb")

camembert["camembert-large"].eval();
camembert["camembert-wiki"].eval();

Some weights of the model checkpoint at camembert/camembert-large were not used when initializing CamembertModel: ['lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing CamembertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at camembert/camembert-base-wikipedia-4gb were not used when initializing CamembertModel: ['lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.bias']
- This

Before we start extracting features, we want to make sure that the chunks are not too big. Camembert has a limit of 512 tokens. What a token is depends on the tokenizer, which is different for the large and the wiki model. Let's see if we can ignore the chunks that are too big.

In [7]:
def chunk_length_measurer(filename, emb_model):
    with open(filename, 'r') as f:
        chunks=f.read().split("\n\t\t\n")
        lengths=[]
        for i, ch in enumerate(chunks):
            tokenized=tokenizer[emb_model].tokenize(ch)
            encoded=tokenizer[emb_model].encode(tokenized)
            encoded = torch.tensor(encoded).unsqueeze(0)
            lengths.append(encoded.shape[1])
    return lengths
chunk_lengths={}
too_big={}
for t in ["valid1", "valid2", "valid3", "valid4", "test"]:
    chunk_lengths[t]={}
    too_big[t]={}
    for a in no_dupl["author"].unique():
        chunk_lengths[t][a]=max(chunk_length_measurer(os.path.join(t, a), "camembert-wiki"),chunk_length_measurer(os.path.join(t, a), "camembert-large"))
        too_big[t][a]=[]
        for (i, l) in enumerate(chunk_lengths[t][a]):
            if l>512:
                too_big[t][a]+=[i] # we save the location of the two big tokens for later use
        if len(too_big[t][a])>0:
            print("We have to skip {} chunk(s) in the {} set for {}".format(len(too_big[t][a]), t, a))


We have to skip 1 chunk(s) in the valid4 set for George Sand
We have to skip 2 chunk(s) in the valid4 set for Victor Hugo
We have to skip 4 chunk(s) in the test set for Victor Hugo
We have to skip 1 chunk(s) in the test set for Marcel Proust


It looks these represent a very small fraction of the train/test set, so we will just ignore these chunks.

We save the the too_big dictionary:

In [8]:
import pickle
with open("too_big.pkl", "wb") as file:
    pickle.dump(too_big, file)

We define three functions to generate embeddings:
<ul>
<li><tt><b>chunk_to_vector</b></tt> for the chunk level</li>
<li><tt><b>embedder</b></tt> for the author level to produce embeddings for all the chunks for an author in the train/test set</li>
<li><tt><b>generate_embeddings</b></tt> for the embedding model level to generate embeddings for all authors</li>

</ul>

In [9]:
def chunk_to_vector(ch, emb_model):
    if emb_model=='w2v':
        tokens=[token for sent in prepare(ch) for token in sent]
        vectors=[w2v_model.wv.get_vector(token) for token in tokens if token in w2v_model.wv.key_to_index]
        if not vectors:
            vectors=np.zeros(200)
        embedding=np.mean(vectors, axis=0)


    
    elif emb_model=="fastText":
        processed=nlp(ch)
        embedding=processed.vector
    elif emb_model=="uni_sent_enc":
        embedding=uni_sent_enc([ch])
    elif emb_model=="camembert-large" or emb_model=="camembert-wiki":
        tokenized=tokenizer[emb_model].tokenize(ch)
        encoded=tokenizer[emb_model].encode(tokenized)
        encoded = torch.tensor(encoded).unsqueeze(0)
        embedding=camembert[emb_model](encoded).pooler_output.detach().numpy()
    else:
        print("unknown model")
    return embedding

In [10]:
import time
def embedder(testortrain, author, emb_model, vector_size):
    filename=os.path.join(testortrain, author)
    print(filename)
    print(time.ctime())
    j=0
    with open(filename, 'r') as f:
        chunks=f.read().split("\n\t\t\n")
    length=len(chunks)
    embeddings = np.zeros((length, vector_size))
    for i, ch in enumerate(chunks):
        if i in too_big[testortrain][author]: #we skip the chunks that are too big for one of the camembert models
             pass
        else:
            embedding=chunk_to_vector(ch, emb_model)
            embeddings[j]=embedding
            j+=1
    embeddings=embeddings[:j]
    labels=[author]*j
    return embeddings, labels

In [11]:
def generate_embeddings(emb_model, vector_size, t, filename, overwrite=False):
    if (not os.path.isfile(filename)) or overwrite==True:
        emb={}
        lab={}
        X={}
        y={}
        all_authors=no_dupl["author"].unique()
        for a in all_authors:
            emb[a], lab[a]=embedder(t, a, emb_model, vector_size)
        X=np.vstack([emb[a] for a in all_authors])          
        y=np.hstack([np.array(lab[a]) for a in all_authors]) 

        np.savez(filename, X=X, y=y)


Finally we run the last one for all five embedding models (as mentioned, the last argument should be changed to overwrite=True and it takes quite some time):

In [12]:
if not os.path.exists("embeddings"):
    os.mkdir("embeddings")

for emb_model, vector_size in zip(emb_models, vector_sizes):
    
    print("Generating embeddings for embedding {}".format(emb_model))
    for t in ["valid1", "valid2", "valid3", "valid4", "test"]:
        generate_embeddings(emb_model, vector_size, t, os.path.join("embeddings", emb_model+'_'+t+'.npz'), overwrite=False) 

Generating embeddings for embedding w2v
valid1/George Sand
Sat Apr  1 19:52:03 2023
valid1/Émile Zola
Sat Apr  1 19:52:04 2023
valid1/Alphonse de Lamartine
Sat Apr  1 19:52:04 2023
valid1/Anatole France
Sat Apr  1 19:52:05 2023
valid1/Jules Verne
Sat Apr  1 19:52:05 2023
valid1/Victor Hugo
Sat Apr  1 19:52:06 2023
valid1/Guy de Maupassant
Sat Apr  1 19:52:06 2023
valid1/Alexandre Dumas
Sat Apr  1 19:52:07 2023
valid1/Marcel Proust
Sat Apr  1 19:52:07 2023
valid1/Gustave Flaubert
Sat Apr  1 19:52:08 2023
valid2/George Sand
Sat Apr  1 19:52:08 2023
valid2/Émile Zola
Sat Apr  1 19:52:09 2023
valid2/Alphonse de Lamartine
Sat Apr  1 19:52:09 2023
valid2/Anatole France
Sat Apr  1 19:52:10 2023
valid2/Jules Verne
Sat Apr  1 19:52:10 2023
valid2/Victor Hugo
Sat Apr  1 19:52:11 2023
valid2/Guy de Maupassant
Sat Apr  1 19:52:12 2023
valid2/Alexandre Dumas
Sat Apr  1 19:52:12 2023
valid2/Marcel Proust
Sat Apr  1 19:52:13 2023
valid2/Gustave Flaubert
Sat Apr  1 19:52:13 2023
valid3/George Sand
Sat