Word2Vec# VINCENT VAN GOGH STANDARD DATA TEXT PROCESS

This script takes the gallery's text from the local data folder and process it to an standard representation with Natural Language Processing and Word2Vec methods.

The process goes as follows:

1. Load the CSV into a pandas DataFrame.
2. Transform text columns into words lists.
3. Clear the text to find tokens (remove stop words).
4. Save corpus complete dictionary.
4. Lemmatize the tokens to find stem words.
5. Execute Bag of Words to find vector representation (doc2bow).
6. Execute the mapping of tokens in the text to find vector representation (doc2idx).
7. Execute the Word 2 Vect standarization process with tf-idf method (TF-IDF: Term Frequency - Inverse Document Frequency).
8. Create the dense standard vector text representation.
9. Save the the resulting CSV from pandas DataFrame

The following Links were useful to create this proof of concept:

- Gensim Word2Vec Tutorial, URL: https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial#Getting-Started
- Gensim Core Concepts, URL: https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html
- Document similarity queries, URL: https://radimrehurek.com/gensim/similarities/docsim.html
- Tutorial 19: Text analysis with gensim, URL: https://statsmaths.github.io/stat289-f18/solutions/tutorial19-gensim.html
- Gensim TF-IDF model, URL: https://radimrehurek.com/gensim/models/tfidfmodel.html?highlight=tfidfmodel#module-gensim.models.tfidfmodel
- Gensim, Construct word<->id mappings, URL: https://radimrehurek.com/gensim/corpora/dictionary.html?highlight=doc2idx#gensim.corpora.dictionary.Dictionary.doc2idx
- tensorflow, Word2Vec, URL: https://www.tensorflow.org/tutorials/text/word2vec
- Introduction to Word Embeddings, URL: https://pub.towardsai.net/introduction-to-word-embedding-5ba5cf97d296
- Python for NLP: Working with the Gensim Library (Part 1), URL: https://stackabuse.com/python-for-nlp-working-with-the-gensim-library-part-1/
- Packt main github repository, URL: https://github.com/PacktPublishing


In [44]:
"""
* Copyright 2020, Maestria de Humanidades Digitales,
* Universidad de Los Andes
*
* Developed for the Msc graduation project in Digital Humanities
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program.  If not, see <http://www.gnu.org/licenses/>.
"""

# ===============================
# native python libraries
# ===============================
import os
import copy
import sys
import csv
import re
import pprint

# ===============================
# extension python libraries
# ===============================
import pandas as pd
import numpy as np
import gensim
from gensim import models
from gensim import matutils
from gensim import similarities
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# downloading nlkt data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


# ===============================
# developed python libraries
# ===============================


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [45]:
# notebook varlable definitions
# root folder
dataf = "Data"

# subfolder with the OCR transcrived txt data
prepf = "Prep"

#  subfolder with the CSV files containing the ML pandas dataframe
stdf = "Std"

# dataframe file extension
fext = "csv"

# dictionary extension
dext = "dict"

# dataframe file name
small_fn = "VVG-Gallery-Text-Data-Small" + "." + fext
large_fn = "VVG-Gallery-Text-Data-Large" + "." + fext


# regex for _TEXT
text_re = u"\w+_TEXT"

# regex for ID
id_re = u"ID{1}"

# regex for others (URLs|Categories)
cat_re = u"\b(?!(ID{1}|\w+_TEXT))\b(\w+\W+)+"
cat_re = u"ID{1}(^\w+( \w+)*$)"

# default values
work_fn = small_fn
# work_fn = large_fn

In [46]:
# loading the CSV file into pandas
# read an existing CSV fileto update the dataframe
fn_path = os.path.join(os.getcwd(), dataf, prepf, work_fn)
print(fn_path)
text_df = pd.read_csv(
                fn_path,
                sep=",",
                encoding="utf-8",
                engine="python",
            )

c:\Users\Felipe\Documents\GitHub\sa-artea\VVG-Gallery-StdDataProcessor\Notebooks\Data\Prep\VVG-Gallery-Text-Data-Small.csv


In [47]:
# getting the df columns
df_cols = list(text_df)

# getting the text columns
text_r = re.compile(text_re)
text_cols = list(filter(text_r.match, df_cols))

# getting the ID column
id_r = re.compile(id_re)
id_cols = list(filter(id_r.match, df_cols))

# getting the URLs/Category columns
cat_r = re.compile(cat_re)
cat_cols = list(filter(cat_r.match, df_cols))

In [48]:
# getting the original working text
text_corpus = list(text_df[text_cols[0]])
print(len(text_corpus))

59


In [49]:
# to working text
text_clean = list()
for text in text_corpus:
    text = text.lower()
    text_clean.append(text)

print(len(text_clean), len(text_corpus))

59 59


In [50]:
# cleaning and preprocessing text for word2vec
i = 0
for i in range(0, len(text_clean)):
    text = text_clean[i]
    # removing special characters
    text = re.sub(r"\W", " ", text)
    # finding missing points between numbers
    text = re.sub(r"(\d{1,3}) (\d{1,2})", r"\1.\2", text)
    # removing excessive spaces
    text = re.sub(r"\s+", " ", text)
    text_clean[i] = text
    i = i + 1

print(len(text_clean), len(text_corpus))

59 59


In [51]:
# tokenising text
text_tokens = list()

for text in text_clean:
    text = text.split()
    text_tokens.append(text)
    # print(text)

print(len(text_tokens), len(text_clean), len(text_corpus))

59 59 59


In [52]:
# removing stopwords
text_nsw_tokens = list()

for tokens in text_tokens:

    clear_tokens = list()

    for token in tokens:
        if not token in stopwords.words('english'):
            clear_tokens.append(token)
    
    ttokens = copy.deepcopy(clear_tokens)
    text_nsw_tokens.append(ttokens)
    # print(clear_tokens)

print(len(text_nsw_tokens), len(text_tokens), len(text_clean), len(text_corpus))

59 59 59 59


In [53]:
# lematization of the text
text_lemmas = list()
token_lematizer = WordNetLemmatizer()

for tokens in text_nsw_tokens:

    lemma_tokens = list()

    for token in tokens:
        
        ans = token_lematizer.lemmatize(token)
        lemma_tokens.append(ans)

    tlemmas = copy.deepcopy(lemma_tokens)
    text_lemmas.append(tlemmas)

print(len(text_lemmas), len(text_nsw_tokens), len(text_tokens), len(text_clean), len(text_corpus))

59 59 59 59 59


In [54]:
text_df["TOKENS"] = text_tokens
text_df["PREP_TOKENS"] = text_lemmas

In [55]:
text_df.head()

Unnamed: 0,ID,CORE_TEXT,EXT_TEXT,complementary colours,this torso of Venus,drew,Van Gogh wrote,standing torso of Venus,he wrote,The Potato Eaters,...,1884,1887,animal art,drawing,1890,cityscape,1881,Brussels,TOKENS,PREP_TOKENS
0,s0004V1962r,Head of a Woman Vincent van Gogh (1853 - 1890)...,F0388r JH0782 s0004V1962r 43.5 cm x 36.2 cm,localhost,localhost,localhost,localhost,localhost,localhost,localhost,...,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,"[head, of, a, woman, vincent, van, gogh, 1853,...","[head, woman, vincent, van, gogh, 1853, 1890, ..."
1,s0006V1962,Head of a Woman Vincent van Gogh (1853 - 1890)...,"F0160 JH0722 s0006V1962 43.2 cm x 30.0 cm, 2.2...",https://www.vangoghmuseum.nl/en/stories/lookin...,localhost,localhost,localhost,localhost,localhost,localhost,...,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,"[head, of, a, woman, vincent, van, gogh, 1853,...","[head, woman, vincent, van, gogh, 1853, 1890, ..."
2,s0010V1962,Portrait of an Old Woman Vincent van Gogh (185...,"F0174 JH0978 s0010V1962 50.5 cm x 39.8 cm, 68....",localhost,localhost,localhost,localhost,localhost,localhost,localhost,...,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,"[portrait, of, an, old, woman, vincent, van, g...","[portrait, old, woman, vincent, van, gogh, 185..."
3,s0056V1962,"Torso of Venus Vincent van Gogh (1853 - 1890),...","F0216a JH1054 s0056V1962 46.0 cm x 38.0 cm, 55...",localhost,https://www.vangoghmuseum.nl/en/collection/s01...,localhost,localhost,localhost,localhost,localhost,...,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,"[torso, of, venus, vincent, van, gogh, 1853, 1...","[torso, venus, vincent, van, gogh, 1853, 1890,..."
4,s0058V1962,Woman with a Mourning Shawl Vincent van Gogh (...,"F0161 JH0788 s0058V1962 45.5 cm x 33.0 cm, 60 ...",localhost,localhost,https://www.vangoghmuseum.nl/en/collection/d00...,localhost,localhost,localhost,localhost,...,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,"[woman, with, a, mourning, shawl, vincent, van...","[woman, mourning, shawl, vincent, van, gogh, 1..."


In [56]:
# saving gesim words dictionary
vvg_dict = gensim.corpora.Dictionary(text_lemmas)
print(vvg_dict)
work_dict = work_fn.split(".")
work_dict = work_dict[0] + "." + dext
dict_pfn = os.path.join(dataf, stdf, work_dict)
print(dict_pfn)
vvg_dict.save(dict_pfn) 
# os.path.join("Data","VVG-gallery-text.dict"))
# pprint.pprint(vvg_dict.token2id)

Dictionary(660 unique tokens: ['1', '11', '16', '1853', '1885']...)
Data\Std\VVG-Gallery-Text-Data-Small.dict


In [57]:
# text representation to numeric representation
text_bows = list()
text_idxs = list()

for lemmas in text_lemmas:

    # bow loose the order/semantic
    t_bow = vvg_dict.doc2bow(lemmas, allow_update=True)
    text_bows.append(t_bow)
    # idz keeps the order/semantic
    t_idx = vvg_dict.doc2idx(lemmas)
    text_idxs.append(t_idx)

print(len(text_bows), len(text_idxs), len(text_lemmas), len(text_nsw_tokens), len(text_tokens), len(text_clean), len(text_corpus))

59 59 59 59 59 59 59


In [58]:
# train the model
tfidf = gensim.models.TfidfModel(text_idxs, dictionary=vvg_dict, normalize=True)
corpus_tfidf = tfidf[text_bows]
print(len(corpus_tfidf), len(text_bows), len(text_idxs), len(text_lemmas), len(text_nsw_tokens), len(text_tokens), len(text_clean), len(text_corpus))

59 59 59 59 59 59 59 59


In [59]:
text_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 52 columns):
 #   Column                                                                                              Non-Null Count  Dtype 
---  ------                                                                                              --------------  ----- 
 0   ID                                                                                                  59 non-null     object
 1   CORE_TEXT                                                                                           59 non-null     object
 2   EXT_TEXT                                                                                            59 non-null     object
 3   complementary colours                                                                               59 non-null     object
 4   this torso of Venus                                                                                 59 non-null     object
 

In [60]:
text_df["BOWS_TOKENS"] = text_bows
text_df["IDX_TOKENS"] = text_idxs
text_df["TFIDF_TOKENS"] = corpus_tfidf

  return array(a, dtype, copy=False, order=order)


In [61]:
# checking everything is okey
text_df.head()

Unnamed: 0,ID,CORE_TEXT,EXT_TEXT,complementary colours,this torso of Venus,drew,Van Gogh wrote,standing torso of Venus,he wrote,The Potato Eaters,...,drawing,1890,cityscape,1881,Brussels,TOKENS,PREP_TOKENS,BOWS_TOKENS,IDX_TOKENS,TFIDF_TOKENS
0,s0004V1962r,Head of a Woman Vincent van Gogh (1853 - 1890)...,F0388r JH0782 s0004V1962r 43.5 cm x 36.2 cm,localhost,localhost,localhost,localhost,localhost,localhost,localhost,...,localhost,localhost,localhost,localhost,localhost,"[head, of, a, woman, vincent, van, gogh, 1853,...","[head, woman, vincent, van, gogh, 1853, 1890, ...","[(0, 1), (1, 1), (2, 1), (3, 1), (4, 2), (5, 1...","[39, 73, 70, 69, 38, 3, 5, 52, 49, 4, 53, 28, ...","[(0, 0.0014366751686058629), (1, 0.20742760148..."
1,s0006V1962,Head of a Woman Vincent van Gogh (1853 - 1890)...,"F0160 JH0722 s0006V1962 43.2 cm x 30.0 cm, 2.2...",https://www.vangoghmuseum.nl/en/stories/lookin...,localhost,localhost,localhost,localhost,localhost,localhost,...,localhost,localhost,localhost,localhost,localhost,"[head, of, a, woman, vincent, van, gogh, 1853,...","[head, woman, vincent, van, gogh, 1853, 1890, ...","[(0, 1), (2, 1), (3, 1), (4, 2), (5, 1), (6, 1...","[39, 73, 70, 69, 38, 3, 5, 52, 81, 4, 53, 28, ...","[(0, 0.0006632619049359525), (2, 0.10441989952..."
2,s0010V1962,Portrait of an Old Woman Vincent van Gogh (185...,"F0174 JH0978 s0010V1962 50.5 cm x 39.8 cm, 68....",localhost,localhost,localhost,localhost,localhost,localhost,localhost,...,localhost,localhost,localhost,localhost,localhost,"[portrait, of, an, old, woman, vincent, van, g...","[portrait, old, woman, vincent, van, gogh, 185...","[(0, 1), (3, 1), (4, 2), (5, 1), (6, 1), (7, 1...","[151, 150, 73, 70, 69, 38, 3, 5, 127, 33, 4, 5...","[(0, 0.0007658221722692316), (4, 0.10636887288..."
3,s0056V1962,"Torso of Venus Vincent van Gogh (1853 - 1890),...","F0216a JH1054 s0056V1962 46.0 cm x 38.0 cm, 55...",localhost,https://www.vangoghmuseum.nl/en/collection/s01...,localhost,localhost,localhost,localhost,localhost,...,localhost,localhost,localhost,localhost,localhost,"[torso, of, venus, vincent, van, gogh, 1853, 1...","[torso, venus, vincent, van, gogh, 1853, 1890,...","[(0, 1), (3, 1), (5, 1), (6, 1), (7, 1), (11, ...","[190, 193, 70, 69, 38, 3, 5, 55, 44, 160, 53, ...","[(0, 0.0008712743083825235), (6, 0.00175770197..."
4,s0058V1962,Woman with a Mourning Shawl Vincent van Gogh (...,"F0161 JH0788 s0058V1962 45.5 cm x 33.0 cm, 60 ...",localhost,localhost,https://www.vangoghmuseum.nl/en/collection/d00...,localhost,localhost,localhost,localhost,...,localhost,localhost,localhost,localhost,localhost,"[woman, with, a, mourning, shawl, vincent, van...","[woman, mourning, shawl, vincent, van, gogh, 1...","[(0, 1), (3, 1), (4, 3), (5, 1), (6, 1), (7, 1...","[73, 205, 210, 70, 69, 38, 3, 5, 52, 48, 49, 4...","[(0, 0.0006110115453988034), (4, 0.12729967560..."


In [64]:
# creating the dense vector standar representantion of the text
text_dvector = list()

# iterating in each text with the tfidf word bag
for t_idtokens, tfidf_tokens in zip(text_idxs, corpus_tfidf):
    # print("===============================")
    # print(len(tidxs), len(ttfidf))
    # print(type(tidxs), type(ttfidf))
    # dense vector representation
    tdvect = list()

    # creating the dense representation for each text
    for t_token in t_idtokens:

        # transforming the tfidf into dict
        tokens_dict = dict(tfidf_tokens)
        
        # looking for each word
        if t_token in tokens_dict.keys():
            temp = tokens_dict.get(t_token)
            # appending std word representation into array
            tdvect.append(temp)

    # copying std dense vector into corpus column
    ans = copy.deepcopy(tdvect)
    text_dvector.append(ans)

# checking the size of all columna
print(len(text_dvector), len(corpus_tfidf), len(text_bows), len(text_idxs), len(text_lemmas), len(text_nsw_tokens), len(text_tokens), len(text_clean), len(text_corpus))

# adding the dense representation into the dataframe
text_df["STD_DVEC_TOKENS"] = text_dvector

59 59 59 59 59 59 59 59 59


In [86]:
# complete text rpresentation
tf_dense_corpus = matutils.corpus2dense(text_bows, num_terms = len(vvg_dict.token2id), num_docs = len(corpus_tfidf))
# text_df["STD_CVEC_TOKENS"] = tf_dense_corpus
tf_dense_corpus[0:]

array([[1., 1., 1., ..., 1., 1., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 2.]], dtype=float32)

In [None]:
tf_sparse_corpus = matutils.corpus2csc(corpus_tfidf)
text_df["STD_SVEC_TOKENS"] = tf_sparse_corpus

In [65]:
# checking everything is okey
text_df.head()

Unnamed: 0,ID,CORE_TEXT,EXT_TEXT,complementary colours,this torso of Venus,drew,Van Gogh wrote,standing torso of Venus,he wrote,The Potato Eaters,...,1890,cityscape,1881,Brussels,TOKENS,PREP_TOKENS,BOWS_TOKENS,IDX_TOKENS,TFIDF_TOKENS,STD_DVEC_TOKENS
0,s0004V1962r,Head of a Woman Vincent van Gogh (1853 - 1890)...,F0388r JH0782 s0004V1962r 43.5 cm x 36.2 cm,localhost,localhost,localhost,localhost,localhost,localhost,localhost,...,localhost,localhost,localhost,localhost,"[head, of, a, woman, vincent, van, gogh, 1853,...","[head, woman, vincent, van, gogh, 1853, 1890, ...","[(0, 1), (1, 1), (2, 1), (3, 1), (4, 2), (5, 1...","[39, 73, 70, 69, 38, 3, 5, 52, 49, 4, 53, 28, ...","[(0, 0.0014366751686058629), (1, 0.20742760148...","[0.12089483268014274, 0.10457729123726997, 0.1..."
1,s0006V1962,Head of a Woman Vincent van Gogh (1853 - 1890)...,"F0160 JH0722 s0006V1962 43.2 cm x 30.0 cm, 2.2...",https://www.vangoghmuseum.nl/en/stories/lookin...,localhost,localhost,localhost,localhost,localhost,localhost,...,localhost,localhost,localhost,localhost,"[head, of, a, woman, vincent, van, gogh, 1853,...","[head, woman, vincent, van, gogh, 1853, 1890, ...","[(0, 1), (2, 1), (3, 1), (4, 2), (5, 1), (6, 1...","[39, 73, 70, 69, 38, 3, 5, 52, 81, 4, 53, 28, ...","[(0, 0.0006632619049359525), (2, 0.10441989952...","[0.11162570186015741, 0.09655924305614885, 0.1..."
2,s0010V1962,Portrait of an Old Woman Vincent van Gogh (185...,"F0174 JH0978 s0010V1962 50.5 cm x 39.8 cm, 68....",localhost,localhost,localhost,localhost,localhost,localhost,localhost,...,localhost,localhost,localhost,localhost,"[portrait, of, an, old, woman, vincent, van, g...","[portrait, old, woman, vincent, van, gogh, 185...","[(0, 1), (3, 1), (4, 2), (5, 1), (6, 1), (7, 1...","[151, 150, 73, 70, 69, 38, 3, 5, 127, 33, 4, 5...","[(0, 0.0007658221722692316), (4, 0.10636887288...","[0.22113928023709148, 0.30323802477749107, 0.1..."
3,s0056V1962,"Torso of Venus Vincent van Gogh (1853 - 1890),...","F0216a JH1054 s0056V1962 46.0 cm x 38.0 cm, 55...",localhost,https://www.vangoghmuseum.nl/en/collection/s01...,localhost,localhost,localhost,localhost,localhost,...,localhost,localhost,localhost,localhost,"[torso, of, venus, vincent, van, gogh, 1853, 1...","[torso, venus, vincent, van, gogh, 1853, 1890,...","[(0, 1), (3, 1), (5, 1), (6, 1), (7, 1), (11, ...","[190, 193, 70, 69, 38, 3, 5, 55, 44, 160, 53, ...","[(0, 0.0008712743083825235), (6, 0.00175770197...","[0.11027555366028634, 0.08752905610112195, 0.0..."
4,s0058V1962,Woman with a Mourning Shawl Vincent van Gogh (...,"F0161 JH0788 s0058V1962 45.5 cm x 33.0 cm, 60 ...",localhost,localhost,https://www.vangoghmuseum.nl/en/collection/d00...,localhost,localhost,localhost,localhost,...,localhost,localhost,localhost,localhost,"[woman, with, a, mourning, shawl, vincent, van...","[woman, mourning, shawl, vincent, van, gogh, 1...","[(0, 1), (3, 1), (4, 3), (5, 1), (6, 1), (7, 1...","[73, 205, 210, 70, 69, 38, 3, 5, 52, 48, 49, 4...","[(0, 0.0006110115453988034), (4, 0.12729967560...","[0.1334287674669934, 0.43723399354449305, 0.43..."


In [66]:
# saving the CSV file into pandas
# writing an existing CSV fileto update the dataframe
target_fn = "std-" + work_fn
fn_tpath = os.path.join(os.getcwd(), dataf, stdf, target_fn)
print(fn_tpath)
text_df.to_csv(fn_tpath,
                sep=",",
                index=False,
                encoding="utf-8",
                mode="w",
                )

c:\Users\Felipe\Documents\GitHub\sa-artea\VVG-Gallery-StdDataProcessor\Notebooks\Data\Std\std-VVG-Gallery-Text-Data-Small.csv


In [67]:
# dont remember for what i did this
# sim_index = gensim.similarities.SparseMatrixSimilarity(corpus_tfidf, num_features=len(vvg_dict))