Word2Vec# VINCENT VAN GOGH STANDARD DATA TEXT PROCESS

This script takes the gallery's text from the local data folder and process it to an standard representation with Natural Language Processing and Word2Vec methods.

The process goes as follows:

1. Load the CSV into a pandas DataFrame.
2. Transform text columns into words lists.
3. Clear the text to find tokens (remove stop words).
4. Save corpus complete dictionary.
4. Lemmatize the tokens to find stem words.
5. Execute Bag of Words to find vector representation (doc2bow).
6. Execute the mapping of tokens in the text to find vector representation (doc2idx).
7. Execute the Word 2 Vect standarization process with tf-idf method (TF-IDF: Term Frequency - Inverse Document Frequency).
8. Create the dense standard vector text representation.
9. Save the the resulting CSV from pandas DataFrame

The following Links were useful to create this proof of concept:

- Gensim Word2Vec Tutorial, URL: https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial#Getting-Started
- Gensim Core Concepts, URL: https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html
- Document similarity queries, URL: https://radimrehurek.com/gensim/similarities/docsim.html
- Tutorial 19: Text analysis with gensim, URL: https://statsmaths.github.io/stat289-f18/solutions/tutorial19-gensim.html
- Gensim TF-IDF model, URL: https://radimrehurek.com/gensim/models/tfidfmodel.html?highlight=tfidfmodel#module-gensim.models.tfidfmodel
- Gensim, Construct word<->id mappings, URL: https://radimrehurek.com/gensim/corpora/dictionary.html?highlight=doc2idx#gensim.corpora.dictionary.Dictionary.doc2idx
- tensorflow, Word2Vec, URL: https://www.tensorflow.org/tutorials/text/word2vec
- Introduction to Word Embeddings, URL: https://pub.towardsai.net/introduction-to-word-embedding-5ba5cf97d296
- Python for NLP: Working with the Gensim Library (Part 1), URL: https://stackabuse.com/python-for-nlp-working-with-the-gensim-library-part-1/
- Packt main github repository, URL: https://github.com/PacktPublishing
- Regex Complement, URL: https://stackoverflow.com/questions/3977455/how-do-i-turn-any-regex-into-an-complement-of-itself-without-complex-hand-editin


In [2]:
"""
* Copyright 2020, Maestria de Humanidades Digitales,
* Universidad de Los Andes
*
* Developed for the Msc graduation project in Digital Humanities
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program.  If not, see <http://www.gnu.org/licenses/>.
"""

# ===============================
# native python libraries
# ===============================
import os
import copy
import sys
import csv
import re
import pprint
import datetime

# ===============================
# extension python libraries
# ===============================
import pandas as pd
from pandas_profiling import ProfileReport
import numpy as np
import gensim
from gensim import models
from gensim import matutils
from gensim import similarities
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# downloading nlkt data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


# ===============================
# developed python libraries
# ===============================


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
# notebook varlable definitions
# root folder
dataf = "Data"

# subfolder with the OCR transcrived txt data
prepf = "Prep"

#  subfolder with the CSV files containing the ML pandas dataframe
stdf = "Std"

# subfolder for reports
reportf = "Reports"

# dataframe file extension
fext = "csv"

# dictionary extension
dext = "dict"

# dataframe file name
small_fn = "VVG-Gallery-Text-Data-Small" + "." + fext
large_fn = "VVG-Gallery-Text-Data-Large" + "." + fext

# report names
str_date = datetime.date.today().strftime("%d-%b-%Y")
small_report = "VVG-TextData-Report-Small-" + str_date + "." + "html"
large_report = "VVG-TextData-Report-Large-" + str_date + "." + "html"


# columns i reallry need
# 'ID', 'F-number', 'JH-number', 'creator-date', 'creator-place', 'Dimensions', 'details', 'credits', 'CORE_TEXT'

# regex for _TEXT
text_re = u"(\w+_TEXT)"

# regex for ID
in_re = u"((ID{1})|([A-Z]+-number){1,3}|^(creator){1,3}|(Dimensions)|(details)|(credits))"

# regex for others (URLs|Categories)
# For example, if you match numeric strings as [0-9]*, to match the entire string you'd prepend ^ and append $, but to use this technique to find the complement you'd need to write ^(?!^[0-9]*$).*$ - and the usual concatenation of such a negated regex is, as far as I can tell, undoable.
cat_re = u"^(?!^((\w+_TEXT)|((ID{1})|([A-Z]+-number){1,3}|^(creator-\w+){1,3}|(Dimensions)|(details)|(credits)))*$).*$"
# cat_re = u"(\w+_TEXT)|((ID{1})|([A-Z]+-number){1,3}|^(creator){1,3}|(Dimensions)|(details)|(credits))"
# cat_re = u"ID{1}(^\w+( \w+)*$)"

# default values
work_fn, work_report = small_fn, small_report
# work_fn, work_report = large_fn, large_report

In [4]:
root_folder = os.getcwd()
root_folder = os.path.split(root_folder)[0]
root_folder = os.path.normpath(root_folder)
print(root_folder)

c:\Users\Felipe\Documents\GitHub\sa-artea\VVG-MLData-Preparer


In [5]:
# loading the CSV file into pandas
# read an existing CSV fileto update the dataframe
fn_path = os.path.join(root_folder, dataf, prepf, work_fn)
print(fn_path)
text_df = pd.read_csv(
                fn_path,
                sep=",",
                encoding="utf-8",
                engine="python",
            )

c:\Users\Felipe\Documents\GitHub\sa-artea\VVG-MLData-Preparer\Data\Prep\VVG-Gallery-Text-Data-Small.csv


In [6]:
text_df = pd.DataFrame(text_df)
text_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 56 columns):
 #   Column                                                                                              Non-Null Count  Dtype 
---  ------                                                                                              --------------  ----- 
 0   ID                                                                                                  59 non-null     object
 1   F-number                                                                                            59 non-null     object
 2   JH-number                                                                                           59 non-null     object
 3   creator-date                                                                                        59 non-null     object
 4   creator-place                                                                                       59 non-null     object
 

In [13]:
text_df.head(5)

Unnamed: 0,ID,F-number,JH-number,creator-date,creator-place,Dimensions,details,credits,CORE_TEXT,1881,...,drew,he wrote,heads,nude,painting,portrait,standing torso of Venus,still life,this torso of Venus,which he painted a number of times
0,s0004V1962r,F0388r,JH0782,May 1885,Nuenen,43.5 cm x 36.2 cm,oil on canvas,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...",Head of a Woman With his brother Theo van Gog...,localhost,...,localhost,localhost,https://vangoghmuseum.nl/en/collection?Genre=h...,localhost,https://vangoghmuseum.nl/en/collection?Type=pa...,localhost,localhost,localhost,localhost,localhost
1,s0006V1962,F0160,JH0722,April 1885,Nuenen,"43.2 cm x 30.0 cm, 2.2 cm x 59.0 cm, 46.2 cm x...",oil on canvas,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...",Head of a Woman This woman is Gordina de Groot...,localhost,...,localhost,localhost,https://vangoghmuseum.nl/en/collection?Genre=h...,localhost,https://vangoghmuseum.nl/en/collection?Type=pa...,localhost,localhost,localhost,localhost,localhost
2,s0010V1962,F0174,JH0978,December 1885,Antwerp,"50.5 cm x 39.8 cm, 68.1 cm x 57.7 cm",oil on canvas,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...",Portrait of an Old Woman The old womans grey h...,localhost,...,localhost,localhost,localhost,localhost,https://vangoghmuseum.nl/en/collection?Type=pa...,https://vangoghmuseum.nl/en/collection?Genre=p...,localhost,localhost,localhost,localhost
3,s0056V1962,F0216a,JH1054,June 1886,Paris,"46.0 cm x 38.0 cm, 55.1 cm x 46.5 cm",oil on cardboard,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...",Torso of Venus Van Gogh was thorough in everyt...,localhost,...,localhost,localhost,localhost,https://vangoghmuseum.nl/en/collection?Genre=nude,https://vangoghmuseum.nl/en/collection?Type=pa...,localhost,localhost,https://vangoghmuseum.nl/en/collection?Genre=s...,https://www.vangoghmuseum.nl/en/collection/s01...,localhost
4,s0058V1962,F0161,JH0788,March-May 1885,Nuenen,"45.5 cm x 33.0 cm, 60 cm x 48 cm",oil on canvas,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...","Woman with a Mourning Shawl Here, Van Gogh was...",localhost,...,https://www.vangoghmuseum.nl/en/collection/d00...,localhost,https://vangoghmuseum.nl/en/collection?Genre=h...,localhost,https://vangoghmuseum.nl/en/collection?Type=pa...,localhost,localhost,localhost,localhost,localhost


In [8]:
# getting the df columns
df_cols = list(text_df.columns.values)

# getting the text columns
text_r = re.compile(text_re)
text_cols = list(filter(text_r.match, df_cols))

# getting the ID column
in_r = re.compile(in_re)
in_cols = list(filter(in_r.match, df_cols))

# getting the URLs/Category columns
cat_r = re.compile(cat_re)
cat_cols = list(filter(cat_r.match, df_cols))
cat_cols = sorted(cat_cols, reverse=False) # .sort(key=str)
df_cols = in_cols + text_cols + cat_cols

In [9]:
print("df columns:\n", df_cols)
print("text column:\n", text_cols)
print("in columns:\n", in_cols)
print("category columns:\n", cat_cols)
print("ordered df columns:\n", df_cols)

df columns:
 ['ID', 'F-number', 'JH-number', 'creator-date', 'creator-place', 'Dimensions', 'details', 'credits', 'CORE_TEXT', '1881', '1884', '1885', '1886', '1887', '1890', 'Antwerp', 'Brussels', 'He would use this technique more than once in his later work', 'Head of a Man', 'Head of a Prostitute', 'Head of a Woman', 'Head of a Woman 1', 'Head of a Woman 2', 'Head of a Woman 3', 'Head of a Woman 4', 'Head of an Old Man', 'Horse', 'Kneeling Ecorche', 'Letter from Vincent van Gogh to Theo van Gogh with sketches of Head of a Woman and Head of a Woman', 'Male Torso', 'Nuenen', 'Paris', 'Plaster Cast of a Womans Torso', 'Plaster Cast of a Womans Torso 1', 'Portrait of a Prostitute', 'The Potato Eaters', 'Torso of Venus', 'Torso of Venus 1', 'Torso of Venus 2', 'Torso of Venus 3', 'Van Gogh wrote', 'Woman Sewing', 'animal art', 'cityscape', 'complementary colours', 'drawing', 'drew', 'he wrote', 'heads', 'nude', 'painting', 'portrait', 'standing torso of Venus', 'still life', 'this torso 

In [10]:
text_df = text_df.reindex(columns=df_cols)
text_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 56 columns):
 #   Column                                                                                              Non-Null Count  Dtype 
---  ------                                                                                              --------------  ----- 
 0   ID                                                                                                  59 non-null     object
 1   F-number                                                                                            59 non-null     object
 2   JH-number                                                                                           59 non-null     object
 3   creator-date                                                                                        59 non-null     object
 4   creator-place                                                                                       59 non-null     object
 

In [14]:
# pandas profile
profile = ProfileReport(text_df, title="Pandas Profiling Report", explorative=True, dark_mode=True)

In [18]:
rfp = os.path.join(root_folder, reportf, small_report)
# print(rfp)
profile.to_file(rfp)

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]


FileNotFoundError: [Errno 2] No such file or directory: 'c:\\Users\\Felipe\\Documents\\GitHub\\sa-artea\\VVG-MLData-Preparer\\Reports\\VVG-TextData-Report-Small-13-Apr-2021.html'

In [15]:
# calculating correlation phik
phik_overview = text_df.phik_matrix()#(interval_cols=interval_cols)
phik_overview
# yo necesito escojer... las que mas influyen en mis variables objetivo (p >= 0.95) y que sean independientes entre si (p ~= 0.0)

AttributeError: 'DataFrame' object has no attribute 'phik_matrix'

In [None]:
# Create correlation matrix
# corr_matrix = df.corr().abs()
upper = phik_overview.abs()
valuable_cols = in_cols + text_cols
# Select upper triangle of correlation matrix
upper = upper.where(np.triu(np.ones(upper.shape), k=1).astype(np.bool))
upper

In [None]:

# Find index of feature columns with correlation greater than 0.95
to_drop = list()

for column in upper.columns:
    row = upper[column]
    row = np.array(row)
    row = row[~np.isnan(row)]

    if len(row) > 0:
        cor_count = 0
        ncor_count = 0
        for data in row:

            if data >= 0.95:
                cor_count = cor_count + 1
            else:
                ncor_count = ncor_count + 1
            
            ans = cor_count -ncor_count
        if ans > 0 and (column not in to_drop):
            to_drop.append(column)
        # if avg > 0.50 and (column not in to_drop) and (column not in valuable_cols):
        #     to_drop.append(column)
        # # print(column, col)

# to_drop = [column for column in upper.columns if (np.average(upper[column]) > 0.2)]
# print(len(to_drop))
print(to_drop)
# upper

In [None]:
# dropping correlected ones
threshold = 0.95
# list of columns
phik_cols = list(phik_overview.columns.values)
print(phik_cols)

for col in phik_cols:
    # get the phik values
    tdata = phik_overview[col].values
    # transform to list to operate.
    tdata = list(tdata)
    # remove the phik own value
    index = phik_cols.index(col)
    tdata.pop(index)
    # trandform to np.series to operate fast
    tdata = np.asarray(tdata)
    avg_phik = np.average(tdata)
    print(col, ": ", avg_phik)

# columns = np.full((phik_overview.shape[0],), True, dtype=bool)
# print(columns)
# for i in range(phik_overview.shape[0]):
#     for j in range(i+1, phik_overview.shape[0]):
#         if phik_overview.iloc[i,j] >= threshold:
#             if columns[j]:
#                 columns[j] = False
# selected_columns = text_df.columns[columns]
# selected_columns
# digest_df = text_df[selected_columns]

In [None]:
# digest_df.info()

In [None]:
# profile.to_widgets()
rf_path = os.path.join(root_folder, dataf, reportf, work_report)
profile.to_file(rf_path)

In [None]:
duplicates = profile.get_duplicates()
rej_cols = list(profile.get_rejected_variables())
print(duplicates)
print(rej_cols)

In [None]:
target_df = pd.DataFrame(text_df)

In [None]:
# getting the original working text
text_corpus = list(text_df[text_cols[0]])
print(len(text_corpus))

In [None]:
# to working text
text_clean = list()
for text in text_corpus:
    text = text.lower()
    text_clean.append(text)

print(len(text_clean), len(text_corpus))

In [None]:
# cleaning and preprocessing text for word2vec
i = 0
for i in range(0, len(text_clean)):
    text = text_clean[i]
    # removing special characters
    text = re.sub(r"\W", " ", text)
    # finding missing points between numbers
    text = re.sub(r"(\d{1,3}) (\d{1,2})", r"\1.\2", text)
    # removing excessive spaces
    text = re.sub(r"\s+", " ", text)
    text_clean[i] = text
    i = i + 1

print(len(text_clean), len(text_corpus))

In [None]:
# tokenising text
text_tokens = list()

for text in text_clean:
    text = text.split()
    text_tokens.append(text)
    # print(text)

print(len(text_tokens), len(text_clean), len(text_corpus))

In [None]:
# removing stopwords
text_nsw_tokens = list()

for tokens in text_tokens:

    clear_tokens = list()

    for token in tokens:
        if not token in stopwords.words('english'):
            clear_tokens.append(token)
    
    ttokens = copy.deepcopy(clear_tokens)
    text_nsw_tokens.append(ttokens)
    # print(clear_tokens)

print(len(text_nsw_tokens), len(text_tokens), len(text_clean), len(text_corpus))

In [None]:
# lematization of the text
text_lemmas = list()
token_lematizer = WordNetLemmatizer()

for tokens in text_nsw_tokens:

    lemma_tokens = list()

    for token in tokens:
        
        ans = token_lematizer.lemmatize(token)
        lemma_tokens.append(ans)

    tlemmas = copy.deepcopy(lemma_tokens)
    text_lemmas.append(tlemmas)

print(len(text_lemmas), len(text_nsw_tokens), len(text_tokens), len(text_clean), len(text_corpus))

In [None]:
target_df["TOKENS"] = text_tokens
target_df["PREP_TOKENS"] = text_lemmas

In [None]:
target_df.head()

In [None]:
# saving gesim words dictionary
vvg_dict = gensim.corpora.Dictionary(text_lemmas)
print(vvg_dict)
work_dict = work_fn.split(".")
work_dict = work_dict[0] + "." + dext
dict_pfn = os.path.join(root_folder, dataf, stdf, work_dict)
print(dict_pfn)
vvg_dict.save(dict_pfn) 
# os.path.join("Data","VVG-gallery-text.dict"))
# pprint.pprint(vvg_dict.token2id)

In [None]:
# text representation to numeric representation
text_bows = list()
text_idxs = list()

for lemmas in text_lemmas:

    # bow loose the order/semantic
    t_bow = vvg_dict.doc2bow(lemmas, allow_update=True)
    text_bows.append(t_bow)
    # idz keeps the order/semantic
    t_idx = vvg_dict.doc2idx(lemmas)
    text_idxs.append(t_idx)

print(len(text_bows), len(text_idxs), len(text_lemmas), len(text_nsw_tokens), len(text_tokens), len(text_clean), len(text_corpus))

In [None]:
# train the model
tfidf = gensim.models.TfidfModel(text_idxs, dictionary=vvg_dict, normalize=True)
corpus_tfidf = tfidf[text_bows]
print(len(corpus_tfidf), len(text_bows), len(text_idxs), len(text_lemmas), len(text_nsw_tokens), len(text_tokens), len(text_clean), len(text_corpus))

In [None]:
text_df.info()

In [None]:
target_df["BOWS_TOKENS"] = text_bows
target_df["IDX_TOKENS"] = text_idxs
target_df["TFIDF_TOKENS"] = corpus_tfidf

In [None]:
# checking everything is okey
target_df.head()

In [None]:
# creating the dense vector standar representantion of the text
text_dvector = list()

# iterating in each text with the tfidf word bag
for t_idtokens, tfidf_tokens in zip(text_idxs, corpus_tfidf):
    # print("===============================")
    # print(len(tidxs), len(ttfidf))
    # print(type(tidxs), type(ttfidf))
    # dense vector representation
    tdvect = list()

    # creating the dense representation for each text
    for t_token in t_idtokens:

        # transforming the tfidf into dict
        tokens_dict = dict(tfidf_tokens)
        
        # looking for each word
        if t_token in tokens_dict.keys():
            temp = tokens_dict.get(t_token)
            # appending std word representation into array
            tdvect.append(temp)

    # copying std dense vector into corpus column
    ans = copy.deepcopy(tdvect)
    text_dvector.append(ans)

# checking the size of all columna
print(len(text_dvector), len(corpus_tfidf), len(text_bows), len(text_idxs), len(text_lemmas), len(text_nsw_tokens), len(text_tokens), len(text_clean), len(text_corpus))

# adding the dense representation into the dataframe
target_df["STD_DVEC_TOKENS"] = text_dvector

In [None]:
# complete text rpresentation
tf_dense_corpus = matutils.corpus2dense(text_bows, num_terms = len(vvg_dict.token2id), num_docs = len(corpus_tfidf))
# text_df["STD_CVEC_TOKENS"] = tf_dense_corpus
tf_dense_corpus[0:]

In [None]:
tf_sparse_corpus = matutils.corpus2csc(corpus_tfidf)
target_df["STD_SVEC_TOKENS"] = tf_sparse_corpus

In [None]:
# checking everything is okey
target_df.head()

In [None]:
# saving the CSV file into pandas
# writing an existing CSV fileto update the dataframe
target_fn = "std-" + work_fn
fn_tpath = os.path.join(root_folder, dataf, stdf, target_fn)
print(fn_tpath)
target_df.to_csv(fn_tpath,
                sep=",",
                index=False,
                encoding="utf-8",
                mode="w",
                )

In [None]:
# dont remember for what i did this
# sim_index = gensim.similarities.SparseMatrixSimilarity(corpus_tfidf, num_features=len(vvg_dict))