Word2Vec# VINCENT VAN GOGH STANDARD DATA TEXT PROCESS

This script takes the gallery's text from the local data folder and process it to an standard representation with Natural Language Processing and Word2Vec methods.

The process goes as follows:

1. Load the CSV into a pandas DataFrame.
2. Transform text columns into words lists.
3. Clear the text to find tokens (remove stop words).
4. Save corpus complete dictionary.
4. Lemmatize the tokens to find stem words.
5. Execute Bag of Words to find vector representation (doc2bow).
6. Execute the mapping of tokens in the text to find vector representation (doc2idx).
7. Execute the Word 2 Vect standarization process with tf-idf method (TF-IDF: Term Frequency - Inverse Document Frequency).
8. Create the dense standard vector text representation.
9. Save the the resulting CSV from pandas DataFrame

The following Links were useful to create this proof of concept:

- Gensim Word2Vec Tutorial, URL: https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial#Getting-Started
- Gensim Core Concepts, URL: https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html
- Document similarity queries, URL: https://radimrehurek.com/gensim/similarities/docsim.html
- Tutorial 19: Text analysis with gensim, URL: https://statsmaths.github.io/stat289-f18/solutions/tutorial19-gensim.html
- Gensim TF-IDF model, URL: https://radimrehurek.com/gensim/models/tfidfmodel.html?highlight=tfidfmodel#module-gensim.models.tfidfmodel
- Gensim, Construct word<->id mappings, URL: https://radimrehurek.com/gensim/corpora/dictionary.html?highlight=doc2idx#gensim.corpora.dictionary.Dictionary.doc2idx
- tensorflow, Word2Vec, URL: https://www.tensorflow.org/tutorials/text/word2vec
- Introduction to Word Embeddings, URL: https://pub.towardsai.net/introduction-to-word-embedding-5ba5cf97d296
- Python for NLP: Working with the Gensim Library (Part 1), URL: https://stackabuse.com/python-for-nlp-working-with-the-gensim-library-part-1/
- Packt main github repository, URL: https://github.com/PacktPublishing


In [1]:
"""
* Copyright 2020, Maestria de Humanidades Digitales,
* Universidad de Los Andes
*
* Developed for the Msc graduation project in Digital Humanities
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program.  If not, see <http://www.gnu.org/licenses/>.
"""

# ===============================
# native python libraries
# ===============================
import os
import copy
import sys
import csv
import re
import pprint
import datetime

# ===============================
# extension python libraries
# ===============================
import pandas as pd
from pandas_profiling import ProfileReport
import numpy as np
import gensim
from gensim import models
from gensim import matutils
from gensim import similarities
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# downloading nlkt data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


# ===============================
# developed python libraries
# ===============================


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
# notebook varlable definitions
# root folder
dataf = "Data"

# subfolder with the OCR transcrived txt data
prepf = "Prep"

#  subfolder with the CSV files containing the ML pandas dataframe
stdf = "Std"

# subfolder for reports
reportf = "Reports"

# dataframe file extension
fext = "csv"

# dictionary extension
dext = "dict"

# dataframe file name
small_fn = "VVG-Gallery-Text-Data-Small" + "." + fext
large_fn = "VVG-Gallery-Text-Data-Large" + "." + fext

# report names
str_date = datetime.date.today().strftime("%d-%b-%Y")
small_report = "VVG-TextData-Report-Small-" + str_date + "." + "html"
large_report = "VVG-TextData-Report-Large-" + str_date + "." + "html"

# regex for _TEXT
text_re = u"\w+_TEXT"

# regex for ID
id_re = u"ID{1}"

# regex for others (URLs|Categories)
cat_re = u"\b(?!(ID{1}|\w+_TEXT))\b(\w+\W+)+"
cat_re = u"ID{1}(^\w+( \w+)*$)"

# default values
work_fn, work_report = small_fn, small_report
# work_fn, work_report = large_fn, large_report

In [3]:
root_folder = os.getcwd()
root_folder = os.path.split(root_folder)[0]
root_folder = os.path.normpath(root_folder)
print(root_folder)

c:\Users\Felipe\Documents\GitHub\sa-artea\VVG-Gallery-StdDataProcessor


In [4]:
# loading the CSV file into pandas
# read an existing CSV fileto update the dataframe
fn_path = os.path.join(root_folder, dataf, prepf, work_fn)
print(fn_path)
text_df = pd.read_csv(
                fn_path,
                sep=",",
                encoding="utf-8",
                engine="python",
            )

c:\Users\Felipe\Documents\GitHub\sa-artea\VVG-Gallery-StdDataProcessor\Data\Prep\VVG-Gallery-Text-Data-Small.csv


In [5]:
# pandas profile
profile = ProfileReport(text_df, title="Pandas Profiling Report", explorative=True)

In [6]:
# profile.to_widgets()
rf_path = os.path.join(root_folder, dataf, reportf, work_report)
profile.to_file(rf_path)

Summarize dataset: 100%|██████████| 69/69 [01:27<00:00,  1.26s/it, Completed]
Generate report structure: 100%|██████████| 1/1 [00:56<00:00, 56.20s/it]
Render HTML: 100%|██████████| 1/1 [00:06<00:00,  6.72s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 43.48it/s]


In [7]:
# getting the df columns
df_cols = list(text_df)

# getting the text columns
text_r = re.compile(text_re)
text_cols = list(filter(text_r.match, df_cols))

# getting the ID column
id_r = re.compile(id_re)
id_cols = list(filter(id_r.match, df_cols))

# getting the URLs/Category columns
cat_r = re.compile(cat_re)
cat_cols = list(filter(cat_r.match, df_cols))

In [8]:
print("df columns:\n", df_cols)
print("ID column:\n", text_cols)
print("text columns:\n", text_cols)
print("category columns:\n", cat_cols)

df columns:
 ['ID', 'F-number', 'JH-number', 'creator-date', 'creator-place', 'Dimensions', 'details', 'credits', 'CORE_TEXT', 'heads', 'Head of a Woman', 'Head of a Man', 'Head of a Woman 1', 'Head of a Woman 2', 'Head of a Woman 3', 'Head of a Woman 4', 'Head of a Prostitute', 'Head of an Old Man', 'Torso of Venus', 'Torso of Venus 1', 'Torso of Venus 2', 'Torso of Venus 3', 'this torso of Venus', 'standing torso of Venus', 'Plaster Cast of a Womans Torso', 'Plaster Cast of a Womans Torso 1', 'Male Torso', 'drawing', 'painting', 'portrait', 'still life', 'nude', 'cityscape', 'animal art', 'Horse', 'Kneeling Ecorche', 'Portrait of a Prostitute', 'Woman Sewing', 'The Potato Eaters', 'Letter from Vincent van Gogh to Theo van Gogh with sketches of Head of a Woman and Head of a Woman', 'Van Gogh wrote', 'he wrote', 'drew', 'complementary colours', 'which he painted a number of times', 'He would use this technique more than once in his later work', '1881', '1884', '1885', '1886', '1887', '

In [9]:
# getting the original working text
text_corpus = list(text_df[text_cols[0]])
print(len(text_corpus))

59


In [10]:
# to working text
text_clean = list()
for text in text_corpus:
    text = text.lower()
    text_clean.append(text)

print(len(text_clean), len(text_corpus))

59 59


In [11]:
# cleaning and preprocessing text for word2vec
i = 0
for i in range(0, len(text_clean)):
    text = text_clean[i]
    # removing special characters
    text = re.sub(r"\W", " ", text)
    # finding missing points between numbers
    text = re.sub(r"(\d{1,3}) (\d{1,2})", r"\1.\2", text)
    # removing excessive spaces
    text = re.sub(r"\s+", " ", text)
    text_clean[i] = text
    i = i + 1

print(len(text_clean), len(text_corpus))

59 59


In [12]:
# tokenising text
text_tokens = list()

for text in text_clean:
    text = text.split()
    text_tokens.append(text)
    # print(text)

print(len(text_tokens), len(text_clean), len(text_corpus))

59 59 59


In [13]:
# removing stopwords
text_nsw_tokens = list()

for tokens in text_tokens:

    clear_tokens = list()

    for token in tokens:
        if not token in stopwords.words('english'):
            clear_tokens.append(token)
    
    ttokens = copy.deepcopy(clear_tokens)
    text_nsw_tokens.append(ttokens)
    # print(clear_tokens)

print(len(text_nsw_tokens), len(text_tokens), len(text_clean), len(text_corpus))

59 59 59 59


In [14]:
# lematization of the text
text_lemmas = list()
token_lematizer = WordNetLemmatizer()

for tokens in text_nsw_tokens:

    lemma_tokens = list()

    for token in tokens:
        
        ans = token_lematizer.lemmatize(token)
        lemma_tokens.append(ans)

    tlemmas = copy.deepcopy(lemma_tokens)
    text_lemmas.append(tlemmas)

print(len(text_lemmas), len(text_nsw_tokens), len(text_tokens), len(text_clean), len(text_corpus))

59 59 59 59 59


In [15]:
text_df["TOKENS"] = text_tokens
text_df["PREP_TOKENS"] = text_lemmas

In [16]:
text_df.head()

Unnamed: 0,ID,F-number,JH-number,creator-date,creator-place,Dimensions,details,credits,CORE_TEXT,heads,...,1885,1886,1887,1890,Nuenen,Antwerp,Paris,Brussels,TOKENS,PREP_TOKENS
0,s0004V1962r,F0388r,JH0782,May 1885,Nuenen,43.5 cm x 36.2 cm,oil on canvas,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...",Head of a Woman With his brother Theo van Gog...,https://vangoghmuseum.nl/en/collection?Genre=h...,...,https://vangoghmuseum.nl/en/collection?Date=1885,localhost,localhost,localhost,https://vangoghmuseum.nl/en/collection?Place=N...,localhost,localhost,localhost,"[head, of, a, woman, with, his, brother, theo,...","[head, woman, brother, theo, van, gogh, paris,..."
1,s0006V1962,F0160,JH0722,April 1885,Nuenen,"43.2 cm x 30.0 cm, 2.2 cm x 59.0 cm, 46.2 cm x...",oil on canvas,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...",Head of a Woman This woman is Gordina de Groot...,https://vangoghmuseum.nl/en/collection?Genre=h...,...,https://vangoghmuseum.nl/en/collection?Date=1885,localhost,localhost,localhost,https://vangoghmuseum.nl/en/collection?Place=N...,localhost,localhost,localhost,"[head, of, a, woman, this, woman, is, gordina,...","[head, woman, woman, gordina, de, groot, posed..."
2,s0010V1962,F0174,JH0978,December 1885,Antwerp,"50.5 cm x 39.8 cm, 68.1 cm x 57.7 cm",oil on canvas,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...",Portrait of an Old Woman The old womans grey h...,localhost,...,https://vangoghmuseum.nl/en/collection?Date=1885,localhost,localhost,localhost,localhost,https://vangoghmuseum.nl/en/collection?Place=A...,localhost,localhost,"[portrait, of, an, old, woman, the, old, woman...","[portrait, old, woman, old, woman, grey, hair,..."
3,s0056V1962,F0216a,JH1054,June 1886,Paris,"46.0 cm x 38.0 cm, 55.1 cm x 46.5 cm",oil on cardboard,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...",Torso of Venus Van Gogh was thorough in everyt...,localhost,...,localhost,https://vangoghmuseum.nl/en/collection?Date=1886,localhost,localhost,localhost,localhost,https://vangoghmuseum.nl/en/collection?Place=P...,localhost,"[torso, of, venus, van, gogh, was, thorough, i...","[torso, venus, van, gogh, thorough, everything..."
4,s0058V1962,F0161,JH0788,March-May 1885,Nuenen,"45.5 cm x 33.0 cm, 60 cm x 48 cm",oil on canvas,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...","Woman with a Mourning Shawl Here, Van Gogh was...",https://vangoghmuseum.nl/en/collection?Genre=h...,...,https://vangoghmuseum.nl/en/collection?Date=1885,localhost,localhost,localhost,https://vangoghmuseum.nl/en/collection?Place=N...,localhost,localhost,localhost,"[woman, with, a, mourning, shawl, here, van, g...","[woman, mourning, shawl, van, gogh, practising..."


In [17]:
# saving gesim words dictionary
vvg_dict = gensim.corpora.Dictionary(text_lemmas)
print(vvg_dict)
work_dict = work_fn.split(".")
work_dict = work_dict[0] + "." + dext
dict_pfn = os.path.join(root_folder, dataf, stdf, work_dict)
print(dict_pfn)
vvg_dict.save(dict_pfn) 
# os.path.join("Data","VVG-gallery-text.dict"))
# pprint.pprint(vvg_dict.token2id)

Dictionary(563 unique tokens: ['1', '11', '16', '1885', '1891']...)
c:\Users\Felipe\Documents\GitHub\sa-artea\VVG-Gallery-StdDataProcessor\Data\Std\VVG-Gallery-Text-Data-Small.dict


In [18]:
# text representation to numeric representation
text_bows = list()
text_idxs = list()

for lemmas in text_lemmas:

    # bow loose the order/semantic
    t_bow = vvg_dict.doc2bow(lemmas, allow_update=True)
    text_bows.append(t_bow)
    # idz keeps the order/semantic
    t_idx = vvg_dict.doc2idx(lemmas)
    text_idxs.append(t_idx)

print(len(text_bows), len(text_idxs), len(text_lemmas), len(text_nsw_tokens), len(text_tokens), len(text_clean), len(text_corpus))

59 59 59 59 59 59 59


In [19]:
# train the model
tfidf = gensim.models.TfidfModel(text_idxs, dictionary=vvg_dict, normalize=True)
corpus_tfidf = tfidf[text_bows]
print(len(corpus_tfidf), len(text_bows), len(text_idxs), len(text_lemmas), len(text_nsw_tokens), len(text_tokens), len(text_clean), len(text_corpus))

59 59 59 59 59 59 59 59


In [20]:
text_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 58 columns):
 #   Column                                                                                              Non-Null Count  Dtype 
---  ------                                                                                              --------------  ----- 
 0   ID                                                                                                  59 non-null     object
 1   F-number                                                                                            59 non-null     object
 2   JH-number                                                                                           59 non-null     object
 3   creator-date                                                                                        59 non-null     object
 4   creator-place                                                                                       59 non-null     object
 

In [21]:
text_df["BOWS_TOKENS"] = text_bows
text_df["IDX_TOKENS"] = text_idxs
text_df["TFIDF_TOKENS"] = corpus_tfidf

  return array(a, dtype, copy=False, order=order)


In [22]:
# checking everything is okey
text_df.head()

Unnamed: 0,ID,F-number,JH-number,creator-date,creator-place,Dimensions,details,credits,CORE_TEXT,heads,...,1890,Nuenen,Antwerp,Paris,Brussels,TOKENS,PREP_TOKENS,BOWS_TOKENS,IDX_TOKENS,TFIDF_TOKENS
0,s0004V1962r,F0388r,JH0782,May 1885,Nuenen,43.5 cm x 36.2 cm,oil on canvas,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...",Head of a Woman With his brother Theo van Gog...,https://vangoghmuseum.nl/en/collection?Genre=h...,...,localhost,https://vangoghmuseum.nl/en/collection?Place=N...,localhost,localhost,localhost,"[head, of, a, woman, with, his, brother, theo,...","[head, woman, brother, theo, van, gogh, paris,...","[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1...","[33, 65, 22, 58, 61, 32, 47, 43, 3, 59, 26, 15...","[(0, 0.0016635089512223552), (1, 0.24017793259..."
1,s0006V1962,F0160,JH0722,April 1885,Nuenen,"43.2 cm x 30.0 cm, 2.2 cm x 59.0 cm, 46.2 cm x...",oil on canvas,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...",Head of a Woman This woman is Gordina de Groot...,https://vangoghmuseum.nl/en/collection?Genre=h...,...,localhost,https://vangoghmuseum.nl/en/collection?Place=N...,localhost,localhost,localhost,"[head, of, a, woman, this, woman, is, gordina,...","[head, woman, woman, gordina, de, groot, posed...","[(0, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1...","[33, 65, 65, 81, 74, 83, 98, 78, 99, 76, 100, ...","[(0, 0.0006961245434044776), (2, 0.10959359241..."
2,s0010V1962,F0174,JH0978,December 1885,Antwerp,"50.5 cm x 39.8 cm, 68.1 cm x 57.7 cm",oil on canvas,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...",Portrait of an Old Woman The old womans grey h...,localhost,...,localhost,localhost,https://vangoghmuseum.nl/en/collection?Place=A...,localhost,localhost,"[portrait, of, an, old, woman, the, old, woman...","[portrait, old, woman, old, woman, grey, hair,...","[(0, 1), (3, 1), (4, 1), (5, 1), (9, 2), (10, ...","[137, 136, 65, 136, 65, 127, 128, 141, 145, 11...","[(0, 0.000815611219197146), (3, 0.056642161351..."
3,s0056V1962,F0216a,JH1054,June 1886,Paris,"46.0 cm x 38.0 cm, 55.1 cm x 46.5 cm",oil on cardboard,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...",Torso of Venus Van Gogh was thorough in everyt...,localhost,...,localhost,localhost,localhost,https://vangoghmuseum.nl/en/collection?Place=P...,localhost,"[torso, of, venus, van, gogh, was, thorough, i...","[torso, venus, van, gogh, thorough, everything...","[(0, 1), (4, 1), (5, 1), (9, 2), (10, 2), (11,...","[173, 176, 61, 32, 172, 158, 156, 168, 94, 167...","[(0, 0.0009126521072534764), (4, 0.00184117721..."
4,s0058V1962,F0161,JH0788,March-May 1885,Nuenen,"45.5 cm x 33.0 cm, 60 cm x 48 cm",oil on canvas,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...","Woman with a Mourning Shawl Here, Van Gogh was...",https://vangoghmuseum.nl/en/collection?Genre=h...,...,localhost,https://vangoghmuseum.nl/en/collection?Place=N...,localhost,localhost,localhost,"[woman, with, a, mourning, shawl, here, van, g...","[woman, mourning, shawl, van, gogh, practising...","[(0, 1), (3, 2), (4, 1), (5, 1), (9, 2), (10, ...","[65, 186, 191, 61, 32, 189, 94, 194, 120, 184,...","[(0, 0.0006248248836673449), (3, 0.08678505406..."


In [23]:
# creating the dense vector standar representantion of the text
text_dvector = list()

# iterating in each text with the tfidf word bag
for t_idtokens, tfidf_tokens in zip(text_idxs, corpus_tfidf):
    # print("===============================")
    # print(len(tidxs), len(ttfidf))
    # print(type(tidxs), type(ttfidf))
    # dense vector representation
    tdvect = list()

    # creating the dense representation for each text
    for t_token in t_idtokens:

        # transforming the tfidf into dict
        tokens_dict = dict(tfidf_tokens)
        
        # looking for each word
        if t_token in tokens_dict.keys():
            temp = tokens_dict.get(t_token)
            # appending std word representation into array
            tdvect.append(temp)

    # copying std dense vector into corpus column
    ans = copy.deepcopy(tdvect)
    text_dvector.append(ans)

# checking the size of all columna
print(len(text_dvector), len(corpus_tfidf), len(text_bows), len(text_idxs), len(text_lemmas), len(text_nsw_tokens), len(text_tokens), len(text_clean), len(text_corpus))

# adding the dense representation into the dataframe
text_df["STD_DVEC_TOKENS"] = text_dvector

59 59 59 59 59 59 59 59 59


In [24]:
# complete text rpresentation
tf_dense_corpus = matutils.corpus2dense(text_bows, num_terms = len(vvg_dict.token2id), num_docs = len(corpus_tfidf))
# text_df["STD_CVEC_TOKENS"] = tf_dense_corpus
tf_dense_corpus[0:]

array([[1., 1., 1., ..., 1., 1., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 2.]], dtype=float32)

In [25]:
tf_sparse_corpus = matutils.corpus2csc(corpus_tfidf)
text_df["STD_SVEC_TOKENS"] = tf_sparse_corpus

In [26]:
# checking everything is okey
text_df.head()

Unnamed: 0,ID,F-number,JH-number,creator-date,creator-place,Dimensions,details,credits,CORE_TEXT,heads,...,Antwerp,Paris,Brussels,TOKENS,PREP_TOKENS,BOWS_TOKENS,IDX_TOKENS,TFIDF_TOKENS,STD_DVEC_TOKENS,STD_SVEC_TOKENS
0,s0004V1962r,F0388r,JH0782,May 1885,Nuenen,43.5 cm x 36.2 cm,oil on canvas,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...",Head of a Woman With his brother Theo van Gog...,https://vangoghmuseum.nl/en/collection?Genre=h...,...,localhost,localhost,localhost,"[head, of, a, woman, with, his, brother, theo,...","[head, woman, brother, theo, van, gogh, paris,...","[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1...","[33, 65, 22, 58, 61, 32, 47, 43, 3, 59, 26, 15...","[(0, 0.0016635089512223552), (1, 0.24017793259...","[0.13998267716640597, 0.12108879158578355, 0.0...","(0, 0)\t0.0016635089512223552\n (1, 0)\t0.2..."
1,s0006V1962,F0160,JH0722,April 1885,Nuenen,"43.2 cm x 30.0 cm, 2.2 cm x 59.0 cm, 46.2 cm x...",oil on canvas,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...",Head of a Woman This woman is Gordina de Groot...,https://vangoghmuseum.nl/en/collection?Genre=h...,...,localhost,localhost,localhost,"[head, of, a, woman, this, woman, is, gordina,...","[head, woman, woman, gordina, de, groot, posed...","[(0, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1...","[33, 65, 65, 81, 74, 83, 98, 78, 99, 76, 100, ...","[(0, 0.0006961245434044776), (2, 0.10959359241...","[0.11715642065574383, 0.10134346399773161, 0.1...","(0, 0)\t0.0016635089512223552\n (1, 0)\t0.2..."
2,s0010V1962,F0174,JH0978,December 1885,Antwerp,"50.5 cm x 39.8 cm, 68.1 cm x 57.7 cm",oil on canvas,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...",Portrait of an Old Woman The old womans grey h...,localhost,...,https://vangoghmuseum.nl/en/collection?Place=A...,localhost,localhost,"[portrait, of, an, old, woman, the, old, woman...","[portrait, old, woman, old, woman, grey, hair,...","[(0, 1), (3, 1), (4, 1), (5, 1), (9, 2), (10, ...","[137, 136, 65, 136, 65, 127, 128, 141, 145, 11...","[(0, 0.000815611219197146), (3, 0.056642161351...","[0.23551639596972265, 0.32295269587566183, 0.1...","(0, 0)\t0.0016635089512223552\n (1, 0)\t0.2..."
3,s0056V1962,F0216a,JH1054,June 1886,Paris,"46.0 cm x 38.0 cm, 55.1 cm x 46.5 cm",oil on cardboard,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...",Torso of Venus Van Gogh was thorough in everyt...,localhost,...,localhost,https://vangoghmuseum.nl/en/collection?Place=P...,localhost,"[torso, of, venus, van, gogh, was, thorough, i...","[torso, venus, van, gogh, thorough, everything...","[(0, 1), (4, 1), (5, 1), (9, 2), (10, 2), (11,...","[173, 176, 61, 32, 172, 158, 156, 168, 94, 167...","[(0, 0.0009126521072534764), (4, 0.00184117721...","[0.11551266398918976, 0.09168590962460091, 0.2...","(0, 0)\t0.0016635089512223552\n (1, 0)\t0.2..."
4,s0058V1962,F0161,JH0788,March-May 1885,Nuenen,"45.5 cm x 33.0 cm, 60 cm x 48 cm",oil on canvas,"Van Gogh Museum, Amsterdam (Vincent van Gogh F...","Woman with a Mourning Shawl Here, Van Gogh was...",https://vangoghmuseum.nl/en/collection?Genre=h...,...,localhost,localhost,localhost,"[woman, with, a, mourning, shawl, here, van, g...","[woman, mourning, shawl, van, gogh, practising...","[(0, 1), (3, 2), (4, 1), (5, 1), (9, 2), (10, ...","[65, 186, 191, 61, 32, 189, 94, 194, 120, 184,...","[(0, 0.0006248248836673449), (3, 0.08678505406...","[0.13644523534498276, 0.44711868574191016, 0.4...","(0, 0)\t0.0016635089512223552\n (1, 0)\t0.2..."


In [27]:
# saving the CSV file into pandas
# writing an existing CSV fileto update the dataframe
target_fn = "std-" + work_fn
fn_tpath = os.path.join(root_folder, dataf, stdf, target_fn)
print(fn_tpath)
text_df.to_csv(fn_tpath,
                sep=",",
                index=False,
                encoding="utf-8",
                mode="w",
                )

c:\Users\Felipe\Documents\GitHub\sa-artea\VVG-Gallery-StdDataProcessor\Data\Std\std-VVG-Gallery-Text-Data-Small.csv


In [28]:
# dont remember for what i did this
# sim_index = gensim.similarities.SparseMatrixSimilarity(corpus_tfidf, num_features=len(vvg_dict))