# Data Cleaning Meta Data from New Dataset

Using this version of [wikiART dataset](https://archive.org/details/wikiart-dataset), since the other datasets had many issues including missing files, inconsistent metadata, and missing classes. This was preferred over calling the [wikiART API](https://www.wikiart.org/en/App/GetApi), as it has a requests limit of 400/hr.

In [18]:
import pandas as pd
data_raw = pd.read_csv('wclasses.csv')

In [19]:
data_raw.head(2)

Unnamed: 0,file,artist,genre,style
0,Realism/vincent-van-gogh_pine-trees-in-the-fen...,22,133,161
1,Baroque/rembrandt_the-angel-appearing-to-the-s...,20,136,144


The target output of our model should be one of artist, genre, or style. The artist, genre, and style are encoded in the `classes.php` file. Unfortunately many of the artists are not included. This means we will need to convert the encodings for genre and style, while obtaining the artist from the filename. Since working with `.php` files in python is unusual, it will be faster to just copy the relevant parts of the dictionary into python.

In [20]:
class_dict ={129 : 'abstract_painting',
    130 : 'cityscape',
    131 : 'genre_painting',
    132 : 'illustration',
    133 : 'landscape',
    134 : 'nude_painting',
    135 : 'portrait',
    136 : 'religious_painting',
    137 : 'sketch_and_study',
    138 : 'still_life',
    139 : 'Unknown Genre',
    140 : 'Abstract_Expressionism',
    141 : 'Action_painting',
    142 : 'Analytical_Cubism',
    143 : 'Art_Nouveau',
    144 : 'Baroque',
    145 : 'Color_Field_Painting',
    146 : 'Contemporary_Realism',
    147 : 'Cubism',
    148 : 'Early_Renaissance',
    149 : 'Expressionism',
    150 : 'Fauvism',
    151 : 'High_Renaissance',
    152 : 'Impressionism',
    153 : 'Mannerism_Late_Renaissance',
    154 : 'Minimalism',
    155 : 'Naive_Art_Primitivism',
    156 : 'New_Realism',
    157 : 'Northern_Renaissance',
    158 : 'Pointillism',
    159 : 'Pop_Art',
    160 : 'Post_Impressionism',
    161 : 'Realism',
    162 : 'Rococo',
    163 : 'Romanticism',
    164 : 'Symbolism',
    165 : 'Synthetic_Cubism',
    166 : 'Ukiyo_e'}

We map the encodings in the dataframe to their values in the dictionary.

In [21]:
df = data_raw
df['genre'] = df['genre'].map(class_dict)
df['art_style'] = df['style'].map(class_dict)
df.drop(columns=['style'], inplace=True)
df.head(5)

Unnamed: 0,file,artist,genre,art_style
0,Realism/vincent-van-gogh_pine-trees-in-the-fen...,22,landscape,Realism
1,Baroque/rembrandt_the-angel-appearing-to-the-s...,20,religious_painting,Baroque
2,Post_Impressionism/paul-cezanne_portrait-of-th...,16,portrait,Post_Impressionism
3,Impressionism/pierre-auguste-renoir_young-girl...,17,genre_painting,Impressionism
4,Romanticism/ivan-aivazovsky_morning-1851.jpg,9,Unknown Genre,Romanticism


In [22]:
df.genre.unique()

array(['landscape', 'religious_painting', 'portrait', 'genre_painting',
       'Unknown Genre', 'still_life', 'sketch_and_study', 'illustration',
       'cityscape', 'nude_painting', 'abstract_painting'], dtype=object)

In [24]:
df.art_style.unique()

array(['Realism', 'Baroque', 'Post_Impressionism', 'Impressionism',
       'Romanticism', 'Art_Nouveau', 'Northern_Renaissance', 'Symbolism',
       'Naive_Art_Primitivism', 'Expressionism', 'Cubism', 'Fauvism',
       'Analytical_Cubism', 'Abstract_Expressionism', 'Synthetic_Cubism',
       'Pointillism', 'Early_Renaissance', 'Color_Field_Painting',
       'New_Realism', 'Ukiyo_e', 'Rococo', 'High_Renaissance',
       'Mannerism_Late_Renaissance', 'Pop_Art', 'Contemporary_Realism',
       'Minimalism', 'Action_painting'], dtype=object)

Replace the unknowns with NaN

In [26]:
import numpy as np
df['genre'] = df['genre'].replace('Unknown Genre', np.nan)

In [27]:
df.head(5)

Unnamed: 0,file,artist,genre,art_style
0,Realism/vincent-van-gogh_pine-trees-in-the-fen...,22,landscape,Realism
1,Baroque/rembrandt_the-angel-appearing-to-the-s...,20,religious_painting,Baroque
2,Post_Impressionism/paul-cezanne_portrait-of-th...,16,portrait,Post_Impressionism
3,Impressionism/pierre-auguste-renoir_young-girl...,17,genre_painting,Impressionism
4,Romanticism/ivan-aivazovsky_morning-1851.jpg,9,,Romanticism


We get the artist from the filename.

In [28]:
df['artist'] = df['file'].str.split('/').str[1].str.split('_').str[0]

In [29]:
df.artist.unique()

array(['vincent-van-gogh', 'rembrandt', 'paul-cezanne', ...,
       'isa-genzken', 'james-turrell', 'frida-kahlo'], dtype=object)

In [31]:
df = df.rename(columns={'file': 'filename'})

In [32]:
df.head(5)

Unnamed: 0,filename,artist,genre,art_style
0,Realism/vincent-van-gogh_pine-trees-in-the-fen...,vincent-van-gogh,landscape,Realism
1,Baroque/rembrandt_the-angel-appearing-to-the-s...,rembrandt,religious_painting,Baroque
2,Post_Impressionism/paul-cezanne_portrait-of-th...,paul-cezanne,portrait,Post_Impressionism
3,Impressionism/pierre-auguste-renoir_young-girl...,pierre-auguste-renoir,genre_painting,Impressionism
4,Romanticism/ivan-aivazovsky_morning-1851.jpg,ivan-aivazovsky,,Romanticism


In [34]:
duplicate_rows = df[df.duplicated('filename', keep=False)]
print(duplicate_rows)

Empty DataFrame
Columns: [filename, artist, genre, art_style]
Index: []


Save the results to `labels.csv`

In [33]:
df.to_csv('labels.csv', index=False)