# Data Preprocessing

I'm aware that there are a number of characters in this scraped dataset that could be misinterpreted as special characters, so before doing any cleaning I'm going to start with some preprocessing to find and then escape potential problem characters.

In [1]:
import pandas as pd
import numpy as np

Just as an example, take a look at how the price columns are rendering:

In [2]:
df = pd.read_csv('data/auction_data.csv', usecols=[0, 2, 3, 4, 5, 7, 8, 9, 10, 12, 13, 14])
df.head()

Unnamed: 0,artist_name,title,date,medium,dims,auction_date,auction_house,auction_sale,auction_lot,price_realized,estimate,bought_in
0,Pablo PICASSO,Fillette au bateau (Maya),", 1938",oil on canvas,73.3 x 60 cm,"Mar 1, 2023",Sotheby's• London,Modern & Contemporary Evening Auction,Lot6002,,,
1,Pablo PICASSO,Homme assis,", 1969",oil on panel laid down on cradled panel,28.7 x 56.6 cm,"Nov 30, 2022",Christie's,Live Auction 19901 20th/21st Century Art Eveni...,Lot35,"HK$10,650,000• US$1,363,489","HK$8,200,000–HK$12,800,000(est)",
2,Pablo PICASSO,Buffalo Bill,", 1911",oil and sand on canvas,33.3 x 46.3 cm,"Nov 17, 2022",Christie's,Live Auction 20988 20th Century Evening Sale,Lot12,"US$12,412,500","US$10,000,000–US$15,000,000(est)",
3,Pablo PICASSO,Homme à la moustache,", 1970",oil and oil stick on panel,65.1 x 129.4 cm,"Nov 17, 2022",Christie's,Live Auction 20988 20th Century Evening Sale,Lot30,"US$4,620,000","US$4,000,000–US$6,000,000(est)",
4,Pablo PICASSO,Le peintre et son modèle,", 1964",oil and ripolin on canvas,195.0 x 130.0 cm,"Nov 17, 2022",Christie's,Live Auction 20988 20th Century Evening Sale,Lot36,"US$10,351,500","US$8,000,000–US$12,000,000(est)",


The issue here is that fields like `price_realized` and `estimate` may have multiple currency symbols, and these can be interpreted as special characters. Notice, for instance, that the `estimate` fields for all these rows are missing the currency symbols (in this case `$` that should be there).

My solution is to simply escape the dollar signs and write the changes to a new .csv as follows:

In [3]:
with open('data/auction_data.csv', 'r') as f:
    data = f.read().replace('$', '\\$')

with open('data/auction_data_processed.csv', 'w') as f:
    f.write(data)

In [4]:
df = pd.read_csv('data/auction_data_processed.csv', usecols=[0, 2, 3, 4, 5, 7, 8, 9, 10, 12, 13, 14])
df.head()

Unnamed: 0,artist_name,title,date,medium,dims,auction_date,auction_house,auction_sale,auction_lot,price_realized,estimate,bought_in
0,Pablo PICASSO,Fillette au bateau (Maya),", 1938",oil on canvas,73.3 x 60 cm,"Mar 1, 2023",Sotheby's• London,Modern & Contemporary Evening Auction,Lot6002,,,
1,Pablo PICASSO,Homme assis,", 1969",oil on panel laid down on cradled panel,28.7 x 56.6 cm,"Nov 30, 2022",Christie's,Live Auction 19901 20th/21st Century Art Eveni...,Lot35,"HK\$10,650,000• US\$1,363,489","HK\$8,200,000–HK\$12,800,000(est)",
2,Pablo PICASSO,Buffalo Bill,", 1911",oil and sand on canvas,33.3 x 46.3 cm,"Nov 17, 2022",Christie's,Live Auction 20988 20th Century Evening Sale,Lot12,"US\$12,412,500","US\$10,000,000–US\$15,000,000(est)",
3,Pablo PICASSO,Homme à la moustache,", 1970",oil and oil stick on panel,65.1 x 129.4 cm,"Nov 17, 2022",Christie's,Live Auction 20988 20th Century Evening Sale,Lot30,"US\$4,620,000","US\$4,000,000–US\$6,000,000(est)",
4,Pablo PICASSO,Le peintre et son modèle,", 1964",oil and ripolin on canvas,195.0 x 130.0 cm,"Nov 17, 2022",Christie's,Live Auction 20988 20th Century Evening Sale,Lot36,"US\$10,351,500","US\$8,000,000–US\$12,000,000(est)",


Et voila