# The Price of Art

## Question
What determines the price of (modern) art?

## Hypothesis
An interaction of artist, artwork features, and representation.

## Methodology

1. Examine a model where price is just being predicted by artist, to see if that model can account for significant amounts of variance. 
2. Examine a model where price is just being predicted by artwork features, """ 
3. Examine a model where price is just being predicted by representation, """
4. See if we can do some structural equation modelling of this

# Step 1: Data processing

## Data import

### Import raw data from CSV

In [331]:
import numpy as np
import pandas as pd
import plotly.express as px 

# import from raw CSV
artsy_data = pd.read_csv("./data/artworks_list_FINAL.csv", header=0)
artsy_data = artsy_data.drop(columns=['collection_index'])

f"Our raw dataset has {len(artsy_data)} entries."

'Our raw dataset has 2970 entries.'

### Remove error rows
These are entries where we were unable to collect useable data, most likely because the artwork had already been sold at the time of the data scraping, even though it was in the Artsy listings index.

In [332]:
# find rows where error collection has failed
error_rows = artsy_data.index[(artsy_data == "error").any(axis=1)].tolist()

# exclude these rows from the overall dataframe
artsy_listings = artsy_data.drop(error_rows, axis=0)

# how much of original dataframe did we exclude?
rows_excluded = len(error_rows)
percent_excluded = round((len(artsy_data) - len(artsy_listings)) / len(artsy_data) * 100, 2)
f"We excluded {rows_excluded} entries, or {percent_excluded}% of the original dataset, because of errors in data collection."

'We excluded 9 entries, or 0.3% of the original dataset, because of errors in data collection.'

In [333]:
# look at data types
artsy_listings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2961 entries, 0 to 2969
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   page_url            2961 non-null   object
 1   artist              2961 non-null   object
 2   artist_nationality  2929 non-null   object
 3   artist_birthdate    2002 non-null   object
 4   title               2961 non-null   object
 5   image_url           2961 non-null   object
 6   year                2957 non-null   object
 7   gallery             2960 non-null   object
 8   gallery_location    2960 non-null   object
 9   medium              2961 non-null   object
 10  medium_details      2961 non-null   object
 11  size_inches         2961 non-null   object
 12  size_cm             2961 non-null   object
 13  condition           1341 non-null   object
 14  classification      2961 non-null   object
 15  signed              2821 non-null   object
 16  authenticated       2090

## Clean target variable

The target variable is price, but these are all in different currencies. We will standardise price in GBP.

In [334]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(artsy_listings['price'])

0                                           55,000
1                                            3,500
2                                            2,000
3                                            6,500
5                                            1,250
6                                           15,000
7                                            1,800
8                                              700
9                                            8,000
10                                          11,000
11                                           2,000
12                                           1,000
13                                           3,650
14                                           4,500
15                                           3,200
17                                           2,200
18                                             800
19                                          80,000
20                                           4,500
21                             

In [335]:
# remove all commas from the prices
artsy_listings['price'] = artsy_listings['price'].replace(',','', regex=True)

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(artsy_listings['price'])

0                                            55000
1                                             3500
2                                             2000
3                                             6500
5                                             1250
6                                            15000
7                                             1800
8                                              700
9                                             8000
10                                           11000
11                                            2000
12                                            1000
13                                            3650
14                                            4500
15                                            3200
17                                            2200
18                                             800
19                                           80000
20                                            4500
21                             

In [336]:
# convert all strings into numerics where possible, otherwise fill with NaNs
def convertNumeric(x):
    if type(x) == str:
        if x.isnumeric():
            x = float(x)
        else:
            x = np.NaN
    return x  

# change the price list to the updated values
artsy_listings['price'] = artsy_listings['price'].apply(lambda x: convertNumeric(x))

# take a look and do a summary
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(artsy_listings['price'])

artsy_listings['price'].describe()

0        55000.0
1         3500.0
2         2000.0
3         6500.0
5         1250.0
6        15000.0
7         1800.0
8          700.0
9         8000.0
10       11000.0
11        2000.0
12        1000.0
13        3650.0
14        4500.0
15        3200.0
17        2200.0
18         800.0
19       80000.0
20        4500.0
21        3495.0
22         800.0
23         800.0
24         695.0
25        3500.0
26        1800.0
27        3900.0
28      120000.0
29        2000.0
30       16000.0
31         250.0
32        1900.0
33        1000.0
34        2400.0
35       18000.0
36        6500.0
37       23000.0
38         390.0
39       22000.0
40         250.0
41       13000.0
42        1500.0
43       50000.0
44       12900.0
45        1725.0
46        3200.0
47        1650.0
48        6000.0
49        5500.0
50       22500.0
51       45000.0
52       20000.0
53         200.0
54       10320.0
55       31250.0
56         580.0
57        1000.0
58         800.0
59       17000.0
60        1300

count      2513.000000
mean      14264.425388
std       45704.491797
min          32.000000
25%        2000.000000
50%        4750.000000
75%       12500.000000
max      850000.000000
Name: price, dtype: float64

Now we take a look at currency so we can standardise all these prices into GBP at today's market rate.

In [337]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(artsy_listings[['currency', 'price', 'gallery', 'gallery_location']])

     currency     price                                            gallery  \
0           $   55000.0                              Corridor Contemporary   
1           €    3500.0                                         Make offer   
2           €    2000.0                                          Artistics   
3           $    6500.0                                       Gallery 1261   
5           $    1250.0                                    Signari Gallery   
6           $   15000.0                                        Los Angeles   
7           $    1800.0                                       Gallery 1261   
8           $     700.0                                     Bakker Gallery   
9           €    8000.0                                         Make offer   
10          $   11000.0                          Luis De Jesus Los Angeles   
11          $    2000.0                                           New York   
12          $    1000.0                                       Ga

So we've got a few cases here:

1. Most artwork prices are in USD ($), GBP (£), or EUR (€).
2. We've got two other characters occasionally - "T" and "E" - which mark instances where the information about price was either not available or not in the standard price section. We will exclude these entries.
3. NaN entries - to be excluded.

In [338]:
# What are the rates for USD > GBP and EUR > GBP?
from currency_converter import CurrencyConverter
from datetime import date
c = CurrencyConverter()

# choose a date around when the data was collected
# default is the most recent rate available, but more recent dates in May 2021 are not available yet
c.convert(100, 'EUR', 'GBP', date=date(2021,3,1))
c.convert(100, 'USD', 'GBP', date=date(2021,3,1))

def currencySignConvert(sign, price):
    if sign == "$":
        price_GBP = c.convert(price, 'USD', 'GBP')
    elif sign == "€":
        price_GBP = c.convert(price, 'EUR', 'GBP')
    elif sign == '£':
        price_GBP = price
    else:
        price_GBP = np.NaN

    return price_GBP


## apply function to every row of the dataframe
price_GBP = []
for index, row in artsy_listings.iterrows():
    # print(row['currency'])
    # print(row['price'])
    price_GBP.append(round(currencySignConvert(row['currency'], row['price']),2))
artsy_listings['price_GBP'] = price_GBP

# describe the data
artsy_listings['price_GBP'].describe()
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(artsy_listings[['currency', 'price', 'price_GBP']])

     currency     price  price_GBP
0           $   55000.0   39049.86
1           €    3500.0    3017.53
2           €    2000.0    1724.30
3           $    6500.0    4614.98
5           $    1250.0     887.50
6           $   15000.0   10649.96
7           $    1800.0    1278.00
8           $     700.0     497.00
9           €    8000.0    6897.20
10          $   11000.0    7809.97
11          $    2000.0    1420.00
12          $    1000.0     710.00
13          $    3650.0    2591.49
14          $    4500.0    3194.99
15          €    3200.0    2758.88
17          $    2200.0    1561.99
18          $     800.0     568.00
19          $   80000.0   56799.80
20          $    4500.0    3194.99
21          £    3495.0    3495.00
22          €     800.0     689.72
23          €     800.0     689.72
24          €     695.0     599.19
25          €    3500.0    3017.53
26          $    1800.0    1278.00
27          $    3900.0    2768.99
28          $  120000.0   85199.70
29          $    200

Omit rows where there is no price information.

In [339]:
# find rows where there is no price information
error_rows = artsy_listings.index[np.isnan(artsy_listings['price_GBP'])]

# exclude these rows from the overall dataframe
artsy_listings = artsy_listings.drop(error_rows, axis=0)

# # how much of original dataframe did we exclude?
rows_excluded = len(error_rows)
percent_excluded = round((len(artsy_data) - len(artsy_listings)) / len(artsy_data) * 100, 2)
f"We excluded {rows_excluded} entries, or {percent_excluded}% of the original dataset, because of errors in data collection."

'We excluded 448 entries, or 15.39% of the original dataset, because of errors in data collection.'

In [340]:
print(artsy_listings['price_GBP'].describe())
px.histogram(artsy_listings, x="price_GBP", title="Distribution of artwork prices")

count      2513.000000
mean      11058.887577
std       36315.699465
min          22.720000
25%        1490.990000
50%        3549.990000
75%        9621.590000
max      689720.000000
Name: price_GBP, dtype: float64


In [341]:
artsy_listings['price_GBP_log'] = np.log(artsy_listings['price_GBP'])
px.histogram(artsy_listings, x="price_GBP_log", title="Distribution of artwork prices")

## Clean predictor variables

### 1. Year
Let's start with the artwork side, specifically, **year** (and therefore age) of the artwork.

In [342]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(artsy_listings['year'])

0                                                ca. 2012
1                                                    2017
2                                                    2015
3                                                    2016
5                                                    2019
6                                                    2020
7                                                    2020
8                                            Late 20th c.
9                                                    2017
10                                                     II
11                                                   2018
12                                                   2020
13                                                   2020
14                                                   2020
15                                                   2017
17                                                   2020
18                                                   2020
19            

In [343]:
# check types
# artsy_listings['year'].apply(lambda x: print(type(x)))

# replace any NaNs with none types
artsy_listings['year'] = artsy_listings['year'].replace({np.nan: 'none'})

# check types again
artsy_listings['year'].apply(lambda x: print(type(x)))

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class

0       None
1       None
2       None
3       None
5       None
        ... 
2962    None
2966    None
2967    None
2968    None
2969    None
Name: year, Length: 2513, dtype: object

In [344]:
# remove 'ca' and 'circa'
artsy_listings['year'] = artsy_listings['year'].replace('ca.','', regex=True)
artsy_listings['year'] = artsy_listings['year'].replace('circa', '', regex=True)
artsy_listings['year'] = artsy_listings['year'].replace('Cir', '', regex=True)

# remove periods
import string
def remove_periods(text):
    text = text.replace('.', '')
    return text

artsy_listings['year'] = artsy_listings['year'].apply(remove_periods)

# find entries containing 1 or 2, otherwise insert "none" (str)
def findYears(entry):
    if "1" not in entry and "2" not in entry:
        entry = "none"
    return entry

artsy_listings['year'] = artsy_listings['year'].apply(findYears)

# replace century with averaged years
def replaceCenturies(entry):
    if "19th" in entry:
        if "early" in entry:
            entry = "1825"
        elif "late" in entry:
            entry = "1875"
        else:
            entry = "1850"
    elif "20th" in entry:
        if "early" in entry:
            entry = "1925"
        elif "late" in entry:
            entry = "1975"
        else:
            entry = "1950"
    elif "21st" in entry:
        entry = "2010"
    elif "contemporary" in entry:
        entry = "2010"
    return entry

artsy_listings['year'] = artsy_listings['year'].apply(replaceCenturies)

# replace split years with later year
def replaceYearSplit(entry):
    delimiter = ["/", "-", "–"]
    for c in delimiter:
        index = entry.find(c)
        entry = entry[index+1:]
    if "to" in entry:
        index = entry.find("to")
        entry = entry[index+3:]
    return entry

artsy_listings['year'] = artsy_listings['year'].apply(replaceYearSplit)

# replace decades ("0s", "0's")
def replaceDecades(entry):
    if "0s" in entry:
        entry = entry.replace("s", "")
        entry = entry.replace("0", "5")
    if "0's" in entry:
        entry = entry.replace("'s", "")
        entry = entry.replace("0", "5")
    return entry

artsy_listings['year'] = artsy_listings['year'].apply(replaceDecades)


# remove 's', 'c', 'cir', extra spaces, and any other punctuation
def removeFaff(entry):
    faff = ["cir", "c", "C", "s", "AD", " "]
    for c in faff:
        entry = entry.replace(c, "")
    return entry
    
artsy_listings['year'] = artsy_listings['year'].apply(removeFaff)


# remove any lines that don't have 4 characters
def noYear(entry):
    if len(entry) != 4:
        entry = "none"
    return entry

artsy_listings['year'] = artsy_listings['year'].apply(noYear)

# convert all years to numerics where possible, otherwise convert to NaNs
def convertYearNumeric(entry):
    if entry == "none":
        return np.NaN
    else:
        return int(entry)

artsy_listings['year'] = artsy_listings['year'].apply(convertYearNumeric)

# eliminate any with unrealistic years
currentYear = 2021
artsy_listings['year'] = artsy_listings['year'].apply(lambda x: np.NaN if x > currentYear else x)

In [345]:
# find rows where there is no price information
error_rows = artsy_listings.index[np.isnan(artsy_listings['year'])]

# exclude these rows from the overall dataframe
artsy_listings = artsy_listings.drop(error_rows, axis=0)

# # how much of original dataframe did we exclude?
rows_excluded = len(error_rows)
percent_excluded = round((len(artsy_data) - len(artsy_listings)) / len(artsy_data) * 100, 2)
print(f"We excluded {rows_excluded} entries, or {percent_excluded}% of the original dataset, because of errors in data collection.")

# with pd.option_context('display.max_rows', None, 'display.max_columns', None):
#     print(artsy_listings['year'])
print(artsy_listings['year'].describe())

We excluded 128 entries, or 19.7% of the original dataset, because of errors in data collection.
count    2385.000000
mean     2005.494759
std        25.290329
min      1755.000000
25%      2002.000000
50%      2018.000000
75%      2020.000000
max      2021.000000
Name: year, dtype: float64


In [346]:
px.histogram(artsy_listings, x="year", title="Distribution of years")

### Convert year into an age variable.

In [347]:
# convert year into age
artsy_listings['artwork_age'] = 2021 - artsy_listings['year']

artsy_listings['artwork_age'].describe()

count    2385.000000
mean       15.505241
std        25.290329
min         0.000000
25%         1.000000
50%         3.000000
75%        19.000000
max       266.000000
Name: artwork_age, dtype: float64

In [348]:
px.histogram(artsy_listings, x="artwork_age", title="Distribution of artwork age")

In [349]:
artsy_listings['artwork_age_log'] = np.log(artsy_listings['artwork_age'] + 1) 
print(artsy_listings['artwork_age_log'].describe())
px.histogram(artsy_listings, x="artwork_age_log", title="Distribution of artwork age (log)")

count    2385.000000
mean        1.905050
std         1.294567
min         0.000000
25%         0.693147
50%         1.386294
75%         2.995732
max         5.587249
Name: artwork_age_log, dtype: float64


### 2. Medium

In [350]:
print(artsy_listings['medium'].unique())
print(artsy_listings['medium'].value_counts())

['Painting' 'Drawing, Collage or other Work on Paper' 'Sculpture' 'Print'
 'Photography' 'Installation' 'Mixed Media' 'Design/Decorative Art'
 'Textile Arts' 'Other' 'Books and Portfolios' 'Posters'
 'Fashion Design and Wearable Art']
Painting                                   1595
Drawing, Collage or other Work on Paper     330
Sculpture                                   173
Photography                                  79
Mixed Media                                  77
Print                                        61
Design/Decorative Art                        39
Installation                                 13
Textile Arts                                  6
Other                                         5
Books and Portfolios                          3
Posters                                       2
Fashion Design and Wearable Art               2
Name: medium, dtype: int64


In [351]:
# convert medium variable into categorical variable
artsy_listings['medium'] = artsy_listings['medium'].astype("category")
print(artsy_listings['medium'].dtype)

category


### 3. Size (in metric units)

In [352]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(artsy_listings['size_cm'])

0                  91.4 × 137.2 cm
1                       54 × 45 cm
2                       55 × 70 cm
3                     61 × 76.2 cm
5                     61 × 22.9 cm
6           134.6 × 53.3 × 53.3 cm
7                   35.6 × 27.9 cm
8                   32.4 × 27.3 cm
9                      70 × 100 cm
11                  50.8 × 40.6 cm
12                  20.3 × 25.4 cm
13                121.9 × 213.4 cm
14                  50.8 × 40.6 cm
15                      40 × 50 cm
17            86.4 × 73.7 × 7.6 cm
18                  30.5 × 30.5 cm
19            34.3 × 44.5 × 1.9 cm
20                      76 × 55 cm
21                149.9 × 109.2 cm
22                      24 × 24 cm
23                      24 × 28 cm
24                     115 × 75 cm
25                      54 × 45 cm
26                  35.6 × 27.9 cm
27                      90 × 88 cm
28            38.1 × 30.5 × 5.1 cm
29                  55.9 × 45.7 cm
30                 76.2 × 152.4 cm
31                  

There are both two dimensional and three dimensional sizes, so we will convert these to two columns: size (in cm^2) and volume (in cm^3).

In [353]:
# return area where possible
def calculateArea(entry):
    # check if it's in metric
    if "cm" in entry:
        # remove the cm
        entry = entry.replace("cm", "")
        # count the number of times the multiplication sign says
        if entry.count("×") == 1:
            index = entry.find("×")
            factor_1 = float(entry[:index])
            factor_2 = float(entry[index+1:])
            # multiple the two sides
            area = round(factor_1 * factor_2, 2)
            return area    
        else:
            return np.NaN
    else:
        return np.NaN


def calculateVolume(entry):
    # check if it's in metric
    if "cm" in entry:
        # remove the cm
        entry = entry.replace("cm", "")
        # count the number of times the multiplication sign says
        if entry.count("×") == 2:
            indices = [n for n in range(len(entry)) if entry.find('×', n) == n]
            index_1 = indices[0]
            index_2 = indices[1]
            factor_1 = float(entry[:index_1])
            factor_2 = float(entry[index_1+1:index_2])
            factor_3 = float(entry[index_2+1:])
            # multiple the two sides
            volume = round(factor_1 * factor_2 * factor_3, 2)
            return volume
        else:
            return np.NaN
    else:
        return np.NaN


def calculateSize(entry):
    # check if it's in metric
    if "cm" in entry:
        # remove the cm
        entry = entry.replace("cm", "")
        # count the number of times the multiplication sign says
        if entry.count("×") == 2:
            indices = [n for n in range(len(entry)) if entry.find('×', n) == n]
            index_1 = indices[0]
            index_2 = indices[1]
            factor_1 = float(entry[:index_1])
            factor_2 = float(entry[index_1+1:index_2])
            factor_3 = float(entry[index_2+1:])
            # multiple the two sides
            volume = round(factor_1 * factor_2 * factor_3, 2)
            return volume
        elif entry.count("×") == 1:
            index = entry.find("×")
            factor_1 = float(entry[:index])
            factor_2 = float(entry[index+1:])
            # multiple the two sides
            area = round(factor_1 * factor_2, 2)
            return area   
        else:
            return 0
    else:
        return 0


In [354]:
artsy_listings['area'] = artsy_listings['size_cm'].apply(calculateArea)
artsy_listings['volume'] = artsy_listings['size_cm'].apply(calculateVolume)
artsy_listings['size_combined'] = artsy_listings['size_cm'].apply(calculateSize)
artsy_listings['size_combined'].describe()

count    2.385000e+03
mean     2.971724e+04
std      2.592980e+05
min      0.000000e+00
25%      1.584200e+03
50%      4.900000e+03
75%      1.440000e+04
max      1.032385e+07
Name: size_combined, dtype: float64

In [355]:
px.histogram(artsy_listings, x="size_combined", title="Distribution of sizes")

In [356]:
# eliminate any with unrealistic sizes
error_rows = artsy_listings.index[artsy_listings['size_combined'] <= 0]

# exclude these rows from the overall dataframe
artsy_listings = artsy_listings.drop(error_rows, axis=0)

# # how much of original dataframe did we exclude?
rows_excluded = len(error_rows)
percent_excluded = round((len(artsy_data) - len(artsy_listings)) / len(artsy_data) * 100, 2)
print(f"We excluded {rows_excluded} entries, or {percent_excluded}% of the original dataset, because of errors in data collection.")

We excluded 35 entries, or 20.88% of the original dataset, because of errors in data collection.


In [357]:
artsy_listings['size_combined_log'] = np.log(artsy_listings['size_combined'] + 1)
print(artsy_listings['size_combined_log'])
px.histogram(artsy_listings, x="size_combined_log", title="Distribution of sizes (log)")

0       9.436765
1       7.796058
2       8.256088
3       8.444450
5       7.242726
          ...   
2962    9.215378
2966    6.969781
2967    9.459308
2968    7.650958
2969    9.819725
Name: size_combined_log, Length: 2350, dtype: float64


In [358]:
# check if this makes sense by comparing with the medium information we have

print(artsy_listings['medium'].value_counts())

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(artsy_listings[['area', 'volume']][artsy_listings['medium'] == "Installation"])

Painting                                   1587
Drawing, Collage or other Work on Paper     329
Sculpture                                   160
Photography                                  77
Mixed Media                                  76
Print                                        61
Design/Decorative Art                        30
Installation                                 13
Textile Arts                                  6
Other                                         5
Books and Portfolios                          2
Fashion Design and Wearable Art               2
Posters                                       2
Name: medium, dtype: int64
          area     volume
54    10374.00        NaN
112        NaN   45360.00
442   12791.61        NaN
1070   4800.00        NaN
1072       NaN    5436.00
1207       NaN  992312.98
1254       NaN  421867.58
1358       NaN  882000.00
1536       NaN   52000.00
1700  10427.12        NaN
1705  36000.00        NaN
2144       NaN   32372.81
2822  18468

Create dummy variables (2D vs 3D) to use alongside size variable within a linear regression model.

In [359]:
artsy_listings['three_dimensional'] = artsy_listings['volume'].apply(lambda x: 0 if np.isnan(x) else 1)

artsy_listings['three_dimensional'].value_counts()

0    1769
1     581
Name: three_dimensional, dtype: int64

### 4. Classification 

In [360]:
print(artsy_listings['classification'].value_counts())

This is a unique work.                           2321
This work is part of a limited edition set.        26
This work is from an edition of unknown size.       3
Name: classification, dtype: int64


So there appears to be two general categories: unique work (majority) and limited edition (minority). Let's make a boolean variable called isUnique.

In [361]:
def rateUniqueness(entry):
    if "unique" in entry.lower():
        return 1
    else:
        return 0

artsy_listings['unique'] = artsy_listings['classification'].apply(rateUniqueness)
artsy_listings['unique'].value_counts()

1    2321
0      29
Name: unique, dtype: int64

### 5. Signed

In [362]:
print(artsy_listings['signed'].value_counts())

Hand-signed by artist                                                              1025
Not signed                                                                          123
Hand-signed by artist, Lower Right (see photo)                                       62
Hand-signed by artist, verso                                                         26
Hand-signed by artist, Initialed on bottom                                           25
                                                                                   ... 
Hand-signed by artist, Signed, dated & inscribed verso 'Nicole Eisenman 95 162'       1
Hand-signed by artist, stamped by artist's estate, Signed lower left G. Braque        1
Hand-signed by artist, Variously inscribed on verso                                   1
Hand-signed by artist, bottom, right corner                                           1
Hand-signed by artist, Signed, dated and titled on verso                              1
Name: signed, Length: 540, dtype

In [363]:
print(artsy_listings['signed'].unique())

['Hand-signed by artist' nan 'Hand-signed by artist, sticker label'
 'Hand-signed by artist, Hand-signed and dated in black marker on reverse.'
 'Hand-signed by artist, Initialed on bottom'
 'Hand-signed by artist, Signed lower right' 'Not signed'
 'Hand-signed by artist, Right bottom corner'
 'Hand-signed by artist, Signature on back'
 'Hand-signed by artist, signed on the back'
 "Hand-signed by artist, Signed lower right 'Frankenthaler'."
 'Hand-signed by artist, Signed on the bottom right hand corner'
 "Hand-signed by artist, stamped by artist's estate, Front"
 'Hand-signed by artist, Signed on the back'
 'For more information read description.'
 'Hand-signed by artist, Signed, titled and dated on the back'
 'Hand-signed by artist, verso' 'Hand-signed by artist, signed on reverse'
 'Hand-signed by artist, on the verso '
 "Hand-signed by artist, Signed and dated '1955' lower left"
 'Hand-signed by artist, On the back of the painting bottom right'
 'Hand-signed by artist, Lower Left (

In [364]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(artsy_listings['signed'])

0                                   Hand-signed by artist
1                                                     NaN
2                    Hand-signed by artist, sticker label
3                                   Hand-signed by artist
5       Hand-signed by artist, Hand-signed and dated i...
6              Hand-signed by artist, Initialed on bottom
7                                   Hand-signed by artist
8               Hand-signed by artist, Signed lower right
9                                   Hand-signed by artist
11                                             Not signed
12                                  Hand-signed by artist
13                                  Hand-signed by artist
14                                  Hand-signed by artist
15             Hand-signed by artist, Right bottom corner
17               Hand-signed by artist, Signature on back
18              Hand-signed by artist, signed on the back
19      Hand-signed by artist, Signed lower right 'Fra...
20            

Again, convert this variable into a boolean where any signature is considered a True.

In [365]:
def rateSignature(entry):
    signature_types = ["hand-signed", "signed"]
    non_signature_types = ["not signed", "unsigned"]

    if type(entry) == float:
        return "unknown"
    elif any(([True if category in entry.lower() else False for category in signature_types])) and not any(([True if category in entry.lower() else False for category in non_signature_types])):
        return "signed"
    elif "stamped" in entry.lower() or "stamp" in entry.lower():
        return "stamped"
    elif "sticker" in entry.lower():
        return "sticker"
    elif "certificate" in entry.lower():
        return "certificate"
    elif any(([True if category in entry.lower() else False for category in non_signature_types])):
        return "unsigned"
    else:
        return "other mark"

def rateSigned(entry):
    if entry == "signed":
        return 1
    else:
        return 0

artsy_listings['artist_mark'] = artsy_listings['signed'].apply(rateSignature)
artsy_listings['artist_mark'] = artsy_listings['artist_mark'].astype("category")
artsy_listings['signed'] = artsy_listings['artist_mark'].apply(rateSigned)

print(artsy_listings['artist_mark'].dtype)
print(artsy_listings['signed'].dtype)

category
int64


In [366]:
print(artsy_listings['signed'].value_counts())

1    2058
0     292
Name: signed, dtype: int64


### 6. Authenticated

In [367]:
print(artsy_listings['authenticated'].value_counts())
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(artsy_listings['authenticated'])

Included    1654
Name: authenticated, dtype: int64
0       Included
1       Included
2       Included
3            NaN
5       Included
6       Included
7            NaN
8            NaN
9       Included
11           NaN
12           NaN
13      Included
14           NaN
15      Included
17      Included
18           NaN
19           NaN
20      Included
21      Included
22      Included
23      Included
24      Included
25      Included
26           NaN
27      Included
28      Included
29      Included
30           NaN
31      Included
32      Included
33      Included
34           NaN
35           NaN
36      Included
38      Included
39           NaN
40           NaN
41           NaN
42      Included
43      Included
45      Included
47      Included
48      Included
49      Included
50           NaN
51      Included
52           NaN
53      Included
54      Included
55      Included
56      Included
57      Included
58      Included
59           NaN
60           NaN
61           N

In [368]:
def rateAuthentication(entry):
    if type(entry) == str and entry.lower() == "included":
        return 1
    else:
        return 0

artsy_listings['authenticated'] = artsy_listings['authenticated'].apply(rateAuthentication)
print(artsy_listings['authenticated'].value_counts())

1    1654
0     696
Name: authenticated, dtype: int64


### 7. Framed

In [369]:
artsy_listings['framed'] = artsy_listings['framed'].apply(lambda x: 0 if x.lower() == "not included" else 1)

print(artsy_listings['framed'].value_counts())

0    1425
1     925
Name: framed, dtype: int64


## Exclusions

### 1. Artwork condition

Variable has too many NaNs for us to use it.

In [370]:
print(artsy_listings['condition'].value_counts())

Excellent                                                                                                      215
Excellent Condition - Like New                                                                                 149
New                                                                                                             78
Perfect                                                                                                         37
Excellent                                                                                                       26
                                                                                                              ... 
Accompanied by a certificate of authenticity signed by Davide Manfredi, Italian former agent of the artist.      1
Good condition in vintage frame to preserve provenance                                                           1
Good, tear in the paper (see photo) does not affect the drawing.                

### 2. Medium details

Most entries are too detailed for regression or classification uses.

In [371]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(artsy_listings['medium_details'])

0                                           Oil on canvas
1                                    Mixed media on paper
2                                      Acrylic on canvas.
3                                            Oil on Linen
5       Hand-painted spray/acrylic on hand-pulled, 100...
6                        Hopkins white. Glazed stoneware.
7                                            Oil on Panel
8                                       Graphite on paper
9                                             Oil on wood
11                                       Enamel on canvas
12                            Oil on cradled wooden panel
13                Acrylic, Latex, & Spray Paint on Canvas
14                                           Oil on Panel
15                                            Oil on wood
17                      Stoneware, glaze, mounted on wood
18            Mixed media monoprint and painting on paper
19                        Unique painting on ceramic tile
20            

### 3. Any artist-related variables

This includes artist name, artist birthdate, artist nationality.

There are more than 1000 unique artists, indicating that most of the artists in our dataset are one-offs. It is likely that this variable, while highly predictive, also skews the model significantly. 

In [372]:
print(artsy_listings['artist'].unique())
print(artsy_listings['artist'].value_counts())
print(artsy_listings['artist'].dtype)

['Yigal Ozeri' 'Gert & Uwe Tobias' 'Ramon Enrich' 'Hollis Dunlap' 'SEEN'
 'Galia Linn' 'Nancy Ellen Craig' 'Egmont Hartwig' 'KATSU' 'Mia Bergeron'
 'Amber Goldhammer' 'Thomas Alban' 'Deirdre Murphy' 'Helen Frankenthaler'
 'Balint Zsako' 'Daniel Hooper' 'Tove Worum' 'Lea Ezekielle'
 'Tat Shing Chu' 'George Condo' 'Preston Paperboy' 'Ilana Manolson'
 'Sophie Ratzsch' 'Robert Lebsack' 'Nobuyoshi Araki' 'Robert Bissell'
 'America Martin' 'Eugene Berman' 'Jason Martin' 'Piero Pizzi Cannella'
 'Charles Buckley' 'Becky Yazdan' 'Jon James' 'Diego Anaya'
 'Samuel John Lamorna Birch' 'Keflione' 'Richard Hambleton'
 'Jérôme Mesnager' 'Jean Wolff' 'Jim Dine' 'Zoe Keramea' 'Mark Kostabi'
 'Ansel Adams' 'Cali Almy' 'RISK' 'Barbara Strasen' 'Nicole Leidenfrost'
 'Peter Max' 'Sonia Babini' "Sherie' Franssen" 'Peter Mars' 'Andy Warhol'
 'Luca Bartoli' 'Carlos Oviedo' 'Ron English' 'Carolyn Todd' 'Álex Soler'
 'Eric Basstein' 'Dale Chihuly' 'Mary Weatherford' 'Sadamasa Motonaga'
 'Adam Miller' 'Silvia L

In [373]:
# print(artsy_listings['artist_birthdate'].unique())
# print(artsy_listings['artist_birthdate'].value_counts())
print(f" The number of missing data entries: {artsy_listings['artist_birthdate'].isna().sum()}")


 The number of missing data entries: 690


In [374]:
print(artsy_listings['artist_nationality'].unique())
print(artsy_listings['artist_nationality'].value_counts())
print(f" The number of missing data entries: {artsy_listings['artist_nationality'].isna().sum()}")

['Israeli' 'Romanian-German' 'Spanish' 'American' 'American, 1927–2015'
 'Netherlandish' 'Follow' 'American, 1928–2011' 'Hungarian' 'British'
 'Norwegian' 'French' 'Chinese' 'Canadian' 'Japanese' 'English'
 'Colombian-American' 'American, 1899–1972' 'Italian' 'Mexican'
 'Canadian, 1952–2017' 'American, 1902–1984' 'German'
 'American, 1928–1987' 'Dutch' 'Japanese, 1922–2011' 'Hungarian-American'
 'Icelandic' 'Ukraninan-American, 1917–2004' 'Lithuanian' 'Brazilian'
 'American-Bulgarian, 1935–2020' 'Argentine' 'American, 1923–1971'
 'Polish' 'British, 1769–1842' 'American, 1916–1991' nan 'Irish'
 'American, 1960–2017' 'French-Dutch, 1877–1968' 'American, 1933–1989'
 'Venezuelan, 1923–2019' 'South African' 'Romanian' 'British, 1841–1918'
 'Belgian' 'Peruvian, 1925–2017' 'Mexican-American' 'Argentinian'
 'American, 1923–2012' 'British, 1838–1907' 'Colombian' 'Swiss' 'Russian'
 'Australian' 'Portuguese' 'American, 1927–2020' '1900–1979'
 'American, 1943–2010' 'American, 1895–1981' 'French-Am

This variable may be better suited to a PCA style approach?

### 4. Any representation-related variables

This includes gallery name and gallery location(s).

In [375]:
print(artsy_listings['gallery'].unique())
print(artsy_listings['gallery'].value_counts())
print(f" The number of missing data entries: {artsy_listings['gallery'].isna().sum()}")

['Corridor Contemporary' 'Make offer' 'Artistics' 'Gallery 1261'
 'Signari Gallery' 'Los Angeles' 'Bakker Gallery' 'New York'
 'Los Angeles, Zürich' 'Barba Contemporary Art' 'Susanna Gold Gallery'
 'George Thornton Art' 'Art of Nature Contemporary' 'HG Contemporary'
 'MIAMI' 'Petion-Ville' 'San Francisco' 'JoAnne Artman Gallery'
 'Susan Eley Fine Art' 'New Haven' 'Red Fox Art Gallery' 'ARTI.NYC'
 'La Galerie Paris 1839' 'West Chelsea Contemporary' 'BAM Art Advisory'
 'Reem Gallery' 'Original Art Broker' 'Abend Gallery' 'QART.COM'
 'Franklin' 'Pawtucket' 'David Leonardis Gallery'
 'West Hollywood, Los Angeles' 'Belhaus' 'Pen Project' 'Mortal Machine'
 'Gallery Sitka' 'Modern Artifact'
 'Newport Beach, Los Angeles , Palm Springs' 'Aux Gallery' 'Philadelphia'
 'Edward Cella Art and Architecture' 'Dellupi Arte'
 'Philip Douglas Fine Art + Garvey Rita Fine Art' 'The Art Design Project'
 'Brooklyn, New York' 'Kubik Gallery' 'Certificate of authenticity'
 'Caviar20' 'Chicago' 'Gallery Art' 'L

In [376]:
print(artsy_listings['gallery_location'].unique())
print(artsy_listings['gallery_location'].value_counts())
print(f" The number of missing data entries: {artsy_listings['gallery_location'].isna().sum()}")

['Tel-Aviv, Philadelphia' 'Cassina Projects' 'Paris' 'Denver'
 'Certificate of authenticity' 'Provincetown' 'Desiderio Gallery'
 'Secure payment' 'Bryn Mawr' 'Bernard Jacobson Gallery'
 'Mimmo Scognamiglio / Placido' 'Nottingham' 'promoart21' 'Galerie Arnaud'
 'Hong Kong' 'Saint Helena, Napa Valley' 'Grob Gallery'
 'New York, Laguna Beach' 'Wallector' 'Buchmann Galerie'
 'New York, Hudson' 'Pound Ridge' 'Haynes Fine Art'
 'Austin , Lakeway, Austin' 'Galerie Brugier-Rigail' 'Larchmont'
 'Camberley' 'Hamburger Kunstgalerie' 'SmART Coast Gallery' 'Chicago'
 'Collezionando Gallery' 'Phoenix, Venice' 'Miami Beach' 'New Orleans'
 'SHIRLEY' 'StolenSpace Gallery' 'Repetto Gallery' 'PUXAGALLERY'
 'Ellia Art Gallery' 'Galerie Eva Vautier' 'West Hollywood' 'Milano'
 'Orleans' 'Galerija VARTAI' 'Porto'
 'This work includes a certificate of authenticity.' 'Toronto'
 'SAGE Paris' 'Aventura' 'THROWN'
 'Gallery Katarzyna Napiorkowska | Warsaw & Brussels'
 'New York, London, Dallas' 'Gibbons & Nicholas

## Remaining variables

### Target variable
price_GBP_log

### Predictor variables:
1. artwork_age_log (FLOAT)
2. three_dimensional (BOOL as 0/1)
3. size_combined_log (FLOAT, 2 decimal points)
5. unique (BOOL as 0/1)
6. signed (BOOL as 0/1)
7. authenticated (BOOL as 0/1)
8. framed (BOOL as 0/1) 

### Optional variables:
1. year (INT)
2. medium (CAT, 14 types)
3. artist_mark (CAT, 7 types)
4. area (FLOAT, 2 decimal points)
5. volume (FLOAT, 2 decimal points)

In [377]:
artsy_listings_cleaned = artsy_listings[['artwork_age_log', 'three_dimensional', 'size_combined_log', 'unique', 'signed', 'authenticated', 'framed', 'price_GBP_log']]

# check data types
print(artsy_listings_cleaned.info())
print(artsy_listings_cleaned.describe())

# check number of NaNs
print(artsy_listings_cleaned.isna().sum())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2350 entries, 0 to 2969
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   artwork_age_log    2350 non-null   float64
 1   three_dimensional  2350 non-null   int64  
 2   size_combined_log  2350 non-null   float64
 3   unique             2350 non-null   int64  
 4   signed             2350 non-null   int64  
 5   authenticated      2350 non-null   int64  
 6   framed             2350 non-null   int64  
 7   price_GBP_log      2350 non-null   float64
dtypes: float64(3), int64(5)
memory usage: 165.2 KB
None
       artwork_age_log  three_dimensional  size_combined_log       unique  \
count      2350.000000        2350.000000        2350.000000  2350.000000   
mean          1.910846           0.247234           8.494285     0.987660   
std           1.293640           0.431496           1.711313     0.110423   
min           0.000000           0.000000        

# Step 2: Model fitting

## Split dataset for modelling

In [378]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

target = artsy_listings_cleaned['price_GBP_log']
predictors = artsy_listings_cleaned.drop(columns=['price_GBP_log'])

print(f"Number of samples in dataset: {len(predictors)}")

# split original dataset into training and test datasets
predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, target, test_size=0.3)

# carve out a validation dataset from the test dataset
predictors_test, predictors_validation, target_test, target_validation = train_test_split(
    predictors_test, target_test, test_size=0.5
)

print("Number of samples in:")
print(f"    Training: {len(target_train)} ({round(len(target_train)/len(target) * 100, 2)}% of dataset)")
print(f"    Validation: {len(target_validation)} ({round(len(target_validation)/len(target) * 100, 2)}% of dataset)")
print(f"    Testing: {len(target_test)} ({round(len(target_test)/len(target) * 100, 2)}% of dataset)")

Number of samples in dataset: 2350
Number of samples in:
    Training: 1645 (70.0% of dataset)
    Validation: 353 (15.02% of dataset)
    Testing: 352 (14.98% of dataset)


## Fit to linear regression model

### Model 1: OLS with all predictors

In [379]:
from statsmodels.formula.api import ols
from scipy import stats

model1 = ols('price_GBP_log ~ artwork_age_log + three_dimensional + size_combined_log + unique + signed + authenticated + framed', data = artsy_listings_cleaned)
results1 = model1.fit()
mse1 = results1.mse_model
residuals1 = results1.resid

print(f"MSE: {mse1}")
print(results1.summary())

MSE: 271.62786295887867
                            OLS Regression Results                            
Dep. Variable:          price_GBP_log   R-squared:                       0.390
Model:                            OLS   Adj. R-squared:                  0.388
Method:                 Least Squares   F-statistic:                     213.6
Date:                Tue, 01 Jun 2021   Prob (F-statistic):          1.10e-245
Time:                        17:11:23   Log-Likelihood:                -3612.9
No. Observations:                2350   AIC:                             7242.
Df Residuals:                    2342   BIC:                             7288.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept     

In [380]:
fig = px.scatter(x=artsy_listings_cleaned.artwork_age_log, y=residuals1, title="Model residual plot of log artwork age vs. log price GBP")
fig.add_hline(y=0,line_color="red")
fig.show()

In [381]:
fig = px.scatter(x=artsy_listings_cleaned.size_combined_log, y=residuals1, title="Model residual plot of log size vs. log price GBP")
fig.add_hline(y=0, line_color="red")
fig.show()

### Model 2: OLS with only significant predictors

In [382]:
model2 = ols('price_GBP_log ~ artwork_age_log + three_dimensional + size_combined_log + unique + authenticated + framed', data = artsy_listings_cleaned)
results2 = model2.fit()
mse2 = results2.mse_model

print(f"MSE: {mse2}")
print(f"MSE difference from original model: {mse2 - mse1}")
print(results2.summary())

MSE: 316.4673623852587
MSE difference from original model: 44.83949942638003
                            OLS Regression Results                            
Dep. Variable:          price_GBP_log   R-squared:                       0.389
Model:                            OLS   Adj. R-squared:                  0.388
Method:                 Least Squares   F-statistic:                     248.7
Date:                Tue, 01 Jun 2021   Prob (F-statistic):          1.85e-246
Time:                        17:11:23   Log-Likelihood:                -3614.0
No. Observations:                2350   AIC:                             7242.
Df Residuals:                    2343   BIC:                             7282.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------

In [383]:
px.scatter(artsy_listings_cleaned, "size_combined_log", "price_GBP_log", title = "Artwork price by artwork age")

In [384]:
px.scatter(artsy_listings_cleaned, "artwork_age_log", "price_GBP_log", title = "Artwork price by artwork age")

### Default sklearn

In [388]:
from sklearn import metrics, decomposition
from sklearn.linear_model import LinearRegression
# from sklearn.pipeline import Pipeline

np.random.seed(100)

simple_model = LinearRegression()
simple_model.fit(predictors_train, target_train)

target_train_pred = simple_model.predict(predictors_train)
train_loss = metrics.mean_squared_error(target_train, target_train_pred)

target_validation_pred = simple_model.predict(predictors_validation)
validation_loss = metrics.mean_squared_error(target_validation, target_validation_pred)

# print(regression.coef_)
print(f"LinearRegression model (sklearn default)")
print(f"     Parameter values: {simple_model.get_params()}")
print(f"     Train loss: {train_loss}")
print(f"     Train R^2: {simple_model.score(predictors_train, target_train)}")
print(f"     Validation loss: {validation_loss}")
print(f"     Validation R^2: {simple_model.score(predictors_validation, target_validation)}")

LinearRegression model (sklearn default)
     Parameter values: {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'normalize': False, 'positive': False}
     Train loss: 1.2736420755611073
     Train R^2: 0.4020689057075785
     Validation loss: 1.2488573564869425
     Validation R^2: 0.32934130036794873


### Default sklearn with k-fold cross validation

In [386]:
def k_fold(dataset, n_splits: int = 5):
    chunks = np.array_split(dataset, n_splits)
    for i in range(n_splits):
        training = chunks[:i] + chunks[i + 1 :]
        validation = chunks[i]
        yield np.concatenate(training), validation

# define K-Fold parameters
loss = 0
n_splits = 5

# K-Fold evaluation
for (predictors_train, predictors_validation), (target_train, target_validation) in zip(k_fold(predictors, n_splits), k_fold(target, n_splits)):

    simple_model2 = LinearRegression()
    simple_model2.fit(predictors_train, target_train)

    target_validation_pred = simple_model2.predict(predictors_validation)
    fold_loss = metrics.mean_squared_error(target_validation, target_validation_pred)
    loss += fold_loss

print(f"K-Fold estimated loss: {loss / n_splits}")
print(f"K-Fold validation R^2: {simple_model2.score(predictors_validation, target_validation)}")

K-Fold estimated loss: 1.2873465360495149
K-Fold validation R^2: 0.32934130036794873


# Step 3: Model evaluation

In [387]:
# make predictions on the test set
target_test_pred = simple_model.predict(predictors_test)

from sklearn import metrics

target_test_loss = metrics.mean_squared_error(target_test, target_test_pred)
target_test_max_error = metrics.max_error(target_test, target_test_pred)
target_test_MAPE = metrics.mean_absolute_percentage_error(target_test, target_test_pred)
target_test_R2 = metrics.r2_score(target_test, target_test_pred, multioutput='variance_weighted')


# final metrics
print(f"MSE: {target_test_loss}")
print(f"Max error: {target_test_max_error}")
print(f"MAPE: {target_test_MAPE}")
print(f"R squared: {target_test_R2}")

# final model parameters
predictor_coefficients = pd.DataFrame(simple_model.coef_, columns=['coefficient'], index=list(predictors.columns))
display(predictor_coefficients)
print(f"Final model bias (intercept): {simple_model.intercept_}")

MSE: 1.412043262325649
Max error: 4.550259205797561
MAPE: 0.11297242198195159
R squared: 0.3573472300695658


Unnamed: 0,coefficient
artwork_age_log,0.394424
three_dimensional,-0.483186
size_combined_log,0.467579
unique,0.685874
signed,0.160348
authenticated,-0.133108
framed,0.35932


Final model bias (intercept): 2.680218784117187
