## Color Chronology of Van Gogh's Palettes
### CSC630: Rudd Fawcett


Below, please find specifications about the development of my text project. The website [is live here.](http://csc630.rudd.io/projects/van-goghs-palettes/).

The five most dominant colors from 913 of Vincent van Gogh’s oil and watercolor paintings.The goals of this project were two fold:

1. To pull dominant colors from almost one thousand Van Gogh Paintings.
2. Produce a playful and dumb simple visualization.

Images and metadata were sourced from this Wikipedia list, which is available under a Creative Commons Attribution-ShareAlike 3.0 Unported License license. Data was scraped and processed using Python, with help from [this blog post on finding dominant image colors](https://zeevgilovitz.com/detecting-dominant-colours-in-python). The algorithm uses K-Means clustering in order to extract the n-most dominant colors (I ran it for the top five).

### Preparing Data for Use

In order to prepare data for use, I had to first download all of the images from Wikipedia, for which I chose to use BeautifulSoup, a Python web-scraping library.

### Scraping Data

In [None]:
from bs4 import BeautifulSoup
import pandas as pd w

page = open('../data/works.html')
soup = BeautifulSoup(page, 'lxml')
entries = []

table = soup.find('table')

for row in table.findAll('tr'):
    cells = row.findAll('td')

    if len(cells) == 5:
        no = int(cells[0].text)

        if cells[1].find('img'):
            image = cells[1].find('img').get('src').replace('./images/', '')

        title = cells[2].text.strip()
        year = cells[3].text
        location = cells[4].text

        entry = {
            'number': no,
            'image': image,
            'title': title,
            'year': year,
            'location': location
        }

        entries.append(entry)

df = pd.DataFrame(entries)
df.to_csv('../data/van-gogh-works.csv', index=False, columns=['number', 'title', 'image', 'year', 'location'])

After I had a rough cut of all of my data in a primitive CSV file, I could work on cleaning it up.

### Cleaning CSV and Calculating Dominant Colors

In [None]:
from PIL import Image
from kmeans import Kmeans
import csv
from base64 import b16encode
import pandas as pd

works = open('../data/van-gogh-works.csv')

entries = []
k = Kmeans(k=5)

for work in csv.DictReader(works):
    src = work['image']

    img = Image.open('../data/images/{}'.format(src))

    try:
        rgbs = k.run(img)

        for idx in range(len(rgbs)):
            rgb = ()
            for value in rgbs[idx]:
                rgb = rgb + (int(value),)
            work['color_{}'.format(idx+1)] = str(b'#'+b16encode(bytes(rgb)), 'utf-8')

        entries.append(work)
    except:
        pass
        continue

all_works = pd.DataFrame(entries)
cols = ['number', 'title', 'image', 'year', 'location', 'color_1', 'color_2', 'color_3', 'color_4', 'color_5']
all_works.to_csv('data/van-gogh-works-colors.csv', index=False, columns=cols)

As mentioned in the introduction above, this part would not have been possible without the help of the Kmeans color clustering library that is linked in the blog post I referenced.

## Final Thoughts
This was one of the more exciting projects I worked on, and I think that it was because I was truly generating my own data set on something that no one had ever done before.

The presentation isn't the absolute best it could be, but it is what it is.

## Citations and Attributions

- https://zeevgilovitz.com/detecting-dominant-colours-in-python
