# Visualization of The Carnegie Museum of Art's Trends
by [David Roberts](https://www.kaggle.com/davidroberts13) Aug 2020

This notebook aims to garner a new perspective on the ever turbulent world of art. By examining the frequency of acquisitions in various categories we might find out what is popular and when in regards to the preferences of The Carnegie Museum of Art. I hope you enjoy it!!

![](https://media-api.xogrp.com/images/8355276b-1913-4025-8a61-70e88b6882f2)
#####A quick caveat is that solely looking through the lens of acquisitions could be problematic because the Museum doesn't necessarily have a say in what art is donated. Never the less if people are donating it then it must be in 'vouge'.

### Table of Contents
* [1.Introduction](#chapter1)
    * [1.1 Load and Check Data](#section_1_1)
    * [1.2 Exploration](#section_1_2)
* [2. Feature Engineering](#Chapter2)
    * [2.1 Material Breakdown](#section_2_1)
    * [2.2 Date & Time Extraction](#section_2_2)
* [3. Visualizations](#chapter3)
    * [3.1 Temporal Visualizations](#section_3_1)
    * [3.2 Word Cloud](#section_3_2)
* [4. Conclusion](#chapter4)

# 1. Introduction <a class="anchor" id="chapter1"></a>
In celebration of their 120th anniversary, the Carnegie Museum of Art is making public the collections records of all of its accessioned artworks. This release contains data on approximately 28269 objects across all departments of the museum; fine arts, decorative arts, photography, contemporary art, and the Heinz Architectural Center.

This repository contains the files containing all of the records, as well as a description of the data, the data structure, and some guidelines on using the data. Please take a minute to familiarize yourself with the structure and guidelines below.


##### *Side note you will see something called '[gelatin silver print](https://en.wikipedia.org/wiki/Gelatin_silver_process#:~:text=The%20gelatin%20silver%20process%20is,%2C%20or%20resin%2Dcoated%20paper.)' as a medium, it is an early form of photography*

## 1.1 Load and Check Data <a class="anchor" id="section_1_1"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import seaborn as sns # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as pyplot
from plotly.offline import init_notebook_mode, iplot
import plotly.figure_factory as ff
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
import plotly.graph_objs as go
import plotly
from plotly import tools
import plotly.express as px
from scipy.stats import boxcox
init_notebook_mode(connected=True)
pd.set_option('display.max_columns', 100)
from wordcloud import WordCloud, STOPWORDS 


In [None]:
df = pd.read_csv('/kaggle/input/carnegie-museum-of-art/cmoa.csv', delimiter=',')
df.head(1)
#looks good!

In [None]:
null_perc = df.isnull().sum()/len(df)*100
null_perc.sort_values(ascending = False).head(10)

These numbers represent the % of Null Values in each column. We can see that we have 5 columns with over 30% NA 'image_url', 'death_date','birth_place', 'death_place',' and 'role' all of these columns are hard to replace since they are not integer values or have far too high of a proportion of NA's for use to find a trend and replace with the mean. So we will ignore them in future analysis/visualizations. This little bit of code was from fellow kaggler [Subhanjan Das](https://www.kaggle.com/subhanjandas) 

## 1.2 Exploration <a class="anchor" id="section_1_2"></a>
*Now let us take a look around and find some patterns that could allow us to perform some feature engineering* 
This is done right now buy slicing the first 10 rows and looking around and hypothesizing/ brainstorming ideas for this exploration

In [None]:
df.head(10)

Initial Thoughts on the possibilities 
* Split the Medium category to isolate elements 
* Cut out the year for the various date categories into its own column
* See trends of 'Credit_line' overtime
* Explore 'Department' overtime 
* Word Clouds of Nationality and Mediums...
* see if there are common Title words to find popular themes

###### Number of unique credit lines below 

###### (been having an issue with x-axis labels getting chopped off on smaller monitors sry!)

In [None]:
print(df['credit_line'].nunique())#Number of unique contribution avenues for the Museums collection. Far too many to graph
#my hunch is there is that a small percentage of the organizations do a vast majority of the work
plt.style.use('seaborn-darkgrid')#Visual Style
sns.set(rc={'figure.figsize':(15,8.27)})#Set Figure Size
ax=df['credit_line'].value_counts()[:10]
ax=ax.to_frame()
ax.iplot(kind='barh',xTitle = "Count", yTitle = 'Credit Line', title = 'Popular Credit line', color = 'orange')
plt.pyplot.show()

# 2. Feature Engineering <a class="anchor" id="chapter2"></a>


## 2.1 Material Breakdown <a class="anchor" id="section_2_1"></a>


In [None]:
df.head(3)

In [None]:
Medium=df['medium'].str.split('on',1,expand=True) #Splitting the 'medium' column on the keyword 'on'
df['material']=Medium[1] #Reasiging everything after the keyword to a new column 'material'
df['art component']=Medium[0]#Reasiging everything before the keyword to a new column 'art component'
df.head(1)

This will be an interesting look at how different mediums rise and fall in popularity. There are limitations to the effectiveness. Not every string is set up as an 'on' statement. leading to about half of the 'material' column being NA.

## 2.2 Date & Time Extraction <a class="anchor" id="section_2_2"></a>


In [None]:
df.head(3)

First relevant Date colum we want to break down is 'date_acquired' this will help us explore art trends. 

In [None]:
df['date_acquired']=pd.to_datetime(df['date_acquired'], infer_datetime_format=True) #Converting 'date_acquired' into something we can work with
df['year_acquired']=df['date_acquired'].dt.strftime('%Y') #Splitting off the 'year' for the 'date_acquired' column

df.head(1)

# 3. Visualizations <a class="anchor" id="chapter3"></a>

## 3.1 Temporal Visualizations <a class="anchor" id="section_3_1"></a> 

###### Number of unique mediums below

In [None]:
print(df['medium'].nunique())#Number of unique mediums in the Carnegie Museums collection. Far too many to graph
plt.style.use('seaborn-darkgrid')#Visual Style
sns.set(rc={'figure.figsize':(11.7,8.27)})
ax=df['medium'].value_counts()[:10].iplot(kind='barh',
                                          xTitle='Medium',
                                          yTitle='Pieces of art',
                                        title='Overall Top 10 Mediums')
plt.pyplot.show()

As you can see there are over 5000 different mediums so it was very important to isolate them into the most popular. We will now do the same thing for the classification of the piece.

###### Number of unique classifications below

In [None]:
print(df['classification'].nunique()) #Returns number of unique classificaitons of art
plt.style.use('seaborn-darkgrid')#Visual Style
sns.set(rc={'figure.figsize':(11.7,8.27)})#Set Figure Size
ax=df['classification'].value_counts()[:10].iplot(kind='barh',
                                          xTitle='Classifications',
                                          yTitle='Pieces of art',
                                        title='Overall Top 10 Artistic Classifications')
plt.pyplot.show()

Not quite the 5000 different mediums but 41 is still too many classifications to visualize; 10 is much better. 

In [None]:
#Lets make a new DF that is home to our Top 10 overall 
#Mediums of art by acquisition number
df1=df[df['medium'].isin(['gelatin silver print',
                          'woodblock print on paper',
                          'lithograph on paper',
                          'etching','oil on canvas',
                          'engraving','woodcut on paper',
                          'porcelain','ink on linen',
                          'pencil on tracing paper'])]
df1.tail(1)

In [None]:
plt.style.use('seaborn-darkgrid')#Visual Style
sns.set_palette("tab20",5)#Color Scheme
df1.groupby(['year_acquired','art component']).count()['department'].unstack().iplot(ax=ax,
                                                                                    xTitle='Year',
                                                                                    yTitle='Number of Peices',
                                                                                    title='Popularity of Art Acquisition: Key Component of Art')
plt.pyplot.show()

There seems to be a massive acquisition of woodblock prints all at once in 1989. This is making it hard to get a clear look at the true relationship. If you are not familiar with Plotly you can click on  'woodblock' in the legend to turn it off and the graph will automatically scale. you also have the ability to click and drag on various portions of the graph to get a closer look and just double click when you are finished in order to return to the total view. 




In [None]:
plt.style.use('seaborn-darkgrid') #Visual Style
sns.set_palette("tab20",5) #Color Scheme
df1.groupby(['date_acquired','art component']).count()['department'].unstack().iplot(ax=ax,
                                                                                    xTitle='Year',
                                                                                    yTitle='Number of Peices',
                                                                                    title='Popularity of Artistic Elemetents\n Through Art Acquisition')
plt.pyplot.show()

Normally it would not be necessary to post this more temporarily detailed figure but for some reason, I can get woodblock to show up. In order to get a closer look just turn 'woodblock print' off. If you know why could you drop a comment?

In [None]:
plt.style.use('seaborn-darkgrid')#Visual Style
sns.set_palette("tab20",5)#Color Scheme
df.groupby(['date_acquired','department']).count()['classification'].unstack().iplot(ax=ax,
                                                                                    xTitle='Year',
                                                                                    yTitle='Number of Peices',
                                                                                    title='Popularity of Art Acquisition: Carnegie Art Departments')
plt.pyplot.show()

In [None]:
#Lets make a new DF that is home to our Top 10 overall 
#classifications of art by acquisition number
df3=df[df['classification'].isin(['print',
                                  'drawings and watercolors',
                                  'photographs',
                                  'Ceramics',
                                  'paintings',
                                  'Metals',
                                  'containers',
                                  'sculpture',
                                  'Glass',
                                  'Wood'])]
df3.tail(1)


In [None]:
plt.style.use('seaborn-darkgrid')#Visual Style
sns.set_palette("tab20",5)#Color Scheme
df3.groupby(['date_acquired','classification']).count()['department'].unstack().iplot(ax=ax,
                                                                                    xTitle='Year',
                                                                                    yTitle='Number of Peices',
                                                                                    title='Popularity of Art Acquisition: Art Classification')
plt.pyplot.show()

Now let us take a look at the top 10 classifications of art over the years. Its interesting we see large booms in 'drawings and watercolors' in the early '90s as well as a massive uptick in Oct, 78. We also see 'photographs' boom in 82 then hold a rather high position though today. This is due to photography being a growing field of art that really only matured in the early '80s. This is a hard way to show true popularity because of the difference in the cost of these pieces. Let us look at sculptures for example. The most'popular' sculptures have been was in May of 2016 when 17 sculptures were acquired but these could have been great pieces outweighing the huge volume of photographs that came into the possession of the Museum 

## 3.2 Word Clouds (first attempt) <a class="anchor" id="section_3_2"></a> 
*Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance.*


In [None]:
df.tail(1)

## A Word Cloud of the Titles 
This is what I believe to be the most interesting word cloud I was able to create at my current skill level. The idea was to sus out themes in popular art. We see terms like 'Attack' 'Saint' as very popular subjects of art and location words like 'Embassy'. If you use this to figure out your next painting I might suggest if you're attempting to make it as successful as possible I might suggest the title be
#### *'The Attack on Saint Keith, during the Reconstruction of the West Embassy'*


In [None]:
text = df['title'].values 
wordcloud = WordCloud().generate(str(text))

pyplot.figure(figsize = (8, 8), facecolor = 'white') 
pyplot.imshow(wordcloud)
pyplot.axis("off")
pyplot.tight_layout(pad = 0) 
pyplot.show()

## A Word Cloud of the Medium

Like the cloud above this shows that the most popular options for an artist are oil on canvas or gelatin silver print, which as I mentioned earlier is just a fancy name for photography.

In [None]:
text = df['medium'].values 
wordcloud = WordCloud().generate(str(text))

pyplot.figure(figsize = (8, 8), facecolor = 'white') 
pyplot.imshow(wordcloud)
pyplot.axis("off")
pyplot.tight_layout(pad = 0) 
pyplot.show()


# 4. Conclusion <a class="anchor" id="chapter4"></a>

After creating this notebook I know for a fact that this is barely a surface level evaluation of this dataset and I would love to see what all of you are able to do with it! I found the parts I found the most interesting were the Title word cloud showing us what themes are prevalent in the Carnegie Museum of Art's acquisition strategy. I also enjoyed seeing the classification of art popularity change overtime. seeing the boom of photography in the late '70s was an interesting way to visualize the maturing and acceptance of a relatively young artform.

## Thank you
I just want to say thank you and I appreciate you if you made it this far through my first Kaggle Dataset. I look forward to many more. Again, I'm a certified newbie trying to learn as fast as I can. So if you see an area of improvement please reach out in the comments!!
-Have a good one