## Drill: Playing with layers

Now it's your turn. Using the space below, experiment with different hidden layer structures. You can try this on a subset of the data to improve runtime. See how things vary. See what seems to matter the most. Feel free to manipulate other parameters as well. It may also be beneficial to do some real feature selection work...

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [None]:
artworks = pd.read_csv('https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv')

In [None]:
artworks.columns

In [None]:
artworks.shape

In [None]:
artworks.head()

We'll also do a bit of data processing and cleaning, selecting columns of interest and converting URL's to booleans indicating whether they are present.

In [None]:
# Select Columns.
artworks = artworks[['Artist', 'Nationality', 'Gender', 'Date', 'Department',
                    'DateAcquired', 'URL', 'ThumbnailURL', 'Height (cm)', 'Width (cm)']]

# Convert URL's to booleans.
artworks['URL'] = artworks['URL'].notnull()
artworks['ThumbnailURL'] = artworks['ThumbnailURL'].notnull()

# Drop films and some other tricky rows.
artworks = artworks[artworks['Department']!='Film']
artworks = artworks[artworks['Department']!='Media and Performance Art']
artworks = artworks[artworks['Department']!='Fluxus Collection']

# Drop missing data.
artworks = artworks.dropna()

In [None]:
# Choose small set of data
#artworks = artworks.iloc[:10000,:]

In [None]:
artworks.shape

## Building a Model

Now, let's see if we can use multi-layer perceptron modeling (or "MLP") to see if we can classify the department a piece should go into using everything but the department name.

Before we import MLP from SKLearn and establish the model we first have to ensure correct typing for our data and do some other cleaning.

In [None]:
# Get data types.
artworks.dtypes

Some more miscellaneous cleaning:

In [None]:
artworks['DateAcquired'] = pd.to_datetime(artworks.DateAcquired)
artworks['YearAcquired'] = artworks.DateAcquired.dt.year
artworks['YearAcquired'].dtype

In [None]:
# Remove multiple nationalities, genders, and artists.
artworks.loc[artworks['Gender'].str.contains('\) \('), 'Gender'] = '\(multiple_persons\)'
artworks.loc[artworks['Nationality'].str.contains('\) \('), 'Nationality'] = '\(multiple_nationalities\)'
artworks.loc[artworks['Artist'].str.contains(','), 'Artist'] = 'Multiple_Artists'

# Convert dates to start date, cutting down number of distinct examples.
artworks['Date'] = pd.Series(artworks.Date.str.extract(
    '([0-9]{4})', expand=False))[:-1]

# Final column drops and NA drop.
X = artworks.drop(['Department', 'DateAcquired', 'Artist', 'Nationality', 'Date'], 1)

# Create dummies separately.
artists = pd.get_dummies(artworks.Artist)
nationalities = pd.get_dummies(artworks.Nationality)
dates = pd.get_dummies(artworks.Date)

# Concat with other variables, but artists slows this wayyyyy down so we'll keep it out for now
X = pd.get_dummies(X, sparse=True)
X = pd.concat([X, nationalities, dates], axis=1)

Y = artworks.Department

In [None]:
# Alright! We've done our prep, let's build the model.
# Neural networks are hugely computationally intensive.
# This may take several minutes to run.

# Import the model.
from sklearn.neural_network import MLPClassifier

# Establish and fit the model, with a single, 100 perceptron layer.
mlp = MLPClassifier(hidden_layer_sizes=(100,))
mlp.fit(X, Y)

In [None]:
mlp.score(X, Y)

In [None]:
Y.value_counts()/len(Y)

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(mlp, X, Y, cv=5)

In [None]:
# Your code here. Experiment with hidden layers to build your own model.

# Establish and fit the model, with default settings.
mlp = MLPClassifier(hidden_layer_sizes=())
mlp.fit(X, Y)

In [None]:
mlp.score(X, Y)

In [None]:
# 5-fold Cross Validation
scores = cross_val_score(mlp, X, Y, cv=5)
scores

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.2, random_state=42)
print(len(X))
print(len(X_train))

In [None]:
# Establish and fit the model, with default settings and training set.
mlp = MLPClassifier()
mlp.fit(X_train, Y_train)

In [None]:
mlp.score(X_train, Y_train)

In [None]:
# 5-fold Cross Validation
cross_val_score(mlp, X_train, Y_train, cv=5)

In [None]:
#Sigmoid Logistic activation - base

# Establish and fit the model, with logistic activation settings and training set.
mlp_sig = MLPClassifier(activation='logistic')
mlp_sig.fit(X_train, Y_train)

In [None]:
mlp_sig.score(X_train, Y_train)

There is an improvement from previous score

In [None]:
# Add Multiple Layers to the base
mlp_sig2 = MLPClassifier(activation='logistic', hidden_layer_sizes=(100, 50, 25))
mlp_sig2.fit(X_train, Y_train)

In [None]:
mlp_sig2.score(X_train, Y_train)

In [None]:
cross_val_score(mlp_sig2, X_train, Y_train, cv=5)

In [None]:
# Decrease Alpha
mlp_sig3 = MLPClassifier(activation='logistic', alpha=1e-6)
mlp_sig3.fit(X_train, Y_train)

In [None]:
mlp_sig3.score(X_train, Y_train)

In [None]:
cross_val_score(mlp_sig3, X_train, Y_train, cv=5)

In [None]:
# Combine the above 2 and add more neurons to the layers
mlp_sig4 = MLPClassifier(activation='logistic', alpha=1e-6, 
                         hidden_layer_sizes=(1000, 1000))
mlp_sig4.fit(X_train, Y_train)

In [None]:
mlp_sig4.score(X_train, Y_train)

In [None]:
cross_val_score(mlp_sig4, X_train, Y_train, cv=5)