![](https://image.slidesharecdn.com/artimitateslifeargumenttoturnin-140310190421-phpapp01/95/art-imitates-life-3-638.jpg?cb=1394478444)

# Introduction 

" I don't think about art when I'm working. I try to think about life.
I don’t listen to what art critics say. I don’t know anybody who needs a critic to find out what art is." -Jean Michel Basquiat

* Art is often considered the process or product of deliberately arranging elements in a way that appeals to the senses or emotions. It encompasses a diverse range of human activities, creations and ways of expression, including music, literature, film, sculpture and paintings. The meaning of art is explored in a branch of philosophy known as aesthetics
* The visual arts are art forms such as painting, drawing, printmaking, sculpture, ceramics, photography, video, filmmaking, design, crafts, and architecture. Many artistic disciplines such as performing arts, conceptual art, textile arts also involve aspects of visual arts as well as arts of other types

# Problem Statement

* With a collection of artworks of 50 of the most influential artists of all time, the aim is to create a **convolutional neural network** to recognise the artists looking at the colors used and the geometric patterns inside the pictures.
* This could help detect forgeries in the art world with by being more accurate than even trained art critics at detecting the forgeries

# Metric of Success 
* Accuracy Score

# Understanding the Context

This dataset contains three files:

* artists.csv: dataset of information for each artist
* images.zip: collection of images (full size), divided in folders and sequentially numbered
* resized.zip: same collection but images have been resized and extracted from folder structure


# Experimental Design

CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is an industry-proven way to guide your data mining efforts. This is the methodology that will be used to deploy this classification experiment . The steps are as seen below:
* Business understanding - assessing the situation (fact finding)
* Data understanding - acquire the data,understand the strengths and weaknesses.
* Data Preparation - cleaning the data and performing feature engineering
* Data modelling -identify the modelling technique
* Evaluation - gauging whether the standard to which the model meets the set business objectives
* Deployment - summarizing the stationung approach inclusing the necessary steps that are taken and how they were performed

# Importing the libraries and datasets

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import warnings
warnings.filterwarnings('ignore')      
# Used to ignore the warnings displayed by python

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.



In [None]:
# uploading the dataset
# uploading the artist csv just to preview information on the artists

artsy=pd.read_csv("../input/best-artworks-of-all-time/artists.csv")

In [None]:
# a random preview of our dataset
artsy.take(np.random.permutation(len(artsy))[:15])

* Here's a short description of each column: 

| Features      |       Description|
|---|---|
|Name | shows the artist's name|
|Years | artist's years on earth|
|Genre | the artist's style of art|
|Nationality | the artist's country of origin|
|Bio | details about the artist |
|Wikipidea | a link to the artist's wikipedia page|
|Paintings | number of paintings the artist has |


* Some columns like id, bio and wikipedia are irrelevant and we will therefore drop them later on .

In [None]:
# shape of the dataset 
print('Our dataset has', artsy.shape[0], 'rows and', artsy.shape[1], 'columns')

In [None]:
# confirming the datatypes
artsy.dtypes

* Most of the columns are objects except id and paintings, whose datatypes are integers

In [None]:
# statistical summary of the datasets
artsy.describe().transpose()

In [None]:
# checking for duplicates
artsy.duplicated().sum()

In [None]:
# check for null values

artsy.isnull().sum()

* It seems like our dataset is free of duplicates and null values

In [None]:
# getting infromation on the dataset
artsy.info

In [None]:
artsy.head()

* From here we can see the name of each artist available on the dataset, together with their nationality  and the genre of their paintings,bios and wikipedia links.
* Because there are some artists that are well known but aren't on the dataset I will append them on to the csv as well as add a file of their images.

In [None]:
# getting the unique features
artsy.nunique()

In [None]:
# dropping the irrelevant columns
artsy.drop(columns=['id','bio','wikipedia'],inplace =True)

* Before perfroming any feature engineering we will do away with the irrelevant columns so that we can add more columns that will help give us better information on the dataset

# Loading Images

In [None]:
# Image manipulation.
import PIL.Image
from IPython.display import display
from glob import glob
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
def plotImages(artist,directory):
    print(artist)
    multipleImages = glob(directory)
    plt.rcParams['figure.figsize'] = (15, 15)
    plt.subplots_adjust(wspace=0, hspace=0)
    i_ = 0
    for l in multipleImages[:25]:
        im = cv2.imread(l)
        im = cv2.resize(im, (128, 128)) 
        plt.subplot(5, 5, i_+1) #.set_title(l)
        plt.imshow(cv2.cvtColor(im, cv2.COLOR_BGR2RGB)); plt.axis('off')
        i_ += 1

In [None]:
print(os.listdir("/kaggle/input/best-artworks-of-all-time/images/images"))

In [None]:
#Read Images
import os
from skimage import io
from PIL import Image
# import cv2
def upload_art_train_images(image_path,best_artwork,height, width):
    images = []
    labels = []
    # Loop across the three directories having wheat images.
    for category in best_artwork:
        # Append the wheat category directory into the main path
        full_image_path = image_path +  category + "/"
        # Retrieve the filenames from the all the three wheat directories.
        image_file_names = [os.path.join(full_image_path, f) for f in os.listdir(full_image_path)]
        # Read the image pixels
        for file in image_file_names:
#             image= cv2.imread(file)
            image=io.imread(file)
            # Append image into list
            image_from_array = Image.fromarray(image, 'RGB')
            #Resize image
            size_image = image_from_array.resize((height, width))
            #Append image into list
            images.append(np.array(size_image))
#             size_image = image_from_array.resize((height, width))
            #Append image into list
#             images.append(np.array(size_image))
            #images.append(image) # uncomment after check
            # Label for each image as per directory
            labels.append(category)
    return images, labels

## Invoke the function
#Image resize parameters
height = 30
width = 30
num_classes = 2
#Get number of classes
best_artwork = ['Claude_Monet', 'Alfred_Sisley']
train_images, train_labels = upload_art_train_images('/kaggle/input/best-artworks-of-all-time/images/images/',best_artwork,height,width)
from keras.utils.np_utils import to_categorical
y_train=np.array(labels)
y_train = to_categorical(y_train, num_classes)


In [None]:
plotImages("Jean-Michel Basquiat","/kaggle/input/new-images/basquiat/**")

In [None]:
plotImages("Keith Haring","/kaggle/input/new-images/haring/**")

* We can see that the images we added on our own are displaying

In [None]:
plotImages("Vincent van Gogh","/kaggle/input/best-artworks-of-all-time/images/images/Vincent_van_Gogh/**")

# Feature Engineering 

In [None]:
artsy.columns

* Because there were some artists that we wanted involved , created a new dataframe with the two new artists then concacted the two dataframes into a new dataframe called 'art'

In [None]:
# We want to obtain the age of the artists so  I will split the death and birth year into two columns
# I will then drop the year column
artsy_year = pd.DataFrame(artsy.years.str.split(' ',2).tolist(),columns = ['birth','-','death'])
artsy_year.drop(["-"],axis=1,inplace=True)
artsy["birth"]=artsy_year.birth
artsy["death"]=artsy_year.death
artsy.drop(["years"],axis=1,inplace=True)

In [None]:
artsy["birth"]=artsy["birth"].apply(lambda x: int(x))
artsy["death"]=artsy["death"].apply(lambda x: int(x))

In [None]:
artsy2 = pd.DataFrame({'name': ['Jean-Michel Basquiat','Keith Haring'],
                            'birth': ['1960','1958'],
                            'death':['1988','1990'],
                            'genre': ['Neo-expressionism', 'Pop Art'],
                            'nationality': ['American', 'American'],
                            'paintings':[600, 79]})
frames= (artsy,artsy2)
art=pd.concat(frames,ignore_index=True)



In [None]:
art.birth=art.birth.astype('int')
art.death=art.death.astype('int')

In [None]:
art["age"]=art.death-art.birth

In [None]:
# specifying bins for when we visualize the distribustion
# creating a new column to show 

art['age']=art['age']
bins=[27,55,65,77,98]
labels=["young adult","early adult","adult","senior"]
art['age_group']=pd.cut(art['age'],bins,labels=labels)

In [None]:
# create function that obtains the century 
# creating a century column
art['century'] = (art['death'] // 100) + 1
art.take(np.random.permutation(len(art))[:52])

In [None]:
# Dropping more irrelevant columns 
art.drop(columns=['birth','death'], inplace= True)

# Exploratory Data Analysis

## Univariate Analysis
* Non - graphical Analysis
Here we will carry out the following computations :
1. Measures of central tendancies : Mean, mode and median for numerical data and Mode for categorical data
2. Measures of dispersion

## Non-graphical analysis

### Measures of Central Tendancies

In [None]:
# Calculating the mean of the numeric features
numeric = ['age', 'century', 'paintings']
for col in numeric:
  print(art[[col]].mean())

In [None]:
# Determining the mode of each of the numeric features

for col in numeric:
  print(art[[col]].mode())

In [None]:
# Identifying the median 

for col in numeric:
  print(art[[col]].median())

### Measures of Dispersion

In [None]:
# The InterQuartile Range (IQR)
# IQR is also called the midspread or middle 50%

# Calculating IQR for the numeric features



for i in numeric:

  Q1 = art[i].quantile(0.25)
  Q3 = art[i].quantile(0.75)
  IQR = Q3 - Q1
  print(i, ':', IQR)

## Graphical Analysis

In [None]:
# findining outliers 
columns=['age','paintings','century']
fig, ax = plt.subplots(len(columns), figsize=(8,40))
for i, values in enumerate(columns):

    sns.boxplot(y=art[values], ax=ax[i])
    ax[i].set_title('Box plot - {}'.format(values), fontsize=8)
    ax[i].set_xlabel(values, fontsize=8)
plt.show()

* The boxplot above does not necessarily mean that we are supposed to eliminate the outliers above. It implies that only 4 artists managed to paint 400 and above paintings.
* Therefore, there will be no need for us to eliminate them.

In [None]:
# Distribution Plots
# plots to check for the distribution of the numeric features of our data

fig, axes = plt.subplots(nrows = 3, ncols = 1, figsize = (20, 25))

for ax, name, data in zip(axes.flatten(), numeric, art):
  sns.distplot(art[name], hist = True, ax = ax, bins = 20, color = 'crimson')
  plt.suptitle('Boxplots for Numeric Features', fontsize = 16)
  plt.subplots_adjust()
  plt.tight_layout

* The age column is seen to be normally distrubuted, since it's seen to be symmetrical.
* The century column has a negative distribution since it's seen to be skewed to the left.
* The paintings column has a positive distribution as it is skewed to the right.

In [None]:
# a plot showing the most popular genre
plt.style.use('fivethirtyeight')
art['genre'].value_counts().plot.bar()

* From the plot above we can see that Impressionism,Post Impressionism, Baroque and Northern Reinassance are the most popular art forms 

In [None]:
# a plot showing the most popular genre grouped by the century
art['genre'].groupby('century').plot.bar()

In [None]:
# visualization showing the Age Group count per Art
art['age_group'].value_counts().plot.bar(rot =0)
plt.xlabel("age_group",fontsize=15)
plt.ylabel("Count",fontsize=15)
plt.title("Age Group count per Artist",fontsize=15)
plt.show()


In [None]:
plt.figure(figsize=(5,5))
art_genre = sns.countplot(y='genre',data=art)
art_genre



In [None]:

plt.figure(figsize=(5,5))
art_nationality = sns.countplot(y='nationality',data=art)
art_nationality



In [None]:
# visualization showing the Age Group count per Art
art['century'].value_counts().plot.bar(rot =0)
plt.xlabel("century",fontsize=15)
plt.ylabel("Count",fontsize=15)
plt.title("Century count per Artist",fontsize=15)
plt.show()



* From the output and visualization above we see that the artists belong to 17 different nationalities.
Most of the artists are French people.

In [None]:
# matplotlib histogram
plt.hist(art['age'], color = 'blue', edgecolor = 'black',
         bins = int(180/5))





In [None]:
# seaborn histogram
sns.distplot(art['age'], hist=True, kde=False,
             bins=int(180/5), color = 'blue',
             hist_kws={'edgecolor':'black'})
# Add labels
plt.title('Histogram of Age of Artists')
plt.xlabel('Age (years)')
plt.ylabel('No. of Artists')

# Modeling
## Baseline Model

*** Model used** -Convolutional Neural Networks(CNN)
**Tensor flow** - Modelling Keras (High level API to tensor flow)

* > Validation set - 20%
* > RGB values are encoded as 8 bit


In [None]:
## TensorFlow and keras
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Conv2D, MaxPool2D, Dense, Flatten, Dropout