## Final Project Submission

Please fill out:
* Student name: Steven Rosa
* Student pace: part time
* Project review date/time: Monday 1 April 2019 11am ET
* Instructor name: Jeff Herman
* Blog post URL:


"Database for The Scratched Voices Begging to be Heard: The Graffiti of Pompeii and Today"

by Alexa Rose

https://core.tdar.org/dataset/445837/database-for-the-scratched-voices-begging-to-be-heard-the-graffiti-of-pompeii-and-today

<a id = 'top'></a>

# Contents
- Libraries and helper functions
- [A first look at the data](#obtain)
- [Cleaning the raw data](#scrub)
- [Exploratory data analysis](#explore)
- Modeling
 - [Model \#1](#model1)
 - [Model \#2](#model2)
 - [Model \#3](#model3)
- [Conclusions](#concl)

# Libraries and helper functions

In [1]:
import pandas as pd #For working with DataFrames
import matplotlib.pyplot as plt #For visualizing plots
import numpy as np #For mathematical operations

In [2]:
import time

In [3]:
#Function to draw in-line histograms
def inline_hists(xs, data, bins = 50):
    fig, axs = plt.subplots(1, len(xs), sharey=False, figsize=((5 * len(xs), 4)))
    for i, x in enumerate(xs):
        data[x].hist(ax=axs[i], label=x, xlabelsize=5, bins=bins)
        axs[i].legend()
    plt.show()

<a id = 'obtain'></a>

[(Back to top)](#top)

# A first look at the data

In [4]:
#Obtain the raw data
df_raw = pd.read_csv('graffiti.csv')

In [None]:
df_raw.head(20)

In [None]:
df_raw.info()

Columns to drop:

'found'
'org'?
'comments'?

Change 'Literacy' to integer before categorizing. Rename column.

Rename 'Image ' as 'Image'.

Categorical variables to transform: Reggio, Insula, Literacy, Context type specific, Context type general, Famous House (?), Socio-economic status


Null values to fill: Reggio, Insula, Entrance, Context type specific, Context type general, Famous House,

Target: 'Category'

<a id = 'scrub'></a>

[(Back to top)](#top)

# Cleaning the data

## Column by column

### 'Reggio'

In [None]:
df_raw['Reggio'].value_counts()

In [None]:
print(df_raw['Reggio'].isna().sum())

In [5]:
df_raw['Reggio'] = df_raw['Reggio'].fillna(0)

In [None]:
#What's with the one 6_7 value?
df_raw[df_raw['Reggio'] == '6_7']

In [6]:
#Change 6_7 to 6
df_raw.at[994, 'Reggio'] = '6'
#Change data type to integer
df_raw['Reggio'] = df_raw['Reggio'].astype(int)

Zero values can be filled later once more is known about the reggios.

### 'Insula'

In [None]:
df_raw['Insula'].value_counts()

In [None]:
df_raw['Insula'].value_counts().sum()

In [None]:
df_raw['Insula'].isna().sum()

In [7]:
#Fill null values
df_raw['Insula'] = df_raw['Insula'].fillna(0)

#Replace the values with underscores
df_raw.at[985, 'Insula'] = '4'
df_raw.at[986, 'Insula'] = '4'
df_raw.at[983, 'Insula'] = '4'
df_raw.at[984, 'Insula'] = '4'
df_raw.at[988, 'Insula'] = '9'
df_raw.at[987, 'Insula'] = '8'
df_raw.at[982, 'Insula'] = '12'
df_raw.at[981, 'Insula'] = '1'

#Change data type to intger
df_raw['Insula'] = df_raw['Insula'].astype(int)

### 'Entrance'

In [None]:
df_raw['Entrance'].value_counts()

In [8]:
#Fill null values
df_raw['Entrance'] = df_raw['Entrance'].fillna('unknown')

#Replace all values with underscores or hyphens
#Dict to fill values from 'Entrance'
entrance_replacements = dict()
entrance_values = df_raw['Entrance'].value_counts()

#Iterate over Entrance values to look for underscore and hyphen
#Make a dict with values to replace the _/- values in the dataframe
#I'm choosing to take the first numerical value from each pair
for index in entrance_values.index:
        if '_' in index:
            index_split = index.split('_')
            entrance_replacements[index] = index_split[0]
        elif '-' in index:
            index_split = index.split('-')
            entrance_replacements[index] = index_split[0]
            
df_raw['Entrance'] = df_raw['Entrance'].replace(entrance_replacements)

In [None]:
#Why are some of the entrances calendar dates?
df_raw[df_raw['Entrance'] == '?']

In [9]:
#Change 'F' to 'f'
df_raw.at[661, 'Entrance'] = df_raw.at[661, 'Entrance'].lower()
#Change '4/5/' to '4'
df_raw.at[6, 'Entrance'] = '4'
#Replace 'I' and '?'
df_raw['Entrance'] = df_raw['Entrance'].replace({'I': 'i', '?': 'unknown'})

This is better, but it may have to be categorized.

### 'found?'

In [None]:
df_raw['found?'].isna().sum()

In [10]:
#Can just be dropped
df_raw.drop(['found?'], axis = 1, inplace = True)

### 'In English'

In [None]:
df_raw['In English'].isna().sum()

In [11]:
#Fill nulls so that they can be read
df_raw['In English'] = df_raw['In English'].fillna('')

#Rows with null values or unhelpful 'CHECK' values
blank_indexes = df_raw.index[df_raw['In English'] == '']
check_indexes = df_raw.index[df_raw['In English'] == '[CHECK]']

#Drop the empty rows. They aren't useful if they don't have the English text of the graffiti.
df_raw.drop(blank_indexes, inplace = True)
df_raw.drop(check_indexes, inplace = True)

### 'org. '

In [None]:
df_raw['org. '].value_counts()[:10]

In [12]:
#Won't be useful here. Can be dropped.
df_raw.drop(['org. '], axis = 1, inplace = True)

### 'Literacy'

In [None]:
df_raw['Literacy (1-3)'].value_counts()

In [None]:
df_raw['Literacy (1-3)'].isna().sum()

In [13]:
#Rename column
df_raw = df_raw.rename(index=str, columns = {'Literacy (1-3)': 'Literacy'})

#Turn the few 1 values into 2s.
df_raw['Literacy'] = df_raw['Literacy'].replace({1.0: 2})
df_raw['Literacy'] = df_raw['Literacy'].fillna(2)

#Turn floats into integers
df_raw['Literacy'] = df_raw['Literacy'].astype(int)

### 'In org. language'

In [None]:
df_raw['In org. language'].value_counts().sum()

Not sure what to do with this at this point.

### 'Context type specific'

In [None]:
df_raw['Context type specific'].value_counts()

In [None]:
df_raw['Context type specific'].isna().sum()

In [14]:
replacements = {'façade': 'facade',
                'tablinium': 'tablinum',
                'Workshop': 'workshop'
                }

df_raw['Context type specific'] = df_raw['Context type specific'].replace(replacements)

### 'Context type general'

In [None]:
df_raw['Context type general'].value_counts()

In [None]:
df_raw['Context type general'].isna().sum()

In [15]:
# No specific, no general, no reggio, insula
no_spec_no_gen = df_raw[df_raw['Context type specific'].isna() & df_raw['Context type general'].isna()]

no_spec_no_gen[(no_spec_no_gen['Reggio'] == 0) & (no_spec_no_gen['Insula'] == 0)].shape

#Must drop the rows that don't have a reggio, insula, specfic context or general context
to_drop = no_spec_no_gen[(no_spec_no_gen['Reggio'] == 0) & (no_spec_no_gen['Insula'] == 0)]
df_raw.drop(to_drop.index, axis = 0, inplace = True)

In [16]:
#Maybe famous house can fill in for general context where it's missing?
famoushouse_nogen = df_raw[
    (df_raw['Famous House'].notna())
    & 
    (df_raw['Context type general'].isna())]

#Get indexes of all rows without a gen context but with a famous house
indexes = famoushouse_nogen.index

In [28]:
#Delete after test
df_raw.at[indexes[0], 'Famous House']

'Basilica'

In [32]:
#Delete after test
any(['Baths', 'baths'] in df_raw.at[indexes[0], 'Famous House'])

TypeError: 'in <string>' requires string as left operand, not list

In [None]:
famoushouse_nogen.shape

In [33]:
famoushouse_nogen['Famous House'].value_counts()

Basilica                          72
Building of Eumachia              11
Praedia of Julia Felix            11
House of Gaius Julius Polybius     5
Villa of the Mysteries             4
Outside Porta Marina               2
house of the prince of naples      1
House of the silver wedding        1
northwest corner of block          1
House of the Dioscuri              1
House of the Mosaic Columns        1
House of the Ceii                  1
Workshop of Verecundus             1
near the Porta Vesuvio             1
Name: Famous House, dtype: int64

In [45]:
famoushouse_gencontexts = {
    'Praedia ': 'building',
    'Basilica': 'basilica',
    'House of': 'house',
    'house of': 'house',
    'Villa of': 'house',
    'Building': 'building',
    'near the Porta Vesuvio': 'necropolis',
    'Workshop': 'workshop'
}

#Replace gen context with the building type from its famous house
#Iterate over all the rows which have a famous house but lack a gen context
for index in indexes:
    #Iterate over the keys of famous houses
    for key, val in famoushouse_gencontexts.items():
        #If the row's famous house matches one from the dict
        if key in df_raw.at[index, 'Famous House']:
            #Fill missing gen context value with value from dict
            df_raw.at[index, 'Context type general'] = val

In [42]:
#249 rows before change
#136 rows left without gen
df_raw[df_raw['Context type general'].isna()]

Unnamed: 0,CIL IV #,Reggio,Insula,Entrance,In English,Literacy,In org. language,Context type specific,Context type general,Famous House,Category,Written by,Work,Meter,Repetition,Foreign language,Image,Flohr Score,Socio-economic status,comments
0,8426,2,1,unknown,"By the holy gods of the house, I ask you to…",3,(per) lares sanctos rogo te vt,altar,,,Religious,,,,,,,,low,
40,640,7,1,39,"Bye, Aper",2,,outer wall,,,Social,,,,,,,,high,
153,2960,9,1,unknown,"I ask you, fall ill!",2,,,,,Reference,,Virgil,,yes,,,,high,
154,3889,1,2,6,All Fell silent/ all/ and atent (ively),3,Conticvere Omnes Omn(es) Intentiq(..) s,atrium,,,Reference,Virgil,"Aeneid 2,1",,,,,15.31,low,
155,3888,1,2,6,On November 19th I attended the meeting,2,XII K Dec in conventv veni,atrium,,,Civic,,,,,,,15.31,low,
156,3928,1,2,19,Best wishes to serena from her friends,2,Serenae sodales sal,latrine,,,Social,,,,,,,22.2,medium,
157,3926,1,2,19,Diadum in us here and everywhere,2,Diadvmus hic et vbique,latrine,,,Political/Social,,,,,,,22.2,medium,
158,3925,1,2,19,"Saturnius, don't lick cunt",2,Satvrnine cvnvm linge re nol(i),latrine,,,Insult/Sexual,,,,,,,22.2,medium,
159,3891,1,2,6,Bye Actius Anicetus!/ Bye Horus,2,Acti anicete va hore va,peristyle,,,Social,,,,,,,15.31,low,
160,3948,1,2,24,May such lies cost you dearly innkeeper! You s...,2,…,peristyle,,,Insult,,,metrical,,,,18,low,


# STILL NOT PERFECT. SEE BELOW.

In [46]:
df_raw[
    (df_raw['Famous House'].notna())
    & 
    (df_raw['Context type general'].notna())]

Unnamed: 0,CIL IV #,Reggio,Insula,Entrance,In English,Literacy,In org. language,Context type specific,Context type general,Famous House,Category,Written by,Work,Meter,Repetition,Foreign language,Image,Flohr Score,Socio-economic status,comments
14,1589-1590,5,1,7,"Aphrodite, mistress Euche, mistress",2,,atrium,house,House of the bull,Social,,,,,,,,,"Aphrodite references name, not godess"
15,1592,5,1,7,"Genialis, Euche",2,,atrium,house,House of the bull,Social,,,,,,,,,
16,4036,5,1,18,"All fell silent ... if one shakes, she shakes...",3,,entrance,house,House of the epigrams,Reference,Virgil,"Aeneid 2, 1",Quote + comments,,,,28.65,medium,
20,4066,5,1,18,Daphnicus ... here with Felicla,3,,facade,house,House of the epigrams,Sexual,,,,,,,28.65,medium,
22,4091,5,1,26,"Long live whoever loves! Awar with him, who do...",3,,peristyle,house,House of Lucius Caeciliis Jucundus,Romantic,,,metrical,yes,,,46.57,high,found in 12 other locations
23,4087,5,1,23,Staphylus ... here with Quieta,2,,peristyle,house,House of Lucius Caeciliis Jucundus,Sexual,,,,,,,46.57,high,
24,4049,5,1,18,Room of Rufinus. Greetings.,2,,peristyle,house,House of the epigrams,Social,,,,,,,28.65,medium,
25,3407f,5,1,18,"Even if you devour me up to my roots, still I ...",3,,room,house,House of the epigrams,Religious,,,,,,,28.65,medium,
26,4080,5,1,26,"It is not me, I do not laze about",3,,tablinum,house,House of Lucius Caeciliis Jucundus,Insult,,,,,,,46.57,high,
27,4078,5,1,26,And adressing him...,2,,tablinum,house,House of Lucius Caeciliis Jucundus,Social,,,"Metrical, Homeric",,,,46.57,high,


In [None]:
df_raw['Context type general'].value_counts()

In [None]:
replacements = {'Baths': 'baths',
               'bakery': 'workshop',
               'apartment': 'house'}

In [None]:
df_raw['Context type general'] = df_raw['Context type general'].replace(replacements)

In [None]:
df_raw = df_raw.rename(index=str, columns ={'Context type specific': 'Position',
              'Context type general': 'Building Type'})

### 'Famous House'

In [None]:
#Encode as 0 or 1 no or yes

In [None]:
df_raw['Famous House'].value_counts()

In [35]:
#df_raw[df_raw['Famous House'].str.contains('bar')]

In [None]:
df_raw['Famous House'].isna().sum()

In [None]:
df_raw['Famous House'] = df_raw['Famous House'].fillna('no')

In [None]:
df_raw[(df_raw['Building Type'].isna()) & (df_raw['Famous House'].notna())]

### Target: 'Category'

In [None]:
#Consider reclassifying all with 'beware' into a 'Warning' category
#Change all "Insult/Threat" to just 'Insult'
#Change all 'Political/Social' to just 'Political'
#Change all Romantic/Sexual' to just 'Romantic'
#Tagging/violence are just military things
#Sexual/social are ??
#Romantic/social are ??
#
df_raw[df_raw['Category'] == 'Religious/Romantic']

### 'Written by'

### 'Work', 'Meter', and 'Reptition'

### 'Foreign language'

In [None]:
df_raw['Foreign language'].value_counts()

In [None]:
df_raw['Foreign language'].isna().sum()

In [None]:
#Can be dropped
df_raw.drop(['Foreign language'], axis = 1, inplace = True)

### 'Image'

In [None]:
df_raw['Image '].value_counts()

In [None]:
df_raw = df_raw.rename(index=str, columns = {'Image ': 'Image'})

### 'Flohr Score'

### 'Socio-economic status'

### 'Comments'

<a id = 'explore'></a>

[(Back to top)](#top)

# Exploratory data analysis

What is a reggio?

What is an insula?

From https://sites.google.com/site/ad79eruption/pompeii/map-of-pompeii

"Pompeii, however, has an additional level of numbering. It has been divided firstly into 9 regions (Regio), numbered in Roman numerals. Each of these regions contains several Insulae which are numbered 1, 2 3, etc. As with Herculaneum, each building within an insula has its own entrance number, again numbered 1, 2, 3 etc. For example, the House of Trebius Valens is labelled (Reg III, Ins 2, 1)."

<a id = 'model1'></a>

[(Back to top)](#top)

# Modeling: Model \#1

<a id = 'model2'></a>

[(Back to top)](#top)

# Model \#2

<a id = 'model3'></a>

[(Back to top)](#top)

# Model \#3

<a id = 'concl'></a>

[(Back to top)](#top)

# Conclusions