## Final Project Submission

Please fill out:
* Student name: Steven Rosa
* Student pace: part time
* Project review date/time: Monday 1 April 2019 11am ET
* Instructor name: Jeff Herman
* Blog post URL:


"Database for The Scratched Voices Begging to be Heard: The Graffiti of Pompeii and Today"

by Alexa Rose

https://core.tdar.org/dataset/445837/database-for-the-scratched-voices-begging-to-be-heard-the-graffiti-of-pompeii-and-today

<a id = 'top'></a>

# Contents
- Libraries and helper functions
- [A first look at the data](#obtain)
- [Cleaning the raw data](#scrub)
- [Exploratory data analysis](#explore)
- Modeling
 - [Model \#1](#model1)
 - [Model \#2](#model2)
 - [Model \#3](#model3)
- [Conclusions](#concl)

# Libraries and helper functions

In [1]:
import pandas as pd #For working with DataFrames
import matplotlib.pyplot as plt #For visualizing plots
import numpy as np #For mathematical operations

In [2]:
import time

In [3]:
#Function to draw in-line histograms
def inline_hists(xs, data, bins = 50):
    fig, axs = plt.subplots(1, len(xs), sharey=False, figsize=((5 * len(xs), 4)))
    for i, x in enumerate(xs):
        data[x].hist(ax=axs[i], label=x, xlabelsize=5, bins=bins)
        axs[i].legend()
    plt.show()

<a id = 'obtain'></a>

[(Back to top)](#top)

# A first look at the data

In [4]:
#Obtain the raw data
df_raw = pd.read_csv('graffiti.csv')

In [None]:
df_raw.head(20)

In [None]:
df_raw.info()

Columns to drop:

'found'
'org'?
'comments'?

Change 'Literacy' to integer before categorizing. Rename column.

Rename 'Image ' as 'Image'.

Categorical variables to transform: Reggio, Insula, Literacy, Context type specific, Context type general, Famous House (?), Socio-economic status


Null values to fill: Reggio, Insula, Entrance, Context type specific, Context type general, Famous House,

Target: 'Category'

<a id = 'scrub'></a>

[(Back to top)](#top)

# Cleaning the data

## Column by column

### 'Reggio'

In [None]:
df_raw['Reggio'].value_counts()

In [None]:
print(df_raw['Reggio'].isna().sum())

In [5]:
#Fill NaN values
df_raw['Reggio'] = df_raw['Reggio'].fillna(0)
#Change 6_7 to 6
df_raw.at[994, 'Reggio'] = '6'
#Change data type to integer
df_raw['Reggio'] = df_raw['Reggio'].astype(float).astype(int)

Zero values can be filled later once more is known about the reggios.

### 'Insula'

In [None]:
df_raw['Insula'].value_counts()

In [None]:
df_raw['Insula'].value_counts().sum()

In [None]:
df_raw['Insula'].isna().sum()

In [6]:
#Fill null values
df_raw['Insula'] = df_raw['Insula'].fillna(0)

#Replace the values with underscores
df_raw.at[985, 'Insula'] = '4'
df_raw.at[986, 'Insula'] = '4'
df_raw.at[983, 'Insula'] = '4'
df_raw.at[984, 'Insula'] = '4'
df_raw.at[988, 'Insula'] = '9'
df_raw.at[987, 'Insula'] = '8'
df_raw.at[982, 'Insula'] = '12'
df_raw.at[981, 'Insula'] = '1'

#Change data type to intger
df_raw['Insula'] = df_raw['Insula'].astype(float).astype(int)

### 'Entrance'

In [None]:
df_raw['Entrance'].value_counts()

In [7]:
#Fill null values
df_raw['Entrance'] = df_raw['Entrance'].fillna('unknown')

#Replace all values with underscores or hyphens
#Dict to fill values from 'Entrance'
entrance_replacements = dict()
entrance_values = df_raw['Entrance'].value_counts()

#Iterate over Entrance values to look for underscore and hyphen
#Make a dict with values to replace the _/- values in the dataframe
#I'm choosing to take the first numerical value from each pair
for index in entrance_values.index:
        if '_' in index:
            index_split = index.split('_')
            entrance_replacements[index] = index_split[0]
        elif '-' in index:
            index_split = index.split('-')
            entrance_replacements[index] = index_split[0]
            
df_raw['Entrance'] = df_raw['Entrance'].replace(entrance_replacements)

#Change 'F' to 'f'
df_raw.at[661, 'Entrance'] = df_raw.at[661, 'Entrance'].lower()
#Change '4/5/' to '4'
df_raw.at[6, 'Entrance'] = '4'
#Replace 'I' and '?'
df_raw['Entrance'] = df_raw['Entrance'].replace({'I': 'i', '?': 'unknown'})

This is better, but it may have to be categorized.

### 'found?'

In [None]:
df_raw['found?'].isna().sum()

In [8]:
#Can just be dropped
df_raw.drop(['found?'], axis = 1, inplace = True)

### 'In English'

In [None]:
df_raw['In English'].isna().sum()

In [9]:
#Fill nulls so that they can be read
df_raw['In English'] = df_raw['In English'].fillna('')

#Rows with null values or unhelpful 'CHECK' values
blank_indexes = df_raw.index[df_raw['In English'] == '']
check_indexes = df_raw.index[df_raw['In English'] == '[CHECK]']

#Drop the empty rows. They aren't useful if they don't have the English text of the graffiti.
df_raw.drop(blank_indexes, inplace = True)
df_raw.drop(check_indexes, inplace = True)

### 'org. '

In [None]:
df_raw['org. '].value_counts()[:10]

In [10]:
#Won't be useful here. Can be dropped.
df_raw.drop(['org. '], axis = 1, inplace = True)

### 'Literacy'

In [None]:
df_raw['Literacy (1-3)'].value_counts()

In [None]:
df_raw['Literacy (1-3)'].isna().sum()

In [11]:
#Rename column
df_raw = df_raw.rename(index=str, columns = {'Literacy (1-3)': 'Literacy'})

#Turn the few 1 values into 2s.
df_raw['Literacy'] = df_raw['Literacy'].replace({1.0: 2})
df_raw['Literacy'] = df_raw['Literacy'].fillna(2)

#Turn floats into integers
df_raw['Literacy'] = df_raw['Literacy'].astype(float).astype(int)

### 'In org. language'

In [None]:
df_raw['In org. language'].value_counts().sum()

Not sure what to do with this at this point.

### 'Context type general'

In [None]:
df_raw['Context type general'].value_counts()

In [None]:
df_raw['Context type general'].isna().sum()

In [57]:
# No specific, no general, no reggio, insula
no_spec_no_gen = df_raw[df_raw['Context type specific'].isna() & df_raw['Context type general'].isna()]

no_spec_no_gen[(no_spec_no_gen['Reggio'] == 0) & (no_spec_no_gen['Insula'] == 0)].shape

#Must drop the 43 rows that don't have a reggio, insula, specfic context or general context
to_drop = no_spec_no_gen[(no_spec_no_gen['Reggio'] == 0) & (no_spec_no_gen['Insula'] == 0)]
df_raw.drop(to_drop.index, axis = 0, inplace = True)

#Maybe famous house can fill in for general context where it's missing?
famoushouse_nogen = df_raw[
    (df_raw['Famous House'].notna())
    & 
    (df_raw['Context type general'].isna())]

#Get indexes of all rows without a gen context but with a famous house
indexes = famoushouse_nogen.index

famoushouse_gencontexts = {
    'Praedia ': 'building',
    'Basilica': 'basilica',
    'House of': 'house',
    'house of': 'house',
    'Villa of': 'house',
    'Building': 'building',
    'near the Porta Vesuvio': 'necropolis',
    'Workshop': 'workshop'
}

#Replace gen context with the building type from its famous house
#Iterate over all the rows which have a famous house but lack a gen context
for index in indexes:
    #Iterate over the keys of famous houses
    for key, val in famoushouse_gencontexts.items():
        #If the row's famous house matches one from the dict
        if key in df_raw.at[index, 'Famous House']:
            #Fill missing gen context value with value from dict
            df_raw.at[index, 'Context type general'] = val
            
#Noticed that Bar of Sotericus has gen context of "house"
indexes = df_raw[df_raw['Famous House'] == 'Bar of Sotericus']['Context type general']
#Replace 'house' with 'bar' for these
for index in indexes.index:
    df_raw.at[index, 'Context type general'] = 'bar'
    
#Noticed that 'Outside Porta Marina' had two NaN gen contexts to fix
#997 and 998
df_raw.at[997, 'Context type general'] = 'house'
df_raw.at[998, 'Context type general'] = 'house'

#change all building types for spec context "workshop" to gen context "workshop"
indexes = df_raw[df_raw['Context type specific'] == 'workshop']
for index in indexes.index:
    df_raw.at[index, 'Context type general'] = 'workshop'

#Same for 'Workshop'
indexes = df_raw[df_raw['Context type specific'] == 'Workshop']
for index in indexes.index:
    df_raw.at[index, 'Context type general'] = 'workshop'

#Specific context "dining room" to general context "house"
indexes = df_raw[
    (df_raw['Context type specific'] == 'dining room')
    & 
    (df_raw['Context type general'].isna())]
for index in indexes.index:
    df_raw.at[index, 'Context type general'] = 'house'

#Spec context "shop" to general context "shop"
indexes = df_raw[
    (df_raw['Context type specific'] == 'shop')
    & 
    (df_raw['Context type general'].isna())]
for index in indexes.index:
    df_raw.at[index, 'Context type general'] = 'shop'
    
#Spec context "kitchen" to general context "house"    
indexes = df_raw[
    (df_raw['Context type specific'] == 'kitchen')
    & 
    (df_raw['Building Type'].isna())]
for index in indexes.index:
    df_raw.at[index, 'Context type general'] = 'house'    
    
#Add consistency to a few of the values
replacements = {'Baths': 'baths',
                'unit': 'apartment'}
df_raw['Context type general'] = df_raw['Context type general'].replace(replacements)

#Changing the name of the column
df_raw = df_raw.rename(index=str, columns ={
              'Context type general': 'Building Type'})

KeyError: 'Context type general'

In [60]:
df_raw['Building Type'].isna().sum()

133

In [None]:
#Still 139 left. Maybe gen type spec can help.

### 'Context type specific'

In [14]:
df_raw['Context type specific'].value_counts()

peristyle      166
entrance       152
façade          88
atrium          65
outer wall      39
facade          26
dining room     22
garden          21
room            19
staircase       18
column          18
latrine         14
kitchen         11
shop             7
workshop         6
tablinum         5
altar            4
portico          4
bath             4
counter          3
tablinium        3
marble           2
Workshop         1
Name: Context type specific, dtype: int64

In [62]:
#Rows with a spec context but not a building type
df_raw['Context type specific'][
    (df_raw['Context type specific'].notna())
    & 
    (df_raw['Building Type'].isna())].value_counts()

façade        36
outer wall    19
peristyle     14
entrance      12
atrium        10
facade         7
latrine        3
tablinum       1
room           1
altar          1
staircase      1
garden         1
Name: Context type specific, dtype: int64

In [63]:
#Reggio and insula for rows with a spec context but not a building type
regIns = df_raw[['Reggio', 'Insula']][
    (df_raw['Context type specific'].notna())
    & 
    (df_raw['Building Type'].isna())]

regins_tuples = []
for row in regIns.index:
    regins_tuples.append(tuple((regIns.at[row, 'Reggio'], regIns.at[row, 'Insula'])))
    
#These reggios and insulae can be looked up for building type
set(regins_tuples)    

{(0.0, 0.0),
 (0.0, 9.0),
 (1.0, 2.0),
 (1.0, 3.0),
 (1.0, 4.0),
 (1.0, 6.0),
 (1.0, 9.0),
 (2.0, 1.0),
 (2.0, 2.0),
 (2.0, 5.0),
 (2.0, 9.0),
 (3.0, 3.0),
 (3.0, 4.0),
 (3.0, 5.0),
 (3.0, 6.0),
 (5.0, 2.0),
 (5.0, 7.0),
 (6.0, 5.0),
 (6.0, 9.0),
 (6.0, 12.0),
 (6.0, 13.0),
 (6.0, 14.0),
 (6.0, 15.0),
 (6.0, 16.0),
 (7.0, 1.0),
 (7.0, 2.0),
 (7.0, 3.0),
 (7.0, 4.0),
 (7.0, 7.0),
 (7.0, 12.0),
 (7.0, 13.0),
 (7.0, 15.0),
 (9.0, 0.0),
 (9.0, 2.0),
 (9.0, 3.0),
 (9.0, 6.0),
 (9.0, 7.0),
 (9.0, 8.0),
 (9.0, 11.0),
 (9.0, 12.0)}

In [43]:
#Rows with no spec context but a building type
regIns = df_raw[['Reggio', 'Insula']][
    (df_raw['Context type specific'].isna())
    & 
    (df_raw['Building Type'].notna())]

In [45]:
regIns.shape # It's probably good enough that these all have a Building Type

(273, 2)

In [None]:
#Add consistency to values
replacements = {'façade': 'facade',
                'tablinium': 'tablinum',
                'Workshop': 'workshop'
                }

df_raw['Context type specific'] = df_raw['Context type specific'].replace(replacements)


#Rename column
df_raw = df_raw.rename(index=str, columns ={'Context type specific': 'Position'})

### 'Famous House'

In [44]:
df_raw['Famous House'].value_counts()[60:90]

House of N Popidius Priscus                   2
House of Narcissus                            2
Bar of Salvius                                1
House of Jason                                1
House of Fabius Amandius                      1
House of the large altar                      1
House of Lucius Clodius Varus and Pelagria    1
House of the Ceii                             1
House of Balbus                               1
House of the ceii                             1
House of the Mosaic Columns                   1
House of Hercules                             1
House of cither player                        1
House of Oppius Gratus                        1
Temple of Jupiter                             1
House of Popidius Metellicus                  1
near the Porta Vesuvio                        1
Workshop of Verecundus                        1
House of Gaius Vibius Italicus                1
House of Cipius Pamphilius Felix              1
Casa delle caccia antica                

In [64]:
df_raw['Famous House'].isna().sum()

537

In [65]:
#Turn this column into a 0/1 for no/yes
indexes = df_raw[df_raw['Famous House'].notna()]

for index in indexes.index:
    df_raw.at[index, 'Famous House'] = 1
    
df_raw['Famous House'] = df_raw['Famous House'].fillna(0)

In [70]:
df_raw['Famous House'].value_counts()

0    535
1    463
Name: Famous House, dtype: int64

In [68]:
df_raw[(df_raw['Building Type'].isna()) & (df_raw['Famous House'] == 0)]

Unnamed: 0,CIL IV #,Reggio,Insula,Entrance,In English,Literacy,In org. language,Context type specific,Building Type,Famous House,Category,Written by,Work,Meter,Repetition,Foreign language,Image,Flohr Score,Socio-economic status,comments
0,8426,2.0,1.0,unknown,"By the holy gods of the house, I ask you to…",3.0,(per) lares sanctos rogo te vt,altar,,no,Religious,,,,,,,,low,
40,640,7.0,1.0,39,"Bye, Aper",2.0,,outer wall,,no,Social,,,,,,,,high,
153,2960,9.0,1.0,unknown,"I ask you, fall ill!",2.0,,,,no,Reference,,Virgil,,yes,,,,high,
154,3889,1.0,2.0,6,All Fell silent/ all/ and atent (ively),3.0,Conticvere Omnes Omn(es) Intentiq(..) s,atrium,,no,Reference,Virgil,"Aeneid 2,1",,,,,15.31,low,
155,3888,1.0,2.0,6,On November 19th I attended the meeting,2.0,XII K Dec in conventv veni,atrium,,no,Civic,,,,,,,15.31,low,
156,3928,1.0,2.0,19,Best wishes to serena from her friends,2.0,Serenae sodales sal,latrine,,no,Social,,,,,,,22.2,medium,
157,3926,1.0,2.0,19,Diadum in us here and everywhere,2.0,Diadvmus hic et vbique,latrine,,no,Political/Social,,,,,,,22.2,medium,
158,3925,1.0,2.0,19,"Saturnius, don't lick cunt",2.0,Satvrnine cvnvm linge re nol(i),latrine,,no,Insult/Sexual,,,,,,,22.2,medium,
159,3891,1.0,2.0,6,Bye Actius Anicetus!/ Bye Horus,2.0,Acti anicete va hore va,peristyle,,no,Social,,,,,,,15.31,low,
160,3948,1.0,2.0,24,May such lies cost you dearly innkeeper! You s...,2.0,…,peristyle,,no,Insult,,,metrical,,,,18,low,


### Target: 'Category'

In [None]:
#Consider reclassifying all with 'beware' into a 'Warning' category
#Change all "Insult/Threat" to just 'Insult'
#Change all 'Political/Social' to just 'Political'
#Change all Romantic/Sexual' to just 'Romantic'
#Tagging/violence are just military things
#Sexual/social are ??
#Romantic/social are ??
#
df_raw[df_raw['Category'] == 'Religious/Romantic']

### 'Written by'

### 'Work', 'Meter', and 'Reptition'

### 'Foreign language'

In [None]:
df_raw['Foreign language'].value_counts()

In [None]:
df_raw['Foreign language'].isna().sum()

In [None]:
#Can be dropped
df_raw.drop(['Foreign language'], axis = 1, inplace = True)

### 'Image'

In [None]:
df_raw['Image '].value_counts()

In [None]:
df_raw = df_raw.rename(index=str, columns = {'Image ': 'Image'})

### 'Flohr Score'

### 'Socio-economic status'

### 'Comments'

<a id = 'explore'></a>

[(Back to top)](#top)

# Exploratory data analysis

What is a reggio?

What is an insula?

From https://sites.google.com/site/ad79eruption/pompeii/map-of-pompeii

"Pompeii, however, has an additional level of numbering. It has been divided firstly into 9 regions (Regio), numbered in Roman numerals. Each of these regions contains several Insulae which are numbered 1, 2 3, etc. As with Herculaneum, each building within an insula has its own entrance number, again numbered 1, 2, 3 etc. For example, the House of Trebius Valens is labelled (Reg III, Ins 2, 1)."

<a id = 'model1'></a>

[(Back to top)](#top)

# Modeling: Model \#1

<a id = 'model2'></a>

[(Back to top)](#top)

# Model \#2

<a id = 'model3'></a>

[(Back to top)](#top)

# Model \#3

<a id = 'concl'></a>

[(Back to top)](#top)

# Conclusions