## Final Project Submission

Please fill out:
* Student name: Steven Rosa
* Student pace: part time
* Project review date/time: Monday 1 April 2019 11am ET
* Instructor name: Jeff Herman
* Blog post URL:


"Database for The Scratched Voices Begging to be Heard: The Graffiti of Pompeii and Today"

by Alexa Rose

https://core.tdar.org/dataset/445837/database-for-the-scratched-voices-begging-to-be-heard-the-graffiti-of-pompeii-and-today

<a id = 'top'></a>

# Contents
- Libraries and helper functions
- [A first look at the data](#obtain)
- [Cleaning the raw data](#scrub)
- [Exploratory data analysis](#explore)
- Modeling
 - [Model \#1](#model1)
 - [Model \#2](#model2)
 - [Model \#3](#model3)
- [Conclusions](#concl)

# Libraries and helper functions

In [1]:
import pandas as pd #For working with DataFrames
import matplotlib.pyplot as plt #For visualizing plots

In [2]:
import time

In [3]:
#Function to draw in-line histograms
def inline_hists(xs, data, bins = 50):
    fig, axs = plt.subplots(1, len(xs), sharey=False, figsize=((5 * len(xs), 4)))
    for i, x in enumerate(xs):
        data[x].hist(ax=axs[i], label=x, xlabelsize=5, bins=bins)
        axs[i].legend()
    plt.show()

<a id = 'obtain'></a>

[(Back to top)](#top)

# A first look at the data

In [4]:
#Obtain the raw data
df_raw = pd.read_csv('graffiti.csv')

In [None]:
df_raw.head(10)

In [None]:
df_raw.info()

Columns to drop:

'found'
'org'?
'comments'?

Change 'Literacy' to integer before categorizing. Rename column.

Rename 'Image ' as 'Image'.

Categorical variables to transform: Reggio, Insula, Literacy, Context type specific, Context type general, Famous House (?), Socio-economic status


Null values to fill: Reggio, Insula, Entrance, Context type specific, Context type general, Famous House,

Target: 'Category'

<a id = 'scrub'></a>

[(Back to top)](#top)

# Cleaning the data

## Column by column

### 'Reggio'

In [None]:
df_raw['Reggio'].value_counts()

In [None]:
print(df_raw['Reggio'].isna().sum())

In [5]:
df_raw['Reggio'] = df_raw['Reggio'].fillna(0)

In [None]:
#What's with the one 6_7 value?
df_raw[df_raw['Reggio'] == '6_7']

In [6]:
#Change 6_7 to 6
df_raw.at[994, 'Reggio'] = '6'

In [7]:
df_raw['Reggio'] = df_raw['Reggio'].astype(int)

Zero values can be filled later once more is known about the reggios.

### 'Insula'

In [None]:
df_raw['Insula'].value_counts()

In [None]:
df_raw['Insula'].value_counts().sum()

In [None]:
df_raw['Insula'].isna().sum()

In [8]:
df_raw['Insula'] = df_raw['Insula'].fillna(0)

In [9]:
#Replace the values with underscores
df_raw.at[985, 'Insula'] = '4'
df_raw.at[986, 'Insula'] = '4'
df_raw.at[983, 'Insula'] = '4'
df_raw.at[984, 'Insula'] = '4'
df_raw.at[988, 'Insula'] = '9'
df_raw.at[987, 'Insula'] = '8'
df_raw.at[982, 'Insula'] = '12'
df_raw.at[981, 'Insula'] = '1'

In [None]:
df_raw['Insula'].value_counts()

In [10]:
df_raw['Insula'] = df_raw['Insula'].astype(int)

### 'Entrance'

In [None]:
#What is the entrance column?
df_raw['Entrance'].value_counts()

In [None]:
#Reggio 7 Insula 12 had a brothel at entrances 18 through 20. Will change these values to 18.
#df_raw.replace('18_20', '19')

In [11]:
df_raw['Entrance'] = df_raw['Entrance'].fillna('unknown')

In [12]:
#To fill values from 'Entrance'
entrance_replacements = dict()

In [13]:
entrance_values = df_raw['Entrance'].value_counts()

In [14]:
#Iterate over Entrance values to look for underscore and hyphen
#Make a dict with values to replace the _/- values in the dataframe
#I'm choosing to take the first numerical value from each pair
for index in entrance_values.index:
        if '_' in index:
            index_split = index.split('_')
            entrance_replacements[index] = index_split[0]
        elif '-' in index:
            index_split = index.split('-')
            entrance_replacements[index] = index_split[0]

In [15]:
df_raw['Entrance'] = df_raw['Entrance'].replace(entrance_replacements)

In [None]:
df_raw['Entrance'].value_counts()

In [None]:
#Why are some of the entrances calendar dates?
df_raw[df_raw['Entrance'] == '?']

In [17]:
#Change 'F' to 'f'
df_raw.at[661, 'Entrance'] = df_raw.at[661, 'Entrance'].lower()
#Change '4/5/' to '4'
df_raw.at[6, 'Entrance'] = '4'
#Replace 'I' and '?'
df_raw['Entrance'] = df_raw['Entrance'].replace({'I': 'i', '?': 'unknown'})

This is better, but it may have to be categorized.

### 'found?'

In [None]:
df_raw['found?'].isna().sum()

In [18]:
#Can just be dropped
df_raw.drop(['found?'], axis = 1, inplace = True)

### 'In English'

In [25]:
df_raw['In English'].isna().sum()

8

In [26]:
df_raw['In English'] = df_raw['In English'].fillna('')


In [46]:
blank_indexes = df_raw.index[df_raw['In English'] == '']
check_indexes = df_raw.index[df_raw['In English'] == '[CHECK]']

In [47]:
#Drop the empty rows. They aren't useful if they don't have the English text of the graffiti.
df_raw.drop(indexes, inplace = True)
df_raw.drop(check_indexes, inplace = True)

### 'org. '

In [55]:
df_raw['org. '].value_counts()[:10]

?                             170
Benefiel 2010a                  8
Varone 2002                     4
Varone 2011 288                 3
Cugusi 2008                     3
Milnor 2009                     2
Biville 2003, 220               2
Biville 2003                    2
Milnor 2014 89-90               2
Garraffoni/Funari 2009 187      2
Name: org. , dtype: int64

In [56]:
#Won't be useful here. Can be dropped.
df_raw.drop(['org. '], axis = 1, inplace = True)

### 'Literacy'

In [62]:
df_raw['Literacy (1-3)'].value_counts()

2.0    592
3.0    444
1.0      2
Name: Literacy (1-3), dtype: int64

In [61]:
df_raw['Literacy (1-3)'].isna().sum()

1

In [63]:
#Rename column
df_raw = df_raw.rename(index=str, columns = {'Literacy (1-3)': 'Literacy'})

In [65]:
#Turn values into integers. To be categorized later.
df_raw['Literacy'] = df_raw['Literacy'].replace({2.0: 2,
                                                 3.0: 3,
                                                 1.0: 2})

In [67]:
df_raw['Literacy'] = df_raw['Literacy'].fillna(2)

### 'In org. language'

In [71]:
df_raw['In org. language'].value_counts().sum()

315

Not sure what to do with this at this point.

### 'Context type specific'

# Resume here. Finish comparing 'specific', 'general', and 'famous house' to see how they can be combined or collapsed to eliminate redundancy and null values.

In [78]:
df_raw['Context type specific'].value_counts()

peristyle      166
entrance       152
façade          88
atrium          65
outer wall      39
facade          26
dining room     22
garden          21
room            19
staircase       18
column          18
latrine         14
kitchen         11
shop             7
workshop         6
tablinum         5
portico          4
bath             4
altar            4
counter          3
tablinium        3
marble           2
Workshop         1
Name: Context type specific, dtype: int64

In [79]:
df_raw['Context type specific'].isna().sum()

341

### 'Context type general'

In [80]:
df_raw['Context type general'].value_counts()

house           438
brothel          70
tabernae         69
palaestra        35
unit             27
bar              23
theatre          22
inn              19
Baths             9
bakery            9
amphitheatre      8
temple            7
apartment         6
shop              3
altar             2
Name: Context type general, dtype: int64

In [81]:
df_raw['Context type general'].isna().sum()

292

In [82]:
#No specific context or general context
#145 rows
df_raw[df_raw['Context type specific'].isna() & df_raw['Context type general'].isna()]

Unnamed: 0,CIL IV #,Reggio,Insula,Entrance,In English,Literacy,In org. language,Context type specific,Context type general,Famous House,Category,Written by,Work,Meter,Repetition,Image,Flohr Score,Socio-economic status,comments
59,1939,8,1,1,"(In Rome) once lived the very rich Vibii, but ...",3,,,,Basilica,Political,,,,,,,low,
60,1863,8,1,1,"(Take) the cook, if you wish. Permission granted",3,,,,Basilica,Civic,,,yes,,,,low,
61,1860,8,1,1,"….as many as I have written once, and you read...",3,,,,Basilica,Insult,,,,,,,low,
62,1830,8,1,1,A hairy cunt is much better fucking than a bal...,3,,,,Basilica,Sexual,,,yes,,,,low,
63,1899,8,1,1,A orator makes a man (of someone); whoever buy...,3,,,,Basilica,Insult,,,,,,,low,
64,1839,8,1,1,"Agatho, slave of Herennius, asks Venus…that he...",3,,,,Basilica,Religious,,,,,,,low,
65,1934,8,1,1,Amicus sends greetings to Pyrrhus. Written by ...,2,,,,Basilica,Social,,,,yes,,,low,
66,1845,8,1,1,and salvius were here,2,,,,Basilica,Sexual,,,,,,,low,
67,1950,8,1,1,"Anyone who is a lover, may walk along Scythia'...",3,,,,Basilica,Reference,,Propertius,yes,,,,low,
68,1824,8,1,1,Anyone who is in love: come! I want to break V...,3,,,,Basilica,Religious,,,,,,,low,


In [85]:
#No specific context but a general context
#196 rows
df_raw[df_raw['Context type specific'].isna() & df_raw['Context type general'].notna()]

Unnamed: 0,CIL IV #,Reggio,Insula,Entrance,In English,Literacy,In org. language,Context type specific,Context type general,Famous House,Category,Written by,Work,Meter,Repetition,Image,Flohr Score,Socio-economic status,comments
12,10085,2,1,10,I sing the greatest songs of the man…,3,Carmina aio svmma viri,,house,,Reference,,Aeneid,"Metrical, parody",,,8.74,low,
13,10085b,2,1,10,The phallus of Crescens: hard and gigantic,3,Phallvs dvrvs cr(escentis) vastvs,,house,,Sexual,,,,,,8.74,low,
45,2124,7,1,8,(Best wishes) to Nero Caesar Augustus,3,NEROI CAESARI AGVSTO,,Baths,Stabian Baths,Political/Social,,,,,,1.69-1.94,low,
46,2107,7,1,8,Aegyptus to his Gallus: greetings!,2,,,Baths,Stabian Baths,Social,,,,,,1.69-1.94,low,
47,2081,7,1,8,Colepius senior licks cunt,2,,,Baths,Stabian Baths,Sexual,,,,,,1.69-1.94,low,
48,2111,7,1,8,"Iarinus, you are living here.",2,,,Baths,Stabian Baths,Social,,,,,,1.69-1.94,low,
49,760,7,1,8,"Lick my cock My cock, you must lick it well I ...",2,,,Baths,Stabian Baths,Sexual,,,,,,1.69-1.94,low,
50,2083,7,1,8,"Myrtilus, may the emperor favor you",2,,,Baths,Stabian Baths,Political/Social,,,,,,1.69-1.94,low,
51,2082,7,1,8,To a cross they should nail you!,3,,,Baths,Stabian Baths,Insult,,,,,,1.69-1.94,low,
52,2098,7,1,8,"Vettius Proclus, bye!",2,,,Baths,Stabian Baths,Social,,,,,,1.69-1.94,low,


In [87]:
#No general context but a specific context
#147 rows
df_raw[df_raw['Context type specific'].notna() & df_raw['Context type general'].isna()]

Unnamed: 0,CIL IV #,Reggio,Insula,Entrance,In English,Literacy,In org. language,Context type specific,Context type general,Famous House,Category,Written by,Work,Meter,Repetition,Image,Flohr Score,Socio-economic status,comments
0,8426,2,1,unknown,"By the holy gods of the house, I ask you to…",3,(per) lares sanctos rogo te vt,altar,,,Religious,,,,,,,low,
40,640,7,1,39,"Bye, Aper",2,,outer wall,,,Social,,,,,,,high,
54,1787,8,1,1,"Epaphra, give that pen back!",3,,entrance,,Basilica,Social,,,,yes,,,low,
55,1796,8,1,1,If anyone is looking for tender embraces in th...,3,,entrance,,Basilica,Romantic,,,,,,,low,
56,1781,8,1,1,"My darling, my sweet, let us play for a while ...",3,,entrance,,Basilica,Sexual,,,yes,,,,low,
57,1783,8,1,1,"Philodamus was (a slave) of Craudelius Festus,...",3,,entrance,,Basilica,Civic,,,,,,,low,
58,1791,8,1,1,Sweet is love for our hearts…and for lethargy ...,2,,entrance,,Basilica,Romantic,,,,,,,low,
154,3889,1,2,6,All Fell silent/ all/ and atent (ively),3,Conticvere Omnes Omn(es) Intentiq(..) s,atrium,,,Reference,Virgil,"Aeneid 2,1",,,,15.31,low,
155,3888,1,2,6,On November 19th I attended the meeting,2,XII K Dec in conventv veni,atrium,,,Civic,,,,,,15.31,low,
156,3928,1,2,19,Best wishes to serena from her friends,2,Serenae sodales sal,latrine,,,Social,,,,,,22.2,medium,


### 'Famous House'

### Target: 'Category'

In [None]:
#Consider reclassifying all with 'beware' into a 'Warning' category
#Change all "Insult/Threat" to just 'Insult'
#Change all 'Political/Social' to just 'Political'
#Change all Romantic/Sexual' to just 'Romantic'
#Tagging/violence are just military things
#Sexual/social are ??
#Romantic/social are ??
#
df_raw[df_raw['Category'] == 'Religious/Romantic']

### 'Written by'

### 'Work', 'Meter', and 'Reptition'

### 'Foreign language'

In [None]:
df_raw['Foreign language'].value_counts()

In [None]:
df_raw['Foreign language'].isna().sum()

In [75]:
#Can be dropped
df_raw.drop(['Foreign language'], axis = 1, inplace = True)

### 'Image'

In [None]:
df_raw['Image '].value_counts()

### 'Flohr Score'

### 'Socio-economic status'

### 'Comments'

In [None]:
df_raw.columns

### Dropping columns

### Renaming columns

In [None]:
#df.rename(index=str, columns={"A": "a", "B": "c"})
df_raw = df_raw.rename(index=str, columns = {'Literacy (1-3)': 'Literacy',
                                    'Image ': 'Image'
                                   })

<a id = 'explore'></a>

[(Back to top)](#top)

# Exploratory data analysis

What is a reggio?

What is an insula?

From https://sites.google.com/site/ad79eruption/pompeii/map-of-pompeii

"Pompeii, however, has an additional level of numbering. It has been divided firstly into 9 regions (Regio), numbered in Roman numerals. Each of these regions contains several Insulae which are numbered 1, 2 3, etc. As with Herculaneum, each building within an insula has its own entrance number, again numbered 1, 2, 3 etc. For example, the House of Trebius Valens is labelled (Reg III, Ins 2, 1)."

<a id = 'model1'></a>

[(Back to top)](#top)

# Modeling: Model \#1

<a id = 'model2'></a>

[(Back to top)](#top)

# Model \#2

<a id = 'model3'></a>

[(Back to top)](#top)

# Model \#3

<a id = 'concl'></a>

[(Back to top)](#top)

# Conclusions