## Final Project Submission

Please fill out:
* Student name: Steven Rosa
* Student pace: part time
* Project review date/time: Monday 1 April 2019 11am ET
* Instructor name: Jeff Herman
* Blog post URL:


"Database for The Scratched Voices Begging to be Heard: The Graffiti of Pompeii and Today"

by Alexa Rose

https://core.tdar.org/dataset/445837/database-for-the-scratched-voices-begging-to-be-heard-the-graffiti-of-pompeii-and-today

<a id = 'top'></a>

# Contents
- Libraries and helper functions
- [A first look at the data](#obtain)
- [Cleaning the raw data](#scrub)
- [Exploratory data analysis](#explore)
- Modeling
 - [Model \#1](#model1)
 - [Model \#2](#model2)
 - [Model \#3](#model3)
- [Conclusions](#concl)

# Libraries and helper functions

In [1]:
import pandas as pd #For working with DataFrames
import matplotlib.pyplot as plt #For visualizing plots
import numpy as np #For mathematical operations
import random                   #for generating random numbers for train/test split
import copy                     #for making deep copies of mutable objects

#for dividing data into a training set and a testing set
from sklearn.model_selection import train_test_split 
#For building regular logistic regression models
from sklearn.linear_model import LogisticRegression
#To view the ROC of a given class and  "area under the curve"
from sklearn.metrics import accuracy_score, roc_curve, auc
#For building decision trees
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree 
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
#For visualizing decision trees:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
#For assessing accuracy of logistic regression or decision trees
from sklearn.metrics import confusion_matrix 
import itertools #To iteratively append labels to cells in a confusion matrox

In [None]:
import time

In [2]:
#Function to draw in-line histograms
def inline_hists(xs, data, bins = 50):
    fig, axs = plt.subplots(1, len(xs), sharey=False, figsize=((5 * len(xs), 4)))
    for i, x in enumerate(xs):
        data[x].hist(ax=axs[i], label=x, xlabelsize=5, bins=bins)
        axs[i].legend()
    plt.show()

In [3]:
#Example function to visualize a confusion matrix without yellow brick
def plot_conf_matrix(cm, classes, normalize=False, 
                          title='Confusion Matrix', cmap=plt.cm.Blues):
#    if normalize:
#        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
#        print("Matrix, normalized")
#    else:
#        print('Matrix')

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

<a id = 'obtain'></a>

[(Back to top)](#top)

# A first look at the data

In [4]:
#Obtain the raw data
df_raw = pd.read_csv('graffiti.csv')

In [None]:
df_raw.head(20)

In [None]:
df_raw.info()

Columns to drop:

'found'
'org'?
'comments'?

Change 'Literacy' to integer before categorizing. Rename column.

Rename 'Image ' as 'Image'.

Categorical variables to transform: Reggio, Insula, Literacy, Context type specific, Context type general, Famous House (?), Socio-economic status


Null values to fill: Reggio, Insula, Entrance, Context type specific, Context type general, Famous House,

Target: 'Category'

<a id = 'scrub'></a>

[(Back to top)](#top)

# Cleaning the data

## Column by column

### 'CIL IV Pound sign'

In [None]:
df_raw['CIL IV #'] = df_raw['CIL IV #'].fillna(0)

### 'Reggio'

In [None]:
df_raw['Reggio'].value_counts()

In [None]:
print(df_raw['Reggio'].isna().sum())

In [5]:
#Fill NaN values
df_raw['Reggio'] = df_raw['Reggio'].fillna(0)
#Change 6_7 to 6
df_raw.at[994, 'Reggio'] = '6'
#Change data type to integer
df_raw['Reggio'] = df_raw['Reggio'].astype(float).astype(int)

Zero values can be filled later once more is known about the reggios.

### 'Insula'

In [None]:
df_raw['Insula'].value_counts()

In [None]:
df_raw['Insula'].value_counts().sum()

In [None]:
df_raw['Insula'].isna().sum()

In [6]:
#Fill null values
df_raw['Insula'] = df_raw['Insula'].fillna(0)

#Replace the values with underscores
df_raw.at[985, 'Insula'] = '4'
df_raw.at[986, 'Insula'] = '4'
df_raw.at[983, 'Insula'] = '4'
df_raw.at[984, 'Insula'] = '4'
df_raw.at[988, 'Insula'] = '9'
df_raw.at[987, 'Insula'] = '8'
df_raw.at[982, 'Insula'] = '12'
df_raw.at[981, 'Insula'] = '1'

#Change data type to intger
df_raw['Insula'] = df_raw['Insula'].astype(float).astype(int)

### 'Entrance'

In [None]:
df_raw['Entrance'].value_counts()

In [7]:
#Fill null values
df_raw['Entrance'] = df_raw['Entrance'].fillna('unknown')

#Replace all values with underscores or hyphens
#Dict to fill values from 'Entrance'
entrance_replacements = dict()
entrance_values = df_raw['Entrance'].value_counts()

#Iterate over Entrance values to look for underscore and hyphen
#Make a dict with values to replace the _/- values in the dataframe
#I'm choosing to take the first numerical value from each pair
for index in entrance_values.index:
        if '_' in index:
            index_split = index.split('_')
            entrance_replacements[index] = index_split[0]
        elif '-' in index:
            index_split = index.split('-')
            entrance_replacements[index] = index_split[0]
            
df_raw['Entrance'] = df_raw['Entrance'].replace(entrance_replacements)

#Change 'F' to 'f'
df_raw.at[661, 'Entrance'] = df_raw.at[661, 'Entrance'].lower()
#Change '4/5/' to '4'
df_raw.at[6, 'Entrance'] = '4'
#Replace 'I' and '?'
df_raw['Entrance'] = df_raw['Entrance'].replace({'I': 'i', '?': 'unknown'})

This is better, but it may have to be categorized.

### 'found?'

In [None]:
df_raw['found?'].isna().sum()

In [None]:
#Can just be dropped
df_raw.drop(['found?'], axis = 1, inplace = True)

### 'In English'

In [None]:
df_raw['In English'].isna().sum()

In [8]:
#Fill nulls so that they can be read
df_raw['In English'] = df_raw['In English'].fillna('')

#Rows with null values or unhelpful 'CHECK' values
blank_indexes = df_raw.index[df_raw['In English'] == '']
check_indexes = df_raw.index[df_raw['In English'] == '[CHECK]']

#Drop the empty rows. They aren't useful if they don't have the English text of the graffiti.
df_raw.drop(blank_indexes, inplace = True)
df_raw.drop(check_indexes, inplace = True)

### 'org. '

In [None]:
df_raw['org. '].value_counts()[:10]

In [9]:
#Won't be useful here. Can be dropped.
df_raw.drop(['org. '], axis = 1, inplace = True)

### 'Literacy'

In [None]:
df_raw['Literacy (1-3)'].value_counts()

In [None]:
df_raw['Literacy (1-3)'].isna().sum()

In [10]:
#Rename column
df_raw = df_raw.rename(index=str, columns = {'Literacy (1-3)': 'Literacy'})

#Turn the few 1 values into 2s.
df_raw['Literacy'] = df_raw['Literacy'].replace({1.0: 2})
df_raw['Literacy'] = df_raw['Literacy'].fillna(2)

#Turn floats into integers
df_raw['Literacy'] = df_raw['Literacy'].astype(float).astype(int)

#Fill nulls
df_raw['Literacy'] = df_raw['Literacy']

### 'In org. language'

In [None]:
df_raw['In org. language'].value_counts().sum()

In [11]:
#Dropping for now
df_raw.drop(['In org. language'], axis = 1, inplace = True)

Not sure what to do with this at this point.

### 'Context type general'

In [None]:
df_raw['Context type general'].value_counts()

In [None]:
df_raw['Context type general'].isna().sum()

In [12]:
# No specific, no general, no reggio, insula
no_spec_no_gen = df_raw[df_raw['Context type specific'].isna() & df_raw['Context type general'].isna()]

no_spec_no_gen[(no_spec_no_gen['Reggio'] == 0) & (no_spec_no_gen['Insula'] == 0)].shape

#Must drop the 43 rows that don't have a reggio, insula, specfic context or general context
to_drop = no_spec_no_gen[(no_spec_no_gen['Reggio'] == 0) & (no_spec_no_gen['Insula'] == 0)]
df_raw.drop(to_drop.index, axis = 0, inplace = True)

#Maybe famous house can fill in for general context where it's missing?
famoushouse_nogen = df_raw[
    (df_raw['Famous House'].notna())
    & 
    (df_raw['Context type general'].isna())]

#Get indexes of all rows without a gen context but with a famous house
indexes = famoushouse_nogen.index

famoushouse_gencontexts = {
    'Praedia ': 'building',
    'Basilica': 'basilica',
    'House of': 'house',
    'house of': 'house',
    'Villa of': 'house',
    'Building': 'building',
    'near the Porta Vesuvio': 'necropolis',
    'Workshop': 'workshop'
}

#Replace gen context with the building type from its famous house
#Iterate over all the rows which have a famous house but lack a gen context
for index in indexes:
    #Iterate over the keys of famous houses
    for key, val in famoushouse_gencontexts.items():
        #If the row's famous house matches one from the dict
        if key in df_raw.at[index, 'Famous House']:
            #Fill missing gen context value with value from dict
            df_raw.at[index, 'Context type general'] = val
            
#Noticed that Bar of Sotericus has gen context of "house"
indexes = df_raw[df_raw['Famous House'] == 'Bar of Sotericus']['Context type general']
#Replace 'house' with 'bar' for these
for index in indexes.index:
    df_raw.at[index, 'Context type general'] = 'bar'
    
#Noticed that 'Outside Porta Marina' had two NaN gen contexts to fix
#Will drop these because they're missing too many columns
df_raw.drop(['997', '998'], axis = 0, inplace = True)

#change all building types for spec context "workshop" to gen context "workshop"
indexes = df_raw[df_raw['Context type specific'] == 'workshop']
for index in indexes.index:
    df_raw.at[index, 'Context type general'] = 'workshop'

#Same for 'Workshop'
indexes = df_raw[df_raw['Context type specific'] == 'Workshop']
for index in indexes.index:
    df_raw.at[index, 'Context type general'] = 'workshop'

#Specific context "dining room" to general context "house"
indexes = df_raw[
    (df_raw['Context type specific'] == 'dining room')
    & 
    (df_raw['Context type general'].isna())]
for index in indexes.index:
    df_raw.at[index, 'Context type general'] = 'house'

#Spec context "shop" to general context "shop"
indexes = df_raw[
    (df_raw['Context type specific'] == 'shop')
    & 
    (df_raw['Context type general'].isna())]
for index in indexes.index:
    df_raw.at[index, 'Context type general'] = 'shop'
    
#Spec context "kitchen" to general context "house"    
indexes = df_raw[
    (df_raw['Context type specific'] == 'kitchen')
    & 
    (df_raw['Context type general'].isna())]
for index in indexes.index:
    df_raw.at[index, 'Context type general'] = 'house'    
    
#Add consistency to a few of the values
replacements = {'Baths': 'baths',
                'unit': 'apartment'}
df_raw['Context type general'] = df_raw['Context type general'].replace(replacements)

#Fill null values
df_raw['Context type general'] = df_raw['Context type general'].fillna('unknown')

#Changing the name of the column
df_raw = df_raw.rename(index=str, columns ={
              'Context type general': 'Building Type'})

### 'Context type specific'

In [None]:
df_raw['Context type specific'].value_counts()

In [None]:
#Rows with a spec context but not a building type
df_raw['Context type specific'][
    (df_raw['Context type specific'].notna())
    & 
    (df_raw['Building Type'].isna())].value_counts()

In [None]:
#Reggio and insula for rows with a spec context but not a building type
regIns = df_raw[['Reggio', 'Insula']][
    (df_raw['Context type specific'].notna())
    & 
    (df_raw['Building Type'].isna())]

regins_tuples = []
for row in regIns.index:
    regins_tuples.append(tuple((regIns.at[row, 'Reggio'], regIns.at[row, 'Insula'])))
    
#These reggios and insulae can be looked up for building type
set(regins_tuples)    

In [None]:
#Rows with no spec context but a building type
regIns = df_raw[['Reggio', 'Insula']][
    (df_raw['Context type specific'].isna())
    & 
    (df_raw['Building Type'].notna())]

In [None]:
regIns.shape # It's probably good enough that these all have a Building Type

In [13]:
#Add consistency to values
replacements = {'façade': 'facade',
                'tablinium': 'tablinum',
                'Workshop': 'workshop'
                }

df_raw['Context type specific'] = df_raw['Context type specific'].replace(replacements)

#Fill null values
df_raw['Context type specific'] = df_raw['Context type specific'].fillna('unknown')

#Rename column
df_raw = df_raw.rename(index=str, columns ={'Context type specific': 'Position'})

### 'Famous House'

In [None]:
df_raw['Famous House'].value_counts()[60:90]

In [None]:
df_raw['Famous House'].isna().sum()

In [14]:
#Turn this column into a 0/1 for no/yes
indexes = df_raw[df_raw['Famous House'].notna()]

for index in indexes.index:
    df_raw.at[index, 'Famous House'] = 1
    
df_raw['Famous House'] = df_raw['Famous House'].fillna(0)

In [None]:
#save for later?
df_raw[(df_raw['Building Type'].isna()) & (df_raw['Famous House'] == 0)]

### Target: 'Category'

In [None]:
#Consider reclassifying all with 'beware' into a 'Warning' category
#Consider a "Greetings" category
#Consider a "Blessing" category
df_raw[df_raw['Category'] == 'Religious/Romantic']

In [None]:
df_raw['Category'].value_counts()

In [15]:
#Replace all values with back slashes
#Dict to fill values from 'Category'
cat_replacements = dict()
cat_values = df_raw['Category'].value_counts()

#Iterate over Category values to look for underscore and hyphen
#Make a dict with values to replace the _/- values in the dataframe
#I'm choosing to take the first value from each pair
for index in cat_values.index:
        if '/' in index:
            index_split = index.split('/')
            cat_replacements[index] = index_split[0]
            
df_raw['Category'] = df_raw['Category'].replace(cat_replacements)

#This creates one instance of 'Political' with a space
df_raw['Category'] = df_raw['Category'].replace({'Political ': 'Political'})

In [32]:
#For now just filling empteis with 'Political'
df_raw['Category'] = df_raw['Category'].fillna('unknown')

### 'Written by'

In [143]:
df_raw['Written by'].value_counts()

unknown                                     895
Virgil                                       36
children?                                     8
Ennius                                        5
Ovid                                          5
Lucretius                                     4
woman                                         4
Virgil                                        3
woman?                                        2
Woman                                         2
two writers                                   2
by two writers                                1
multiple                                      1
at least 3 young writers                      1
? Popular poem                                1
Herodutus                                     1
last line by different hand                   1
Vergil                                        1
allusion to virgil                            1
Epaphra/Elea                                  1
Written by two writers                  

In [139]:
df_raw['Written by'] = df_raw['Written by'].fillna('unknown')

In [142]:
ovids = df_raw[df_raw['Written by'].str.contains('Ovid', regex = False, case = False)]

for index in ovids.index:
    df_raw.at[index, 'Written by'] = 'Ovid'

In [None]:
#Can be dropped
#df_raw.drop(['Written by'], axis = 1, inplace = True)

### 'Work', 'Meter', and 'Reptition'

In [None]:
#Change meter to 0/1 no/yes  
indexes = df_raw[df_raw['Meter'].notna()]
for index in indexes.index:
    df_raw.at[index, 'Meter'] = 1   
    
df_raw['Meter'] = df_raw['Meter'].fillna(0)
    
#Change name of 'Meter' to 'Literary'
df_raw = df_raw.rename(index=str, columns ={'Meter': 'Literary'})

#Work and Repetition can be dropped
df_raw.drop(['Work', 'Repetition'], axis = 1, inplace = True)

### 'Foreign language'

In [None]:
df_raw['Foreign language'].value_counts()

In [None]:
df_raw['Foreign language'].isna().sum()

In [16]:
#Can be dropped
df_raw.drop(['Foreign language'], axis = 1, inplace = True)

### 'Image '

In [None]:
#interesting, further investigation could lead ot help with categorizing, but will drop for now
#df_raw['Image '].value_counts()

In [None]:
df_raw.drop(['Image '], axis = 1, inplace = True)

In [None]:
#df_raw = df_raw.rename(index=str, columns = {'Image ': 'Image'})

### 'Flohr Score'

In [None]:
df_raw['Flohr Score'].value_counts()

In [None]:
#Use a loop later
df_raw['Flohr Score'] = df_raw['Flohr Score'].replace({'1.69-1.94': 1.69,
                                                      '3.44-4.52': 3.44,
                                                      '2.96-3.15': 2.96,
                                                      '17-0': 17.0})

In [None]:
df_raw['Flohr Score'] = df_raw['Flohr Score'].astype('float')

In [None]:
#Just going to put zeros for now
df_raw['Flohr Score'] =df_raw['Flohr Score'].fillna(0)

### 'Socio-economic status'

In [None]:
df_raw['Socio-economic status'].value_counts()

In [None]:
df_raw['Socio-economic status'] = df_raw['Socio-economic status'].fillna('medium')

### 'Comments'

In [None]:
#interesting, further investigation could lead ot help with categorizing, but will drop for now
df_raw.drop(['comments'], axis = 1, inplace = True)

<a id = 'explore'></a>

[(Back to top)](#top)

# Exploratory data analysis

What is a reggio?

What is an insula?

From https://sites.google.com/site/ad79eruption/pompeii/map-of-pompeii

"Pompeii, however, has an additional level of numbering. It has been divided firstly into 9 regions (Regio), numbered in Roman numerals. Each of these regions contains several Insulae which are numbered 1, 2 3, etc. As with Herculaneum, each building within an insula has its own entrance number, again numbered 1, 2, 3 etc. For example, the House of Trebius Valens is labelled (Reg III, Ins 2, 1)."

In [33]:
df_raw['Category'].value_counts()

Social       225
Sexual       199
Insult       160
Civic         84
Reference     73
Romantic      62
Tagging       61
Religious     60
Violence      35
Political     23
unknown       16
Name: Category, dtype: int64

In [134]:
df_raw[df_raw['Category'] == 'Social'][200:]

Unnamed: 0,CIL IV #,Reggio,Insula,Entrance,found?,In English,Literacy,Position,Building Type,Famous House,Category,Written by,Work,Meter,Repetition,Image,Flohr Score,Socio-economic status,comments
927,4597.0,6.0,15.0,1,,Greetings! Greetings!,2.0,peristyle,house,1,Social,,,,,,39.4,medium,
928,4596.0,6.0,15.0,1,,"Vitalio, hi! Actius (sends) Cossinia his mothe...",2.0,peristyle,house,1,Social,,,,,,39.4,medium,
933,4611.0,6.0,15.0,2,,Send my best to!,2.0,peristyle,unknown,0,Social,,,,,,,low,
937,4826.0,7.0,15.0,7,,What have I to do with….,3.0,atrium,tabernae,0,Social,,,,,,8.21,low,
950,6817.0,6.0,16.0,7,,Campylus sends greetings to Poppaea,2.0,entrance,house,1,Social,,,yes,,,2.01,low,
955,9143.0,7.0,16.0,20,,Greetings to Pompeians Everywhere,2.0,atrium,house,1,Social,,,,,,5.54,low,
956,,7.0,16.0,20,,Hello!,2.0,entrance,house,1,Social,,,,,,5.54,low,
957,,7.0,16.0,20,,Good luck to Rufo,2.0,dining room,house,1,Social,,,,,,5.54,low,
959,,7.0,16.0,20,,Having obtained the opportunity…I have not let...,3.0,dining room,house,1,Social,,,,,,5.54,low,
960,,7.0,16.0,20,,Lyaeus writes most amicably to Fabius Rufus,3.0,dining room,house,1,Social,,,,,,5.54,low,


the girls 6

slave 25


Nero 16


Love:
    'venus'

Hello and Goodbye:
    'greet' (69)
    'bye' (53), some lewd

Glory:
    'soldier'
    'fight'
    'victor'
    'mars' 4

Leave most 'Insult'

"Lewd": 
    'suck' 49
    'fuck'
    'cunt'
    'cock'
    'bugger'
    'faggot' 15
    
Blessing:
    'best wishes'
    'favor'
    'favour'
    'good luck'
    'bravo' 5
    happy 10
    
Curse:
    All of insult?
    'beware' (not very many)
    'anger'

In [132]:
df_raw[df_raw['In English'].str.contains('dying', regex = False, case = False)]

Unnamed: 0,CIL IV #,Reggio,Insula,Entrance,found?,In English,Literacy,Position,Building Type,Famous House,Category,Written by,Work,Meter,Repetition,Image,Flohr Score,Socio-economic status,comments
644,9054,7.0,9.0,1,yes,I'm dying of love for you…lover..I'm consumed ...,3.0,unknown,building,1,Romantic,,,,,,,high,
768,2258,7.0,12.0,18,,"Africanus is dying. A boy writes this, Rusticu...",3.0,unknown,brothel,0,Social,,,,,,5.59,low,


In [None]:
df_raw()

ValueError: cannot index with vector containing NA / NaN values

<a id = 'model1'></a>

[(Back to top)](#top)

# Modeling: Model \#1

In [None]:
df_raw.info()

In [None]:
#Remaining empty reggio and insula rows   
indexes = df_raw[
    (df_raw['Reggio'].isna())
    & 
    (df_raw['Building Type'].isna())]
for index in indexes.index:
    df_raw.at[index, 'Context type general'] = 'house'    

In [None]:
empties = df_raw[df_raw['Reggio'].isna()]

In [None]:
df_raw.drop(['In English'], axis = 1, inplace = True)

## Logistic regression in progress

In [None]:
#Make categories
df_raw['Entrance'] = df_raw['Entrance'].astype('category')
df_raw['Position'] = df_raw['Position'].astype('category')
df_raw['Building Type'] = df_raw['Building Type'].astype('category')
df_raw['Famous House'] = df_raw['Famous House'].astype('bool').astype('category')
df_raw['Literary'] = df_raw['Literary'].astype('bool').astype('category')
df_raw['Socio-economic status'] = df_raw['Socio-economic status'].astype('category')

In [None]:
#Get dummies
#entrance_dummies = pd.get_dummies(df_raw['Entrance'], prefix = 'Entrance')
position_dummies = pd.get_dummies(df_raw['Position'], prefix = 'Position')
build_type_dummies = pd.get_dummies(df_raw['Building Type'], prefix = 'Building_Type')
famous_dummies = pd.get_dummies(df_raw['Famous House'], prefix = 'Famous')
literary_dummies = pd.get_dummies(df_raw['Literary'], prefix = 'Literary')
econ_status_dummies = pd.get_dummies(df_raw['Socio-economic status'], prefix = 'Econ_Status')

In [None]:
#What if I get dummies for y
category_dummies = pd.get_dummies(df_raw['Category'], prefix = 'Category')

In [None]:
X = df_raw.drop(['CIL IV #', 'Entrance', 'Position', 'Building Type', 'Famous House', 'Literary', 'Category', 'Socio-economic status'], axis = 1)
X = pd.concat([X, position_dummies, build_type_dummies, famous_dummies, literary_dummies, econ_status_dummies], axis = 1)

In [None]:
#105 columns
X.head()

In [None]:
y = df_raw['Category']

In [None]:
#from sklearn doc
#from sklearn.preprocessing import label_binarize

#y = label_binarize(y, classes = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
#from sklearn doc
#Number of classes for which to get ROCs
#n_classes = y.shape[1]

In [None]:
#Create a train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=12)
#print(y_train.value_counts(),'\n', y_test.value_counts())

In [None]:
#Build a logistic regression model
logreg = LogisticRegression(fit_intercept=False, C=1e16)
logreg.fit(X_train, y_train)

In [None]:
y_score = logreg.decision_function(X_test)

In [None]:
y_score.shape

In [None]:
#This works
y_hat = logreg.predict(X_train)

In [None]:
#Create a confusion matrix with the results
conf_matrix = confusion_matrix(y_hat, y_train)
#Create labels for the classes in the conf matrix
class_names = set(y)
#Draw a figure
plt.figure(figsize = (12,8))
#Call the custom function to draw the conf matrix
plot_conf_matrix(conf_matrix, classes = class_names)

This model does passably only on Romantic, Political,  Religious, and Sexual.

In [None]:
y_train.value_counts()

<a id = 'model2'></a>

[(Back to top)](#top)

# Model \#2

## Decision tree in progress

In [None]:
df = copy.deepcopy(df_raw)

In [None]:
# Create label encoder instance
lb = LabelEncoder() 

In [None]:
df.info()

In [None]:
# Create Numerical labels for classes
df['Reggio_'] = lb.fit_transform(df['Reggio'])
df['Insula_'] = lb.fit_transform(df['Insula'])
df['Literacy_'] = lb.fit_transform(df['Literacy'])
df['Position_'] = lb.fit_transform(df['Position'])
df['Build_Type_'] = lb.fit_transform(df['Building Type'])
df['Famous_'] = lb.fit_transform(df['Famous House'])
df['Literary_'] = lb.fit_transform(df['Literary'])
df['Econ_Status_'] = lb.fit_transform(df['Socio-economic status'])
#the target
df['Category_'] = lb.fit_transform(df['Category'])

In [None]:
class_names = set(df['Category'])

In [None]:
# Split features and target variable
X = df[['Reggio_', 'Insula_', 'Literacy_', 'Position_', 'Build_Type_', 'Famous_', 'Literary_', 'Econ_Status_']]
y = df['Category_']

In [None]:
# Instantiate a one hot encoder
enc = OneHotEncoder()

In [None]:
# Fit the feature set X
enc.fit(X)

In [None]:
# Transform X to onehot array 
onehotX = enc.transform(X).toarray()

onehotX, onehotX.shape, X.shape

In [None]:
# Create a 70/30 split
X_train, X_test, y_train, y_test = train_test_split(onehotX, y, test_size = 0.3, random_state = 12)

In [None]:
# Train the classifier and make predictions
clf = DecisionTreeClassifier(criterion = 'entropy')
clf.fit(X_train,y_train) 
y_hat = clf.predict(X_test)

In [None]:
# Calculate Accuracy 
acc = accuracy_score(y_test, y_hat) * 100
print("Accuracy is :{0}".format(acc))

In [None]:
# Check the AUC for predictions
#false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
#roc_auc = auc(false_positive_rate, true_positive_rate)
#print("\nAUC is :{0}".format(round(roc_auc,2)))

In [None]:
#Create a confusion matrix with the results
conf_matrix = confusion_matrix(y_hat, y_test)
#Create labels for the classes in the conf matrix
#or use labels created before y is encoded
#Draw a figure
plt.figure(figsize = (12,8))
#Call the custom function to draw the conf matrix
plot_conf_matrix(conf_matrix, classes = class_names)

This model performed similarly as poorly as the first logistic regression.

In [None]:
#And now an attempt to use Graph Viz
# Visualize the decision tree using graph viz library 
dot_data = StringIO()

In [None]:
#Feeds from decision tree classifier instantiated above
export_graphviz(clf, out_file=dot_data, filled=True, rounded=True,special_characters=True)

In [None]:
graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 

In [None]:
Image(graph.create_png())

<a id = 'model3'></a>

[(Back to top)](#top)

# Model \#3

<a id = 'concl'></a>

[(Back to top)](#top)

# Conclusions