In [1]:
import numpy as np

In [2]:
import pandas as pd

Now that I have imported pandas and numpy, I will import what I need to process my data and create a tree from it.

In [3]:
from sklearn.tree import DecisionTreeClassifier #Decision Tree Classifier
from sklearn.model_selection import train_test_split # can split data into test test & train set
from sklearn import metrics # Can calculate how accurrate our model is
from sklearn import preprocessing

Reading my Shakespeare csv into a dataframe

In [4]:
Shakespeare = pd.read_csv('Shakespeare_data.csv')
Shakespeare.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"


Feature engineering can be used to ease access to this data set in a number of ways, for example:
1) Dataline column can be removed, as all it does it track the number of lines already in the table. This is not only useless, but Pandas already keeps track of this information when you read in the csv.

2) I can remove all entries where the Player value is NaN, as this does not help us determine player line patterns. 

3) I can also break apart the ActSceneLine into 3 separate columns, so that the numerical value of each can be used

Below I am dropping the Dataline column, the Playerline column, and all rows that contain an NaN value.

Next I split up ActSceneLine into 3 groups (Act, Scene, Line) and push them back onto my table in their own columns. Once I have done this, I can delete the ActSceneLine column as it is no longer needed.

I also remove the PlayerLine column. While the information found in this column could be used to help classify players, it would take extensive work to use language processing to try to learn how each player speaks, and this data set would likely be too small for language patters to be valuable in classifying players. It was easier to classify players based on when and in what play they speak in, rather than what they actually say. 

In [5]:
Shakespeare = Shakespeare.drop(columns=['Dataline'])
Shakespeare = Shakespeare.dropna()
Shakespeare = Shakespeare.drop(columns=['PlayerLine'])
ASL = Shakespeare['ActSceneLine'].str.split(pat='.', n=2, expand=True)
Shakespeare['act'] = ASL[0]
Shakespeare['scene'] = ASL[1]
Shakespeare['line'] = ASL[2]
Shakespeare = Shakespeare.drop(columns=['ActSceneLine'])
Shakespeare.head()

Unnamed: 0,Play,PlayerLinenumber,Player,act,scene,line
3,Henry IV,1.0,KING HENRY IV,1,1,1
4,Henry IV,1.0,KING HENRY IV,1,1,2
5,Henry IV,1.0,KING HENRY IV,1,1,3
6,Henry IV,1.0,KING HENRY IV,1,1,4
7,Henry IV,1.0,KING HENRY IV,1,1,5


Below I am using a label encoder to transform the play's and the players into numerical values so that my decision tree can make decisions about them.

In [6]:
label_encoder = preprocessing.LabelEncoder()
Shakespeare['Play'] = label_encoder.fit_transform(Shakespeare['Play'])
Shakespeare['Player'] = label_encoder.fit_transform(Shakespeare['Player'])
Shakespeare.head()

Unnamed: 0,Play,PlayerLinenumber,Player,act,scene,line
3,9,1.0,457,1,1,1
4,9,1.0,457,1,1,2
5,9,1.0,457,1,1,3
6,9,1.0,457,1,1,4
7,9,1.0,457,1,1,5


Below I am separating my columns into 2 groups, my independent variables which my tree will use to determine the player (play, player line number, act, scene, line) and the dependent variable which I am trying to predict (player).

In [7]:
X = Shakespeare[['Play', 'PlayerLinenumber', 'act', 'scene', 'line']]
Y = Shakespeare[['Player']]

Now I will split my data into a training and test set:

In [8]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)

In [9]:
#Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train, Y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [10]:
print("Accuracy:",metrics.accuracy_score(Y_test, y_pred))

Accuracy: 0.773632156216319


My accuracy of ~0.77 means that roughly 77% of players were classified correctly by my decision tree, which trained on my test set (70% of my csv file data) and then tested the remaining 30% of my data set. While this precision is good for a first attempt, I believe results could be improved if natural language processing was used to evaluate each characters quotes and then use potential quotes when classifying characters. Because my program currently classifies characters based on when they speak rather than what they say, the ~23% of misclassified characters are likely the result of characters speaking out of turn, or reentering a scene they did not play a large role in. 