# Shakespeare Player Classification #

In this project we'll try to determine which Shakespearean character is speaking during any given line using only information about the play and the line itself.  To accomplish this, we'll engineer a few features on the given dataset and also

Sources:
- Shakespeare Line Data: https://www.kaggle.com/kingburrito666/shakespeare-plays/data


## Read and clean data
First we'll read in our dataset and throw out values that are either misformatted or not relevant to our project.

In [1]:
#But imports first!
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer
from numpy import concatenate, hstack, unique
import scipy.sparse as sparse
from sklearn.metrics import f1_score, classification_report, confusion_matrix


raw_data = pd.read_csv('./data/shakespeare_data.csv')
raw_data.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"


## Dataset info
**Column definitions** 

| Column   | Description |
| :---------| :---------------
| `Dataline` | unique ID for a line |
|`Play` | Play the line is from |
|`PlayerLinenumber` | The line being spoken |
|`ActSceneLine` | Act.Scene.Line the play is from |
|`Player` | The player reading the line|
|`PlayerLine`| The line being spoken|

## Cleaning Data
#### Narrator
There seems to be narrator and stage direction lines through out our data set with Player as `Nan`.  While we could reassign these to a value such as `NARRATOR`, instead we'll remove them from our data set since the narrator is quite different from all of the other characters across the dataset.

#### Dataline Column
The dataline column isn't contributing anything to our dataset, it's just a sequential ID.  We'll take it out to make our dataset just the info that matters.

#### Stage Directions & Scene Info
Many lines contain information about the setting, stage directions, and scene info.  These typically have null values for one of the `PlayerLinenumber`, `ActSceneLine`, or `Player` columns.  We'll remove these lines as well.	


In [2]:
#Take these values out!
raw_data.query("ActSceneLine.isnull() or Player.isnull() or PlayerLinenumber.isnull()", engine = 'python')

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
111,112,Henry IV,10.0,,WESTMORELAND,Exeunt
112,113,Henry IV,10.0,,WESTMORELAND,SCENE II. London. An apartment of the Prince's.
...,...,...,...,...,...,...
111208,111209,A Winters Tale,38.0,,Clown,"Enter LEONTES, POLIXENES, FLORIZEL, PERDITA, C..."
111232,111233,A Winters Tale,4.0,,PAULINA,"PAULINA draws a curtain, and discovers HERMION..."
111330,111331,A Winters Tale,30.0,,PAULINA,Music
111336,111337,A Winters Tale,30.0,,PAULINA,HERMIONE comes down


In [3]:
#Cleaned dataset!
lines = raw_data.query("ActSceneLine.notnull() and Player.notnull() and PlayerLinenumber.notnull()", engine = 'python').copy()
del lines['Dataline']

### Creating usable data
Let's give each Act, Scene, and Line it's own column here and change types to floats.  We'll also encode each play using one-hot encoding.

In [4]:
lines.head()

Unnamed: 0,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
3,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
5,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils
6,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.
7,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil


In [5]:
#Create new columns for act, scene, and line
lines['Act'] = lines.apply(lambda row : int(str(row.ActSceneLine).split('.')[0]), axis=1)
lines['Scene'] = lines.apply(lambda row : int(str(row.ActSceneLine).split('.')[1]), axis=1)
lines['Line'] = lines.apply(lambda row : int(str(row.ActSceneLine).split('.')[2]), axis=1)
lines.drop(columns=["ActSceneLine"], inplace=True)
lines['LineLength'] = lines.apply(lambda row : len(row.PlayerLine), axis=1)

# One hot encode the Play name
lines = pd.get_dummies(lines, columns = ["Play"])

# Standardize PlayerLine (remove punctuation & all lowercase)
lines ['PlayerLine'] = lines['PlayerLine'].apply(lambda x: re.sub('[^a-zA-z0-9\s]', '', x).lower())

# Show us that progress!
lines.head()

Unnamed: 0,PlayerLinenumber,Player,PlayerLine,Act,Scene,Line,LineLength,Play_A Comedy of Errors,Play_A Midsummer nights dream,Play_A Winters Tale,...,Play_Richard III,Play_Romeo and Juliet,Play_Taming of the Shrew,Play_The Tempest,Play_Timon of Athens,Play_Titus Andronicus,Play_Troilus and Cressida,Play_Twelfth Night,Play_Two Gentlemen of Verona,Play_macbeth
3,1.0,KING HENRY IV,so shaken as we are so wan with care,1,1,1,38,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1.0,KING HENRY IV,find we a time for frighted peace to pant,1,1,2,42,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1.0,KING HENRY IV,and breathe shortwinded accents of new broils,1,1,3,46,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,1.0,KING HENRY IV,to be commenced in strands afar remote,1,1,4,39,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,1.0,KING HENRY IV,no more the thirsty entrance of this soil,1,1,5,41,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Bag of Words 

In order to give a numerical representation to our `PlayerLine` column, we'll be using the Bag of Words approach described here: https://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation

In [6]:
# Init the Vectorizer
vectorizer = CountVectorizer()

# Vectorize the PlayerLines
vectorized_lines = vectorizer.fit_transform(lines['PlayerLine'])
vectorized_lines

<105152x27199 sparse matrix of type '<class 'numpy.int64'>'
	with 729870 stored elements in Compressed Sparse Row format>

In [7]:
features = lines.copy()
features.drop(columns=["Player", "PlayerLine"], inplace=True)
complete_features = sparse.hstack( [sparse.csr_matrix(features), vectorized_lines])

## Classification
To show the added value of features we add on, let's try and classify the players with a few different methods with only the data we have been provided.

We'll try using a SVM and KNN model to see what kind of results we get. We'll choose K to be the number of players in our dataset:

In [8]:
# Find the K for K nearest neighbors
players = lines['Player'].unique()
K = len(players)
K

934

In [9]:
from sklearn.model_selection import train_test_split

#Define our labels & features, then split into training and test data
labels = lines.loc[: , ["Player"]] #Player is our label of interest
feature_train, feature_test, label_train, label_test = train_test_split(complete_features, labels, test_size=0.20, random_state=27)

### Initialize, Train, and Test the Models
Now we'll train and test our models.  We'll be using a SVM, K-Nearest Neighbors, and Decision Tree models.  Our test data was 20% of our total data set.

To get a better understanding of our results, we'll also look at the confusion matrix for our SVC model and the classification report for our KNN model (K = 934) as well.

In [10]:
from sklearn.svm import SVC
#Initialize the model
SVM_model = SVC()

#Training
SVM_model.fit(feature_train, label_train)

#Predict
SVM_prediction = SVM_model.predict(feature_test)

  y = column_or_1d(y, warn=True)


In [11]:
from sklearn.tree import DecisionTreeClassifier
#Initialize the model
DT_model = DecisionTreeClassifier()

#Training
DT_model.fit(feature_train, label_train)

#Predict
DT_prediction = DT_model.predict(feature_test)

In [12]:
from sklearn.neighbors import KNeighborsClassifier
#Initialize the model
KNN_model = KNeighborsClassifier(n_neighbors=K)

#Training
KNN_model.fit(feature_train, label_train)

#Predict
KNN_prediction = KNN_model.predict(feature_test)

  


In [13]:
#View our results for SVM
print(f1_score(SVM_prediction, label_test, average = 'weighted', labels = unique(SVM_prediction)))
print(confusion_matrix(SVM_prediction, label_test))

0.05791334695723195
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [16]:
#View our results for Decision Tree Classification
print(f1_score(DT_prediction, label_test, average = 'weighted', labels = unique(DT_prediction)))

0.6969477652121824


In [15]:
#View our Results for KNN
print(f1_score(KNN_prediction, label_test, average = 'weighted', labels = unique(KNN_prediction)))
print(classification_report(KNN_prediction, label_test))

0.07170349235415517


  'recall', 'true', average, warn_for)


                    precision    recall  f1-score   support

            A Lord       0.00      0.00      0.00         0
          A Player       0.00      0.00      0.00         0
             AARON       0.03      0.04      0.04        50
       ABERGAVENNY       0.00      0.00      0.00         0
          ABHORSON       0.00      0.00      0.00         0
           ABRAHAM       0.00      0.00      0.00         0
          ACHILLES       0.00      0.00      0.00         0
              ADAM       0.00      0.00      0.00         0
            ADRIAN       0.00      0.00      0.00         0
           ADRIANA       0.16      0.05      0.08       154
 ADRIANO DE ARMADO       0.00      0.00      0.00         0
            AEGEON       0.00      0.00      0.00         0
           AEMELIA       0.00      0.00      0.00         0
          AEMILIUS       0.00      0.00      0.00         0
            AENEAS       0.00      0.00      0.00         1
            AEdile       0.00      0.00

## Summary
SVM struggled with a 6% precision rating.  

Decision Tree did well, correctly identifying ~70% of lines.

KNN did awfully with a 7% precision, having so many different 'neighbors' with so few data points likely contributes to this.

Going forward, other methods such as random forest may do well since Play name nicely partitions the labels!

