The project is done on google colab. <br>

# **Importing Libraries**<br>
Importing all the required libraries

**TfidfVectorizer** - Used to make a bag of words for the sentences, this method comprises of CountVector() which counts the the number of times a word is used and Tfidftransformer() which transforms the counts to a frequency and an inverse document frequency which gives us a better representation of the data as it weighs down the redundant words.<br>
**LabelEncoder** - Onehotencoding the labels, finding unique classes and converting to number representation between 0 to (n-1) unique classes.<br>
**Metrics** - accuracy_score and confusion_matrix for evaluation on the test dataset.<br><br>

**Other libraries used but not considered:**<br>
from sklearn.feature_extraction.text import TfidfTransformer<br>
from sklearn.feature_extraction.text import CountVectorizer<br>
These ended up using a lot of RAM and so a different approach using TfidfVectorizer() which combines these and runs more efficiently is used.

In [1]:
import pandas as pd
import numpy as np

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

import io

# Reading Data
Using google colab which requires the file to be uploaded.

In [2]:
from google.colab import files
uploaded = files.upload()

Saving Shakespeare_data.csv to Shakespeare_data (34).csv


Reading the uploaded data file into a pandas dataframe and printing out the head to see the data.

In [3]:
path = io.BytesIO(uploaded['Shakespeare_data.csv'])
df = pd.read_csv(path)
df.head(10)

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
5,6,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils
6,7,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.
7,8,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil
8,9,Henry IV,1.0,1.1.6,KING HENRY IV,Shall daub her lips with her own children's bl...
9,10,Henry IV,1.0,1.1.7,KING HENRY IV,"Nor more shall trenching war channel her fields,"


# Cleaning the Dataset<br>
Dropping rows with a NaN value as these could be noisy for the training/testing purposes.<br>

In [4]:
df = df.dropna()
df = df.reset_index()

#Data Preparation:<br>
I considered PlayerLine and the Play as the input features for the ml model to predict the player.

In [5]:
df[["PlayerLine","Play","Player"]].head(3)

Unnamed: 0,PlayerLine,Play,Player
0,"So shaken as we are, so wan with care,",Henry IV,KING HENRY IV
1,"Find we a time for frighted peace to pant,",Henry IV,KING HENRY IV
2,And breathe short-winded accents of new broils,Henry IV,KING HENRY IV


Shape of the input features and target label.

In [6]:
print(df[["PlayerLine","Play"]].shape,df["Player"].shape)

(105152, 2) (105152,)


# Transforming Data
Using TfidfVectorizer() here with a min_df of 10 which reduces the dimension space as it ignores the words occurring less then 10 times in the document. Used (1,2) for ngram_range as this specifies the vectorizer to use both unigrams and bigrams. Used 'english' (only option currently available) as stop_words as it reduces the dimension space by ignoring frequently ocurring common english words.

In [7]:
tfidf = TfidfVectorizer(min_df=1, ngram_range=(1, 2), stop_words='english')

Limiting the dataset to just a single play. In this notebook play "Henry IV" is used.

In [8]:
df = df.loc[df['Play']=='Henry IV']

Transforming the text in the PlayerLine column into a term frequency matrix.

In [9]:
X_pl = tfidf.fit_transform(df['PlayerLine']).toarray()

Making the label encoders for Player and Play to convert the unique classes to a range of numbers between 0 to (n-1) unique classes.

In [10]:
playerLabel = LabelEncoder()
playLabel = LabelEncoder()
X_pl.shape

(3044, 10656)

Transforming the Player and Play column into numbers.

In [11]:
y_t = playerLabel.fit_transform(df['Player'])
X_p = playLabel.fit_transform(df['Play']).reshape(-1,1)
print(y_t.shape,X_p.shape)

(3044,) (3044, 1)


Concatenating the traning dataset, using PlayerLine and Play as using the other features is increasing the memory requirements.

In [12]:
X_f = np.concatenate([X_pl,X_p],axis=1)
X_f.shape

(3044, 10657)

#Train Test split
Splitting into train and test datasets

In [13]:
X_train = X_f[:2000]
y_train = y_t[:2000]

In [14]:
X_pl = []
X_p = []
X_test = X_f[2000:]
y_test = y_t[2000:]
print(X_test.shape,y_test.shape)

(1044, 10657) (1044,)


# Implementing ML Models
Data preparation complete, now we start testing different ML models.<br><br>
**MODELS USED**<br>
Multinomial Naive Bayes <br>
Gaussian Naive Bayes <br>
Multi Level Perceptron Classifier <br>
Linear Discriminant Analysis <br>
Logistic Regression <br>

In [15]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

In [16]:
mnb = MultinomialNB().fit(X_train, y_train)
predsMNB =  mnb.predict(X_test)
print("\nAccuracy score:" , accuracy_score(y_test, predsMNB))
playerLabel.inverse_transform(mnb.predict([X_test[0]]))


Accuracy score: 0.22126436781609196


array(['KING HENRY IV'], dtype=object)

In [17]:
gnb = GaussianNB().fit(X_train, y_train)
predsGNB = gnb.predict(X_test)
print("\nAccuracy score:" , accuracy_score(y_test, predsGNB))
playerLabel.inverse_transform(gnb.predict([X_test[0]]))


Accuracy score: 0.15517241379310345


array(['Vintner'], dtype=object)

In [18]:
mlp = MLPClassifier(random_state=1, max_iter=200).fit(X_train, y_train)
predsMLP = mlp.predict(X_test)
print("\nAccuracy score:" , accuracy_score(y_test, predsMLP))
playerLabel.inverse_transform(mlp.predict([X_test[0]]))


Accuracy score: 0.20402298850574713


array(['GADSHILL'], dtype=object)

In [19]:
lda = LinearDiscriminantAnalysis().fit(X_train,y_train)
predsLDA = lda.predict(X_test)
print("\nAccuracy score:" , accuracy_score(y_test, predsLDA))
playerLabel.inverse_transform(lda.predict([X_test[0]]))


Accuracy score: 0.044061302681992334


array(['Chamberlain'], dtype=object)

In [20]:
lr = LogisticRegression(random_state=0,multi_class="multinomial",solver="newton-cg").fit(X_train, y_train)
predsLR = lr.predict(X_test)
print("\nAccuracy score:" , accuracy_score(y_test, predsLR))
playerLabel.inverse_transform(lr.predict([X_test[0]]))


Accuracy score: 0.24042145593869732


array(['FALSTAFF'], dtype=object)

**Observations**<br>
As we can see on this limited dataset of only one play i.e. "Henry IV" LogisticRegression has the highest accuracy with a 24.04% chance of correctly predicting the Player using the PlayerLine as input feature and Multinomial Naive Bayes coming in seconds with a 22.13% chance of correctly predicting.