The project is done on google colab. <br>

# **Importing Libraries**<br>
Importing all the required libraries

**TfidfVectorizer** - Used to make a bag of words for the sentences, this method comprises of CountVector() which counts the the number of times a word is used and Tfidftransformer() which transforms the counts to a frequency and an inverse document frequency which gives us a better representation of the data as it weighs down the redundant words.<br>
**LabelEncoder** - Onehotencoding the labels, finding unique classes and converting to number representation between 0 to (n-1) unique classes.<br>
**Metrics** - accuracy_score and confusion_matrix for evaluation on the test dataset.<br><br>

**Other libraries used but not considered:**<br>
from sklearn.feature_extraction.text import TfidfTransformer<br>
from sklearn.feature_extraction.text import CountVectorizer<br>
These ended up using a lot of RAM and so a different approach using TfidfVectorizer() which combines these and runs more efficiently.

In [1]:
import pandas as pd
import numpy as np

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

import io

# Reading Data
Using google colab which requires the file to be uploaded.

In [2]:
from google.colab import files
uploaded = files.upload()

Saving Shakespeare_data.csv to Shakespeare_data (3).csv


Reading the uploaded data file into a pandas dataframe and printing out the head to see the data.

In [3]:
path = io.BytesIO(uploaded['Shakespeare_data.csv'])
df = pd.read_csv(path)
df.head(10)

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
5,6,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils
6,7,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.
7,8,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil
8,9,Henry IV,1.0,1.1.6,KING HENRY IV,Shall daub her lips with her own children's bl...
9,10,Henry IV,1.0,1.1.7,KING HENRY IV,"Nor more shall trenching war channel her fields,"


# Cleaning the Dataset<br>
Dropping rows with a NaN value as these could be noisy for the training/testing purposes.<br>

In [4]:
df = df.dropna()
df = df.reset_index(drop=True)

#Data Preparation:<br>
I considered PlayerLine and the Play as the input features for the ml model to predict the player.<br>

These two columns are text also the target column is a string, we need to change their representation to a numeric one so that a machine learning algorithm can understand/mase sense of the data and target.<br>

For the PlayerLine column as it has sentences I use the technique of Bag of Words. For this I used tfidfVectorizer() from scikit-learn feature extraction, which counts the frequency of a word in a document giving us a representation with which number is important and which is not. The method also multiplies this count with an inverse document frequency which weighs the number according to how important they are within a document. The end result is a bag of words with a number representation that a machine learning model can take as an input.<br>

For the Play and the Player (target) columns I used the labelEncoder() from scikit-learn which gives all the unique classes a number between 1 to (n-1) unique classes and using this I transformed these columns to a number representation which can now be used as an input and target data points for a machine learning model.  

In [5]:
df[["PlayerLine","Play","Player"]].head(3)

Unnamed: 0,PlayerLine,Play,Player
0,"So shaken as we are, so wan with care,",Henry IV,KING HENRY IV
1,"Find we a time for frighted peace to pant,",Henry IV,KING HENRY IV
2,And breathe short-winded accents of new broils,Henry IV,KING HENRY IV


Shape of the input features and target label.

In [6]:
print(df[["PlayerLine","Play"]].shape,df["Player"].shape)

(105152, 2) (105152,)


# Transforming Data
As the whole dataset is being used limiting the dimension space, using TfidfVectorizer() here with a min_df of 10 which reduces the dimension space as it ignores the words occurring less then 10 times in the document most of my time was spent on optimizing this parameter so that I can have the whole data prepared to be used. Used (1,2) for ngram_range as this specifies the vectorizer to use both unigrams and bigrams. Used 'english' (only option currently available) as stop_words as it reduces the dimension space by ignoring frequently ocurring common english words.

In [7]:
tfidf = TfidfVectorizer(min_df=10, ngram_range=(1, 2), stop_words='english')

**Shuffling the whole dataset so that the plays are not grouped together, also so the traning and testing dataset are fairly balanced**

In [8]:
# df = df.loc[df['Play']=='Henry IV']
df = df.sample(frac = 1)

Transforming the text in the PlayerLine column into a term frequency matrix.

In [None]:
X_pl = tfidf.fit_transform(df['PlayerLine']).toarray()

Making the label encoders for Player and Play to convert the unique classes to a range of numbers between 0 to (n-1) unique classes.

In [10]:
playerLabel = LabelEncoder()
playLabel = LabelEncoder()
X_pl.shape

(105152, 5563)

Transforming the Player and Play column into numbers.

In [11]:
y_t = playerLabel.fit_transform(df['Player'])
X_p = playLabel.fit_transform(df['Play']).reshape(-1,1)
print(y_t.shape,X_p.shape)

(105152,) (105152, 1)


Concatenating the traning dataset, using PlayerLine and Play as using the other features is increasing the memory requirements.

In [12]:
X_f = np.concatenate([X_pl,X_p],axis=1)
X_f.shape

(105152, 5564)

#Train Test split
Splitting into train and test datasets

In [13]:
X_train = X_f[:75000]
y_train = y_t[:75000]

In [14]:
X_pl = []
X_p = []
X_test = X_f[75000:]
y_test = y_t[75000:]
print(X_test.shape,y_test.shape)

(30152, 5564) (30152,)


# Implementing ML Models

**As the logistic regression had the highest accuracy, only using that here** 

Data preparation complete, now we start testing ML models.<br><br>
**MODELS USED**<br>
Logistic Regression <br>
~~Multinomial Naive Bayes~~ <br>
~~Gaussian Naive Bayes~~<br>
~~Multi Level Perceptron Classifier~~ <br>
~~Linear Discriminant Analysis~~

In [15]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr = LogisticRegression(random_state=0,multi_class="multinomial",solver="newton-cg").fit(X_train, y_train)
predsLR = lr.predict(X_test)
print("\nAccuracy score:" , accuracy_score(y_test, predsLR))
playerLabel.inverse_transform(lr.predict([X_test[0]]))

# **Observations**<br>
We can observe that when the whole dataset is used the accuracy of the logistic regression model is ...<br> 
*Training and inference to be finished*

Previous testing (Not present in this notebook) with same optimization but using a Multinomial Naive Bayes resulted in a accuracy of 1.3% which is significantly lower than the accuracy found using Multinomial Naive Bayes when the data set is limited to only 1 play which is 20.04%. Which indicates that the logistic regression model on the same input data would also result in a lower accuracy even though if it had an accuracy of 24.4% on the limited dataset and this is due to the information which is being lost when the whole dataset is used in the bag of words representation.

**Future Work**<br>
We know as the data dimensions increase the training time of a machine learning model increases, also the RAM requirements increase. Since I did not have enough memory I had to limit the bag of words representation of PlayerLine column, since I did this some information was lost and so the machine learning model was not able to correctly predict the target required and hence the low accuracy. I think we can optimize the transformation which reduces the memory usage/requirements which can train a better ml model.