Ulises Marian

This project will make use of Natural Language Processing, in which I seek to build a model that analyzes sentences from the “History of Philosophy” dataset, which contains over 50 texts spanning 10 major schools of philosophy throughout thousands of years (350 BC – 1985), in order to determine which school of philosophy each sentence belongs in.

I chose this topic due to multiple reasons:
<br>1.- I wanted to create a model that would be able to classify data among multiple labels, rather than just two labels ("yes" or "no"). And in this dataset, there will be 13 labels. So, I am vary curious to see how accurate the model is.
<br>2.- The fact that the input data (texts) encompass data from BC until 1985, is simply astonishing. Being able to use (philosophy) data spanning thousands of years, speaks of the legacy of humanity and philosophy, of human thought and development. 

File: philosophy_data.csv
<br>Source: https://www.kaggle.com/kouroshalizadeh/history-of-philosophy
        

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import random
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import nltk

nltk.download('punkt') 
from nltk.tokenize import word_tokenize 
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import TweetTokenizer

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ulisesmarian/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


1. This is where I will import the file "philosophy_data.csv",
<br> and show you how it looks before making any modifications to it.

In [26]:
data = pd.read_csv("philosophy_data.csv", encoding='ISO-8859-1')
print(f"dataframe shape: {data.shape}")
print(f"number of rows: {data.shape[0]}")
print(f"number of columns: {data.shape[1]}")
set(data.original_publication_date)
data.head()

dataframe shape: (360808, 11)
number of rows: 360808
number of columns: 11


Unnamed: 0,title,author,school,sentence_spacy,sentence_str,original_publication_date,corpus_edition_date,sentence_length,sentence_lowered,tokenized_txt,lemmatized_str
0,Plato - Complete Works,Plato,plato,"What's new, Socrates, to make you leave your ...","What's new, Socrates, to make you leave your ...",-350,1997,125,"what's new, socrates, to make you leave your ...","['what', 'new', 'socrates', 'to', 'make', 'you...","what be new , Socrates , to make -PRON- lea..."
1,Plato - Complete Works,Plato,plato,Surely you are not prosecuting anyone before t...,Surely you are not prosecuting anyone before t...,-350,1997,69,surely you are not prosecuting anyone before t...,"['surely', 'you', 'are', 'not', 'prosecuting',...",surely -PRON- be not prosecute anyone before ...
2,Plato - Complete Works,Plato,plato,The Athenians do not call this a prosecution b...,The Athenians do not call this a prosecution b...,-350,1997,74,the athenians do not call this a prosecution b...,"['the', 'athenians', 'do', 'not', 'call', 'thi...",the Athenians do not call this a prosecution ...
3,Plato - Complete Works,Plato,plato,What is this you say?,What is this you say?,-350,1997,21,what is this you say?,"['what', 'is', 'this', 'you', 'say']",what be this -PRON- say ?
4,Plato - Complete Works,Plato,plato,"Someone must have indicted you, for you are no...","Someone must have indicted you, for you are no...",-350,1997,101,"someone must have indicted you, for you are no...","['someone', 'must', 'have', 'indicted', 'you',...","someone must have indict -PRON- , for -PRON- ..."


2. For the purposes of this project, where I will do the data cleaning for NLP, I will drop all the columns that I don't need. 
<br>Create the X variable from the DataFrame. 
<br>Print the size of X and the first five lines of X.

In [49]:
X = data.sentence_str.to_frame()
print(X.shape)
X.head()

(360808, 1)


Unnamed: 0,sentence_str
0,"What's new, Socrates, to make you leave your ..."
1,Surely you are not prosecuting anyone before t...
2,The Athenians do not call this a prosecution b...
3,What is this you say?
4,"Someone must have indicted you, for you are no..."


3. See how many sentences per school of philosophy there are.

In [50]:
for school in set(data["school"]):
    print(school,len(data[data["school"]==school]))

plato 38366
stoicism 2535
empiricism 19931
analytic 55425
rationalism 22949
phenomenology 28573
german_idealism 42136
nietzsche 13548
capitalism 18194
continental 33779
feminism 18635
aristotle 48779
communism 17958


4. Create the y variable from the DataFrame.
<br>Print the size of Y and the first five lines of Y.

In [4]:
y = data.school
print(y.shape)
y.head()

(360808,)


0    plato
1    plato
2    plato
3    plato
4    plato
Name: school, dtype: object

5. Pre-process data
<br>The tokenizer separates the text (sentences) into words (keeping only those that are alphabetic characters, thus eliminating those that contain numbers, for example)
<br> *I will convert all the letters/sentences to lowercase. 
<br> *Then, I will remove all stop words (e.g., 'an', 'the', 'to', 'that'), because they usually don't help the model in finding the meaning of the words/sentences. Plus, eliminating them, reduces the already huge amount of data.
<br>  *Similarly, in order to further reduce the amount of words to analyze, I extract the stem of each word. So that words that share the same root can be grouped together and counted once. 
<br>  *The final product is a dataframe with all the stems of the words in lowercase, and without stop words.

In [5]:
nltk.download('stopwords') 
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

tokenizer = RegexpTokenizer("[a-zA-Z]+")  #letters only

for i,sentence in enumerate(X.sentence_str):
    w = tokenizer.tokenize(sentence.lower())         # separate into words and lowercase each word
    w = [word for word in w if word not in stop_words]    # remove stop words
    w = [stemmer.stem(word) for word in w]    # find stem of each word
    w = ' '.join(w)         # join back into a string
    X.sentence_str[i] = w

X.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ulisesmarian/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,sentence_str
0,new socrat make leav usual haunt lyceum spend ...
1,sure prosecut anyon king archon
2,athenian call prosecut indict euthyphro
3,say
4,someon must indict go tell indict someon els


6. Since ML models need numbers to work, I hereby convert the strings in X into vectors (the CountVectorizer counts the frequency of each word) of numbers.
<br> Show the size of the vectors.

In [51]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
vect.fit(X["sentence_str"])    
X_vectors = vect.transform(X["sentence_str"])

X
print(X_vectors.shape)

(360808, 90797)


 7. Split the dataset into training and testing sets and show the size of the sets.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X_vectors,y,test_size=0.2)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(288646, 60482) (288646,) (72162, 60482) (72162,)


8. Use the Multinomial Naive Bayes model to train, then test the model.

In [8]:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

9. Show the accuracy of the model by displaying the accuracy score, confusion matrix, and F1 score

In [52]:
from sklearn.metrics import f1_score

print(metrics.accuracy_score(y_test, y_pred))

print(f1_score(y_test, y_pred, average='weighted'))

print(metrics.confusion_matrix(y_test, y_pred))

0.7187300795432499
0.7170341025208494
[[8762  406   53   47  237  156   69  473   25  264  467  129   17]
 [ 565 7312  101   35   99  128   81  346   29   64  894  148    3]
 [ 113   67 3082  127   57   80   30   29    9    6   70   20    1]
 [ 111   90  348 2511  124   30  104   62   36   27  105   18    8]
 [ 363  231   75  131 4599   55  161  418   61  291  184  113    5]
 [ 251  243   47    7   44 2786   45  182   21   24  148  197    5]
 [  99  158   36   91  174   81 2508   81   74   54  263   86    2]
 [ 476  256   42   61  167   72   63 6627   47  226  232  124   12]
 [ 127  172   23   25  128   43  107  118 1552   42  272   93   11]
 [ 480  171   13   43  329   53   50  596   29 3675  164   79   10]
 [ 617  903   72   34   89   59  103  166   24   60 5452  133    2]
 [ 283  375   17   13   66  266   57  264   24   48  236 2905   30]
 [   9   80    6    2    4   84    4   13   88    3   73   25   94]]


10. The results show us that the model is not very accurate (71.8%). It could definitely be worse, but it could also be better. On the other hand, considering that all the texts, and thus sentences, and their corresponding labels are about the same topic, philosophy, then the model did well. 
<br> If the topics would've varied, for instance, if there had also been a text about biology, another one about baseball, another one about astronomy, then probably the predictions would've been more accurate. But in the dataset used for this model, it was all philosophy, despite the fact that "the schools of philosophy" varied, as there were 13, but they were nevertheless philosophy.



Conclusion

The model that I've created to predict what school of philosophy each sentence from the “History of Philosophy” dataset belongs in, could certainly be improved. Thinking that the model has a 28.2% probabilty of being wrong could indeed be a source of concern. But if this model is simply used as a reference, taking into account that all the sentences are within the realm of philosophy, and that it is data within a 2000+ years time frame(350 BC – 1985), saying that the model isn't good, would be an understatement.

How to make make the model more accurate?
I wonder if the translations of the above (texts) sentences in other languages, such as German, would result in a more accurate model. Since it often happens that certain words cannot be directly translated into other languages, such as from German to English. Ironically, this happens often at the philosophical/emotional/intangible level. E.g., "Fernweh", "Zeitgeist", "Weltanschauung".

Another option that comes to mind is that if instead of trying to predict each sentence separately, what if pairing them up, that is, putting two sentences together in each row, would result in a higher accuracy. Especially considering the fact that all the sentences are about philosophy, thus expanding the amount of words in a 'sentence' could increase the chances that a 'keyword' related to a certain school of philosophy be within each sentence pair.

Another option is to add more data.
As we can see from #3 (output copied below), Stoicism only has 2535 sentences, as opposed to Aristotle (48779)
or german_idealism (42136). Also Nietzsche (13548) is low, as well as the rest of those that are below 20,000.

plato 38366
stoicism 2535
empiricism 19931
analytic 55425
rationalism 22949
phenomenology 28573
german_idealism 42136
nietzsche 13548
capitalism 18194
continental 33779
feminism 18635
aristotle 48779
communism 17958

Or, on the other hand, to reduce the data. That is, to reduce the amount of sentences of all those that are, e.g.,  over 35,000 to under 35,000, or perhaps reduce all those that are greater than 20,000 to 20,000. Resulting in a more balanced input accross each label.



Lastly, maybe removing the stop words had a negative impact in the accuracy of the model. Nonetheless, not removing them would have resulted in a much slower algorithm/model.

Similarly, the tokenizer could potentially have a direct impact on the result (accuracy of the model). Eliminating anything that is not alphabetical, like numbers, probably results in an important loss of information. For instance, in this particular dataset, the numbers might often be years, thus those years would potentially be a reference to a school of philosophy, an era, a philosopher, etc.

#It's very interesting to see, after doing the analysis and conclusion, that even the (apparently) most mundane tasks (such as selecting the tokenizer, or stopwords) could have a huge impact on the accuracy of the model.