# Dostoevsky or Tolstoy?

When the topic of a conversation is the Russian Literature, "wo/men of letters" quite frequently encounter the question "Who do you think is greater, Tolstoy or Dostoevsky?", "Are you more Tolstoy or Dostoevsky?", or rather briefly: "Tolstoy or Dostoevsky?". It is obvious that the answer to this question is rather "one (or the other) is greater for me" than "one (or the other) is greater than the other", just as one literary person once noted:
___
                   "I loved them both: Tolstoy, for the story he told, 
                                and Dostoevsky, for the thoughts he provoked."
                                                              (Raquel Chanto)
                                               
  <img style="width: 600px; height:600 px;" src="https://bit.ly/38EWEwM" class=center>

___
This Kaggle notebook will firstly try to develop a machine learning model to identify the author of a quote from the works of either of the two Russian writers. It will later repeat the same model for more Russian writers with the addition of Turgenev, which I believe will make the task even more interesting).

For this purpose, I'll use the relevant folders of [the Russian Literature Dataset](https://www.kaggle.com/d0rj3228/russian-literature) and create training and testing subdatasets out of them. 

A final note in the intro: I will be very happy if you could help me with your corrections and advice. 
Thank you in advance.

## Libraries and Modules

Just a quick initial note: WordNetLemmatizer did not work well with the Russian texts. So, I searched for a proper lemmatizer and came across  MorphAnalyzer from pymorphy2. I'm very satisfied with the result and I believe you'll also agree. 

In [None]:
pip install pymorphy2

In [None]:
import numpy as np 
import pandas as pd 
import random 
import glob 
import pickle
from collections import Counter
import re

from pymorphy2 import MorphAnalyzer
from nltk.corpus import stopwords
from nltk import tokenize

from sklearn.preprocessing import LabelEncoder 
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.ensemble import RandomForestClassifier

import matplotlib.pyplot as plt
import seaborn as sns

labelencoder=LabelEncoder()

## Data Preparation

In [None]:
def split_text(filepath, min_len=50):
    
    text = str()
    with open(filepath, "r", encoding="utf8") as file:
        sentences = tokenize.sent_tokenize(file.read())
  
    sentences = [sentence for sentence in sentences if len(sentence) >= min_len]
        
    return list(sentences)

*split_text* will simply split texts into sentences of at least 50 characters each, which will somehow ensure meaningful fragments remain after the lemmatization.

In [None]:
def prepare_data(auth):
    
    text=[]
    for path in glob.glob('../input/russian-literature/prose/{}/*.txt'.format(auth)):
        text += split_text(path)  
    
    return text

authors=["Dostoevsky", "Tolstoy", "Turgenev"]
all_texts={}
all_texts_train={}
all_texts_test={}

for author in authors:
    all_texts[author]=prepare_data(author)
    all_texts_train[author]=all_texts[author][:10000]
    all_texts_test[author]=all_texts[author][-2000:]
    
np.random.seed(1)

for author in authors:
    all_texts_train[author]=np.random.choice(all_texts_train[author], 10000, replace=False)
    all_texts_test[author]=np.random.choice(all_texts_test[author], 1000, replace=False)

This is where we get the requested data read from the files by author and then store it in a dictionary. For the sake of data consistency we take 10,000 sentences randomly chosen from each of the three authors. Please note that I prepared two sets: one for training and validation and another smaller one for post-testing purpose. In order for the post-test set not to be too similar to the training/validation set, I fetched it from the other end of the data. To summarize, we'll have 10,000 sentences per author in the training/validation set and another 1,000 per author in the post-test set.

In [None]:
for key in all_texts_train.keys():
    print(key, ':', len(all_texts_train[key]), 'sentences')

This is some simple code to double-check if we have an equal number of sentences from each author.

In [None]:
np.random.seed(5)

tmp=[]

for key, value in all_texts_train.items():    
    for v in value:
        txt="".join(v) 
        zipped=(txt,key)
        tmp.append(zipped)
random.shuffle(tmp)
unv=pd.DataFrame(tmp,columns=["text","author"])

We convert the "author:text list" dictionary into a dataframe, which will hold all the texts for the three authors in the training/validation set.

In [None]:
td=unv[unv["author"] == "Dostoevsky"].append(unv[unv["author"] == "Tolstoy"])

We'll firstly work on "Tolstoy or Dostoevsky", so, for now, we'll need the data for the two authors only. Alternatively, we could first prepare the data for these two authors, and later repeat the function in the kernel, when the three authors are compared.

In [None]:
np.random.shuffle(td.values)

It will be a good idea to make sure the data is more or less random rather than all Dostoevsky first and Tolstoy next.

In [None]:
td.index=range(20000)

This will reset the indices inherited from the main dataframe above.

In [None]:
td.head(7) #7 is my lucky number :-P

Now let's convert the smaller post-testing data into a dataframe as well:

In [None]:
tmp=[]
for key, value in all_texts_test.items():    
    for v in value:
        txt="".join(v) 
        zipped=(txt,key)
        tmp.append(zipped)
random.shuffle(tmp)
unvt=pd.DataFrame(tmp,columns=["text","author"])

tdt=unvt[unvt["author"] == "Dostoevsky"].append(unvt[unvt["author"] == "Tolstoy"])
np.random.shuffle(tdt.values)
tdt.index=range(2000)

tdt.head()

## Data Preprocessing

In [None]:
character_set = "[!#$%&'()*+,./:;<=>?@[\]^_`{|}„“~—\"\-]–+«»…"
stopwords_ru = set(stopwords.words("russian")+ ["это", "твой","свой","всё", "который", "ещё"])

with open ("../input/toldostur/romans.csv", "r") as f:
    roman_nums=[item.strip() for item in f]

morph = MorphAnalyzer()

def lemmatize(sent):
    
    sent = re.sub('\w*\d\w*', '', sent)
    
    punct_free=[character for character in sent if character not in character_set]
    punct_free=''.join(punct_free)
    
    lemmas = []
    for lemma in punct_free.split():
        lemma = lemma.strip()
        if lemma and lemma.upper() not in roman_nums:
            lemma = morph.normal_forms(lemma)[0]
            if lemma not in stopwords_ru and len(lemma)!=1:
                lemmas.append(lemma)

    return lemmas

**Some important remarks about the code above:**
1. Initially I used string.punctuation to clean the punctuation marks from the text, but it was ineffective with, for example, "«»", the punctuation marks used in place of the quotation marks in Russian texts, or, for example, when the punctuation mark was attached to a word. I added, for the same reason, several other punctuation marks to the character_set string as I came across several exceptions by trial and error.
2. 0-9 was initially included in the character set, but it didn't help get rid of numbers in the text, either, for which I used a simple regexp operation.
3. I added several items to the NLTK Russian stopwords.
4. There were Roman numerals in the texts, so I removed them by reading the numbers from a separate file. It is, of course, possible to write a separate code to identify and eliminate them, but that is not the job of this kernel.

In [None]:
'''
td["lems"]=''
td["lemphrases"]=''
for i,j in enumerate(td['text']):
    td['lems'][i]=lemmatize(j)
    
    td["lemphrases"][i]=" ".join(td["lems"][i])
'''

#with open('td.pickle', 'wb') as p:
#     pickle.dump(td, p, protocol=pickle.HIGHEST_PROTOCOL)

with open('../input/toldostur/td.pickle', 'rb') as p:
    td = pickle.load(p)  
  


In [None]:
td.head(3)

In [None]:
tdt["lems"]=''
tdt["lemphrases"]=''
for i,j in enumerate(tdt['text']):
    tdt['lems'][i]=lemmatize(j)
    tdt["lemphrases"][i]=" ".join(tdt["lems"][i])


In [None]:
np.random.shuffle(tdt.values);    
tdt.head()

In [None]:
#cv.get_feature_names()

## Feature Selection and Model Creation

The answer to the question what features make an author different than the others is the most important step for this task. Style and content are certainly two of the most important factors: Author's choice of words, the way s/he orders them, use of unique vocabulary, etc., etc. Here I'm going to experiment with the original text (without any preprocesses) and also the lemmatized phrases. Firstly, I'll train the original texts and test them and later repeat the experiment with the lemmatized phrases. Next, I'll create a pipeline to combine the two features.

In my previous versions the code was rather messy and repetitive. I've decided to make it all into a tidier, though in a bigger chunk, format. In this new version I'll test the models for the training/validation and post-testing sets and the results will be shown in a table.

In [None]:
training_scores=[]
validation_scores=[]
post_test_scores=[]

def model_test(d1,d2):

    y=d1['author']
    yt=d2['author']
    
    y = labelencoder.fit_transform(y)
    yt = labelencoder.fit_transform(yt)
    
    for i in ['text', 'lemphrases']:
        
        X=d1[i]
        Xt=d2[i]
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2294)

        cv=CountVectorizer(ngram_range=(1,4), min_df=2)
        transformer=cv.fit(X_train)

        text_train=transformer.transform(X_train)
        text_test=transformer.transform(X_test)
        model = MultinomialNB()
        model = model.fit(text_train, y_train)
        
        score=model.score(text_train, y_train)
        training_scores.append(score)        
 
        score=model.score(text_test, y_test)
        validation_scores.append(score)

        Xt=transformer.transform(Xt)
        preds1=model.predict(Xt)
        score=model.score(Xt, yt)
        post_test_scores.append(score)
        
    
    Xp=d1.drop(['author', 'lems'], axis=1)
    yp=d1['author']
    yp = labelencoder.fit_transform(yp)
    Xp_train, Xp_test, yp_train, yp_test = train_test_split(Xp, yp, test_size=0.2, random_state=1887)
    
    p_transformer=make_column_transformer((CountVectorizer(ngram_range=(1,4), min_df=2), "text"),
                                     (CountVectorizer(ngram_range=(1,3), min_df=2), "lemphrases"))

    p_model=make_pipeline(p_transformer, MultinomialNB())
    p_model.fit(Xp_train, yp_train)
    
    score=p_model.score(Xp_train, yp_train)
    training_scores.append(score)
    
    score=p_model.score(Xp_test, yp_test)
    validation_scores.append(score)
    
    Xpt=d2.drop(['author', 'lems'], axis=1)
    ypt=d2['author']
    ypt = labelencoder.fit_transform(ypt)
    score=p_model.score(Xpt, ypt)
    post_test_scores.append(score)
 

model_test(td, tdt) 

results=pd.DataFrame()
results["Features"]=["Original Text", "Lemmatized" , "Pipeline"]
results["Training"]=training_scores
results["Validation"]=validation_scores
results["Test"]=post_test_scores

results


We see that we get better results with the two features combined. Of course, it's possible to create a more complex pipeline with more detailed parameters and also think of some other features like the unique vocabulary used by each author.

We also see that the model isn't that good with totally new data, the data which the machine hasn't seen at all. It may be possible to get much better results for the totally new, untrained data, by training much bigger data.
However, I must add that I get much better test_set results on my personal notebook - by about 10% - (using the same data and the same code, of course :)) I'm sharing a screenshot here:

<img src="https://www.linkpicture.com/q/pipe_results.png" class=center>

---
Now we'll repeat almost everything for the three great Russian authors: Dostoevsky, Tolstoy and Turgenev.


## Your favorite Russian author: Tolstoy, Dostoevsky or Turgenev?
---
<img src="https://www.linkpicture.com/q/dtt.png" class=center>

---
It may be important to note in advance that those interested in the Russian Literature usually compare Dostoevsky and Tolstoy or Tolstoy and Turgenev. Dostoevsky and Turgenev are not usually compared as they are quite different from each other. Yet, the results of this task are a bit different as we see that the relative difference between the authors is pretty much the same.

In [None]:
'''
unv["lems"]=''
unv["lemphrases"]=''
for i,j in enumerate(unv['text']):
    unv['lems'][i]=lemmatize(j)
    unv["lemphrases"][i]=" ".join(unv["lems"][i])
'''
    
#with open('unv.pickle', 'wb') as pu:
#    pickle.dump(unv, pu, protocol=pickle.HIGHEST_PROTOCOL)

with open('../input/toldostur/unv.pickle', 'rb') as pu:
    unv = pickle.load(pu)
    
unv.head(3)

Here we repeat the same and create two more columns: one with the lemmatized lists of the texts and another with the lemmatized sentences put together as phrases.

And we repeat the same for the smaller post-testing data:

In [None]:
unvt["lems"]=''
unvt["lemphrases"]=''
for i,j in enumerate(unvt['text']):
    unvt['lems'][i]=lemmatize(j)
    unvt["lemphrases"][i]=" ".join(unvt["lems"][i])
    
unvt.head(3)

Although I'm not going to use unique vocabulary by "one and only one" author as a feature at this moment, I'd like to share some plots to show the uniqueness of the vocabulary each of the three authors uses:

In [None]:
'''
all_words={"Dostoevsky":[], "Tolstoy":[], "Turgenev":[]}

for auth in all_words.keys():
    for line in unv[unv["author"]==auth]["lems"]:
        for w in line:
            all_words[auth].append(w)

all_words_unique={}
all_words_unique["Dostoevsky"]=[ x for x in all_words['Dostoevsky'] if x not in (all_words['Tolstoy'] + all_words['Turgenev'])]
all_words_unique["Tolstoy"]=[ x for x in all_words['Tolstoy'] if x not in (all_words['Dostoevsky'] + all_words['Turgenev'])]
all_words_unique["Turgenev"]=[ x for x in all_words['Turgenev'] if x not in (all_words['Dostoevsky'] + all_words['Tolstoy'])]
'''

In [None]:
#with open('all_words_unique.pickle', 'wb') as p:
#    pickle.dump(all_words_unique, p, protocol=pickle.HIGHEST_PROTOCOL)

with open('../input/toldostur/all_words_unique.pickle', 'rb') as p:
    all_words_unique = pickle.load(p)


In [None]:
Counter(all_words_unique["Tolstoy"]).most_common()[:10]

In [None]:
auth_counts={"Dostoevsky": len(all_words_unique["Dostoevsky"]),
             "Tolstoy": len(all_words_unique["Tolstoy"]),
             "Turgenev":len(all_words_unique["Turgenev"])
             }

fig_sizes = {'S' : (6.5,4),
             'M' : (9.75,6),
             'L' : (13,8)}

def show_plot(f_size=(6.5,4),plot_title="",x_title="",y_title=""):
    plt.figure(figsize=f_size)
    plt.xlabel(x_title)
    plt.ylabel(y_title)
    plt.title(plot_title)

ax_bp = show_plot((6.5,4),'Unique Vocabulary by Author','Author','Count')
#sns.barplot(x=list(auth_counts.keys()), y=list(auth_counts.values()), ax=ax_bp)
sns.barplot(x=list(all_words_unique.keys()), y=list(len(i) for i in all_words_unique.values()), ax=ax_bp)
plt.show()

We see that Turgenev has got the most unique vocabulary among the three Russian authors. Unique vocabulary by author may be added as a feature to the models in future versions.

In [None]:
awu=set(all_words_unique['Turgenev'])| set(all_words_unique['Dostoevsky']) | set(all_words_unique['Tolstoy'])

In [None]:
unv.head()

In [None]:
training_scores=[]
validation_scores=[]
post_test_scores=[]

def model_test(d1,d2):

    y=d1['author']
    yt=d2['author']
    
    y = labelencoder.fit_transform(y)
    yt = labelencoder.fit_transform(yt)
    
    for i in ['text', 'lemphrases']:
        
        X=d1[i]
        Xt=d2[i]
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

        CV=CountVectorizer(ngram_range=(1,4), min_df=2)
        transformer=CV.fit(X_train)

        text_train=transformer.transform(X_train)
        text_test=transformer.transform(X_test)

        model = LogisticRegression(penalty="l2", max_iter=2000, solver="newton-cg") #MultinomialNB() 
        model = model.fit(text_train, y_train)
        
        score=model.score(text_train, y_train)
        training_scores.append(score)        
 
        score=model.score(text_test, y_test)
        validation_scores.append(score)        

        Xt=transformer.transform(Xt)  
        score=model.score(Xt, yt)
        post_test_scores.append(score)
        

    yp=d1['author']
    Xp=d1.drop(['author', 'lems'], axis=1)

    yp = labelencoder.fit_transform(yp)
    Xp_train, Xp_test, yp_train, yp_test = train_test_split(Xp, yp, test_size=0.2, random_state=1567)
    
    p_transformer=make_column_transformer((CountVectorizer(ngram_range=(1,4), min_df=2), "text"),
                                     (CountVectorizer(ngram_range=(1,3), min_df=2), "lemphrases"))                                      

    p_model=make_pipeline(p_transformer, LogisticRegression(penalty="l2", max_iter=2000, solver="newton-cg")) #MultinomialNB())
    
    p_model.fit(Xp_train, yp_train)
    
    score=p_model.score(Xp_train, yp_train)
    training_scores.append(score)
    
    score=p_model.score(Xp_test, yp_test)
    validation_scores.append(score)
    
    
    ypt=d2['author']
    Xpt=d2.drop(['author', 'lems'], axis=1)
    
    ypt = labelencoder.fit_transform(ypt)
    score=p_model.score(Xpt, ypt)
    post_test_scores.append(score)    
    

model_test(unv, unvt) 

results=pd.DataFrame()
results["Features"]=["Original Text", "Lemmatized" , "Pipeline"]
results["Training"]=training_scores
results["Validation"]=validation_scores
results["Test"]=post_test_scores

results

The score with the pipeline is again better. Besides, the LogisticRegression model gives much better results for the multiclass dataset. I should repeat that this pipeline is pretty simple and with more complex ones it may be possible to achieve much better results. Also, the scores with totally new data aren't very good and the model requires training with much bigger data, more complex pipelines and some further feature engineering. I'll just check the pipeline model with Kfold cross validation below, which gives a slightly better result.


In [None]:
from sklearn import model_selection

p_transformer=make_column_transformer((CountVectorizer(ngram_range=(1,4), min_df=2), "text"),
                                     (CountVectorizer(ngram_range=(1,3), min_df=2), "lemphrases"))                                      

p_model=make_pipeline(p_transformer, LogisticRegression(penalty="l2", max_iter=2000, solver="newton-cg"))

yp=unv['author']
Xp=unv.drop(['author', 'lems'], axis=1)

kfold = model_selection.KFold(n_splits=3, shuffle=True, random_state=2323)
results = model_selection.cross_val_score(p_model, Xp, yp, cv=kfold)
print("Accuracy: %.1f%%" % (results.mean()*100.0))

In [None]:
f_model=p_model.fit(Xp, yp)

In [None]:
ypt=unvt['author']
Xpt=unvt.drop(['author', 'lems'], axis=1)

f_model.score(Xpt,ypt)

### Bonus :)

I'm aware of the fact that some of us don't like wordclouds much and I must admit I also don't know to what extent they could be useful, but they "kinda" look nice :)

In [None]:
from wordcloud import WordCloud

for author, color in [("Dostoevsky","orange"), ("Tolstoy", "lightgreen"), ("Turgenev","lightblue")]:
    wc=WordCloud(font_path='../input/toldostur/a_RussDecor.ttf', random_state=42, 
                         background_color=color,  width=1200, height=900, collocations=False,
                         max_words=200) 
    wordcloud=wc.generate(' '.join(all_words_unique[author]))
    plt.figure(figsize=(9, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.suptitle(author, size="x-large", weight="bold")
    plt.axis('off')
    plt.show()