# NLP: classification task

## Notes

This notebook is a completely finished text classification work in the sense that this laptop contains all the steps necessary for analyzing and training the model, except for collecting raw data, and produces the result.

Each numbered part is essentially a stand-alone notebook. Thus, it is possible to divide this notebook into three parts.

If you are only interested in the classification model, then skip directly to step 3.

# 1) Dataset creation

## Overview

This part is needed to create a raw dataset for the task of classifying sentences by authorship from the original texts. The output will be a csv file in the format: "sentence", "author".

To create a dataset, works are used (at start of this work):


|Author|Works|
|---------  |-------|
|А.П. Чехов | Collection of stories |
|Ф.М. Достоевский| Collection of selected works |
|Л.Н. Толстой| Most Popular Writings |

## File creation

### Packages import

In [None]:
from typing import List
import random

import glob
from nltk import tokenize, download
import numpy as np
import pandas as pd

It is necessary to tokenize the offer, it is enough to call it on the working machine once:

In [None]:
download('punkt')

### Function for loading and preprocessing text

Let's create a list of sentences, the length of which is more than 5 characters, since shorter ones, most likely, do not carry information useful for attribution. Generally speaking, these sentences can express, and express quite vividly, the writing style of a particular author; however, this is not used in the model.

To improve the performance of the offer tokenizer, some character combinations are replaced. So, the replicas will be separate from the speech of the author in sentences, and the problem with quotes should be solved.

In [None]:
def split_text(filepath: str, min_char: int = 5) -> List[str]:
    
    text = str()
    with open(filepath, 'r', encoding='utf8') as file:
        text = file.read().replace('\n', '. ')
        text = text.replace('.”', '”.').replace('."', '".').replace('?”', '”?').replace('!”', '”!')
        text = text.replace('--', ' ').replace('. . .', '').replace('_', '')
    
    sentences = tokenize.sent_tokenize(text)    
    sentences = [sentence for sentence in sentences if len(sentence) >= min_char]

    return list(sentences)

### Create a sentence| set for each author

In [None]:
chekhov = []
for path in glob.glob('../input/russian-literature/prose/Chekhov/*.txt'):
    chekhov += split_text(path)
    
dostoevsky = []
for path in glob.glob('../input/russian-literature/prose/Dostoevsky/*.txt'):
    dostoevsky += split_text(path)

tolstoy = []
for path in glob.glob('../input/russian-literature/prose/Tolstoy/*.txt'):
    tolstoy += split_text(path)

In [None]:
text_dict = { 'Chekhov': chekhov, 'Dostoevsky': dostoevsky, 'Tolstoy': tolstoy }

for key in text_dict.keys():
    print(key, ':', len(text_dict[key]), ' sentences')

Each list contains 21'860 to 117'861 sentences. In order to have an even distribution of authors in our set, we will limit the set for each, for example, to 20'000 sentences.

### Combining sentences

In [None]:
np.random.seed(1)

max_len = 20_000

names = [chekhov, dostoevsky, tolstoy]

combined = []
for name in names:
    name = np.random.choice(name, max_len, replace = False)
    combined += list(name)

print('Length of combo and internally shuffled list:', len(combined))

### Create a marked list

At this point, it is important to indicate the labels of the authors (their names) in the same order as in the previous step, otherwise the data will simply turn out to be incorrect. So far, a simple regulating mechanism does not come to mind.

In [None]:
labels = ['Chekhov'] * max_len + ['Dostoevsky'] * max_len + ['Tolstoy'] * max_len

print('Length of marked list:', len(labels))

The output of the quantity was needed for additional control over the data and their labels. Equality means that every sentence in our dataset will have a label (correct or incorrect - it should have been controlled before).

In [None]:
len(combined) == len(labels)

### Randomly shuffle the data

In [None]:
random.seed(3)

zipped = list(zip(combined, labels))
random.shuffle(zipped)
combined, labels = zip(*zipped)

### Exporting the resulting dataset

In [None]:
out_data = pd.DataFrame()
out_data['text'] = combined
out_data['author'] = labels

In [None]:
print(out_data.head())
print(out_data.tail())

In [None]:
out_data.to_csv('author_data.csv', index=False)

# 2) Dataset preprocessing

Preparing data for use in model training and explore it.

## Importing packages and loading data

In [None]:
import string
import time

import numpy as np
import pandas as pd
from collections import Counter

import seaborn as sns
import matplotlib.pyplot as plt

import nltk
from nltk.stem.porter import PorterStemmer

In [None]:
data = pd.read_csv('author_data.csv', encoding='utf8')
print(data.head())

In [None]:
text = list(data['text'].values)
author = list(data['author'].values)

print('Dataset contains {} notes.'.format(len(text)))

## Data exploration

Number of sentences for each author:

In [None]:
authors = Counter(author)
authors

In [None]:
author_names = list(authors.keys())
author_names

Let's look at some sample sentences:

In [None]:
np.random.seed(73)
n = len(text)

for _ in range(5):
    print(text[np.random.randint(0, n)])

## Calculating statistics by words:

In [None]:
word_count = np.array([len(sent.split()) for sent in text])
char_count = np.array([len(sent) for sent in text])
ave_length = char_count / word_count

In [None]:
def get_stats(var):    
    print('\t Min: ', np.min(var))
    print('\t Max: ', np.max(var))
    print('\t Average: ', np.mean(var))
    print('\t Median: ', np.median(var))
    print('\t Percentile 1%: ', np.percentile(var, 1))
    print('\t Percentile 95%: ', np.percentile(var, 95))
    print('\t Percentile 99%: ', np.percentile(var, 99))
    print('\t Percentile 99.5%: ', np.percentile(var, 99.5))
    print('\t Percentile 99.9%: ', np.percentile(var, 99.9))

### Word count

In [None]:
print('Word count statistics:')
get_stats(word_count)

In [None]:
sns.distplot(word_count, kde=True, bins=80, color='green').set_title('Distribution of word count')
plt.xlabel('Sentence length in words')
plt.ylabel('Number of offers')
plt.xlim(0, 100)
plt.savefig('word_count.png')

### Character count

In [None]:
print('Character count statistics:')
get_stats(char_count)

In [None]:
sns.distplot(char_count, kde=True, bins=80, color='green').set_title('Distribution of characters')
plt.xlabel('Sentence length in characters')
plt.ylabel('Number of sentences')
plt.xlim(0, 400)
plt.savefig('char_count.png')

### Average length

In [None]:
print('Average length statistics:')
get_stats(ave_length)

In [None]:
sns.distplot(ave_length, kde=True, bins=80, color='green').set_title('Distribution of average word length')
plt.xlabel('Average word length in characters')
plt.ylabel('Number of sentences')
plt.xlim(0, 10)
plt.savefig('ave_length.png')

## Examining outliers in data

### Extremely long sentences

In [None]:
word_outliers = np.where(word_count > 150)

for i in word_outliers[0][:5]:
    print('Author: {}, Sentence length: {}'.format(author[i], word_count[i]))
    print(text[i], '\n')

In [None]:
max_authors = {author : 0 for author in author_names}

for i in word_outliers[0]:
    max_authors[author[i]] += 1

Counter(max_authors)

### Extremely short

In [None]:
word_outliers = np.where(word_count < 2)

for i in word_outliers[0][:10]:
    print('Sentence length: {}'.format(word_count[i]))
    print(text[i], '\n')

## Exploring symbols

Let's create a dictionary showing the number of dataset inclusions for each character.

In [None]:
text_string = ''
for sents in text:
    text_string += sents.lower()

char_cnt = Counter(text_string)
print(char_cnt)
print(len(char_cnt), 'unusual symbols in data.')

All symbols used:

In [None]:
print(list(char_cnt.keys()))

Among them there are many that do not belong to the standard ones, such as punctuation or Cyrillic characters. Let's highlight those sentences in which they occur.

In [None]:
accented_chars = ['f', 'u', 'r', 's', 'i', 'c', 'h', '́', 'n', 'd', 'p', 'e', 'a', 't', 'o', 'l', 'x', 'm', 'j', 'é', 'ô', 'v', 'q', 'ê', 'g', 'b', 'k', 'y', 'à', 'і', 'z', 'w', 'è', 'ó', 'ö', '°', 'ç', 'ï', 'á', 'ü', 'ù', 'û', 'î', 'ѣ', 'â']

accented_text = []
for i in range(len(text)):
    for j in text[i]:
        if j in accented_chars:
            accented_text.append(i)
        
accented_text = list(set(accented_text))
 
print(len(accented_text), 'sentences contains unusual symbols.')

In [None]:
for i in accented_text[:10]:
    print('Sentence number {}: '.format(i))
    print(text[i], '\n')

Based on the above research proposals, we can say that our data is quite suitable for analysis. The only thing is that you need to remove the indented blocks and some invalid characters that are artifacts of the original text.

## Data preparation

This is exactly where we remove the unacceptable uninformative characters.

In [None]:
text = [excerpt.replace('\xa0', '').replace('\x7f', '') for excerpt in text]

And big blocks of indentation.

In [None]:
ctr = 0
for excerpt in text:
    if '  ' in excerpt:
        ctr += 1

print(ctr, 'occurrences of large blocks of indentation.')

In [None]:
new_text = []
for excerpt in text:
    while '  ' in excerpt:
        excerpt = excerpt.replace('  ',' ')
    new_text.append(excerpt)

text = new_text
print(len(text))

Remove punctuation and convert all letters of the sentence to lowercase.

In [None]:
normed_text = []

for sent in text:
    new = sent.lower()
    new = new.translate(str.maketrans('','', string.punctuation))
    new = new.replace('“', '').replace('”', '') # english quotes
    new = new.replace('‟', '').replace('”', '') # french quotes
    new = new.replace('«', '').replace('»', '') # christmas tree quotes
    new = new.replace('—', '').replace('–', '') # em dash
    new = new.replace('(', '').replace(')', '')
    new = new.replace('…', '') # ellipsis as one character
    
    normed_text.append(new)
    
print(normed_text[0:5])
print(len(normed_text))

## Save the prepared data

In [None]:
data['text'] = normed_text

data.to_csv('preprocessed_data.csv', index=False)

# Analysys

Teaching and learning models.

## Importing packages and loading pre-prepared data

In [None]:
from typing import List

import numpy as np
import pandas as pd
from collections import Counter
import seaborn as sns
import matplotlib.pyplot as plt
import string
import time

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.preprocessing import LabelBinarizer

import keras
from keras.models import Model
from keras.layers import Input, Dense, Flatten, Dropout, Embedding
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.layers.merge import concatenate
from keras.optimizers import Adam
from keras.preprocessing.text import one_hot
from keras.callbacks import ModelCheckpoint 

from scipy import stats

In [None]:
data = pd.read_csv("preprocessed_data.csv", encoding='utf8')
print(data.head())

In [None]:
normed_text = list(data['text'])
author = list(data['author'])

authors_names = list(Counter(author).keys())
authors_count = len(authors_names)

normed_text = [str(i) for i in normed_text]

## Section with statistics and output functions

In [None]:
def plot_confusion_matrix(cm, classes: List[str],
                          normalize: bool = False,
                          title: str = 'Confusion matrix',
                          cmap = plt.cm.Greens):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print('Normalized confusion matrix')
    else:
        print('Unnormalized confusion matrix')

    print(cm)
       
    df_cm = pd.DataFrame(cm, index = classes,
                  columns = classes)
    sns.heatmap(df_cm, annot=True, cmap = cmap)
    plt.ylabel('Right author')
    plt.xlabel('Predicted author')
    plt.title(title)

In [None]:
def plot_history_of_accurancy(history):
    plt.plot(history.history['accuracy'])
    plt.plot(history.history['val_accuracy'])
    plt.title('Model\'s accurancy')
    plt.ylabel('accurancy')
    plt.xlabel('epochs')
    plt.legend(['teaching data', 'test data'], loc='upper left')

In [None]:
def plot_history_of_loss(history):
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('Model\'s error')
    plt.ylabel('error')
    plt.xlabel('epochs')
    plt.legend(['teaching data', 'test data'], loc='upper left')

## Preparing data for direct use

### We select the training and test set

In [None]:
text_train, text_test, author_train, author_test = train_test_split(normed_text, author, test_size=0.2, random_state=5)

In [None]:
print(np.shape(text_train))
print(np.shape(text_test))
print(np.shape(author_train))
print(np.shape(author_test))

### Create n-gram sequences

In [None]:
def create_n_grams(excerpt_list: List[str], n: int, vocab_size: int, seq_size: int):
    n_gram_list = []

    for excerpt in excerpt_list:
        excerpt = excerpt.replace(" ", "")

        n_grams = [excerpt[i:i + n] for i in range(len(excerpt) - n + 1)]

        new_string = " ".join(n_grams)

        hot = one_hot(new_string, round(vocab_size * 1.3))

        hot_len = len(hot)
        if hot_len >= seq_size:
            hot = hot[0:seq_size]
        else:
            diff = seq_size - hot_len
            extra = [0]*diff
            hot = hot + extra

        n_gram_list.append(hot)
    
    n_gram_array = np.array(n_gram_list)
    
    return n_gram_array

In [None]:
def get_vocab_size(excerpt_list: List[str], n: int, seq_size: int) -> int:
    n_gram_list = []

    for excerpt in excerpt_list:
        excerpt = excerpt.replace(" ", "")
   
        n_grams = [excerpt[i:i + n] for i in range(len(excerpt) - n + 1)]

        gram_len = len(n_grams)
        if gram_len >= seq_size:
            n_grams = n_grams[0:seq_size]
        else:
            diff = seq_size - gram_len
            extra = [0]*diff
            n_grams = n_grams + extra
        
        n_gram_list.append(n_grams)
    
    n_gram_list = list(np.array(n_gram_list).flat)
    
    n_gram_cnt = Counter(n_gram_list)
    vocab_size = len(n_gram_cnt)
    
    return vocab_size

Determine the size of the dictionary for n from 1 to 3 inclusive:

In [None]:
vocab_sizes = []
for i in range(1, 4):
    vocab_sizes.append(get_vocab_size(text_train, i, 350))
    print('Size for n =', i, 'is:', vocab_sizes[i - 1])

And create lists of n-grams:

In [None]:
gram1_train = create_n_grams(text_train, 1, vocab_sizes[0], 350)
gram2_train = create_n_grams(text_train, 2, vocab_sizes[1], 350)
gram3_train = create_n_grams(text_train, 3, vocab_sizes[2], 350)

In [None]:
gram1_test = create_n_grams(text_test, 1, vocab_sizes[0], 350)
gram2_test = create_n_grams(text_test, 2, vocab_sizes[1], 350)
gram3_test = create_n_grams(text_test, 3, vocab_sizes[2], 350)

In [None]:
print(np.shape(gram1_train))
print(np.shape(gram2_train))
print(np.shape(gram3_train))

print(np.shape(gram1_test))
print(np.shape(gram2_test))
print(np.shape(gram3_test))

Определим максимальное значение n-грамм, что будет использовано для создания сети.

In [None]:
max_1gram = np.max(gram1_train)
max_2gram = np.max(gram2_train)
max_3gram = np.max(gram3_train)

print('Max value for 1-gramms: ', max_1gram)
print('Max value for bigramms: ', max_2gram)
print('Max value for trigramms: ', max_3gram)

## Vectorization

In [None]:
processed_train = text_train
processed_test = text_test

print(processed_train[0:5])

In [None]:
vectorizer = TfidfVectorizer(strip_accents = 'unicode', min_df = 6)
vectorizer.fit(processed_train)

print('Dictionary size: ', len(vectorizer.vocabulary_))

words_train = vectorizer.transform(processed_train)
words_test = vectorizer.transform(processed_test)

In [None]:
author_lb = LabelBinarizer()

author_lb.fit(author_train)
author_train_hot = author_lb.transform(author_train)
author_test_hot = author_lb.transform(author_test)

## Model implementation

https://machinelearningmastery.com/develop-n-gram-multichannel-convolutional-neural-network-sentiment-analysis/

In [None]:
def define_model(input_len: int, output_size: int, vocab_size : int, embedding_dim: int, verbose: bool = True,
                drop_out_pct: float = 0.25, conv_filters: int = 500, activation_fn: str = 'relu', pool_size: int = 2, learning: float = 0.0001):
    inputs1 = Input(shape=(input_len,))
    embedding1 = Embedding(vocab_size, embedding_dim)(inputs1)
    drop1 = Dropout(drop_out_pct)(embedding1)
    conv1 = Conv1D(filters=conv_filters, kernel_size=3, activation=activation_fn)(drop1)
    pool1 = MaxPooling1D(pool_size=pool_size)(conv1)
    flat1 = Flatten()(pool1)
    
    inputs2 = Input(shape=(input_len,))
    embedding2 = Embedding(vocab_size, embedding_dim)(inputs2)
    drop2 = Dropout(drop_out_pct)(embedding2)
    conv2 = Conv1D(filters=conv_filters, kernel_size=4, activation=activation_fn)(drop2)
    pool2 = MaxPooling1D(pool_size=pool_size)(conv2)
    flat2 = Flatten()(pool2)
    
    inputs3 = Input(shape=(input_len,))
    embedding3= Embedding(vocab_size, embedding_dim)(inputs3)
    drop3 = Dropout(drop_out_pct)(embedding3)
    conv3 = Conv1D(filters=conv_filters, kernel_size=5, activation=activation_fn)(drop3)
    pool3 = MaxPooling1D(pool_size=pool_size)(conv3)
    flat3 = Flatten()(pool3)
    
    merged = concatenate([flat1, flat2, flat3])
    
    output = Dense(output_size, activation='softmax')(merged)
    
    model = Model(inputs=[inputs1, inputs2, inputs3], outputs=output)
    
    model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=learning), metrics=['accuracy'])
    
    if verbose:
        print(model.summary())
        
    return model

## Calculations

In [None]:
gram1_model = define_model(350, authors_count, max_1gram + 1, 26)

In [None]:
gram1_model_history = gram1_model.fit([gram1_train, gram1_train, gram1_train], author_train_hot, epochs=10, batch_size=32, 
                verbose = 1, validation_split = 0.2)

In [None]:
gram2_model = define_model(350, authors_count, max_2gram + 1, 300)

In [None]:
gram2_model_history = gram2_model.fit([gram2_train, gram2_train, gram2_train], author_train_hot, epochs=10, batch_size=32, 
                verbose = 1, validation_split = 0.2)

In [None]:
t0 = time.time()
gram3_model = define_model(350, authors_count, max_3gram + 1, 600)

In [None]:
gram3_model_history = gram3_model.fit([gram3_train, gram3_train, gram3_train], author_train_hot, epochs=10, batch_size=32, 
                verbose=1, validation_split=0.2)
t1 = time.time()

## 3-gramm first model statistics

In [None]:
author_pred1 = gram3_model.predict([gram3_test, gram3_test, gram3_test])

t2 = time.time()

author_pred1 = author_lb.inverse_transform(author_pred1)

accuracy = accuracy_score(author_test, author_pred1)
precision, recall, f1, support = score(author_test, author_pred1)
ave_precision = np.average(precision, weights = support/np.sum(support))
ave_recall = np.average(recall, weights = support/np.sum(support))
ave_f1 = np.average(f1, weights = support/np.sum(support))
confusion = confusion_matrix(author_test, author_pred1, labels=authors_names)
    
print('Accurancy:', accuracy)
print('Average Precision:', ave_precision)
print('Average Recall:', ave_recall)
print('Average F1 Score:', ave_f1)
print('Learning time:', (t1 - t0), 'seconds')
print('Prediction time:', (t2 - t1), 'seconds')
print('Confusion matrix:\n', confusion)

In [None]:
plot_confusion_matrix(confusion, classes=authors_names, \
                      normalize=True, title='Normalized confusion matrix - Model 1')

plt.savefig('confusion_model1.png')

In [None]:
plot_history_of_accurancy(gram3_model_history)
plt.savefig('accurancy_model1.png')

In [None]:
plot_history_of_loss(gram3_model_history)
plt.savefig('loss_model1.png')

In [None]:
keras.utils.plot_model(gram3_model, 'gram3_model1_arh.png')

The trigram model showed the best results in terms of accuracy, so you should choose it as the main one.

The improved version should only be trained for 5 epochs, because the graph shows a plateau and even a decline in model accuracy after this point.

## Improvement

Retraining the trigram model with the addition of an additional channel.

In [None]:
def define_model2(input_len: int, output_size: int, vocab_size: int, embedding_dim: int, verbose: bool = True,
                drop_out_pct: float = 0.25, conv_filters: int = 500, activation_fn: str = 'relu', pool_size:int = 2, learning: float = 0.0001):
    
    inputs1 = Input(shape=(input_len,))
    embedding1 = Embedding(vocab_size, embedding_dim)(inputs1)
    drop1 = Dropout(drop_out_pct)(embedding1)
    conv1 = Conv1D(filters=conv_filters, kernel_size=3, activation=activation_fn)(drop1)
    pool1 = MaxPooling1D(pool_size=pool_size)(conv1)
    flat1 = Flatten()(pool1)
    
    inputs2 = Input(shape=(input_len,))
    embedding2 = Embedding(vocab_size, embedding_dim)(inputs2)
    drop2 = Dropout(drop_out_pct)(embedding2)
    conv2 = Conv1D(filters=conv_filters, kernel_size=4, activation=activation_fn)(drop2)
    pool2 = MaxPooling1D(pool_size=pool_size)(conv2)
    flat2 = Flatten()(pool2)

    inputs3 = Input(shape=(input_len,))
    embedding3= Embedding(vocab_size, embedding_dim)(inputs3)
    drop3 = Dropout(drop_out_pct)(embedding3)
    conv3 = Conv1D(filters=conv_filters, kernel_size=5, activation=activation_fn)(drop3)
    pool3 = MaxPooling1D(pool_size=pool_size)(conv3)
    flat3 = Flatten()(pool3)
    
    inputs4 = Input(shape=(input_len,))
    embedding4 = Embedding(vocab_size, embedding_dim)(inputs4)
    drop4 = Dropout(drop_out_pct)(embedding4)
    conv4 = Conv1D(filters=conv_filters, kernel_size=6, activation=activation_fn)(drop4)
    pool4 = MaxPooling1D(pool_size=pool_size)(conv4)
    flat4 = Flatten()(pool4)
    
    merged = concatenate([flat1, flat2, flat3, flat4])
    
    output = Dense(output_size, activation='softmax')(merged)
    
    model = Model(inputs = [inputs1, inputs2, inputs3, inputs4], outputs = output)
    
    model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=learning), metrics=['accuracy'])
    
    if verbose:
        print(model.summary())
        
    return model

In [None]:
t0 = time.time()
gram3_model2 = define_model2(350, authors_count, max_3gram + 1, 600)

In [None]:
gram3_model2_history = gram3_model2.fit([gram3_train, gram3_train, gram3_train, gram3_train], author_train_hot, epochs=5, batch_size=32, 
                verbose=1, validation_split=0.2)
t1 = time.time()

In [None]:
author_pred2 = gram3_model2.predict([gram3_test, gram3_test, gram3_test, gram3_test])

t2 = time.time()

author_pred2 = author_lb.inverse_transform(author_pred2)

accuracy = accuracy_score(author_test, author_pred2)
precision, recall, f1, support=score(author_test, author_pred2)
ave_precision = np.average(precision, weights=support/np.sum(support))
ave_recall = np.average(recall, weights=support/np.sum(support))
ave_f1 = np.average(f1, weights=support/np.sum(support))
confusion = confusion_matrix(author_test, author_pred2, labels=authors_names)
    
print('Accurancy:', accuracy)
print('Average Precision:', ave_precision)
print('Average Recall:', ave_recall)
print('Average F1 Score:', ave_f1)
print('Learning time:', (t1 - t0), 'seconds')
print('Predict time:', (t2 - t1), 'seconds')
print('Confusion matrix:\n', confusion)

In [None]:
plot_confusion_matrix(confusion, classes=authors_names, \
                      normalize=True, title='Normalized confusion matrix - Model 2')

plt.savefig('confusion_model2.png')

In [None]:
plot_history_of_accurancy(gram3_model2_history)
plt.savefig('accurancy_model2.png')

In [None]:
plot_history_of_loss(gram3_model2_history)
plt.savefig('loss_model2.png')

In [None]:
keras.utils.plot_model(gram3_model2, 'gram3_model2_arh.png')

# 4*) Benchmarks and comparation

In [None]:
accuracy_list = []
prec_list = []
recall_list = []
f1_list = []

for i in range(10):
    author_pred3 = np.random.choice(authors_names, len(author_test))

    accuracy = accuracy_score(author_test, author_pred3)
    precision, recall, f1, support = score(author_test, author_pred3)
    ave_precision = np.average(precision, weights = support/np.sum(support))
    ave_recall = np.average(recall, weights = support/np.sum(support))
    ave_f1 = np.average(f1, weights = support/np.sum(support))
    
    accuracy_list.append(accuracy)
    prec_list.append(ave_precision)
    recall_list.append(ave_recall)
    f1_list.append(ave_f1)

print('Accurancy:', accuracy_list, np.mean(accuracy_list), np.std(accuracy_list))
print('Average Precision:', prec_list, np.mean(prec_list), np.std(prec_list))
print('Average Recall:', recall_list, np.mean(recall_list), np.std(recall_list))
print('Average F1 Score:', f1_list, np.mean(f1_list), np.std(f1_list))

In [None]:
for i in range(100):
    print('Sentence', i, '- Right answer =', author_test[i],  'Model\'s 1 predict =', author_pred1[i], 
         'Model\'s 2 predict =', author_pred2[i])
    print(text_test[i], '\n')

In [None]:
def calculate_averages(true, pred, text):
    
    correct_len_chars = []
    incorrect_len_chars = []
    correct_len_words = []
    incorrect_len_words = []

    
    for i in range(len(true)):
        if true[i] == pred[i]:
            correct_len_chars.append(len(text[i]))
            correct_len_words.append(len(text[i].split()))
        else:
            incorrect_len_chars.append(len(text[i]))
            incorrect_len_words.append(len(text[i].split()))
    
    correct_ave_chars = np.mean(correct_len_chars)
    correct_ave_words = np.mean(correct_len_words)
    incorrect_ave_chars = np.mean(incorrect_len_chars)
    incorrect_ave_words = np.mean(incorrect_len_words)
    
    print('t-test for characters')
    print(stats.ttest_ind(correct_len_chars, incorrect_len_chars, equal_var = False))
    
    print('\nt-test for words')
    print(stats.ttest_ind(correct_len_words, incorrect_len_words, equal_var = False))
    
    return correct_ave_chars, correct_ave_words, incorrect_ave_chars, incorrect_ave_words

In [None]:
correct_ave_chars1, correct_ave_words1, incorrect_ave_chars1, incorrect_ave_words1\
= calculate_averages(author_test, author_pred1, text_test)

In [None]:
correct_ave_chars2, correct_ave_words2, incorrect_ave_chars2, incorrect_ave_words2\
= calculate_averages(author_test, author_pred2, text_test)

In [None]:
print('Model 1 - Average length correct predicted sentences by characters =', correct_ave_chars1, 
        ', incorrect =', incorrect_ave_chars1)
print('Model 2 - Average length correct predicted sentences by characters =', correct_ave_chars2, 
      ', incorrect =', incorrect_ave_chars2)

print('\nModel 1 - Average length correct predicted sentences by words =', correct_ave_words1, 
        ', incorrect =', incorrect_ave_words1)
print('Model 2 - Average length correct predicted sentences by words =', correct_ave_words2, 
      ', incorrect =', incorrect_ave_words2)