# <center> Exploring Pre-WWII Soviet Socialist Realism with Text Analysis

<img src="image.png">

About: This notebook uses logistic regression and topic modeling to explore the particularities of the Soviet literary genre of socialist realism in the pre-WII Soviet Union.  

The python modules nltk, gensim, pymystem, and pyLDAvis are used in this notebook.

In [1]:
import numpy as np
import pandas as pd
from nltk.corpus import stopwords 
import string
import gensim
import scipy
from sklearn.utils import shuffle
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegression
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

from gensim import corpora
from gensim import models
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt
from pymystem3 import Mystem
import pyLDAvis
import re

# Section 1 : Predictive Analysis

This section splits the corpus (both socialist realism and banned literature) into training and testing data and uses logistic regression to predict the values of the testing data.   

Load .csv with both socialist realism and banned literature. 

In [2]:
df = pd.read_csv("csv_files/data_for_log_reg_smaller_chunks.csv")
df = shuffle(df)

y = df['bannedorsoclit']


Clean and lemmatize the text.

In [3]:
stop = list(stopwords.words('russian'))
stop.extend(['это', 'свой', 'то', ' ', 'то', 'который'])
mystem = Mystem() 
exclude = list(string.punctuation)
exclude.extend

def clean(text):
    tokens = text.split(" ")
    tokens = [i.strip().lower() for i in tokens]
    tokens = [i for i in tokens if i not in exclude]
    tokens = [re.sub(r'[!|,|?|.|:|;]|\.{1,}$\)', "", item) for item in tokens]
    tokens = [i for i in tokens if i not in stop]
    cleanedtext = " ".join(tokens)
    lemmatized = mystem.lemmatize(cleanedtext.lower())
    lemmatized = [i for i in lemmatized if i not in exclude]
    lemmatized = [i for i in lemmatized if i not in stop]
    lemmatized = " ".join(str(x) for x in lemmatized)
    return lemmatized


df["cleanedtext"] = df["text"].apply(clean)

df.head()


Unnamed: 0,source,title,date,text,bannedorsoclit,cleanedtext
222,\n\nhttp://www.lib.ru/PROZA/SHOLOHOW/tihijdon1...,\nТихий Дон\n,4,"в рубке , постарше - отвиливали от занятий . Л...",1,рубка старший отвиливать занятие человек хрипн...
187,\n\nhttp://lib.ru/RUSSLIT/OSTROWSKIJ/rozhdenn_...,\n\nРожденные бурей\n\n,4,"даже тогда были далеки от Черного моря ) , и п...",1,далекий черный море погибать оттого каждый уез...
197,\n\nhttp://lib.ru/RUSSLIT/OSTROWSKIJ/kak_zakal...,\n\nКак закалялась сталь\n\n,5,это время со стороны мельницы в город въезжал ...,1,время сторона мельница город въезжать вооружен...
135,\n\nhttp://librebook.me/cement\n\n,\nЦемент\n,2,"проклятых… Вишь , морды какие нахолили ! .. Мо...",1,проклятый … вишь морда нахолить чертол коза...
136,\n\nhttp://librebook.me/cement\n\n,\nЦемент\n,3,"Здравствуйте , товарищ Чумалова ! Заведующий с...",1,здравствовать товарищ чумалов заведующий прийт...


Vectorize the text and split it into training and testing data

In [4]:
count_vectorizer = CountVectorizer(stop_words=None)
# count_vectorizer = TfidfVectorizer()
features = count_vectorizer.fit_transform(
    df['text'])

features_nd = features.toarray()

X_train, X_test, y_train, y_test = train_test_split(
                                             features_nd[0:len(df['text'])], y, 
                                             test_size=0.4, 
                                             random_state=53)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(163, 120538) (110, 120538) (163,) (110,)


Apply logistic regression.

In [5]:
log_model = LogisticRegression()
log_model = log_model.fit(X=X_train, y=y_train)
y_pred = log_model.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(log_model.score(X_test, y_test)))

kfold = model_selection.KFold(n_splits=10, random_state=7)
modelCV = LogisticRegression()
scoring = 'accuracy'
results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))

confusion_matrix = confusion_matrix(y_test, y_pred)
print('confusion_matrix', confusion_matrix)

print(classification_report(y_test, y_pred))




Accuracy of logistic regression classifier on test set: 0.95
10-fold cross validation average accuracy: 0.951
confusion_matrix [[30  6]
 [ 0 74]]
             precision    recall  f1-score   support

          0       1.00      0.83      0.91        36
          1       0.93      1.00      0.96        74

avg / total       0.95      0.95      0.94       110



Results indicate that, logistic regression, with a certain degree of accuracy can differentiate between the socialist realism and banned literature corpora.  It is likely that accuracy might be even higher, were deep learning techniques to be applied. 

# Section 2: Topic Modeling

Import .csv with data from entire socialist realist literature corpus. 

In [6]:
litdata = pd.read_csv("csv_files/socialist_realism_for_tm.csv")
litdata['text'].head(5)

0     СОДЕРЖАНИЕ I. Морозка II . Мечик III . Шестое...
1     . Многие открыто роптали на `` подлость и нес...
2     бровей на тоненькую девушку с длинными косами...
3     время люди очнулись и поняли , что наступило ...
4     кричал Рябец , навалившись на одно слово и вс...
Name: text, dtype: object

Define stopwords, punctuation, read in output from named entity recognition (NER) script.

In [7]:
with open('NERoutput.txt') as fp:
    lines = fp.read().splitlines()

lines = [i.lower() for i in lines]

stop = list(stopwords.words('russian'))
stop.extend(['это', 'свой', 'то', ' ','который'])
stop.extend(lines)

punctuation = list(string.punctuation)
punctuation.extend(['--', '—', '«', '»'," -- ", "\n", "\r", "\t","-" ,"-", "...", "…", " - ", " « ", " ", "..", "``", "\"\"","\'\'"])


Use function to create a new column in pandas df with cleaned text.

In [8]:
# Create new column in dataframe that is literary texts without stopwords and punctuation, as well as without
# named entities for people, as identified in NERoutput.txt, which was created with NERscript.py

def cleantext(text):
    tokens = text.split(" ")
    tokenslower = [i.strip().lower() for i in tokens]
#the following list comprehension gets rid of word-final punctuation, while retaining word-medial punctuation in words like "кто-то"
    tokens_no_word_final_punc = [re.sub(r'[!|,|?|.|:|;]|\.{1,}$\)', "", item) for item in tokenslower]
    tokens_no_stop_or_punc = [i for i in tokens_no_word_final_punc if i not in stop if i not in punctuation]
#the following line gets rid of blank, empty spaces in each list    
    tokens_no_stop_or_punc = list(filter(None, tokens_no_stop_or_punc))
    return tokens_no_stop_or_punc
    
litdata["no_punc_or_stop"] = litdata["text"].apply(cleantext)
no_punc_or_stop = litdata["no_punc_or_stop"]

# Print first five rows of new column

no_punc_or_stop.head(5)

0    [содержание, i, ii, iii, шестое, чувство, iv, ...
1    [многие, открыто, роптали, подлость, несознани...
2    [бровей, тоненькую, девушку, длинными, косами,...
3    [время, люди, очнулись, поняли, наступило, утр...
4    [кричал, рябец, навалившись, одно, слово, веря...
Name: no_punc_or_stop, dtype: object

Use pymystem to lemmatize the text. 

In [9]:
mystem = Mystem() 


def lemmatize(text):
    joinedtokens = " ".join(text)
    lemmatized = mystem.lemmatize(joinedtokens)
    lemmatized = [i for i in lemmatized if i not in stop if i not in punctuation]
    return lemmatized
    
litdata["lemmatized"] = litdata["no_punc_or_stop"].apply(lemmatize)

litdata["lemmatized"].head(5)

0    [содержание, i, ii, iii, шестой, чувство, iv, ...
1    [многие, открыто, роптать, подлость, несознани...
2    [бровь, тоненький, девушка, длинный, коса, под...
3    [время, человек, очнуться, понимать, наступать...
4    [кричать, рябец, наваливаться, слово, верить, ...
Name: lemmatized, dtype: object

Topic modeling with gensim, using Latent Semantic Indexing (LSI)

In [10]:
dictionary = corpora.Dictionary(litdata["lemmatized"] )
doc_term_matrix = [dictionary.doc2bow(item) for item in litdata["lemmatized"] ]
lsi = models.LsiModel(doc_term_matrix, num_topics=4, id2word=dictionary)
output = lsi.print_topics(num_topics=4, num_words = 4)
print (output)

[(0, '0.286*"сказать" + 0.255*"человек" + 0.228*"рука" + 0.204*"говорить"'), (1, '0.324*"мать" + 0.322*"сказать" + -0.246*"казак" + 0.227*"человек"'), (2, '-0.202*"знать" + -0.183*"дело" + 0.177*"глаз" + 0.168*"голова"'), (3, '-0.286*"казак" + 0.228*"человек" + -0.223*"мать" + -0.157*"говорить"')]


Results are skewed because of the length of certain works in the corpus, primarily "Quiet Flows the Don".  Next step, try to rectify these results by using a smaller, more even amount of text from each work in the corpus.

## What are the results with smaller chunked files?

In [11]:
import pandas as pd

newsocreal = pd.read_csv("csv_files/socreal_files_chunked_smaller.csv")
newsocreal.head(5)

Unnamed: 0,title,name,date,english metadata,source,text
0,Разгром,"Фадеев , Александр Александрович",1926,"Aleksandr Fadeev , The Rout , 1926",http : //lib.ru/RUSSLIT/FADEEW/razgrom.txt,СОДЕРЖАНИЕ I. Морозка II . Мечик III . Шестое...
1,\n\nРазгром\n\n,"Фадеев, Александр Александрович\n\n",2,"\n\nAleksandr Fadeev, The Rout, 1926\n\n",\n\nhttp://lib.ru/RUSSLIT/FADEEW/razgrom.txt\n\n,"пыльным , сумасшедшим карьером . -- Обожди-и ,..."
2,\n\nРазгром\n\n,"Фадеев, Александр Александрович\n\n",3,"\n\nAleksandr Fadeev, The Rout, 1926\n\n",\n\nhttp://lib.ru/RUSSLIT/FADEEW/razgrom.txt\n\n,.. -- снова повторил Дубов . -- Блуди ! Посмот...
3,\n\nРазгром\n\n,"Фадеев, Александр Александрович\n\n",4,"\n\nAleksandr Fadeev, The Rout, 1926\n\n",\n\nhttp://lib.ru/RUSSLIT/FADEEW/razgrom.txt\n\n,"Ведь надо как-то , умеют же другие ... '' В мы..."
4,\n\nРазгром\n\n,"Фадеев, Александр Александрович\n\n",5,"\n\nAleksandr Fadeev, The Rout, 1926\n\n",\n\nhttp://lib.ru/RUSSLIT/FADEEW/razgrom.txt\n\n,"ребят , вернувшихся из-за реки . Дубов узнал ,..."


Apply all of the same steps to the .csv with smaller chunks of each individual work in the corpus.  

In [12]:
with open('NERoutput.txt') as fp:
    lines = fp.read().splitlines()

lines = [i.lower() for i in lines]
lines = " ".join(lines)
mystem = Mystem() 
lemmatizedlines = mystem.lemmatize(lines)
lemmatizedlines = [i for i in lemmatizedlines if i != " "]

stop = list(stopwords.words('russian'))
stop.extend(['это', 'свой', 'то', ' ','который', 'говорить', 'сказать', 'знать', 'спрашивать', 'вопрос'])
stop.extend(lemmatizedlines)

punctuation = list(string.punctuation)
punctuation.extend(['--', '—', " – ", ' —', '«', '»'," -- ", " – – ", "\n", "\r", "\t","-" ,"-", "...", "…", " - ", " « ", " ", "..", "``", "\"\"","\'\'"])

def cleantext(text):
    tokens = text.split(" ")
    tokenslower = [i.strip().lower() for i in tokens]
#the following list comprehension gets rid of word-final punctuation, while retaining word-medial punctuation in words like "кто-то"
    tokens_no_word_final_punc = [re.sub(r'[!|,|?|.|:|;]|\.{1,}$\)', "", item) for item in tokenslower]
    tokens_no_stop_or_punc = [i for i in tokens_no_word_final_punc if i not in stop if i not in punctuation]
#the following line gets rid of blank, empty spaces in each list    
    tokens_no_stop_or_punc = list(filter(None, tokens_no_stop_or_punc))
    return tokens_no_stop_or_punc

newsocreal["no_punc_or_stop"] = newsocreal["text"].apply(cleantext)
cleaned = newsocreal["no_punc_or_stop"]

mystem = Mystem() 

def lemmatize(text):
    joinedtokens = " ".join(text)
    lemmatized = mystem.lemmatize(joinedtokens)
    lemmatized = [i for i in lemmatized if i not in stop if i not in punctuation]
    return lemmatized

newsocreal["lemmatized"] = newsocreal["no_punc_or_stop"].apply(lemmatize)

#Topic Modeling with Gensim, using Latent Semantic Indexing
dictionary = corpora.Dictionary(newsocreal["lemmatized"] )
doc_term_matrix = [dictionary.doc2bow(item) for item in newsocreal["lemmatized"] ]
lsi = models.LsiModel(doc_term_matrix, num_topics=4, id2word=dictionary)
output = lsi.print_topics(num_topics=4, num_words = 4)

print (output)




[(0, '-0.268*"человек" + -0.246*"рука" + -0.218*"глаз" + -0.158*"голова"'), (1, '0.416*"мать" + 0.205*"человек" + -0.173*"крепость" + -0.165*"дело"'), (2, '0.177*"крепость" + 0.167*"дело" + -0.158*"голова" + -0.155*"казак"'), (3, '0.288*"товарищ" + 0.165*"завод" + 0.163*"рабочий" + -0.142*"двор"')]


What happens with Latent Dirichlet Allocation (LDA) as opposed to LSI?

In [13]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

ldamodel = gensim.models.ldamodel.LdaModel(doc_term_matrix, num_topics=3, id2word = dictionary, passes=20)
ldaoutput = ldamodel.print_topics(num_topics=3, num_words=4)
print (ldaoutput)

[(0, '0.005*"глаз" + 0.004*"рука" + 0.004*"голова" + 0.004*"становиться"'), (1, '0.006*"товарищ" + 0.005*"рука" + 0.004*"глаз" + 0.003*"человек"'), (2, '0.007*"человек" + 0.005*"рука" + 0.005*"глаз" + 0.004*"лицо"')]


In [14]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, dictionary)

## Now, let's examine the banned literature corpus. 

In [15]:
bannedlit = pd.read_csv("csv_files/banned_lit_for_tm_smaller_chunks.csv")
bannedlit.head(5)

Unnamed: 0,title,name,date,english metadata,source,text
0,Собачье сердце,"Булгаков , Михаил",1925,"Mikhail Bulgakov , A Dog ’ s Heart , 1924",http : //bulgakov.lit-info.ru/bulgakov/proza/...,"У-у-у-у-у-гу-гуг-гуу ! О , гляньте на меня , ..."
1,\n\nСобачье сердце\n\n,"Булгаков, Михаил\n",2,"\n\nMikhail Bulgakov, A Dog’s Heart, 1924\n\n\n",\n\nhttp://bulgakov.lit-info.ru/bulgakov/proza...,", дорогой профессор , только в виде опыта . - ..."
2,\n\nСобачье сердце\n\n,"Булгаков, Михаил\n",3,"\n\nMikhail Bulgakov, A Dog’s Heart, 1924\n\n\n",\n\nhttp://bulgakov.lit-info.ru/bulgakov/proza...,Ястребиные ноздри его раздувались . Набравшись...
3,\n\nСобачье сердце\n\n,"Булгаков, Михаил\n",4,"\n\nMikhail Bulgakov, A Dog’s Heart, 1924\n\n\n",\n\nhttp://bulgakov.lit-info.ru/bulgakov/proza...,правом боку следы совершенно зажившего ожога ....
4,\n\nСобачье сердце\n\n,"Булгаков, Михаил\n",5,"\n\nMikhail Bulgakov, A Dog’s Heart, 1924\n\n\n",\n\nhttp://bulgakov.lit-info.ru/bulgakov/proza...,трубку в рогульки . Голубая радость разлилась ...


In [19]:
with open('NERoutputBanned.txt') as fp:
    lines = fp.read().splitlines()

lines = [i.lower() for i in lines]
joinedlines = " ".join(lines)
mystem = Mystem() 
lemmatizedlines = mystem.lemmatize(joinedlines)
lemmatizedlines = [i for i in lemmatizedlines if i != " "]

stop = list(stopwords.words('russian'))
stop.extend(['это', 'свой', 'то', ' ','который', 'говорить', 'сказать', 'знать', 'спрашивать', 'вопрос','наш'])
stop.extend(lemmatizedlines)

punctuation = list(string.punctuation)
punctuation.extend(['--', '—', " – ", ' —', '«', '»'," -- ", " – – ", "\n", "\r", "\t","-" ,"-", "...", "…", " - ", " « ", " ", "..", "``", "\"\"","\'\'"])

def cleantext(text):
    tokens = text.split(" ")
    tokenslower = [i.strip().lower() for i in tokens]
#the following list comprehension gets rid of word-final punctuation, while retaining word-medial punctuation in words like "кто-то"
    tokens_no_word_final_punc = [re.sub(r'[!|,|?|.|:|;]|\.{1,}$\)', "", item) for item in tokenslower]
    tokens_no_stop_or_punc = [i for i in tokens_no_word_final_punc if i not in stop if i not in punctuation]
#the following line gets rid of blank, empty spaces in each list    
    tokens_no_stop_or_punc = list(filter(None, tokens_no_stop_or_punc))
    return tokens_no_stop_or_punc

bannedlit["no_punc_or_stop"] = bannedlit["text"].apply(cleantext)
cleaned = bannedlit["no_punc_or_stop"]

mystem = Mystem() 

def lemmatize(text):
    joinedtokens = " ".join(text)
    lemmatized = mystem.lemmatize(joinedtokens)
    lemmatized = [i for i in lemmatized if i not in stop if i not in punctuation]
    return lemmatized

bannedlit["lemmatized"] = bannedlit["no_punc_or_stop"].apply(lemmatize)

#Topic Modeling with Gensim, using Latent Semantic Indexing
dictionaryb = corpora.Dictionary(bannedlit["lemmatized"] )
doc_term_matrixb = [dictionaryb.doc2bow(item) for item in bannedlit["lemmatized"] ]
lsi = models.LsiModel(doc_term_matrixb, num_topics=4, id2word=dictionaryb)
output = lsi.print_topics(num_topics=4, num_words = 4)

print (output)


[(0, '0.333*"человек" + 0.184*"рука" + 0.177*"глаз" + 0.155*"жизнь"'), (1, '-0.269*"человек" + 0.227*"профессор" + 0.178*"глаз" + -0.159*"жизнь"'), (2, '-0.359*"профессор" + -0.214*"отвечать" + -0.213*"человек" + -0.138*"кабинет"'), (3, '-0.203*"год" + -0.192*"человек" + -0.143*"отец" + -0.136*"дом"')]


Now, what happens with LDA?

In [22]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

ldamodel = gensim.models.ldamodel.LdaModel(doc_term_matrixb, num_topics=4, id2word = dictionaryb, passes=20)
ldaoutput = ldamodel.print_topics(num_topics=3, num_words=4)
print (ldaoutput)

[(3, '0.007*"человек" + 0.004*"жизнь" + 0.004*"рука" + 0.004*"весь"'), (1, '0.007*"человек" + 0.004*"рука" + 0.004*"жизнь" + 0.003*"год"'), (2, '0.007*"человек" + 0.004*"рука" + 0.003*"глаз" + 0.003*"идти"')]


In [23]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(ldamodel, doc_term_matrixb, dictionaryb)