# <center> Exploring Pre-WWII Soviet Socialist Realism with Text Analysis

<img src="image.png">

About: This notebook uses logistic regression and topic modeling to explore the particularities of the Soviet literary genre of socialist realism in the pre-WII Soviet Union.  

The python modules nltk, gensim, pymystem, and pyLDAvis are used in this notebook.

In [4]:
import numpy as np
import pandas as pd
from nltk.corpus import stopwords 
import string
import gensim
import scipy
from gensim import corpora
from gensim import models
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt
from pymystem3 import Mystem
import pyLDAvis
import re

# Section 1 : Logistic Regression

This section uses predictive modeling using logistic regression. 

Results indicate that, logistic regression, with a certain degree of accuracy can differentiate between the socialist realism and banned literature corpora.  It is likely that accuracy might be even higher, were deep learning techniques to be applied. 

# Section 2: Topic Modeling

Import .csv with data from entire socialist realist literature corpus. 

In [2]:
litdata = pd.read_csv("csv_files/socialist_realism_for_tm.csv")
litdata['text'].head(5)

0     СОДЕРЖАНИЕ I. Морозка II . Мечик III . Шестое...
1     . Многие открыто роптали на `` подлость и нес...
2     бровей на тоненькую девушку с длинными косами...
3     время люди очнулись и поняли , что наступило ...
4     кричал Рябец , навалившись на одно слово и вс...
Name: text, dtype: object

Define stopwords, punctuation, read in output from named entity recognition (NER) script.

In [3]:
with open('NERoutput.txt') as fp:
    lines = fp.read().splitlines()

lines = [i.lower() for i in lines]

stop = list(stopwords.words('russian'))
stop.extend(['это', 'свой', 'то', ' ','который'])
stop.extend(lines)

punctuation = list(string.punctuation)
punctuation.extend(['--', '—', '«', '»'," -- ", "\n", "\r", "\t","-" ,"-", "...", "…", " - ", " « ", " ", "..", "``", "\"\"","\'\'"])


Use function to create a new column in pandas df with cleaned text.

In [4]:
# Create new column in dataframe that is literary texts without stopwords and punctuation, as well as without
# named entities for people, as identified in NERoutput.txt, which was created with NERscript.py

def cleantext(text):
    tokens = text.split(" ")
    tokenslower = [i.strip().lower() for i in tokens]
#the following list comprehension gets rid of word-final punctuation, while retaining word-medial punctuation in words like "кто-то"
    tokens_no_word_final_punc = [re.sub(r'[!|,|?|.|:|;]|\.{1,}$\)', "", item) for item in tokenslower]
    tokens_no_stop_or_punc = [i for i in tokens_no_word_final_punc if i not in stop if i not in punctuation]
#the following line gets rid of blank, empty spaces in each list    
    tokens_no_stop_or_punc = list(filter(None, tokens_no_stop_or_punc))
    return tokens_no_stop_or_punc
    
litdata["no_punc_or_stop"] = litdata["text"].apply(cleantext)
no_punc_or_stop = litdata["no_punc_or_stop"]

# Print first five rows of new column

no_punc_or_stop.head(5)

0    [содержание, i, ii, iii, шестое, чувство, iv, ...
1    [многие, открыто, роптали, подлость, несознани...
2    [бровей, тоненькую, девушку, длинными, косами,...
3    [время, люди, очнулись, поняли, наступило, утр...
4    [кричал, рябец, навалившись, одно, слово, веря...
Name: no_punc_or_stop, dtype: object

Use pymystem to lemmatize the text. 

In [2]:
mystem = Mystem() 


def lemmatize(text):
    joinedtokens = " ".join(text)
    lemmatized = mystem.lemmatize(joinedtokens)
    lemmatized = [i for i in lemmatized if i not in stop if i not in punctuation]
    return lemmatized
    
litdata["lemmatized"] = litdata["no_punc_or_stop"].apply(lemmatize)

litdata["lemmatized"].head(5)

NameError: name 'Mystem' is not defined

Topic modeling with gensim, using Latent Semantic Indexing (LSI)

In [6]:
dictionary = corpora.Dictionary(litdata["lemmatized"] )
doc_term_matrix = [dictionary.doc2bow(item) for item in litdata["lemmatized"] ]
lsi = models.LsiModel(doc_term_matrix, num_topics=4, id2word=dictionary)
output = lsi.print_topics(num_topics=4, num_words = 4)
print (output)

[(0, '0.286*"сказать" + 0.255*"человек" + 0.228*"рука" + 0.204*"говорить"'), (1, '0.324*"мать" + 0.322*"сказать" + -0.246*"казак" + 0.227*"человек"'), (2, '-0.203*"знать" + -0.183*"дело" + 0.177*"глаз" + 0.168*"голова"'), (3, '0.286*"казак" + -0.228*"человек" + 0.224*"мать" + 0.157*"говорить"')]


Results are skewed because of the length of certain works in the corpus, primarily "Quiet Flows the Don".  Next step, try to rectify these results by using a smaller, more even amount of text from each work in the corpus.

## What are the results with smaller chunked files?

In [7]:
import pandas as pd

newsocreal = pd.read_csv("csv_files/socreal_files_chunked_smaller.csv")
newsocreal.head(5)

Unnamed: 0,title,name,date,english metadata,source,text
0,Разгром,"Фадеев , Александр Александрович",1926,"Aleksandr Fadeev , The Rout , 1926",http : //lib.ru/RUSSLIT/FADEEW/razgrom.txt,СОДЕРЖАНИЕ I. Морозка II . Мечик III . Шестое...
1,\n\nРазгром\n\n,"Фадеев, Александр Александрович\n\n",2,"\n\nAleksandr Fadeev, The Rout, 1926\n\n",\n\nhttp://lib.ru/RUSSLIT/FADEEW/razgrom.txt\n\n,"пыльным , сумасшедшим карьером . -- Обожди-и ,..."
2,\n\nРазгром\n\n,"Фадеев, Александр Александрович\n\n",3,"\n\nAleksandr Fadeev, The Rout, 1926\n\n",\n\nhttp://lib.ru/RUSSLIT/FADEEW/razgrom.txt\n\n,.. -- снова повторил Дубов . -- Блуди ! Посмот...
3,\n\nРазгром\n\n,"Фадеев, Александр Александрович\n\n",4,"\n\nAleksandr Fadeev, The Rout, 1926\n\n",\n\nhttp://lib.ru/RUSSLIT/FADEEW/razgrom.txt\n\n,"Ведь надо как-то , умеют же другие ... '' В мы..."
4,\n\nРазгром\n\n,"Фадеев, Александр Александрович\n\n",5,"\n\nAleksandr Fadeev, The Rout, 1926\n\n",\n\nhttp://lib.ru/RUSSLIT/FADEEW/razgrom.txt\n\n,"ребят , вернувшихся из-за реки . Дубов узнал ,..."


Apply all of the same steps to the .csv with smaller chunks of each individual work in the corpus.  

In [8]:
with open('NERoutput.txt') as fp:
    lines = fp.read().splitlines()

lines = [i.lower() for i in lines]
lines = " ".join(lines)
mystem = Mystem() 
lemmatizedlines = mystem.lemmatize(lines)
lemmatizedlines = [i for i in lemmatizedlines if i != " "]

stop = list(stopwords.words('russian'))
stop.extend(['это', 'свой', 'то', ' ','который', 'говорить', 'сказать', 'знать', 'спрашивать', 'вопрос'])
stop.extend(lemmatizedlines)

punctuation = list(string.punctuation)
punctuation.extend(['--', '—', " – ", ' —', '«', '»'," -- ", " – – ", "\n", "\r", "\t","-" ,"-", "...", "…", " - ", " « ", " ", "..", "``", "\"\"","\'\'"])

def cleantext(text):
    tokens = text.split(" ")
    tokenslower = [i.strip().lower() for i in tokens]
#the following list comprehension gets rid of word-final punctuation, while retaining word-medial punctuation in words like "кто-то"
    tokens_no_word_final_punc = [re.sub(r'[!|,|?|.|:|;]|\.{1,}$\)', "", item) for item in tokenslower]
    tokens_no_stop_or_punc = [i for i in tokens_no_word_final_punc if i not in stop if i not in punctuation]
#the following line gets rid of blank, empty spaces in each list    
    tokens_no_stop_or_punc = list(filter(None, tokens_no_stop_or_punc))
    return tokens_no_stop_or_punc

newsocreal["no_punc_or_stop"] = newsocreal["text"].apply(cleantext)
cleaned = newsocreal["no_punc_or_stop"]

mystem = Mystem() 

def lemmatize(text):
    joinedtokens = " ".join(text)
    lemmatized = mystem.lemmatize(joinedtokens)
    lemmatized = [i for i in lemmatized if i not in stop if i not in punctuation]
    return lemmatized

newsocreal["lemmatized"] = newsocreal["no_punc_or_stop"].apply(lemmatize)

#Topic Modeling with Gensim, using Latent Semantic Indexing
dictionary = corpora.Dictionary(newsocreal["lemmatized"] )
doc_term_matrix = [dictionary.doc2bow(item) for item in newsocreal["lemmatized"] ]
lsi = models.LsiModel(doc_term_matrix, num_topics=4, id2word=dictionary)
output = lsi.print_topics(num_topics=4, num_words = 4)

print (output)




[(0, '-0.268*"человек" + -0.246*"рука" + -0.218*"глаз" + -0.158*"голова"'), (1, '-0.416*"мать" + -0.205*"человек" + 0.173*"крепость" + 0.165*"дело"'), (2, '-0.176*"крепость" + -0.167*"дело" + 0.158*"голова" + 0.155*"казак"'), (3, '-0.335*"товарищ" + -0.191*"завод" + -0.191*"рабочий" + 0.140*"крепость"')]


What happens with Latent Dirichlet Allocation (LDA) as opposed to LSI?

In [9]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

ldamodel = gensim.models.ldamodel.LdaModel(doc_term_matrix, num_topics=3, id2word = dictionary, passes=20)
ldaoutput = ldamodel.print_topics(num_topics=3, num_words=4)
print (ldaoutput)

[(0, '0.005*"товарищ" + 0.004*"рука" + 0.004*"человек" + 0.004*"глаз"'), (1, '0.005*"рука" + 0.005*"глаз" + 0.004*"голова" + 0.004*"человек"'), (2, '0.008*"человек" + 0.006*"рука" + 0.005*"глаз" + 0.004*"лицо"')]


In [10]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, dictionary)

## Now, let's examine the banned literature corpus. 

In [6]:
bannedlit = pd.read_csv("csv_files/banned_lit_for_tm_smaller_chunks.csv")
bannedlit.head(5)

Unnamed: 0,title,name,date,english metadata,source,text
0,Собачье сердце,"Булгаков , Михаил",1925,"Mikhail Bulgakov , A Dog ’ s Heart , 1924",http : //bulgakov.lit-info.ru/bulgakov/proza/...,"У-у-у-у-у-гу-гуг-гуу ! О , гляньте на меня , ..."
1,\n\nСобачье сердце\n\n,"Булгаков, Михаил\n",2,"\n\nMikhail Bulgakov, A Dog’s Heart, 1924\n\n\n",\n\nhttp://bulgakov.lit-info.ru/bulgakov/proza...,", дорогой профессор , только в виде опыта . - ..."
2,\n\nСобачье сердце\n\n,"Булгаков, Михаил\n",3,"\n\nMikhail Bulgakov, A Dog’s Heart, 1924\n\n\n",\n\nhttp://bulgakov.lit-info.ru/bulgakov/proza...,Ястребиные ноздри его раздувались . Набравшись...
3,\n\nСобачье сердце\n\n,"Булгаков, Михаил\n",4,"\n\nMikhail Bulgakov, A Dog’s Heart, 1924\n\n\n",\n\nhttp://bulgakov.lit-info.ru/bulgakov/proza...,правом боку следы совершенно зажившего ожога ....
4,\n\nСобачье сердце\n\n,"Булгаков, Михаил\n",5,"\n\nMikhail Bulgakov, A Dog’s Heart, 1924\n\n\n",\n\nhttp://bulgakov.lit-info.ru/bulgakov/proza...,трубку в рогульки . Голубая радость разлилась ...


In [8]:
with open('NERoutputBanned.txt') as fp:
    lines = fp.read().splitlines()

lines = [i.lower() for i in lines]
joinedlines = " ".join(lines)
mystem = Mystem() 
lemmatizedlines = mystem.lemmatize(joinedlines)
lemmatizedlines = [i for i in lemmatizedlines if i != " "]

stop = list(stopwords.words('russian'))
stop.extend(['это', 'свой', 'то', ' ','который', 'говорить', 'сказать', 'знать', 'спрашивать', 'вопрос'])
stop.extend(lemmatizedlines)

punctuation = list(string.punctuation)
punctuation.extend(['--', '—', " – ", ' —', '«', '»'," -- ", " – – ", "\n", "\r", "\t","-" ,"-", "...", "…", " - ", " « ", " ", "..", "``", "\"\"","\'\'"])

def cleantext(text):
    tokens = text.split(" ")
    tokenslower = [i.strip().lower() for i in tokens]
#the following list comprehension gets rid of word-final punctuation, while retaining word-medial punctuation in words like "кто-то"
    tokens_no_word_final_punc = [re.sub(r'[!|,|?|.|:|;]|\.{1,}$\)', "", item) for item in tokenslower]
    tokens_no_stop_or_punc = [i for i in tokens_no_word_final_punc if i not in stop if i not in punctuation]
#the following line gets rid of blank, empty spaces in each list    
    tokens_no_stop_or_punc = list(filter(None, tokens_no_stop_or_punc))
    return tokens_no_stop_or_punc

bannedlit["no_punc_or_stop"] = bannedlit["text"].apply(cleantext)
cleaned = bannedlit["no_punc_or_stop"]

mystem = Mystem() 

def lemmatize(text):
    joinedtokens = " ".join(text)
    lemmatized = mystem.lemmatize(joinedtokens)
    lemmatized = [i for i in lemmatized if i not in stop if i not in punctuation]
    return lemmatized

bannedlit["lemmatized"] = bannedlit["no_punc_or_stop"].apply(lemmatize)

#Topic Modeling with Gensim, using Latent Semantic Indexing
dictionaryb = corpora.Dictionary(bannedlit["lemmatized"] )
doc_term_matrixb = [dictionaryb.doc2bow(item) for item in bannedlit["lemmatized"] ]
lsi = models.LsiModel(doc_term_matrixb, num_topics=4, id2word=dictionaryb)
output = lsi.print_topics(num_topics=4, num_words = 4)

print (output)


[(0, '0.332*"человек" + 0.184*"рука" + 0.176*"глаз" + 0.154*"жизнь"'), (1, '-0.275*"человек" + 0.220*"профессор" + 0.176*"глаз" + -0.159*"жизнь"'), (2, '0.362*"профессор" + 0.218*"человек" + 0.215*"отвечать" + -0.165*"наш"'), (3, '0.206*"год" + 0.182*"человек" + 0.146*"отец" + -0.137*"колхоз"')]


Now, what happens with LDA?

In [15]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

ldamodel = gensim.models.ldamodel.LdaModel(doc_term_matrixb, num_topics=4, id2word = dictionaryb, passes=20)
ldaoutput = ldamodel.print_topics(num_topics=3, num_words=4)
print (ldaoutput)

[(0, '0.005*"человек" + 0.004*"рука" + 0.004*"глаз" + 0.004*"профессор"'), (2, '0.009*"человек" + 0.004*"жизнь" + 0.004*"весь" + 0.004*"становиться"'), (1, '0.006*"глаз" + 0.006*"рука" + 0.004*"стена" + 0.004*"i"')]


In [16]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(ldamodel, doc_term_matrixb, dictionaryb)