# Russian lemmatizer

This Jupyter notebook takes a text file, or folder of text files in Russian, and creates a set of lemmatized derivative files (where all words are in their dictionary form, and not inflected). These lemmatized files can then be used for searching, or other computational text analysis methods.

## First-time setup
The first code cell below installs the *pymystem3* package which can do the actual lemmatizing. You only need to run it the first time you use this notebook in a particular environment (laptop, virtual machine, etc.) You can skip it the next time you use the notebook, but nothing bad will happen if you re-run it.

In [None]:
#Imports the sys module which you can use to perform system-level things 
#like installing new modules
import sys

#Installs pymystem3
!{sys.executable} -m pip install pymystem3

## Importing modules

The next code cell imports the modules you need to run this notebook. Run it every time you open this notebook. 

The first time you run it, you'll see a message about "Installing mystem".

In [None]:
#os is used for navigating your filesystem and getting to the files you want to lemmatize
import os

#pymystem3 is the module that does the actual lemmatizing
from pymystem3 import Mystem

#loads mystem
mystem = Mystem()

## Lemmatizing a single file
Put the full path to your text file in the cell below, using the correct syntax for your operating system. 

For instance, the default path a text file in the Documents directory is (substituting your user name on the computer for YOUR-USER-NAME):

* On Mac: '/Users/YOUR-USER-NAME/Documents/YOUR-TEXT-FILE.txt'
* On Windows: 'C:\\\Users\\\YOUR-USER-NAME\\\Documents\\\YOUR-TEXT-FILE.txt'

In [None]:
#Put the full path to your file between the single quotes here
filepath = '/Users/qad/Documents/baiki-iz-kantiniy-mos-aisli.txt'

#The outname is the name of the lemmatized file that this notebook creates
#If you want it to be named something other than the original file name + -lemmatized
#you can change that here
outname = filepath.replace('.txt', '-lemmatized.txt')

In [None]:
#Opens the file you specified
with open(filepath, 'r', encoding='utf8') as f:
    #Creates an empty text file with -lemmatized.txt appended to the name
    with open(outname, 'w', encoding='utf8') as out:
        #Reads the text of the file you specified
        text = f.read()
        #Lemmatizes the text
        tokens = mystem.lemmatize(text)
        #Creates a Python list of the lemmas
        tokens = [token for token in tokens]
        #Changes the lemmas from a Python list to a text string
        #Lemmas are separated by a space
        lemmatized = "".join(tokens)
        #Writes the lemmas to the text file with -lemmatized appended to the name
        out.write(lemmatized)

## Lemmatizing a folder of text files
Put the full path to your folder of text files in the cell below, using the correct syntax for your operating system. 

For instance, the default path to a folder called "russian" in the Documents directory is (substituting your user name on the computer for YOUR-USER-NAME):

* On Mac: '/Users/YOUR-USER-NAME/Documents/russian'
* On Windows: 'C:\\\Users\\\YOUR-USER-NAME\\\Documents\\\russian'

In [None]:
#Put the full path to your folder between single quotes here
textfolder = '/Users/qad/Documents/russian'
#Changes the working directory to the folder you specified
os.chdir(textfolder)

In [None]:
#For every file in the folder you specified...
for filename in os.listdir(textfolder):
    #If it's a text file, but not one of the text files with just lemmas
    if filename.endswith('.txt') and not filename.endswith('-lemmatized.txt'):
        #The outname is the name of the lemmatized file that this notebook creates
        #If you want it to be named something other than the original file name + -lemmatized
        #you can change that here
        outname = filename.replace('.txt', '-lemmatized.txt')
        #Opens the file you specified
        with open(filename, 'r', encoding='utf8') as f:
            #Creates an empty text file with -lemmatized.txt appended to the name
            with open(outname, 'w', encoding='utf8') as out:
                #Reads the text of the file you specified
                text = f.read()
                #Lemmatizes the text
                tokens = mystem.lemmatize(text)
                #Creates a Python list of the lemmas
                tokens = [token for token in tokens]
                #Changes the lemmas from a Python list to a text string
                #Lemmas are separated by a space
                lemmatized = "".join(tokens)
                #Writes the lemmas to the text file with -lemmatized appended to the name
                out.write(lemmatized)

## About

This Jupyter notebook was originally developed by Quinn Dombrowski for use in [DLCL 204: Digital Humanities Across Borders](https://github.com/quinnanya/dlcl204) at Stanford University, fall 2020. 