# Arabic stemmer

This Jupyter notebook takes a text file, or folder of text files in Arabic, and creates a set of stemmed derivative files (where all words have prefixes and suffixes removed; the result may not always be an actual root form). These stemmed files can then be used for searching, or other computational text analysis methods.

Before you can use this notebook, you need to have Java SDK (1.7+) installed in your system, and make sure it's in your computer's PATH. (Here's some [instructions for Windows, Mac, or Linux](https://community.akamai.com/customers/s/article/Adding-JDK-Path-in-Mac-OS-X-Linux-or-Windows?language=en_US).

## First-time setup
The first code cell below installs the *farasapy* package which is a wrapper for Farasa, which is in Java. Farasa does the actual lemmatizing. You only need to run it the first time you use this notebook in a particular environment (laptop, virtual machine, etc.) You can skip it the next time you use the notebook, but nothing bad will happen if you re-run it.

In [None]:
#Imports the sys module which you can use to perform system-level things 
#like installing new modules
import sys

#Installs pymystem3
!{sys.executable} -m pip install farasapy

## Importing modules
Every time you run this notebook, run the cells below to import the modules you'll need.

The first time you run the notebook, you'll see a red message saying it's performing a system check. If you have at least Java 1.7 installed and in your path, it should recognize it, and will download "zipped binaries" (the code that does the actual text processing). This isn't an error, just part of the loading process. 

The next time you run it, you should see a red message that "dependencies seem to be satisfied" and "task \[STEM\] is initalized in STANDALONE mode..."

In [None]:
from farasa.stemmer import FarasaStemmer
stemmer = FarasaStemmer()

## Stemming a single file
Put the full path to your text file in the cell below, using the correct syntax for your operating system. 

For instance, the default path a text file in the Documents directory is (substituting your user name on the computer for YOUR-USER-NAME):

* On Mac: '/Users/YOUR-USER-NAME/Documents/YOUR-TEXT-FILE.txt'
* On Windows: 'C:\\\Users\\\YOUR-USER-NAME\\\Documents\\\YOUR-TEXT-FILE.txt'

In [None]:
#Put the full path to your file between the single quotes here
filepath = '/Users/qad/Documents/arabic.txt'

#The outname is the name of the lemmatized file that this notebook creates
#If you want it to be named something other than the original file name + -lemmatized
#you can change that here
outname = filepath.replace('.txt', '-lemmatized.txt')

In [None]:
#Opens the file you specified
with open(filepath, 'r', encoding="utf8") as f:
    #Creates an empty text file with -lemmatized.txt appended to the name
    with open(outname, 'w', encoding="utf8") as out:
        #Reads the text of the file you specified
        text = f.read()
        #Stems the text
        stemmed = stemmer.stem(text)
        #Writes the result to the output file
        out.write(stemmed)

## Lemmatizing a folder of text files
Put the full path to your folder of text files in the cell below, using the correct syntax for your operating system. 

For instance, the default path to a folder called "arabic" in the Documents directory is (substituting your user name on the computer for YOUR-USER-NAME):

* On Mac: '/Users/YOUR-USER-NAME/Documents/arabic'
* On Windows: 'C:\\\Users\\\YOUR-USER-NAME\\\Documents\\\arabic'

In [None]:
#Put the full path to your folder between single quotes here
textfolder = '/Users/qad/Documents/arabic'
#Changes the working directory to the folder you specified
os.chdir(textfolder)

In [None]:
#For every file in the folder you specified...
for filename in os.listdir(textfolder):
    #If it's a text file, but not one of the text files with just lemmas
    if filename.endswith('.txt') and not filename.endswith('-lemmatized.txt'):
        #The outname is the name of the lemmatized file that this notebook creates
        #If you want it to be named something other than the original file name + -lemmatized
        #you can change that here
        outname = filename.replace('.txt', '-lemmatized.txt')
        #Opens the file you specified
        with open(filename, 'r', encoding="utf8") as f:
            #Creates an empty text file with -lemmatized.txt appended to the name
            with open(outname, 'w', encoding="utf8") as out:
                #Reads the text of the file you specified
                text = f.read()
                #Stems the text
                stemmed = stemmer.stem(text)
                #Writes the result to the output file
                out.write(stemmed)

## About

This Jupyter notebook was originally developed by Quinn Dombrowski for use in [DLCL 204: Digital Humanities Across Borders](https://github.com/quinnanya/dlcl204) at Stanford University, fall 2020. 