# Spanish lemmatizer

This Jupyter notebook takes a text file, or folder of text files in Spanish, and creates a set of lemmatized derivative files (where all words are in their dictionary form, and not inflected). These lemmatized files can then be used for searching, or other computational text analysis methods.

### Mounting Google drive
Run the code cell below. You'll get a link that will take you to a screen where you can choose which Google account to authenticate with. After you choose an account and approve the connection, you'll get a long string of numbers and letters. Copy and paste it into the box that will appear in the code cell below, and hit enter.

If you see `Mounted at /content/gdrive`, then you know it worked.

In [None]:
from google.colab import drive

drive.mount('/content/gdrive')

## Setup
The code cells below below install the *spacy* package which can do the actual lemmatizing. 

In [None]:
#Imports the module you need to download and install the spaCy modules
import sys
#Installs spaCy
!{sys.executable} -m pip install spacy==3.0

In [None]:
#Replace es_core_news_lg with another model name here for other languages
import spacy
!{sys.executable} -m spacy download es_core_news_lg

### Import modules
To use a [SpaCy language model](https://spacy.io/models) other than Spanish, replace `es_core_news_lg` with the model name in the cell below.

In [None]:
# os is used for navigating directories
import os
# spacy is used for identifying the subjects and verbs
import spacy
#Replace en_core_web_sm with another model name here for other languages
import es_core_news_lg
#Replace en_core_web_sm with another model name here for other languages
nlp = spacy.load("es_core_news_lg")

## Lemmatizing a single file


**Skip down to the next section if you want to lemmatize a whole folder**

After uploading the text file to Drive, put in the full path to where it's located. All paths should begin with \'/content/gdrive/My Drive/'. 

If your file is located just in the root of your Drive (not in a subfolder), the path should look like '/content/gdrive/My Drive/your-text-file.txt'

If it's in a subfolder, it should look like: '/content/gdrive/My Drive/subfolder/your-text-file.txt' (you can also include multiple subfolders if it's nested)

In [None]:
#Put the full path to your file between the single quotes here
filepath = '/content/gdrive/MyDrive/intro-to-nlp-es-files/fortunata_y_jacinta_1.txt'

#The outname is the name of the lemmatized file that this notebook creates
#If you want it to be named something other than the original file name + -lemmatized
#you can change that here
outname = filepath.replace('.txt', '-lemmatized.txt')

In [None]:
#Opens the file you specified
with open(filepath, 'r', encoding='utf8') as f:
    #Creates an empty text file with -lemmatized.txt appended to the name
    with open(outname, 'w', encoding='utf8') as out:
        #Reads the text of the file you specified
        text = f.read()
        #Does Spanish NLP on the text
        doc = nlp(text)
        #For each word in the text...
        for token in doc:
            #Write the lemma to the new text file with the lemmatized text
            out.write(token.lemma_)
            #Write a space after each word
            out.write(' ')
            #Print the lemmas to the screen below, with a space between them
            print(token.lemma_, end=' ')

## Lemmatizing an entire folder of text files
If want to lemmatize an entire folder of files, put the path to the folder here. *Please include only paths to folders that contain text (.txt) files. If you have a folder that contains sub-folders of text files.*

Paths should look like: '/content/gdrive/My Drive/subfolder' (you can also include multiple subfolders if it's nested)

In [None]:
#Put the full path to your folder between single quotes here
textfolder = '/content/gdrive/My Drive/intro-to-nlp-es-files'
#Changes the working directory to the folder you specified
os.chdir(textfolder)

In [None]:
#For every file in the folder you specified...
for filename in os.listdir(textfolder):
    #If it's a text file, but not one of the text files with just lemmas
    if filename.endswith('.txt') and not filename.endswith('-lemmatized.txt'):
        #The outname is the name of the lemmatized file that this notebook creates
        #If you want it to be named something other than the original file name + -lemmatized
        #you can change that here
        outname = filename.replace('.txt', '-lemmatized.txt')
        #Opens the file you specified
        with open(filename, 'r', encoding='utf8') as f:
            #Creates an empty text file with -lemmatized.txt appended to the name
            with open(outname, 'w', encoding='utf8') as out:
                #Reads the text of the file you specified
                text = f.read()
                #Removes any problematic punctuation
                #Does Spanish NLP on the cleaned text
                doc = nlp(text)
                #For each word in the text...
                for token in doc:
                    #Write the lemma to the new text file with the lemmatized text
                    out.write(token.lemma_)
                    #Write a space after each word
                    out.write(' ')
                    #Print the lemmas to the screen below, with a space between them
                    #print(token.lemma_, end=' ')
print('Your folder now has lemmatized files!')

## About

This Jupyter notebook was originally developed by Quinn Dombrowski for use in [DLCL 204: Digital Humanities Across Borders](https://github.com/quinnanya/dlcl204) at Stanford University, fall 2020. 