# Intro to NLP (Spanish)

This Jupyter notebook lets you explore the different ways the spaCy Python library can annotate Spanish-language text.

### Mounting Google drive
Run the code cell below. You'll get a link that will take you to a screen where you can choose which Google account to authenticate with. After you choose an account and approve the connection, you'll get a long string of numbers and letters. Copy and paste it into the box that will appear in the code cell below, and hit enter.

If you see `Mounted at /content/gdrive`, then you know it worked.

In [None]:
from google.colab import drive

drive.mount('/content/gdrive')

## Setup
The code cells below below install the *spacy* package which we will use for NLP.

In [None]:
#Imports the module you need to download and install the spaCy modules
import sys
#Installs spaCy
!{sys.executable} -m pip install spacy==3.0

In [None]:
#Replace es_core_news_lg with another model name here for other languages
import spacy
!{sys.executable} -m spacy download es_core_news_lg

### Import modules
To use a [SpaCy language model](https://spacy.io/models) other than Spanish, replace `es_core_news_lg` with the model name in the cell below.

In [None]:
# os is used for navigating directories
import os
# spacy is used for identifying the subjects and verbs
import spacy
from spacy.symbols import nsubj, VERB
#Replace core_news_lg with another model name here for other languages
import es_core_news_lg
#Replace en_core_web_sm with another model name here for other languages
nlp = spacy.load("es_core_news_lg")

## Exploring spaCy tagging

Before you run spaCy on a whole text, try it on a few sentences in order to understand what the different annotations are and how they work.

In [None]:
example = nlp("Empezó Maximiliano sus estudios el 69, y su hermano y su tía le ponderaban lo bonita que era la Farmacia y lo mucho que con ella se ganaba, por ser muy caros los medicamentos y muy baratas las primeras materias: agua del pozo, ceniza del fogón, tierra de los tiestos, etcétera... El pobre chico, que era muy dócil, con todo se mostraba conforme.")

for token in example:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

In [None]:
from spacy import displacy
displacy.render(example, style='dep', jupyter=True, options={'distance': 90})

## Getting all nouns
When comparing texts based on their content, it can be useful to create derivative files with only the words that give us the best information about the content of the text. Nouns usually convey the most information about what a text is about. The code cells below take each of the text files in the *intro-to-nlp-es-files* folder in your Google Drive, and creates a derivative that only has the nouns from the text.

In [None]:
#Put the full path to your folder between single quotes here
textfolder = '/content/gdrive/My Drive/intro-to-nlp-es-files'
#Changes the working directory to the folder you specified
os.chdir(textfolder)

In [None]:
#For every file in the folder you specified...
for filename in os.listdir(textfolder):
    #If it's a text file, but not one of the text files with just lemmas
    if filename.endswith('.txt') and not filename.endswith('-lemmatized.txt'):
        #The outname is the name of the nouns-only file that this notebook creates
        #If you want it to be named something other than the original file name + -nouns
        #you can change that here
        outname = filename.replace('.txt', '-nouns.txt')
        #Opens the file you specified
        with open(filename, 'r', encoding='utf8') as f:
            #Creates an empty text file with -nouns.txt appended to the name
            with open(outname, 'w', encoding='utf8') as out:
                #Reads the text of the file you specified
                text = f.read()
                #Removes any problematic punctuation
                #Does Spanish NLP on the cleaned text
                doc = nlp(text)
                #For each word in the text...
                for token in doc:
                    if token.pos_ == 'NOUN':
                      #Write the lemma to the new text file with the lemmatized text
                      out.write(token.lemma_)
                      #Write a space after each word
                      out.write(' ')
print('Your folder now has files with only nouns!')

## Extracting character verbs
You can also use spaCy's dependency parse to try to identify all of a character's verbs: what are different characters *doing* in the text?

Ideally, there'd be another step in the pipeline that performs *co-reference resolution* (figuring out which character all the 'he', 'she', 'I', etc. are referring to), but that is still a very hard computational problem. (The only relatively easy-to-use tool that does it for English, somewhat successfully, is David Bamman's [BookNLP](https://github.com/dbamman/book-nlp)).

Here, we've defined a list of major characters in *Fortunata and Jacinta*. The code below looks for all places where the subject of a sentence (nsubj) matches one of those character names, and writes out that character and the main verb from the sentence.

In [None]:
names = ['Fortunata', 'Juanito', 'Juanito Santa Cruz', 'Jacinta', 'Maximiliano', 'Maxi', 'Juárez', 'el Negro', 'Juan Evaristo', 'D. Evaristo', 'D. José', 'José Izquierdo', 'Mauricia', 'Aurora']

In [None]:
#You can name your output file here something else if you like
#This file will appear in the same folder in Drive as the text files
charverbfile = 'fortunata-jacinda-verbs2.csv'

In [None]:
#Put the full path to your folder between single quotes here
textfolder = '/content/gdrive/My Drive/intro-to-nlp-es-files'
#Changes the working directory to the folder you specified
os.chdir(textfolder)

In [None]:
#For every file in the folder you specified...
with open(charverbfile, 'w') as out:
  for filename in os.listdir(textfolder):
      #If it's a text file, but not one of the text files with just lemmas or nouns
      if filename.endswith('.txt') and not filename.endswith('-lemmatized.txt') and not filename.endswith('-nouns.txt'):
          #The outname is the name of the lemmatized file that this notebook creates
          #If you want it to be named something other than the original file name + -lemmatized
          #you can change that here
          #Opens each file
          with open(filename, 'r') as bookfile:
              #Reads in the text in the file
              book = bookfile.read()
              #NLP parse of the text
              doc = nlp(book)
              for possible_subject in doc:
                for name in names:
                  if possible_subject.text == name:
                    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
                      out.write(str(possible_subject) + ', ')
                      out.write(str(possible_subject.head) + '\n')

## Lemmatizing character verbs

In [None]:
#Import CSV reader
import csv
#Open a new file that will include a column for the lemmatized verbs
with open('verbs-lemmatized.csv', 'w') as out:
  #Writes header row
  out.write('Character, Verb, VerbLemma\n')
  #Opens the character verb file
  with open('fortunata-jacinda-verbs2.csv', 'r') as csvfile:
    #Reads the character verb file
    csvreader = csv.reader(csvfile, delimiter=',')
    #For each row...
    for row in csvreader:
      #Character is first column
      character = row[0]
      #Verb is second column
      verb = row[1]
      #NLP on verb
      analyzed = nlp(verb)
      #For word in analyzed text
      for token in analyzed:
        #Skip the first one (there are two, the first is blank, not sure why)
        if token.i == 1:
          #Write out the result
          out.write(character + ', ' + verb + ', ' + token.lemma_ + '\n')