# Intro to NLP (Spanish)

This Jupyter notebook lets you explore the different ways the spaCy Python library can annotate Spanish-language text.

### Mounting Google drive
Run the code cell below. You'll get a link that will take you to a screen where you can choose which Google account to authenticate with. After you choose an account and approve the connection, you'll get a long string of numbers and letters. Copy and paste it into the box that will appear in the code cell below, and hit enter.

If you see `Mounted at /content/gdrive`, then you know it worked.

In [1]:
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


## Setup
The code cells below below install the *spacy* package which we will use for NLP.

In [2]:
#Imports the module you need to download and install the spaCy modules
import sys
#Installs spaCy
!{sys.executable} -m pip install spacy==3.0

Collecting spacy==3.0
[?25l  Downloading https://files.pythonhosted.org/packages/8b/62/a98c61912ea57344816dd4886ed71e34d8aeec55b79e5bed05a7c2a1ae52/spacy-3.0.0-cp37-cp37m-manylinux2014_x86_64.whl (12.7MB)
[K     |████████████████████████████████| 12.7MB 227kB/s 
Collecting thinc<8.1.0,>=8.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/61/87/decceba68a0c6ca356ddcb6aea8b2500e71d9bc187f148aae19b747b7d3c/thinc-8.0.3-cp37-cp37m-manylinux2014_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 33.7MB/s 
[?25hCollecting srsly<3.0.0,>=2.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/c3/84/dfdfc9f6f04f6b88207d96d9520b911e5fec0c67ff47a0dea31ab5429a1e/srsly-2.4.1-cp37-cp37m-manylinux2014_x86_64.whl (456kB)
[K     |████████████████████████████████| 460kB 34.1MB/s 
Collecting catalogue<2.1.0,>=2.0.1
  Downloading https://files.pythonhosted.org/packages/9c/10/dbc1203a4b1367c7b02fddf08cb2981d9aa3e688d398f587cea0ab9e3bec/catalogue-2.0.4-py3-none-an

In [3]:
#Replace es_core_news_lg with another model name here for other languages
import spacy
!{sys.executable} -m spacy download es_core_news_lg

2021-05-17 03:51:09.806720: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Collecting es-core-news-lg==3.0.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_lg-3.0.0/es_core_news_lg-3.0.0-py3-none-any.whl (569.7MB)
[K     |████████████████████████████████| 569.7MB 26kB/s 
Installing collected packages: es-core-news-lg
Successfully installed es-core-news-lg-3.0.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_lg')


### Import modules
To use a [SpaCy language model](https://spacy.io/models) other than Spanish, replace `es_core_news_lg` with the model name in the cell below.

In [4]:
# os is used for navigating directories
import os
# spacy is used for identifying the subjects and verbs
import spacy
from spacy.symbols import nsubj, VERB
#Replace core_news_lg with another model name here for other languages
import es_core_news_lg
#Replace en_core_web_sm with another model name here for other languages
nlp = spacy.load("es_core_news_lg")

## Exploring spaCy tagging

Before you run spaCy on a whole text, try it on a few sentences in order to understand what the different annotations are and how they work.

In [5]:
example = nlp("Empezó Maximiliano sus estudios el 69, y su hermano y su tía le ponderaban lo bonita que era la Farmacia y lo mucho que con ella se ganaba, por ser muy caros los medicamentos y muy baratas las primeras materias: agua del pozo, ceniza del fogón, tierra de los tiestos, etcétera... El pobre chico, que era muy dócil, con todo se mostraba conforme.")

for token in example:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Empezó empezar VERB VERB ROOT Xxxxx True False
Maximiliano Maximiliano PROPN PROPN nsubj Xxxxx True False
sus su DET DET det xxx True True
estudios estudio NOUN NOUN nsubj xxxx True False
el el DET DET det xx True True
69 69 NUM NUM obl dd False False
, , PUNCT PUNCT punct , False False
y y CCONJ CCONJ cc x True False
su su DET DET det xx True True
hermano hermano NOUN NOUN dep xxxx True False
y y CCONJ CCONJ cc x True False
su su DET DET det xx True True
tía tía NOUN NOUN nsubj xxx True False
le él PRON PRON obj xx True True
ponderaban ponderar VERB VERB conj xxxx True False
lo él PRON PRON det xx True True
bonita bonito ADJ ADJ obj xxxx True False
que que PRON PRON nsubj xxx True True
era ser AUX AUX cop xxx True True
la el DET DET det xx True True
Farmacia Farmacia PROPN PROPN ccomp Xxxxx True False
y y CCONJ CCONJ cc x True False
lo él PRON PRON det xx True True
mucho mucho ADV ADV dep xxxx True True
que que PRON PRON obj xxx True True
con con ADP ADP case xxx True True
ella él PRO

In [6]:
from spacy import displacy
displacy.render(example, style='dep', jupyter=True, options={'distance': 90})

## Getting all nouns
When comparing texts based on their content, it can be useful to create derivative files with only the words that give us the best information about the content of the text. Nouns usually convey the most information about what a text is about. The code cells below take each of the text files in the *intro-to-nlp-es-files* folder in your Google Drive, and creates a derivative that only has the nouns from the text.

What can you do with a nouns-only file? If you do this for multiple novels, you can compare them, e.g. as described in the *Programming Historian* lesson "[Understanding and Using Common Similarity Measures](https://programminghistorian.org/en/lessons/common-similarity-measures)" by John R. Ladd.

In [7]:
#Put the full path to your folder between single quotes here
textfolder = '/content/gdrive/My Drive/intro-to-nlp-es-files'
#Changes the working directory to the folder you specified
os.chdir(textfolder)

In [8]:
#For every file in the folder you specified...
for filename in os.listdir(textfolder):
    #If it's a text file, but not one of the text files with just lemmas
    if filename.endswith('.txt') and not filename.endswith('-lemmatized.txt'):
        #The outname is the name of the nouns-only file that this notebook creates
        #If you want it to be named something other than the original file name + -nouns
        #you can change that here
        outname = filename.replace('.txt', '-nouns.txt')
        #Opens the file you specified
        with open(filename, 'r', encoding='utf8') as f:
            #Creates an empty text file with -nouns.txt appended to the name
            with open(outname, 'w', encoding='utf8') as out:
                #Reads the text of the file you specified
                text = f.read()
                #Removes any problematic punctuation
                #Does Spanish NLP on the cleaned text
                doc = nlp(text)
                #For each word in the text...
                for token in doc:
                    if token.pos_ == 'NOUN':
                      #Write the lemma to the new text file with the lemmatized text
                      out.write(token.lemma_)
                      #Write a space after each word
                      out.write(' ')
print('Your folder now has files with only nouns!')

Your folder now has files with only nouns!


## Extracting character verbs
You can also use spaCy's dependency parse to try to identify all of a character's verbs: what are different characters *doing* in the text?

Ideally, there'd be another step in the pipeline that performs *co-reference resolution* (figuring out which character all the 'he', 'she', 'I', etc. are referring to), but that is still a very hard computational problem. (The only relatively easy-to-use tool that does it for English, somewhat successfully, is David Bamman's [BookNLP](https://github.com/dbamman/book-nlp)).

Here, we've defined a list of major characters in *Fortunata and Jacinta*. The code below looks for all places where the subject of a sentence (nsubj) matches one of those character names, and writes out that character and the main verb from the sentence.

In [9]:
names = ['Fortunata', 'Juanito', 'Juanito Santa Cruz', 'Jacinta', 'Maximiliano', 'Maxi', 'Juárez', 'el Negro', 'Juan Evaristo', 'D. Evaristo', 'D. José', 'José Izquierdo', 'Mauricia', 'Aurora']

Next, name the CSV file where you'll write out all the character verbs.

In [10]:
#You can name your output file here something else if you like
#This file will appear in the same folder in Drive as the text files
charverbfile = 'fortunata-jacinta-verbs.csv'

Put in the path to the folder with your source texts.

In [11]:
#Put the full path to your folder between single quotes here
textfolder = '/content/gdrive/My Drive/intro-to-nlp-es-files'
#Changes the working directory to the folder you specified
os.chdir(textfolder)

The code below creates a CSV file with all the occurrences it can find of one of the character names listed above, plus a verb.

In [13]:
#For every file in the folder you specified...
with open(charverbfile, 'w') as out:
  #Write header row
  out.write('Character,Verb')
  for filename in os.listdir(textfolder):
      #If it's a text file, but not one of the text files with just lemmas or nouns
      if filename.endswith('.txt') and not filename.endswith('-lemmatized.txt') and not filename.endswith('-nouns.txt'):
          #The outname is the name of the lemmatized file that this notebook creates
          #If you want it to be named something other than the original file name + -lemmatized
          #you can change that here
          #Opens each file
          with open(filename, 'r') as bookfile:
              #Reads in the text in the file
              book = bookfile.read()
              #NLP parse of the text
              doc = nlp(book)
              #For each possible subject
              for possible_subject in doc:
                #For each character name you listed above
                for name in names:
                  #If the text of a possible subject matches a name
                  if possible_subject.text == name:
                    #If the possible subject is labeled nsubj and is associated with a verb
                    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
                      #Write out the subject, a comma, and the verb
                      out.write(str(possible_subject) + ',' + str(possible_subject.head) + '\n')

## Lemmatizing character verbs
To reduce variation in the verb forms, it can help to lemmatize the verbs that appear in your CSV. The code cells below create another CSV file with a column for lemmatized verb form, then lemmatize the verbs in your first CSV.

In [14]:
lemmatizedverbfilename = 'verbs-lemmatized.csv'

In [16]:
#Import CSV reader
import csv
#Open a new file that will include a column for the lemmatized verbs
with open(lemmatizedverbfilename, 'w') as out:
  #Writes header row
  out.write('Character, Verb, VerbLemma\n')
  #Opens the character verb file
  with open('fortunata-jacinta-verbs.csv', 'r') as csvfile:
    #Reads the character verb file
    csvreader = csv.reader(csvfile, delimiter=',')
    #For each row...
    for row in csvreader:
      #Character is first column
      character = row[0]
      #Verb is second column
      verb = row[1]
      #NLP on verb
      analyzed = nlp(verb)
      #For word in analyzed text
      for token in analyzed:
          #Write out the result
          out.write(character + ', ' + verb + ',' + token.lemma_ + '\n')

## What now?
Now you can take the lemmatized verb CSV and explore it with an environment like [RAWGraphs](https://app.rawgraphs.io/), or just look through the data in your favorite tool for tabular data. ([OpenRefine](https://openrefine.org/) is a nice one, and there's a *Programming Historian* lesson on "[Limpieza de datos con OpenRefine](https://programminghistorian.org/es/lecciones/limpieza-de-datos-con-OpenRefine)" by Seth van Hooland, Ruben Verborgh, and Max De Wilde.)

If you split up the data so that there's one .txt file with all the verbs for a particular character, you can use TF-IDF (as described in the *Programming Historian* lesson "[Analyzing Documents with TF-IDF](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf) by Matthew J. Lavin) to identify which verbs are especially characteristic of a particular character.