### Step 0. Download SpaCy and language model
If you don't already have the SpaCy library installed, the code below downloads it along with the English language model. This notebook should work for any [language with a SpaCy model](https://spacy.io/models). Just substitute the name of the model for *en_core_web_sm* (For instance, if you wanted to use Lithuanian, you can replace `en_core_web_sm` with `lt_core_news_sm`). Places where you need to do this substitution are commented in the code.

In [1]:
import sys
!{sys.executable} -m pip install spacy

You should consider upgrading via the '/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
#Replace es_core_news_lg with another model name here for other languages
!{sys.executable} -m spacy download es_core_news_lg

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Collecting es-core-news-lg==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_lg-3.0.0/es_core_news_lg-3.0.0-py3-none-any.whl (569.7 MB)
[K     |████████████████████████████████| 569.7 MB 66 kB/s s eta 0:00:01    |█                               | 16.7 MB 10.9 MB/s eta 0:00:51     |████████████████████            | 355.0 MB 10.4 MB/s eta 0:00:21     |██████████████████████████████▌ | 542.8 MB 8.8 MB/s eta 0:00:04
Installing collected packages: es-core-news-lg
  Attempting uninstall: es-core-news-lg
    Found existing installation: es-core-news-lg 2.3.1
    Uninstalling es-core-news-lg-2.3.1:
      Successfully uninstalled es-core-news-lg-2

### Step 1. Import modules
To use a [SpaCy language model](https://spacy.io/models) other than English, replace `en_core_web_sm` with the model name in the cell below.

In [3]:
# os is used for navigating directories
import os
# spacy is used for identifying the subjects and verbs
import spacy
#Replace en_core_web_sm with another model name here for other languages
import es_core_news_lg
#Replace en_core_web_sm with another model name here for other languages
nlp = spacy.load("es_core_news_lg")

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## Exploring spaCy tagging

In [15]:
example = nlp("Empezó Maximiliano sus estudios el 69, y su hermano y su tía le ponderaban lo bonita que era la Farmacia y lo mucho que con ella se ganaba, por ser muy caros los medicamentos y muy baratas las primeras materias: agua del pozo, ceniza del fogón, tierra de los tiestos, etcétera... El pobre chico, que era muy dócil, con todo se mostraba conforme.")

for token in example:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Empezó empezar VERB VERB ROOT Xxxxx True False
Maximiliano Maximiliano PROPN PROPN nsubj Xxxxx True False
sus su DET DET det xxx True True
estudios estudio NOUN NOUN nsubj xxxx True False
el el DET DET det xx True True
69 69 NUM NUM obl dd False False
, , PUNCT PUNCT punct , False False
y y CCONJ CCONJ cc x True False
su su DET DET det xx True True
hermano hermano NOUN NOUN dep xxxx True False
y y CCONJ CCONJ cc x True False
su su DET DET det xx True True
tía tía NOUN NOUN nsubj xxx True False
le él PRON PRON obj xx True True
ponderaban ponderar VERB VERB conj xxxx True False
lo él PRON PRON det xx True True
bonita bonito ADJ ADJ obj xxxx True False
que que PRON PRON nsubj xxx True True
era ser AUX AUX cop xxx True True
la el DET DET det xx True True
Farmacia Farmacia PROPN PROPN ccomp Xxxxx True False
y y CCONJ CCONJ cc x True False
lo él PRON PRON det xx True True
mucho mucho ADV ADV dep xxxx True True
que que PRON PRON obj xxx True True
con con ADP ADP case xxx True True
ella él PRO

In [16]:
from spacy import displacy
displacy.render(example, style="dep")

### Step 2. Select directory
Edit the. code cell below to put in the full path to the directory/folder with the text files (.txt) you want to work with. Note that this notebook will process *all* text files in the directory you specify, so you may need to put just the files you're interested in into their own directory.

If you have only one text file (e.g. a single novel) and aren't comfortable reworking the code to run on only one file, you can create a directory and put the text file inside it.

Here's the syntax to specify the full path to a folder called *YOUR-FOLDER* within the Documents directory:

* On Mac: '/Users/YOUR-USER-NAME/Documents/YOUR-FOLDER'
* On Windows: 'C:\\Users\\YOUR-USER-NAME\\Documents\\YOUR-FOLDER'

In [10]:
directory = '/Users/qad/sampletexts'

In [11]:
#Changes the notebook's working directory to the directory you specified
os.chdir(directory)

### Step 3. Find subjects
The code cells below reads each text file in the directory, finds every subject and verb, and writes it to a CSV file along with the filename it's from. The files are processed alphabetically, so that if something breaks (e.g. text files bigger than 1 MB may trigger a memory error), you can figure out what files are left to be done.

By default, the CSV file is called `charverbs.csv` and gets created inside the directory with the text files. You can give the file a different name in the code cell below, but keeping `.csv` is recommended.

Doing the NLP parse can be time-consuming, particularly if you have a large number of files. If your text is on the scale of hundreds of novels, expect it to take hours.

In [12]:
charverbfile = 'fortunata-jacinda-verbs.csv'

In [17]:
#Opens the output file
with open(charverbfile, 'w') as out:
    out.write('Filename, Subject, Verb' + '\n')
    #Sorts the files alphabetically
    for filename in sorted(os.listdir(directory)):
        #Looks for .txt files
        if filename.endswith('.txt'):
            #Opens each file
            with open(filename, 'r') as bookfile:
                #Reads in the text in the file
                book = bookfile.read()
                #NLP parse of the text
                doc = nlp(book)
                #Noun chunks are the part of the SpaCy dependency parse that we need
                for chunk in doc.noun_chunks:
                    #If the dependency relation is 'nsubj' (noun subject)
                    if chunk.root.dep_ == 'nsubj':
                        #Write the filename, the noun chunk, the verb, and then a newline character
                        strsubj = str(chunk.text)
                        cleansubj = strsubj.replace(',', '')                        
                        out.write(filename + ', ' + cleansubj + ', ' + chunk.root.head.text + '\n')