# Hist 3368
## Tutorial on Speed

Let's say you want to use Spacy, a resource-intensive software package, to extract named entities from Congress.

#### Setup Spacy

In [1]:
import pandas as pd, spacy
from datetime import datetime

In [2]:
nlp = spacy.load('en_core_web_sm')

Getting an error? 

* Please note that to use spaCy on M2 you must go to My Interactive Sessions/JupyterLab and add **source /hpc/applications/python_environments/spacy/bin/activate** to the **“Custom environment settings”** field.


#### Load some data

We're going to load the speeches of Congress.

In [5]:
cd /scratch/group/history/hist_3368-jguldi

/scratch/group/history/hist_3368-jguldi


In [6]:
congress = pd.read_csv("congress1967-2010.csv")
#congress = pd.read_csv("eighties_data.csv")

In [9]:
cd ~/digital-history

/users/jguldi/digital-history


#### Notice that the Code is slow

Let's apply our event recognizer to just a sample.

In [2]:
sample = [ner_finder(speech, 'LAW') for speech in congress_1968['speech'][:20]]
sample

NameError: name 'congress_1968' is not defined

Notice that the code hangs for a minute. Spacy uses a lot of 'memory' -- or computing power.  Coders have tricks to speed things up.  Let's talk about that.

#### Tracking Speed with time.time()


Many coders like to keep track of how fast different approaches are so that they can choose the speediest approach when they move from small data to big data. Let's do that.  We'll import the *time* module and call

    time.time() 
    
to get the time in milliseconds.  Then we run the same line of code, and call time.time() again afterwards, and subtract start time from finish.

We can use datetime.time() to take the time before and after the operation to see how quick or slow each operation is.

Here's the same code you just ran again, with timing instructions around it.

In [None]:
import time
start = time.time()

sample = [ner_finder(speech, 'LAW') for speech in congress_1968['speech'][:-20]]

finish = time.time()

print(sample)
print()

finish-start

Next, let's try a speedier approach.  Let's use our parallelized ner_finder to search for mentions of laws in just one year. 

Again, we'll run the sample code on a tiny sample.  Again, we'll keep track of how long it takes.  


#### Speeding things up with .apply()

To speed things up, we can try calling upon "parallel" processing, which causes every node within a computer system to run the same command simultaneously.  

We'll use a 'lambda' function, which allows us to take the function following "lambda x" and efficiently "apply" it to every row in the dataframe. Lambda functions run in parallel.

Note these two elements of the grammar.

    .apply()
    lambda x: [function to be applied]


Here's a tutorial about using .apply().

In [None]:
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/smPLY_5gVv4" title="YouTube video player""></iframe>')




And here's some code using .apply() with ner_finder() to search for all the laws mentioned in the Dallas Minutes.

Note that we are also using *time.time()* to take the time in milliseconds before and after running the function, so that we can compare how fast the .apply() method is to similar code using list comprehension above.

**This may still take a minute.** But apply is potentially much, much faster than if you had run the same command wihtout parallel processing.  

*Note: You will see a pink warning label. It isn't an error, and the data is still running.*

In [None]:
start = time.time()

sample2 = dallas_minutes['Text'][:5].apply(lambda x: ner_finder(x, 'LAW'))

finish = time.time()

print(sample2)
print()

finish-start

The winner is... the .apply() method in parallel -- faster by a hair! (*NOTE: Your mileage may vary*)

Let's run it on a slightly larger sample of text -- the whole year 2019.  

***We chose the faster method on purpose, but NER is a slow process. This process clocks at 30 m on my session. Get a cup of tea.***

*You can also limit the amount of text you're working with by using square brackets, e.g. dallas_minutes_year1['Text'][:100]*

In [None]:
start = time.time()


dallas_minutes_year1['Laws'] = dallas_minutes_year1['Text'].apply(lambda x: ner_finder(x, 'LAW'))

finish = time.time()
print(finish-start)


dallas_minutes_year1[:5]

In [None]:
# Here's the code for applying nlp to the entire archive of Dallas City Council minutes, not just one year.  
#dallas_minutes['Laws'] = dallas_minutes['Text'].apply(lambda x: ner_finder(x, 'LAW'))
#dallas_minutes                                                                  