# Basic NLP
## by Damian Trilling

# Preparation
We assume that you have NLTK (Bird, Loper, & Klein, 2009) installed. If you use Anaconda, you have it anyway. You also have to download some data for some specific NLTK modules. Download them by executing the following cell (you only have to do this once):

Bird, S., Loper, E., & Klein, E. (2009). *Natural language processing with Python*. Sebastopol, CA: O'Reilly.

In [None]:
import nltk
nltk.download('vader_lexicon')
nltk.download('stopwords')
nltk.download('maxent_treebank_pos_tagger')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Warming up

Think back of what you know already about Python. Use the cell below to do the following task:
- Create a list that contains strings with numbers inside, something like ["12","42","11]
- Write a loop that converts the strings to integers, prints them, and adds them to a new list
- Modify your loop in such a way that it multiplies the numbers by two before adding them to the new list.

# Let's get started!

## Import modules
Before we start, let's import some modules that we need today. It is good practice to do so at the beginning of a script, so we'll do it right now and not later when we need them. The benefit is that you immediately see if something goes wrong (for instance, because the module is not installed).

In [7]:
import csv
import re
from nltk.sentiment import vader
from nltk.corpus import stopwords
import nltk

## Download the data
We will use a dataset by Schumacher et al. (2016). From the abstract:
> This paper presents EUSpeech, a new dataset of 18,403 speeches from EU leaders (i.e., heads of government in 10 member states, EU commissioners, party leaders in the European Parliament, and ECB and IMF leaders) from 2007 to 2015. These speeches vary in sentiment, topics and ideology, allowing for fine-grained, over-time comparison of representation in the EU. The member states we included are Czech Republic, France, Germany, Greece, Netherlands, Italy, Spain, United Kingdom, Poland and Portugal.

Schumacher, G, Schoonvelde, M., Dahiya, T., Traber, D, & de Vries, E. (2016): *EUSpeech: a New Dataset of EU Elite Speeches*. [doi:10.7910/DVN/XPCVEI](http://dx.doi.org/10.7910/DVN/XPCVEI)

Download and unpack the following file:
```
speeches_csv.tar.gz
```

In the .tar.gz file, you find a .zip file. Extract the whole folder to your home directory.

Let's have a look at the files we downloaded. The following cell does this (assuming that you work on Linux or MacOS *and* that you saved the files in the same directory where you started your notebook server and where this notebook lies). 

In [4]:
%ls Cleaned_Speeches/

Speeches_ALDE_Cleaned.csv       Speeches_GR_Cleaned.csv
Speeches_CZ_Cleaned.csv         Speeches_IMF_Cleaned.csv
Speeches_DE_Cleaned.csv         Speeches_IT_Cleaned.csv
Speeches_ECB_Cleaned.csv        Speeches_NL_Cleaned.csv
Speeches_EC_Cleaned.csv         Speeches_PL_Cleaned.csv
Speeches_ECR_Cleaned.csv        Speeches_PO_Cleaned.csv
Speeches_EP_Cleaned.csv         Speeches_SP_Cleaned.csv
Speeches_EUCouncil_Cleaned.csv  Speeches_UK_Cleaned.csv
Speeches_FR_Cleaned.csv         [0m[01;34mTranslated[0m/


## Get some idea about the data
Let us inspect the data. Let us only look at the first row:

In [8]:
with open("Cleaned_Speeches/Speeches_NL_Cleaned.csv") as fi:
    reader=csv.reader(fi)
    firstrow=next(reader)
    print("It looks like we have",len(firstrow),"columns.")
    print("\nThis is the content:\n")
    print(firstrow)

It looks like we have 8 columns.

This is the content:

["'A roadmap for sustainable recovery'", '23-09-2010', 'Netherlands', 'J.P. Balkenende', '1404', "<p>Ladies and gentlemen,</p><p>It is an honour to be here today to introduce the theme of 'recession and recovery'. If you will permit, I would like to suggest that this afternoon we focus more on recovery than on recession. I think we know enough about the recession side of the story.</p><p>It started with the fall of Lehman Brothers on 15 September 2008. I happened to be here, at the Blouin Creative Leadership Summit, only ten days later. Everyone was talking about the collapse of Lehman. They were shocked and alarmed. But even then we could hardly imagine that its impact would be so dramatic, so historic.</p><p>As we now know, this event triggered a global financial and economic crisis. Governments were forced to give cash injections running into billions to prevent an economic and financial meltdown. When credit dried up and deman

As you can see, we can directly address a specific element from this row (we start counting at zero!). Which one might be most interesting for us? Just **play around** a bit! Note down (on a piece of paper or in a file) how the structure of the dataset looks like!

In [None]:
firstrow[0]

## Let's start!
Now that we know how the data looks like, we can *loop* over all rows in the file in order to retrieve a list of all speeches:

In [11]:
with open("Cleaned_Speeches/Speeches_NL_Cleaned.csv") as fi:
    reader=csv.reader(fi)
    speeches=[]
    for row in reader:
        speeches.append(row[5])

In [12]:
len(speeches)

391

We'll clean up a bit. You don't know the technique used here yet (it's called 'list comprehension), and I can explain it to you later. It is basically a short form of writing a for-loop.

In [13]:
speeches_nl=[speech.replace('<p>',' ').replace('</p>',' ') for speech in speeches_nl]   #remove HTML tags
speeches_nl=["".join([l for l in speech if l not in punctuation]) for speech in speeches_nl]  #remove punctuation
speeches_nl=[speech.lower() for speech in speeches_nl]  # convert to lower case
speeches_nl=[" ".join(speech.split()) for speech in speeches_nl]   # remove double spaces by splitting the strings into words and joining these words again

NameError: name 'speeches_nl' is not defined

Let's look at the first speech to check everything's fine.

In [None]:
speeches[0]

# Sentiment analysis
We will do our first analysis, using the algorithm by Hutto and Gilbert (2014). It is already implemented in NLTK, so we can run the analysis with just two lines of code! 
The only thing we have to care about is providing the input data and storing the output.

Hutto, C.J., & Gilbert, E. (2014). Vader: A parsimonious rule-based model for sentiment analysis of social media text. *Eigth internatioanl AAAI conference on weblogs and social media.*

In [14]:
senti=vader.SentimentIntensityAnalyzer()

In [15]:
senti.polarity_scores(speeches[0])

{'neg': 0.061, 'neu': 0.78, 'pos': 0.158, 'compound': 0.9993}

In [16]:
senti.polarity_scores(speeches[1])

{'neg': 0.058, 'neu': 0.796, 'pos': 0.146, 'compound': 0.9977}

So, how could we apply this to the whole dataset? With a loop! I'll give you a basic example with a lot of possibilities for improvement:

In [17]:
with open("Cleaned_Speeches/Speeches_NL_Cleaned.csv") as fi,  open('myoutput.csv',mode='w') as fo:
    reader=csv.reader(fi)
    writer=csv.writer(fo)
    for row in reader:
        speech=row[5]
        sentiment = senti.polarity_scores(speech)
        writer.writerow([speech[:100],sentiment['pos']])

In [18]:
!head myoutput.csv

"<p>Ladies and gentlemen,</p><p>It is an honour to be here today to introduce the theme of 'recession",0.158
"<p>Ladies and gentlemen,</p><p>Today we are looking back at the ten years since the Millennium Devel",0.146
"<p>Monsieur le Premier ministre, Mesdames et Messieurs les ambassadeurs, très chers vétérans, Mesdam",0.031
"<p>Prime Minister Harper, Excellencies, honoured veterans, ladies and gentlemen, </p><p>He was seven",0.193
"<p>Ni hao pengyoumen, </p><p> Your Excellency, Professor Qin, </p><p> Professor Poppema, </p><p> Lad",0.19
"<p>Honoured veterans, ladies and gentlemen,</p><p>Today, the official funeral takes place of the 46 ",0.206
"<p>President Han-Joong Kim, Dean Tae-Young Lee, Your Excellencies, ladies and gentlemen,</p><p>It is",0.189
"<p>Ambassador Stanczyk, Your Excellencies, ladies and gentlemen,</p><p>Saturday's disaster in Smolen",0.197
"<p>Ladies and gentlemen,</p><p>Last night, the whole world was treated to the marvellous Opening Cer",0.215
"<

## It's your turn!
Your task: write a better code that 
- outputs more info
- preprocesses the string (remove p-tags, for example)

If you feel a bit more adventurous: 
- Add an if-statement to filter out the french speeches! Modify your script by including a structure like
```
if APPROPRIATECOLUMN=='en':
    DO SOMETHING
```

# Regular Expressions
There are a lot of online tutorials explaining regular expressions (and you can read up in my book or on the slides), so I won't go into detail here how to construct one. But let's look at a prototypical usecase: Counting how often something is mentioned in texts. Let's start by examing one single speech:

In [19]:
speeches[0]

"<p>Ladies and gentlemen,</p><p>It is an honour to be here today to introduce the theme of 'recession and recovery'. If you will permit, I would like to suggest that this afternoon we focus more on recovery than on recession. I think we know enough about the recession side of the story.</p><p>It started with the fall of Lehman Brothers on 15 September 2008. I happened to be here, at the Blouin Creative Leadership Summit, only ten days later. Everyone was talking about the collapse of Lehman. They were shocked and alarmed. But even then we could hardly imagine that its impact would be so dramatic, so historic.</p><p>As we now know, this event triggered a global financial and economic crisis. Governments were forced to give cash injections running into billions to prevent an economic and financial meltdown. When credit dried up and demand fell, businesses struggled to keep their heads above water, and many went under. Ordinary people's jobs, homes and pensions were at risk.</p><p>The aft

Then we can get a list with all substrings that match the regexp. And, as with any lists, we can calculate its length!

In [20]:
re.findall(r"[Ee]conomy|[Ee]conomic",speeches[0])

['economic',
 'economic',
 'economic',
 'economic',
 'economic',
 'economic',
 'economic',
 'economic',
 'economic',
 'economy']

In [21]:
len(re.findall(r"[Ee]conomy|[Ee]conomic",speeches[0]))

10

## It's your turn!
Let's write a loop to count the numbers of references to the economy per article and output it to a csv file!

# NLP
As a prerequisite for many techiques we want to use tomorrow, we want to clean up the text. Typical steps involve:
- converting to lowercase
- remove punctuation
- remove stopwords
- stemming
- parsing (= determining the grammatical function of words).
Of course, depending on the task at hand, we don't want to do all of them - and also the order matters. If we want to parse a sentence, well, we better still have a sentence (and not already have removed stopwords and punctuation).

Below, you find some examples:

## Stopword removal

In [22]:
cleanedspeeches=[]
for speech in speeches:
    speech=speech.lower().replace(".","").replace(",","").replace('"',''.replace("'","")).replace("?","")
    words=speech.split()
    words = [w for w in words if w not in stopwords.words('english')]
    speechnew = " ".join(words)
    cleanedspeeches.append(speechnew)

In [23]:
cleanedspeeches

["<p>ladies gentlemen</p><p>it honour today introduce theme 'recession recovery' permit would like suggest afternoon focus recovery recession think know enough recession side story</p><p>it started fall lehman brothers 15 september 2008 happened blouin creative leadership summit ten days later everyone talking collapse lehman shocked alarmed even could hardly imagine impact would dramatic historic</p><p>as know event triggered global financial economic crisis governments forced give cash injections running billions prevent economic financial meltdown credit dried demand fell businesses struggled keep heads water many went ordinary people's jobs homes pensions risk</p><p>the aftermath high unemployment us europe: around ten per cent many countries public finances thrown completely balance</p><p>the question is: road economic recovery look like road sustainable balanced recovery firmly add sustainable balanced recovery prevent future imbalances economic system sustainable balanced recove

## Stemming
Stemming can be useful to avoid that 'economics', 'economic', and 'economy' are seen as different concepts by the topic model. In practice, however, standard stemming algorithms are far from perfect:

In [24]:
stemmer = nltk.stem.snowball.EnglishStemmer()
speeches_nl_stemmed = [" ".join([stemmer.stem(word) for word in speech.split()]) for speech in speeches_nl]
speeches_nl_stemmed[0][:500]

NameError: name 'speeches_nl' is not defined

## Parsing and retaining only nouns and adjectives
Depending on the specific use case at hand, one might discover that some parts of speech (POS) are more informative than others. We could, for instance, create a topic model based on only the nouns and adjectives in a text, disregarding everything else. Look at the NLTK documentation to find out what each code means (e.g., 'NN' is 'noun') 

In [None]:
speechesnounsadj=[]
for speech in speeches:
    tokens = nltk.word_tokenize(speech)
    tagged = nltk.pos_tag(tokens)
    cleanspeech = ""
    for element in tagged:
        if element[1] in ('NN','NNP','JJ'):
            cleanspeech=cleanspeech+element[0]+" "
    speechesnounsadj.append(cleanspeech)

In [None]:
speechesnounsadj