<font color="grey">Qi Yu (University of Konstanz)  |  ZHAW, March 03-04, 2022</font>

# 1. Basics

## 1.1 Install NLTK

In [None]:
!pip install nltk

## 1.2 Import NLTK

In [None]:
import nltk

## 1.3 Install additional components

**1. The following line opens a separate window with which you can choose additional components to install:**

In [None]:
nltk.download()

**2. Alternatively, you can also install a certain component by specifying the name. E.g., downloading the Brown corpus:** 

In [None]:
nltk.download("brown")

# 2. Accessing corpora provided by NLTK

**Use submodule ```nltk.corpus``` to access corpora:**

In [None]:
from nltk.corpus import brown
brown_words = brown.words()
print(brown_words)
print("Total token amount:", len(brown_words))

# 3. Processing own text data with NLTK

**We will work with the file ```peterpan_cleaned.txt``` which you created from the exercise ```peterpan.ipynb```.**

In [None]:
f = open('peterpan_cleaned.txt','r')
text = f.readlines()
f.close()

In [None]:
text

**Remove line breaks by using the method ```strip()```.**

**```strip()``` removes any leading and trailing characters. If no argument is passed to it, it will remove leading and trailing spaces by default.**

In [None]:
lines_cleaned = []
for line in text:
    line = line.strip()
    if line:
        lines_cleaned.append(line)

In [None]:
lines_cleaned

In [None]:
text = " ".join(lines_cleaned)
text

## 3.1 Tokenization

**1. Tokenize sentences:**

In [None]:
from nltk.tokenize import sent_tokenize

In [None]:
sentences = sent_tokenize(text)

In [None]:
sentences

In [None]:
len(sentences)

**2. Tokenize words:**

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
text_tokenized = word_tokenize(text)

In [None]:
text_tokenized

## 3.2 POS-tagging and lemmatizing

**NLTK provides the submodule ```WordNetLemmatizer``` for lemmatization.**

**Attention: ```WordNetLemmatizer``` requires the POS-tag of a token. So always first POS-tagging, then lemmatizing!**

In [None]:
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer

In [None]:
pos_tagged = pos_tag(text_tokenized)
pos_tagged

In [None]:
lemmatizer = WordNetLemmatizer()

for word, pos in pos_tagged:
    pos = pos[0].lower()
    if pos[0].lower() not in ['a', 'r', 'n', 'v']:
        pos = 'n'
        
    word_lemmatized = lemmatizer.lemmatize(word, pos)
    print(word, "-->", word_lemmatized)

## 3.3 Stemming

**Here we will use the Snowball Stemmer:**

In [None]:
from nltk.stem import SnowballStemmer

In [None]:
stemmer = SnowballStemmer(language='english')

In [None]:
for word in text_tokenized: 
    word_stemmed = stemmer.stem(word)
    print(word, "-->", word_stemmed)

## 3.4 Removing stop words and punctuations

**Remove stopwords:**

In [None]:
from nltk.corpus import stopwords

In [None]:
stop_words = stopwords.words('english')
stop_words

**Remove punctuations:**

In [None]:
import string

In [None]:
punct = string.punctuation
punct

In [None]:
text_cleaned = []
for token in text_tokenized:
    if not token in stop_words and not token in punct:
        text_cleaned.append(token)

In [None]:
text_cleaned

# 4. Parsing

**We will use the Stanford CoreNLP API in NLTK to do constituency parsing and dependency parsing (See the general information [here](https://stanfordnlp.github.io/CoreNLP/other-languages.html#python)).**

**Stanford CoreNLP is a Java library, so it requires [Java](https://www.java.com/en/download/) to be installed on your computer.**

**To start the Stanford CoreNLP server, please follow the steps below:**

1. Download the Stanford Corenlp [here](https://stanfordnlp.github.io/CoreNLP/download.html).

1. Open a new Command Prompt window (Windows) / Terminal window (Linux/MacOS), and excecte the following commands:
    1. change the working directory to the path where the Stanford NLP is located by excecuting the following command:
    
    ```cd PATH_OF_STANFORD_NLP``` (Please change ```PATH_OF_STANFORD_NLP``` to your own path.)

    2. Starting the server by executing the following command in the Command Prompt (Windows) / Terminal (Linux/MacOS): 
    
    ```java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -preload tokenize,ssplit,pos,lemma,ner,parse,depparse -status_port 9000 -port 9000 -timeout 15000 &```. 
    
    If this step is executed successfully, you will see the line ```[main] INFO CoreNLP - StanfordCoreNLPServer listening at /[0:0:0:0:0:0:0:0]:9000```.

**Once the steps above are successfully done, you can start using the Stanford CoreNLP API in NLTK (see cells below).**

**See also the official instruction [here](https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK) (including instruction for doing parsing for languages other than English).**

## 4.1 Constituency parsing

In [None]:
from nltk.parse import CoreNLPParser

In [None]:
parser = CoreNLPParser(url='http://localhost:9000')

**For demonstration purposes, we will use the following short sentence as example:**

In [None]:
demo_sent = sentences[60]
demo_sent

**Get parse tree of the sentence:**

When executing the following cell, you may encounter an error message ```ModuleNotFoundError: No module named 'svgling'```. For solving this, please install the module by using the following cell: 

In [None]:
!pip install svgling

In [None]:
next(parser.raw_parse(demo_sent))

**Or you can also get the parsing result as a list for further operations:**

In [None]:
list(parser.raw_parse(demo_sent))

## 4.2 Dependency parsing

In [None]:
from nltk.parse.corenlp import CoreNLPDependencyParser

In [None]:
dep_parser = CoreNLPDependencyParser(url='http://localhost:9000')

In [None]:
parses = dep_parser.raw_parse(demo_sent)

In [None]:
for parse in list(parses):
    for governor, dep, dependent in parse.triples():
        print("Head: ", governor, "\tDependency Relation: ", dep, "\tDependent: ", dependent)