# POS Chunking
**1. Create a chunker that detects noun-phrases (NPs) and lists the NPs in the text below.**

- Both [NLTK](https://www.nltk.org/book/ch07.html) and [spaCy](https://spacy.io/api/matcher) supports chunking
- Look up RegEx parsing for NLTK and the document object for spaCy.
- Make use of what you've learned about tokenization.

In [None]:
text = "The language model predicted the next word. It was a very nice word!"
# TODO: set up a pos tagger and a chunker.
# Output: a list of all tokens, grouped as noun-phrases where applicable

**2. Modify the chunker to handle verb-phases (VPs) as well.**
- This can be done by using a RegEx parser in NLTK or using a spaCy Matcher.

In [None]:
# TODO: set up grammars to chunk VPs

grammar = """
    VP: {MYGRAMMAR}
"""

**3. Verb-phrases (VPs) can be defined by many different grammatical rules. Give four examples.**
- Hint: Context-Free Grammars, chapter 8 in NLTK.

Your answer here!

**4. After these applications, do you find chunking to be beneficial in the context of language modeling and next-word prediction? Why or why not?**

Your answer here!

___

# Dependency Parsing

**1. Use spaCy to inspect/visualise the dependency tree of the text provided below.**
- Optional addition: visualize the dependencies as a graph using `networkx`

In [None]:
text = "The language model predicted the next word"
# TODO: use spacy and displacy to visualize the dependency tree

**2. What is the root of the sentence? Attempt to spot it yourself, but the answer should be done by code**

In [None]:
# TODO: implement a function to find the root of the document
# Return both the word and its POS tag

**3. Find the subject and object of a sentence. Print the results for the sentence above.**

In [None]:
# TODO: implement a function to find the subjects + objects in the document

**4. How would you use the relationships extracted from dependency parsing in language modeling contexts?**

___

# Wordnet

**1. Use Wordnet (from NLTK) and create a function to get all synonyms of a word of your choice. Try with "language"**

In [None]:
from nltk.corpus import wordnet as wn
# TODO: find synonyms

**2. From the same word you chose, extract an additional 4 or more features from wordnet (such as hyponyms). Describe each category briefly.**

In [None]:
# TODO: expand the function to find more features!

___

# Machine Learning Exercise - A sentiment classifier
- A rule-based approach with SentiWordNet + A machine learning classifier

**1. There are several steps required to build a classifier or any sort of machine learning application for textual data. For data including (INPUT_TEXT, LABEL), list the typical pipeline for classification.**

Your answer here!

**2. Before developing a classifier, having a baseline is very useful. Build a baseline model for sentiment classification using SentiWordNet.**
- How you decide to aggregate sentiment is up to you. Explain your approach.
- It should report the accuracy of the classifier.

In [None]:
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import wordnet as wn
import spacy


# TODO: implement a function to get the sentiment of a text
# Must use the sentiwordnet lexicon

# Evaluate it on the following sentences:
sents = [
    "I liked it! Did you?",
    "It's not bad but... Nevermind, it is.",
    "It's awful",
    "I don't care if you loved it - it was terrible!",
    "I don't care if you hated it, I think it was awesome"
]
# 0: negative, 1: positive
y_true = [1, 0, 0, 0, 1]

## The SST-2 binary sentiment dataset

**3. Split the training set into a training and test set. Choose a split size, and justify your choice.**

In [None]:
from datasets import load_dataset
dataset = load_dataset("sst2")

train_df = dataset["train"].to_pandas().drop(columns=["idx"])
train_df = train_df.sample(10000)  # a tiny subset
print(train_df.label.value_counts())
train_df.head()

In [None]:
# TODO: split the data
train_df = ...
test_df = ...

**4. Evaluate your baseline model on the test set.**

- Additionally: compare it against a random baseline. That is, a random guess for each example

In [None]:
# TODO: evaluate on test set + random guess
# Report results in terms of accuracy

**5. Did you beat random guess?**

If not, can you think of any reasons why?

Your answer here!

## Classification with Naive Bayes and TF-IDF
This is the final task of the lab. You will use high-level libraries to implement a TF-IDF vectorizer and train your data using a Naive Bayes classifier

In [None]:
# TODO: use scikit-learn to...
# - normalize
# - vectorize/extract features
# - train a classifier
# - evaluate the classifier using `classification_report` and `accuracy`
# 
# expect an accuracy of > 0.8

## Optional task: using a pre-trained transformer model
If you wish to push the accuracy as far as you can, take a look at BERT-based or other pre-trained language models. As a starting point, take a look at a model already fine-tuned on the SST-2 dataset: [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english)

**Advanced:**

Going beyond this, you could look into the addition of a *classification head* on top of the pooling layer of a BERT-based model. This is a common approach to fine-tuning these models on classification or regression problems.