# Advanced NLP

*Notebook for Assignment 1 of Advanced NLP - Group 2 by Simone Colombo, Sophie van Duin, Iris Lau & Romera Voerman*

This notebook includes code that extracts the features proposed in part b) of the Syntax section. This includes: 
* The full constituent starting from a head word
* A feature including the head of a target word
* A feature including the dependent(s) of target word
* A feature based on dependency relations (e.g. paths, relation to head, etc.)

In [1]:
import nltk
import spacy
import pandas as pd
import networkx as nx
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

In [2]:
sentence = "As the sun set behind the mountains and the sky turned shades of orange and pink, Mary and Tom walked hand in hand along the deserted beach, listening to the sound of the waves crashing against the shore."

#### Token

In [3]:
tokens = word_tokenize(sentence)

Create dataframe of the tokenized sentence to add the extracted features

In [4]:
df = pd.DataFrame(tokens, columns=['token'])

#### Lemma

In [5]:
# create a lemmatizer object
lemmatizer = WordNetLemmatizer()

# lemmatize each word in the sentence
lemmatized_words = [lemmatizer.lemmatize(token) for token in tokens]

df['lemma'] = lemmatized_words

#### Part-of-Speech

In [6]:
# perform part-of-speech tagging
pos_tags = nltk.pos_tag(tokens)

df['pos'] = pos_tags

#### Dependency labels

In [7]:
# load the English language model
nlp = spacy.load('en_core_web_sm')

# perform dependency parsing
doc = nlp(sentence)
dep_labels = [token.dep_ for token in doc]

df['dep_labels'] = dep_labels

#### Preceding Part-of-Speech tags and dependency labels

In [8]:
preceding_pos = []
preceding_dep = []
for token in doc:
    preceding_pos.append([t.pos_ for t in token.lefts])
    preceding_dep.append([t.dep_ for t in token.lefts])

In [9]:
df['prec_pos'] = preceding_pos
df['prec_dep'] = preceding_dep

#### Head of each token

In [10]:
head_tags = [token.head for token in doc]
df['head_tags'] = head_tags

#### Length of the path connecting each token with the ROOT of the sentence

In [11]:
edges = []
for token in doc:
    for child in token.children:
        edges.append(('{0}'.format(token),
                      '{0}'.format(child)))
graph = nx.Graph(edges)

In [12]:
root = df.loc[df.dep_labels == 'ROOT', 'token'].values[0]

In [13]:
root_path = []

for token in doc:
    root_path.append(nx.shortest_path_length(graph, source=str(token), target=root))

In [14]:
df['token-ROOT_path'] = root_path

#### Children

In [15]:
children = []
for token in doc:
    token_children = [child.text for child in token.children]
    children.append(token_children)

df['children'] = children

#### Word shape

In [19]:
word_shapes = [token.shape_ for token in doc]
df['word_shapes'] = word_shapes

## Full dataframe including extractedfeature

In [20]:
df.head(10)

Unnamed: 0,token,lemma,pos,dep_labels,prec_pos,prec_dep,head_tags,token-ROOT_path,children,word_shapes
0,As,As,"(As, IN)",mark,[],[],set,2,[],Xx
1,the,the,"(the, DT)",det,[],[],sun,3,[],xxx
2,sun,sun,"(sun, NN)",nsubj,[DET],[det],set,2,[the],xxx
3,set,set,"(set, VBN)",advcl,"[SCONJ, NOUN]","[mark, nsubj]",walked,1,"[As, sun, behind, and, turned]",xxx
4,behind,behind,"(behind, IN)",prep,[],[],set,2,[mountains],xxxx
5,the,the,"(the, DT)",det,[],[],mountains,3,[],xxx
6,mountains,mountain,"(mountains, NNS)",pobj,[DET],[det],behind,3,[the],xxxx
7,and,and,"(and, CC)",cc,[],[],set,2,[],xxx
8,the,the,"(the, DT)",det,[],[],sky,3,[],xxx
9,sky,sky,"(sky, NN)",nsubj,[DET],[det],turned,3,[the],xxx
