# Approach

## Problem Description

The assignment requires the classification of preposition attachment to nouns or verbs. The training set contains <b>20801</b> samples with binary labels of <b>N</b> or <b>V</b>. The testing data is fairly even in composition with <b>10865</b> noun attachments and <b>9936</b> verb attachments. A successful classifier would need to perform above <b>0.522</b> accuracy to beat a majority class prediction model.

## Approach

My approach to this classification problem is to analyze the data to gain some insight about the feature. Then I will implement a classifier using binary regression to make predictions based on the features. This classifier could act as a good baseline for future classification experiments.

# Implementation

#### Importing the Data

In [1]:
import os
import pandas as pd
import numpy as np

DATA_ROOT   = 'data/PPAttachData'
TRAINING    = 'training'
TESTING     = 'test'
VALIDATION  = 'devset'

training_df   = pd.read_csv(os.path.join(DATA_ROOT, TRAINING), 
                            delimiter=' ', 
                            header=None,
                            names=['idx', 'w0', 'w1', 'w2', 'w3', 'y'],
                            converters = {'y': lambda label : label == 'N'})

testing_df    = pd.read_csv(os.path.join(DATA_ROOT, TESTING), 
                            delimiter=' ', 
                            header=None,
                            names=['idx', 'w0', 'w1', 'w2', 'w3', 'y'],
                            converters = {'y': lambda label : label == 'N'})

validation_df = pd.read_csv(os.path.join(DATA_ROOT, VALIDATION), 
                            delimiter=' ', 
                            header=None,
                            names=['idx', 'w0', 'w1', 'w2', 'w3', 'y'],
                            converters = {'y': lambda label : label == 'N'})

training_df.head()

Unnamed: 0,idx,w0,w1,w2,w3,y
0,0,join,board,as,director,False
1,1,is,chairman,of,N.V.,True
2,2,named,director,of,conglomerate,True
3,3,caused,percentage,of,deaths,True
4,5,using,crocidolite,in,filters,False


#### Dataset balance

In [2]:
print('Training:   ', np.sum(training_df['y']==True) / len(training_df))
print('Testing:    ', np.sum(testing_df['y']==True) / len(testing_df))
print('Validation: ', np.sum(validation_df['y']==True) / len(validation_df))

Training:    0.5223306571799433
Testing:     0.589602841459477
Validation:  0.5303292894280762


The testing set is more imbalanced than the other two which could be a very realistic problem depending on the scenario.

#### Merge training and validation

This is done so that K-fold can be used.

In [3]:
training_df = pd.concat([training_df, validation_df])

#### Unique Word Counts

I thought it would be good to look at number of unique words for each column for some insight into the data.

In [4]:
print("Unique verbs       : ", training_df['w0'].str.lower().nunique())
print("Unique nouns (w1)  : ", training_df['w1'].str.lower().nunique())
print("Unique prepositions: ", training_df['w2'].str.lower().nunique())
print("Unique nouns (w3)  : ", training_df['w3'].str.lower().nunique())

Unique verbs       :  3496
Unique nouns (w1)  :  4791
Unique prepositions:  69
Unique nouns (w3)  :  6054


Given the small feature space of the prepositions (in comparison) I thought it would be interesting to see how a classifier would perform on just this information.

In [5]:
from sklearn.preprocessing import LabelBinarizer

training_prep = training_df['w2'].str.lower()
testing_prep  = testing_df['w2'].str.lower()

enc = LabelBinarizer().fit(pd.concat([training_prep, testing_prep]))

training_prep_oh = enc.transform(training_prep)
testing_prep_oh  = enc.transform(testing_prep)

from sklearn.linear_model import LogisticRegressionCV

# 5-fold cross validation
model = LogisticRegressionCV(cv=5, random_state=0)

print("Accuracy on Test Data")
model.fit(training_prep_oh, training_df['y']).score(testing_prep_oh, testing_df['y'])

Accuracy on Test Data


0.7190829835324507

# Discussion of Results

It would appear from the 0.719 accuracy (p < 0.000001) that using the preposition itself is a signficant feature for general classification. My intuition is that the trailing noun might offer more information for an improved classifier. Future work could be done to extract more features and use more complex models that handle these features in an optimal way.

# Contributers

Myself