# Feature Design Bakeoff: Native Language Identification

## The Task

Given an essay written in English by a non-native speaker, identify their native language. This is a task published by ETS (Educational Testing Service), with the data coming from TOEFL essays.

**Data:** Files for training, development, and test, automatically downloaded and parsed by the code below. The test data does not have labels. The languages in the data are 

    ARA = Arabic
    CHI = Chinese
    FRE = French
    GER = German
    HIN = Hindi
    ITA = Italian
    JPN = Japanese
    KOR = Korean
    SPA = Spanish
    TEL = Telugu
    TUR = Turkish

**Learning Algorithm:** One-versus-rest logistic regression

**Features:** You design 'em! A baseline with word counts is given to you.

## 1. Load Data

(This will take a few seconds since it has to read the data over the Internet.)


In [1]:
import urllib2

def load_data(url):
    """read a data file from the web"""
    obj = urllib2.urlopen(url)
    return [line.strip() for line in obj.readlines()]

train_text = load_data('http://cs.wellesley.edu/~sravana/ml/nli/training.essays')
train_labels = load_data('http://cs.wellesley.edu/~sravana/ml/nli/training.labels')
   
dev_text = load_data('http://cs.wellesley.edu/~sravana/ml/nli/development.essays')
dev_labels = load_data('http://cs.wellesley.edu/~sravana/ml/nli/development.labels')

test_text = load_data('http://cs.wellesley.edu/~sravana/ml/nli/test.essays')

### Preview data

Here are what the essays look like...

In [2]:
print 'There are', len(train_text), 'essays in the training set'
print 'A sample essay by a speaker of', train_labels[0]
print train_text[0]

There are 6435 essays in the training set
A sample essay by a speaker of CHI
I agree with the successful people try new things and take risks rather than only doing what they already know how to do well . I know and heard many examples about this kind of things . I think a really successful person must be smart and honest . They can be successed in this area , because they know how to be successd in this area , but a man can not always stay and do now go forward . Successful people always want to be success in diffrent area , so they can prove themselves that they are really should be successed .  First , I think many peple heard that some popular movie stars or singers also doing their own business , such as have a resteruant , clothes stores , hotels and so on . Movie stars and singers they earnd too much money than others , why they should do these things , they need more money ? or they have too much money so they just do it for fun ? I think they just want to show people that they

## 2. Featurizing and Classification

The code below transforms the language labels ('ARA', 'CHI', etc) into integer labels for the training and development data.

In [3]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

trainy = le.fit_transform(train_labels)  
devy = le.transform(dev_labels)  

A baseline featurizer with bag-of-word counts is implemented below. 

In [4]:
# from sklearn.feature_extraction.text import CountVectorizer

# vectorizer = CountVectorizer()  # initialize object with CountVectorizer defaults

# # convert to array where each row is an essay, each dimension is a word, 
# # and each value is the count of that word in the essay
# trainX = vectorizer.fit_transform(train_text)  

# trainX = trainX.toarray()   # make dense

# devX = vectorizer.transform(dev_text)  # featurize the development text
# devX = devX.toarray()

# testX = vectorizer.transform(test_text)  # featurize the testing text
# testX = testX.toarray()

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf=True)  # initialize object with TfidfVectorizer defaults

# convert to array where each row is an essay, each dimension is a word, 
# and each value is the count of that word in the essay
trainX = vectorizer.fit_transform(train_text)

trainX = trainX.toarray()   # make dense

devX = vectorizer.transform(dev_text)  # featurize the development text
# devX = devX.toarray()

testX = vectorizer.transform(test_text)  # featurize the testing text
# testX = testX.toarray()

Now train a multi-class logistic regression model and test on the dev set. This code may take a few seconds to run.

In [6]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()  # one-versus rest logistic regression

model.fit(trainX, trainy)
accuracy = model.score(devX, devy)
print 'Classification accuracy', accuracy

Classification accuracy 0.777479892761


Use different options for the `CountVectorizer` (which allows word and character n-grams) or replace it with the `TfidfVectorizer` and explore *its* different options. See [the documentation on text feature extraction](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text).

You can also experiment with some of sklearn's [feature scaling and normalization options](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing).

In addition to feature engineering, you can also try alternate hyperparameters for the logistic regression model above.

If you're really efficient, you can concatenate some hand-designed features (average word length, etc) to the trainX, testX, and devX arrays. This takes some time to implement, so leave it to the end.

When you're satisfied with the development accuracy, predict labels for test data by running the code below.

In [None]:
predictions = model.predict(testX)
predictions = le.inverse_transform(predictions)  # transform list of indices into list of languages

team_name = raw_input('Enter your team name: ').strip().replace(' ', '_')
student_names = raw_input('Enter all student names, separated by commas: ').strip()

with open(team_name+'.results', 'w') as o:
    o.write(student_names+'\n')
    o.write('\n'.join(predictions))
print 'Wrote results to file'

Enter your team name: lma2-sye2


Drop the `<your_team_name>.results` file that was created in the current working directory into
[this Google Drive folder](https://drive.google.com/open?id=0B8FnZZJ_NRjiMXBkN2YtMDF2dzg).