Task 2 - Example of submission 
We train a simple system that computes one feature only, i.e. the length of the article, learn a model on the training set and makes predictions on the development set.
It is assumed that the folders "train" and "dev" are in the same folder as this notebook.

In [1]:
train_folder = "train" # if train and dev folders are not in the same folder as this notebook, change
dev_folder = "dev"     # these variables accordingly

from sklearn.linear_model import LogisticRegression
import glob
import os.path
import numpy as np

Loading training data

In [2]:
# loading articles' content from *.txt files in the train folder
file_list = glob.glob(os.path.join(train_folder, "*.txt"))
sentence_list = []
for i, filename in enumerate(file_list):
    with open(filename, "r") as f:
        for row in f.readlines():
            sentence_list.append(row.rstrip())

# loading articles ids and sentence ids from files *.task2.labels in the train folder 
gold_file_list = glob.glob(os.path.join(train_folder, "*.task2.labels"))
articles_id, sentence_id_list, gold_labels = ([], [], [])
for filename in gold_file_list:
    with open(filename, "r") as f:
        for row in f.readlines():
            article_id, sentence_id, gold_label = row.rstrip().split("\t")
            articles_id.append(article_id)
            sentence_id_list.append(sentence_id)
            gold_labels.append(gold_label)
print("Loaded %d sentences from %d articles" % (len(sentence_list), i+1))


Loaded 15171 sentences from 293 articles


We create one feature per sentence, which is the length of the sentence, and we train a model using only that feature

In [3]:
train = np.array([ len(sentence) for sentence in sentence_list ]).reshape(-1, 1)
model = LogisticRegression(penalty='l2', class_weight='balanced', solver="lbfgs")
model.fit(train, gold_labels)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)

We load now the dev set, compute the length feature and make predictions based on 'model'

In [4]:
file_list = glob.glob(os.path.join(dev_folder, "*.txt"))
dev_sentence_list = []
for i, filename in enumerate(file_list):
    with open(filename, "r") as f:
        for row in f.readlines():
            dev_sentence_list.append(len(row.rstrip()))

gold_file_list = glob.glob(os.path.join(dev_folder, "*.task2.labels"))
dev_articles_id, dev_sentence_id_list = ([], [])
for filename in gold_file_list:
    with open(filename, "r") as f:
        for row in f.readlines():
            article_id, sentence_id = row.rstrip().split("\t")[0:2]
            dev_articles_id.append(article_id)
            dev_sentence_id_list.append(sentence_id)

dev = np.array(dev_sentence_list).reshape(-1, 1)
predictions = model.predict(dev)


save the prediction in the output format for the competition

In [5]:
with open("example-submission-task2-predictions.txt", "w") as fout:
    for article_id, sentence_id, prediction in zip(dev_articles_id, dev_sentence_id_list, predictions):
        fout.write("%s\t%s\t%s\n" % (article_id, sentence_id, prediction))

Running the scorer on file "example-submission-task2-predictions.txt" gives the following results:
Precision=0.307815
Recall=0.582202
F1=0.402713