# **Feature Extraction** for **Extractive Summarisation**

**Features Considered for each line:**
* Number of Verbs
* Number of Stop words
* Number of Named Entities
* Number of Pronouns
* Position in document
* Sentence Length

** May include in Future:**
* Discourse Cues
* Sentiment
* Salience
* Uniqueness
* Has Money

## Contents

* Parsing raw data
* POS tagging
* Stop words
* Count Named Entities

In [22]:
import os
import numpy as np
import json
import nltk

In [84]:
nltk.download('tagsets')
nltk.download('stopwords')

[nltk_data] Downloading package tagsets to
[nltk_data]     /home/ramkishore.s/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ramkishore.s/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
TRAIN = os.listdir('Dataset/all/training/')
DEV = os.listdir("Dataset/all/validation/")
TEST = os.listdir('Dataset/all/test/')
TRAIN_FOLDER = 'Dataset/all/training/'
DEV_FOLDER = 'Dataset/all/validation/'
TEST_FOLDER = 'Dataset/all/test/'

In [4]:
len(TRAIN), len(TEST), len(DEV)

(277554, 11443, 13367)

## Model dataset file:

### Introduction

This dataset contains CNN and Dailymail articles used for training a summarization system. The script used to create the dataset is modified from the release of Hermann et al. 2015.

### Format:

Each file contains four parts separated by ‘\n\n’. They are
* url of the original article;
* sentences in the article and their labels (for sentence-based extractive summarization);
* extractable highlights (for word extraction-based abstractive summarization);
* named entity mapping.

### Sentence labels

There are three labels for the sentences: 1, 2 and 0. 
* ** 1 **—-sentence should extracted; 
* ** 2 **--sentence might be extracted; 
* ** 0 **—-sentence shouldn't be extracted.

### Extractable highlights

The extractable highlights are created by examining if a word (or its morphological transformation) in the highlight appears in the article or a general purpose stop-word list, which together constitute the output space (i.e., the allowed vocabulary during summary generation).

## Loading all the Data to process

In [5]:
entities = {}
docs = []

In [6]:
def file_parse(file):
    global entities
    # generating data
    
    contents = file.read()
    parts = contents.split('\n\n')
    lines = parts[1].split('\n')
    output = [int(line[-1]) for line in lines]
    lines = [line[:-1].rstrip() for line in lines]
    
    # storing entities
    entity_map = parts[3].split('\n')
    for i in entity_map:
        id, name = i.split(":")[:2]
        entities[id] = name
    file.close()
    return {"lines": lines, "output": output, "summary": parts[2].split('\n')}

In [7]:
def get_processed_files(folder, file_names):
    docs = []
    summaries = []
    for file in file_names:
        try:
            docs.append(file_parse(open(folder + file)))
        except ValueError:
            pass
    return docs

In [8]:
TRAIN = get_processed_files(TRAIN_FOLDER, TRAIN)

In [9]:
TEST = get_processed_files(TEST_FOLDER, TEST)
DEV = get_processed_files(DEV_FOLDER, DEV)

In [74]:
def write_files(target_folder, string, docs, filename_pattern):
    if not filename_pattern: filename_pattern = string
    for doc, index in zip(docs, range(len(docs))):
        with open(target_folder + filename_pattern + '.' + str(index) + '.txt', 'w+') as file:
            for line in doc[string]:
                file.write(str(line))
                file.write('\n')

In [None]:
os.mkdir('Dataset/Summaries/')
os.mkdir('Dataset/Summaries/train/')
os.mkdir('Dataset/Summaries/test')
os.mkdir('Dataset/Summaries/dev/')

In [62]:
write_files('Dataset/Summaries/train/', 'summary', TRAIN)
write_files('Dataset/Summaries/test/', 'summary', TEST)
write_files('Dataset/Summaries/dev/', 'summary', DEV)

In [65]:
os.mkdir('Dataset/Text/')
os.mkdir('Dataset/Text/train/')
os.mkdir('Dataset/Text/test/')
os.mkdir('Dataset/Text/dev/')

In [70]:
write_files('Dataset/Text/train/', 'lines', TRAIN, 'doc')
write_files('Dataset/Text/test/', 'lines', TEST, 'doc')
write_files('Dataset/Text/dev/', 'lines', DEV, 'doc')

In [71]:
os.mkdir('Dataset/Cheng_outputs/')

In [72]:
os.mkdir('Dataset/Cheng_outputs/train/')
os.mkdir('Dataset/Cheng_outputs/test/')
os.mkdir('Dataset/Cheng_outputs/dev/')

In [75]:
write_files('Dataset/Cheng_outputs/train/', 'output', TRAIN, 'output')
write_files('Dataset/Cheng_outputs/test/', 'output', TEST, 'output')
write_files('Dataset/Cheng_outputs/dev/', 'output', DEV, 'output')

## POS data

**NOTE: Since we only using primitive information like number of verbs, number of pronouns etc, we don't need accurate tags. So we simply use the most probable tag for each word. This reduces the computation time required for POS tagging**

In [28]:
TAGS = {}
CACHED_TAGS = {}

In [36]:
def cache(word):
    if word not in CACHED_TAGS:
        CACHED_TAGS[word] = nltk.pos_tag([word])[0][1]
    return CACHED_TAGS[word]

In [37]:
TRAIN_TAGS = []
TEST_TAGS = []
DEV_TAGS = []

In [38]:
def tag_data(raw, final):
    for doc in raw:
        dic = []
        for line in doc['lines']:
            l = []
            for word in line.split():
                tag = cache(word)
                l.append(tag)
            dic.append(l)
        final.append(dic)

In [39]:
tag_data(TRAIN, TRAIN_TAGS)
tag_data(TEST, TEST_TAGS)
tag_data(DEV, DEV_TAGS)

In [106]:
def write(target_folder, filename_pattern, docs):
    for doc, index in zip(docs, range(len(docs))):
        with open(target_folder + filename_pattern + '.' + str(index) + '.txt', 'w+') as file:
            for line in doc:
                for tag in line:
                    file.write(str(tag))
                    file.write(' ')
                file.write('\n')

In [None]:
os.mkdir('Dataset/Tags')
os.mkdir('Dataset/Tags/Train')
os.mkdir('Dataset/Tags/Test')
os.mkdir('Dataset/Tags/Dev')

In [46]:
write('Dataset/Tags/Train/', 'tags', TRAIN_TAGS)

In [47]:
write('Dataset/Tags/Test/', 'tags', TEST_TAGS)
write('Dataset/Tags/Dev/', 'tags', DEV_TAGS)

In [48]:
del TRAIN_TAGS, TEST_TAGS, DEV_TAGS, TAGS, CACHED_TAGS

## Stop words

In [77]:
from nltk.corpus import stopwords

In [103]:
STOPWORDS = set(stopwords.words('english'))

In [104]:
def count_stops(data):
    final = []
    for doc in data:
        stops = []
        for line in doc['lines']:
            count = 0
            for word in line.split():
                if word.lower() in STOPWORDS: count += 1
            stops.append(count)
        final.append(stops)
    return final

In [105]:
TRAIN_STOPS = count_stops(TRAIN)
TEST_STOPS = count_stops(TEST)
DEV_STOPS = count_stops(DEV)

In [108]:
os.mkdir('Dataset/StopCounts/')
os.mkdir('Dataset/StopCounts/train')
os.mkdir('Dataset/StopCounts/test')
os.mkdir('Dataset/StopCounts/dev')

In [112]:
def write(target_folder, filename_pattern, docs):
    for doc, index in zip(docs, range(len(docs))):
        with open(target_folder + filename_pattern + '.' + str(index) + '.txt', 'w+') as file:
            for line in doc:
                file.write(str(line))
                file.write('\n')

In [113]:
write('Dataset/StopCounts/train/', 'stop_counts', TRAIN_STOPS)
write('Dataset/StopCounts/test/', 'stop_counts', TEST_STOPS)
write('Dataset/StopCounts/dev', 'stop_counts', DEV_STOPS)

In [114]:
del TRAIN_STOPS, TEST_STOPS, DEV_STOPS

## Glove 100 dimensional Vectors

In [14]:
words = []
word2id = {}
vectors = []
id = 0
dims = 100
glove_file = '/home/ramkishore.s/glove/glove.6B.100d.txt'
with open(glove_file) as f:
    for l in f:
        line = l.split()
        word = line[0]
        words.append(word)
        word2id[word] , id = id, id + 1
        vect = np.array(line[1:]).astype(np.float)
        vectors.append(vect)
        
print(len(words), len(word2id), len(vectors))

400000 400000 400000


## Random NER vectors

In [15]:
entity_names = ['@entity' + str(num) for num in range(0,1000)]
entity2ids = {}
i = len(words)
for name in entity_names:
    entity2ids[name], i = i, i + 1
entity_vectors = np.random.randn(1000, 100)

In [16]:
words.extend(entity_names)
word2id.update(entity2ids)
vectors.extend(entity_vectors)

## Adding **meta** vectors

* `unknown`
* `number`
* `time`
* `money`

In [17]:
UNK = '@@@UNK@@@'
words.append(UNK)
word2id[UNK] = len(words) - 1
vectors.extend(np.random.randn(1, 100))

In [18]:
MONEY = '@@@MON@@@'
TIME = '@@@TIME@@@'
NUMBER = '@@@NUM@@@'
words.extend([MONEY, TIME, NUMBER])
word2id[MONEY], word2id[TIME], word2id[NUMBER] = len(words), len(words) + 1, len(words) + 2
vectors.extend(np.random.randn(3, 100))

In [19]:
len(words), len(vectors), len(word2id)

(401004, 401004, 401004)

In [20]:
word2id[MONEY] = 401001
word2id[TIME] = 401002
word2id[NUMBER] = 401003