# Feature extraction

In this notebook we will learn how to extract different features from a text and how to combine them. It's pretty simple, but if you have this part well organized, it will be really useful in the near future. So, let's get started!

In [1]:
import nltk
from sklearn.pipeline import FeatureUnion
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn import preprocessing

In [2]:
from sklearn.preprocessing import LabelEncoder
import os
import glob
import json
import argparse
import time
import codecs
from collections import defaultdict
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn.calibration import CalibratedClassifierCV

import nltk
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer

In [3]:
import re
import random
from sklearn.pipeline import Pipeline

In [4]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /Users/sallyisa/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/sallyisa/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
import os

In [45]:
from nltk.tokenize import word_tokenize
def get_pos_ngrams(sents):
    pos_tags= [nltk.pos_tag(word_tokenize(sents[ind])) for ind, item in enumerate(sents) if item != '']

    pos_sents = []
    for sent in pos_tags:
        pos = ' '.join([pos_tag[1] for pos_tag in sent])
        pos_sents.append(pos)

    vectorizer = CountVectorizer(ngram_range = (1,1),token_pattern=u"(?u)\\b\\w+\\b", min_df=1)

    pos_ngram = vectorizer.fit_transform(pos_sents)
    pos_ngram.toarray(), pos_tags
    return pos_ngram

get_pos_ngrams(train_sents)

<7x31 sparse matrix of type '<class 'numpy.int64'>'
	with 184 stored elements in Compressed Sparse Row format>

In [105]:
# -*- coding: utf-8 -*-

"""
 A baseline authorship attribution method 
 based on a character n-gram representation
 and a linear SVM classifier.
 It has a reject option to leave documents unattributed
 (when the probabilities of the two most likely training classes are too close)
 
 Questions/comments: stamatatos@aegean.gr

 It can be applied to datasets of PAN-19 cross-domain authorship attribution task
 See details here: http://pan.webis.de/clef19/pan19-web/author-identification.html
 Dependencies:
 - Python 2.7 or 3.6 (we recommend the Anaconda Python distribution)
 - scikit-learn

 Usage from command line: 
    > python pan19-cdaa-baseline.py -i EVALUATION-DIRECTORY -o OUTPUT-DIRECTORY [-n N-GRAM-ORDER] [-ft FREQUENCY-THRESHOLD] [-pt PROBABILITY-THRESHOLD]
 EVALUATION-DIRECTORY (str) is the main folder of a PAN-19 collection of attribution problems
 OUTPUT-DIRECTORY (str) is an existing folder where the predictions are saved in the PAN-19 format
 Optional parameters of the model:
   N-GRAM-ORDER (int) is the length of character n-grams (default=3)
   FREQUENCY-THRESHOLD (int) is the cutoff threshold used to filter out rare n-grams (default=5)
   PROBABILITY-THRESHOLD (float) is the threshold for the reject option assigning test documents to the <UNK> class (default=0.1)
                                 Let P1 and P2 be the two maximum probabilities of training classes for a test document. If P1-P2<pt then the test document is assigned to the <UNK> class.
   
 Example:

     >  python pan19-cdaa-baseline-svm.py -i ".\pan19-cross-domain-authorship-attribution-training-dataset-2019-01-23\" -o ".\a
nswers-trigram\" -n 3
"""

from __future__ import print_function
import os
import glob
import json
import argparse
import time
import codecs
from collections import defaultdict
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn.calibration import CalibratedClassifierCV

def represent_text(text,n):
    # Extracts all character 'n'-grams from  a 'text'
    if n>0:
        tokens = [text[i:i+n] for i in range(len(text)-n+1)]
    frequency = defaultdict(int)
    for token in tokens:
        frequency[token] += 1
    return frequency

represent_text(train_sents[0], 2)
                           

defaultdict(int,
            {'gr': 7,
             'ra': 6,
             'ac': 10,
             'ce': 16,
             'ef': 5,
             'fu': 11,
             'ul': 8,
             'l ': 9,
             ' o': 29,
             'on': 25,
             'ne': 27,
             'es': 13,
             's.': 4,
             '.\n': 18,
             '\n\n': 34,
             '\n"': 9,
             '"O': 1,
             'On': 1,
             'e ': 131,
             ' m': 20,
             'mo': 6,
             'or': 26,
             're': 41,
             'e,': 9,
             ',"': 6,
             '" ': 8,
             ' M': 7,
             'Ma': 11,
             'ar': 27,
             'rv': 11,
             've': 27,
             'el': 28,
             'lo': 22,
             'ou': 44,
             'us': 28,
             's ': 68,
             ' s': 72,
             'sa': 13,
             'ai': 18,
             'id': 12,
             'd,': 13,
             ', ': 42,
             'so': 5,
    

In [108]:
def read_files(path,label):
    # Reads all text files located in the 'path' and assigns them to 'label' class
    files = glob.glob(path+os.sep+label+os.sep+'*.txt')
    texts=[]
    for i,v in enumerate(files):
        f=codecs.open(v,'r',encoding='utf-8')
        texts.append((f.read(),label))
        f.close()
    return texts

def extract_vocabulary(texts,n,ft):
    # Extracts all characer 'n'-grams occurring at least 'ft' times in a set of 'texts'
    occurrences=defaultdict(int) 
    for (text,label) in texts:
        text_occurrences = {}
        if isinstance(n, int):
            for x in range(1,n+1):
                text_occurrences.update(represent_text(text,x))
        else:
            pass
        for ngram in text_occurrences:
            if ngram in occurrences:
                occurrences[ngram]+=text_occurrences[ngram]
            else:
                occurrences[ngram]=text_occurrences[ngram]
    vocabulary=[]
    for i in occurrences.keys():
        if occurrences[i]>=ft:
            vocabulary.append(i)
    return vocabulary

extract_vocabulary([(x,i) for i, x in enumerate(train_sents)], 3, 5)

['g',
 'r',
 'a',
 'c',
 'e',
 'f',
 'u',
 'l',
 ' ',
 'o',
 'n',
 's',
 '.',
 '\n',
 '"',
 'O',
 'm',
 ',',
 'M',
 'v',
 'i',
 'd',
 'y',
 'b',
 'h',
 't',
 'S',
 '’',
 'J',
 'k',
 '(',
 'j',
 ')',
 'H',
 'w',
 ';',
 'x',
 'p',
 'B',
 '-',
 'L',
 'T',
 'A',
 'z',
 'Y',
 '“',
 '”',
 'I',
 'Z',
 '?',
 '—',
 'W',
 'C',
 '!',
 'gr',
 'ra',
 'ac',
 'ce',
 'ef',
 'fu',
 'ul',
 'l ',
 ' o',
 'on',
 'ne',
 'es',
 's.',
 '.\n',
 '\n\n',
 '\n"',
 'e ',
 ' m',
 'mo',
 'or',
 're',
 'e,',
 ',"',
 '" ',
 ' M',
 'Ma',
 'ar',
 'rv',
 've',
 'el',
 'lo',
 'ou',
 'us',
 's ',
 ' s',
 'sa',
 'ai',
 'id',
 'd,',
 ', ',
 'so',
 'un',
 'nd',
 'di',
 'in',
 'ng',
 'g ',
 ' r',
 'ro',
 'oy',
 'al',
 'll',
 'ly',
 'y ',
 ' b',
 'bo',
 'ed',
 'd ',
 ' f',
 'fr',
 'om',
 'm ',
 ' h',
 'hi',
 'is',
 'se',
 'ea',
 'at',
 't.',
 'Sh',
 'he',
 'e’',
 '’s',
 ' t',
 'ti',
 'ir',
 ' J',
 'Jo',
 'oe',
 'th',
 'ho',
 'ug',
 'gh',
 'h ',
 ' n',
 'no',
 'ot',
 't ',
 ' u',
 'nk',
 'ki',
 'y.',
 '. ',
 ' (',
 '(t',
 'uc',

In [117]:
def baseline(path,outpath,n=3,ft=5,pt=0.1):
    start_time = time.time()
    # Reading information about the collection
    infocollection = path+os.sep+'collection-info.json'
    problems = []
    language = []
    with open(infocollection, 'r') as f:
        for attrib in json.load(f):
            problems.append(attrib['problem-name'])
            language.append(attrib['language'])
    for index,problem in enumerate(problems):
        print(problem)
        # Reading information about the problem
        infoproblem = path+os.sep+problem+os.sep+'problem-info.json'
        candidates = []
        with open(infoproblem, 'r') as f:
            fj = json.load(f)
            unk_folder = fj['unknown-folder']
            for attrib in fj['candidate-authors']:
                candidates.append(attrib['author-name'])
        # Building training set
        train_docs=[]
        for candidate in candidates:
            train_docs.extend(read_files(path+os.sep+problem,candidate))
        train_texts = [text for i,(text,label) in enumerate(train_docs)]
        train_labels = [label for i,(text,label) in enumerate(train_docs)]
        vocabulary = extract_vocabulary(train_docs,n,ft)
        vectorizer = CountVectorizer(analyzer='char',ngram_range=(n,n),lowercase=False,vocabulary=vocabulary)
        train_data = vectorizer.fit_transform(train_texts)
        train_data = train_data.astype(float)
        print(train_data.shape)
        for i,v in enumerate(train_texts):
            train_data[i]=train_data[i]/len(train_texts[i]) # normalizes over length?
        print('\t', 'language: ', language[index])
        print('\t', len(candidates), 'candidate authors')
        print('\t', len(train_texts), 'known texts')
        print('\t', 'vocabulary size:', len(vocabulary))
        # Building test set
        test_docs=read_files(path+os.sep+problem,unk_folder)
        test_texts = [text for i,(text,label) in enumerate(test_docs)]
        test_data = vectorizer.transform(test_texts)
        test_data = test_data.astype(float)
        for i,v in enumerate(test_texts):
            test_data[i]=test_data[i]/len(test_texts[i])
        print('\t', len(test_texts), 'unknown texts')
        # Applying SVM
        max_abs_scaler = preprocessing.MaxAbsScaler()
        scaled_train_data = max_abs_scaler.fit_transform(train_data)
        scaled_test_data = max_abs_scaler.transform(test_data)
        clf=CalibratedClassifierCV(OneVsRestClassifier(SVC(C=1)))
        clf.fit(scaled_train_data, train_labels)
        predictions=clf.predict(scaled_test_data)
        proba=clf.predict_proba(scaled_test_data)
        # Reject option (used in open-set cases)
        count=0
        for i,p in enumerate(predictions):
            sproba=sorted(proba[i],reverse=True)
            if sproba[0]-sproba[1]<pt:
                predictions[i]=u'<UNK>'
                count=count+1
        print('\t',count,'texts left unattributed')
        # Saving output data
        out_data=[]
        unk_filelist = glob.glob(path+os.sep+problem+os.sep+unk_folder+os.sep+'*.txt')
        pathlen=len(path+os.sep+problem+os.sep+unk_folder+os.sep)
        for i,v in enumerate(predictions):
            out_data.append({'unknown-text': unk_filelist[i][pathlen:], 'predicted-author': v})
        with open(outpath+os.sep+'answers-'+problem+'.json', 'w') as f:
            json.dump(out_data, f, indent=4)
        print('\t', 'answers saved to file','answers-'+problem+'.json')
    print('elapsed time:', time.time() - start_time)

base_dir='pan18-cross-domain-authorship-attribution-training-dataset-2017-12-02'
out_dir = base_dir+os.sep+'output-dir'
eval_dir = base_dir+os.sep+'eval-dir'
baseline(base_dir,out_dir,n=5,ft=3,pt=0.05)

problem00001
(140, 64239)
	 language:  en
	 20 candidate authors
	 140 known texts
	 vocabulary size: 64239
	 105 unknown texts




	 30 texts left unattributed
	 answers saved to file answers-problem00001.json
problem00002
(35, 26761)
	 language:  en
	 5 candidate authors
	 35 known texts
	 vocabulary size: 26761
	 21 unknown texts




	 7 texts left unattributed
	 answers saved to file answers-problem00002.json
elapsed time: 32.0657012462616


In [118]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
# Evaluation script for the Cross-Domain Authorship Attribution task @PAN2019.
We use the F1 metric (macro-average) as implemented in scikit-learn:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
We include the following ad hoc rules:
- If authors are predicted which were not seen during training,
  these predictions will count as false predictions ('<UNK>' class)
  and they will negatively effect performance.
- If texts are left unattributed they will assigned to the ('<UNK>'
  class) and they will negatively effect performance.
- The <UNK> class is excluded from the macro-average across classes.
- If multiple test attributions are given for a single unknown document,
  only the first one will be taken into consideration.

Dependencies:
- Python 2.7 or 3.6 (we recommend the Anaconda Python distribution)
- scikit-learn

Usage from the command line:
>>> python pan19-cdaa-evaluator.py -i COLLECTION -a ANSWERS -o OUTPUT
where
    COLLECTION is the path to the main folder of the evaluation collection
    ANSWERS is the path to the answers folder of a submitted method
    OUTPUT is the path to the folder where the results of the evaluation will be saved

Example: 
>>>  python pan19-cdaa-evaluator.py -i ".\pan19-cross-domain-authorship-attribution-training-dataset-2019-01-23\" -a ".\answ
ers-unigram" -o ".\eval-unigram\"

# References:
@article{scikit-learn,
 title={Scikit-learn: Machine Learning in {P}ython},
 author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
         and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
         and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
         Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
 journal={Journal of Machine Learning Research},
 volume={12},
 pages={2825--2830},
 year={2011}
}
"""

import argparse
import os
import json
import warnings
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
from sklearn.preprocessing import LabelEncoder
import numpy as np

def eval_measures(gt, pred):
    """Compute macro-averaged F1-scores, macro-averaged precision, 
    macro-averaged recall, and micro-averaged accuracy according the ad hoc
    rules discussed at the top of this file.
    Parameters
    ----------
    gt : dict
        Ground truth, where keys indicate text file names
        (e.g. `unknown00002.txt`), and values represent
        author labels (e.g. `candidate00003`)
    pred : dict
        Predicted attribution, where keys indicate text file names
        (e.g. `unknown00002.txt`), and values represent
        author labels (e.g. `candidate00003`)
    Returns
    -------
    f1 : float
        Macro-averaged F1-score
    precision : float
        Macro-averaged precision
    recall : float
        Macro-averaged recall
    accuracy : float
        Micro-averaged F1-score
    """

    actual_authors = list(gt.values())
    encoder = LabelEncoder().fit(['<UNK>'] + actual_authors)

    text_ids, gold_authors, silver_authors = [], [], []
    for text_id in sorted(gt):
        text_ids.append(text_id)
        gold_authors.append(gt[text_id])
        try:
            silver_authors.append(pred[text_id])
        except KeyError:
            # missing attributions get <UNK>:
            silver_authors.append('<UNK>')

    assert len(text_ids) == len(gold_authors)
    assert len(text_ids) == len(silver_authors)

    # replace non-existent silver authors with '<UNK>':
    silver_authors = [a if a in encoder.classes_ else '<UNK>' 
                      for a in silver_authors]

    gold_author_ints = encoder.transform(gold_authors)
    silver_author_ints = encoder.transform(silver_authors)

    # get F1 for individual classes (and suppress warnings):
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        labels=list(set(gold_author_ints))
        # Exclude the <UNK> class
        for x in labels:
            if encoder.inverse_transform(np.array([x]))=='<UNK>':
                labels.remove(x)
        f1 = f1_score(gold_author_ints,
                  silver_author_ints,
                  labels,
                  average='macro')
        precision = precision_score(gold_author_ints,
                  silver_author_ints,
                  labels,
                  average='macro')
        recall = recall_score(gold_author_ints,
                  silver_author_ints,
                  labels,
                  average='macro')
        accuracy = accuracy_score(gold_author_ints,
                  silver_author_ints)

    return f1,precision,recall

def evaluate(ground_truth_file,predictions_file):
    # Calculates evaluation measures for a single attribution problem
    gt = {}
    with open(ground_truth_file, 'r') as f:
        for attrib in json.load(f)['ground_truth']:
            gt[attrib['unknown-text']] = attrib['true-author']

    pred = {}
    with open(predictions_file, 'r') as f:
        for attrib in json.load(f):
            if attrib['unknown-text'] not in pred:
                pred[attrib['unknown-text']] = attrib['predicted-author']
    f1,precision,recall =  eval_measures(gt,pred)
    return round(f1,3), round(precision,3), round(recall,3)

def evaluate_all(path_collection,path_answers,path_out):
    # Calculates evaluation measures for a PAN-18 collection of attribution problems
    infocollection = path_collection+os.sep+'collection-info.json'
    problems = []
    data = []
    with open(infocollection, 'r') as f:
        for attrib in json.load(f):
            problems.append(attrib['problem-name'])
    scores=[];
    for problem in problems:
        f1,precision,recall=evaluate(path_collection+os.sep+problem+os.sep+'ground-truth.json',path_answers+os.sep+'answers-'+problem+'.json')
        scores.append(f1)
        data.append({'problem-name': problem, 'macro-f1': round(f1,3), 'macro-precision': round(precision,3), 'macro-recall': round(recall,3)})
        print(str(problem),'Macro-F1:',round(f1,3))
    overall_score=sum(scores)/len(scores)
    # Saving data to output files (out.json and evaluation.prototext)
    with open(path_out+os.sep+'out.json', 'w') as f:
        json.dump({'problems': data, 'overall_score': round(overall_score,3)}, f, indent=4, sort_keys=True)
    print('Overall score:', round(overall_score,3))
    prototext='measure {\n key: "mean macro-f1"\n value: "'+str(round(overall_score,3))+'"\n}\n'
    with open(path_out+os.sep+'evaluation.prototext', 'w') as f:
        f.write(prototext)
   
evaluate_all(base_dir,out_dir,eval_dir)

problem00001 Macro-F1: 0.559
problem00002 Macro-F1: 0.633
Overall score: 0.596


In [23]:
def process_dir_files(path):
    dir_files = []
    for file in os.listdir(path):
        current = os.path.join(path, file)
        if os.path.isfile(current):
            dir_files.append(open_file(current))
    return dir_files
                             

In [40]:
def open_file(path):
    with open(path, 'r+') as f:
        return '\n'.join([line.strip() for line in f])


train_sents= process_dir_files('pan18-cross-domain-authorship-attribution-training-dataset-2017-12-02/problem00001/candidate00001')
train_sents
                  

['graceful ones.\n\n"One more," Marvelous said, sounding royally bored from his seat.\n\n"She’s tired," Joe said, though not unkindly. (the fucking jerk).\n\nHe was right; her muscles have long since turned to cotton with exhaustion and her knees refused to support her upright. But damned if she was going to take Joe’s offered hand to help her up, and damned if she was going to make herself look like a weakling in front of Marvelous, who - whether she liked it or not - was the fucking captain.\n\nLuka was determined to prove her worth to both of them, no matter what it took.\n\nSo she slapped Joe’s hand away and pushed herself off the fucking floor, and aimed the sword at his neck, willing her arm not to betray her.\n\n"Three more," she said, voice trembling, and Joe responded with a smirk.\n\nMarvelous yawned.\n\n(fucking jerks, both of them.)\n\n-\n\nA few weeks into their training, and Luka realized Joe was just that good in using his sword.\n\n"You have to learn how to read your op

In [42]:
train_sents

['graceful ones.\n\n"One more," Marvelous said, sounding royally bored from his seat.\n\n"She’s tired," Joe said, though not unkindly. (the fucking jerk).\n\nHe was right; her muscles have long since turned to cotton with exhaustion and her knees refused to support her upright. But damned if she was going to take Joe’s offered hand to help her up, and damned if she was going to make herself look like a weakling in front of Marvelous, who - whether she liked it or not - was the fucking captain.\n\nLuka was determined to prove her worth to both of them, no matter what it took.\n\nSo she slapped Joe’s hand away and pushed herself off the fucking floor, and aimed the sword at his neck, willing her arm not to betray her.\n\n"Three more," she said, voice trembling, and Joe responded with a smirk.\n\nMarvelous yawned.\n\n(fucking jerks, both of them.)\n\n-\n\nA few weeks into their training, and Luka realized Joe was just that good in using his sword.\n\n"You have to learn how to read your op

### Exercise 5: Combine all features for each sentence.

Combine all the previous features, and generate a matrix encoding all previously mentioned features: unigrams, bigrams, trigrams and pos_tags. The resulting matrix should have the following dimensions: 3x31

You could use the `sklearn.pipeline.FeatureUnion` class.

In [None]:
#toks = sentence_tokenize(train_sentences)

pipe2 = Pipeline (FeatureUnion([("uni", CountVectorizer(ngram_range = (1,3), min_df = 1)),("PoS", nltk.word_tokenize(train_sentences_1[0]))])


#CountVectorizer(ngram_range = (1,1), min_df = 1) #token_pattern=u"(?u)\\b\\w+\\b"

#bigrams_bow = bigr.fit_transform(train_sentences)
#union.fit_transform(train_sentences)    


### Extra to play with: Check this website and think about it. Do you think you can use this for something? (in the exam)

http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/

## SHARE YOUR KNOWLEDGE!

### Do you know any other way of representing the features of the training/testing set?

Please share your knowledge using the forum from Absalon!!!