Brace yourselves, friends, cause this is going to be long. I'll try to keep you entertained.
I'm going to create a huge document where I want to accomplish two different goals.

<h1>Goals</h1>
1. Introduce people that are not necessarily on the field of NLP to the task of text classification.
1. Present my stylometry framework and my specific approach to the author identification task.

And given the extension of the document, I also provide an index:
1. Introduction to the task
1. Feature Engineering
1. Author Identification (and profiling)
1. Approach description
1. Framework description

Feel free to skip any part if you feel it is too basic/broad/boring.

<h1>1. Introduction to the task</h1>

First of all, lets talk about the classic NLP document classification task. This is a classic task in the field of NLP in which the goal is to classify texts with respect to a predefined set of candidate categories. Each one of these categories will be referred as classes or labels from now on.  
Some examples of document classification tasks:
* Given a web page and a query (google search query for example), determine if a document is relevant.
* Given a text, determine if the content is considered positive or negative.
* Given a word, determine which one of its senses is being used.
* Given a set of assignments, determine if there has been cases of plagarism.
* Given a text, determine if it contains hate speech or not.
...
So, yeah, there are maaaaany examples and each one of them is a whole area of study in the NLP community.

Let us use this task as an example. We have a corpus of texts written by 3 different authors. The goal is to determine who wrote unseen texts between the three candidates.  

**How do we do that? 
**

We could manually code some rules to classify them (e.g., if "Cthulhu" appears, them H.P. Lovecraft is the answer). This sort of rule-based approach is usually not scalable: if an unexpected case appears, new rules need to be implemented. 
A popular alternative is to use machine learning. The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience. Machine learning algorithms learn from the provided data and extract underlying regular patterns from it. The extracted patterns can then be applied to the classification of unseen data instances. This is the approach that I'm going to be using.

The typical machine learning-driven text classification approach works like this:
1) we determine which features of the text are distinctive (i.e., can help distinguishing between text classes).
2) we extract said features.       
3) we use the feature vectors along with the correct classes of the training set to learn (i'm assuming there exists a set of texts in which "the correct answer" is known, usually known as training set), feeding them to a standard supervised machine learning algorithm (such as SVM, Random Forests, etc.)
4) we use the model that is extracted by the machine learning algorithm to classify unseen instances. 
5) if we have the correct classes of the unseen instances, we evaluate how well we did.

All right, still with me folks? So, given the presented steps, which one do you think is the critical step? 

There are plenty of machine learning algorithms that can be easily used off-the-shelf and that work very well, extracting features is just a matter of coding, the evaluation usually consists in applying a formula that considers the output of the classifier. The choice of feature sets is the key component in this type of approach. This feature selection process  is often called "feature engineering".

<h1>2 Feature Engineering </h1>
Feature engineering is a vaguely defined set of tasks related to designing feature sets for machine learning applications (which in some cases, is considered an art). The first important task to do, in order to correctly design a feature set is to understand the properties of the problem at hand and assess how they might interact with the chosen classifier. After understanding the problem, hypotheses need to be drawn. Feature engineering is thus a cycle, in which a set of features is proposed, experiments with this feature set are performed, and, after analyzing the results, the feature set is modified to improve the performance until the results are satisfactory. 

Although it is often possible to obtain competitive performance using fairly simple and obvious sets of features, there is room for significant performance improvement. Carefully constructed feature sets require thorough understanding of the task at hand, but can significantly outperform basic feature sets. In short, better features mean better results.

The data needs to be characterized by a group of features that differentiate between the instances that belong to a class with respect to the other classes. Irrelevant or partially relevant features can negatively impact the performance of the classifier. An example of an irrelevant feature would be one that takes a fixed value for any instance in the input data.

Optimal feature selection helps the algorithms extract patterns that generalize to unseen instances without needing complex parametrization of the classifier to perform competitively, preventing overfitting. Models created by the machine learning algorithm which contain the  "knowledge" extracted from training data are faster to run, easier to understand and to maintain if the feature set is appropriate.









After talking about the general task of text classification, let us talk about the specific task at hand.

<h1>3. Author Identification</h1>

The task at hand can be considered an author identification problem. The goal is to determine the author of a text with a predefined set of authors.

So, the question is,  are we the first ones that thought about that? 

The answer is...

Yeah, no.

Given how wrong you are if you thought the answer was YES! Let me give you a....

<div style="color:red;background:black;">**HISTORY LESSON!**</div>

(real shame the marquee tag didnt work)

One of the first proposals that implemented a data-driven author identification system is:

Frederick Mosteller and David L. Wallace, ‘Inference in an authorship problem’,Journal of the Ameri-can Statistical Association,58(302), 275–309, (1963). 

Yeah, 1963! (some of us were not even planned yet)

The authors tried to clarify the authorship of the Federalist Papers drawing upon function words and Naive Bayes classification.  The Federalist Papers is a collection of 85 articles and essays written by Alexander Hamilton, James Madison, and John Jay to promote the ratification of the United States Constitution.  The authorship of seventy-three of The Federalist essays is fairly certain. The remaining 12  are the subjects of study of several scholars.

The other classic problem involves this guy:
![](https://www.biography.com/.image/c_fill%2Ccs_srgb%2Cg_face%2Ch_300%2Cq_80%2Cw_300/MTE1ODA0OTcxNzgzMzkwNzMz/william-shakespeare-194895-1-402.jpg)

He looks like he is hiding something.

Long story short, people believe that William Shakespeare did not actually write some of his best plays. His biography, humble origins, obscure life (not much is known of his personal life) made people question how, given his background, could he be the greatest writer of all time.  So, many researchers used NLP to research on the topic. None of the results are absolutely conclusive, but it seems that some of his works are stylistically related to other authors. If you want to read more about the topic, refer to:

* Hierarchical and Non-Hierarchical Linear and Non-Linear Clustering Methods to "Shakespeare Authorship Question" by Refat Aljumly 

and 

* Neural computation in stylometry I: An application to the works of Shakespeare and Fletcher by Robert Matthews and Thomas Merriam.

Another very interesting authorship problem that was in the news a couple of years ago is the case of Robert Galbraith. In 2013 a novel  called The Cuckoo’s Calling was published by an unknown author called Robert Galbraith. A newspaper received an anonymous tip that this author was actually J.K Rowling, which wanted to publish a crime novel without the influence of the whole "Harry Potter" universe to affect its success. Said newspaper wanted to prove it and hired a researcher in the field of author identification. The researcher observed many similarities between the styles of J.K. Rowling and Robert Galbraith.  J.K Rowling admitted it was her all along (bummed out probably to be outed by us nerds).

<div style="color:red;background:black;">**END OF HISTORY LESSON!**</div>

So yeah, these are some of the mainstream cases. But there are MANY other previous works on the task. Author identification  is often applied in forensic linguistic scenarios (here you can see the kind of linguistic analysis that can be done in police investigation: [https://www2.fbi.gov/publications/leb/1996/oct964.txt](http://))

How do some of these approaches tackle this task? What features do they use? Let me present some examples:
* Character/token n-grams (frequencies of sequences of characters/words)
* Frequency of specific words (the most relevant, look up function words, tf-idf computation and all of that good stuff)
* Usage of specific parts of speech (frequencies of adjectives, nouns, etc).
* Punctuation mark usage (very stylistic in some languages).
* Syntactic structural features (analysis on the syntactic trees of the sentences).
* Topic models
* Word embeddings
...

and a large ETC.


<h1>4. Approach description</h1>

At this point you might be thinking: "WTF IS THIS, I WANT TO SEE CODE" or "WOW, THIS IS MARVELOUS". I am specially eloquent, basically because I presented my PhD on author identification and profiling (instead of identifying the author of texts, try to identify demographic traits of the author: gender, age, etc.) a couple of months ago. 

So now that I have put everything in a bit of context, let me introduce my approach and then I'll explain the Python framework I created (during the development of my PhD thesis) and show you some of the things it can do. 

An overview of the flow of the system is shown in the following image:

![](https://i.imgur.com/ljqGT9Y.png)

So, it seems like a standard text classification machine learning flow. The only peculiar thing is that I use both syntactic parsing and some dictionaries. 

The important part of the system is the selected feature set. 

The feature set is composed of six subgroups of features:

**Character-based features
**

are composed of the ratios between upper case characters, periods, commas, parentheses, exclamations, colons,
number digits, semicolons, hyphens and quotation marks and the total number of characters in a text.

**Word-based features **

are composed of the mean number of characters per word, vocabulary richness, acronyms, stopwords, first person pronouns, usage of words composed by two or three characters, standard deviation of word length and the difference between the longest and shortest words.

**Sentence-based features**

are composed of the mean number of words per sentence, standard deviation of words per sentence and the difference between the maximum and minimum number of words per sentence in a text.

**Dictionary-based features**

consist of the ratios of discourse markers, interjections, abbreviations, curse words, and polar words (positive and negative words in polarity dictionaries) with respect to the total number of words in a text.

**Syntactic features**

Three types of syntactic features are distinguished:

Part-of-Speech features are given by  the relative frequency of each PoS tag. We use the Penn Treebank tagset  ([http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html](http://)) in a text,  the relative frequency of comparative/superlative adjectives and adverbs and the relative frequency of the present and past tenses. In addition to the fine-grained Penn Treebank tags, we introduce general grammatical categories (such as "verb", "noun", etc.) and calculate their frequencies.

Dependency features reflect the occurrence of syntactic dependency relations in the dependency trees of the text. The tagset used by the parser is the standard penn treebank dependency tagset. We extract the frequency of each individual dependency relation per sentence, the percentage of modifier relations used per tree,  the frequency of adverbial dependencies (they give information on manner, direction, purpose, etc.), the ratio of modal verbs with respect to the total number of verbs, and the percentage of verbs that appear in complex tenses referred to as "verb chains" (VCs).

Tree features measure the tree width, the tree depth and the ramification factor. Tree depth is defined as the maximum number of nodes between the root and a leaf node; the width is the maximum number of siblings at any of the levels of the tree; and the ramification factor is the mean number of children per level. In other words, the tree  features characterize the complexity of the dependency structure of the sentences. 

These measures are also applied to subordinate and coordinate clauses.

(I usually use discourse features as well, but I'm going to pass in this case)

**Lexical features**

Super simple, just the frequency of the N most frequent words in the training set.

The full set is composed of less than 200 features and in several tasks, it performs at state-of-the-art level.



<h1>5. Framework Description</h1>

All right, if you have jumped to this section directly, welcome!

If you have stuck around in my long rant, you deserve a treat!
![](https://media.giphy.com/media/HlYYLuI3WsAW4/giphy.gif)

Lets get to the code.

In my PhD research I had to code a lot, so I ended up creating a framework to perform author profiling/identification/text classification problems. 

Lets see the general file system organization

TreeLib/

        tree.py
        treeOperations.py

dicts/       

        several dictionaries

featureClasses/

        characterBasedFeatures.py
        dictionaryBasedFeatures.py
        lexicalFeatures.py
        sentenceBasedFeatures.py
        syntacticFeatures.py
        utils.py
        wordBasedFeatures.py

featureManager.py

instanceManager.py


The instance and feature manager classes are key, so let us see some code (FINALLY).





This code models each feature individually and the concept of a feature set.
Each Feature has a name and a value.
The feature set has a feature dict, in which you can access a specific feature like this:

self.featureDict["typeOfFeature"]["nameOfFeature"]
Each group of features will be a type of feature: characterBased, wordBased...

In [None]:
class Feature:

	def __init__(self, featureName, featureValue):
		self.name = featureName
		self.value = featureValue

	def __repr__(self):
		return str(self.value)

class FeatureSet:

	def __init__(self):
		self.featureDict = {}

	def __repr__(self):
		return str(self.featureDict)

	def initFeatureType(self, featureType):
		self.featureDict[featureType] = {}

	def addFeature(self, featureType, featureName, featureValue):
		self.featureDict[featureType][featureName] = Feature(featureName, featureValue)

	def updateFeature(self, featureType, featureName, increment, operation="sum"):
		if operation == "sum":
			self.featureDict[featureType][featureName].value += increment
		elif operation == "division":
			self.featureDict[featureType][featureName].value /= increment
		else:
			raise ValueError("Incorrect Operation")

	def getFeatureNames(self, featuresSelected=None):
		featureNames = []
		
		if featuresSelected is None:
			featuresSelected = self.featureDict.keys()
		
		for featType in featuresSelected:
			featureNames.expand(self.featureDict[featType].keys())

		return featureNames

	def getFeatureTypeNames(self,featuresSelected=None):
		featureTypeNames = []
		if featuresSelected is None:
			featuresSelected = self.featureDict.keys()

		for featType in featuresSelected:
			for featName in self.featureDict[featType].keys():
				featureTypeNames.append((featType,featName))

		return featureTypeNames

	def getFeatureVector(self, featureTypeNames):
		featureVector = []

		for featType, featName in featureTypeNames:
			featValue = self.featureDict[featType][featName].value
			featureVector.append(featValue)

		return featureVector

Now, we have the classes that represent an Instance (a text transformed into a feature vector) and an Instance collection (the representation of a corpus in vectors.

Each instance has a name (the file name), a FeatureSet, the correct label, the text itself, the tokens and sentences of the text and the tokens in lower case. This way, we have the tokenization and sentence splitting precomputed. The conll that represents the syntactic trees is also stored inside the instance (we will talk about conll later).

The instance collection class, contains an array of instances, a instance dict (to directly access by name), and has  functions that, for instance, transform the instance collection to sklearn input format.

In [None]:
from nltk import word_tokenize
import codecs
import nltk
import os

class Instance:

	def __init__(self, name, label, paths):
		self.name = name
		self.featureSet = FeatureSet()
		self.label = label
		tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')	
		self.paths = paths
		self.text = codecs.open(self.paths["clean"],"r", encoding="utf-8").read()
		if "synParsed" in self.paths:
			self.conll = codecs.open(self.paths["synParsed"],"r", encoding="utf-8").read()
		else:
			self.conll = None
		self.tokens = word_tokenize(self.text)
		self.lowerTokens = self.text.lower().split()
		self.sentences = tokenizer.tokenize(self.text)

	def getFeaturenames(self, featuresSelected):
		return self.featureSet.getFeaturenames(featuresSelected)

	def getFeatureTypeNames(self, featuresSelected):
		return self.featureSet.getFeatureTypeNames(featuresSelected)

	def getFeatureVector(self, featuresSelected):
		return self.featureSet.getFeatureVector(featuresSelected)

	def initFeatureType(self, featureType):
		self.featureSet.initFeatureType(featureType)

	def addFeature(self, featureType, featureName, featureValue):
		self.featureSet.addFeature(featureType, featureName, featureValue)

	def updateFeature(self, featureType, featureName, increment, operation="sum"):
		self.featureSet.updateFeature(featureType, featureName, increment, operation)

	def __repr__(self):
		return self.name + "\n" + self.label+ "\n" + str(self.featureSet) #+ "\n"+str(self.tokens)

	def getSklearnInput(self):
		featureTypeNames = self.getFeatureTypeNames(None)
		X = self.getFeatureVector(featureTypeNames)
		return X, self.label, self.name.split("_")[0]

class InstanceCollection:

	def __init__(self):
		self.instances = []
		self.labels = set()
		self.instanceDict = {}
		self.featurePath = "path to store precomputed features"

	def __repr__(self):
		strCollection = ""
		for instance in self.instances:
			strCollection += "---------\n"+ str(instance) +"\n---------"
		return strCollection
    
	def initFeatureType(self, featureType):
		for instance in self.instances:
			instance.initFeatureType(featureType)

	def addInstance(self, instance):
		self.instances.append(instance)
		self.instanceDict[instance.name] = instance
		self.labels.add(instance.label)


	def getFeatureNames(self, featuresSelected):
		return self.instances[0].getFeatureNames(featuresSelected)

	def getFeatureTypeNames(self, featuresSelected):
		return self.instances[0].getFeatureTypeNames(featuresSelected)

	def getSklearnInput(self, featuresSelected = None):
		X = []
		Y = []

		featureTypeNames = self.getFeatureTypeNames(featuresSelected)

		for instance in self.instances:
			featureVector = instance.getFeatureVector(featureTypeNames)
			X.append(featureVector)
			Y.append(instance.label)

		return X, Y

	def getMeanFeatValuesPerClass(self, featuresSelected=None):
		featureTypeNames = self.getFeatureTypeNames(featuresSelected)
		nFeats = len(featureTypeNames)

		dictPerClass = {}

		for instance in self.instances:
			featureVector = instance.getFeatureVector(featureTypeNames)
			label = instance.label
			if label not in dictPerClass:
				dictPerClass[label] = np.array([featureVector],dtype=np.float64)
			else:
				dictPerClass[label] = np.append(dictPerClass[label],[featureVector],axis=0)

		outDict = {}
		for label, matrix in dictPerClass.items():
			i=0
			outDict[label] = {}
			while i < nFeats:
				
				featureValues = matrix[:,i]
				featureType, featureName = featureTypeNames[i]
				
				mean = np.mean(featureValues)
				median = np.median(featureValues)
				std = np.std(featureValues)
				
				outDict[label][featureName] = {}
				outDict[label][featureName]["mean"] = mean
				outDict[label][featureName]["median"] = median
				outDict[label][featureName]["std"] = std

				i+=1

		return outDict

All right, so now we have this sort of class structure

Instance Collection
        Instance
                FeatureSet

Let us create an empty instance collection using some sample files that I uploaded (4 txt files with their corresponding 4 conll dependency parses) which contain literary texts from project gutenberg.

In [None]:

def createInstanceCollection(path, labelPosition=1, separator = "_", selectedLabels = None):
    iC = InstanceCollection()
    for fname in os.listdir(path):
        if fname.endswith(".txt"):
            paths = {}
            paths["clean"] = path+fname
            paths["synParsed"] = path+fname.replace(".txt",".conll")
            pieces = fname.split(separator)
            label = pieces[labelPosition]
            instance = Instance(fname, label, paths)
            iC.addInstance(instance)

    return iC

labelPosition = 3
path = "../input/test-data/"

iC = createInstanceCollection(path,labelPosition)
print(iC)

All right, now we have an empty instance collection. Lets fill it with some features.
Each kind of feature has a specific class that computes them. 
We will start with the simplest group of features, the SentenceBasedFeatures.

Each class that we want to define in this framework receives the instance collection and the model name (we could call this "kaggle_author_identification", for instance).
In the initialization, we set the class basic info, such as the type of feature that this class contains.

Then we have the words per sentence feature function, which computes the mean number of words per sentence, as well as the statistical range and standard deviation of this value for all of the instances in our instance collection.

In [None]:
import numpy as np

class SentenceBasedFeatures:

	def __init__(self,iC, modelName):
		self.iC = iC
		self.type = "SentenceBasedFeatures"
		self.iC.initFeatureType(self.type)
		self.modelName = modelName
			
	def get_wordsPerSentence_stdandrange(self):
		for instance in self.iC.instances:
			sentences = instance.sentences
			lengths = []
			for sentence in sentences:
				lengths.append(len(word_tokenize(sentence)))
			
			std = np.std(lengths)
			mean = np.mean(lengths)
			rng = np.amax(lengths) - np.amin(lengths)

			instance.addFeature(self.type, self.type+"_STD", std)
			instance.addFeature(self.type, self.type+"_Range", rng)
			instance.addFeature(self.type, self.type+"_wordsPerSentence", mean)

Easy right? we do our thing, and then add the feature to the instance. Lets use this class and see how the features look like.

In [None]:
from pprint import pprint
modelName = "kaggle_author_identification"
iSent = SentenceBasedFeatures(iC,modelName)
iSent.get_wordsPerSentence_stdandrange()
pprint(iC)

All right, this is looking better. Let us add some character-based features, which are simple but stupidly effective. We can see how the class is very similar to the previous one (same initialization, but different feature functions).

In [None]:
import re
class CharacterBasedFeatures:

	def __init__(self,iC, modelName):
		self.iC = iC
		self.type = "CharacterBasedFeatures"
		self.iC.initFeatureType(self.type)
		self.modelName = modelName

	def get_uppers(self):
		for instance in self.iC.instances:
			featValue = 0.0
			matches = re.findall("[A-Z]",instance.text,re.DOTALL)
			upperCases = len(matches)
			ratio = upperCases / len(instance.text)
			instance.addFeature(self.type, self.type+"_UpperCases", ratio)

	def get_in_parenthesis_stats(self):
		for instance in self.iC.instances:
			matches = re.findall("\((.*?)\)", instance.text)
			npar = len(matches)
			totalchars = 0
			totalwords = 0

			for match in matches:
				totalchars += len(match)
				words = word_tokenize(match)
				totalwords = len(words)

			charsInParenthesis = 0.0
			wordsInParenthesis = 0.0
			if npar > 0:
				charsInParenthesis = totalchars / npar
				wordsInParenthesis = totalwords / npar

			instance.addFeature(self.type, self.type+"_charsinparenthesis", charsInParenthesis)
			instance.addFeature(self.type, self.type+"_wordsinparenthesis", wordsInParenthesis)
		
	def get_numbers(self):
		for instance in self.iC.instances:
			matches = re.findall("[0-9]", instance.text)
			ratio = 0.0
			nchars = len(instance.text)

			if nchars > 0:
				ratio = len(matches) / nchars

			instance.addFeature(self.type, self.type+"_Numbers", ratio)

	def get_symbols(self,symbols, featureName):
		for instance in self.iC.instances:
			nChars = len(instance.text)
			matches = 0
			ratio = 0.0
			
			for char in instance.text:
				if char in symbols:
					matches = matches + 1
			
			if nChars > 0:
				ratio = matches / nChars

			instance.addFeature(self.type, self.type+"_"+featureName, ratio)      

Lets now extract some character-based features.

In [None]:
iChar = CharacterBasedFeatures(iC,modelName)
iChar.get_uppers()
iChar.get_numbers()
iChar.get_symbols([","],"commas")
iChar.get_symbols(["."],"dots")
iChar.get_symbols(['?',"¿"],"questions")
iChar.get_symbols(['!','¡'],"exclamations")
iChar.get_symbols([":"],"colons")
iChar.get_symbols([";"],"semicolons")
iChar.get_symbols(['"',"'","”","“", "’"],"quotations")
iChar.get_symbols(["—","-","_"],"hyphens")
iChar.get_symbols(["(",")"],"parenthesis")
iChar.get_in_parenthesis_stats()
print(iC)

We have more features now, which is cool.  But I know you guys are pretty smart and you want to see more complex stuff.
All right, let us talk about the syntactic features. First of all, disclaimer, all of this framework assumes that you have 2 things:
- collection of raw texts
- their conll files with the dependency parses
So, before executing everything, you need to have everything parsed (I use mate-tools). 
Before going into more detail, lets see what kind of file is a conll file:




In [None]:
conll = "1\tThe\tthe\tthe\tDT\tDT\tend_string=3|spos=DT|start_string=0\tspos=DT\t2\t2\tNMOD\tNMOD\t_\t_\n2\tman\tman\tman\tNN\tNN\tend_string=7|number=SG|spos=NN|start_string=4\tnumber=SG|spos=NN\t5\t5\tSBJ\tSBJ\t_\t_\n3\tin\tin\tin\tIN\tIN\tend_string=10|spos=IN|start_string=8\tspos=IN\t2\t2\tNMOD\tNMOD\t_\t_\n4\tblack\tblack\tblack\tNN\tNN\tend_string=16|number=SG|spos=NN|start_string=11\tnumber=SG|spos=NN\t3\t3\tPMOD\tPMOD\t_\t_\n5\tfled\tflee\tflee\tVBD\tVBD\tend_string=21|finiteness=FIN|person=3|spos=VV|start_string=17|tense=PAST\tfiniteness=FIN|person=3|spos=VV|tense=PAST\t0\t0\tROOT\tROOT\t_\t_\n6\tacross\tacross\tacross\tIN\tIN\tend_string=28|spos=IN|start_string=22\tspos=IN\t5\t5\tLOC\tLOC\t_\t_\n7\tthe\tthe\tthe\tDT\tDT\tend_string=32|spos=DT|start_string=29\tspos=DT\t8\t8\tNMOD\tNMOD\t_\t_\n8\tdesert\tdesert\tdesert\tNN\tNN\tend_string=39|number=SG|spos=NN|start_string=33\tnumber=SG|spos=NN\t6\t6\tPMOD\tPMOD\t_\t_\n9\t,\t,\t,\t,\t,\tend_string=40|spos=,|start_string=39\tspos=,\t5\t5\tP\tP\t_\t_\n10\tand\tand\tand\tCC\tCC\tend_string=44|spos=CC|start_string=41\tspos=CC\t5\t5\tCOORD\tCOORD\t_\t_\n11\tthe\tthe\tthe\tDT\tDT\tend_string=48|spos=DT|start_string=45\tspos=DT\t12\t12\tNMOD\tNMOD\t_\t_\n12\tgunslinger\tgunslinger\tgunslinger\tNN\tNN\tend_string=59|number=SG|spos=NN|start_string=49\tnumber=SG|spos=NN\t13\t13\tSBJ\tSBJ\t_\t_\n13\tfollowed\tfollow\tfollow\tVBD\tVBD\tend_string=68|finiteness=FIN|person=3|spos=VV|start_string=60|tense=PAST\tfiniteness=FIN|person=3|spos=VV|tense=PAST\t10\t10\tCONJ\tCONJ\t_\t_\n14\t.\t.\t.\t.\t.\tend_string=69|spos=.|start_string=68\tspos=.\t5\t5\tP\tP\t_\t_\n\n"
print(conll)

All right, maybe that looks weird (bonus points for recognizing the sentence shown). Basically a conll file is a tab separated file that contains info such as the part of speech of each token, the syntactic dependencies that link two words, the lemma of the word, etc.  The important part, is that each sentence is represented as a tree in this file. Now, I'm going to show you the tree, which will look much nicer.

![](https://i.imgur.com/DwFOZQ4.png)

Much nicer right? (now you might now where the sentence comes from. If anyone talks about the movie, I'm going to be so pissed).

Let us now see how we use this sort of files. Let me introduce the classes that manipulate the trees and extract our syntactic features.
First, the Tree class. A Tree is a root node and a node dict. A node contains meta data, an array of nodes that are their children, a parent id, the id of the node and the label of the arc that reach the node. This is all info that can be found in the conll string. We can also see that the Node class has a subclass that is called SyntacticNode, which has specific characteristics found in the conll file.



In [None]:
class Tree:

	def __init__(self, rootNode, nodeDict={}):
		self.nodeDict = nodeDict
		self.root = rootNode

		if self.root:
			if not self.root.id in self.nodeDict:
				self.nodeDict[self.root.id] = rootNode

	def getDepthIterator(self, initNode = None):
		if not initNode:
			initNode = self.root
		
		stack = []
		stack.append(initNode)

		while stack:
			current = stack.pop(0)
			if current:
				yield current
				for child in current.children:
					stack.insert(0,child)

	def getWidthIterator(self, initNode = None):
		if not initNode:
			initNode = self.root

		queue = []
		queue.append(initNode)

		while queue:
			current = queue.pop()
			if current:
				yield current
				for child in current.children:
					queue.insert(0,child)

	def __str__(self):
		strRepr = ""
		
		queue = []
		queue.append(self.root)
		strRepr += "ROOT-> "+ str(self.root.id) + "\n"
		i = 1
		while queue:
			current = queue.pop()
			strRepr += "CHILDREN OF "+str(current.id)+" -> "
			for child in current.children:
				strRepr += str(child.id) + "\t"
				queue.insert(0,child)
	
			strRepr +="\n"
			i+=1

		return strRepr

class Node:
	def __init__(self, meta, idNode, arcLabel, parentId):
		self.meta = meta
		self.children = []
		self.parent = parentId
		self.id = idNode
		self.arcLabel = arcLabel

	def setParent(self, parentNode):
		self.parent = parentNode

	def addChild(self, childNode):
		self.children.append(childNode)

	def __str__(self):
		strRepr = ""
		strRepr += self.id + " " + self.meta + " " + self.arcLabel
		return strRepr


class SyntacticNode(Node):

	def __init__(self, meta, idNode, arcLabel, parentId):
		self.meta = meta
		self.children = []
		self.parent = parentId
		self.id = idNode
		self.arcLabel = arcLabel

		pieces = meta.split("\t")
		self.word = pieces[1]
		self.lemma = pieces[2]
		self.pos = pieces[4]
		self.features = pieces[6]
		self.parentid = pieces[8]

	def __str__(self):
		strRepr = ""
		strRepr += self.word + " " + self.pos + " " + self.features + " " + self.parentid + " " + self.arcLabel
		return strRepr

Now, I present the TreeOperations class and SyntacticTreeOperations subclass, which uses the Tree class and has many functions that extract information from the syntactic tree.

In [None]:
class TreeOperations:

	def __init__(self, conllStringSentence):
		conllStringSentence = conllStringSentence.strip()
		if not conllStringSentence:
			raise ValueError("Please input a correct conll sentence")
			return
		self.tree = self.conll_to_tree(conllStringSentence)

	def conll_to_tree(self, conllString):
		conllArray = conllString.split("\n")
		nodes, root = self.create_nodes(conllArray)
		self.link_nodes(nodes)
		return Tree(root, nodes)

	def create_nodes(self, conllArray):
		nodeDict = {}
		root = None

		for line in conllArray:
			pieces = line.split("\t")
			idNode = int(pieces[0])
			arcLabel = pieces[10]
			parentId = int(pieces[9])
			iNode = Node(line, idNode, arcLabel, parentId)
			nodeDict[idNode] = iNode
			if parentId == 0:
				root = iNode

		return nodeDict, root


	def link_nodes(self, nodeDict):
		for idNode, iNode in nodeDict.items():
			if iNode.parent > 0:
				iParent = nodeDict[iNode.parent]
				iParent.addChild(iNode)
				iNode.setParent(iParent)

	
	def get_ramification_factor(self, initNode = None):
		if initNode:
			it = self.tree.getWidthIterator(initNode)
		else:
			it = self.tree.getWidthIterator()

		acumChilds = 0
		levels = 1
		for current in it:
			nchilds = len(current.children)
			if nchilds > 0:
				acumChilds+=nchilds
				levels+=1

		return acumChilds / levels

	def get_max_width(self, initNode = None):
		it = self.tree.getWidthIterator(initNode)
		maxWidth = 0

		for current in it:
			nchilds = len(current.children)
			if nchilds > maxWidth:
				maxWidth = nchilds

		return maxWidth

	def get_max_depth(self, initNode = None):
		if not initNode:
			initNode = self.tree.root

		return self.get_max_depth_recursive(initNode)


	def get_max_depth_recursive(self, node):
		depth = []

		if node:
			if not node.children:
				return 0
		if not node:
			return 0
		
		for child in node.children:
			depth.append(self.get_max_depth_recursive(child))

		return 1 + max(depth)

	def get_node_depth(self, node):
		current = node
		depth = 0
		while current.parent:
			depth+=1
			current = current.parent
		return depth

class SyntacticTreeOperations(TreeOperations):

	'''
		Gets the maximum width and depth below a node that has a given relation 
		with its father. EX: For every subordinate clause, we get the maximum value
		of width and depth of the subtree BELOW the node which has a SUB relation with its father.
	'''
	def get_relation_width_depth(self, relation):
		
		it = self.tree.getWidthIterator()
		widthDepths = []

		for current in it:
			if current.arcLabel == relation:
				width = self.get_max_width(current)
				depth = self.get_max_depth(current)
				widthDepths.append((width,depth))

		return widthDepths

	def get_relation_depth_level(self, relation):
		it = self.tree.getWidthIterator()
		levels = []
		for current in it:
			if current.arcLabel == relation:
				level = self.get_node_depth(current)
				levels.append(level)

		return levels

	def get_relation_ramification_factor(self, relation):
		it = self.tree.getWidthIterator()
		ramFactors = []
		for current in it:
			if current.arcLabel == relation:
				ramFactor = self.get_ramification_factor(current)
				ramFactors.append(ramFactor)

		return ramFactors

	def search_deps_frequency(self, searchedRels = []):

		it = self.tree.getWidthIterator()
		relFreq = {}
		searchAll = False
		if not searchedRels:
			searchAll = True

		total = 0
		for current in it:
			if current:
				for child in current.children:
					if child.arcLabel in searchedRels or searchAll:
						if child.arcLabel in relFreq:
							relFreq[child.arcLabel] +=1
						else:
							relFreq[child.arcLabel] =1
						total+=1

		return relFreq, total

	def search_pos_frequency(self, searchedPos = []):
		it = self.tree.getWidthIterator()
		posFreq = {}

		searchAll = False
		if not searchedPos:
			searchAll = True

		total = 0
		for current in it:
			if current:
				for child in current.children:
					if child.pos in searchedPos or searchAll:
						if child.pos in posFreq:
							posFreq[child.pos] +=1
						else:
							posFreq[child.pos] =1
						total+=1

		return posFreq, total

	def get_composed_verb_ratio(self):
		verbTags = ["VB","VBD","VBG","VBN","VBP","VBZ", "MD"]
		verbFreq, total = self.search_pos_frequency(verbTags)
		depFreq, vcFreq = self.search_deps_frequency(["VC"])

		if vcFreq > 0 and total > 0:
			composedVerbRatio = vcFreq / total
		else:
			composedVerbRatio = 0.0

		return composedVerbRatio

	def get_modal_ratio(self):
		verbTags = ["VB","VBD","VBG","VBN","VBP","VBZ", "MD"]
		verbFreq, total = self.search_pos_frequency(verbTags)

		if total > 0 and "MD" in verbFreq:
			modalRatio = verbFreq["MD"]/ total
		else:
			modalRatio = 0.0

		return modalRatio

	def create_nodes(self, conllArray):
		
		nodeDict = {}
		root = None
		for line in conllArray:
			pieces = line.split("\t")
			idNode = int(pieces[0])
			arcLabel = pieces[11]
			parentId = int(pieces[9])

			iNode = SyntacticNode(line, idNode, arcLabel, parentId)
			nodeDict[idNode] = iNode
			if parentId == 0:
				root = iNode

		return nodeDict, root

All right, this might seem a bit convoluted. Lets see how we use all of this stuff. Presenting, the SyntacticFeatures class.

In [None]:
class SyntacticFeatures:

	adverbialRelations = ["ADV","TMP","LOC","DIR","MNR","PRP","EXT"]
	modifierRelations = ["NMOD","PMOD","AMOD"]

	verbTags = ["VB","VBD","VBG","VBN","VBP","VBZ", "MD"]
	nounTags = ["NN","NNS","NNP","NNPS"]
	adverbTags = ["RB","RBR","RBS","WRB"]
	adjectiveTags = ["JJ","JJR","JJS"]
	pronounTags = ["PRP","PRP$","WP","WP$"]
	determinerTags = ["DT","PDT","WDT"]
	conjunctionTags = ["CC","IN"]

	superlatives = ["JJS","RBS"]
	comparatives = ["JJR","RBR"]
	
	pastVerbs = ["VBD","VBN"]
	presentVerbs = ["VBG","VBP","VBZ"]

	def __init__(self,iC, modelName, load=True):	
		
		self.iC = iC
		self.type = "SyntacticFeatures"
		self.iC.initFeatureType(self.type)
		self.allRelationsPos = open("../input/dictss/allRelationsPos.txt","r").read().split("\n")
		self.modelName = modelName
		self.load = load

	def compute_syntactic_features(self):
		nPosts = len(self.iC.instances)
		nProcessed = 0
		for instance in self.iC.instances:
			conllSents = instance.conll.split("\n\n")
			iTrees = []
			conllSents = conllSents[:-1]
			for conllSent in conllSents:
				try:
					iTree = SyntacticTreeOperations(conllSent)
					iTrees.append(iTree)
				except ValueError as e:
					continue

			self.get_relation_usage(iTrees, instance)
			self.get_relationgroup_usage(iTrees, instance)
			self.get_pos_usage(iTrees, instance)
			self.get_posgroup_usage(iTrees, instance)
			
			self.get_shape_features(iTrees, instance)
			self.get_subcoord_features(iTrees, instance)
			self.get_verb_features(iTrees, instance)
			nProcessed +=1

		self.adjust_features()
		
	#to be used after get_relation_usage and get_pos_usage
	def adjust_features(self):
		for instance in self.iC.instances:
			for featName in self.allRelationsPos:
				if featName not in instance.featureSet.featureDict["SyntacticFeatures"]:
					instance.addFeature(self.type, featName, 0.0)

	def get_relation_usage(self, iTrees, instance):
		nTrees = len(iTrees)
		for iTree in iTrees:
			depFreq,_ = iTree.search_deps_frequency()
			for dep, freq in depFreq.items():
				if "SYNDEP_"+ dep in self.allRelationsPos:	
					if "SYNDEP_"+ dep not in instance.featureSet.featureDict["SyntacticFeatures"].keys():
						instance.addFeature(self.type, "SYNDEP_"+dep, 0.0)
					
					instance.updateFeature(self.type, "SYNDEP_"+dep, freq / nTrees)

				

	def get_relationgroup_usage(self,iTrees, instance):
		nTrees = len(iTrees)
		instance.addFeature(self.type, "SYNDEP_modifierRelations", 0.0)
		instance.addFeature(self.type, "SYNDEP_adverbialRelations", 0.0)

		for iTree in iTrees:
			depFreq, total = iTree.search_deps_frequency(self.adverbialRelations)
			instance.updateFeature(self.type, "SYNDEP_adverbialRelations", total / nTrees)

			depFreq, total = iTree.search_deps_frequency(self.modifierRelations)
			instance.updateFeature(self.type, "SYNDEP_modifierRelations", total / nTrees)


	def get_posgroup_usage(self, iTrees, instance):
		nTrees = len(iTrees)
		instance.addFeature(self.type, "SYNPOS_verbTags", 0.0)
		instance.addFeature(self.type, "SYNPOS_nounTags", 0.0)
		instance.addFeature(self.type, "SYNPOS_adverbTags", 0.0)
		instance.addFeature(self.type, "SYNPOS_adjectiveTags", 0.0)
		instance.addFeature(self.type, "SYNPOS_pronounTags", 0.0)
		instance.addFeature(self.type, "SYNPOS_determinerTags", 0.0)
		instance.addFeature(self.type, "SYNPOS_conjunctionTags", 0.0)
		instance.addFeature(self.type, "SYNPOS_superlatives", 0.0)
		instance.addFeature(self.type, "SYNPOS_comparatives", 0.0)
		instance.addFeature(self.type, "SYNPOS_pastVerbs", 0.0)
		instance.addFeature(self.type, "SYNPOS_presentVerbs", 0.0)


		for iTree in iTrees:
			depFreq, total = iTree.search_pos_frequency(self.verbTags)
			instance.updateFeature(self.type, "SYNPOS_verbTags", total / nTrees)

			depFreq, total = iTree.search_pos_frequency(self.nounTags)
			instance.updateFeature(self.type, "SYNPOS_nounTags", total / nTrees)

			depFreq, total = iTree.search_pos_frequency(self.adverbTags)
			instance.updateFeature(self.type, "SYNPOS_adverbTags", total / nTrees)

			depFreq, total = iTree.search_pos_frequency(self.adjectiveTags)
			instance.updateFeature(self.type, "SYNPOS_adjectiveTags", total / nTrees)

			depFreq, total = iTree.search_pos_frequency(self.pronounTags)
			instance.updateFeature(self.type, "SYNPOS_pronounTags", total / nTrees)

			depFreq, total = iTree.search_pos_frequency(self.determinerTags)
			instance.updateFeature(self.type, "SYNPOS_determinerTags", total / nTrees)

			depFreq, total = iTree.search_pos_frequency(self.conjunctionTags)
			instance.updateFeature(self.type, "SYNPOS_conjunctionTags", total / nTrees)

			depFreq, total = iTree.search_pos_frequency(self.superlatives)
			instance.updateFeature(self.type, "SYNPOS_superlatives", total / nTrees)

			depFreq, total = iTree.search_pos_frequency(self.comparatives)
			instance.updateFeature(self.type, "SYNPOS_comparatives", total / nTrees)

			depFreq, total = iTree.search_pos_frequency(self.pastVerbs)
			instance.updateFeature(self.type, "SYNPOS_pastVerbs", total / nTrees)

			depFreq, total = iTree.search_pos_frequency(self.presentVerbs)
			instance.updateFeature(self.type, "SYNPOS_presentVerbs", total / nTrees)


	def get_pos_usage(self,iTrees, instance):
		nTrees = len(iTrees)
		for iTree in iTrees:
			posFreq, _ = iTree.search_pos_frequency()
			for pos, freq in posFreq.items():
				if "SYNPOS_"+ pos in self.allRelationsPos:
					if "SYNPOS_"+pos not in instance.featureSet.featureDict["SyntacticFeatures"]:
						instance.addFeature(self.type, "SYNPOS_"+pos, 0.0)

					instance.updateFeature(self.type, "SYNPOS_"+pos, freq / nTrees)


	def get_shape_features(self,iTrees, instance):
		nTrees = len(iTrees)
		instance.addFeature(self.type, "SYNSHAPE_width", 0.0)
		instance.addFeature(self.type, "SYNSHAPE_depth", 0.0)
		instance.addFeature(self.type, "SYNSHAPE_ramFactor", 0.0)

		for iTree in iTrees:
			ramFact = iTree.get_ramification_factor()
			width = iTree.get_max_width()
			depth = iTree.get_max_depth()
			instance.updateFeature(self.type, "SYNSHAPE_width", width / nTrees)
			instance.updateFeature(self.type, "SYNSHAPE_depth", depth / nTrees)
			instance.updateFeature(self.type, "SYNSHAPE_ramFactor", ramFact / nTrees)

	def get_subcoord_features(self, iTrees, instance):
		nSubs = 0
		nCoords = 0

		instance.addFeature(self.type, "SYNSHAPE_subDepth", 0.0)
		instance.addFeature(self.type, "SYNSHAPE_subWidth", 0.0)
		instance.addFeature(self.type, "SYNSHAPE_subRamFact", 0.0)
		instance.addFeature(self.type, "SYNSHAPE_subLevel", 0.0)

		instance.addFeature(self.type, "SYNSHAPE_coordDepth", 0.0)
		instance.addFeature(self.type, "SYNSHAPE_coordWidth", 0.0)
		instance.addFeature(self.type, "SYNSHAPE_coordRamFact", 0.0)
		instance.addFeature(self.type, "SYNSHAPE_coordLevel", 0.0)


		for iTree in iTrees:
			subFreq, numS =  iTree.search_deps_frequency(["SUB"])
			if subFreq:
				nSubs += numS

			coordFreq, numC =  iTree.search_deps_frequency(["COORD"])
			if coordFreq:
				nCoords += numC

			widthDepth = iTree.get_relation_width_depth("SUB")
			if widthDepth:
				incrementW = sum([pair[0] for pair in widthDepth]) / len(widthDepth)
				incrementD = sum([pair[1] for pair in widthDepth]) / len(widthDepth)

				instance.updateFeature(self.type, "SYNSHAPE_subWidth", incrementW)
				instance.updateFeature(self.type, "SYNSHAPE_subDepth", incrementD)

			ramFactors = iTree.get_relation_ramification_factor("SUB")
			if ramFactors:
				incrementR = np.sum(np.array(ramFactors)) / len(ramFactors)
				instance.updateFeature(self.type, "SYNSHAPE_subRamFact", incrementR)

			levels = iTree.get_relation_depth_level("SUB")
			if levels:
				incrementSL = np.sum(np.array(levels)) / len(levels)
				instance.updateFeature(self.type, "SYNSHAPE_subLevel", incrementSL)

			widthDepth = iTree.get_relation_width_depth("COORD")
			if widthDepth:
				incrementCW = sum([pair[0] for pair in widthDepth]) / len(widthDepth)
				incrementCD = sum([pair[1] for pair in widthDepth]) / len(widthDepth)

				instance.updateFeature(self.type, "SYNSHAPE_coordWidth", incrementCW)
				instance.updateFeature(self.type, "SYNSHAPE_coordDepth", incrementCD)

			ramFactors = iTree.get_relation_ramification_factor("COORD")
			if ramFactors:
				incrementCR = np.sum(np.array(ramFactors)) / len(ramFactors)
				instance.updateFeature(self.type, "SYNSHAPE_coordRamFact", incrementCR)

			levels = iTree.get_relation_depth_level("COORD")
			if levels:
				incrementCL = np.sum(np.array(levels)) / len(levels)
				instance.updateFeature(self.type, "SYNSHAPE_coordLevel", incrementCL)

		if nSubs > 0:
			instance.updateFeature(self.type, "SYNSHAPE_subDepth", nSubs, "division")
			instance.updateFeature(self.type, "SYNSHAPE_subWidth", nSubs, "division")
			instance.updateFeature(self.type, "SYNSHAPE_subRamFact", nSubs, "division")
			instance.updateFeature(self.type, "SYNSHAPE_subLevel", nSubs, "division")

		if nCoords > 0:
			instance.updateFeature(self.type, "SYNSHAPE_coordDepth", nCoords, "division")
			instance.updateFeature(self.type, "SYNSHAPE_coordWidth", nCoords, "division")
			instance.updateFeature(self.type, "SYNSHAPE_coordRamFact", nCoords, "division")
			instance.updateFeature(self.type, "SYNSHAPE_coordLevel", nCoords, "division")

	def get_verb_features(self, iTrees, instance):
		nTrees = len(iTrees)
		instance.addFeature(self.type, "SYNSHAPE_composedVerbRatio", 0.0)
		instance.addFeature(self.type, "SYNSHAPE_modalRatio", 0.0)

		for iTree in iTrees:
			composedVerbRatio = iTree.get_composed_verb_ratio()
			modalRatio = iTree.get_modal_ratio()
			instance.updateFeature(self.type, "SYNSHAPE_composedVerbRatio", composedVerbRatio / nTrees)
			instance.updateFeature(self.type, "SYNSHAPE_modalRatio", modalRatio / nTrees)


Lets add these bad boys to our feature set.

In [None]:
iSyntactic = SyntacticFeatures(iC,modelName)
iSyntactic.compute_syntactic_features()
pprint(iC)

Now, we have a TON of features. Everything we computed can be used to classify. An easy way to do it, is to get the sklearn input and train a classifier that can be used to predict the class of an unseen instance.

In [None]:
from sklearn.svm import SVC
X, Y = iC.getSklearnInput()
clfLinear = SVC(C=1.0, kernel="linear", gamma='auto', coef0=0.0, shrinking=True, probability=True, tol=0.001, cache_size=200, class_weight=None, verbose=False)
clfLinear.fit(X, Y)

path = "../input/testtest/"
labelPosition = 3

iCTest = createInstanceCollection(path,labelPosition)

iSent = SentenceBasedFeatures(iCTest,modelName)
iSent.get_wordsPerSentence_stdandrange()

iChar = CharacterBasedFeatures(iCTest,modelName)
iChar.get_uppers()
iChar.get_numbers()
iChar.get_symbols([","],"commas")
iChar.get_symbols(["."],"dots")
iChar.get_symbols(['?',"¿"],"questions")
iChar.get_symbols(['!','¡'],"exclamations")
iChar.get_symbols([":"],"colons")
iChar.get_symbols([";"],"semicolons")
iChar.get_symbols(['"',"'","”","“", "’"],"quotations")
iChar.get_symbols(["—","-","_"],"hyphens")
iChar.get_symbols(["(",")"],"parenthesis")
iChar.get_in_parenthesis_stats()

iSyntactic = SyntacticFeatures(iCTest,modelName)
iSyntactic.compute_syntactic_features()

for instance in iCTest.instances:
	X, _ , idx = instance.getSklearnInput()
	print(clfLinear.predict([X]).tolist())


And as you see, the system tells us that the test instance is from arthur conan doyle, which in fact, it is actually true. 
Now you know how my framework and my approach works. If you want to use the full version of the code, please help yourselves, [Link to code](https://github.com/joanSolCom/author_profiling_tools/tree/master/author_profiling_code) .

If you are interested in my research and you want to read my articles, visit my researchgate: [Link to researchgate profile](https://www.researchgate.net/profile/Juan_Soler_Company). 