# Functions For Coding Text Features

Below is a list of functions we developed to code for features to be used as predictors in our ML model.  

In [22]:
import random
import pandas as pd
import numpy as np
import scipy as sp
import nltk
import PyPDF2

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge

## Past Tense Words

The function below tags each text answer for the presence of past tense words as classified by the nltk package.  Past tense is one of the many indices used in classic LIWC studies (e.g., Pennebaker & Francis, 1996/1999), which have been shown to significantly predict personality traits.  We cannot use the LIWC library here as it is not open access, but we can approximate it using the nltk package.  This function returns the percentage of past tense words in each answer.

In [5]:
def pastTag(variable):
    
    words = variable.split()
    for word in nltk.pos_tag(variable):
        past = 0
        if word == "VBD" or "VBN":
            past += 1
        else:
            past = past
        percentpast = past/len(variable.split())
        return percentpast 
         

## Present Tense Words

The presentTag() function tags each text answer for the presence (0) or absence (1) of present tense words as classified by the nltk package.  Is is also an approximation to the LIWC measure of present tense.  Fewer present tense words have been found to be predictive of openness to experience.

In [2]:
def presentTag(variable):
    
    words = variable.split()
    for word in nltk.pos_tag(words):
        present = 0
        if word == "VBG" or "VBP" or "VPZ":
            present += 1
        else:
            present = present
        percentpresent = present/len(variable.split())
        return percentpresent

## Articles

The use of fewer articles (i.e., "a" or "the") has been found to positively predict agreeableness.  The following function caculates the percentage of articles in each text answer.

In [3]:
def articleTag(variable):
    
    for word in row.split():
        articles = 0
        if word == "a" or "the" or "A" or "The":
            articles += 1
        else:
            articles = articles
        percentarticles = articles/len(row.split())
        return percentarticles

## First Person Singular

The use of more first person singular pronouns (i.e., "I", "me", "my" or "mine") has been shown to positively predict neuroticism and negatively predict openness to experience.  The following function calculates the percentage of first person singular pronouns used in each text answer.

In [4]:
def firstpersonTag(variable):
    firstperson = 0
    for word in row.split():
        if word == "I" or "me" or "my" or "mine" or "Me" or "My" or "Mine":
            firstperson += 1
        else:
            firstperson = firstperson
        percentfirst = firstperson/len(row.split())
        return percentfirst

## Word Length

The wordLengths function measures the average word length of each text answer and then returns the percentage of words that have five or more letters.  Having a higher percentage of words greater than five letters long has been found to be a good predictor of openness to experience.

In [1]:
def wordLengths(variable):
  
    words = list(map(len,variable.split()))
    sumlengths = sum(i > 5 for i in words)
    #print(sumlengths)
    percentoverfive = sumlengths/len(variable)*100
    return percentoverfive

## Negative Affect Words

Negative affect is another index known to predict traits such as neuroticism (positively) and extraversion, agreeableness, and conscientiousness (negatively). We chose to use a publicly available libraries from Loughran-McDonald available here: https://sraf.nd.edu/textual-analysis/resources/#LM%20Sentiment%20Word%20Lists or here https://github.com/lcdm-uiuc/ml-finance.  The following syntax reads in the word list from a downloaded pdf and converts it into a vector that can be used to tag the text answers for the presence of negative words.

In [92]:
file = open('LM_Negative.pdf', 'rb')

# creating a pdf reader object
fileReader = PyPDF2.PdfFileReader(file)
pageObj = fileReader.getPage(0)
page_content = pageObj.extractText()
#print(page_content)

In [90]:
negativewords = ""
for page in range(fileReader.numPages):
    pageObj = fileReader.getPage(page)
    page_content = pageObj.extractText()
    negativewords = negativewords + page_content
#print(negativewords)
negwords = negativewords.split()

negwords = negwords[7:]

The following function takes the list of words imported above (ie., "negwords") and tags each text answer as to the percentage of negative words present in its text.  

In [79]:
def negAffect(variable):
    for row in variable:
        negative = 0
        words = row.split()
        for word in words:
            if word.upper() in negwords:
                negative = negative + 1
                #print(negative)
                percentnegative = negative/len(words)
                return percentnegative
        return percentnegative

## Positive Affect Words

We also used the Loughran-McDonald sentiment word list again (this time from https://github.com/lcdm-uiuc/ml-finance) to code for positive words.  Percentage of positive words has been found to positively predict extraversion, conscientiousness, and agreeableness and to negatively predict neuroticism.

In [101]:
file = open('LM_Positive.pdf', 'rb')

# creating a pdf reader object
fileReader = PyPDF2.PdfFileReader(file)
pageObj = fileReader.getPage(0)
page_content = pageObj.extractText()
#print(page_content)

In [103]:
positivewords = ""
for page in range(fileReader.numPages):
    pageObj = fileReader.getPage(page)
    page_content = pageObj.extractText()
    positivewords = positivewords + page_content
#print(negativewords)
poswords = positivewords.split()

poswords = poswords[7:]

In [104]:
def posAffect(variable):
    for row in variable:
        negative = 0
        words = row.split()
        for word in words:
            if word.upper() in poswords:
                postive = positive + 1
                #print(negative)
                percentpositive = positive/len(words)
                return percentpositive
        return percentpositive

## Tentative Words

The use of tentative words that lessen certainty about a statement (e.g., "may", "possibly", "seems to") has been found to positively predict openness to experience and negatively predict extraversion.  We created a quick and dirty library of tentative words by looking online (mainly here: https://lo.unisa.edu.au/pluginfile.php/499128/mod_resource/content/2/Tentative%20language%20Nov%202015.pdf) to calculate the percentage of tentative words in each text answer.

In [116]:
tentwords = ["may", "might", "can", "could", "possibly", "probably", "likely", "possible", "unlikely", "probable", "tends", "appears", "suggests", "seems"]

In [117]:
def tentAffect(variable):
    for row in variable:
        tentative = 0
        words = row.split()
        for word in words:
            if word.lower() in tentwords:
                tentative = tentative + 1
                #print(negative)
                percenttentative = tentative/len(words)
                return percenttentative
        return percenttentative

## Causal Words

Although more complicated text mining is usually needed to fully detect causal speech, we used a crude list of causal words from this publication (https://www.aaai.org/Papers/FLAIRS/2002/FLAIRS02-071.pdf) to do a quick tagging of causal speech as well.  Decreased use of causal words is usually positively associated with extraversion, conscientiousness, and openness to experience.

In [118]:
causewords = ["induce", "produce", "generate", "effect", "affect", "bring", "arouse", "provoke", "elicit", "lead", "trigger", 
             "derive", "associate", "relate", "link", "stem", "originate", "result", "stir", "entail", "contribute", "set", "commence",
             "conduce", "educe", "spark", "evoke", "implicate", "activate", "actuate", "kindle", "fire", "stimulate", "call", "unleash",
             "effectuate", "kick", "give", "birth", "call", "put", "create", "launch", "develop", "start", "make", "begin", "rise"]

In [119]:
def causeAffect(variable):
    for row in variable:
        causal = 0
        words = row.split()
        for word in words:
            if word.lower() in causewords:
                causal = causal + 1
                #print(negative)
                percentcausal = causal/len(words)
                return percentcausal
        return percentcausal