<h1>Exploring Associations Between the Mach-IV Inventory, Ten-Item Personality (TIP) Inventory, and the Test Taker</h1>
<h2>Siddharth Tiwari</h2>
<h3>HDAG Summer Program Project</h3>

<h2>Purpose</h2>
<p>Machiavellianism is a personality trait that denotes cunningness, the ability to be manipulative, and a drive to use whatever means necessary to gain power; the Mach-IV, a three-dimensional, 20-item self-reported psychometric inventory, was devised in 1970 to measure the relative strength of this trait within individuals. In this project, I aim to explore the (possible) relationships between the selected responses for the Mach-IV test from 73k individuals and their TIP results as well as other personal information (age, family size, profession, etc.). Results can help derive associations similar to “Hofstede’s Dimensions” to explore how different demographic and cultural factors may influence the presence of Machiavellian traits within different groups of people in the future.</p>


In [1]:
#import statements

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pingouin as pg
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn import metrics
from sklearn.pipeline import Pipeline 

In [2]:
#load dataset
df_r = pd.read_excel(r"data/data.xlsx")

<h2>Describe Dataset</h2>

<p>The following list contains descriptions of the features included in this dataset. For additional information, visit the Open Psychometrics Project (link above) or, alternatively, feel free to look through the test as well: <a href="https://openpsychometrics.org/tests/MACH-IV/" target="_blank">https://openpsychometrics.org/tests/MACH-IV/.</a> The following features from the dataset will be used for analysis:</p>
<ul>
    <li><h4>20 Question <i>Mach-IV</i> Inventory.</h4></li>
    <ul style="list-style-type='circle'">
      <li>Consists of 20 questions used to measure the presence of the "Machiavellian Construct" in the survey respondent.</li>
      <li>Three values are recorded for each question (ex. Q1):</li>
      <ul style="list-style-type='circle'">
          <li>The user's answer ranging from 1 to 5 - 1 = disagree, 5 = agree (ex. feature name: <b>Q1A</b>)</li>
          <li>The position of the item in the survey (ex. feature name: <b>Q1I</b>)</li>
          <li>The time spent on the question in milliseconds (ex. feature name: <b>Q1E</b>)</li>
      </ul>
      <li>Only selected answers (QIA) will be used to calculate three-dim Mach-IV scores for regressions/classifications, elapsed time (Q1E) will be used to exclude outliers but not for any analysis, response position will be removed entirely from analysis (Q1I)</li>
    </ul>
    <br>
    <li><h4>Ten Item Personality Inventory</h4></li>
    <ul style="list-style-type='circle'">
      <li>Ten-Item Personality Inventory was used to briefly capture the respondent's personality traits</li>
      <li>Features are labeled as <b>"TIPI"</b> followed by the question number <b>(ex. TIP1)</b></li>
      <li>Will be used to calculate Big 5 Character traits for respondents</li>
    </ul>
    <br>
    <li><h4>Demographic Variables</h4></li>
    <ul style="list-style-type='circle'">
      <li>The following information was catalogued for each respondent:</li>
      <ul style="list-style-type='circle'">
          <li><b>age</b>:		the user's age</li>
          <li><b>country</b>:		the user's network location</li>
          <li><b>education</b>:			"How much education have you completed?", 1=Less than high school, 2=High school, 3=University degree, 4=Graduate degree</li>
          <li><b>engnat</b>: "Is English your native language?", 1=Yes, 2=No</li>
          <li><b>urban</b>:				"What type of area did you live when you were a child?", 1=Rural (country side), 2=Suburban, 3=Urban (town, city)</li>
          <li><b>gender</b>:				"What is your gender?", 1=Male, 2=Female, 3=Other</li>
          <li><b>religion</b>:			"What is your religion?", 1=Agnostic, 2=Atheist, 3=Buddhist, 4=Christian (Catholic), 5=Christian (Mormon), 6=Christian (Protestant), 7=Christian (Other), 8=Hindu, 9=Jewish, 10=Muslim, 11=Sikh, 12=Other</li>
          <li><b>orientation</b>:			"What is your sexual orientation?", 1=Heterosexual, 2=Bisexual, 3=Homosexual, 4=Asexual, 5=Other</li>
          <li><b>voted</b>:   "Have you voted in a national election in the past year?", 1=Yes, 2=No</li>
          <li><b>race	</b>:			"What is your race?", 10=Asian, 20=Arab, 30=Black, 40=Indigenous Australian, 50=Native American, 60=White, 70=Other</li>
          <li><b>familysize</b>:				"Including you, how many children did your mother have?"</li>
          <li><b>married</b>:				"What is your marital status?", 1=Never married, 2=Currently married, 3=Previously married</li>
          <li><b>major</b>: "If you attended a university, what was your major (e.g. "psychology", "English", "civil engineering")?"</li>
      </ul>
      <li>To see other features of the original dataset, visit the <b>codebook</b> (within the data folder). The demographic variables included above are explored in the "Data Visualization" section of this project.</li>
    </ul>
</ul>

The head and shape of the original dataset are included below:

In [3]:
print(df_r.head())

   Q1A   Q1I      Q1E  Q2A   Q2I      Q2E  Q3A   Q3I      Q3E  Q4A  ...  \
0  3.0   6.0  21017.0  3.0   7.0  18600.0  5.0  20.0  14957.0  2.0  ...   
1  5.0  17.0   3818.0  5.0   9.0   7850.0  1.0  16.0   5902.0  3.0  ...   
2  5.0  16.0   4186.0  5.0  12.0   2900.0  1.0   2.0   7160.0  1.0  ...   
3  2.0  12.0   9373.0  4.0   1.0  10171.0  2.0   7.0  10117.0  1.0  ...   
4  5.0  13.0   9465.0  5.0   7.0   5284.0  2.0  19.0   8872.0  1.0  ...   

   screenw  screenh  hand  religion  orientation  race  voted  married  \
0   1440.0    900.0     1         7            1    30      1        2   
1   1536.0    864.0     1         1            1    60      2        1   
2    375.0    667.0     1         2            2    10      2        1   
3   1280.0    720.0     1         6            1    60      1        3   
4    360.0    640.0     1         4            3    60      1        1   

   familysize                    major  
0           5               Marketing   
1           2         

<h4>Cleaning Data</h4>

<p>Remove test-takers who took “too long” to complete the test (STD > 3), remove features that aren’t used in analyses, ensure data types are consistent, remove NaNs</p>

In [18]:
#drop unused features
df = df_r.drop(['Q1I', 'Q2I', 'Q3I', 'Q4I', 'Q5I', 'Q6I', 'Q7I', 'Q8I', 'Q9I', 'Q10I', 'Q11I', 'Q12I', 'Q13I', 'Q14I', 'Q15I', 'Q16I', 'Q17I', 'Q18I', 'Q19I', 'Q20I', 'VCL1', 'VCL2', 'VCL3', 'VCL4', 'VCL5', 'VCL6', 'VCL7', 'VCL8', 'VCL9', 'VCL10', 'VCL11', 'VCL12', 'VCL13', 'VCL14', 'VCL15', 'VCL16', 'introelapse', 'testelapse', 'surveyelapse', 'screenw', 'screenh', 'hand'], axis = 1)

#convert numeric categories to numeric datatype
vars_num = ['Q1A', 'Q2A', 'Q3A', 'Q4A', 'Q5A', 'Q6A', 'Q7A', 'Q8A', 'Q9A', 'Q10A', 'Q11A', 'Q12A', 'Q13A', 'Q14A', 'Q15A', 'Q16A', 'Q17A', 'Q18A', 'Q19A', 'Q20A', 'Q1E', 'Q2E', 'Q3E', 'Q4E', 'Q5E', 'Q6E', 'Q7E', 'Q8E', 'Q9E', 'Q10E', 'Q11E', 'Q12E', 'Q13E', 'Q14E', 'Q15E', 'Q16E', 'Q17E', 'Q18E', 'Q19E', 'Q20E', 'TIPI1', 'TIPI2', 'TIPI3', 'TIPI4', 'TIPI5', 'TIPI6', 'TIPI7', 'TIPI8', 'TIPI9', 'TIPI10', 'age', 'familysize'] 
df_num = df.filter(vars_num)
df_num = df_num.apply(pd.to_numeric)

#remove nan and inf values 
df_num = df_num[~df_num.isin([np.nan, np.inf, -np.inf]).any(1)].astype(np.float64)

#remove outliers
df_num = df_num[(np.abs(stats.zscore(df_num)) < 3).all(axis=1)]

#concatenate numeric columns with categorical columns using inner join
df_cat = df.drop(vars_num, axis = 1)
df = pd.concat([df_num, df_cat], axis = 1, join="inner")
df.reset_index(inplace=True)

print("Initial shape of Dataset: ")
print(df_r.shape)
print("")

print("Final shape: ")
print(df.shape)

Initial shape of Dataset: 
(73489, 105)

Final shape: 
(68919, 64)


<h4>Separating Variables</h4>
<p>These are the sets of variables that I use in my regressions/classifications below:</p>
<ul>
    <li><b>mach_iv</b> (20 features): selected responses to the Mach-IV inventory; will be used to calculate Machiavellianism scores (stored in <b>scores</b>)</li>
    <li><b>tip_i</b> (10 features): selected response to the Ten-Item Personality Inventory; will be used to calculate Big Five scores (stored in <b>b5</b>)</li>
    <li><b>dem</b> (13 features): recorded responses to various demographic questions (country, education, engnat, urban, gender, religion, orientation, voted, race, familysize, age, married, major)</li>
</ul>

In [17]:
#separate data
mach_iv = df.filter(['Q1A', 'Q2A', 'Q3A', 'Q4A', 'Q5A', 'Q6A', 'Q7A', 'Q8A', 'Q9A', 'Q10A', 'Q11A', 'Q12A', 'Q13A', 'Q14A', 'Q15A', 'Q16A', 'Q17A', 'Q18A', 'Q19A', 'Q20A'])
tip_i = df.filter(['TIPI1', 'TIPI2', 'TIPI3', 'TIPI4', 'TIPI5', 'TIPI6', 'TIPI7', 'TIPI8', 'TIPI9', 'TIPI10'])
dem = df.filter(['country', 'education', 'engnat', 'urban', 'gender', 'religion', 'orientation', 'voted', 'race', 'familysize', 'age', 'married', 'major'])


#calculate mach scores
def rev(col):
    return 6 - col

scores = pd.DataFrame({'tactics': (mach_iv['Q1A'] + mach_iv['Q2A'] + rev(mach_iv['Q3A']) + mach_iv['Q9A'] + rev(mach_iv['Q10A']) + mach_iv['Q15A'] + rev(mach_iv['Q17A']))* (100/35),
                       'humanity': (rev(mach_iv['Q4A']) + mach_iv['Q5A'] + rev(mach_iv['Q7A']) + rev(mach_iv['Q8A']) + rev(mach_iv['Q11A']) + rev(mach_iv['Q14A']) + mach_iv['Q20A']) * (100/35),
                       'morality': (rev(mach_iv['Q6A']) + mach_iv['Q12A'] + mach_iv['Q13A'] + rev(mach_iv['Q16A']) + mach_iv['Q18A'] + mach_iv['Q19A']) * (100/30)})

scores['total'] = scores.mean(axis = 1)

#calculate tip scores
def rev(col):
    return 8 - col

b5 = pd.DataFrame({'extrav': (tip_i['TIPI1'] + rev(tip_i['TIPI6']))/2,
                   'agree': (rev(tip_i['TIPI2']) + tip_i['TIPI7'])/2,
                   'consc': (tip_i['TIPI3'] + rev(tip_i['TIPI8']))/2,
                   'emot': (rev(tip_i['TIPI4']) + tip_i['TIPI9'])/2,
                   'open': (tip_i['TIPI5'] + rev(tip_i['TIPI10']))/2})

In [6]:
#apply proper labels to demographic variables


#process majors using string similarity checker



In [7]:
for i in range(1,11):
    print("'TIPI" + str(i) + "'", end = ", ")

'TIPI1', 'TIPI2', 'TIPI3', 'TIPI4', 'TIPI5', 'TIPI6', 'TIPI7', 'TIPI8', 'TIPI9', 'TIPI10', 