<h1>Exploring Associations Between the Mach-IV Inventory, Ten-Item Personality (TIP) Inventory, and the Test Taker</h1>
<h2>Siddharth Tiwari</h2>
<h3>HDAG Summer Program Project</h3>

<h2>Purpose</h2>
<p>Machiavellianism is a personality trait that denotes cunningness, the ability to be manipulative, and a drive to use whatever means necessary to gain power; the Mach-IV, a three-dimensional, 20-item self-reported psychometric inventory, was devised in 1970 to measure the relative strength of this trait within individuals. In this project, I aim to explore the (possible) relationships between the selected responses for the Mach-IV test from 73k individuals and their TIP results as well as other personal information (age, family size, profession, etc.). Results can help derive associations similar to “Hofstede’s Dimensions” to explore how different demographic and cultural factors may influence the presence of Machiavellian traits within different groups of people in the future.</p>


In [125]:
#import statements

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn import metrics
from sklearn.pipeline import Pipeline 


In [126]:
#load dataset
df_r = pd.read_excel(r"data/data.xlsx")

<h2>Describe Dataset</h2>

<p>The following list contains descriptions of the features included in this dataset. For additional information, visit the Open Psychometrics Project (link above) or, alternatively, feel free to look through the test as well: <a href="https://openpsychometrics.org/tests/MACH-IV/" target="_blank">https://openpsychometrics.org/tests/MACH-IV/.</a> The following features from the dataset will be used for analysis:</p>
<ul>
    <li><h4>20 Question <i>Mach-IV</i> Inventory.</h4></li>
    <ul style="list-style-type='circle'">
      <li>Consists of 20 questions used to measure the presence of the "Machiavellian Construct" in the survey respondent.</li>
      <li>Three values are recorded for each question (ex. Q1):</li>
      <ul style="list-style-type='circle'">
          <li>The user's answer ranging from 1 to 5 - 1 = disagree, 5 = agree (ex. feature name: <b>Q1A</b>)</li>
          <li>The position of the item in the survey (ex. feature name: <b>Q1I</b>)</li>
          <li>The time spent on the question in milliseconds (ex. feature name: <b>Q1E</b>)</li>
      </ul>
      <li>Only selected answers (QIA) will be used to calculate three-dim Mach-IV scores for regressions/classifications, elapsed time (Q1E) will be used to exclude outliers but not for any analysis, response position will be removed entirely from analysis (Q1I)</li>
    </ul>
    <br>
    <li><h4>Ten Item Personality Inventory</h4></li>
    <ul style="list-style-type='circle'">
      <li>Ten-Item Personality Inventory was used to briefly capture the respondent's personality traits</li>
      <li>Features are labeled as <b>"TIPI"</b> followed by the question number <b>(ex. TIP1)</b></li>
      <li>Will be used to calculate Big 5 Character traits for respondents</li>
    </ul>
    <br>
    <li><h4>Demographic Variables</h4></li>
    <ul style="list-style-type='circle'">
      <li>The following information was catalogued for each respondent:</li>
      <ul style="list-style-type='circle'">
          <li><b>age</b>:		the user's age</li>
          <li><b>familysize</b>:				"Including you, how many children did your mother have?"</li>
          <li><b>education</b>:			"How much education have you completed?", 1=Less than high school, 2=High school, 3=University degree, 4=Graduate degree</li>
          <li><b>engnat</b>: "Is English your native language?", 1=Yes, 2=No</li>
          <li><b>urban</b>:				"What type of area did you live when you were a child?", 1=Rural (country side), 2=Suburban, 3=Urban (town, city)</li>
          <li><b>gender</b>:				"What is your gender?", 1=Male, 2=Female, 3=Other</li>
          <li><b>religion</b>:			"What is your religion?", 1=Agnostic, 2=Atheist, 3=Buddhist, 4=Christian (Catholic), 5=Christian (Mormon), 6=Christian (Protestant), 7=Christian (Other), 8=Hindu, 9=Jewish, 10=Muslim, 11=Sikh, 12=Other</li>
          <li><b>orientation</b>:			"What is your sexual orientation?", 1=Heterosexual, 2=Bisexual, 3=Homosexual, 4=Asexual, 5=Other</li>
          <li><b>voted</b>:   "Have you voted in a national election in the past year?", 1=Yes, 2=No</li>
          <li><b>race	</b>:			"What is your race?", 10=Asian, 20=Arab, 30=Black, 40=Indigenous Australian, 50=Native American, 60=White, 70=Other</li>
          <li><b>married</b>:				"What is your marital status?", 1=Never married, 2=Currently married, 3=Previously married</li>
          <li><b>country</b>:		the user's network location</li>
          <li><b>major</b>: "If you attended a university, what was your major (e.g. "psychology", "English", "civil engineering")?"</li>
      </ul>
      <li>To see other features of the original dataset, visit the <b>codebook</b> (within the data folder). The demographic variables included above are explored in the "Data Visualization" section of this project.</li>
    </ul>
</ul>

The head and shape of the original dataset are included below:

In [127]:
print(df_r.head())

   Q1A   Q1I      Q1E  Q2A   Q2I      Q2E  Q3A   Q3I      Q3E  Q4A  ...  \
0  3.0   6.0  21017.0  3.0   7.0  18600.0  5.0  20.0  14957.0  2.0  ...   
1  5.0  17.0   3818.0  5.0   9.0   7850.0  1.0  16.0   5902.0  3.0  ...   
2  5.0  16.0   4186.0  5.0  12.0   2900.0  1.0   2.0   7160.0  1.0  ...   
3  2.0  12.0   9373.0  4.0   1.0  10171.0  2.0   7.0  10117.0  1.0  ...   
4  5.0  13.0   9465.0  5.0   7.0   5284.0  2.0  19.0   8872.0  1.0  ...   

   screenw  screenh  hand  religion  orientation  race  voted  married  \
0   1440.0    900.0     1         7            1    30      1        2   
1   1536.0    864.0     1         1            1    60      2        1   
2    375.0    667.0     1         2            2    10      2        1   
3   1280.0    720.0     1         6            1    60      1        3   
4    360.0    640.0     1         4            3    60      1        1   

   familysize                    major  
0           5               Marketing   
1           2         

<h2>1. Cleaning Data</h2>

<p>Remove test-takers who took “too long” to complete the test (STD > 3), remove features that aren’t used in analyses, ensure data types are proper and consistent, remove NaNs, assign proper labels to categorical data</p>

In [128]:
#drop unused features
df = df_r.drop(['Q1I', 'Q2I', 'Q3I', 'Q4I', 'Q5I', 'Q6I', 'Q7I', 'Q8I', 'Q9I', 'Q10I', 'Q11I', 'Q12I', 'Q13I', 'Q14I', 'Q15I', 'Q16I', 'Q17I', 'Q18I', 'Q19I', 'Q20I', 'VCL1', 'VCL2', 'VCL3', 'VCL4', 'VCL5', 'VCL6', 'VCL7', 'VCL8', 'VCL9', 'VCL10', 'VCL11', 'VCL12', 'VCL13', 'VCL14', 'VCL15', 'VCL16', 'introelapse', 'testelapse', 'surveyelapse', 'screenw', 'screenh', 'hand'], axis = 1)

#convert numeric categories to numeric datatype
df_num = df.filter(['Q1A', 'Q2A', 'Q3A', 'Q4A', 'Q5A', 'Q6A', 'Q7A', 'Q8A', 'Q9A', 'Q10A', 'Q11A', 'Q12A', 'Q13A', 'Q14A', 'Q15A', 'Q16A', 'Q17A', 'Q18A', 'Q19A', 'Q20A', 'Q1E', 'Q2E', 'Q3E', 'Q4E', 'Q5E', 'Q6E', 'Q7E', 'Q8E', 'Q9E', 'Q10E', 'Q11E', 'Q12E', 'Q13E', 'Q14E', 'Q15E', 'Q16E', 'Q17E', 'Q18E', 'Q19E', 'Q20E', 'TIPI1', 'TIPI2', 'TIPI3', 'TIPI4', 'TIPI5', 'TIPI6', 'TIPI7', 'TIPI8', 'TIPI9', 'TIPI10', 'age', 'familysize'] )
df_num = df_num.apply(pd.to_numeric)

#replace empty values with NaNs
df_num = df_num.mask(df_num == '')

#remove nan and inf values from numerical data
df_num = df_num[~df_num.isin([np.nan, np.inf, -np.inf]).any(1)].astype(np.float64)

#remove outliers
df_num = df_num[(np.abs(stats.zscore(df_num)) < 3).all(axis=1)]

#assign proper labels to categorical columns
df_cat = df.drop(['Q1A', 'Q2A', 'Q3A', 'Q4A', 'Q5A', 'Q6A', 'Q7A', 'Q8A', 'Q9A', 'Q10A', 'Q11A', 'Q12A', 'Q13A', 'Q14A', 'Q15A', 'Q16A', 'Q17A', 'Q18A', 'Q19A', 'Q20A', 'Q1E', 'Q2E', 'Q3E', 'Q4E', 'Q5E', 'Q6E', 'Q7E', 'Q8E', 'Q9E', 'Q10E', 'Q11E', 'Q12E', 'Q13E', 'Q14E', 'Q15E', 'Q16E', 'Q17E', 'Q18E', 'Q19E', 'Q20E', 'TIPI1', 'TIPI2', 'TIPI3', 'TIPI4', 'TIPI5', 'TIPI6', 'TIPI7', 'TIPI8', 'TIPI9', 'TIPI10', 'age', 'familysize', 'major'] , axis = 1)

def remap(col, pts, labs):
    df_cat[col] = df_cat[col].map(dict((zip(pts,labs)))).astype("category")

remap('education', [1,2,3,4], ['<HS','HSD','CD','GD'])
remap('engnat', [2,1], ['No', 'Yes'])
remap('urban', [1,2,3], ['Rural', 'Suburban', 'Urban'])
remap('gender', [1,2,3], ['Male', 'Female', 'Other'])
remap('religion', np.arange(1,13), ['Agnostic', 'Atheist', 'Buddhist', 'Christian (Catholic)', 'Christian (Mormon)', 'Christian (Protestant)', 'Christian (Other)', 'Hindu', 'Jewish', 'Muslim', 'Sikh', 'Other'])
remap('orientation', [1,2,3,4,5], ['Heterosexual', 'Bisexual', 'Homosexual', 'Asexual', 'Other'])
remap('voted', [2,1], ['No', 'Yes'])
remap('race', [10,20,30,40,50,60,70], ['Asian', 'Arab', 'Black', 'Indigenous Australian', 'Native American', 'White', 'Other'])
remap('married', [1,2,3], ['Never married', 'Currently married', 'Previously married'])

#convert alpha2 country names to full country names
import pycountry

countries = []
codes = []
for country in list(pycountry.countries):
    countries.append(country.name)
    codes.append(country.alpha_2)
remap('country', codes, countries)

#replace empty values with NaNs
df_cat = df_cat.mask(df_cat == '')

#remove nan from categorical data (excluding major)
df_cat = df_cat[~df_cat.isin([np.nan]).any(1)]

#concatenate numeric columns with categorical columns using inner join
df = pd.concat([df_num, df_cat, df['major']], axis = 1, join="inner")

#reset index
df.reset_index(inplace=True)

print("Initial shape of Dataset: ")
print(df_r.shape)
print("")

print("Final shape: ")
print(df.shape)

Initial shape of Dataset: 
(73489, 105)

Final shape: 
(63169, 64)


In [129]:
#use string similarity checker to assign category names for 'major' column
from fuzzywuzzy import process, fuzz

df['major'] = [str(s).lower() for s in df['major']]

#remove leading and trailing spaces for majors in dataframe
print(len(df['major']))
print(df['major'].isna().sum())

#set any major that begins with characters other than a letter or a number to NaN
reg = df['major'].str.match(r'^([a-zA-Z\s])*$').astype(bool)

print(len(df['major'].loc[reg]))

df['major'].loc[~reg] = np.nan
print(df['major'].isna().sum())

#view unique majors
unique_majors = df['major'].unique().tolist()


#print(sorted(df['major'].dropna())[0:100])

set_majors = ['business', 'nursing', 'psychology', 'biology', 'engineering', 'education', 'communications', 'finance', 'accounting', 'sociology', 'anthropology', 'computer science', 'english', 'economics', 'political science', 'history', 'math', 'environmental sciences', 'chemistry', 'physics', 'philosophy']

#Create tuples of brand names, matched brand names, and the score
score_sort = [(x,) + i
             for x in set_majors 
             for i in process.extract(x, unique_majors, scorer=fuzz.token_sort_ratio)]
#Create a dataframe from the tuples
similarity_sort = pd.DataFrame(score_sort, columns=['major_sort','match_sort','score_sort'])

similarity_sort['sorted_major_sort'] = np.minimum(similarity_sort['major_sort'], similarity_sort['match_sort'])

high_score_sort = similarity_sort[(similarity_sort['score_sort'] >= 80) &
                (similarity_sort['major_sort'] !=  similarity_sort['match_sort']) &
                (similarity_sort['sorted_major_sort'] != similarity_sort['match_sort'])]

high_score_sort = high_score_sort.drop('sorted_major_sort',axis=1).copy()

high_score_sort.groupby(['major_sort','score_sort']).agg(
                        {'match_sort': ', '.join}).sort_values(
                        ['score_sort'], ascending=False)

#map majors in dataset to matched majors

63169
0
61807
1362


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['major'].loc[~reg] = np.nan


Unnamed: 0_level_0,Unnamed: 1_level_0,match_sort
major_sort,score_sort,Unnamed: 2_level_1
sociology,100,sociology
nursing,100,nursing
history,100,history
philosophy,100,philosophy
finance,100,finance
physics,100,physics
english,100,english
accounting,100,accounting
engineering,100,engineering
education,100,education


<h2>2. Score individuals on Mach-IV and TIP Inventories</h2>

<p>These are the sets of variables that I use in my regressions/classifications below:</p>
<ul>
    <li><b>mach_iv</b> (20 features): selected responses to the Mach-IV inventory; will be used to calculate Machiavellianism scores (stored in <b>scores</b>)</li>
    <li><b>tip_i</b> (10 features): selected response to the Ten-Item Personality Inventory; will be used to calculate Big Five scores (stored in <b>b5</b>)</li>
    <li><b>dem</b> (13 features): recorded responses to various demographic questions (country, education, engnat, urban, gender, religion, orientation, voted, race, familysize, age, married, major)</li>
</ul>

<p>The Mach-IV and TIPI are scored below</p>

In [131]:
#separate data
mach_iv = df.filter(['Q1A', 'Q2A', 'Q3A', 'Q4A', 'Q5A', 'Q6A', 'Q7A', 'Q8A', 'Q9A', 'Q10A', 'Q11A', 'Q12A', 'Q13A', 'Q14A', 'Q15A', 'Q16A', 'Q17A', 'Q18A', 'Q19A', 'Q20A'])
tip_i = df.filter(['TIPI1', 'TIPI2', 'TIPI3', 'TIPI4', 'TIPI5', 'TIPI6', 'TIPI7', 'TIPI8', 'TIPI9', 'TIPI10'])
dem = df.filter(['age', 'familysize', 'education', 'engnat', 'urban', 'gender', 'religion', 'orientation', 'voted', 'race', 'married', 'country', 'major'])


#calculate mach scores
def rev(col):
    return 6 - col

scores = pd.DataFrame({'tactics': (mach_iv['Q1A'] + mach_iv['Q2A'] + rev(mach_iv['Q3A']) + mach_iv['Q9A'] + rev(mach_iv['Q10A']) + mach_iv['Q15A'] + rev(mach_iv['Q17A']))* (100/35),
                       'humanity': (rev(mach_iv['Q4A']) + mach_iv['Q5A'] + rev(mach_iv['Q7A']) + rev(mach_iv['Q8A']) + rev(mach_iv['Q11A']) + rev(mach_iv['Q14A']) + mach_iv['Q20A']) * (100/35),
                       'morality': (rev(mach_iv['Q6A']) + mach_iv['Q12A'] + mach_iv['Q13A'] + rev(mach_iv['Q16A']) + mach_iv['Q18A'] + mach_iv['Q19A']) * (100/30)})

scores['total'] = scores.mean(axis = 1)

print("Mach-IV Scores: ")
print(scores.head())
print("")

#calculate tip scores
def rev(col):
    return 8 - col

b5 = pd.DataFrame({'extrav': (tip_i['TIPI1'] + rev(tip_i['TIPI6']))/2,
                   'agree': (rev(tip_i['TIPI2']) + tip_i['TIPI7'])/2,
                   'consc': (tip_i['TIPI3'] + rev(tip_i['TIPI8']))/2,
                   'emot': (rev(tip_i['TIPI4']) + tip_i['TIPI9'])/2,
                   'open': (tip_i['TIPI5'] + rev(tip_i['TIPI10']))/2})

print("TIP Inventory Scores: ")
print(b5.head())

Mach-IV Scores: 
     tactics   humanity    morality      total
0  57.142857  77.142857   86.666667  73.650794
1  88.571429  77.142857   96.666667  87.460317
2  88.571429  82.857143  100.000000  90.476190
3  77.142857  88.571429   80.000000  81.904762
4  85.714286  68.571429   90.000000  81.428571

TIP Inventory Scores: 
   extrav  agree  consc  emot  open
0     5.5    5.0    5.0   7.0   7.0
1     2.0    4.0    5.0   6.0   3.5
2     1.0    1.0    5.0   1.0   4.5
3     6.0    4.5    5.5   1.5   6.0
4     2.0    4.0    5.0   3.0   5.0


<h2>3. Construct Models</h2>

<p>Look for relationships between Machiavellian scores and Big 5 personality traits obtained from TIPI (use multiple regression model: y ~ b0 + b1*x1 + b2*x2 +...+ b5*x5; obtain semi-partial correlations) and demographic characteristics (use simple regression models with most* variables as factors)
</p>

In [155]:
#Multiple regression model for mach-iv and big 5 personality traits
import pingouin as pg
import statsmodels.api as sm

X = b5
y = scores
data = pd.concat([X,y], axis=1)

x = sm.add_constant(X)
model = sm.OLS(y, x).fit()

#print(model.summary())

def semipartial_corr(data, x, y, x_covar):
    return float(pg.partial_corr(data = data, x=x, y=y, x_covar=x_covar, method = 'spearman')['r'])

#retrieve semipartial correlations
personality_corrs = pd.DataFrame(index = X.columns, columns = y.columns)
for dim in y.columns:
    for trait in range(5):
        personality_corrs[dim][X.columns[trait]] = semipartial_corr(data, X.columns[trait], dim, X.columns[~trait])

print(personality_corrs)


         tactics  humanity  morality     total
extrav -0.056016 -0.097142 -0.090663 -0.095229
agree  -0.353775 -0.380659 -0.416449 -0.448071
consc  -0.064453 -0.057533 -0.097489 -0.087744
emot    0.081107  0.045628  0.044057   0.06639
open    0.026809  0.068155  0.046496  0.054772


In [57]:
#for i in range(1,11):
#    print("'TIPI" + str(i) + "'", end = ", ")


AS
