This is a technique I'm experimenting with. Let's say you have a bunch of psychometric data, and you want to figure out which questions correlate together. The goal is to combine them in order to create new factors. What if we tried doing this manually, and let the *data* tell us how many factors we need?

Let's walk through how we might do it.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))

%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
# Responses to a personality test
# Pre-wrangled, because getting it nice and tidy isn't the interesting part here
data = pd.read_csv('Personality_AB5C_prewrangled.csv')
del data['Unnamed: 0']
def absolute_correlations(col, df=data, threshold=.5):
    '''Finds related questions, with both positive and negative correlations'''
    corrs = pd.DataFrame(df.select_dtypes(include=[np.number]).corrwith(df[col]), columns=['correlation'])
    corrs['absol'] = np.abs(corrs['correlation'])
    return corrs[corrs.absol > threshold].sort_values('absol', ascending=False).drop('absol', axis=1)

def rev(item):
    '''Questions are scored out of 7, and many need to be reverse-scored'''
    return 8 - item

data.shape

(86, 512)

In [3]:
from random import choice

# 30 sample questions out of 500+ total (no need to see all of them!)
for i in range(30):
    print(choice(data.columns))

iinsultpeople
iseldomnoticedetails
iseldomgetlostinthought
iusemybrain
ifollowdirections
itrytooutdoothers
iseldomgetlostinthought
iprefervarietytoroutine
iamquicktojudgeothers
iampreciseinmywork
idontmindeatingalone
iamdeeplymovedbyothersmisfortunes
ispendtimereflectingonthings
isenseotherswishes
icantcomeupwithnewideas
iameasilyfrightened
ispendtimereflectingonthings
isticktotherules
iretreatfromothers
iamnotasstrictasishouldbe
idontgetexcitedaboutthings
igeteasilyagitated
iseekdanger
idontmindeatingalone
itaketimeoutforothers
ineversplurge
igetaheadstartonothers
iamnotsurewheremylifeisgoing
imakewellconsidereddecisions
ihavecryingfits


Okay, here's the rationale. Let's look at the questions with the highest standard deviations *first*. These controversial ones well probably tell us more about someone than the questions that everyone answers similarly.

In [4]:
# Transpose the dataframe, so we're analyzing questions instead of users
qs = data.select_dtypes(include=[np.number]).fillna(data.mean()).T

qs['stdev'] = qs.std(axis=1)

questions_sorted = qs.sort_values('stdev', ascending=False)['stdev'].index

# Most controversial at top; least controversial at bottom
questions_sorted

Index(['icryduringmovies', 'icryeasily', 'iburstintotears',
       'iameasilymovedtotears', 'iloveagoodfight', 'iwouldntharmafly',
       'itrytooutdoothers', 'idontcallpeoplejusttotalk', 'idonotlikepoetry',
       'idontmindeatingalone',
       ...
       'itakeothersinterestsintoaccount', 'ishowmygratitude',
       'irespecttheprivacyofothers', 'ienjoythebeautyofnature',
       'ilikeharmonyinmylife', 'ilovebeautifulthings',
       'irespectothersfeelings', 'ilikemusic', 'irespectothers',
       'iappreciategoodmanners'],
      dtype='object', length=512)

So apparently there are two types of people in the world: those who cry during movies and those who don't.

On the other hand, just about everyone appreciates good manners!

Here's where things get interesting. We're going to iterate over each question and see if there are at least 3 other questions that have a correlation of ±0.5 (I'd go higher, but we don't have enough data currently).

If there are 3 or more related questions, we'll group them all together and score them from 1-100. As a matter of personal preference, I'll use each question only once. Priority will go to the factor with the highest standard deviation.

In [5]:
from collections import defaultdict

# Record which questions go in each factor
factors = defaultdict(list)

# Use each question only once
questions_used = []

# Put users' scores in a dataframe
df_factors = pd.DataFrame()

for question in questions_sorted:
    if question not in questions_used:
        related = absolute_correlations(
            question,
            df=data[[i for i in data.columns if i not in questions_used]],
            threshold=.50
        )        
        if len(related) >= 4:
            for related_question in related.index:
                questions_used.append(related_question)
            factors[question].append(related.index)
            
            pos_items = related[related['correlation'] > 0].index
            neg_items = related[related['correlation'] < 0].index
            all_items = list(pos_items) + list(neg_items)   

            df_factors[question] = (data[pos_items].sum(axis=1) + (8-data[neg_items]).sum(axis=1))/(.07*len(all_items))
            
print('found', str(df_factors.shape[1]), 'personality traits')

found 41 personality traits


We ended up with 41 factors, which isn't far off from the [AB5C personality test](http://ipip.ori.org/newab5ckey.htm) that purports to have 45. So the technique seems to work!

But I think there's a way we can improve this further. Perhaps we should prioritize factors that are orthagonal to the ones we've already discovered. So if the 1st factor is the "cry during movies" factor, the 2nd one should be one that has a near-zero correlation with the 1st.

In [None]:
# TODO: Find factors orthogonally