# Factor Analysis

(The tool you might not be using)

Everyone knows that you can use clustering to assign labels to the observations in your dataset.

But what if you want to assign labels to the features? Did you know there's a tool for that? It's called factor analysis, and it's commonly used in psychology research. This notebook is a quick introduction to how it works.

**The dataset**

The dataset I'll be using is a fun one: *The Young People Survey* by Miroslav Sabo. It examines various personality traits as well as things that Slovakian college students like and don't like.

You can find it here:

https://www.kaggle.com/miroslavsabo/young-people-survey


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.decomposition import FactorAnalysis

%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [88]:
# Load the (lightly-wrangled) data
data = pd.read_csv('responses2.csv')

# These will mess up our analysis if we keep them in the dataframe
weight = data.pop('weight')
height = data.pop('height')
age = data.pop('age')

# Remove extra column(s)
for col in data.columns:
    if 'unnamed' in col.lower():
        del data[col]

data.head()

Unnamed: 0,music,slow_songs_or_fast_songs,dance,folk,country,classical_music,musical,pop,rock,metal_or_hardrock,...,number_of_siblings,gender,left__right_handed,education,only_child,village__town,house__block_of_flats,i_am_always_on_time,i_lie_to_others,i_spend_a_lot_of_time_online
0,5.0,3.0,2.0,1.0,2.0,2.0,1.0,5.0,5.0,1.0,...,1.0,female,right handed,college/bachelor degree,no,village,block of flats,5,1,3
1,4.0,4.0,2.0,1.0,1.0,1.0,2.0,3.0,5.0,4.0,...,2.0,female,right handed,college/bachelor degree,no,city,block of flats,3,3,3
2,5.0,5.0,2.0,2.0,3.0,4.0,5.0,3.0,5.0,3.0,...,2.0,female,right handed,secondary school,no,city,block of flats,1,3,3
3,5.0,3.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,...,1.0,female,right handed,college/bachelor degree,yes,city,house/bungalow,3,2,5
4,5.0,3.0,4.0,3.0,2.0,4.0,3.0,5.0,3.0,1.0,...,1.0,female,right handed,secondary school,no,village,house/bungalow,5,5,3


In [89]:
# Dummify the categorical variables

for i in data.columns:
    if data[i].dtype == 'O':
        dums = pd.get_dummies(data[i], prefix=i+'_')
        del data[i]
        data = pd.concat([data, dums], axis=1)

# Now that everything is numerical, let's just use the median
# for missing values.
data.fillna(data.median(), inplace=True)

data.dtypes.value_counts()

float64    131
uint8       26
int64        8
dtype: int64

# Determining the number of factors

Just like with clustering, you typically don't know how many "factors" are ideal for your data. There's going to be some trial and error here.

In psychology, it's widely accepted that there are 5 factors (or dimensions, if you prefer) of personality. So let's do a quick walkthrough with 5.

In [90]:
from sklearn.decomposition import FactorAnalysis
print('Minimum factor loadings for n factors:\n')

for n_comp in range(1,21):
    print(n_comp, end=': ')
    fa = FactorAnalysis(n_components=n_comp).fit(data)

    # Stick it in a dataframe
    factors = pd.DataFrame(fa.components_, columns=data.columns).T
    
    # Record and compare minimum factor loadings
    mins = []
    for i in factors.columns:
        factors['absol'] = abs(factors[i])
        factors_sorted = factors.sort_values('absol', ascending=False).head(4) # Top 4
        mins.append(factors_sorted['absol'].min())                             # Min value
        del factors['absol']
    
    print(np.min(np.array(mins)))

Minimum factor loadings for n factors:

1: 0.5602341221357283
2: 0.560463461131604
3: 0.5609574210742054
4: 0.45260815447720626
5: 0.40342127447605636
6: 0.294939713278729
7: 0.29502709319550374
8: 0.2748624015308651
9: 0.2748486391519195
10: 0.19773762195784178
11: 0.1977399387115115
12: 0.19238189322998026
13: 0.19206553949447674
14: 0.19251423633329182
15: 0.19168970185477138
16: 0.19200738290256628
17: 0.19259667730020377
18: 0.1927836156254579
19: 0.15274986149834205
20: 0.17110742889338112


In [91]:
fa = FactorAnalysis(n_components=5).fit(data)

# Stick it in a dataframe
factors = pd.DataFrame(fa.components_, columns=data.columns).T

factors

Unnamed: 0,0,1,2,3,4
music,-0.069381,0.070280,-0.078323,0.040963,0.081981
slow_songs_or_fast_songs,0.069407,-0.058750,-0.150641,0.073819,-0.046562
dance,-0.052838,0.051696,-0.477774,-0.184094,-0.058575
folk,-0.052562,0.494050,0.144885,-0.139008,-0.032167
country,0.081229,0.365244,0.115704,-0.045446,0.002401
classical_music,-0.001539,0.719751,0.340046,0.012277,0.260682
musical,-0.373405,0.491881,0.017753,-0.108138,0.115664
pop,-0.191700,-0.019034,-0.375983,-0.263318,0.022136
rock,0.016471,0.283631,0.300872,0.300420,0.268782
metal_or_hardrock,0.245242,0.276230,0.425969,0.395199,0.177400


In [94]:
def show_loadings(factor=0, factors=factors):
    loadings = pd.DataFrame()
    loadings['loading'] = factors[factor]
    loadings['absol_' + str(factor)] = abs(loadings['loading'])
    loadings = loadings[loadings['absol_' + str(factor)] > 0.4]\
    .sort_values('absol_' + str(factor), ascending=False)
    return pd.DataFrame(loadings['loading'])

show_loadings(0)

Unnamed: 0,loading
i_cry_when_i_feel_down_or_things_dont_go_the_right_way,-0.791775
cars,0.636347
pc,0.614638
war,0.561039
reading,-0.551145
romantic,-0.541651
phobia_spiders,-0.523123
action,0.51081
shopping,-0.508355
gender__female,-0.488717


In [95]:
show_loadings(1)

Unnamed: 0,loading
classical_music,0.719751
opera,0.648563
art_exhibitions,0.648239
musical_instruments,0.635492
swing_jazz,0.630821
theatre,0.60741
reading,0.588138
religion,0.552343
history,0.523211
medicine,0.515634


In [96]:
show_loadings(2)

Unnamed: 0,loading
i_spend_a_lot_of_money_on_my_appearance,-0.710439
i_enjoy_going_to_large_shopping_centres,-0.679245
shopping,-0.678412
hiphop_rap,-0.638089
i_prefer_branded_clothing_to_non_branded,-0.572782
i_have_lots_of_friends,-0.507483
adrenaline_sports,-0.499998
active_sport,-0.49839
i_am_always_full_of_life_and_energy,-0.490674
dance,-0.477774


In [97]:
show_loadings(3)

Unnamed: 0,loading
phobia_dangerous_dogs,-0.553878
phobia_snakes,-0.533101
phobia_rats,-0.469641
i_have_to_be_well_prepared_before_public_speaking,-0.436367
i_worry_about_my_health,-0.42688
i_prefer_big_dangerous_dogs_to_smaller_calmer_dogs,0.422508


In [98]:
show_loadings(4)

Unnamed: 0,loading
i_wish_i_could_change_the_past_because_of_the_things_i_have_done,0.440735
i_am_always_on_time,-0.416822
i_feel_lonely_in_life,0.414363
phobia_aging,0.403421
