# SELVE Item Pool Selection

## Purpose
Select the best 15-20 items per dimension from validated sources to create the final SELVE assessment questionnaire.

## Selection Criteria
1. **Statistical Quality**: Highest item-total correlations (r > 0.40 preferred)
2. **Content Coverage**: Balanced representation of subfacets
3. **Language Quality**: Clear, modern, unambiguous wording
4. **Balance**: Mix of positively and negatively worded items
5. **Cultural Neutrality**: Avoid culture-specific references
6. **Non-Redundancy**: Remove items that are too similar

## Target
- **120-160 items total** across 8 dimensions
- **15-20 items per dimension**
- Final assessment: 40-60 items (via adaptive testing)

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from scipy import stats
import json

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

print("‚úÖ Libraries imported")

‚úÖ Libraries imported


## 1. LUMEN ‚ú® - Social Energy & Enthusiasm
**Source**: Big Five Extraversion (50 items)

In [2]:
# Load Big Five data
big5_data = pd.read_csv('/home/chris/selve/data/openpsychometrics-rawdata/BIG5/data.csv', sep='\t')

# LUMEN items (Extraversion - E1 to E10)
lumen_items = [f'E{i}' for i in range(1, 11)]

# Item texts from Big Five codebook
lumen_item_texts = {
    'E1': "I am the life of the party.",
    'E2': "I don't talk a lot.",  # Reversed
    'E3': "I feel comfortable around people.",
    'E4': "I keep in the background.",  # Reversed
    'E5': "I start conversations.",
    'E6': "I have little to say.",  # Reversed
    'E7': "I talk to a lot of different people at parties.",
    'E8': "I don't like to draw attention to myself.",  # Reversed
    'E9': "I don't mind being the center of attention.",
    'E10': "I am quiet around strangers."  # Reversed
}

# Extract and clean LUMEN data
lumen_df = big5_data[lumen_items].dropna()

# Reverse score negatively keyed items
reverse_items = ['E2', 'E4', 'E6', 'E8', 'E10']
lumen_df_scored = lumen_df.copy()
for item in reverse_items:
    lumen_df_scored[item] = 6 - lumen_df_scored[item]  # 5-point scale

# Calculate item-total correlations
lumen_correlations = []
for item in lumen_items:
    other_items = [i for i in lumen_items if i != item]
    total_without_item = lumen_df_scored[other_items].mean(axis=1)
    corr = lumen_df_scored[item].corr(total_without_item)
    lumen_correlations.append({
        'item': item,
        'text': lumen_item_texts[item],
        'correlation': corr,
        'reversed': item in reverse_items
    })

lumen_corr_df = pd.DataFrame(lumen_correlations).sort_values('correlation', ascending=False)

print("LUMEN - Top Items by Correlation:")
print(lumen_corr_df.to_string(index=False))
print(f"\nAverage correlation: {lumen_corr_df['correlation'].mean():.3f}")

LUMEN - Top Items by Correlation:
item                                            text  correlation  reversed
  E5                          I start conversations.     0.711078     False
  E7 I talk to a lot of different people at parties.     0.703092     False
  E4                       I keep in the background.     0.684289      True
  E3               I feel comfortable around people.     0.651022     False
  E2                             I don't talk a lot.     0.648044      True
 E10                    I am quiet around strangers.     0.635780      True
  E1                     I am the life of the party.     0.625926     False
  E9     I don't mind being the center of attention.     0.576892     False
  E6                           I have little to say.     0.573090      True
  E8       I don't like to draw attention to myself.     0.521523      True

Average correlation: 0.633


In [3]:
# Select top 15 LUMEN items
# We'll take all 10 from Big Five E-scale as they're all high quality
# Note: In real implementation, we'd add more from other extraversion scales

lumen_selected = lumen_corr_df.copy()

print(f"‚úÖ LUMEN: Selected {len(lumen_selected)} items")
print("\nNote: In production, we would expand to 15-20 items by including:")
print("  - Enthusiasm/Energy items from HEXACO Extraversion")
print("  - Sociability items from 16PF")
print("  - Additional validated extraversion items")

‚úÖ LUMEN: Selected 10 items

Note: In production, we would expand to 15-20 items by including:
  - Enthusiasm/Energy items from HEXACO Extraversion
  - Sociability items from 16PF
  - Additional validated extraversion items


## 2. AETHER üå´Ô∏è - Emotional Stability
**Source**: Big Five Emotional Stability (50 items)

In [4]:
# AETHER items (Emotional Stability - N1 to N10, reversed)
aether_items = [f'N{i}' for i in range(1, 11)]

aether_item_texts = {
    'N1': "I get stressed out easily.",  # Reversed for stability
    'N2': "I am relaxed most of the time.",
    'N3': "I worry about things.",  # Reversed
    'N4': "I seldom feel blue.",
    'N5': "I am easily disturbed.",  # Reversed
    'N6': "I get upset easily.",  # Reversed
    'N7': "I change my mood a lot.",  # Reversed
    'N8': "I have frequent mood swings.",  # Reversed
    'N9': "I get irritated easily.",  # Reversed
    'N10': "I often feel blue."  # Reversed
}

aether_df = big5_data[aether_items].dropna()

# For emotional stability, most items are reversed (higher neuroticism = lower stability)
reverse_items = ['N1', 'N3', 'N5', 'N6', 'N7', 'N8', 'N9', 'N10']
aether_df_scored = aether_df.copy()
for item in reverse_items:
    aether_df_scored[item] = 6 - aether_df_scored[item]

aether_correlations = []
for item in aether_items:
    other_items = [i for i in aether_items if i != item]
    total_without_item = aether_df_scored[other_items].mean(axis=1)
    corr = aether_df_scored[item].corr(total_without_item)
    aether_correlations.append({
        'item': item,
        'text': aether_item_texts[item],
        'correlation': corr,
        'reversed': item in reverse_items
    })

aether_corr_df = pd.DataFrame(aether_correlations).sort_values('correlation', ascending=False)

print("AETHER - Top Items by Correlation:")
print(aether_corr_df.to_string(index=False))
print(f"\nAverage correlation: {aether_corr_df['correlation'].mean():.3f}")

AETHER - Top Items by Correlation:
item                           text  correlation  reversed
  N6            I get upset easily.     0.691438      True
  N8   I have frequent mood swings.     0.690576      True
  N7        I change my mood a lot.     0.654902      True
  N1     I get stressed out easily.     0.647910      True
  N9        I get irritated easily.     0.643396      True
 N10             I often feel blue.     0.618206      True
  N3          I worry about things.     0.561335      True
  N5         I am easily disturbed.     0.499990      True
  N2 I am relaxed most of the time.     0.495044     False
  N4            I seldom feel blue.     0.342877     False

Average correlation: 0.585


## 3. ORPHEUS üéµ - Empathy & Compassion
**Source**: Big Five Agreeableness (50 items)

In [5]:
# ORPHEUS items (Agreeableness - A1 to A10)
orpheus_items = [f'A{i}' for i in range(1, 11)]

orpheus_item_texts = {
    'A1': "I feel little concern for others.",  # Reversed
    'A2': "I am interested in people.",
    'A3': "I insult people.",  # Reversed
    'A4': "I sympathize with others' feelings.",
    'A5': "I am not interested in other people's problems.",  # Reversed
    'A6': "I have a soft heart.",
    'A7': "I am not really interested in others.",  # Reversed
    'A8': "I take time out for others.",
    'A9': "I feel others' emotions.",
    'A10': "I make people feel at ease."
}

orpheus_df = big5_data[orpheus_items].dropna()

reverse_items = ['A1', 'A3', 'A5', 'A7']
orpheus_df_scored = orpheus_df.copy()
for item in reverse_items:
    orpheus_df_scored[item] = 6 - orpheus_df_scored[item]

orpheus_correlations = []
for item in orpheus_items:
    other_items = [i for i in orpheus_items if i != item]
    total_without_item = orpheus_df_scored[other_items].mean(axis=1)
    corr = orpheus_df_scored[item].corr(total_without_item)
    orpheus_correlations.append({
        'item': item,
        'text': orpheus_item_texts[item],
        'correlation': corr,
        'reversed': item in reverse_items
    })

orpheus_corr_df = pd.DataFrame(orpheus_correlations).sort_values('correlation', ascending=False)

print("ORPHEUS - Top Items by Correlation:")
print(orpheus_corr_df.to_string(index=False))
print(f"\nAverage correlation: {orpheus_corr_df['correlation'].mean():.3f}")

ORPHEUS - Top Items by Correlation:
item                                            text  correlation  reversed
  A4             I sympathize with others' feelings.     0.692467     False
  A9                        I feel others' emotions.     0.631258     False
  A7           I am not really interested in others.     0.618795      True
  A5 I am not interested in other people's problems.     0.610444      True
  A8                     I take time out for others.     0.549323     False
  A2                      I am interested in people.     0.530051     False
  A6                            I have a soft heart.     0.503718     False
 A10                     I make people feel at ease.     0.415147     False
  A1               I feel little concern for others.     0.387814      True
  A3                                I insult people.     0.345268      True

Average correlation: 0.528


## 4. ORIN üß≠ - Organization & Discipline
**Source**: Big Five Conscientiousness (50 items)

In [6]:
# ORIN items (Conscientiousness - C1 to C10)
orin_items = [f'C{i}' for i in range(1, 11)]

orin_item_texts = {
    'C1': "I am always prepared.",
    'C2': "I leave my belongings around.",  # Reversed
    'C3': "I pay attention to details.",
    'C4': "I make a mess of things.",  # Reversed
    'C5': "I get chores done right away.",
    'C6': "I often forget to put things back in their proper place.",  # Reversed
    'C7': "I like order.",
    'C8': "I shirk my duties.",  # Reversed
    'C9': "I follow a schedule.",
    'C10': "I am exacting in my work."
}

orin_df = big5_data[orin_items].dropna()

reverse_items = ['C2', 'C4', 'C6', 'C8']
orin_df_scored = orin_df.copy()
for item in reverse_items:
    orin_df_scored[item] = 6 - orin_df_scored[item]

orin_correlations = []
for item in orin_items:
    other_items = [i for i in orin_items if i != item]
    total_without_item = orin_df_scored[other_items].mean(axis=1)
    corr = orin_df_scored[item].corr(total_without_item)
    orin_correlations.append({
        'item': item,
        'text': orin_item_texts[item],
        'correlation': corr,
        'reversed': item in reverse_items
    })

orin_corr_df = pd.DataFrame(orin_correlations).sort_values('correlation', ascending=False)

print("ORIN - Top Items by Correlation:")
print(orin_corr_df.to_string(index=False))
print(f"\nAverage correlation: {orin_corr_df['correlation'].mean():.3f}")

ORIN - Top Items by Correlation:
item                                                     text  correlation  reversed
  C5                            I get chores done right away.     0.561349     False
  C6 I often forget to put things back in their proper place.     0.558501      True
  C4                                 I make a mess of things.     0.544094      True
  C1                                    I am always prepared.     0.539909     False
  C9                                     I follow a schedule.     0.539470     False
  C2                            I leave my belongings around.     0.479405      True
  C8                                       I shirk my duties.     0.462484      True
  C7                                            I like order.     0.456748     False
 C10                                I am exacting in my work.     0.412265     False
  C3                              I pay attention to details.     0.354028     False

Average correlation: 0.491


## 5. LYRA ü¶ã - Openness & Curiosity
**Source**: Big Five Openness (50 items)

In [7]:
# LYRA items (Openness - O1 to O10)
lyra_items = [f'O{i}' for i in range(1, 11)]

lyra_item_texts = {
    'O1': "I have a rich vocabulary.",
    'O2': "I have difficulty understanding abstract ideas.",  # Reversed
    'O3': "I have a vivid imagination.",
    'O4': "I am not interested in abstract ideas.",  # Reversed
    'O5': "I have excellent ideas.",
    'O6': "I do not have a good imagination.",  # Reversed
    'O7': "I am quick to understand things.",
    'O8': "I use difficult words.",
    'O9': "I spend time reflecting on things.",
    'O10': "I am full of ideas."
}

lyra_df = big5_data[lyra_items].dropna()

reverse_items = ['O2', 'O4', 'O6']
lyra_df_scored = lyra_df.copy()
for item in reverse_items:
    lyra_df_scored[item] = 6 - lyra_df_scored[item]

lyra_correlations = []
for item in lyra_items:
    other_items = [i for i in lyra_items if i != item]
    total_without_item = lyra_df_scored[other_items].mean(axis=1)
    corr = lyra_df_scored[item].corr(total_without_item)
    lyra_correlations.append({
        'item': item,
        'text': lyra_item_texts[item],
        'correlation': corr,
        'reversed': item in reverse_items
    })

lyra_corr_df = pd.DataFrame(lyra_correlations).sort_values('correlation', ascending=False)

print("LYRA - Top Items by Correlation:")
print(lyra_corr_df.to_string(index=False))
print(f"\nAverage correlation: {lyra_corr_df['correlation'].mean():.3f}")

LYRA - Top Items by Correlation:
item                                            text  correlation  reversed
 O10                             I am full of ideas.     0.589556     False
  O1                       I have a rich vocabulary.     0.534944     False
  O2 I have difficulty understanding abstract ideas.     0.515194      True
  O5                         I have excellent ideas.     0.515165     False
  O3                     I have a vivid imagination.     0.463691     False
  O8                          I use difficult words.     0.460209     False
  O6               I do not have a good imagination.     0.452209      True
  O4          I am not interested in abstract ideas.     0.438116      True
  O7                I am quick to understand things.     0.430434     False
  O9              I spend time reflecting on things.     0.274489     False

Average correlation: 0.467


## 6. VARA ‚öñÔ∏è - Honesty & Humility
**Source**: HEXACO Honesty-Humility (40 items)

In [8]:
# Load HEXACO data
hexaco_data = pd.read_csv('/home/chris/selve/data/openpsychometrics-rawdata/HEXACO/data.csv', sep='\t')

# VARA items - we'll select top items from each subfacet
# Sincerity, Fairness, Greed-Avoidance, Modesty

vara_items = [
    # Sincerity (select top 4)
    'HSinc1', 'HSinc2', 'HSinc3', 'HSinc4',
    # Fairness (select top 4)
    'HFair1', 'HFair2', 'HFair3', 'HFair5',
    # Greed-Avoidance (select top 4)
    'HGree1', 'HGree2', 'HGree4', 'HGree6',
    # Modesty (select top 4)
    'HMode1', 'HMode2', 'HMode3', 'HMode4'
]

# Sample item texts (would need full codebook for all)
vara_item_texts = {
    'HSinc1': "I don't pretend to be more than I am.",
    'HFair1': "I would never take things that aren't mine.",
    'HGree1': "I would not enjoy being a famous celebrity.",
    'HMode1': "I don't think that I'm better than other people.",
    # ... (would include all item texts)
}

vara_df = hexaco_data[vara_items].dropna()

# Identify reversed items (from HEXACO documentation)
reverse_items = ['HSinc2', 'HSinc3', 'HSinc4', 'HFair6', 'HFair7', 'HFair8', 
                'HGree3', 'HGree5', 'HGree6', 'HGree8', 'HGree9', 'HGree10',
                'HMode5', 'HMode6', 'HMode7', 'HMode8', 'HMode9', 'HMode10']

vara_df_scored = vara_df.copy()
for item in reverse_items:
    if item in vara_items:
        vara_df_scored[item] = 8 - vara_df_scored[item]  # 7-point scale

vara_correlations = []
for item in vara_items:
    other_items = [i for i in vara_items if i != item]
    total_without_item = vara_df_scored[other_items].mean(axis=1)
    corr = vara_df_scored[item].corr(total_without_item)
    vara_correlations.append({
        'item': item,
        'text': vara_item_texts.get(item, f'{item} (text needed)'),
        'correlation': corr,
        'reversed': item in reverse_items
    })

vara_corr_df = pd.DataFrame(vara_correlations).sort_values('correlation', ascending=False)

print("VARA - Top Items by Correlation:")
print(vara_corr_df.head(16).to_string(index=False))
print(f"\nAverage correlation: {vara_corr_df['correlation'].mean():.3f}")

  hexaco_data = pd.read_csv('/home/chris/selve/data/openpsychometrics-rawdata/HEXACO/data.csv', sep='\t')


VARA - Top Items by Correlation:
  item                                             text  correlation  reversed
HMode4                             HMode4 (text needed)     0.451200     False
HMode2                             HMode2 (text needed)     0.426635     False
HMode3                             HMode3 (text needed)     0.421768     False
HFair1      I would never take things that aren't mine.     0.418303     False
HMode1 I don't think that I'm better than other people.     0.405948     False
HSinc2                             HSinc2 (text needed)     0.369481      True
HSinc3                             HSinc3 (text needed)     0.360696      True
HFair2                             HFair2 (text needed)     0.355167     False
HSinc1            I don't pretend to be more than I am.     0.338978     False
HSinc4                             HSinc4 (text needed)     0.322982      True
HFair5                             HFair5 (text needed)     0.307299     False
HFair3             

## 7. CHRONOS ‚è≥ - Patience & Flow
**Source**: HEXACO Agreeableness (40 items)

In [9]:
# CHRONOS items - select top items from each subfacet
# Forgiveness, Gentleness, Flexibility, Patience

chronos_items = [
    # Patience (select top 5 - this was the best subfacet)
    'APati1', 'APati2', 'APati3', 'APati4', 'APati5',
    # Forgiveness (select top 4)
    'AForg1', 'AForg2', 'AForg3', 'AForg4',
    # Gentleness (select top 4)
    'AGent1', 'AGent2', 'AGent3', 'AGent4',
    # Flexibility (select top 3)
    'AFlex1', 'AFlex2', 'AFlex5'
]

chronos_df = hexaco_data[chronos_items].dropna()

reverse_items = ['APati6', 'APati7', 'APati8', 'APati9', 'APati10',
                'AForg5', 'AForg6', 'AForg7', 'AForg8', 'AForg9', 'AForg10',
                'AGent5', 'AGent6', 'AGent7', 'AGent8', 'AGent9', 'AGent10',
                'AFlex3', 'AFlex4', 'AFlex5', 'AFlex6', 'AFlex7', 'AFlex8', 'AFlex9', 'AFlex10']

chronos_df_scored = chronos_df.copy()
for item in reverse_items:
    if item in chronos_items:
        chronos_df_scored[item] = 8 - chronos_df_scored[item]

chronos_correlations = []
for item in chronos_items:
    other_items = [i for i in chronos_items if i != item]
    total_without_item = chronos_df_scored[other_items].mean(axis=1)
    corr = chronos_df_scored[item].corr(total_without_item)
    chronos_correlations.append({
        'item': item,
        'correlation': corr,
        'reversed': item in reverse_items
    })

chronos_corr_df = pd.DataFrame(chronos_correlations).sort_values('correlation', ascending=False)

print("CHRONOS - Top Items by Correlation:")
print(chronos_corr_df.to_string(index=False))
print(f"\nAverage correlation: {chronos_corr_df['correlation'].mean():.3f}")

CHRONOS - Top Items by Correlation:
  item  correlation  reversed
APati2     0.689469     False
APati1     0.664349     False
APati5     0.614542     False
APati4     0.600113     False
AForg3     0.583282     False
APati3     0.549151     False
AForg2     0.545534     False
AGent4     0.490379     False
AGent1     0.433046     False
AForg4     0.432046     False
AGent3     0.426323     False
AForg1     0.358908     False
AFlex2     0.337960     False
AGent2     0.329034     False
AFlex1     0.320728     False
AFlex5     0.243742      True

Average correlation: 0.476


## 8. KAEL üî• - Assertiveness & Leadership
**Source**: 16PF Dominance (10 items)

In [10]:
# Load 16PF data
pf16_data = pd.read_csv('/home/chris/selve/data/openpsychometrics-rawdata/16PF/data.csv', sep='\t')

# KAEL items (Dominance - D1 to D10)
kael_items = [f'D{i}' for i in range(1, 11)]

kael_item_texts = {
    'D1': "I am good at making impromptu speeches.",
    'D2': "I don't mind being the center of attention.",
    'D3': "I feel comfortable around people.",
    'D4': "I have leadership abilities.",
    'D5': "I have a strong personality.",
    'D6': "I know how to captivate people.",
    'D7': "I would be afraid to give a speech in public.",  # Reversed
    'D8': "I find it difficult to approach others.",  # Reversed
    'D9': "I hate being the center of attention.",  # Reversed
    'D10': "I have little to say."  # Reversed
}

kael_df = pf16_data[kael_items].dropna()

reverse_items = ['D7', 'D8', 'D9', 'D10']
kael_df_scored = kael_df.copy()
for item in reverse_items:
    kael_df_scored[item] = 6 - kael_df_scored[item]  # 5-point scale

kael_correlations = []
for item in kael_items:
    other_items = [i for i in kael_items if i != item]
    total_without_item = kael_df_scored[other_items].mean(axis=1)
    corr = kael_df_scored[item].corr(total_without_item)
    kael_correlations.append({
        'item': item,
        'text': kael_item_texts[item],
        'correlation': corr,
        'reversed': item in reverse_items
    })

kael_corr_df = pd.DataFrame(kael_correlations).sort_values('correlation', ascending=False)

print("KAEL - Top Items by Correlation:")
print(kael_corr_df.to_string(index=False))
print(f"\nAverage correlation: {kael_corr_df['correlation'].mean():.3f}")

KAEL - Top Items by Correlation:
item                                          text  correlation  reversed
  D1       I am good at making impromptu speeches.     0.651006     False
  D5                  I have a strong personality.     0.577612     False
  D7 I would be afraid to give a speech in public.     0.557877      True
  D2   I don't mind being the center of attention.     0.490668     False
  D9         I hate being the center of attention.     0.473375      True
  D3             I feel comfortable around people.     0.419996     False
  D4                  I have leadership abilities.     0.419400     False
  D6               I know how to captivate people.     0.408249     False
 D10                         I have little to say.     0.394739      True
  D8       I find it difficult to approach others.     0.334377      True

Average correlation: 0.473


## Summary & Export
Create final item pool for SELVE assessment

In [11]:
# Compile all selected items
item_pool = {
    'LUMEN': lumen_corr_df.to_dict('records'),
    'AETHER': aether_corr_df.to_dict('records'),
    'ORPHEUS': orpheus_corr_df.to_dict('records'),
    'ORIN': orin_corr_df.to_dict('records'),
    'LYRA': lyra_corr_df.to_dict('records'),
    'VARA': vara_corr_df.to_dict('records'),
    'CHRONOS': chronos_corr_df.to_dict('records'),
    'KAEL': kael_corr_df.to_dict('records')
}

# Export to JSON
import json
with open('/home/chris/selve/data/selve_item_pool.json', 'w') as f:
    json.dump(item_pool, f, indent=2)

print("‚úÖ Item pool exported to: /home/chris/selve/data/selve_item_pool.json")

# Summary statistics
print("\n" + "="*60)
print("SELVE ITEM POOL SUMMARY")
print("="*60)
for dimension, items in item_pool.items():
    avg_corr = np.mean([item['correlation'] for item in items])
    print(f"{dimension:12s} {len(items):2d} items, avg r={avg_corr:.3f}")

total_items = sum(len(items) for items in item_pool.values())
print(f"\nTotal items: {total_items}")
print("\nNext steps:")
print("  1. Review and modernize item wording")
print("  2. Add more items to reach 15-20 per dimension")
print("  3. Build scoring algorithm")
print("  4. Create adaptive testing logic")

‚úÖ Item pool exported to: /home/chris/selve/data/selve_item_pool.json

SELVE ITEM POOL SUMMARY
LUMEN        10 items, avg r=0.633
AETHER       10 items, avg r=0.585
ORPHEUS      10 items, avg r=0.528
ORIN         10 items, avg r=0.491
LYRA         10 items, avg r=0.467
VARA         16 items, avg r=0.289
CHRONOS      16 items, avg r=0.476
KAEL         10 items, avg r=0.473

Total items: 92

Next steps:
  1. Review and modernize item wording
  2. Add more items to reach 15-20 per dimension
  3. Build scoring algorithm
  4. Create adaptive testing logic
