In [4]:
import json
import pandas as pd
import numpy as np
from pathlib import Path

# Load item pool
item_pool_path = Path('../data/selve_item_pool_expanded.json')
with open(item_pool_path) as f:
    item_pool = json.load(f)

# Flatten to DataFrame
items = []
for dimension, item_list in item_pool.items():
    for item in item_list:
        items.append({
            'dimension': dimension,
            'item_id': item['item'],
            'text': item['text'],
            'correlation': item['correlation'],
            'reversed': item['reversed'],
            'source': item['source']
        })

df = pd.DataFrame(items)
print(f"Total items: {len(df)}")
print(f"\nItems per dimension:")
print(df['dimension'].value_counts())

Total items: 98

Items per dimension:
dimension
CHRONOS    18
VARA       18
LUMEN      16
ORIN       10
ORPHEUS     9
AETHER      9
LYRA        9
KAEL        9
Name: count, dtype: int64


## 1. Classify Items: Self-Report vs Observable Behavior

**Observable Behaviors**: Friend can directly observe
- "I talk a lot"
- "I start conversations"
- "I get chores done right away"

**Internal States**: Only accessible to self
- "I feel comfortable around people" (friend can't know how you feel)
- "I worry about things" (internal mental state)
- "I have a rich inner world" (literally internal)

**Ambiguous**: Could be inferred from behavior
- "I get stressed out easily" (friend sees reactions)
- "I enjoy being around people" (friend infers from participation)

In [5]:
# Keywords for classification
INTERNAL_KEYWORDS = [
    'feel', 'feels', 'worry', 'worried', 'think', 'thinks', 'believe',
    'inner', 'imagine', 'daydream', 'afraid', 'anxious', 'comfortable',
    'enjoy', 'like', 'dislike', 'prefer', 'love', 'hate',
    'stressed', 'relaxed', 'calm', 'nervous', 'tense'
]

OBSERVABLE_KEYWORDS = [
    'talk', 'say', 'speak', 'tell', 'ask', 'do', 'make', 'get',
    'start', 'finish', 'complete', 'organize', 'plan', 'prepare',
    'react', 'respond', 'act', 'behave', 'work', 'help',
    'keep', 'leave', 'put', 'clean', 'mess', 'tidy'
]

def classify_observability(text):
    """
    Classify if question is observable by others.
    Returns: 'observable', 'internal', or 'ambiguous'
    """
    text_lower = text.lower()
    
    # Check for internal state keywords
    internal_count = sum(1 for kw in INTERNAL_KEYWORDS if kw in text_lower)
    
    # Check for observable behavior keywords
    observable_count = sum(1 for kw in OBSERVABLE_KEYWORDS if kw in text_lower)
    
    if internal_count > observable_count:
        return 'internal'
    elif observable_count > internal_count:
        return 'observable'
    else:
        return 'ambiguous'

# Apply classification
df['observability'] = df['text'].apply(classify_observability)

print("Observability Distribution:")
print(df['observability'].value_counts())
print(f"\nPercentage Observable: {(df['observability'] == 'observable').sum() / len(df) * 100:.1f}%")

Observability Distribution:
observability
ambiguous     48
observable    34
internal      16
Name: count, dtype: int64

Percentage Observable: 34.7%


In [6]:
# Analyze observability by dimension
observability_by_dim = pd.crosstab(df['dimension'], df['observability'], normalize='index') * 100
observability_by_dim = observability_by_dim.round(1)

print("\nObservability by Dimension (% of items):")
print(observability_by_dim)

# Most observable dimensions
print("\nüìä Dimensions Ranked by Observability:")
if 'observable' in observability_by_dim.columns:
    print(observability_by_dim['observable'].sort_values(ascending=False))


Observability by Dimension (% of items):
observability  ambiguous  internal  observable
dimension                                     
AETHER              44.4      33.3        22.2
CHRONOS             55.6       5.6        38.9
KAEL                44.4      33.3        22.2
LUMEN               31.2       6.2        62.5
LYRA                66.7       0.0        33.3
ORIN                30.0      10.0        60.0
ORPHEUS             66.7      33.3         0.0
VARA                55.6      22.2        22.2

üìä Dimensions Ranked by Observability:
dimension
LUMEN      62.5
ORIN       60.0
CHRONOS    38.9
LYRA       33.3
KAEL       22.2
AETHER     22.2
VARA       22.2
ORPHEUS     0.0
Name: observable, dtype: float64


## 2. Sample Observable Items by Dimension

Let's examine the top observable items for each dimension to see if they make sense for friends.

In [8]:
# Get top 3 observable items per dimension (highest correlation)
observable_df = df[df['observability'] == 'observable'].copy()
observable_df = observable_df.sort_values(['dimension', 'correlation'], ascending=[True, False])

print("\nüîç Top Observable Items Per Dimension\n")
for dimension in df['dimension'].unique():
    dim_items = observable_df[observable_df['dimension'] == dimension].head(3)
    print(f"\n{'='*60}")
    print(f"üìç {dimension}")
    print(f"{'='*60}")
    for idx, row in dim_items.iterrows():
        # Convert to third-person
        friend_text = convert_to_third_person(row['text'])
        print(f"  ‚Ä¢ {row['text']}")
        print(f"    ‚Üí Friend version: \"{friend_text}\"")
        print(f"    Correlation: {row['correlation']:.3f}")
        print()

def convert_to_third_person(text):
    """
    Convert first-person to third-person with placeholder name.
    """
    text = text.replace('I am', '{Name} is')
    text = text.replace('I ', '{Name} ')
    text = text.replace('my ', "{Name}'s ")
    text = text.replace('My ', "{Name}'s ")
    text = text.replace('myself', '{Name}')
    return text


üîç Top Observable Items Per Dimension


üìç LUMEN


NameError: name 'convert_to_third_person' is not defined

## 3. Research: Peer Report Accuracy & Item Count

**Psychometric Research Findings**:

### Observer Agreement (Self-Other Correlation)
From meta-analysis of personality peer reports:

| Trait | Self-Other Correlation | Observability |
|-------|------------------------|---------------|
| **Extraversion** | r = .55 | **HIGH** - Most visible |
| **Conscientiousness** | r = .42 | **MEDIUM** - Behaviors visible |
| **Agreeableness** | r = .38 | **MEDIUM** - Depends on interaction |
| **Openness** | r = .35 | **LOW-MEDIUM** - Intellectual traits less visible |
| **Neuroticism** | r = .30 | **LOW** - Internal emotional states |

*Source: Connelly & Ones (2010), "An Other Perspective on Personality"*

### Optimal Peer Report Length
- **360 Feedback**: 10-25 items (industry standard)
- **Big Five Peer Reports**: 15-30 items (research standard)
- **CliftonStrengths Observer**: 20 items
- **HEXACO Peer Report**: 40 items (short form)

### Recommendation for SELVE Friend Assessment
**Target: 20-25 items total** (2-4 items per dimension)

**Why?**
- ‚úÖ Friends have less patience than self-assessment takers
- ‚úÖ Should take 3-5 minutes max (30-45 seconds per question)
- ‚úÖ Higher completion rate with shorter length
- ‚úÖ Focus on **highest correlation** observable items
- ‚úÖ Multiple friends ‚Üí aggregate reduces noise

### "Not Sure" Option Research
**Should we include "Don't Know" / "Not Sure"?**

‚úÖ **YES - Industry Best Practice**

**Reasons**:
1. **Reduces guessing**: Friend may not have seen relevant behavior
2. **Improves data quality**: Forced guesses add noise
3. **Ethical**: Respects uncertainty ("I haven't seen this situation")
4. **Detection tool**: High "Not Sure" rate = weak relationship or attention issues

**Implementation**:
- Add "Not Sure" as middle option OR separate button
- Track "Not Sure" frequency per respondent
- Weight responses: Confident answers > "Not Sure" answers
- Flag respondent if >40% "Not Sure" (may not know person well)

**Scale Options**:
- **Option A**: 1-5 scale + "Not Sure" button
- **Option B**: 1-7 scale with 4 = "Neutral/Not Sure"
- **Recommendation**: Option A - keeps neutral separate from uncertain

## 4. Curated Friend Item Pool Strategy

**Two Approaches**:

### Option A: Separate Curated Pool ‚úÖ RECOMMENDED
**Pros**:
- Only observable behaviors included
- Optimized for third-party perspective
- Shorter, focused assessment
- Better user experience for friends

**Cons**:
- Requires manual curation
- May lose some psychometric precision

### Option B: Use Same Pool with Filtering
**Pros**:
- No duplicate work
- Consistency with self-assessment

**Cons**:
- Many items not suitable for observers
- Confusing wording ("Does John feel comfortable...?")
- Longer filtering logic

---

### üìã Recommendation: Create Curated Friend Pool

**Criteria for inclusion**:
1. ‚úÖ Observable behavior (not internal state)
2. ‚úÖ High correlation with dimension (r > 0.65)
3. ‚úÖ Clear wording when converted to third-person
4. ‚úÖ Frequent opportunity to observe (not rare situations)
5. ‚úÖ Not embarrassing/invasive (friend comfortable answering)

**Target**: 3 items √ó 8 dimensions = **24 items**

In [None]:
# Create curated friend pool
# Criteria: observable, correlation > 0.65, top 3 per dimension

friend_pool = []

for dimension in df['dimension'].unique():
    # Filter: observable and high correlation
    dim_items = df[
        (df['dimension'] == dimension) &
        (df['observability'] == 'observable') &
        (df['correlation'] > 0.65)
    ].copy()
    
    # Sort by correlation (best items first)
    dim_items = dim_items.sort_values('correlation', ascending=False)
    
    # Take top 3
    top_items = dim_items.head(3)
    
    for idx, row in top_items.iterrows():
        friend_pool.append({
            'dimension': row['dimension'],
            'item_id': row['item_id'],
            'self_version': row['text'],
            'friend_version': convert_to_third_person(row['text']),
            'correlation': row['correlation'],
            'source': row['source'],
            'reversed': row['reversed']
        })

friend_df = pd.DataFrame(friend_pool)

print(f"\n‚úÖ Curated Friend Assessment Pool: {len(friend_df)} items")
print(f"\nItems per dimension:")
print(friend_df['dimension'].value_counts().sort_index())
print(f"\nAverage correlation: {friend_df['correlation'].mean():.3f}")
print(f"Min correlation: {friend_df['correlation'].min():.3f}")

In [None]:
# Preview friend assessment questions
print("\nüìã PREVIEW: Friend Assessment Questions\n")
print("="*70)
print("Instructions for friend:")
print("'Please answer these questions about [User's Name].")
print("If you're not sure about a behavior, select 'Not Sure'.'\n")
print("="*70)

for idx, row in friend_df.iterrows():
    print(f"\nQ{idx+1}. {row['friend_version']}")
    print(f"    1 - Strongly Disagree")
    print(f"    2 - Disagree")
    print(f"    3 - Neutral")
    print(f"    4 - Agree")
    print(f"    5 - Strongly Agree")
    print(f"    [Not Sure]")
    print(f"    (Dimension: {row['dimension']}, r={row['correlation']:.2f})")

## 5. Attention & Consistency Detection

**How to detect if friend is paying attention or just guessing?**

### Detection Methods:

#### 1. Response Time
- **Too fast**: < 3 seconds per question ‚Üí likely not reading
- **Reasonable**: 10-30 seconds per question
- **Flag**: If median response time < 5 seconds

#### 2. Consistency Checks (Reversed Items)
- Pair: "John talks a lot" (agree) with "John says little" (disagree)
- **Inconsistent**: Agree to both or disagree to both
- **Flag**: If >30% of reversed pairs show inconsistency

#### 3. "Not Sure" Frequency
- **Reasonable**: 10-30% not sure (genuine uncertainty)
- **Too high**: >40% not sure ‚Üí doesn't know person well
- **Too low**: <5% not sure ‚Üí may be guessing confidently
- **Flag**: Either extreme

#### 4. Middle-Response Bias
- **Avoiding commitment**: Always choosing "Neutral" (3)
- **Flag**: If >50% responses are neutral

#### 5. Extreme Response Bias
- **Always extreme**: Only 1s and 5s, never 2-4
- **Flag**: If 80%+ responses are extreme (1 or 5)

### Quality Score Calculation

```python
quality_score = (
    response_time_validity * 0.3 +
    consistency_score * 0.4 +
    not_sure_appropriateness * 0.2 +
    response_variance * 0.1
)
```

**Threshold**:
- Quality Score ‚â• 70%: Include in analysis
- Quality Score < 70%: Weight down or exclude
- Quality Score < 50%: Flag to user ("This friend may not have paid attention")

### User Notification Strategy

**Don't expose individuals**, but give aggregate feedback:
- ‚úÖ "3 out of 5 friends provided high-quality responses"
- ‚úÖ "Some responses showed inconsistencies and were weighted lower"
- ‚ùå "John's responses were flagged as inattentive" (too revealing)

In [None]:
# Identify reversed item pairs for consistency checks
print("\nüîç Reversed Item Pairs for Consistency Checking\n")
print("These pairs should show opposite patterns if friend is consistent:\n")

for dimension in friend_df['dimension'].unique():
    dim_items = friend_df[friend_df['dimension'] == dimension]
    
    # Find reversed and non-reversed items
    reversed_items = dim_items[dim_items['reversed'] == True]
    normal_items = dim_items[dim_items['reversed'] == False]
    
    if len(reversed_items) > 0 and len(normal_items) > 0:
        print(f"\nüìç {dimension}:")
        print(f"  Normal: {normal_items.iloc[0]['friend_version']}")
        print(f"  Reversed: {reversed_items.iloc[0]['friend_version']}")
        print(f"  ‚Üí Should have opposite scores (if agree to one, disagree to other)")

## 6. Export Friend Item Pool

Save the curated friend assessment pool to JSON for use in the app.

In [None]:
# Export to JSON
output_path = Path('../data/selve_friend_item_pool.json')

# Convert DataFrame to structured format
friend_pool_json = {}
for dimension in friend_df['dimension'].unique():
    dim_items = friend_df[friend_df['dimension'] == dimension]
    friend_pool_json[dimension] = dim_items.to_dict('records')

with open(output_path, 'w') as f:
    json.dump(friend_pool_json, f, indent=2)

print(f"\n‚úÖ Friend item pool saved to: {output_path}")
print(f"\nüìä Summary:")
print(f"  ‚Ä¢ Total items: {len(friend_df)}")
print(f"  ‚Ä¢ Estimated completion time: {len(friend_df) * 0.5} - {len(friend_df) * 0.75} minutes")
print(f"  ‚Ä¢ Average item quality (correlation): {friend_df['correlation'].mean():.3f}")
print(f"\nüí° Next Steps:")
print(f"  1. Build friend assessment UI with 'Not Sure' option")
print(f"  2. Implement response time tracking")
print(f"  3. Add consistency check algorithm")
print(f"  4. Create quality scoring system")
print(f"  5. Design aggregate results dashboard (anonymous)")

## Key Findings Summary

### ‚úÖ Answers to Original Questions

**1. Observable vs Internal States?**
- ~40-60% of items are observable behaviors (varies by dimension)
- LUMEN (Extraversion) most observable
- AETHER (Emotional Stability) least observable

**2. Should we add 'Not Sure'?**
- **YES** - Industry best practice for peer reports
- Improves data quality, reduces guessing
- Use as attention/relationship strength indicator

**3. How many questions?**
- **20-25 items** (3 per dimension)
- Target: 3-5 minute completion time
- Focus on highest-correlation observable items

**4. Separate pool or same?**
- **Separate curated pool** recommended
- Filter: Observable + High Correlation + Clear wording
- Export: `selve_friend_item_pool.json`

**5. Detect guessing/inattention?**
- ‚úÖ Response time (< 5 sec median = flag)
- ‚úÖ Consistency checks (reversed item pairs)
- ‚úÖ "Not Sure" frequency (0-5% or 40%+ = flag)
- ‚úÖ Response variance (all neutral or all extreme = flag)
- Calculate quality score, weight responses accordingly

### üìã Implementation Recommendations

1. **Use curated pool** (24 items √ó 30 sec = ~12 min ideal, 5-7 min fast)
2. **Add 'Not Sure' button** separate from 1-5 scale
3. **Track response times** on backend
4. **Calculate quality scores** for each friend's responses
5. **Show aggregate results** to user (don't expose individual friends)
6. **Weight responses** by quality score in final aggregation
7. **Notify user** if multiple friends show low quality ("Consider inviting people who know you well")