# Network-Aware User Profile Search with Elasticsearch

This notebook builds a search system that ranks users based on:
1. **Profile matching** - Traditional keyword relevance
2. **Social distance** - How many hops away in the network
3. **Mutual friends** - Number of shared connections
4. **Community membership** - Shared circles/communities

## Setup Requirements

```bash
# Install Elasticsearch on macOS
brew install elasticsearch
brew services start elasticsearch

# Install required Python packages
pip install elasticsearch networkx pandas numpy
```

In [1]:
pip install elasticsearch networkx pandas numpy

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import urllib.request
import tarfile
import networkx as nx
import pandas as pd
import numpy as np
from elasticsearch import Elasticsearch, helpers
from collections import defaultdict, Counter
import json
from pathlib import Path

## Step 1: Download and Extract Dataset

In [3]:
# Download the dataset
url = "https://snap.stanford.edu/data/facebook.tar.gz"
data_dir = Path("./facebook_data")
data_dir.mkdir(exist_ok=True)

tar_path = data_dir / "facebook.tar.gz"

if not tar_path.exists():
    print("Downloading dataset...")
    urllib.request.urlretrieve(url, tar_path)
    print("Download complete!")
    
    # Extract
    print("Extracting...")
    with tarfile.open(tar_path, 'r:gz') as tar:
        tar.extractall(data_dir)
    print("Extraction complete!")
else:
    print("Dataset already downloaded")

Dataset already downloaded


## Step 2: Explore Dataset Structure

The Facebook dataset contains:
- `.edges` - Edge lists (friendships)
- `.feat` - Feature vectors for each user
- `.featnames` - Feature names
- `.circles` - Friend circles (communities)

In [4]:
# Find all ego networks in the dataset
facebook_dir = data_dir / "facebook"
ego_users = []

for file in facebook_dir.glob("*.edges"):
    ego_id = file.stem
    ego_users.append(ego_id)

print(f"Found {len(ego_users)} ego networks")
print(f"Ego users: {ego_users}")

Found 10 ego networks
Ego users: ['686', '348', '3437', '1912', '1684', '0', '698', '3980', '414', '107']


## Step 3: Load and Process Network Data

In [5]:
class FacebookNetworkProcessor:
    def __init__(self, data_dir):
        self.data_dir = Path(data_dir)
        self.G = nx.Graph()
        self.user_features = {}
        self.user_circles = defaultdict(list)
        self.feature_names = []
        
    def load_network(self, ego_id):
        """Load ego network for a specific user"""
        edges_file = self.data_dir / f"{ego_id}.edges"
        feat_file = self.data_dir / f"{ego_id}.feat"
        featnames_file = self.data_dir / f"{ego_id}.featnames"
        circles_file = self.data_dir / f"{ego_id}.circles"
        
        # Load edges
        if edges_file.exists():
            with open(edges_file, 'r') as f:
                for line in f:
                    u, v = line.strip().split()
                    self.G.add_edge(int(u), int(v))
        
        # Add ego node and connect to all nodes in their network
        ego_id_int = int(ego_id)
        for node in list(self.G.nodes()):
            if node != ego_id_int:
                self.G.add_edge(ego_id_int, node)
        
        # Load feature names
        if featnames_file.exists():
            with open(featnames_file, 'r') as f:
                self.feature_names = [line.strip().split(' ', 1) for line in f]
        
        # Load features
        if feat_file.exists():
            with open(feat_file, 'r') as f:
                for line in f:
                    parts = line.strip().split()
                    user_id = int(parts[0])
                    features = [int(x) for x in parts[1:]]
                    self.user_features[user_id] = features
        
        # Add ego features (usually all 0s)
        if self.feature_names:
            self.user_features[ego_id_int] = [0] * len(self.feature_names)
        
        # Load circles
        if circles_file.exists():
            with open(circles_file, 'r') as f:
                for line in f:
                    parts = line.strip().split()
                    circle_name = parts[0]
                    members = [int(x) for x in parts[1:]]
                    for member in members:
                        self.user_circles[member].append(circle_name)
    
    def load_all_networks(self):
        """Load all ego networks and combine them"""
        for file in self.data_dir.glob("*.edges"):
            ego_id = file.stem
            print(f"Loading network for user {ego_id}...")
            self.load_network(ego_id)
        
        print(f"\nTotal nodes: {self.G.number_of_nodes()}")
        print(f"Total edges: {self.G.number_of_edges()}")
        print(f"Users with features: {len(self.user_features)}")
    
    def compute_network_features(self, user_id, reference_user_id):
        """Compute network-based features between two users"""
        features = {}
        
        if user_id not in self.G or reference_user_id not in self.G:
            return {
                'social_distance': 999,
                'mutual_friends': 0,
                'shared_circles': 0
            }
        
        # Social distance (shortest path)
        try:
            features['social_distance'] = nx.shortest_path_length(
                self.G, user_id, reference_user_id
            )
        except nx.NetworkXNoPath:
            features['social_distance'] = 999  # Not connected
        
        # Mutual friends
        user_friends = set(self.G.neighbors(user_id))
        ref_friends = set(self.G.neighbors(reference_user_id))
        features['mutual_friends'] = len(user_friends & ref_friends)
        
        # Shared circles
        user_circles = set(self.user_circles.get(user_id, []))
        ref_circles = set(self.user_circles.get(reference_user_id, []))
        features['shared_circles'] = len(user_circles & ref_circles)
        
        return features
    
    def create_user_profile(self, user_id):
        """Create a searchable profile for a user"""
        profile = {
            'user_id': user_id,
            'degree': self.G.degree(user_id) if user_id in self.G else 0,
            'circles': self.user_circles.get(user_id, []),
        }
        
        # Convert features to named attributes
        if user_id in self.user_features and self.feature_names:
            features = self.user_features[user_id]
            profile['attributes'] = []
            
            for i, (feat_id, feat_name) in enumerate(self.feature_names):
                if i < len(features) and features[i] == 1:
                    profile['attributes'].append(feat_name)
        
        return profile

# Initialize processor
processor = FacebookNetworkProcessor(facebook_dir)
processor.load_all_networks()

Loading network for user 686...
Loading network for user 348...
Loading network for user 3437...
Loading network for user 1912...
Loading network for user 1684...
Loading network for user 0...
Loading network for user 698...
Loading network for user 3980...
Loading network for user 414...
Loading network for user 107...

Total nodes: 3963
Total edges: 105082
Users with features: 4039


## Step 4: Setup Elasticsearch Connection

# Pull the Elasticsearch image (ARM64 compatible)
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.11.0

# Run Elasticsearch container
docker run -d \
  --name elasticsearch \
  -p 9200:9200 \
  -p 9300:9300 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "xpack.security.http.ssl.enabled=false" \
  -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \
  docker.elastic.co/elasticsearch/elasticsearch:8.11.0
  
# Check if container is running
docker ps

# Test connection
curl http://localhost:9200

# You should see JSON output with cluster info

In [6]:
# Connect to Elasticsearch
es = Elasticsearch(
    ['http://localhost:9200'],
    basic_auth=None  # Add auth if needed
)

# Check connection
if es.ping():
    print("✓ Connected to Elasticsearch")
    print(f"Cluster info: {es.info()['version']['number']}")
else:
    print("✗ Could not connect to Elasticsearch")
    print("Make sure Elasticsearch is running: brew services start elasticsearch")

✓ Connected to Elasticsearch
Cluster info: 8.11.0


## Step 5: Create Index with Custom Mapping

In [7]:
index_name = "facebook_users"

# Delete index if it exists
if es.indices.exists(index=index_name):
    es.indices.delete(index=index_name)
    print(f"Deleted existing index: {index_name}")

# Create index with mapping
index_mapping = {
    "mappings": {
        "properties": {
            "user_id": {"type": "integer"},
            "degree": {"type": "integer"},
            "circles": {
                "type": "text",
                "fields": {
                    "keyword": {"type": "keyword"}
                }
            },
            "attributes": {
                "type": "text",
                "analyzer": "standard"
            }
        }
    },
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    }
}

es.indices.create(index=index_name, body=index_mapping)
print(f"✓ Created index: {index_name}")

Deleted existing index: facebook_users
✓ Created index: facebook_users


## Step 6: Index User Profiles

In [8]:
def generate_docs():
    """Generator for bulk indexing"""
    for user_id in processor.G.nodes():
        profile = processor.create_user_profile(user_id)
        yield {
            "_index": index_name,
            "_id": user_id,
            "_source": profile
        }

# Bulk index
print("Indexing users...")
success, failed = helpers.bulk(es, generate_docs(), stats_only=True)
print(f"✓ Indexed {success} users")
if failed:
    print(f"✗ Failed: {failed}")

# Refresh index
es.indices.refresh(index=index_name)

Indexing users...
✓ Indexed 3963 users


ObjectApiResponse({'_shards': {'total': 1, 'successful': 1, 'failed': 0}})

## Step 7: Implement Network-Aware Search

This is where the magic happens! We'll create a custom scoring function that combines:
- Text relevance (Elasticsearch BM25)
- Social distance (exponential decay)
- Mutual friends (linear boost)
- Shared circles (category boost)

In [9]:
class NetworkAwareSearch:
    def __init__(self, es_client, processor, index_name):
        self.es = es_client
        self.processor = processor
        self.index = index_name
    
    def search(self, query, searcher_id, size=10, weights=None):
        """
        Perform network-aware search.
        
        Args:
            query: Search query string
            searcher_id: User performing the search
            size: Number of results to return
            weights: Dict with keys: text, distance, mutual, circles
        """
        if weights is None:
            weights = {
                'text': 1.0,
                'distance': 0.5,
                'mutual': 0.3,
                'circles': 0.2
            }
        
        # Step 1: Get candidate results from Elasticsearch
        es_query = {
            "query": {
                "multi_match": {
                    "query": query,
                    "fields": ["attributes^2", "circles"],
                    "type": "best_fields"
                }
            },
            "size": size * 3  # Get more candidates for reranking
        }
        
        response = self.es.search(index=self.index, body=es_query)
        
        # Step 2: Rerank with network features
        results = []
        for hit in response['hits']['hits']:
            user_id = hit['_source']['user_id']
            text_score = hit['_score']
            
            # Compute network features
            net_features = self.processor.compute_network_features(
                user_id, searcher_id
            )
            
            # Compute network score components
            distance_score = self._distance_score(net_features['social_distance'])
            mutual_score = self._mutual_score(net_features['mutual_friends'])
            circle_score = self._circle_score(net_features['shared_circles'])
            
            # Combined score
            final_score = (
                weights['text'] * text_score +
                weights['distance'] * distance_score +
                weights['mutual'] * mutual_score +
                weights['circles'] * circle_score
            )
            
            results.append({
                'user_id': user_id,
                'profile': hit['_source'],
                'scores': {
                    'final': final_score,
                    'text': text_score,
                    'distance': distance_score,
                    'mutual': mutual_score,
                    'circles': circle_score
                },
                'network': net_features
            })
        
        # Sort by final score
        results.sort(key=lambda x: x['scores']['final'], reverse=True)
        
        return results[:size]
    
    def _distance_score(self, distance):
        """Exponential decay based on social distance"""
        if distance >= 999:
            return 0.0
        return np.exp(-0.5 * distance) * 10
    
    def _mutual_score(self, mutual_friends):
        """Linear score for mutual friends"""
        return min(mutual_friends, 20) * 0.5
    
    def _circle_score(self, shared_circles):
        """Score for shared community membership"""
        return shared_circles * 2.0
    
    def compare_with_baseline(self, query, searcher_id, size=10):
        """Compare network-aware vs baseline search"""
        # Baseline: text-only
        baseline_query = {
            "query": {
                "multi_match": {
                    "query": query,
                    "fields": ["attributes^2", "circles"]
                }
            },
            "size": size
        }
        
        baseline_response = self.es.search(index=self.index, body=baseline_query)
        baseline_results = [
            {
                'user_id': hit['_source']['user_id'],
                'score': hit['_score']
            }
            for hit in baseline_response['hits']['hits']
        ]
        
        # Network-aware
        network_results = self.search(query, searcher_id, size)
        
        return {
            'baseline': baseline_results,
            'network_aware': network_results,
            'query': query,
            'searcher_id': searcher_id
        }

# Initialize search engine
search_engine = NetworkAwareSearch(es, processor, index_name)
print("✓ Network-aware search engine ready!")

✓ Network-aware search engine ready!


## Step 8: Test the Search System

In [10]:
# Pick a searcher (ego user)
if ego_users:
    searcher_id = int(ego_users[0])
else:
    searcher_id = list(processor.G.nodes())[0]

print(f"Searching as user: {searcher_id}")
print(f"User has {processor.G.degree(searcher_id)} friends\n")

Searching as user: 686
User has 176 friends



In [11]:
# Example search
query = "education work"  # Adjust based on available features

results = search_engine.search(query, searcher_id, size=5)

print(f"Query: '{query}'\n")
print("="*80)

for i, result in enumerate(results, 1):
    print(f"\n{i}. User {result['user_id']}")
    print(f"   Final Score: {result['scores']['final']:.3f}")
    print(f"   └─ Text: {result['scores']['text']:.3f} | "
          f"Distance: {result['scores']['distance']:.3f} | "
          f"Mutual: {result['scores']['mutual']:.3f} | "
          f"Circles: {result['scores']['circles']:.3f}")
    print(f"   Network:")
    print(f"   └─ Social distance: {result['network']['social_distance']} hops")
    print(f"   └─ Mutual friends: {result['network']['mutual_friends']}")
    print(f"   └─ Shared circles: {result['network']['shared_circles']}")
    print(f"   Profile:")
    print(f"   └─ Degree: {result['profile']['degree']} friends")
    if result['profile'].get('circles'):
        print(f"   └─ Circles: {', '.join(result['profile']['circles'][:3])}")

Query: 'education work'


1. User 1085
   Final Score: 9.770
   └─ Text: 6.731 | Distance: 3.679 | Mutual: 4.000 | Circles: 0.000
   Network:
   └─ Social distance: 2 hops
   └─ Mutual friends: 8
   └─ Shared circles: 0
   Profile:
   └─ Degree: 72 friends
   └─ Circles: circle3, circle4, circle19

2. User 2107
   Final Score: 9.611
   └─ Text: 6.722 | Distance: 3.679 | Mutual: 3.500 | Circles: 0.000
   Network:
   └─ Social distance: 2 hops
   └─ Mutual friends: 7
   └─ Shared circles: 0
   Profile:
   └─ Degree: 43 friends

3. User 1779
   Final Score: 8.848
   └─ Text: 6.858 | Distance: 3.679 | Mutual: 0.500 | Circles: 0.000
   Network:
   └─ Social distance: 2 hops
   └─ Mutual friends: 1
   └─ Shared circles: 0
   Profile:
   └─ Degree: 22 friends
   └─ Circles: circle6

4. User 909
   Final Score: 8.822
   └─ Text: 6.832 | Distance: 3.679 | Mutual: 0.500 | Circles: 0.000
   Network:
   └─ Social distance: 2 hops
   └─ Mutual friends: 1
   └─ Shared circles: 0
   Profile:
   └─ De

## Step 9: Compare with Baseline

In [12]:
comparison = search_engine.compare_with_baseline(query, searcher_id, size=5)

print("BASELINE SEARCH (text-only):")
print("="*50)
for i, result in enumerate(comparison['baseline'], 1):
    print(f"{i}. User {result['user_id']} - Score: {result['score']:.3f}")

print("\n\nNETWORK-AWARE SEARCH:")
print("="*50)
for i, result in enumerate(comparison['network_aware'], 1):
    print(f"{i}. User {result['user_id']} - Score: {result['scores']['final']:.3f}")
    print(f"   └─ {result['network']['social_distance']} hops, "
          f"{result['network']['mutual_friends']} mutual friends")

BASELINE SEARCH (text-only):
1. User 1779 - Score: 6.858
2. User 909 - Score: 6.832
3. User 992 - Score: 6.789
4. User 963 - Score: 6.770
5. User 1845 - Score: 6.760


NETWORK-AWARE SEARCH:
1. User 1085 - Score: 9.770
   └─ 2 hops, 8 mutual friends
2. User 2107 - Score: 9.611
   └─ 2 hops, 7 mutual friends
3. User 1779 - Score: 8.848
   └─ 2 hops, 1 mutual friends
4. User 909 - Score: 8.822
   └─ 2 hops, 1 mutual friends
5. User 992 - Score: 8.779
   └─ 2 hops, 1 mutual friends


## Step 10: Experiment with Different Weights

In [13]:
# Try different weight configurations
weight_configs = [
    {'name': 'Text-focused', 'weights': {'text': 1.0, 'distance': 0.1, 'mutual': 0.1, 'circles': 0.1}},
    {'name': 'Network-focused', 'weights': {'text': 0.3, 'distance': 0.5, 'mutual': 0.5, 'circles': 0.3}},
    {'name': 'Balanced', 'weights': {'text': 0.5, 'distance': 0.3, 'mutual': 0.3, 'circles': 0.2}},
]

query = "work education"  # Adjust as needed

for config in weight_configs:
    print(f"\n{'='*60}")
    print(f"{config['name'].upper()} Configuration")
    print(f"{'='*60}")
    
    results = search_engine.search(
        query, searcher_id, size=3, weights=config['weights']
    )
    
    for i, result in enumerate(results, 1):
        print(f"\n{i}. User {result['user_id']} - Score: {result['scores']['final']:.3f}")
        print(f"   Distance: {result['network']['social_distance']} hops, "
              f"Mutual: {result['network']['mutual_friends']}")


TEXT-FOCUSED Configuration

1. User 1779 - Score: 7.276
   Distance: 2 hops, Mutual: 1

2. User 909 - Score: 7.250
   Distance: 2 hops, Mutual: 1

3. User 992 - Score: 7.207
   Distance: 2 hops, Mutual: 1

NETWORK-FOCUSED Configuration

1. User 1779 - Score: 4.147
   Distance: 2 hops, Mutual: 1

2. User 909 - Score: 4.139
   Distance: 2 hops, Mutual: 1

3. User 992 - Score: 4.126
   Distance: 2 hops, Mutual: 1

BALANCED Configuration

1. User 1779 - Score: 4.683
   Distance: 2 hops, Mutual: 1

2. User 909 - Score: 4.670
   Distance: 2 hops, Mutual: 1

3. User 992 - Score: 4.648
   Distance: 2 hops, Mutual: 1


## Step 11: Analyze Network Connectivity

In [14]:
# Network statistics
print("Network Statistics:")
print("="*50)
print(f"Total users: {processor.G.number_of_nodes()}")
print(f"Total friendships: {processor.G.number_of_edges()}")
print(f"Average degree: {sum(dict(processor.G.degree()).values()) / processor.G.number_of_nodes():.2f}")
print(f"Network density: {nx.density(processor.G):.4f}")

# Check if graph is connected
if nx.is_connected(processor.G):
    print(f"Average shortest path length: {nx.average_shortest_path_length(processor.G):.2f}")
    print(f"Diameter: {nx.diameter(processor.G)}")
else:
    print("Graph is not fully connected")
    components = list(nx.connected_components(processor.G))
    print(f"Number of components: {len(components)}")
    print(f"Largest component size: {len(max(components, key=len))}")

Network Statistics:
Total users: 3963
Total friendships: 105082
Average degree: 53.03
Network density: 0.0134
Average shortest path length: 1.99
Diameter: 2


## Step 12: Interactive Search Interface

In [16]:
def interactive_search():
    """Interactive search session"""
    print("Network-Aware Search Interface")
    print("="*50)
    print(f"Searching as user: {searcher_id}\n")
    
    while True:
        query = input("\nEnter search query (or 'quit' to exit): ").strip()
        
        if query.lower() in ['quit', 'exit', 'q']:
            print("Goodbye!")
            break
        
        if not query:
            continue
        
        try:
            results = search_engine.search(query, searcher_id, size=5)
            
            print(f"\nTop 5 results for '{query}':")
            print("-"*50)
            
            for i, result in enumerate(results, 1):
                print(f"\n{i}. User {result['user_id']} (Score: {result['scores']['final']:.2f})")
                print(f"   Network: {result['network']['social_distance']} hops, "
                      f"{result['network']['mutual_friends']} mutual friends")
                
                if result['profile'].get('attributes'):
                    attrs = result['profile']['attributes'][:3]
                    print(f"   Attributes: {', '.join(attrs)}")
        
        except Exception as e:
            print(f"Error: {e}")

# Uncomment to run interactive search
interactive_search()

Network-Aware Search Interface
Searching as user: 686




Enter search query (or 'quit' to exit):  stanford



Top 5 results for 'stanford':
--------------------------------------------------



Enter search query (or 'quit' to exit):  sport



Top 5 results for 'sport':
--------------------------------------------------



Enter search query (or 'quit' to exit):  stanford



Top 5 results for 'stanford':
--------------------------------------------------



Enter search query (or 'quit' to exit):  quit


Goodbye!


## Summary

You've built a network-aware search system that:

1. **Loads social network data** from the Facebook dataset
2. **Indexes user profiles** in Elasticsearch with attributes and circles
3. **Computes network features** (social distance, mutual friends, shared circles)
4. **Reranks search results** by combining text relevance with network proximity
5. **Provides interpretable results** showing why users were ranked

### Key Insights:

- **Social distance** has exponential decay - users 1 hop away are much more relevant than 2 hops
- **Mutual friends** provide trust signals
- **Shared circles** indicate common interests/communities
- Weight tuning lets you balance relevance vs. social proximity

### Next Steps:

1. Add temporal features (recent interactions)
2. Implement learning-to-rank with user feedback
3. Add faceted search by circles/communities
4. Scale to larger networks with approximate algorithms
5. Build a web UI for easier interaction