# User-to-User Similarity Script
This script utilizes the sklearn feature extraction package *DictVectorizer*
>Link to DictVectorizer documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html

Section 2's code cells are callable functions to add reproducability on different, or multiple, datasets. To adjust inputs (data and parameters), changes are only required in section 2. 

### Imports

In [1]:
import json # load data
import pandas as pd
import numpy as np
from collections import defaultdict, Counter
from sklearn.feature_extraction import DictVectorizer # convert json data to matrix format
from sklearn.metrics.pairwise import cosine_similarity

# Section 1: Callable Functions

#### Cosine similarity variations
Three different cosine similarity formulas are applied here and then evaluated to see which performs best. The 3 variations are: 
1. **Standard Cosine Similarity**  
$$
\text{cosine}(u, v) = \frac{u \cdot v}{\|u\| \cdot \|v\|}
$$

2. **Scaled Cosine Similarity**  
$$
\text{scaled_cosine}(u, v) = \text{cosine}(u, v) \cdot \min\left( \frac{\|u\|}{\|v\|}, \frac{\|v\|}{\|u\|} \right)
$$

3. **Soft Cosine Similarity** (with tunable exponent $ \alpha $)  
$$
\text{soft_cosine}_\alpha(u, v) = \frac{u \cdot v}{\|u\|^\alpha \cdot \|v\|^\alpha}, \quad \alpha \in [0, 1]
$$

Each variation attempts to balance vector directionality with magnitude in different ways. Standard Cosine Similarity only considers the direction of user vectors, while Scaled Cosine and Soft Cosine also consider vector magnitude. The soft cosine introduces a tunable parameter $ \alpha $ that interpolates between pure dot product ($ \alpha = 0 $) and standard cosine similarity ($ \alpha = 1 $).

<span style="color:blue">

Nick's Notes (temp):
> scaled cosine will ALWAYS produce a raw cosine score that is less than or equal to the standard cosine score. 
    
> soft cosine inflates the raw scores, so the printed raw scores are often larger than 100%

In [2]:
### load the data ### 
def load_pseudo_users(json_paths):
    all_users = []
    for path in json_paths:
        with open(path, "r") as f:
            data = json.load(f)
            all_users.extend(data)
    return all_users


### build the user matrix from artist counts ### 
def build_artist_vectors(user_json):
    user_ids = []
    artist_count_dicts = []

    for user in user_json:
        user_id = user["username"]
        artist_counter = Counter()

        for playlist in user["playlists"]:
            for track in playlist["tracks"]:
                artist_uri = track["artist_uri"]
                artist_counter[artist_uri] += 1

        user_ids.append(user_id)
        artist_count_dicts.append(dict(artist_counter))

    vectorizer = DictVectorizer(sparse=False)
    matrix = vectorizer.fit_transform(artist_count_dicts)
    artist_df = pd.DataFrame(matrix, index=user_ids, columns=vectorizer.get_feature_names_out())
    return artist_df

##########################################################################################
##########################################################################################

### compute standard cosine similarity matrix (baseline) ###
def compute_cosine_similarity(artist_df):
    sim_matrix = cosine_similarity(artist_df.values)
    return pd.DataFrame(sim_matrix, index=artist_df.index, columns=artist_df.index)


### compute scaled cosine similarity matrix ###
def get_scaled_cosine_similarity(artist_df):
    X = artist_df.values
    sim_matrix = cosine_similarity(X) # get standard cosine matrix
    
    # calculate norms of vectors from the artist df
    norms = np.linalg.norm(X, axis=1)
    norm_ratios = np.minimum.outer(norms, norms) / np.maximum.outer(norms, norms)
    
    # multiply standard cosine matrix by norm matrix 
    scaled_sim = sim_matrix * norm_ratios
    return pd.DataFrame(scaled_sim, index=artist_df.index, columns=artist_df.index)


### compute soft cosine similarity matrix ###
def get_soft_cosine_similarity(artist_df, alpha=1.0):
    X = artist_df.values
    
    # numerator: compute dot product
    dot = X @ X.T
    
    # denominator: compute norms and raise to alpha
    norms = np.linalg.norm(X, axis=1)
    norm_matrix = np.outer(norms, norms) ** alpha
    
    # calculate final soft cosine
    soft_cosine = dot / norm_matrix
    np.fill_diagonal(soft_cosine, 1.0)
    
    return pd.DataFrame(soft_cosine, index=artist_df.index, columns=artist_df.index)

##########################################################################################
##########################################################################################

### get similarity scores for a given user or pair of users ### 
def get_user_similarity(df, user1, user2=None, top_x=5):
    """
    Parameters:
        df (pd.DataFrame): similarity df
        user1 (str) : The first user's ID.
        [OPTIONAL] user2 (str): The second user's ID for comparing a pair
        top_x (int): Number of top similar users to return if user2 is not provided
    Returns:
        pd.Series or float
    """
    if user1 not in df.index:
        raise ValueError(f"User {user1} not found")
    if user2:
        if user2 not in df.columns:
            raise ValueError(f"User {user2} not found")
        return df.loc[user1, user2].round(3)
    else:
        top_users = df.loc[user1].drop(user1).sort_values(ascending=False).head(top_x)
        return top_users.round(3)
    
    
### transform raw cosine score to user interpretable value ### 
def transform_match_score(scores, df, include_raw=False):
    """
    Converts raw similarity score(s) into interpretable match score(s).
    
    Parameters:
        scores (float or pd.Series): Output from `get_user_similarity()`
        df (pd.DataFrame): similarity df
        score_type (str): Return 3 transformed (user interpretable) score
                1) raw cosine score x100
                2) min-max scaled (0–1) then ×100
                3) percentile rank value
    
    Returns:
        float or pd.Series of interpreted match scores
    """
    # triu_indices_from grabs only upper triangle and exclude diagonal
    all_scores = df.values[np.triu_indices_from(df, k=1)]  
    min_score = all_scores.min()
    max_score = all_scores.max()

    def interpret(score, user_label=None):
        # raw score x 100
        raw_score = round(score * 100, 1)
        
        # min max scaled score x 100
        scaled_score = (score - min_score) / (max_score - min_score)
        scaled_score = round(scaled_score*100,1)
        
        # percentile rank score (inverted to show top %)
        rank = (all_scores < score).sum()
        percentile_score = round(100*rank / len(all_scores), 1)
        top_percentile_score = round(100 - percentile_score,1)
        
        base_return = f"{scaled_score}% match, placing you in the top {top_percentile_score}% of user similarities"
        if include_raw:
            base_return += f" (Raw score of {raw_score}%)"
        if user_label:
            return f"- {user_label}: {base_return}"
        return base_return
        
    if isinstance(scores, pd.Series):
        lines = [interpret(score, user_label=user) for user, score in scores.items()]
        return f"\n".join(lines)
    else:
        return interpret(scores)




##########################################################################################
##########################################################################################

### placeholder ###


##########################################################################################
##########################################################################################

### placeholder ###

# Section 2: Function Usage

## Usage 1: First 1000 playlists Elbow Method data | Standard Cosine

### *Similarity Matrix*

In [3]:
### creating similarity matrix ###
json_paths = ["data/pseudo-users-elbow-data.json"]
pseudo_users = load_pseudo_users(json_paths)
artist_df = build_artist_vectors(pseudo_users)
cosine_df = compute_cosine_similarity(artist_df)

cosine_df.round(3).head()

Unnamed: 0,user_0009,user_000d,user_0001,user_000c,user_0003,user_000b,user_0005,user_000f,user_0007,user_0012,user_0002,user_000e,user_0008,user_0010,user_000a,user_0006,user_0004,user_0000,user_0011
user_0009,1.0,0.023,0.263,0.286,0.142,0.111,0.034,0.299,0.131,0.404,0.327,0.087,0.067,0.334,0.095,0.393,0.37,0.1,0.017
user_000d,0.023,1.0,0.084,0.007,0.075,0.095,0.348,0.062,0.072,0.019,0.045,0.064,0.027,0.129,0.416,0.01,0.024,0.0,0.028
user_0001,0.263,0.084,1.0,0.201,0.201,0.161,0.116,0.187,0.124,0.241,0.167,0.051,0.023,0.186,0.13,0.163,0.147,0.088,0.039
user_000c,0.286,0.007,0.201,1.0,0.057,0.076,0.031,0.117,0.073,0.218,0.445,0.024,0.008,0.14,0.017,0.604,0.144,0.567,0.001
user_0003,0.142,0.075,0.201,0.057,1.0,0.228,0.116,0.291,0.137,0.205,0.059,0.028,0.009,0.144,0.063,0.044,0.035,0.012,0.009


### *Similarity between users*

In [4]:
### getting top 5 most similar users ###
print("=== TOP 5 ===")
print(get_user_similarity(cosine_df, user1="user_0009"))

### getting similarity between 2 users ###
print("\n=== SIMILARITY BETWEEN 2 USERS ===")
print(get_user_similarity(cosine_df, user1="user_0009", user2="user_0006"))

### get transformed scores for a single user ###
print("\n=== TRANSFORMED TOP 5 ===")
print(transform_match_score(get_user_similarity(cosine_df, user1="user_0009"), cosine_df, include_raw=True))

### get transformed scores for a user pair ###
print("\n=== TRANSFORMED SIMILARITY BETWEEN 2 USERS ===")
print(transform_match_score(get_user_similarity(cosine_df, user1="user_0009", user2="user_0006"), cosine_df))

=== TOP 5 ===
user_0012    0.404
user_0006    0.393
user_0004    0.370
user_0010    0.334
user_0002    0.327
Name: user_0009, dtype: float64

=== SIMILARITY BETWEEN 2 USERS ===
0.393

=== TRANSFORMED TOP 5 ===
- user_0012: 64.2% match, placing you in the top 2.9% of user similarities (Raw score of 40.4%)
- user_0006: 62.5% match, placing you in the top 4.1% of user similarities (Raw score of 39.3%)
- user_0004: 58.8% match, placing you in the top 5.3% of user similarities (Raw score of 37.0%)
- user_0010: 53.1% match, placing you in the top 7.6% of user similarities (Raw score of 33.4%)
- user_0002: 52.0% match, placing you in the top 7.6% of user similarities (Raw score of 32.7%)

=== TRANSFORMED SIMILARITY BETWEEN 2 USERS ===
62.5% match, placing you in the top 4.1% of user similarities


## Usage 2: First 1000 playlists Elbow Method data | Scaled Cosine

### *Similarity Matrix*

In [5]:
scaled_cosine_df = get_scaled_cosine_similarity(artist_df)

scaled_cosine_df.round(3).head()

Unnamed: 0,user_0009,user_000d,user_0001,user_000c,user_0003,user_000b,user_0005,user_000f,user_0007,user_0012,user_0002,user_000e,user_0008,user_0010,user_000a,user_0006,user_0004,user_0000,user_0011
user_0009,1.0,0.014,0.182,0.21,0.097,0.061,0.007,0.109,0.06,0.209,0.288,0.07,0.03,0.176,0.028,0.241,0.199,0.057,0.003
user_000d,0.014,1.0,0.036,0.003,0.067,0.086,0.116,0.037,0.054,0.016,0.031,0.032,0.02,0.112,0.198,0.004,0.021,0.0,0.008
user_0001,0.182,0.036,1.0,0.19,0.095,0.061,0.016,0.047,0.039,0.087,0.102,0.044,0.007,0.068,0.026,0.145,0.055,0.035,0.004
user_000c,0.21,0.003,0.19,1.0,0.028,0.031,0.005,0.031,0.025,0.083,0.288,0.022,0.002,0.054,0.004,0.505,0.057,0.235,0.0
user_0003,0.097,0.067,0.095,0.028,1.0,0.183,0.035,0.155,0.092,0.155,0.046,0.015,0.006,0.111,0.027,0.018,0.028,0.01,0.002


### *Similarity between users*

In [6]:
### getting top 5 most similar users ###
print("=== TOP 5 ===")
print(get_user_similarity(scaled_cosine_df, user1="user_0009"))

### getting similarity between 2 users ###
print("\n=== SIMILARITY BETWEEN 2 USERS ===")
print(get_user_similarity(scaled_cosine_df, user1="user_0009", user2="user_0006"))

### get transformed scores for a single user ###
print("\n=== TRANSFORMED TOP 5 ===")
print(transform_match_score(get_user_similarity(scaled_cosine_df, user1="user_0009"), scaled_cosine_df, include_raw=True))

### get transformed scores for a user pair ###
print("\n=== TRANSFORMED SIMILARITY BETWEEN 2 USERS ===")
print(transform_match_score(get_user_similarity(scaled_cosine_df, user1="user_0009", user2="user_0006"), scaled_cosine_df))

=== TOP 5 ===
user_0002    0.288
user_0006    0.241
user_000c    0.210
user_0012    0.209
user_0004    0.199
Name: user_0009, dtype: float64

=== SIMILARITY BETWEEN 2 USERS ===
0.241

=== TRANSFORMED TOP 5 ===
- user_0002: 57.0% match, placing you in the top 1.2% of user similarities (Raw score of 28.8%)
- user_0006: 47.7% match, placing you in the top 3.5% of user similarities (Raw score of 24.1%)
- user_000c: 41.6% match, placing you in the top 4.7% of user similarities (Raw score of 21.0%)
- user_0012: 41.4% match, placing you in the top 5.8% of user similarities (Raw score of 20.9%)
- user_0004: 39.4% match, placing you in the top 7.6% of user similarities (Raw score of 19.9%)

=== TRANSFORMED SIMILARITY BETWEEN 2 USERS ===
47.7% match, placing you in the top 3.5% of user similarities


## Usage 3: First 1000 playlists Elbow Method data | Soft Cosine

### *Similarity Matrix*

In [7]:
soft_cosine_df = get_soft_cosine_similarity(artist_df, alpha = 0.5)

soft_cosine_df.round(3).head()

Unnamed: 0,user_0009,user_000d,user_0001,user_000c,user_0003,user_000b,user_0005,user_000f,user_0007,user_0012,user_0002,user_000e,user_0008,user_0010,user_000a,user_0006,user_0004,user_0000,user_0011
user_0009,1.0,6.773,116.512,123.305,43.259,30.391,5.726,66.727,32.752,107.216,113.281,35.823,16.491,89.605,18.821,185.181,100.246,27.762,2.51
user_000d,6.773,1.0,29.187,2.289,17.773,20.391,45.321,10.773,13.991,3.888,12.235,20.738,5.199,27.086,64.649,3.592,4.974,0.106,3.279
user_0001,116.512,29.187,1.0,104.113,73.663,53.037,23.293,50.03,37.266,76.972,69.418,25.326,6.864,59.752,30.986,92.308,47.744,29.406,7.004
user_000c,123.305,2.289,104.113,1.0,20.156,24.401,5.99,30.354,21.334,67.755,180.176,11.453,2.203,43.83,3.876,332.12,45.396,183.724,0.193
user_0003,43.259,17.773,73.663,20.156,1.0,51.491,15.932,53.55,28.27,44.934,17.031,9.368,1.768,31.982,10.281,17.012,7.811,2.809,1.138


### *Similarity between users*

In [8]:
### getting top 5 most similar users ###
print("=== TOP 5 ===")
print(get_user_similarity(soft_cosine_df, user1="user_0009"))

### getting similarity between 2 users ###
print("\n=== SIMILARITY BETWEEN 2 USERS ===")
print(get_user_similarity(soft_cosine_df, user1="user_0009", user2="user_0006"))

### get transformed scores for a single user ###
print("\n=== TRANSFORMED TOP 5 ===")
print(transform_match_score(
    get_user_similarity(soft_cosine_df, user1="user_0009"), soft_cosine_df, include_raw=True))

### get transformed scores for a user pair ###
print("\n=== TRANSFORMED SIMILARITY BETWEEN 2 USERS ===")
print(transform_match_score(
    get_user_similarity(soft_cosine_df, user1="user_0009", user2="user_0006"), soft_cosine_df))

=== TOP 5 ===
user_0006    185.181
user_000c    123.305
user_0001    116.512
user_0002    113.281
user_0012    107.216
Name: user_0009, dtype: float64

=== SIMILARITY BETWEEN 2 USERS ===
185.181

=== TRANSFORMED TOP 5 ===
- user_0006: 55.8% match, placing you in the top 0.6% of user similarities (Raw score of 18518.1%)
- user_000c: 37.1% match, placing you in the top 4.7% of user similarities (Raw score of 12330.5%)
- user_0001: 35.1% match, placing you in the top 4.7% of user similarities (Raw score of 11651.2%)
- user_0002: 34.1% match, placing you in the top 5.3% of user similarities (Raw score of 11328.1%)
- user_0012: 32.3% match, placing you in the top 5.8% of user similarities (Raw score of 10721.6%)

=== TRANSFORMED SIMILARITY BETWEEN 2 USERS ===
55.8% match, placing you in the top 0.6% of user similarities
