## Person Re-Identification Task
#### Name: Sanjeev Khannan

For this person re-identification task, I grouped detection IDs into exactly 5 clusters, each representing a unique individual. To achieve this, I first normalized and reduce the dimensionality of the feature vectors from the detections. This preprocessing step helps to improve performance of the clustering by simplifying the data and ensures consistent feature scales. Next, I’ll use a Gaussian Mixture Model (GMM) for the initial clustering, as it’s effective at capturing complex patterns and overlapping clusters. And to evaluate the predicted clusters I wrote a custom evaluation technique to verify our predictions with true labels.

In [1]:
import json
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.mixture import GaussianMixture


In [2]:
# Load the detection data from the JSON file
with open('final_detections.json', 'r') as f:
    detections = json.load(f)
    

In [3]:
# Extract features and detection IDs
features = []
detection_ids = []

for det in detections:
    features.append(det['feature'])
    detection_ids.append(det['detection_id'])

features = np.array(features)
features


array([[-0.01484233, -0.05363799, -0.05816172, ..., -0.00557502,
        -0.03075798, -0.00266532],
       [-0.03776949, -0.02715539, -0.03124019, ..., -0.02639771,
        -0.02372385, -0.01186128],
       [-0.01462968, -0.02831179, -0.0522408 , ..., -0.00551042,
        -0.04961643, -0.00492361],
       ...,
       [ 0.01070334,  0.00016948, -0.047523  , ..., -0.03672916,
        -0.04369174, -0.00983843],
       [ 0.01693735,  0.0021647 , -0.05257468, ..., -0.04102672,
        -0.04835961, -0.01173849],
       [ 0.01585637, -0.00551623, -0.05567126, ..., -0.03639167,
        -0.04980672, -0.00640311]])

In [4]:
# Step 3: Normalize the features
scaler = StandardScaler()
features = scaler.fit_transform(features)
features


array([[ 0.57868783, -0.57985488, -1.52276282, ...,  0.90700305,
        -0.65727432, -0.00303677],
       [-0.29531476,  0.84781311, -0.11710985, ..., -0.06168926,
        -0.32965385, -0.36456722],
       [ 0.58679419,  0.78547182, -1.21361385, ...,  0.91000819,
        -1.53562211, -0.09181942],
       ...,
       [ 1.55250989,  2.32088789, -0.9672839 , ..., -0.54231892,
        -1.25967481, -0.2850407 ],
       [ 1.79015538,  2.4284497 , -1.23104705, ..., -0.74224566,
        -1.4770847 , -0.35973994],
       [ 1.7489475 ,  2.01437325, -1.39272829, ..., -0.52661842,
        -1.54448492, -0.14998462]])

In [5]:
# Dimensionality Reduction with PCA
pca = PCA(n_components=256)  # Reduce to 256 dimensions for better clustering performance
features = pca.fit_transform(features)
features.shape


(11000, 256)

In [6]:
# Perform clustering with Gaussian Mixture Models
# Since we have data of 5 persons, setting num_clusters to 5
num_clusters = 5
gmm = GaussianMixture(n_components=num_clusters, random_state=0)
labels = gmm.fit_predict(features)
labels


array([1, 3, 1, ..., 3, 3, 3])

In [7]:
# Post-processing to ensure exactly 5 clusters
# Map cluster labels to detection IDs
clusters = {i: [] for i in range(num_clusters)}
for i, label in enumerate(labels):
    clusters[label].append(detection_ids[i])


In [8]:
# Prepare output in the format required
prediction = list(clusters.values())

# Step 7: Save the prediction results to a JSON file
with open('prediction.json', 'w') as f:
    json.dump(prediction, f, indent=4)


## Inspection of labels.json

In [9]:
# Load the sample data (assuming you have loaded it already)
with open('labels.json') as f:
    true_labels = json.load(f)

In [10]:
[len(sub_l) for sub_l in true_labels]

[113, 90, 113, 112, 132]

### In the detections data, there are 11000 detections and the task is to group all the similar person's detections, but in the labels.json file there were only 560 detections totally. I assume you gave a partial data for evaluation.

In [11]:
print(true_labels[1])

[567, 447, 567, 2260, 2503, 2508, 2743, 2748, 2979, 2984, 3184, 3188, 3376, 3607, 4924, 5135, 5141, 5301, 5306, 5461, 5465, 5636, 5642, 7749, 8192, 8335, 332, 677, 792, 1210, 1816, 2036, 2038, 2263, 2268, 2500, 3372, 3612, 3867, 3871, 4116, 4711, 4930, 5839, 6004, 6541, 7626, 7850, 7943, 8055, 8056, 795, 912, 915, 1050, 1818, 1598, 1396, 1390, 2043, 4108, 6160, 6332, 8495, 8497, 8653, 1054, 1605, 8194, 1214, 1823, 8337, 7484, 8781, 8923, 8650, 9066, 8778, 5138, 6865, 6534, 2975, 6534, 2740, 2740, 6325, 5833, 5833, 8919, 8919]


### And also there are some duplicate entries in the labels.json, in the above group as you can see, '567' detection_id is present twice in the list. If detection_id is unique as mentioned in the problem statement, how there can be duplicate entries?

<br>

## Evaluation Technique
Since there are lot of confusions in the labels.json, I will use a custom evaluation technique to compare the cluster_id of a detection in the lables.json with our predictions.

For each matched clusters, it will marked as 1 and 0 for mismatch. This will give us how perfect is our cluster predictions compared to limited labels.json data.

In [12]:
# mapping each detection id with a cluster_number
prediction_map = {}

for cluster_id, detection_id_list in enumerate(prediction):
    for detection_id in detection_id_list:
        prediction_map[detection_id]=cluster_id

In [13]:
# Each detection_id is assigned a cluster_number
len(prediction_map)

11000

Now for each true_labels clusters,
- we will take first `detection_id` and note the `cluster_id` from our prediction_map.
- Now check all the remaining detection_ids match the same `cluster_id`.

In [14]:
matched_prediction_count = 0
total_count = 0

for true_detection_list in true_labels:
    expected_cluster_id = prediction_map[true_detection_list[0]]

    for detection_id in true_detection_list[1:]:
        if expected_cluster_id == prediction_map[detection_id]:
            matched_prediction_count+=1
        total_count+=1


In [15]:
matched_prediction_count, total_count

(467, 555)

I got around 467 correct predictions out of 555.

In [16]:
print(f"Average accuracy of the clustering - {matched_prediction_count/total_count}")

Average accuracy of the clustering - 0.8414414414414414


## I got around 84% accuracy for person re-identification grouping