# Experiment 2: Part Type Collision Analysis

This notebook will contain gathered results from experiment 2.

# Part Type Collision Analysis
## Methodology
For each part type we have, run the experiment many times over many different hyperparameters. Specifically, isolate one hyperparameter, run the experiment over a range of values, tracking the computed collision rate each time. Repeat this for each hyperparameter and each part type.
## Deliverables
Graphs and analysis for the impact of different values of the hyperparmeters. How do they affect the final collision rate? Why are the effecting the collision rate like that? What does this tell us? 
Graphs and analysis for comparing the results across different part types. Are different part types affected in the same way by the same change in hyperperamters? How close are their collision rates? What does this tell us about the relative importance of both hyperparameters and part types. 

## Setting Up Your Environment
Installing required libraries

Changing Directory to Parent to allow importing external data files. Please change the specified path based on where this repo exists for you locally.

In [None]:
import os 

user_path = '~/GitHub/matcher'  # CHANGE THIS LINE AS NEEDED FOR YOUR ENVIRONMENT
os.chdir(os.path.expanduser(user_path))

## Source Code

The below sections contains all of our source codes.

In [None]:
from dataclasses import dataclass
from typing import List, Tuple
import numpy as np
import os
import glob
import scipy.stats as st
from scipy import stats
import dataclasses
import mlflow

@dataclass
class NormalDistribution:
    mean: float
    std: float

@dataclass
class Part:

    type: str
    sub_part_name: str
    sensor: str
    signals: List   # Signal is numpy array of (500,3) with [frequency, Z, X]


def load_part_data(part_type: str) -> List[Part]:

    parts = []
    for part_dir in os.listdir(f'psig_matcher/data/{part_type}'):

        sensor = part_dir[1:]
        measurement_files = glob.glob(f'psig_matcher/data/{part_type}/{part_dir}/*.npy')
        measurements = [np.load(f) for f in measurement_files]
        parts.append(Part(part_type, part_dir, sensor, measurements))

    return parts

def limit_deminsionality(parts: List[Part], frequeny_indexes: List[int]) -> List[Part]:
    """Use only a subset of the frequencies for the analysis. This effectivley transforms the
    500 dimension multivariant distribution to a n-dimentional distribution where n is the
    length of the frequency_indexes.
    Further, this assumes use of the X axis"""

    return [
        dataclasses.replace(part, signals=[[signal[index][1] for index in frequeny_indexes] for signal in part.signals])
        for part in parts]

def compute_normal_ci(x: List[float], confidence: float) -> Tuple[float, float]:
    """Computes the confidence interval for a given confidence bound."""

    if np.mean(x) == 0: return (0, 0)
    
    if len(x) < 30:
        return st.t.interval(confidence, len(x)-1, loc=np.mean(x), scale=st.sem(x))
    else:
        return stats.norm.interval(confidence, loc=np.mean(x), scale=np.std(x))

def estimate_normal_dist(x: List[float], confidence: float) -> NormalDistribution:
    """Estimate the normal distribution for the given data.
    This is done using: https://handbook-5-1.cochrane.org/chapter_7/7_7_3_2_obtaining_standard_deviations_from_standard_errors_and.htm#:~:text=The%20standard%20deviation%20for%20each,should%20be%20replaced%20by%205.15.
    """

    val_comp = st.t.ppf if len(x) < 30 else stats.norm.ppf
    lower, upper = compute_normal_ci(x, confidence)

    val = val_comp(confidence, len(x)-1)
    std = np.sqrt(len(x))*(upper-lower)*val
    return NormalDistribution(np.mean(x, axis=0), std)


def probability_of_multivariant_point(mu: List[float], cov: List[List[float]], x: List[float]) -> float:

    #https://stats.stackexchange.com/questions/331283/how-to-calculate-the-probability-of-a-data-point-belonging-to-a-multivariate-nor
    # Double check this math
    m_dist_x = np.dot((x-mu).transpose(),np.linalg.inv(cov))
    m_dist_x = np.dot(m_dist_x, (x-mu))
    return 1-stats.chi2.cdf(m_dist_x, 3)

def estimate_overlap_of_set_with_sample_signals(parts: List[Part], samples: int, meta_pdf_ci: float, part_pdf_ci: float, confidence_bound: float) -> float:
    """ I believe this is the best solution out of all them. We are directly modeling the distribution/state space that
    the signals come from, and sampling from that. This directly correlates with the CI and is intuitive. See notion for
    more details and defense. """

    min_confidence = 1 - confidence_bound
    signals = [
        signal for part in parts
        for signal in part.signals]

    part_pdfs = [estimate_normal_dist(part.signals, part_pdf_ci) for part in parts]
    sample_pdf = estimate_normal_dist(signals, meta_pdf_ci)
    
    state_space_samples = np.random.multivariate_normal(sample_pdf.mean, np.diag(sample_pdf.std), samples)

    # using probability_of_multivariant_point no longer directly equates to false negative rate.
    # TODO (henry): Figure out relationship between integrated pdf range and false negative rate
    sample_confidences = [
        [probability_of_multivariant_point(pdf.mean, np.diag(pdf.std), sample) for pdf in part_pdfs]
        for sample in state_space_samples]

    filtered_confidences = [
        list(filter(lambda confidence: confidence >= min_confidence, sample_confidence))
        for sample_confidence in sample_confidences]

    # We're ok with up to 1 match, but every one more than that is a conflict.
    collisions = [max(len(confidences)-1, 0) for confidences in filtered_confidences]
    return sum(collisions)/(samples*len(part_pdfs))

def run_meta_markov_multivariant_analysis(parts: List[Part], part_dim: int, num_samples: int, meta_pdf_ci: float, part_pdf_ci: float, confidence_bound: float):
    """ Runs the Monte Carlo Approximation of multivariant collision using the signal sample meta
    pdf methodology. The Monte Carlo Approximation will continually be run until the confidence interval
    converges and the average of the previous 10 runs is not smaller than the average of the previous 100 runs."""

    collisions = []
    confidence_ranges = []
    while True:

        multivariant_parts = limit_deminsionality(parts, list(range(part_dim)))
        collision_rate = estimate_overlap_of_set_with_sample_signals(multivariant_parts, num_samples, meta_pdf_ci, part_pdf_ci, confidence_bound)

        collisions.append(collision_rate)
        mlflow.log_metric("monte_carlo_collision_rate", collision_rate)

        lower, upper = compute_normal_ci(collisions, 0.95)
        confidence_ranges.append(upper - lower)
        mlflow.log_metric("monte_carlo_confidence_interval", upper - lower)

        # print(f"Estimated collision rate from sample distributiion has range: {upper - lower}")

        if len(confidence_ranges) > 100 and np.mean(confidence_ranges[-10:]) >= np.mean(confidence_ranges[-100:]):
            return upper

## Experimentation

The below sections gives example scenarios to illustrate the working code and validate the proposed approach.

### *Base Line*

In [18]:
from sklearn.model_selection import ParameterGrid
import multiprocessing as mp
from psig_matcher.experiments import run_experiment

import importlib
importlib.reload(run_experiment)

def run_parallel_experiment(part_type: str, part_dim: int, num_samples: int, meta_pdf_ci: float, part_pdf_ci: float, confidence_bound: float):
    with mlflow.start_run():
        run_experiment.run_experiment(part_type, part_dim, num_samples, meta_pdf_ci, part_pdf_ci, confidence_bound)

# param_values = {
#     'part_type': ["CON"],
#     'part_dim' : [1, 5, 10, 50],
#     'num_samples': [100],
#     'meta_pdf_ci' : [0.5, 0.8, 0.9, 0.95, 0.99, 0.999],
#     'part_pdf_ci' : [0.5, 0.8, 0.9, 0.95, 0.99, 0.999],
#     'confidence_bound' : [0.5, 0.8, 0.9, 0.95, 0.99, 0.999]}

mlflow.set_experiment("Experiment 2")
param_values = {
    'part_type': ["CON"],
    'part_dim' : [2],
    'num_samples': [1000],
    'meta_pdf_ci' : [0.6],
    'part_pdf_ci' : [0.5],
    'confidence_bound' : [0.5]}

parameter_grid = list(ParameterGrid(param_values))
print(f"Running {len(parameter_grid)} experiments")

pool = mp.Pool(mp.cpu_count())
for params in parameter_grid:
    pool.apply_async(run_parallel_experiment, kwds=params)
    #run_parallel_experiment(**params)

pool.close()
pool.join()
    
        


Running 1 experiments


Process SpawnPoolWorker-104:
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/cs-6362/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/homebrew/Caskroom/miniforge/base/envs/cs-6362/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/cs-6362/lib/python3.10/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/opt/homebrew/Caskroom/miniforge/base/envs/cs-6362/lib/python3.10/multiprocessing/queues.py", line 367, in get
    return _ForkingPickler.loads(res)
ModuleNotFoundError: No module named 'psig_matcher'
Process SpawnPoolWorker-105:
Process SpawnPoolWorker-107:
Process SpawnPoolWorker-106:
Process SpawnPoolWorker-109:
Process SpawnPoolWorker-102:
Process SpawnPoolWorker-101:
Process SpawnPoolWorker-103:
Process SpawnPoolWorker-108:
Process SpawnPoolWorker-88:
Traceback (most recent

KeyboardInterrupt: 

get()
  File "/opt/homebrew/Caskroom/miniforge/base/envs/cs-6362/lib/python3.10/multiprocessing/queues.py", line 365, in get
    res = self._reader.recv_bytes()
  File "/opt/homebrew/Caskroom/miniforge/base/envs/cs-6362/lib/python3.10/multiprocessing/connection.py", line 221, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/cs-6362/lib/python3.10/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/cs-6362/lib/python3.10/multiprocessing/connection.py", line 384, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt


---

## Conclusion

TBD.