# Experiment 1: Part Signal PDF Convergence

This notebook will contain gathered results from experiment 1.

# Part Signal PDF convergence
## Methodology 
For each part we have data for, run the function to simulate the PDF convergence, 100 times. Randomly shuffle the part signals between each run, otherwise it would yeild the same results each time. Track the part, part type, how many signals it needed until convergence, and the relative variance. 
## Deliverables
Associated graphs for each part run showing the the convergence of the CI
Graphs and analysis for the combined average of each part type. What does it tell us? What can we conclude about the part type and why it is behaving that way?
Graphs and analysis comparing the averages of the different types. How different are they? How can we explain this? Does this validate our assumptions? 

## Source Code

The below sections contains all of our source codes.

In [1]:
import mlflow
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:

import os 

user_path = '~/matcher'  # CHANGE THIS LINE AS NEEDED FOR YOUR ENVIRONMENT
os.chdir(os.path.expanduser(user_path))


In [3]:
def get_metrics_series(mlruns_path: str, experiment_id: str, run_id: str, metric_name: str) -> list:
    """Get a series of metric values for a given metric name."""
    with open(f'{mlruns_path}/{experiment_id}/{run_id}/metrics/{metric_name}') as f:
        file_lines = f.readlines()
    return [float(line.split()[1]) for line in file_lines]


In [12]:
experiment_id = mlflow.get_experiment_by_name(name='Experiment 1').experiment_id
runs_df = mlflow.search_runs(experiment_ids=experiment_id, max_results=10_000)
runs_df['num_samples_for_convergence'] = runs_df.apply(
    lambda row: get_metrics_series(
            mlruns_path='mlruns', 
            experiment_id=experiment_id, 
            run_id=row['run_id'], 
            metric_name='num_samples_for_convergence'), 
    axis=1)

print(runs_df['num_samples_for_convergence'])

0                                                [123.0]
1                                                [117.0]
2                                                [101.0]
3                                                [101.0]
4                                                [102.0]
                             ...                        
595    [112.0, 153.0, 111.0, 119.0, 102.0, 104.0, 144...
596    [162.0, 112.0, 115.0, 103.0, 114.0, 101.0, 352...
597    [101.0, 231.0, 116.0, 122.0, 138.0, 102.0, 135...
598    [101.0, 133.0, 104.0, 115.0, 106.0, 116.0, 107...
599    [105.0, 124.0, 260.0, 159.0, 194.0, 101.0, 101...
Name: num_samples_for_convergence, Length: 600, dtype: object


In [17]:
runs_df['num_samples_for_convergence'] = runs_df.loc[len(runs_df['num_samples_for_convergence']) == 100]
print(runs_df['num_samples_for_convergence'])

KeyError: 'False: boolean label can not be used without a boolean index'

In [14]:
mlflow.end_run()
mlflow.start_run()


meta_pdf_ci_analysis_df = runs_df.loc[
    (runs_df['params.confidence_bound'] == base_confidence_bound) &
    (runs_df['params.part_pdf_ci'] == base_part_pdf_ci)]
part_pdf_ci_analysis_df = runs_df.loc[
    (runs_df['params.confidence_bound'] == base_confidence_bound) &
    (runs_df['params.meta_pdf_ci'] == base_meta_pdf_ci)]
confidence_bound_analysis_df = runs_df.loc[
    (runs_df['params.meta_pdf_ci'] == base_meta_pdf_ci) &
    (runs_df['params.part_pdf_ci'] == base_part_pdf_ci)]

meta_pdf_ci_part_groups = meta_pdf_ci_analysis_df.groupby('params.part_type')
part_pdf_ci_part_groups = part_pdf_ci_analysis_df.groupby('params.part_type')
confidence_bound_part_groups = confidence_bound_analysis_df.groupby('params.part_type')


def run_experiment(df_groups, param_col: str):
    
    for part_type, part_group in df_groups:
    
        part_group.sort_values(by=param_col, inplace=True)
        vars = [
            np.var(np.array(param_collision_rates)) 
            for param_collision_rates in part_group['monte_carlo_upper_collision_rate_series'].to_numpy()]    
        plt.plot(part_group[param_col], vars, label=f'{part_type} - Correlation: {np.corrcoef(part_group[param_col], vars)[0,1]:.2f}')
    
    plt.legend()
    plt.title(f'Variance of Collision Rates vs {param_col}')
    plt.xlabel(f'{param_col}')
    plt.ylabel('Variance of Collision Rates')
    plt.savefig(f'psig_matcher/experiments/graphs/variance_of_collision_rates_vs_{param_col}.png')
    mlflow.log_artifact(f'psig_matcher/experiments/graphs/variance_of_collision_rates_vs_{param_col}.png')
    plt.clf()
        
run_experiment(meta_pdf_ci_part_groups, 'params.meta_pdf_ci')
run_experiment(part_pdf_ci_part_groups, 'params.part_pdf_ci')
run_experiment(confidence_bound_part_groups, 'params.confidence_bound')

KeyError: 'params.confidence_bound'

In [15]:
mlflow.set_experiment("Experiment 1 Analysis")
for analysis_type in analysis_groups:
   
    group = analysis_groups[analysis_type]
    x_vals = []
    y_vals = []
    
    for index, df in group:

        col_vals = set(df[analysis_type].to_list())
        if len(col_vals) != 1:
            raise Exception(f"More than one {analysis_type} value in group")

        x_vals.append(col_vals.pop())
        collision_rates = df['monte_carlo_upper_collision_rate_series'].to_list()
        
        
        
    
    print(x_vals)
        #
        
    # plt.plot(x_vals, y_vals, label=f'{analysis_type}s vs upper_collision_rate')
    # plt.xlabel(analysis_type)
    # plt.ylabel(f"Averaged upper_collision_rate across all tested parts")
    # plt.savefig(f"psig_matcher/experiments/graphs/{analysis_type}_vs_upper_collision_rate.png")
    # mlflow.log_artifact(f"psig_matcher/experiments/graphs/{analysis_type}_vs_upper_collision_rate.png")
    


2022/12/10 16:20:47 INFO mlflow.tracking.fluent: Experiment with name 'Experiment 1 Analysis' does not exist. Creating a new experiment.


NameError: name 'analysis_groups' is not defined

---

## Conclusion

TBD.