# Full Model Notebook

## OUTSTANDING WORK
* Sync train-test split
* Incorporate ALMA text preprocessing and compare performance to our preprocessing
* Ensure measurements with width greater than 5GHz are dropped (I think this is only 2)
* Band EDA
* Remove outlier projects (> 26.5 measurement) from Band prediction
* Remove projects that have incorrectly formatted band data
    * E.G. 2011.0.00008.E has an observation line with band = '3 6'
* Test different text preprocessing options and compare results
* Consider removing bands 1 AND 2 from band prediction
    * Only 21 measurements in band 1, no measurements in band 2

## Workflow Outline:
We leverage two parallel pipelines, that are combined to recommend median frequencies to explore after each model has completed training and prediction.

All projects for this phase of the overall pipeline are 'line' projects.

### Frequency Mining Pipeline
* OPTIONAL: remove projects with > 26.5 measurements **CURRENTLY REMOVING**
    * Tested both options, hit rate accuracies did not increase significantly to offset 1k cluster add

* Run projects through LDA to generate topic model with $N=50$ topics
    * Currently using count vectorization of combined title and abstract with lemmatized_no_sw_text
* Group projects to max topic by taking argmax of document-topic table
* Run HDBSCAN on each of the topics to create measurement clusters, referred to as "areas of interest"
    * Currently areas of interest are taken from min and max median frequency for each cluster generated
    * NOTE: each of the 50 HDBSCAN models can (and probably should) be tuned individually
        * We should make sure generated clusters are not too large unless it makes sense
            * E.G. a large cluster from 700-750GHz might make sense since measurements in this range are generally sparse
            * These large clusters are due to HDBSCAN adjusting the "neighborhood size", $\epsilon$ dynamically (using heirarchical clustering underneath the hood) to account for areas of varying density, as opposed to DBSCAN which uses a flat $\epsilon$ for all measurements within a topic.

### Band Prediction Pipeline
* OPTIONAL: remove projects with > 26.5 measurements **NOT CURRENTLY REMOVING NEED TO CHANGE**
* Predict band for project with Naive Bayes
    * Currently using TF-IDF vectorization of combined title and abstract with **NEED TO CHOOSE TEXT**
* Choose band(s) using hard classification into one or two bands
    * We remove band 2 entirely because there are so few 
    * We do this to be able to give a final hit rate of appx. 75%
        * This shows we have a good prediction model to match projects to band
* Ultimately we will use probability vector output (not hard classification) to order mined recommendations by full band prediction

In [1]:
import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly_express as px
import plotly.figure_factory as ff
from ast import literal_eval
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.cluster import DBSCAN, HDBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split

SEED = 42

## Read data
Training and testing projects and measurements are created in the /data/Data_Ingestion.ipynb notebook

In [2]:
train_projects = pd.read_csv("../data/train_projects.csv")
train_projects = train_projects.set_index('project_code')
train_projects.shape

(2383, 12)

In [3]:
test_projects = pd.read_csv("../data/test_projects.csv")
test_projects = test_projects.set_index('project_code')
test_projects.shape

(795, 12)

In [4]:
train_measurements = pd.read_csv('../../train_measurements.csv')
train_measurements = train_measurements.set_index('project_code')
train_measurements.shape

(17638, 16)

In [5]:
test_measurements = pd.read_csv('../../test_measurements.csv')
test_measurements = test_measurements.set_index('project_code')
test_measurements.shape

(5844, 16)

## Read in band predictions
This data frame gives a list from least likely band to most likely band for each test project from the Band Classification part of the project

In [6]:
band_predictions = pd.read_csv('../data/diverse_band_prediction.csv')
band_predictions = band_predictions.set_index('project_code')
band_predictions.head()

Unnamed: 0_level_0,band_predictions
project_code,Unnamed: 1_level_1
2016.1.00485.S,"[1, 10, 9, 5, 6, 7, 4, 3, 8]"
2017.1.00824.S,"[1, 10, 5, 9, 8, 3, 6, 4, 7]"
2015.1.01088.S,"[1, 10, 9, 6, 5, 8, 4, 7, 3]"
2013.1.00781.S,"[1, 10, 9, 5, 8, 7, 4, 3, 6]"
2016.1.00800.S,"[1, 10, 5, 9, 4, 8, 3, 6, 7]"


## Band Cutoffs from ALMA

In [7]:
band_cutoffs = [35, 51, 84, 125, 158, 211, 275, 385, 602, 787]

## LDA Model

### Create training and testing text groups to fit LDA

In [8]:
train_texts = train_projects.lemmatized_no_sw_text
test_texts = test_projects.lemmatized_no_sw_text

### LDA class

In [9]:
class LDA_Model:
    def __init__(self, N_topics=50):
        self.N_topics = N_topics
        self.countVectorizer = CountVectorizer()
        self.lda = LatentDirichletAllocation(n_components=self.N_topics, random_state=SEED)
    
    def fit(self, corpus):
        termFrequency = self.countVectorizer.fit_transform(corpus)
        self.lda.fit(termFrequency)
        return self.lda.transform(termFrequency)

    # Additional method to transform new data
    def transform(self, corpus):
        termFrequency = self.countVectorizer.transform(corpus)
        return self.lda.transform(termFrequency)

#### Initialize Model

In [10]:
lda_model = LDA_Model(N_topics=50)

#### Fit model on training set

In [11]:
train_topics = lda_model.fit(train_texts)

In [12]:
words = lda_model.countVectorizer.get_feature_names_out()

### Inspect top words for topics to see if they are salient
We can also use these later in user-facing tools for transparency

In [13]:
N = 10 #number of top words to show
topic_components = lda_model.lda.components_

for topic_idx, topic in enumerate(topic_components):
    print(f"Topic {topic_idx}:")
    # Get the indices of the top N words for this topic
    top_word_indices = topic.argsort()[-N:][::-1]
    # Print these words with their weights
    for word_idx in top_word_indices:
        print(f"{words[word_idx]} (weight: {topic[word_idx]:.2f})")
    print("\n")

Topic 0:
line (weight: 57.87)
bd (weight: 48.98)
compact (weight: 36.90)
velocity (weight: 26.63)
nucleus (weight: 24.94)
con (weight: 21.02)
obscure (weight: 17.56)
bds (weight: 17.02)
variation (weight: 16.27)
nuclei (weight: 15.03)


Topic 1:
dust (weight: 188.60)
gas (weight: 122.90)
ci (weight: 59.72)
grain (weight: 40.69)
evolution (weight: 39.17)
observation (weight: 36.21)
star (weight: 30.49)
carbon (weight: 30.10)
destruction (weight: 28.51)
study (weight: 26.06)


Topic 2:
agb (weight: 141.51)
star (weight: 127.80)
mass (weight: 107.10)
loss (weight: 81.86)
bipolar (weight: 70.32)
outflow (weight: 61.83)
jet (weight: 59.40)
nebula (weight: 53.18)
planetary (weight: 40.83)
wind (weight: 40.55)


Topic 3:
system (weight: 74.58)
debris (weight: 63.54)
belt (weight: 47.11)
planet (weight: 46.05)
disk (weight: 36.87)
collision (weight: 34.83)
structure (weight: 34.27)
disc (weight: 30.83)
observation (weight: 30.48)
scale (weight: 28.42)


Topic 4:
galaxy (weight: 212.35)
scale (

### Inspect training document-topic data frames

In [14]:
train_doc_topic = pd.DataFrame(train_topics)
train_doc_topic = train_doc_topic.set_index(train_texts.index.values)
train_doc_topic.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
2016.1.01288.S,0.000392,0.000392,0.000392,0.000392,0.000392,0.065492,0.000392,0.000392,0.000392,0.000392,...,0.000392,0.000392,0.000392,0.000392,0.000392,0.000392,0.464354,0.000392,0.000392,0.000392
2018.1.01077.S,0.000194,0.000194,0.000194,0.000194,0.000194,0.000194,0.000194,0.000194,0.000194,0.000194,...,0.000194,0.000194,0.000194,0.000194,0.000194,0.000194,0.000194,0.376111,0.000194,0.000194
2018.1.00437.S,0.000192,0.000192,0.000192,0.127172,0.000192,0.000192,0.000192,0.000192,0.000192,0.000192,...,0.000192,0.000192,0.000192,0.000192,0.000192,0.000192,0.000192,0.000192,0.000192,0.000192
2021.1.00637.S,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,...,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222
2012.1.00786.S,0.000171,0.000171,0.192147,0.000171,0.000171,0.000171,0.000171,0.740628,0.000171,0.000171,...,0.000171,0.000171,0.000171,0.000171,0.000171,0.000171,0.000171,0.000171,0.000171,0.000171


### Match test data into topics

In [15]:
test_topics = lda_model.transform(test_texts)

### Inspect testing document-topic data frames

In [16]:
test_doc_topic= pd.DataFrame(test_topics.tolist())
test_doc_topic= test_doc_topic.set_index(test_texts.index.values)
test_doc_topic.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
2016.1.00485.S,0.000215,0.000215,0.000215,0.029035,0.000215,0.000215,0.000215,0.000215,0.000215,0.000215,...,0.000215,0.000215,0.000215,0.000215,0.000215,0.000215,0.000215,0.031655,0.000215,0.113737
2017.1.00824.S,0.000169,0.098192,0.080003,0.000169,0.000169,0.000169,0.000169,0.000169,0.000169,0.000169,...,0.000169,0.130644,0.000169,0.14033,0.000169,0.000169,0.000169,0.022981,0.000169,0.000169
2015.1.01088.S,0.188665,0.000211,0.000211,0.000211,0.017753,0.134495,0.000211,0.06318,0.063875,0.000211,...,0.000211,0.000211,0.000211,0.109437,0.000211,0.020642,0.146402,0.000211,0.000211,0.070094
2013.1.00781.S,0.000294,0.000294,0.000294,0.000294,0.000294,0.000294,0.135778,0.000294,0.000294,0.122034,...,0.000294,0.000294,0.000294,0.000294,0.000294,0.000294,0.000294,0.087903,0.000294,0.461476
2016.1.00800.S,0.317005,0.000182,0.000182,0.000182,0.000182,0.000182,0.000182,0.000182,0.000182,0.000182,...,0.000182,0.000182,0.000182,0.076003,0.000182,0.000182,0.000182,0.000182,0.052253,0.000182


### Group documents to highest matching topic

Combine project topic vector frames for convenience.
* Note you can subset this dataframe to train and test texts using `proj_topics.loc[train_texts.index]`

In [17]:
train_texts = pd.DataFrame(train_texts)
test_texts = pd.DataFrame(test_texts)
proj_topics = pd.concat([train_doc_topic, test_doc_topic])
proj_topics.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
2016.1.01288.S,0.000392,0.000392,0.000392,0.000392,0.000392,0.065492,0.000392,0.000392,0.000392,0.000392,...,0.000392,0.000392,0.000392,0.000392,0.000392,0.000392,0.464354,0.000392,0.000392,0.000392
2018.1.01077.S,0.000194,0.000194,0.000194,0.000194,0.000194,0.000194,0.000194,0.000194,0.000194,0.000194,...,0.000194,0.000194,0.000194,0.000194,0.000194,0.000194,0.000194,0.376111,0.000194,0.000194
2018.1.00437.S,0.000192,0.000192,0.000192,0.127172,0.000192,0.000192,0.000192,0.000192,0.000192,0.000192,...,0.000192,0.000192,0.000192,0.000192,0.000192,0.000192,0.000192,0.000192,0.000192,0.000192
2021.1.00637.S,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,...,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222
2012.1.00786.S,0.000171,0.000171,0.192147,0.000171,0.000171,0.000171,0.000171,0.740628,0.000171,0.000171,...,0.000171,0.000171,0.000171,0.000171,0.000171,0.000171,0.000171,0.000171,0.000171,0.000171


### Take highest matching topic for each project

In [18]:
proj_topics['max_topic'] = proj_topics.apply(lambda x: x.argmax(), axis=1)

### Create data frame with project id and max topic

In [19]:
proj_max_topic = proj_topics['max_topic'].to_frame()
proj_max_topic.head()

Unnamed: 0,max_topic
2016.1.01288.S,46
2018.1.01077.S,37
2018.1.00437.S,17
2021.1.00637.S,13
2012.1.00786.S,7


In [91]:
proj_max_topic.max_topic.value_counts().to_frame().sort_index()

Unnamed: 0_level_0,count
max_topic,Unnamed: 1_level_1
0,23
1,29
2,42
3,24
4,47
5,37
6,50
7,25
8,28
9,74


### Inspect some topic stats

In [20]:
proj_max_topic.value_counts().describe()

count     50.0000
mean      63.5600
std       64.2737
min        7.0000
25%       24.2500
50%       33.5000
75%       92.7500
max      293.0000
Name: count, dtype: float64

There are a few topics that match to a large number of documents. Perhaps we need a better topic model or to group documents by project_topic vector similarity.

### Eyeball comparison of documents by max topic
This requires looking at the online explorer since printing out abstracts in here gets messy.

In [21]:
proj_max_topic[proj_max_topic.max_topic == 3].head()

Unnamed: 0,max_topic
2022.1.00793.S,3
2015.1.01260.S,3
2017.1.00167.S,3
2019.1.01443.T,3
2015.1.00032.S,3


### Add `max_topic` to `measurements` frame to be able to group measurements by max topic

In [22]:
train_measurements = pd.merge(train_measurements, proj_max_topic, left_index=True, right_index=True)

### Generate test projects measurements
This will be useful for calculating hit rates to evaluate model performance.

**NOTE!!!**
You should not sort these, however tempting. We need to preserve the relationships of the entries to not lose measurement information.

In [23]:
test_proj_meas = test_measurements.loc[test_texts.index]
test_proj_meas = test_proj_meas.groupby(test_proj_meas.index)\
    .agg({
        'low_freq': lambda x: round(x, 4).tolist(),
        'high_freq': lambda x: round(x, 4).tolist(),
        'med_freq': lambda x: round(x, 4).tolist(),
        'diff_freq': lambda x: round(x, 4).tolist()
    })
test_proj_meas.head()

Unnamed: 0_level_0,low_freq,high_freq,med_freq,diff_freq
project_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2011.0.00010.S,"[90.38, 90.7, 91.69, 92.89, 217.59, 218.67, 21...","[90.62, 90.93, 91.92, 93.12, 218.53, 219.6, 21...","[90.5, 90.815, 91.805, 93.005, 218.06, 219.135...","[0.24, 0.23, 0.23, 0.23, 0.94, 0.93, 0.94, 0.9..."
2011.0.00064.S,"[288.96, 290.79, 300.84, 302.71, 288.94, 290.7...","[290.84, 292.67, 302.71, 304.59, 290.82, 292.6...","[289.9, 291.73, 301.775, 303.65, 289.88, 291.7...","[1.88, 1.88, 1.87, 1.88, 1.88, 1.87, 1.88, 1.87]"
2011.0.00121.S,"[319.07, 320.48, 319.83, 319.36, 319.71, 316.59]","[320.94, 322.35, 321.71, 321.24, 321.58, 318.47]","[320.005, 321.415, 320.77, 320.3, 320.645, 317...","[1.87, 1.87, 1.88, 1.88, 1.87, 1.88]"
2011.0.00136.S,"[335.29, 335.98, 345.67, 346.47]","[335.52, 336.22, 345.91, 346.7]","[335.405, 336.1, 345.79, 346.585]","[0.23, 0.24, 0.24, 0.23]"
2011.0.00199.S,"[639.15, 645.41, 657.7, 661.7, 320.98, 322.12,...","[640.11, 646.37, 658.66, 662.66, 321.46, 322.6...","[639.63, 645.89, 658.18, 662.18, 321.22, 322.3...","[0.96, 0.96, 0.96, 0.96, 0.48, 0.48, 0.49, 0.48]"


### Generate train topic measurements
We will use these to engineer 'areas of interest' among topics using HDBSCAN

**NOTE!!!**
You should not sort these, however tempting. We need to preserve the relationships of the entries to not lose measurement information.

In [24]:
train_measurements.head()

Unnamed: 0_level_0,project_title,project_abstract,fs_type,low_freq,high_freq,science_category,science_keyword,band,target,diff_freq,med_freq,raw_text,standardized_text,no_sw_text,lemmatized_sw_text,lemmatized_no_sw_text,max_topic
project_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2011.0.00017.S,Expanding the frontiers of chemical complexity...,The search for complex pre-biotic and biotic m...,line,87.72,89.6,ISM and star formation,"Inter-Stellar Medium (ISM)/Molecular clouds, A...",3.0,1,1.88,88.66,Expanding the frontiers of chemical complexity...,expanding the frontiers of chemical complexity...,expanding frontiers chemical complexity comple...,expand the frontier of chemical complexity wit...,expand frontier chemical complexity complex pr...,39
2011.0.00017.S,Expanding the frontiers of chemical complexity...,The search for complex pre-biotic and biotic m...,line,89.54,91.42,ISM and star formation,"Inter-Stellar Medium (ISM)/Molecular clouds, A...",3.0,1,1.88,90.48,Expanding the frontiers of chemical complexity...,expanding the frontiers of chemical complexity...,expanding frontiers chemical complexity comple...,expand the frontier of chemical complexity wit...,expand frontier chemical complexity complex pr...,39
2011.0.00017.S,Expanding the frontiers of chemical complexity...,The search for complex pre-biotic and biotic m...,line,99.72,101.59,ISM and star formation,"Inter-Stellar Medium (ISM)/Molecular clouds, A...",3.0,1,1.87,100.655,Expanding the frontiers of chemical complexity...,expanding the frontiers of chemical complexity...,expanding frontiers chemical complexity comple...,expand the frontier of chemical complexity wit...,expand frontier chemical complexity complex pr...,39
2011.0.00017.S,Expanding the frontiers of chemical complexity...,The search for complex pre-biotic and biotic m...,line,101.54,103.42,ISM and star formation,"Inter-Stellar Medium (ISM)/Molecular clouds, A...",3.0,1,1.88,102.48,Expanding the frontiers of chemical complexity...,expanding the frontiers of chemical complexity...,expanding frontiers chemical complexity comple...,expand the frontier of chemical complexity wit...,expand frontier chemical complexity complex pr...,39
2011.0.00017.S,Expanding the frontiers of chemical complexity...,The search for complex pre-biotic and biotic m...,line,91.37,93.24,ISM and star formation,"Inter-Stellar Medium (ISM)/Molecular clouds, A...",3.0,1,1.87,92.305,Expanding the frontiers of chemical complexity...,expanding the frontiers of chemical complexity...,expanding frontiers chemical complexity comple...,expand the frontier of chemical complexity wit...,expand frontier chemical complexity complex pr...,39


In [25]:
train_topic_freqs = train_measurements.loc[train_texts.index]\
    .reset_index()\
    .groupby('max_topic')\
    .agg({
        'project_code': lambda x: x.tolist(), 
        'low_freq': lambda x: round(x, 4).tolist(),
        'high_freq': lambda x: round(x, 4).tolist(),
        'med_freq': lambda x: round(x, 4).tolist(),
        'diff_freq': lambda x: round(x, 4).tolist(),
        'band': lambda x: x.astype('int64').tolist()
    })
train_topic_freqs.head()

Unnamed: 0_level_0,project_code,low_freq,high_freq,med_freq,diff_freq,band
max_topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,"[2017.1.00598.S, 2017.1.00598.S, 2017.1.00598....","[257.18, 259.1, 260.42, 262.37, 260.84, 262.79...","[259.04, 260.97, 262.29, 264.24, 262.71, 264.6...","[258.11, 260.035, 261.355, 263.305, 261.775, 2...","[1.86, 1.87, 1.87, 1.87, 1.87, 1.87, 1.87, 1.8...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, ..."
1,"[2018.1.00341.S, 2018.1.00341.S, 2018.1.00341....","[342.03, 343.99, 354.03, 355.91, 478.54, 480.5...","[343.9, 345.86, 355.9, 357.79, 480.54, 482.54,...","[342.965, 344.925, 354.965, 356.85, 479.54, 48...","[1.87, 1.87, 1.87, 1.88, 2.0, 2.0, 2.0, 0.25, ...","[7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 6, 6, 6, 6, 6, ..."
2,"[2017.1.00595.S, 2017.1.00595.S, 2017.1.00595....","[330.25, 331.25, 342.52, 345.1, 215.39, 217.28...","[331.25, 333.25, 344.52, 346.1, 217.39, 219.28...","[330.75, 332.25, 343.52, 345.6, 216.39, 218.28...","[1.0, 2.0, 2.0, 1.0, 2.0, 2.0, 1.0, 2.0, 2.0, ...","[7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ..."
3,"[2022.1.00793.S, 2022.1.00793.S, 2022.1.00793....","[327.94, 329.94, 339.99, 342.24, 344.24, 354.2...","[329.82, 331.82, 341.86, 344.12, 346.12, 356.1...","[328.88, 330.88, 340.925, 343.18, 345.18, 355....","[1.88, 1.88, 1.87, 1.88, 1.88, 1.87, 1.87, 1.8...","[7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 7, 7, 6, 6, 6, ..."
4,"[2017.1.00707.S, 2017.1.00707.S, 2017.1.00707....","[216.05, 217.07, 217.21, 218.19, 218.45, 219.5...","[216.17, 217.13, 217.27, 218.25, 218.5, 219.59...","[216.11, 217.1, 217.24, 218.22, 218.475, 219.5...","[0.12, 0.06, 0.06, 0.06, 0.05, 0.06, 0.06, 0.1...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 3, 3, 3, 3, 3, 3, ..."


In [26]:
len(train_topic_freqs.loc[10].project_code)

1234

## Cluster cleaning function

How do we handle > 2 band overlaps?

* Currently we just roll with it.
    
* Should call out which topic, and cluster

* Alternatively, throw error <- this is hard because it completely stops any training

In [27]:
# Code to check for clusters that span at least two bands
def cluster_cleaning(topic_meas_df, min_cluster_size):
    dummy_label = 100000  # Used to make new labels and ensure we're adding new clusters. Cluster labels will be reset eventually

    for clst in np.unique(topic_meas_df.cluster_label):
        # HDBSCAN labels noise as -1 so we skip this row for cleaning
        if clst != -1:
            # Subset topic measurement dataframe to current cluster
            clst_subset = topic_meas_df[topic_meas_df.cluster_label == clst].sort_values('med_freq')
            
            # Extract band information for cluster subset
            band = np.unique(clst_subset.band)

            # If there are multiple bands in measurements for cluster we want to break them up
            if band.size > 1:
                # Loop over bands in cluster and make a dataframe for each band
                bnd_dict = [clst_subset[clst_subset.band == bnd] for bnd in band]

                # Loop over cluster-band frames
                for df in bnd_dict:
                # Check if number of measurements in this band and cluster is > min_cluster_size
                # If it is less, simply assign those measurements back to noise, since we don't want too small clusters
                # Otherwise, there are enough measurments to keeping "this" part of the cluster, so make a new cluster for it 
                    if df.shape[0] < min_cluster_size:
                        topic_meas_df.loc[df.index, 'cluster_label'] = -1
                    else:
                        topic_meas_df.loc[df.index, 'cluster_label'] = dummy_label
                        dummy_label += 1
                        
    # Re-label clusters to be a continuous range from -1 to N
    new_labels = [n-1 for n in range(len(np.unique(topic_meas_df.cluster_label)))]
    label_counter = 0   # Used to increment through new_labels
    
    # Loop over cluster labels and update them
    for reclust in np.unique(topic_meas_df.cluster_label):
        # Set cluster label to new_labels
        topic_meas_df.loc[topic_meas_df.cluster_label == reclust, 'cluster_label'] = new_labels[label_counter]
        label_counter += 1

## HDBSCAN Train/Test Code
### Loop over topics and find accuracy measurements

In [68]:
band_prediction_limit = 0               # Number of top band predictions to include. 0 to include all
test_project_hits = 0                   # Hits for all projects if at least one measurement is matched
test_project_meas_hit_rate = []         # List of hit rates by project
topic_cluster_widths = []               # List of cluster widths by topic to ensure generated clusters are not too wide (list of lists)
total_num_clusters = 0                  # List of number of clusters for each topic
topic_cluster_stat_list = []            # List of dataframes with clusters by topic. Used to make a main topic-cluster data frame later
topic_measurement_stat_list = []        # List of dataframes with measurements by topic. Used to make a main topic-measurement data frame later

# Loop over topics
for tpc in set(proj_max_topic.max_topic.values):
    # Fit HDBSCAN for each topic
    # Note that these can be parameterized for each of the topics generated
    # Note one can give HDBSCAN a max_cluster_size parameter to ensure clusters do not grow too large
    db = HDBSCAN(min_cluster_size=5)\
        .fit(list(zip(train_topic_freqs.loc[tpc].med_freq)))
    
    # Get labels from HDBSCAN
    labels = db.labels_

    # Number of clusters in labels, ignoring noise if present.
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_points = len(list(labels))
    n_noise = list(labels).count(-1)

    # Stat callouts
    print(f'HDBSCAN Results for topic {tpc}')
    #print(f'Estimated number of clusters: {n_clusters}')
    print(f'Number of projects in topic: {proj_max_topic.loc[train_texts.index].query(f"max_topic == {tpc}").shape[0]}')
    print(f'Total number of measurements: {n_points}')
    print(f'Estimated number of noise measurements: {n_noise}')
    print(f'Noise proportion: {round(list(labels).count(-1)/labels.shape[0], 3)}')
    print(f'Signal proportion: {round(1-list(labels).count(-1)/labels.shape[0], 3)}')

    topic_measurement = pd.DataFrame.from_dict({'med_freq':train_topic_freqs.loc[tpc].med_freq,
                                            'band':train_topic_freqs.loc[tpc].band,
                                            'project_code':train_topic_freqs.loc[tpc].project_code,
                                            'cluster_label':labels})
    
    # Add topic_measurement to 
    topic_measurement = pd.concat({tpc: topic_measurement}, names=['topic'])
    topic_measurement.index.names = ['topic', 'measurement']

    # Append topic_cluster to topic_cluster_stats for analysis later
    topic_measurement_stat_list.append(topic_measurement)
    
    #Clean clusters in topic_measurment, breaking up clusters that span more than one band
    cluster_cleaning(topic_measurement, 5)

    # Generate topic_cluster data frame for this topic
    # Clusters are defined by the minimum and maximum median_freq for all labeled measurements
    topic_cluster = topic_measurement.groupby('cluster_label').agg(
        mean_freq=('med_freq', 'mean'),
        min_freq=('med_freq', 'min'),
        max_freq=('med_freq', 'max'),
        count_freq=('med_freq', 'count'),
        count_proj=('project_code', 'nunique'),
        band_min=('band', 'min'),
        band_max=('band', 'max'),
        band_mode=('band', 'mean')
    )

    # Sort index
    topic_cluster = topic_cluster.sort_index()

    # Add width for cleaned clusters
    topic_cluster['width'] = topic_cluster.max_freq - topic_cluster.min_freq

    # Add topic index to topic_cluster
    topic_cluster = pd.concat({tpc: topic_cluster}, names=['topic'])
    topic_cluster.index.names = ['topic', 'cluster']

    # Append topic_cluster to topic_cluster_stats for analysis later
    topic_cluster_stat_list.append(topic_cluster)


    # Testing loop
    # Loop over generated clusters and print cluster stats
    # Initialize list of cluster widths
    # Code to check for clusters that span at least two bands
    cluster_widths = []

    # If there are any measurements in that topic, print stats
    # This if statement is to avoid errors if topics only have noise and no clusters
    if topic_cluster.shape[0] != 0:
        for clst in topic_cluster.index:
            min_freq, max_freq = topic_cluster.loc[clst].min_freq, topic_cluster.loc[clst].max_freq
            cluster_widths.append(max_freq - min_freq)
            total_num_clusters += 1
            if (topic_cluster.loc[clst].band_min != topic_cluster.loc[clst].band_max):
                    print("BAND OVERLAP")
        print('')
        print(f'Topic {tpc} cluster width stats:')
        print(np.round(pd.Series(cluster_widths).describe(), 4))
    else: print('No clusters for this topic')

    # Print cluster data frame with relevant columns for tuning
    print('')
    print(f'Cluster data frame for topic {tpc}')
    with pd.option_context('display.max_rows', None):
        print(topic_cluster[['min_freq', 'max_freq', 'count_freq', 'band_mode', 'width']]\
          .sort_values(['width', 'min_freq'], ascending=False))

    # Get a list of test project codes
    tps = proj_max_topic.loc[test_texts.index].query(f'max_topic == {tpc}')

    #Begin test projects
    print('')
    # print('Begin tests')

    # Loop over test projects
    for tp in tps.index:
        tp_hr = 0   # Hit rate for this specific project
        # Loop over measurements in test project

        # Subset `topic_cluster` to be the top two predicted bands for each project
        topic_cluster_subset = topic_cluster[topic_cluster.band_mode.isin(literal_eval(band_predictions.loc[tp].band_predictions)[-band_prediction_limit:])]
        if topic_cluster.shape[0] != 0:
            print(f'Ratio of recommended clusters to total clusters: {topic_cluster_subset.shape[0]/topic_cluster.shape[0]}')
        else: print(f'No clusters for topic {tpc}')

        for meas in test_proj_meas.loc[tp].med_freq:
            # Loop over clusters in topic
            # Note we have a multi index so we want to get to the 'cluster' index, being level 1
            for clust in topic_cluster_subset.index.get_level_values(level=1):
                # Skip noise
                if clust != -1:
                    lower_bound = round(topic_cluster_subset.loc[tpc, clust].min_freq, 3)
                    upper_bound = round(topic_cluster_subset.loc[tpc, clust].max_freq, 3)
                    if ((meas >= lower_bound) and (meas <= upper_bound)):
                        tp_hr += 1
                        break
        test_project_meas_hit_rate.append(round(tp_hr/len(list(test_proj_meas.loc[tp].med_freq)), 3))
        # Stats for individual test projects
        print(f'Number of measurements: {len(test_proj_meas.loc[tp].med_freq)}')
        print(f'Hits: {tp_hr}')
        print(f'Hit rate: {round(tp_hr/len(list(test_proj_meas.loc[tp].med_freq)), 3)}')
        print('')

        # Increment test_project_hits if at least one measurement in the project matched
        if (tp_hr > 0):
            test_project_hits +=1
    print('=========================================\n')

print(f'Total number of clusters across topics: {total_num_clusters}')
print(f'Number of test projects with at least one measurement match: {test_project_hits}')
print(f'Ratio of test project hits to number of test projects: {round(test_project_hits/test_texts.shape[0], 4)}')
print(f'Average hit rate per project: {round(sum(test_project_meas_hit_rate)/test_texts.shape[0], 4)}')

HDBSCAN Results for topic 0
Number of projects in topic: 17
Total number of measurements: 125
Estimated number of noise measurements: 5
Noise proportion: 0.04
Signal proportion: 0.96
BAND OVERLAP

Topic 0 cluster width stats:
count     11.0000
mean      22.7227
std       45.7554
min        2.7400
25%        3.7575
50%        7.2350
75%       14.5075
max      159.3400
dtype: float64

Cluster data frame for topic 0
               min_freq  max_freq  count_freq  band_mode    width
topic cluster                                                    
0     -1         86.845   246.185           7   4.428571  159.340
       0        329.325   351.930          10   7.000000   22.605
       9        164.165   182.000           6   5.000000   17.835
       8         87.925    99.105          10   3.000000   11.180
       5        258.110   269.140          25   6.000000   11.030
       6        247.295   254.530           9   6.000000    7.235
       2        459.095   466.225          10   8.00000

### Build full topic-measurement data frame

In [29]:
topic_measurement_frame = pd.concat(topic_measurement_stat_list)

In [30]:
topic_measurement_frame.loc[0]

Unnamed: 0_level_0,med_freq,band,project_code,cluster_label
measurement,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,258.110,6,2017.1.00598.S,5
1,260.035,6,2017.1.00598.S,5
2,261.355,6,2017.1.00598.S,5
3,263.305,6,2017.1.00598.S,5
4,261.775,6,2017.1.00598.S,5
...,...,...,...,...
120,247.295,6,2017.1.00113.S,6
121,249.065,6,2017.1.00113.S,6
122,261.235,6,2017.1.00113.S,5
123,263.005,6,2017.1.00113.S,5


### Visualization of HDBSCAN on Topics
Choose a topic in `inspect_topic` and let it rip! It uses the `topic_measurement_frame` to generate the images. There's a little bit of extra processing though, so we create a helper dataframe, `inspect_topic_frame` to make sure everything runs smoothly.

* This should probably become a function, at least the plotting part
* Maybe check out plotly "strip" charts

In [31]:
inspect_topic = 25
inspect_topic_frame = topic_measurement_frame.loc[inspect_topic]

inspect_topic_frame = inspect_topic_frame.sort_values('cluster_label', ascending=False)
inspect_topic_frame.cluster_label = inspect_topic_frame.cluster_label.astype('str')

# Add noise binary column for plot symbol
inspect_topic_frame['noise'] = np.where(inspect_topic_frame.cluster_label == '-1', 1, 0)

# Cheat Some indexes to group clusters together alond index for y-axis
# This gives us a "meaningless" 2nd-dimension to an otherwise 1-D plot, but helps with visualization
# inspect_topic_frame = inspect_topic_frame.reset_index().drop('index', axis=1)
# inspect_topic_frame = inspect_topic_frame.reset_index()

# Noise and Signal
itf_noise = inspect_topic_frame.noise.sum()
itf_signal = inspect_topic_frame.shape[0] - itf_noise

# Set symbols for plot
# We set all points to be 'circle' using the px number 0, and then change noise to 'x'
symbols = list(np.zeros(np.unique(inspect_topic_frame.cluster_label).shape[0], 'int'))
symbols[-1] = 'x'

# Create plot
fig = px.scatter(inspect_topic_frame,
                 x='med_freq',
                 y='cluster_label',
                 color='cluster_label',
                 symbol='cluster_label',
                 symbol_sequence=symbols,
                 title=f"HDBSCAN Generated Clusters for Topic {inspect_topic} <br><sup>{itf_signal} Clustered Measurements with {itf_noise} Noise Measurements</sup>",
                 labels={
                     'med_freq':'Median Frequency (GHz)',
                     'index':'Index',
                     'cluster_label':'Cluster Label'
                 })
fig.update_traces(marker={'size': 15, 'opacity':0.5})

# Idea to use boxplots and points
# Create plot
# fig = px.box(inspect_topic_frame,
#                  x='med_freq',
#                  y='cluster_label',
#                  points='all',
#                  color='cluster_label',
#                  title=f"HDBSCAN Generated Clusters for Topic {inspect_topic} <br><sup>{itf_signal} Clustered Measurements with {itf_noise} Noise Measurements</sup>",
#                  labels={
#                      'med_freq':'Median Frequency (GHz)',
#                      'index':'Index',
#                      'cluster_label':'Cluster Label'
#                  })
# fig.update_traces(marker={'size': 5, 'opacity':0.5})

fig.show()

Ad-hoc version of above code

In [72]:
inspect_topic = 25
inspect_topic_frame = pd.DataFrame.from_dict(dict(train_topic_freqs.loc[inspect_topic]))

hdb = HDBSCAN(min_cluster_size=5)\
    .fit(list(zip(inspect_topic_frame.med_freq)))

# Add labels to inspect_topic_frame
inspect_topic_frame['cluster_label'] = hdb.labels_
inspect_topic_frame = inspect_topic_frame.sort_values('cluster_label', ascending=False)
inspect_topic_frame.cluster_label = inspect_topic_frame.cluster_label.astype('str')

# Add noise binary column for plot symbol
inspect_topic_frame['noise'] = np.where(inspect_topic_frame.cluster_label == '-1', 1, 0)

# Cheat Some indexes to group clusters together alond index for y-axis
# This gives us a "meaningless" 2nd-dimension to an otherwise 1-D plot, but helps with visualization
# inspect_topic_frame = inspect_topic_frame.reset_index().drop('index', axis=1)
# inspect_topic_frame = inspect_topic_frame.reset_index()

# Noise and Signal
itf_noise = inspect_topic_frame.noise.sum()
itf_signal = inspect_topic_frame.shape[0] - itf_noise

# Set symbols for plot
# We set all points to be 'circle' using the px number 0, and then change noise to 'x'
symbols = list(np.zeros(np.unique(inspect_topic_frame.cluster_label).shape[0], 'int'))
symbols[-1] = 'x'

# Create plot
fig = px.scatter(inspect_topic_frame,
                 x='med_freq',
                 y='cluster_label',
                 color='cluster_label',
                 symbol='cluster_label',
                 symbol_sequence=symbols,
                 title=f"HDBSCAN Generated Clusters for Topic {inspect_topic} <br><sup>{itf_signal} Clustered Measurements with {itf_noise} Noise Measurements</sup>",
                 labels={
                     'med_freq':'Median Frequency (GHz)',
                     'index':'Index',
                     'cluster_label':'Cluster'
                 })
fig.update_traces(marker={'size': 15, 'opacity':0.5})

fig.update_layout(
    autosize=False,
    width=1500,
    height=750,
    margin=dict(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4
    ),
    title_font=dict(size=40),  # Increase title font size
    xaxis=dict(title_font=dict(size=30), tickfont=dict(size=22)),  # Increase x-axis title font size
    yaxis=dict(title_font=dict(size=30), tickfont=dict(size=22)),
    legend=dict(
        font=dict(size=30),  # Increase legend font size
    ))

# Idea to use boxplots and points
# Create plot
# fig = px.box(inspect_topic_frame,
#                  x='med_freq',
#                  y='cluster_label',
#                  points='all',
#                  color='cluster_label',
#                  title=f"HDBSCAN Generated Clusters for Topic {inspect_topic} <br><sup>{itf_signal} Clustered Measurements with {itf_noise} Noise Measurements</sup>",
#                  labels={
#                      'med_freq':'Median Frequency (GHz)',
#                      'index':'Index',
#                      'cluster_label':'Cluster Label'
#                  })
# fig.update_traces(marker={'size': 5, 'opacity':0.5})

fig.show()

In [84]:
fig.write_image("topic_25.png", width=1541, height=750, scale=1)

In [34]:
topic_measurement_frame.loc[25]

Unnamed: 0_level_0,med_freq,band,project_code,cluster_label
measurement,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,140.935,4,2015.1.00117.S,0
1,142.815,4,2015.1.00117.S,0
2,152.935,4,2015.1.00117.S,0
3,154.815,4,2015.1.00117.S,0
4,96.165,3,2015.1.00117.S,8
...,...,...,...,...
134,159.115,4,2017.1.00321.S,0
135,132.550,4,2017.1.00321.S,0
136,134.415,4,2017.1.00321.S,0
137,144.465,4,2017.1.00321.S,0


### Build full topic-cluster data frame

## **THIS IS A VERY IMPORTANT DATAFRAME IT THIS IS THE CORE RESULT OF THE MINING APPROACH!!!!!!**

This data frame holds all of the cluster info for each of the generated topics

* Pretty much all of the cluster stats in the code cell above can be derived from this

In [35]:
topic_cluster_stats = pd.concat(topic_cluster_stat_list)

In [67]:
topic_cluster_stats.loc[0].sort_values('count_proj', ascending=False)

Unnamed: 0_level_0,mean_freq,min_freq,max_freq,count_freq,count_proj,band_min,band_max,band_mode,width,relevance
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5,263.7548,258.11,269.14,25,5,6,6,6.0,11.03,0
3,231.342727,230.47,233.81,11,4,6,6,6.0,3.34,0
-1,162.422143,86.845,246.185,7,3,3,6,4.428571,159.34,0
0,341.727,329.325,351.93,10,3,7,7,7.0,22.605,0
4,218.598333,216.86,220.395,15,3,6,6,6.0,3.535,0
6,251.748889,247.295,254.53,9,3,6,6,6.0,7.235,0
7,109.520385,108.155,110.895,13,3,3,3,3.0,2.74,0
8,94.6465,87.925,99.105,10,3,3,3,3.0,11.18,0
1,453.106667,450.495,454.475,9,1,8,8,8.0,3.98,0
2,463.914,459.095,466.225,10,1,8,8,8.0,7.13,0


In [37]:
topic_cluster_stats.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,mean_freq,min_freq,max_freq,count_freq,count_proj,band_min,band_max,band_mode,width
topic,cluster,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
30,2,261.751364,260.255,264.27,11,2,6,6,6.0,4.015
42,17,217.051818,216.65,217.24,11,6,6,6,6.0,0.59
13,49,252.903167,249.695,256.345,30,10,6,6,6.0,6.65
9,27,219.56,219.56,219.56,6,6,6,6,6.0,0.0
25,6,355.194167,355.14,355.285,6,3,7,7,7.0,0.145
43,93,230.559444,230.555,230.565,9,5,6,6,6.0,0.01
43,45,145.872,145.58,146.05,10,6,4,4,4.0,0.47
16,22,345.800833,345.79,345.81,6,3,7,7,7.0,0.02
20,5,109.529,108.605,110.095,5,2,3,3,3.0,1.49
32,14,102.731111,102.47,102.935,9,4,3,3,3.0,0.465


## Inspect topic-clusters

In [38]:
topic_cluster_stats.describe()

Unnamed: 0,mean_freq,min_freq,max_freq,count_freq,count_proj,band_min,band_max,band_mode,width
count,1538.0,1538.0,1538.0,1538.0,1538.0,1538.0,1538.0,1538.0,1538.0
mean,231.223314,225.208349,240.138882,11.46814,5.729519,5.342003,5.482445,5.413672,14.930533
std,118.634591,119.21215,130.281123,14.68478,6.160058,1.720198,1.737119,1.680843,70.668754
min,39.58875,36.08,43.105,5.0,1.0,1.0,1.0,1.0,0.0
25%,114.70425,111.81625,115.0425,6.0,3.0,3.0,3.0,3.0,0.235
50%,227.484615,223.465,230.3925,9.0,5.0,6.0,6.0,6.0,0.87
75%,290.43637,277.98375,315.175,12.0,7.0,7.0,7.0,7.0,3.49625
max,890.962188,878.23,906.795,277.0,107.0,10.0,10.0,10.0,750.67


In [39]:
topic_cluster_stats.width.describe()

count    1538.000000
mean       14.930533
std        70.668754
min         0.000000
25%         0.235000
50%         0.870000
75%         3.496250
max       750.670000
Name: width, dtype: float64

In [40]:
topic_cluster_stats.query('width > 10 and cluster != -1')\
    .sort_values(['count_freq', 'width'], ascending=False)\
    .head(50)


Unnamed: 0_level_0,Unnamed: 1_level_0,mean_freq,min_freq,max_freq,count_freq,count_proj,band_min,band_max,band_mode,width
topic,cluster,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
17,0,144.955,128.64,161.73,73,16,4,4,4.0,33.09
18,1,342.669255,305.365,372.865,47,13,7,7,7.0,67.5
5,5,255.655111,248.805,263.955,45,7,6,6,6.0,15.15
12,2,141.371447,126.035,158.505,38,9,4,4,4.0,32.47
10,96,146.574459,127.505,161.965,37,6,4,4,4.0,34.46
15,3,142.426034,128.315,158.265,29,7,4,4,4.0,29.95
25,7,112.277586,105.285,115.29,29,8,3,3,3.0,10.005
49,2,671.203214,648.0,688.52,28,4,9,9,9.0,40.52
7,0,337.877778,310.995,356.73,27,6,7,7,7.0,45.735
49,10,478.765769,472.905,484.37,26,7,8,8,8.0,11.465


### Check to see there are no clusters spanning bands

In [41]:
topic_cluster_stats.query('band_max - band_min > 1 and cluster != -1')\
    .sort_values(['width', 'count_freq'], ascending=False)\
    .head(30)

Unnamed: 0_level_0,Unnamed: 1_level_0,mean_freq,min_freq,max_freq,count_freq,count_proj,band_min,band_max,band_mode,width
topic,cluster,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1


### Inspect an individual topic's clusters

In [42]:
topic_cluster_stats.loc[0]

Unnamed: 0_level_0,mean_freq,min_freq,max_freq,count_freq,count_proj,band_min,band_max,band_mode,width
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
-1,162.422143,86.845,246.185,7,3,3,6,4.428571,159.34
0,341.727,329.325,351.93,10,3,7,7,7.0,22.605
1,453.106667,450.495,454.475,9,1,8,8,8.0,3.98
2,463.914,459.095,466.225,10,1,8,8,8.0,7.13
3,231.342727,230.47,233.81,11,4,6,6,6.0,3.34
4,218.598333,216.86,220.395,15,3,6,6,6.0,3.535
5,263.7548,258.11,269.14,25,5,6,6,6.0,11.03
6,251.748889,247.295,254.53,9,3,6,6,6.0,7.235
7,109.520385,108.155,110.895,13,3,3,3,3.0,2.74
8,94.6465,87.925,99.105,10,3,3,3,3.0,11.18


### Add cluster "relevance" by comparing to proportion of measurements in each cluster from raw data
#### Add relevance column that we wil use

In [43]:
topic_cluster_stats['relevance'] = 0
topic_cluster_stats.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,mean_freq,min_freq,max_freq,count_freq,count_proj,band_min,band_max,band_mode,width,relevance
topic,cluster,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,-1,162.422143,86.845,246.185,7,3,3,6,4.428571,159.34,0
0,0,341.727,329.325,351.93,10,3,7,7,7.0,22.605,0
0,1,453.106667,450.495,454.475,9,1,8,8,8.0,3.98,0
0,2,463.914,459.095,466.225,10,1,8,8,8.0,7.13,0
0,3,231.342727,230.47,233.81,11,4,6,6,6.0,3.34,0


In [44]:
min_test = 450.495
max_test = 454.475
np.unique(train_measurements[train_measurements.med_freq.between(min_test, max_test, inclusive='right')].index.values).shape[0]

6

In [45]:
# We want to generate something like a score for clusters to use for recommendations
# The thought here is to compare the number of measurements in a topic cluster to the number of measurements from the raw data in the cluster
# Theoretically, if this cluster contains all of the measurements in this range, it is "relevant" to the topic
# This needs to be normalized by number of projects somehow because we don't want to overweight clusters if they're comprised of mostly one project
# Loop over topic clusters to get min_freq and max_freq
for row in topic_cluster_stats.head(10).index:
    clust_min = topic_cluster_stats.loc[row].min_freq   # Current cluster minimum frequency
    clust_max = topic_cluster_stats.loc[row].max_freq   # Current cluster maximum frequency                               
    meas_count = train_measurements[train_measurements.med_freq.between(clust_min, clust_max, inclusive='right')].shape[0]                         # Count of raw measurements in cluster range, (min_freq, max_freq]
    proj_count = np.unique(train_measurements[train_measurements.med_freq.between(clust_min, clust_max, inclusive='right')].index.values).shape[0] # Count of raw projects in cluster range, (min_freq, max_freq]
    print(clust_min, clust_max, meas_count, proj_count)
    

86.845 246.185 11134 1694
329.325 351.93 2039 480
450.495 454.475 16 6
459.095 466.225 63 21
230.47 233.81 1308 546
216.86 220.395 1758 492
258.11 269.14 683 199
247.295 254.53 374 125
108.155 110.895 458 183
87.925 99.105 1678 472


### Histogram of topic clusters

**Compare this to the scatter plot above. These two charts in tandem are good. Maybe we can combine them somehow**

Hover info needs work

In [46]:
import plotly.graph_objects as go
def plot_topic_clusters(tc_frame:pd.DataFrame, topic:int):
    figure = go.Figure()
    figure.add_trace(
        go.Bar(
            x=tc_frame.query(f'topic== {topic} and cluster != -1').mean_freq,
            y=tc_frame.query(f'topic== {topic} and cluster != -1').count_freq,
            #name=dict(color=tc_frame.query(f'topic== {topic} and cluster != -1').index.get_level_values(level=1)),
            width=tc_frame.query(f'topic== {topic} and cluster != -1').width.to_list()
            # hoverinfo=(
            #     tc_frame.query(f'topic== {topic} and cluster != -1').min_freq,
            #     tc_frame.query(f'topic== {topic} and cluster != -1').max_freq
            # )
        )
    )
    figure.update_layout(
    title=(f'Areas of Interest for Topic {topic}'),
    xaxis_title='Frequency (GHz)',
    yaxis_title='Count of Measurements',
    )
    figure.show()

In [47]:
plot_topic_clusters(topic_cluster_stats, 25)