# Full Model Notebook

## OUTSTANDING WORK
* Sync train-test split
* Incorporate ALMA text preprocessing and compare performance to our preprocessing
* Ensure measurements with width greater than 5GHz are dropped (I think this is only 2)
* Band EDA
* Remove outlier projects (> 26.5 measurement) from Band prediction
* Remove projects that have incorrectly formatted band data
    * E.G. 2011.0.00008.E has an observation line with band = '3 6'
* Test different text preprocessing options and compare results
* Consider removing bands 1 AND 2 from band prediction
    * Only 21 measurements in band 1, no measurements in band 2

## Workflow Outline:
We leverage two parallel pipelines, that are combined to recommend median frequencies to explore after each model has completed training and prediction.

All projects for this phase of the overall pipeline are 'line' projects.

### Frequency Mining Pipeline
* OPTIONAL: remove projects with > 26.5 measurements **CURRENTLY REMOVING**
    * Tested both options, hit rate accuracies did not increase significantly to offset 1k cluster add

* Run projects through LDA to generate topic model with $N=50$ topics
    * Currently using count vectorization of combined title and abstract with lemmatized_no_sw_text
* Group projects to max topic by taking argmax of document-topic table
* Run HDBSCAN on each of the topics to create measurement clusters, referred to as "areas of interest"
    * Currently areas of interest are taken from min and max median frequency for each cluster generated
    * NOTE: each of the 50 HDBSCAN models can (and probably should) be tuned individually
        * We should make sure generated clusters are not too large unless it makes sense
            * E.G. a large cluster from 700-750GHz might make sense since measurements in this range are generally sparse
            * These large clusters are due to HDBSCAN adjusting the "neighborhood size", $\epsilon$ dynamically (using heirarchical clustering underneath the hood) to account for areas of varying density, as opposed to DBSCAN which uses a flat $\epsilon$ for all measurements within a topic.

### Band Prediction Pipeline
* OPTIONAL: remove projects with > 26.5 measurements **NOT CURRENTLY REMOVING NEED TO CHANGE**
* Predict band for project with Naive Bayes
    * Currently using TF-IDF vectorization of combined title and abstract with **NEED TO CHOOSE TEXT**
* Choose band(s) using hard classification into one or two bands
    * We remove band 2 entirely because there are so few 
    * We do this to be able to give a final hit rate of appx. 75%
        * This shows we have a good prediction model to match projects to band
* Ultimately we will use probability vector output (not hard classification) to order mined recommendations by full band prediction

In [1]:
import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly_express as px
import plotly.figure_factory as ff
from ast import literal_eval
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.cluster import DBSCAN, HDBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split

SEED = 42

## Read data

In [2]:
train_projects = pd.read_csv("../data/train_projects.csv")
train_projects = train_projects.set_index('project_code')
train_projects.shape

(2383, 12)

In [3]:
test_projects = pd.read_csv("../data/test_projects.csv")
test_projects = test_projects.set_index('project_code')
test_projects.shape

(795, 12)

In [4]:
train_measurements = pd.read_csv('../../train_measurements.csv')
train_measurements = train_measurements.set_index('project_code')

In [5]:
test_measurements = pd.read_csv('../../test_measurements.csv')
test_measurements = test_measurements.set_index('project_code')

## Read in band predictions
This data frame gives a list from least likely band to most likely band for each test project

In [6]:
band_predictions = pd.read_csv('../data/band_prediction.csv')
band_predictions = band_predictions.set_index('project_code')
band_predictions.head()

Unnamed: 0_level_0,band_predictions
project_code,Unnamed: 1_level_1
2016.1.00485.S,"[1, 10, 9, 5, 4, 8, 7, 6, 3]"
2017.1.00824.S,"[1, 10, 9, 5, 8, 4, 3, 7, 6]"
2015.1.01088.S,"[1, 10, 9, 5, 8, 4, 7, 6, 3]"
2013.1.00781.S,"[1, 10, 9, 5, 8, 4, 7, 3, 6]"
2016.1.00800.S,"[1, 10, 9, 5, 8, 4, 3, 7, 6]"


## By hand band lower-bound cutoffs
To avoid possible conflicts, we simply call the cutoffs for band 1 and 2 to be 0 and 1GHz, respectively.

In [7]:
band_cutoffs = [0, 1, 84, 120, 163, 211, 275, 385, 602, 787]

In [8]:
train_texts = train_projects.lemmatized_no_sw_text
test_texts = test_projects.lemmatized_no_sw_text

### LDA class

In [9]:
class LDA_Model:
    def __init__(self, N_topics=3):
        self.N_topics = N_topics
        self.countVectorizer = CountVectorizer(stop_words='english')
        self.lda = LatentDirichletAllocation(n_components=self.N_topics, random_state=SEED)
    
    def fit(self, corpus):
        termFrequency = self.countVectorizer.fit_transform(corpus)
        self.lda.fit(termFrequency)
        return self.lda.transform(termFrequency)

    # Additional method to transform new data
    def transform(self, corpus):
        termFrequency = self.countVectorizer.transform(corpus)
        return self.lda.transform(termFrequency)

#### Initialize Model

In [10]:
lda_model = LDA_Model(N_topics=50)

#### Fit model on training set

In [11]:
train_topics = lda_model.fit(train_texts)

In [12]:
words = lda_model.countVectorizer.get_feature_names_out()

In [13]:
N = 10 #number of top words to show
topic_components = lda_model.lda.components_

for topic_idx, topic in enumerate(topic_components):
    print(f"Topic {topic_idx}:")
    # Get the indices of the top N words for this topic
    top_word_indices = topic.argsort()[-N:][::-1]
    # Print these words with their weights
    for word_idx in top_word_indices:
        print(f"{words[word_idx]} (weight: {topic[word_idx]:.2f})")
    print("\n")

Topic 0:
ci (weight: 337.85)
galaxy (weight: 281.26)
gas (weight: 228.73)
line (weight: 163.63)
molecular (weight: 120.74)
high (weight: 119.00)
redshift (weight: 109.37)
propose (weight: 103.92)
observation (weight: 103.48)
tracer (weight: 97.54)


Topic 1:
galaxy (weight: 1115.54)
gas (weight: 974.40)
star (weight: 642.74)
molecular (weight: 468.13)
formation (weight: 402.50)
high (weight: 254.33)
study (weight: 213.17)
form (weight: 212.59)
observation (weight: 203.25)
propose (weight: 196.43)


Topic 2:
dust (weight: 180.68)
galaxy (weight: 155.78)
high (weight: 134.75)
continuum (weight: 120.43)
observation (weight: 111.98)
resolution (weight: 109.43)
image (weight: 101.03)
line (weight: 88.76)
massive (weight: 70.79)
propose (weight: 65.92)


Topic 3:
cycle (weight: 108.41)
observation (weight: 93.76)
resolution (weight: 84.14)
line (weight: 84.08)
gas (weight: 78.04)
emission (weight: 75.91)
high (weight: 70.46)
resolve (weight: 68.94)
band (weight: 50.77)
propose (weight: 48.12

In [14]:
train_doc_topic = pd.DataFrame(train_topics)
train_doc_topic = train_doc_topic.set_index(train_texts.index.values)
train_doc_topic.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
2016.1.01288.S,0.00037,0.00037,0.00037,0.00037,0.00037,0.00037,0.00037,0.00037,0.00037,0.00037,...,0.00037,0.00037,0.171151,0.00037,0.00037,0.00037,0.00037,0.00037,0.00037,0.00037
2018.1.01077.S,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,...,0.0002,0.0002,0.0002,0.0002,0.139093,0.0002,0.0002,0.0002,0.0002,0.0002
2018.1.00437.S,0.000175,0.000175,0.000175,0.000175,0.000175,0.000175,0.410149,0.000175,0.000175,0.000175,...,0.118838,0.000175,0.000175,0.000175,0.000175,0.000175,0.000175,0.000175,0.000175,0.462768
2021.1.00637.S,0.000196,0.000196,0.148858,0.000196,0.000196,0.000196,0.000196,0.000196,0.000196,0.000196,...,0.000196,0.000196,0.000196,0.000196,0.000196,0.599945,0.000196,0.000196,0.122562,0.000196
2012.1.00786.S,0.00013,0.00013,0.158401,0.00013,0.456044,0.17161,0.00013,0.00013,0.00013,0.00013,...,0.00013,0.00013,0.00013,0.00013,0.00013,0.00013,0.00013,0.00013,0.00013,0.00013


In [15]:
train_texts = pd.DataFrame(train_texts)

### Match test data into topics

In [16]:
test_topics = lda_model.transform(test_texts)

In [17]:
test_doc_topic= pd.DataFrame(test_topics.tolist())
test_doc_topic= test_doc_topic.set_index(test_texts.index.values)
test_doc_topic.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
2016.1.00485.S,0.000198,0.160898,0.000198,0.000198,0.000198,0.000198,0.000198,0.158033,0.000198,0.000198,...,0.000198,0.000198,0.000198,0.000198,0.000198,0.23667,0.000198,0.000198,0.085407,0.000198
2017.1.00824.S,0.000168,0.118801,0.000168,0.000168,0.000168,0.383218,0.000168,0.000168,0.000168,0.000168,...,0.000168,0.104688,0.000168,0.000168,0.000168,0.000168,0.000168,0.000168,0.000168,0.000168
2015.1.01088.S,0.000182,0.104138,0.104443,0.000182,0.000182,0.000182,0.000182,0.000182,0.000182,0.044875,...,0.000182,0.065091,0.143142,0.000182,0.000182,0.000182,0.014204,0.000182,0.037287,0.000182
2013.1.00781.S,0.314379,0.000225,0.201856,0.000225,0.000225,0.000225,0.000225,0.000225,0.000225,0.000225,...,0.000225,0.000225,0.000225,0.000225,0.000225,0.150238,0.000225,0.000225,0.000225,0.069341
2016.1.00800.S,0.035727,0.369768,0.164091,0.000164,0.000164,0.000164,0.000164,0.000164,0.000164,0.000164,...,0.000164,0.063034,0.157391,0.000164,0.000164,0.000164,0.000164,0.000164,0.000164,0.000164


In [18]:
test_texts = pd.DataFrame(test_texts)

### Group documents to highest matching topic

Combine project topic vector frames

In [19]:
proj_topics = pd.concat([train_doc_topic, test_doc_topic])
proj_topics

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
2016.1.01288.S,0.000370,0.000370,0.000370,0.000370,0.000370,0.000370,0.000370,0.000370,0.000370,0.000370,...,0.000370,0.000370,0.171151,0.000370,0.000370,0.000370,0.000370,0.000370,0.000370,0.000370
2018.1.01077.S,0.000200,0.000200,0.000200,0.000200,0.000200,0.000200,0.000200,0.000200,0.000200,0.000200,...,0.000200,0.000200,0.000200,0.000200,0.139093,0.000200,0.000200,0.000200,0.000200,0.000200
2018.1.00437.S,0.000175,0.000175,0.000175,0.000175,0.000175,0.000175,0.410149,0.000175,0.000175,0.000175,...,0.118838,0.000175,0.000175,0.000175,0.000175,0.000175,0.000175,0.000175,0.000175,0.462768
2021.1.00637.S,0.000196,0.000196,0.148858,0.000196,0.000196,0.000196,0.000196,0.000196,0.000196,0.000196,...,0.000196,0.000196,0.000196,0.000196,0.000196,0.599945,0.000196,0.000196,0.122562,0.000196
2012.1.00786.S,0.000130,0.000130,0.158401,0.000130,0.456044,0.171610,0.000130,0.000130,0.000130,0.000130,...,0.000130,0.000130,0.000130,0.000130,0.000130,0.000130,0.000130,0.000130,0.000130,0.000130
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2013.1.01058.S,0.000222,0.000222,0.000222,0.131865,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,...,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222,0.000222
2016.1.00004.S,0.000230,0.232734,0.000230,0.120931,0.000230,0.000230,0.000230,0.000230,0.034291,0.000230,...,0.000230,0.000230,0.000230,0.000230,0.000230,0.000230,0.000230,0.000230,0.000230,0.000230
2017.1.00935.S,0.000175,0.875598,0.000175,0.000175,0.029094,0.000175,0.000175,0.000175,0.000175,0.000175,...,0.000175,0.000175,0.000175,0.000175,0.000175,0.000175,0.000175,0.000175,0.000175,0.000175
2018.1.00216.S,0.000194,0.594620,0.000194,0.000194,0.000194,0.000194,0.029841,0.017551,0.000194,0.000194,...,0.000194,0.000194,0.000194,0.000194,0.059090,0.000194,0.000194,0.000194,0.000194,0.046483


Take highest matching topic for each project

In [20]:
proj_topics['max_topic'] = proj_topics.apply(lambda x: x.argmax(), axis=1)

Create data frame with project id and max topic

In [21]:
proj_max_topic = proj_topics['max_topic'].to_frame()
proj_max_topic.head()

Unnamed: 0,max_topic
2016.1.01288.S,29
2018.1.01077.S,38
2018.1.00437.S,49
2021.1.00637.S,45
2012.1.00786.S,4


### Add `max_topic` to `measurements` frame to be able to group measurements by max topic

In [22]:
train_measurements = pd.merge(train_measurements, proj_max_topic, left_index=True, right_index=True)

In [23]:
proj_max_topic.value_counts().describe()

count     50.0000
mean      63.5600
std       65.1855
min        3.0000
25%       22.2500
50%       42.0000
75%       81.0000
max      325.0000
Name: count, dtype: float64

There are a few topics that match to a large number of documents. Perhaps we need a better topic model or to group documents by project_topic vector similarity.

Eyeball comparison of documents by max topic. This requires looking at the online explorer since printing out abstracts in here gets messy.

In [24]:
proj_max_topic[proj_max_topic.max_topic == 3].head()

Unnamed: 0,max_topic
2015.1.01549.S,3
2018.1.00047.S,3
2022.1.01618.S,3
2012.1.00075.S,3
2016.1.00413.V,3


### Generate test projects measurements
This will be useful for calculating hit rates to evaluate model performance.

**NOTE!!!**
You should not sort these, however tempting. We need to preserve the relationships of the entries to not lose measurement information.

In [25]:
test_proj_meas = test_measurements.loc[test_texts.index]
test_proj_meas = test_proj_meas.groupby(test_proj_meas.index)\
    .agg({
        'low_freq': lambda x: round(x, 4).tolist(),
        'high_freq': lambda x: round(x, 4).tolist(),
        'med_freq': lambda x: round(x, 4).tolist(),
        'diff_freq': lambda x: round(x, 4).tolist()
    })
test_proj_meas.head()

Unnamed: 0_level_0,low_freq,high_freq,med_freq,diff_freq
project_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2011.0.00010.S,"[90.38, 90.7, 91.69, 92.89, 217.59, 218.67, 21...","[90.62, 90.93, 91.92, 93.12, 218.53, 219.6, 21...","[90.5, 90.815, 91.805, 93.005, 218.06, 219.135...","[0.24, 0.23, 0.23, 0.23, 0.94, 0.93, 0.94, 0.9..."
2011.0.00064.S,"[288.96, 290.79, 300.84, 302.71, 288.94, 290.7...","[290.84, 292.67, 302.71, 304.59, 290.82, 292.6...","[289.9, 291.73, 301.775, 303.65, 289.88, 291.7...","[1.88, 1.88, 1.87, 1.88, 1.88, 1.87, 1.88, 1.87]"
2011.0.00121.S,"[319.07, 320.48, 319.83, 319.36, 319.71, 316.59]","[320.94, 322.35, 321.71, 321.24, 321.58, 318.47]","[320.005, 321.415, 320.77, 320.3, 320.645, 317...","[1.87, 1.87, 1.88, 1.88, 1.87, 1.88]"
2011.0.00136.S,"[335.29, 335.98, 345.67, 346.47]","[335.52, 336.22, 345.91, 346.7]","[335.405, 336.1, 345.79, 346.585]","[0.23, 0.24, 0.24, 0.23]"
2011.0.00199.S,"[639.15, 645.41, 657.7, 661.7, 320.98, 322.12,...","[640.11, 646.37, 658.66, 662.66, 321.46, 322.6...","[639.63, 645.89, 658.18, 662.18, 321.22, 322.3...","[0.96, 0.96, 0.96, 0.96, 0.48, 0.48, 0.49, 0.48]"


### Generate train topic measurements
We will use these to engineer 'areas of interest' among topics using DBSCAN

**NOTE!!!**
You should not sort these, however tempting. We need to preserve the relationships of the entries to not lose measurement information.

In [26]:
train_topic_freqs = train_measurements.loc[train_texts.index]\
    .groupby('max_topic')\
    .agg({
        'low_freq': lambda x: round(x, 4).tolist(),
        'high_freq': lambda x: round(x, 4).tolist(),
        'med_freq': lambda x: round(x, 4).tolist(),
        'diff_freq': lambda x: round(x, 4).tolist(),
        'band': lambda x: x.astype('int64').tolist()
    })
train_topic_freqs.head()

Unnamed: 0_level_0,low_freq,high_freq,med_freq,diff_freq,band
max_topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,"[126.36, 127.55, 254.01, 255.61, 218.23, 218.4...","[128.24, 129.43, 255.88, 257.49, 218.48, 218.7...","[127.3, 128.49, 254.945, 256.55, 218.355, 218....","[1.88, 1.88, 1.87, 1.88, 0.25, 0.25, 0.25, 0.2...","[4, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ..."
1,"[84.89, 86.78, 96.89, 98.84, 87.77, 89.54, 99....","[86.76, 88.66, 98.76, 100.72, 89.65, 91.42, 10...","[85.825, 87.72, 97.825, 99.78, 88.71, 90.48, 1...","[1.87, 1.88, 1.87, 1.88, 1.88, 1.88, 1.87, 1.8...","[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ..."
2,"[249.28, 251.16, 264.95, 266.82, 142.24, 143.9...","[251.16, 253.03, 266.82, 268.7, 144.11, 145.8,...","[250.22, 252.095, 265.885, 267.76, 143.175, 14...","[1.88, 1.87, 1.87, 1.88, 1.87, 1.87, 1.87, 1.8...","[6, 6, 6, 6, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, ..."
3,"[329.3, 330.55, 339.31, 340.68, 329.29, 339.3,...","[329.36, 330.61, 339.37, 340.74, 329.35, 339.3...","[329.33, 330.58, 339.34, 340.71, 329.32, 339.3...","[0.06, 0.06, 0.06, 0.06, 0.06, 0.06, 2.0, 2.0,...","[7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 7, 7, 7, ..."
4,"[339.86, 341.56, 351.86, 353.57, 332.86, 334.5...","[341.74, 343.44, 353.74, 355.44, 334.74, 336.4...","[340.8, 342.5, 352.8, 354.505, 333.8, 335.5, 3...","[1.88, 1.88, 1.88, 1.87, 1.88, 1.88, 1.88, 1.8...","[7, 7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 3, 6, 6, 6, ..."


In [27]:
pd.DataFrame(train_topic_freqs.loc[0].med_freq,
             train_topic_freqs.loc[0].band)\
             .reset_index()

Unnamed: 0,index,0
0,4,127.300
1,4,128.490
2,6,254.945
3,6,256.550
4,6,218.355
...,...,...
619,6,224.590
620,6,224.575
621,6,226.125
622,6,259.185


## Cluster cleaning function

In [28]:
# Code to check for clusters that span at least two bands
def cluster_cleaning(cluster_df, topic_df):
    new_rows = []
    bad_rows = []
    for clst in cluster_df.index:
        if cluster_df.loc[clst].band_min != cluster_df.loc[clst].band_max:
            olap_clst = clst
            olap_band_min = cluster_df.loc[clst].band_min.astype('int64')
            olap_band_max = cluster_df.loc[clst].band_max.astype('int64')
            olap_band_mode = cluster_df.loc[clst].band_mode.astype('int64')
            olap_min_freq = cluster_df.loc[clst].min_freq
            olap_max_freq = cluster_df.loc[clst].max_freq

            # Check to see that the cluster doesn't span more than two bands
            # If it does to this the cluster is far too large and clusters need to be tuned better
            if olap_band_max - olap_band_min > 1:
                raise ValueError('Cluster spans more than 2 bands. Re-parameterize clusters.')
            
            # Otherwise, we split the cluster into two different clusters on the band boundaries
            new_row1_min_freq = olap_min_freq
            new_row1_max_freq = band_cutoffs[olap_band_max-1]
            new_row2_min_freq = band_cutoffs[olap_band_max-1]
            new_row2_max_freq = olap_max_freq

            # There are cases where the band transition in the data is wrong
            # Measurements in band 4 are misclassified to band 5 it seems if they are very close to the boundary between bands
            # If this is the case, we don't make new rows and simply set the band mode for the 'bad' row to the band with the higher mode
            if (new_row1_min_freq > new_row1_max_freq) or (new_row2_min_freq > new_row2_max_freq):
                break

            reassign_measures = topic_df[topic_df.cluster_label == clst].med_freq\
                .sort_values()\
                .to_list()
            new_row1_measures = []
            new_row2_measures = []

            # Loop over reassign_measures and build lists for each new cluster
            for meas in reassign_measures:
                if meas <= new_row2_min_freq:
                    new_row1_measures.append(meas)
                else:
                    new_row2_measures.append(meas)
            
            # Generate column values for new rows (clusters)
            new_row1_count = len(new_row1_measures)
            new_row2_count = len(new_row2_measures)
            new_row1_mean = np.mean(new_row1_measures)
            new_row2_mean = np.mean(new_row2_measures)
            new_row1_band_min = olap_band_min
            new_row1_band_max = olap_band_min
            new_row2_band_min = olap_band_max
            new_row2_band_max = olap_band_max
            new_row1_band_mode = olap_band_mode
            new_row2_band_mode = olap_band_mode

            # Make new row lists to add to data frame
            new_row1 = [new_row1_mean,
                        new_row1_min_freq,
                        new_row1_max_freq,
                        new_row1_count,
                        new_row1_band_min,
                        new_row1_band_max,
                        new_row1_band_mode
                        ]
            
            new_row2 = [new_row2_mean,
                        new_row2_min_freq,
                        new_row2_max_freq,
                        new_row2_count,
                        new_row2_band_min,
                        new_row2_band_max,
                        new_row2_band_mode
                        ]
            
            # Add new rows (clusters) to list to ultimately alter cluster_df
            # We don't want to alter the data frame we're looping over in the loop
            new_rows.append(new_row1)
            cluster_df.loc[len(cluster_df.index)] = new_row1
            cluster_df.loc[len(cluster_df.index)] = new_row2
            bad_rows.append(clst)

        # Drop and add affected rows
    if len(bad_rows) != 0:
        for br in range(len(bad_rows)):
            cluster_df = cluster_df.drop(bad_rows[br], axis=0)
            
    cluster_df = cluster_df.reset_index()
    cluster_df = cluster_df.drop('cluster_label', axis=1)
    return cluster_df

### Loop over topics and find accuracy measurements

In [30]:
test_project_hits = 0               # Hits for all projects if at least one measurement is matched
test_project_meas_hit_rate = []     # List of hit rates by project
topic_cluster_widths = []           # List of cluster widths by topic to ensure generated clusters are not too wide (list of lists)
total_num_clusters = 0              # List of number of clusters for each topic

# Loop over topics
for tpc in set(proj_max_topic.max_topic.values):
    # DBSCAN with parameters from topic parameter data frame
    # BASIC COMPARISON PARAMETERIZATION: eps=0.5, min_samples=2
    # db = DBSCAN(eps=0.25, min_samples=2)\
    #     .fit(list(zip(train_topic_freqs.loc[tpc].med_freq)))
    # db = DBSCAN(eps=params_frame.loc[tpc].eps, min_samples=2)\
    # .fit(list(zip(train_topic_freqs.loc[tpc].med_freq)))
    db = HDBSCAN(max_cluster_size=200, min_cluster_size=5)\
    .fit(list(zip(train_topic_freqs.loc[tpc].med_freq)))
    
    # Get labels from DBSCAN
    labels = db.labels_

    # Number of clusters in labels, ignoring noise if present.
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_points = len(list(labels))
    n_noise = list(labels).count(-1)

    # Stat callouts
    print(f'HDBSCAN Results for topic {tpc}')
    #print(f'Estimated number of clusters: {n_clusters}')
    print(f'Number of projects in topic: {proj_max_topic.loc[train_texts.index].query(f"max_topic == {tpc}").shape[0]}')
    print(f'Total number of measurements: {n_points}')
    print(f'Estimated number of noise measurements: {n_noise}')
    print(f'Noise percentage: {round(list(labels).count(-1)/labels.shape[0], 3)}')
    print(f'Signal to noise ratio: {round(1-list(labels).count(-1)/labels.shape[0], 3)}')

    # Create data frame for measurements in this specific topic
    selected_topic = pd.DataFrame(train_topic_freqs.loc[tpc].med_freq,
                                  train_topic_freqs.loc[tpc].band)\
    .reset_index()
    selected_topic.columns = ['band', 'med_freq']
    selected_topic['cluster_label'] = labels

    # Take mean of diff_freq and med_freq to generate areas of interest    
    topic_cluster = selected_topic.groupby('cluster_label').agg(
        mean_freq=('med_freq', 'mean'),
        min_freq=('med_freq', 'min'),
        max_freq=('med_freq', 'max'),
        count_freq=('med_freq', 'count'),
        band_min=('band', 'min'),
        band_max=('band', 'max'),
        band_mode=('band', 'mean')
    )

    # Check to see if there is noise from clustering
    # If so, drop noise cluster as we do not want to count hits there
    if (-1 in topic_cluster.index):
        topic_cluster = topic_cluster.drop(-1, axis=0) # Drop noise (label -1)
    topic_cluster = topic_cluster.sort_index()

    # Break up clusters that span more than one band
    topic_cluster = cluster_cleaning(topic_cluster, selected_topic)
    topic_cluster = topic_cluster.sort_index()

    # Loop over generated clusters and print cluster stats
    # Initialize list of cluster widths
    # Code to check for clusters that span at least two bands
    new_rows = []
    bad_rows = []
    cluster_widths = []

    # If after dropping noise there are still clusters, print the stats
    # This if statement is to avoid errors if topics only have noise and no clusters
    if topic_cluster.shape[0] != 0:
        for clst in topic_cluster.index:
            min_freq, max_freq = topic_cluster.loc[clst].min_freq, topic_cluster.loc[clst].max_freq
            cluster_widths.append(max_freq - min_freq)
            total_num_clusters += 1
            if (topic_cluster.loc[clst].band_min != topic_cluster.loc[clst].band_max):
                print("BAND OVERLAP")
        print(f'Topic {tpc} cluster width stats:')
        print(np.round(pd.Series(cluster_widths).describe(), 4))
    else: print('No clusters for this topic')

    # Print topic cluster dataframe for HDBSCAN tuning
    print(f'Stats for HDBSCAN clusters for topic {tpc}')
    print(topic_cluster.describe())
    print(topic_cluster.sort_values('count_freq', ascending=False))

    # Get a list of test project codes
    tps = proj_max_topic.loc[test_texts.index].query(f'max_topic == {tpc}')

    # Check to see if there are any test projects assigned to this topic
    # ADD IF, ELSE STATEMENT HERE, ATTACH ELSE TO FOLLOWING CODE

    #Begin test projects
    print('')
    # print('Begin tests')

    # Loop over test projects
    for tp in tps.index:
        tp_hr = 0   # Hit rate for this specific project
        #print(f'Test project {tp}:')
        # Loop over measurements in test project
        for meas in test_proj_meas.loc[tp].med_freq:
            # Loop over clusters in topic
            for clust in topic_cluster.index.values:
                lower_bound = round(topic_cluster.loc[clust].min_freq, 3)
                upper_bound = round(topic_cluster.loc[clust].max_freq, 3)
                if ((meas >= lower_bound) and (meas <= upper_bound)):
                    tp_hr += 1
                    break
        test_project_meas_hit_rate.append(round(tp_hr/len(list(test_proj_meas.loc[tp].med_freq)), 3))
        #Print some stats
        # print(f'Number of measurements: {len(test_proj_meas.loc[tp].med_freq)}')
        # print(f'Hits: {tp_hr}')
        # print(f'Hit rate: {round(tp_hr/len(list(test_proj_meas.loc[tp].med_freq)), 3)}')
        # print('')

        # Increment test_project_hits if at least one measurement in the project matched
        if (tp_hr > 0):
            test_project_hits +=1
    print('=========================================\n')

print(f'Total number of clusters across topics: {total_num_clusters}')
print(f'Number of test projects with at least one measurement match: {test_project_hits}')
print(f'Ratio of test project hits to number of test projects: {round(test_project_hits/test_texts.shape[0], 4)}')
print(f'Average hit rate per project: {round(sum(test_project_meas_hit_rate)/test_texts.shape[0], 4)}')

HDBSCAN Results for topic 0
Number of projects in topic: 83
Total number of measurements: 624
Estimated number of noise measurements: 107
Noise percentage: 0.171
Signal to noise ratio: 0.829
Topic 0 cluster stats:
Stats for HDBSCAN clusters for topic 0
        mean_freq    min_freq    max_freq  count_freq   band_min   band_max  \
count   55.000000   55.000000   55.000000   55.000000  55.000000  55.000000   
mean   225.470697  224.301818  226.649364    9.400000   5.109091   5.109091   
std    148.084750  147.327456  148.859073    5.363111   1.911687   1.911687   
min     85.916000   85.800000   86.025000    3.000000   3.000000   3.000000   
25%    110.266607  109.837500  110.750000    5.500000   3.000000   3.000000   
50%    216.195556  215.900000  216.660000    8.000000   6.000000   6.000000   
75%    254.912652  254.310000  255.747500   12.000000   6.000000   6.000000   
max    684.098750  679.680000  688.520000   34.000000   9.000000   9.000000   

       band_mode  
count  55.000000

## Looping in band prediction assessments
Taking the hit rates above as the upper limit, we see how reducing our recommendations to the 'areas of interest' from the top 2 predicted bands affects our hit rate

In [50]:
test_project_hits = 0               # Hits for all projects if at least one measurement is matched
test_project_meas_hit_rate = []     # List of hit rates by project
topic_cluster_widths = []           # List of cluster widths by topic to ensure generated clusters are not too wide (list of lists)
total_num_clusters = 0              # List of number of clusters for each topic

# Loop over topics
for tpc in set(proj_max_topic.max_topic.values):
    kaleigh_test = []
    # DBSCAN with parameters from topic parameter data frame
    # BASIC COMPARISON PARAMETERIZATION: eps=0.5, min_samples=2
    # db = DBSCAN(eps=0.25, min_samples=2)\
    #     .fit(list(zip(train_topic_freqs.loc[tpc].med_freq)))
    # db = DBSCAN(eps=params_frame.loc[tpc].eps, min_samples=2)\
    # .fit(list(zip(train_topic_freqs.loc[tpc].med_freq)))
    db = HDBSCAN(max_cluster_size=200, min_cluster_size=5)\
    .fit(list(zip(train_topic_freqs.loc[tpc].med_freq)))
    
    # Get labels from DBSCAN
    labels = db.labels_

    # Number of clusters in labels, ignoring noise if present.
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_points = len(list(labels))
    n_noise = list(labels).count(-1)

    # Stat callouts
    print(f'HDBSCAN Results for topic {tpc}')
    #print(f'Estimated number of clusters: {n_clusters}')
    print(f'Number of projects in topic: {proj_max_topic.loc[train_texts.index].query(f"max_topic == {tpc}").shape[0]}')
    print(f'Total number of measurements: {n_points}')
    print(f'Estimated number of noise measurements: {n_noise}')
    print(f'Noise percentage: {round(list(labels).count(-1)/labels.shape[0], 3)}')
    print(f'Signal to noise ratio: {round(1-list(labels).count(-1)/labels.shape[0], 3)}')

    # Create data frame for measurements in this specific topic
    selected_topic = pd.DataFrame(train_topic_freqs.loc[tpc].med_freq,
                                  train_topic_freqs.loc[tpc].band)\
    .reset_index()
    selected_topic.columns = ['band', 'med_freq']
    selected_topic['cluster_label'] = labels

    # Take mean of diff_freq and med_freq to generate areas of interest    
    topic_cluster = selected_topic.groupby('cluster_label').agg(
        mean_freq=('med_freq', 'mean'),
        min_freq=('med_freq', 'min'),
        max_freq=('med_freq', 'max'),
        count_freq=('med_freq', 'count'),
        band_min=('band', 'min'),
        band_max=('band', 'max'),
        band_mode=('band', 'mean')
    )
    # We will leverage the band mode, but this is not working in .agg
    # Instead we take the mean of band and round to the nearest value to approximate the mode
    topic_cluster.band_mode = round(topic_cluster.band_mode, 0).astype('int')

    # Check to see if there is noise from clustering
    # If so, drop noise cluster as we do not want to count hits there
    if (-1 in topic_cluster.index):
        topic_cluster = topic_cluster.drop(-1, axis=0) # Drop noise (label -1)
    topic_cluster = topic_cluster.sort_index()

    # Break up clusters that span more than one band
    topic_cluster = cluster_cleaning(topic_cluster, selected_topic)
    topic_cluster = topic_cluster.sort_index()

    # Loop over generated clusters and print cluster stats
    # Initialize list of cluster widths
    # Code to check for clusters that span at least two bands
    new_rows = []
    bad_rows = []
    cluster_widths = []

    # If after dropping noise there are still clusters, print the stats
    # This if statement is to avoid errors if topics only have noise and no clusters
    if topic_cluster.shape[0] != 0:
        for clst in topic_cluster.index:
            min_freq, max_freq = topic_cluster.loc[clst].min_freq, topic_cluster.loc[clst].max_freq
            cluster_widths.append(max_freq - min_freq)
            total_num_clusters += 1
            if (topic_cluster.loc[clst].band_min != topic_cluster.loc[clst].band_max):
                print("BAND OVERLAP")
        print('')
        print(f'Topic {tpc} cluster width stats:')
        print(np.round(pd.Series(cluster_widths).describe(), 4))
    else: print('No clusters for this topic')

    # Print topic cluster dataframe for HDBSCAN tuning
    print('')
    print(topic_cluster.sort_values('count_freq', ascending=False))

    # Get a list of test project codes
    tps = proj_max_topic.loc[test_texts.index].query(f'max_topic == {tpc}')

    # Check to see if there are any test projects assigned to this topic
    # ADD IF, ELSE STATEMENT HERE, ATTACH ELSE TO FOLLOWING CODE

    #Begin test projects
    print('')
    # print('Begin tests')

    # Loop over test projects
    for tp in tps.index:
        tp_hr = 0   # Hit rate for this specific project
        #print(f'Test project {tp}:')

        # Subset `topic_cluster` to be the top two predicted bands for each project
        topic_cluster_subset = topic_cluster[topic_cluster.band_mode.isin(literal_eval(band_predictions.loc[tp].band_predictions)[-2:])]
        if topic_cluster.shape[0] != 0:
            print(f'Ratio of recommended clusters to total clusters: {topic_cluster_subset.shape[0]/topic_cluster.shape[0]}')
        else: print(f'No clusters for topic {tpc}')

        # Loop over measurements in test project
        for meas in test_proj_meas.loc[tp].med_freq:
            # Loop over clusters in topic
            for clust in topic_cluster_subset.index.values:
                lower_bound = round(topic_cluster_subset.loc[clust].min_freq, 3)
                upper_bound = round(topic_cluster_subset.loc[clust].max_freq, 3)
                if ((meas >= lower_bound) and (meas <= upper_bound)):
                    tp_hr += 1
                    break
        test_project_meas_hit_rate.append(round(tp_hr/len(list(test_proj_meas.loc[tp].med_freq)), 3))
        #Print some stats
        # print(f'Number of measurements: {len(test_proj_meas.loc[tp].med_freq)}')
        # print(f'Hits: {tp_hr}')
        # print(f'Hit rate: {round(tp_hr/len(list(test_proj_meas.loc[tp].med_freq)), 3)}')
        # print('')

        # Increment test_project_hits if at least one measurement in the project matched
        if (tp_hr > 0):
            test_project_hits +=1
    print('=========================================\n')

print(f'Total number of clusters across topics: {total_num_clusters}')
print(f'Number of test projects with at least one measurement match: {test_project_hits}')
print(f'Ratio of test project hits to number of test projects: {round(test_project_hits/test_texts.shape[0], 4)}')
print(f'Average hit rate per project: {round(sum(test_project_meas_hit_rate)/test_texts.shape[0], 4)}')

HDBSCAN Results for topic 0
Number of projects in topic: 83
Total number of measurements: 624
Estimated number of noise measurements: 107
Noise percentage: 0.171
Signal to noise ratio: 0.829

Topic 0 cluster width stats:
count    55.0000
mean      2.3475
std       3.3641
min       0.0450
25%       0.5200
50%       0.9800
75%       2.1075
max      15.8700
dtype: float64

     mean_freq  min_freq  max_freq  count_freq  band_min  band_max  band_mode
15  344.085294   336.325   352.195        34.0       7.0       7.0        7.0
27  101.746591   100.695   103.045        22.0       3.0       3.0        3.0
31  239.497381   236.420   242.795        21.0       6.0       6.0        6.0
29  142.870313   139.835   146.610        16.0       4.0       4.0        4.0
18  269.946250   265.775   273.905        16.0       6.0       6.0        6.0
47  218.575667   218.305   218.915        15.0       6.0       6.0        6.0
52  230.712143   230.515   231.030        14.0       6.0       6.0        6.0
41 