# Potential Data Discretization

The goal of this notebook is to further explore options and develop a method/ function to discretize the possible frequency ranges to build targets for a deep learning/ other models.

## Things to fix
- Drop ranges that are "Impossible"- ranges that have not been observed at
- probably convert to numpy for stuff instead of base pytho 
- return a matrix instead of a list/ vector of probabilities (this may not be necessary)

In [8]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import plotly.express as px

measurements = pd.read_csv('./nrao_measurements.csv')

In [10]:
measurements = measurements[measurements.diff_freq < 5]
lines = measurements.query('fs_type == "line"')
lines = lines[['project_code', 'diff_freq', 'med_freq']]
lines.head()

Unnamed: 0,project_code,diff_freq,med_freq
0,2011.0.00010.S,0.24,90.5
1,2011.0.00010.S,0.23,90.815
2,2011.0.00010.S,0.23,91.805
3,2011.0.00010.S,0.23,93.005
4,2011.0.00010.S,0.94,218.06


In [27]:
len(sorted(lines['diff_freq'].round(2).unique()))

74

In [26]:
len(sorted(lines['med_freq'].round(2).unique()))

14800

In [28]:
len(lines)

44230

In [30]:
sorted(lines['diff_freq'].round(2).unique())

[0.05,
 0.06,
 0.07,
 0.08,
 0.09,
 0.1,
 0.11,
 0.12,
 0.13,
 0.14,
 0.15,
 0.17,
 0.18,
 0.23,
 0.24,
 0.25,
 0.26,
 0.27,
 0.28,
 0.29,
 0.3,
 0.39,
 0.46,
 0.47,
 0.48,
 0.49,
 0.5,
 0.51,
 0.52,
 0.53,
 0.59,
 0.63,
 0.93,
 0.94,
 0.95,
 0.96,
 0.97,
 0.98,
 0.99,
 1.0,
 1.01,
 1.02,
 1.03,
 1.05,
 1.06,
 1.43,
 1.86,
 1.87,
 1.88,
 1.89,
 1.9,
 1.91,
 1.92,
 1.93,
 1.94,
 1.97,
 1.98,
 1.99,
 2.0,
 2.01,
 2.02,
 2.03,
 2.04,
 2.05,
 2.06,
 2.07,
 2.08,
 2.09,
 2.1,
 2.12,
 2.13,
 2.14,
 2.15,
 3.89]

From the above, we can see that the greatest number of targets (when rounding to 2 decimal places) is 74 x 14800 = 1095200diffs = sorted(lines['diff_freq'].round(2).unique())
meds = sorted(lines['med_freq'].round(2).unique())

In [48]:
diffs = sorted(lines['diff_freq'].round(2).unique())
meds = sorted(lines['med_freq'].round(2).unique())

The functions below are to create the possible ranges based off desired windows.
- group_values_by_range: helper function to create ranges for both median frequency and frequency difference. starts based off lowest unique value in above lists, and only creates "ranges" for possible values (i.e doesn't create windows where no project has an observation)
- build_range: uses group_values_by_range for both median and difference in frequency and creates a list of lists, containing each "box" on the graph made in the EDA notebook.

In [51]:
def group_values_by_range(lst, group_size):
    lst.sort()  # Sort the list first
    ranges = []
    start = lst[0]  # Start with the minimum value

    while start < lst[-1]:
        end = start + group_size
        values_in_range = [val for val in lst if start <= val < end]
        ranges.append([start, end])
        
        # Move to the next value that's greater than the current end
        start = next((val for val in lst if val >= end), lst[-1] + 1)
        
    return ranges

def build_range(med_vals, diff_vals=0.2, remove_imp= True):
    '''
    Returns ranges with the range set by med_vals and diff_vals
    '''
    diff_range = group_values_by_range(diffs, diff_vals)
    med_range = group_values_by_range(meds, med_vals)
    
    ranges = []
    for item1 in diff_range:
        for item2 in med_range :
            ranges.append([item1, item2])
    
    return ranges

Below is a function to create a new table with new target values

In [69]:
def target_discretization(df, ranges):
    '''
    This function creates a new dataframe with project code and a new target variable vector,
    which will represent the target probabilities of a given project code
    to use for a machine learning model. It will discretize a large range of values (in this case, frequencies)
    into categories set in the ranges parameter.
    
    param df, pandas.dataframe: dataframe of measurements
    
    param ranges, 2- D array: Nested list of ranges to sort values into for each project. The length
    of the first dimension of this list will determine the length of the truth vector returned.
    [[[0.05, 0.25], [36.08, 41.08]],]
    '''
    df_new = pd.DataFrame(columns=['project_code', 'target']) # new df to return
    for pc in df.project_code.unique(): # loop through all line projects
        truth_vals = [0 for _ in range(len(ranges))] # create initial truth value list
        df_small = df[df['project_code'] == pc] # subset to correct observations
        for i in range(len(df_small)): # loop through all oobservations
            diff_f = df_small.iloc[i]['diff_freq']
            med_f = df_small.iloc[i]['med_freq']
            for a in range(len(ranges)): # Loop through ranges and match observation to range
                if diff_f >= ranges[a][0][0] and diff_f < ranges[a][0][1] and \
                med_f >= ranges[a][1][0] and med_f < ranges[a][1][1]:
                    truth_vals[a] = truth_vals[a] + 1 # add 1 for each observation in given range
        pos = sum(truth_vals) # now we change to probabilities
        for a in range(len(truth_vals)):
            if truth_vals[a] != 0:
                truth_vals[a] = truth_vals[a]/pos
        df_new.loc[len(df_new)] = [pc, truth_vals] # append to return dataframe
        
    return df_new

## Lets see if it worked!

In [71]:
target = target_discretization(lines, build_range(5))

In [76]:
target.iloc[0]['target'] 

[0,
 0,
 0,
 0.3,
 0.2,
 0.05,
 0.05,
 0,
 0.2,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,


In [81]:
sum(target.iloc[84]['target']) # sum to 1, good!

1.0

In [77]:
len(target.iloc[0]['target'])

854

In [59]:
lines.iloc[0]['diff_freq']

0.2400000000000091

In [78]:
len(build_range(5)) # yes! shape of target array should match number of targets

854

## Lets check to see if it correctly created probabilities for a random project

In [82]:
target.iloc[1234]

project_code                                       2016.1.00854.S
target          [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: 1234, dtype: object

In [83]:
lines[lines['project_code'] == '2016.1.00854.S']

Unnamed: 0,project_code,diff_freq,med_freq
18518,2016.1.00854.S,1.87,228.435


In [85]:
target.iloc[1234]['target'].index(1)

640

In [86]:
build_range(5)[640]

[[1.86, 2.06], [226.76, 231.76]]