⚔️ Side Quest Notebook: Imputation Optimization ⚔️
==============================================================

**Author:** Xavier R Nogueira

**Overview:** In my first competition notebook, `NB1_PreProcessing_Data.ipynb`, missing values in the Protein and Peptide training datasets were imputed using both Iterative and KNN imputation. That notebook will remain the first notebook in my workflow, however, in this notebook we will explore whether our imputation accuracy can be improved for each method via altering parameters. In later notebooks we will make predictions using training data filled with both methods, and evaluate results at the prediction task level.

**Methodology:**
1. Pull in the columnar formatted `protein_data_raw.parquet` and `peptide_data_raw.parquet` training data files into `pd.DataFrame`s. Combine them into one table.
2. Combine the Protein/Peptide boolean missing data masks. Make a dictionary that returns indices where there IS data for a given column.
3. Set up a version of K-Fold CV where a different subset of cells are coverted to `np.nan` in each fold such that all non-empty cells get converted just once. Evaluate imputation accuracy.
4. Run `Optuna` evaluation for both imputation methods across their parameter space.
5. Record all results in a `pd.DataFrame` such that if we eliminate features later, we can focus on the imputation method that provides the best performance for our subselection of columns.

In [52]:
# core imports
import pandas as pd
import numpy as np
import hvplot.pandas
from typing import (
    List,
    Dict,
    Optional,
)

# enable experimental imputer
from sklearn.experimental import enable_iterative_imputer

# import our imputation algos
from sklearn.impute import (
    IterativeImputer,
    KNNImputer,
)

# Pull in data

## Combine raw data tables

In [3]:
# load in data from parquet
proteins_df = pd.read_parquet(
    'prepped_inputs/protein_data_raw.parquet',
    engine='pyarrow',
)
peptide_df = pd.read_parquet(
    'prepped_inputs/peptide_data_raw.parquet',
    engine='pyarrow',
)

In [4]:
# keep track of our protein / peptide columns
protein_cols = proteins_df.columns
peptide_cols = peptide_df.columns

# join the protein / peptide data
prot_and_peps_df = pd.concat(
    [proteins_df, peptide_df],
    axis=1,
)

In [5]:
prot_and_peps_df.head()

Unnamed: 0_level_0,O00391,O00533,O00584,O14498,O14773,O14791,O15240,O15394,O43505,O60888,O75144,O75326,O94919,P00441,P00450,P00734,P00736,P00738,P00746,P00747,P00748,P00751,P01008,P01009,P01011,P01019,P01023,P01024,P01031,P01033,P01034,P01042,P01344,P01591,P01594,P01608,P01621,P01717,P01780,P01833,P01834,P01857,P01859,P01860,P01861,P01876,P01877,P02452,P02647,P02649,P02652,P02655,P02656,P02671,P02675,P02679,P02747,P02748,P02749,P02750,...,VSPTDC(UniMod_4)SAVEPEAEK,VSTLPAITLK,VTAAPQSVC(UniMod_4)ALR,VTEIWQEVMQR,VTEPISAESGEQVER,VTGVVLFR,VTIKPAPETEKRPQDAK,VTIPTDLIASSGDIIK,VTLTC(UniMod_4)VAPLSGVDFQLR,VTSIQDWVQK,VTTVASHTSDSDVPSGVTEVVVK,VVEESELAR,VVEQMC(UniMod_4)ITQYER,VVVNFAPTIQEIK,VYAC(UniMod_4)EVTHQGLSSPVTK,VYC(UniMod_4)DMNTENGGWTVIQNR,VYTVDLGR,WC(UniMod_4)AVSEHEATK,WEAEPVYVQR,WELALGR,WGYC(UniMod_4)LEPK,WKNFPSPVDAAFR,WLPSSSPVTGYR,WQEEMELYR,WSGQTAIC(UniMod_4)DNGAGYC(UniMod_4)SNPGIPIGTR,WSRPQAPITGYR,WSSTSPHRPR,WYEIEKIPTTFENGR,WYFDVTEGK,YAMVYGYNAAYNR,YANC(UniMod_4)HLAR,YFIDFVAR,YGFIEGHVVIPR,YGLDSDLSC(UniMod_4)K,YGLVTYATYPK,YGQTIRPIC(UniMod_4)LPC(UniMod_4)TEGTTR,YHDRDVWKPEPC(UniMod_4)R,YIETDPANR,YIFHNFMER,YIVSGTPTFVPYLIK,YKAAFTEC(UniMod_4)C(UniMod_4)QAADK,YLFLNGNK,YLGEEYVK,YLQEIYNSNNQK,YLYEIAR,YNSQNQSNNQFVLYR,YPGPQAEGDSEGLSQGLVDREK,YPNC(UniMod_4)AYR,YPSLSIHGIEGAFDEPGTK,YQC(UniMod_4)YC(UniMod_4)YGR,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1
10053_0,9104.27,402321.0,,,7150.57,2497.84,83002.9,15113.6,167327.0,129048.0,53069.5,,11074.6,,774736.0,474672.0,,3594820.0,34217.2,365510.0,28713.8,475601.0,1849090.0,12825300.0,1084770.0,1197290.0,1005230.0,2669740.0,,79917.7,18811700.0,541909.0,120502.0,8277.92,,8776.14,196510.0,88369.3,12877.0,,2440800.0,3750260.0,2238830.0,49324.7,83537.6,1874270.0,75415.7,18309.7,4032650.0,5158030.0,123201.0,,33735.9,413906.0,336093.0,189063.0,117286.0,57607.3,1000890.0,194973.0,...,49307.3,86217.6,437354.0,85433.3,2497.84,,,85231.1,66733.6,339999.0,1579800.0,,292483.0,43339.3,1296660.0,61108.9,,561125.0,19881.1,,,594013.0,176748.0,187165.0,19940.3,117771.0,,723167.0,30318.9,,,,102861.0,229992.0,153939.0,127619.0,,,,11505.5,2831940.0,19533.7,1747090.0,34810.6,24442300.0,29038.9,114029.0,,103946.0,,202274.0,,4401830.0,77482.6,583075.0,76705.7,104260.0,530223.0,,7207.3
10053_12,10464.2,435586.0,,,,,197117.0,15099.1,164268.0,108114.0,55856.4,,44516.3,,1025080.0,391328.0,,1992590.0,119396.0,365969.0,23347.6,406037.0,1769370.0,11871200.0,1135570.0,1178170.0,1230990.0,3360790.0,,78499.7,19343900.0,584371.0,76739.9,7124.72,14098.5,7592.37,185831.0,85195.6,5802.1,1860110.0,2030500.0,4872420.0,2165970.0,131029.0,178106.0,1800070.0,11026.3,30729.5,5111760.0,5201730.0,,17404.8,43910.3,562354.0,349269.0,156655.0,106468.0,60908.1,871500.0,236112.0,...,19065.9,78750.3,437547.0,126827.0,,,,62315.8,97854.4,195976.0,1616750.0,,289648.0,51576.8,996287.0,55446.7,1860110.0,,23515.0,,,681018.0,189668.0,237930.0,18230.6,123852.0,,617841.0,60135.0,,,,114651.0,211126.0,182877.0,112425.0,,,,30119.8,2527260.0,26771.7,1837800.0,28912.5,23326600.0,37132.6,143767.0,,118192.0,,201009.0,,5001750.0,36745.3,355643.0,92078.1,123254.0,453883.0,49281.9,25332.8
10053_18,13235.7,507386.0,7126.96,24525.7,,2372.71,126506.0,16289.6,168107.0,163776.0,74672.3,,57117.2,53157.9,1104540.0,491425.0,75603.2,2683290.0,57081.2,309751.0,89467.3,523401.0,2821870.0,14319900.0,1444430.0,1597540.0,1333940.0,3728860.0,7907.66,87682.5,18772400.0,971184.0,94953.3,10847.8,13656.1,14548.2,89982.2,87116.5,18800.8,3372600.0,2106740.0,4331390.0,1420530.0,1191760.0,150159.0,1508300.0,92023.6,60339.2,5713180.0,8381290.0,245486.0,20860.6,27293.8,545170.0,565451.0,181571.0,168103.0,115004.0,1231770.0,246669.0,...,58428.7,115663.0,408756.0,159321.0,2372.71,,26433.2,74930.9,89718.0,220181.0,1631220.0,51511.0,291653.0,69825.6,849681.0,61443.9,3372600.0,1335870.0,22260.6,,25154.3,768850.0,237317.0,261271.0,15235.9,121775.0,7117.34,907390.0,,6072.97,66376.6,140059.0,154033.0,180510.0,200660.0,112493.0,26941.6,118795.0,13527.1,30630.3,2499730.0,32547.4,1688080.0,38889.5,18684300.0,30693.7,136396.0,128936.0,123094.0,6584.04,220728.0,,5424380.0,39016.0,496021.0,63203.6,128336.0,447505.0,52389.1,21235.7
10138_12,12600.2,494581.0,9165.06,27193.5,22506.1,6015.9,156313.0,54546.4,204013.0,56725.0,62369.7,8008.41,29401.2,43100.8,1116740.0,996035.0,71763.9,2170220.0,112861.0,481992.0,71270.2,713993.0,3265820.0,11024900.0,1725440.0,1971000.0,1302590.0,4276570.0,18786.4,86710.8,19010200.0,1627070.0,101591.0,12807.2,17559.0,25025.4,164879.0,163960.0,18861.9,,4071840.0,6921380.0,3399840.0,967013.0,89678.0,5456100.0,139666.0,42239.7,22825800.0,12307400.0,1110760.0,,155479.0,1298780.0,1025430.0,323897.0,114470.0,115162.0,1420800.0,171188.0,...,86160.1,125382.0,410874.0,107322.0,6015.9,212331.0,31558.7,73569.3,49913.8,236691.0,1829510.0,51194.2,296983.0,65222.9,1809710.0,32784.1,,1461420.0,24210.4,412328.0,,862388.0,226891.0,607357.0,23988.0,136148.0,11908.8,723083.0,84101.6,5434.61,90660.2,262285.0,111201.0,236779.0,235144.0,169551.0,15406.9,171070.0,25189.9,21336.8,3549440.0,19802.1,2461190.0,75843.7,25956100.0,76535.8,96779.6,115465.0,120734.0,8246.13,188362.0,9433.71,3900280.0,48210.3,328482.0,89822.1,129964.0,552232.0,65657.8,9876.98
10138_24,12003.2,522138.0,4498.51,17189.8,29112.4,2665.15,151169.0,52338.1,240892.0,85767.1,70809.0,3589.96,49287.0,28399.4,1318710.0,1071020.0,67342.9,2953540.0,123219.0,368621.0,91217.9,781739.0,2085790.0,14145200.0,1544440.0,1562190.0,1221060.0,4570310.0,8922.14,131876.0,19243800.0,1301680.0,87044.3,9129.22,4250.41,20883.7,229282.0,218329.0,10807.2,2429060.0,2421520.0,6264990.0,1821340.0,640657.0,49962.2,3032880.0,105030.0,50106.3,11136600.0,10068700.0,571613.0,,95510.9,624499.0,930273.0,228675.0,110548.0,73480.3,1264300.0,205385.0,...,54389.4,127807.0,276816.0,130509.0,2665.15,285732.0,19171.9,93298.2,51890.0,302019.0,2675500.0,25011.6,475865.0,74352.1,925239.0,87236.2,2429060.0,550905.0,19792.2,516413.0,30321.5,731279.0,283783.0,297050.0,28874.8,120299.0,5818.59,893687.0,76436.4,2579.41,42120.0,237349.0,173130.0,196896.0,276282.0,218992.0,13685.0,94790.8,19992.0,33892.1,2416050.0,26228.7,2872410.0,45285.6,38684200.0,85872.0,104356.0,98727.5,96599.6,6023.11,206187.0,6365.15,3521800.0,69984.6,496737.0,80919.3,111799.0,,56977.6,4903.09


## Combine missing value matrices

In [6]:
# load in data from parquet
proteins_mask_df = pd.read_parquet(
    'prepped_inputs/protein_data_missing_values_mask.parquet',
    engine='pyarrow',
)
peptide_mask_df = pd.read_parquet(
    'prepped_inputs/peptide_data_missing_values_mask.parquet',
    engine='pyarrow',
)

In [8]:
# join the protein / peptide data
bool_mask_df = pd.concat(
    [proteins_mask_df, peptide_mask_df],
    axis=1,
)
bool_mask_df.head()

Unnamed: 0_level_0,O00391,O00533,O00584,O14498,O14773,O14791,O15240,O15394,O43505,O60888,O75144,O75326,O94919,P00441,P00450,P00734,P00736,P00738,P00746,P00747,P00748,P00751,P01008,P01009,P01011,P01019,P01023,P01024,P01031,P01033,P01034,P01042,P01344,P01591,P01594,P01608,P01621,P01717,P01780,P01833,P01834,P01857,P01859,P01860,P01861,P01876,P01877,P02452,P02647,P02649,P02652,P02655,P02656,P02671,P02675,P02679,P02747,P02748,P02749,P02750,...,VSPTDC(UniMod_4)SAVEPEAEK,VSTLPAITLK,VTAAPQSVC(UniMod_4)ALR,VTEIWQEVMQR,VTEPISAESGEQVER,VTGVVLFR,VTIKPAPETEKRPQDAK,VTIPTDLIASSGDIIK,VTLTC(UniMod_4)VAPLSGVDFQLR,VTSIQDWVQK,VTTVASHTSDSDVPSGVTEVVVK,VVEESELAR,VVEQMC(UniMod_4)ITQYER,VVVNFAPTIQEIK,VYAC(UniMod_4)EVTHQGLSSPVTK,VYC(UniMod_4)DMNTENGGWTVIQNR,VYTVDLGR,WC(UniMod_4)AVSEHEATK,WEAEPVYVQR,WELALGR,WGYC(UniMod_4)LEPK,WKNFPSPVDAAFR,WLPSSSPVTGYR,WQEEMELYR,WSGQTAIC(UniMod_4)DNGAGYC(UniMod_4)SNPGIPIGTR,WSRPQAPITGYR,WSSTSPHRPR,WYEIEKIPTTFENGR,WYFDVTEGK,YAMVYGYNAAYNR,YANC(UniMod_4)HLAR,YFIDFVAR,YGFIEGHVVIPR,YGLDSDLSC(UniMod_4)K,YGLVTYATYPK,YGQTIRPIC(UniMod_4)LPC(UniMod_4)TEGTTR,YHDRDVWKPEPC(UniMod_4)R,YIETDPANR,YIFHNFMER,YIVSGTPTFVPYLIK,YKAAFTEC(UniMod_4)C(UniMod_4)QAADK,YLFLNGNK,YLGEEYVK,YLQEIYNSNNQK,YLYEIAR,YNSQNQSNNQFVLYR,YPGPQAEGDSEGLSQGLVDREK,YPNC(UniMod_4)AYR,YPSLSIHGIEGAFDEPGTK,YQC(UniMod_4)YC(UniMod_4)YGR,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1
10053_0,False,False,True,True,False,False,False,False,False,False,False,True,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,True,False,False,False,False,True,False,False,False,False,True,False,False,True,True,False,False,False,False,False,True,False,False,True,True,True,False,False,False,False,True,True,True,False,False,False,False,False,False,False,False,True,False,True,False,True,False,False,False,False,False,False,True,False
10053_12,False,False,True,True,True,True,False,False,False,False,False,True,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,True,True,False,False,False,False,True,False,False,False,False,False,True,False,True,True,False,False,False,False,False,True,False,False,True,True,True,False,False,False,False,True,True,True,False,False,False,False,False,False,False,False,True,False,True,False,True,False,False,False,False,False,False,False,False
10053_18,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False
10138_12,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
10138_24,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False


## Make a dictionary containing column headers as keys, and non-empty cell indices as values

In [17]:
%%time
values_exist_dict = {}
for col in bool_mask_df.columns:
    values_exist_dict[col] = list(
        bool_mask_df.loc[
            (bool_mask_df[col] == False)
        ].index
    )
len(values_exist_dict)

Wall time: 1.7 s


1195

# Define functions for evaluation

**Note:** The following is a modified version of our main workflow defined in `Tabular_MachineLearning_Projects/ml_models`.

## Define function to run full matrix K-Fold

This function will need to randomly convert some proportion (1/K) of real cell values to NaN for each trial, without repeating the same cell twice.

In [24]:
df = prot_and_peps_df.copy()
num_values_to_convert = df.apply(lambda x: len(x.dropna()) // 5)
num_values_to_convert.sort_values()

QALPQVR                    97
Q99829                     97
EPQVYTLPPSRDELTK          112
TPSGLYLGTC(UniMod_4)ER    118
SLEDQVEMLR                119
                         ... 
P00751                    222
P02749                    222
P02750                    222
P12109                    222
P36222                    222
Length: 1195, dtype: int64

In [28]:
np.random.choice(values_exist_dict['P00751'])

'42003_0'

In [47]:
choose_from_dict = values_exist_dict.copy()

In [None]:
# Set the values to NaN in the DataFrame at the selected indices
df.values[~converted_rows, np.arange(len(df.columns))] = np.nan

In [None]:
def get_fold_matrix(
    data_df: pd.pd.DataFrame,
    values_exist_dict: Dict[str, List[str]],
) -> pd.DataFrame:
    pass

In [56]:
def k_fold_cv(
    data_df: pd.DataFrame,
    values_exist_dict: Dict[str, List[str]],
    results_df: Optional[pd.DataFrame] = None,
):
    # find the number of values to include in our folds for each column
    num_values_to_convert = data_df.apply(lambda x: len(x.dropna()) // 5)

    # make a copy of our values_exist_dict to choose from
    choose_from_dict = values_exist_dict.copy()

    # set up true vs predicted values dict
    true_preds_dict = {}

    for fold in range(kfolds):
        # get the corresponding valid value indices for each column, excluding converted rows
        indices_to_convert = np.array(
            [np.random.choice(
                choose_from_dict[col],
                size=num_values_to_convert[col] - 1,
                replace=False,
            ) for col in list(choose_from_dict.keys())
            ],
            dtype='object',
        ).T

        # 


In [54]:
# find the number of values to include in our folds for each column
num_values_to_convert = df.apply(lambda x: len(x.dropna()) // 5)

# get the corresponding valid value indices for each column, excluding converted rows
indices_to_convert = np.array(
    [np.random.choice(
        values_exist_dict[col],
        size=num_values_to_convert[col],
        replace=False,
        #p=prob_dist[col],
    ) for col in list(values_exist_dict.keys())
    ],
    dtype='object',
).T

zipped_dict = dict(zip(values_exist_dict.keys(), indices_to_convert))