# Hypertrophic Cardiomyopathy Genes Cross-Validation
##### Selin Kubali
##### 12/13/2023
## Goal
Find out whether we can distinguish the HCM risk of bottom 25% and top 25% of missense and deleterious variant carriers in key hypertrophic cardiomyopathy-related genes.

#### How the code functions
Use cross-validation to fit a Cox-PH model and predict hazard scores. Then isolate the bottom 25% and top 25% of carriers by hazard score and calculate whether there is a statistically significant difference in HCM between them use the Mann-Whitney U test.

Cross-validation is done by splitting on variant data, to ensure there are an equal number of variants in each fold and prevent overfitting on high-frequency variants.

#### Inputs
Lifelines files - from running generate_extracts_gnomAD.ipynb on UKBiobank in Cassa Lab Shared Project/selected_genes/hcm/notebooks. Stored in Cassa Lab Shared Project/selected_genes/hcm/lifelines_data. 
Variant data files - from running vep_processing.ipynb on UKBiobank in Cassa Lab Shared Project/selected_genes/hcm/notebooks. Stored in Cassa Lab Shared Project/selected_genes/hcm/parsed_vep_files

#### Note
Two HCM related genes - DES and PLN - were eliminated for having too few variants to converge.
PTPN11, TNNI3, and TTR each have few cases of HCM with missense or deleterious variants, which may harm convergence.

In [251]:
import pandas as pd
import numpy as np
from lifelines import CoxPHFitter
from sklearn.model_selection import KFold
from statsmodels.stats.multitest import multipletests
from lifelines import KaplanMeierFitter
import matplotlib.pyplot as plt
from lifelines.statistics import logrank_test


In [252]:
def cross_val(gene):
        cph = CoxPHFitter(penalizer=0.0000001)

        # load lifelines file
        file_name=gene+'.csv'
        lifelines_data = pd.read_csv("/Users/uriel/Downloads/work_temp/cross_val_lifelines/"+file_name, dtype={
                'is_family_hist':'boolean',
                'is_hcm':'boolean'
                })

        # load variant data file
        file_name=gene+'.csv'
        variant_data = pd.read_csv("/Users/uriel/Downloads/work_temp/variant_files/"+file_name)
        variant_data = variant_data[['Name']]
        variant_data['var_index'] = variant_data.index

        # set lifelines data index to variant data index
        lifelines_data = variant_data.merge(lifelines_data, how="outer")
        lifelines_data.set_index("var_index")


        # clean lifelines file; set pathogenicity for deleterious variants to 1
        lifelines_data.loc[lifelines_data['deleterious'] == 1, 'am_pathogenicity'] = 1
        lifelines_data = lifelines_data.drop(["Name", 'Carrier', 'index', 'deleterious', 'missense_variant', 'am_pathogenicity'], axis = 1)
        lifelines_data = lifelines_data.dropna()


        # cross validation: split up phenotypic data file based on variant file index
        kf = KFold(n_splits=5, shuffle=True, random_state=1)
        testing_set = []
        for train_idx, test_idx in kf.split(variant_data):
                train = lifelines_data[lifelines_data['var_index'].isin(train_idx)]
                test = lifelines_data[lifelines_data['var_index'].isin(test_idx)]

                train = train.drop(['var_index'], axis=1)
                test = test.drop(['var_index'], axis=1)

                # fit CPH and add hazard scores
                cph.fit(train, duration_col="duration", event_col="is_hcm", fit_options = {"step_size":0.1})
                hazard_scores_fold = cph.predict_partial_hazard(test)
                test['hazard'] = hazard_scores_fold
                testing_set.append(test)


        # create new lifelines_data df by joining all testing sets
        lifelines_data = pd.concat([df for idx, df in enumerate(testing_set)])


        return lifelines_data
    

In [253]:

def find_params(gene):
        thresholds_list =  list(range(1, 101))


        p_vals = {}
        hazard_ratios = {}
        f1_scores = {}


        lifelines_data = cross_val(gene)


        # filter for patients with lowest 25% and highest 25% hazard scores

        for i in thresholds_list:
                percentiles = np.percentile(lifelines_data['hazard'], [i])
                bottom = lifelines_data[lifelines_data['hazard'] < percentiles[0]]
                top = lifelines_data[lifelines_data['hazard'] >= percentiles[0]]
                bottom.loc[:,'is_hcm'] = np.where(bottom['is_hcm'] == True, 1, 0)
                top.loc[:,'is_hcm'] = np.where(top['is_hcm'] == True, 1, 0)


                dfA = pd.DataFrame({'E': bottom['is_hcm'], 'T': bottom['duration'], 'is_highest': 0})
                dfB = pd.DataFrame({'E': top['is_hcm'], 'T': top['duration'], 'is_highest': 1})
                df = pd.concat([dfA, dfB])

                cph = CoxPHFitter().fit(df, 'T', 'E', fit_options = {"step_size":0.1})
                hazard_ratios.update({i:cph.hazard_ratios_.at['is_highest']})

                p_vals.update({i:cph.summary['p'].at['is_highest']})

        
        return p_vals, hazard_ratios





#### Find threshold with lowest p-value 

In [263]:
genes = ['ACTN2', 'ALPK3', 'FLNC','MYBPC3','MYH6', 'MYH7', 'PTPN11', 'TNNI3', 'TTR']
lowest_thresholds_by_p_val = {}
for gene in genes:
    p_vals, hazard_ratios = find_params(gene)
    # By minimizing p-value
    print(p_vals)
    associated_threshold = min(p_vals, key=p_vals.get)
    lowest_thresholds_by_p_val.update({gene:associated_threshold})




>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.



>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.




{1: 0.02164727509735597, 2: 0.13252223523664508, 3: 0.27737072332419743, 4: 0.43036051775470263, 5: 0.5770014255921483, 6: 0.7046043028953338, 7: 0.8200137563659363, 8: 0.9177498368707155, 9: 0.9901429769886402, 10: 0.8985654895467867, 11: 0.8160240430183959, 12: 0.7460135143136477, 13: 0.6820525836725156, 14: 0.6236754394842168, 15: 0.7938672769562953, 16: 0.8742263806095617, 17: 0.9510320492617251, 18: 0.9793631291496168, 19: 0.5150548018684715, 20: 0.5757982176085625, 21: 0.6379028511812646, 22: 0.7005898168404885, 23: 0.76660023643641, 24: 0.8312535512550009, 25: 0.8933101115307157, 26: 0.9568366043598333, 27: 0.9801246183374095, 28: 0.9163162559788205, 29: 0.8588863786553176, 30: 0.2979971428621576, 31: 0.336086572515725, 32: 0.3756541413825305, 33: 0.15729698190519692, 34: 0.17926156927397963, 35: 0.20313762055437154, 36: 0.23177388389319714, 37: 0.25886029403663935, 38: 0.28682980725865165, 39: 0.31840113353763233, 40: 0.3520308740004845, 41: 0.3921900301207517, 42: 0.1727320267


>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.





{1: 0.9963279105261834, 2: 0.7770654693744591, 3: 0.9421265070867624, 4: 0.7517308051486183, 5: 0.6027109525465482, 6: 0.4920128459609362, 7: 0.40524003434074507, 8: 0.7234187370745171, 9: 0.8474374872747565, 10: 0.9730580713505989, 11: 0.8896405798872771, 12: 0.7479679137371429, 13: 0.609529220911849, 14: 0.5065006926554803, 15: 0.41267557665534305, 16: 0.3365716568003532, 17: 0.27889592646241984, 18: 0.22969568149543923, 19: 0.1893089742007019, 20: 0.15397794845832913, 21: 0.1238195777665573, 22: 0.10349345313973464, 23: 0.08550281429308408, 24: 0.07126911585055871, 25: 0.05848123724211728, 26: 0.04783986028385734, 27: 0.03972394049456865, 28: 0.03263473333161455, 29: 0.027127662830300877, 30: 0.02227348870382817, 31: 0.018308609940783163, 32: 0.014814306641765472, 33: 0.012228610857609066, 34: 0.009963144098360537, 35: 0.008079124010368768, 36: 0.006497376615612944, 37: 0.005098462915947554, 38: 0.004039766774210364, 39: 0.00328741116947114, 40: 0.002647119805645713, 41: 0.002096807


>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.



>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.





{1: 0.02196724077428712, 2: 0.2103902282599324, 3: 0.5079473086270471, 4: 0.786298358733567, 5: 0.9610162535608322, 6: 0.7186272707766317, 7: 0.9404750707779458, 8: 0.8651757236895047, 9: 0.6955430416719636, 10: 0.5539659476286709, 11: 0.43420712850167054, 12: 0.33766980911377814, 13: 0.2612898225453203, 14: 0.41881079891258144, 15: 0.3365774087053597, 16: 0.5106751304506538, 17: 0.4231495277653028, 18: 0.34215862793884133, 19: 0.27204457073388594, 20: 0.2131605558905465, 21: 0.16923470917590194, 22: 0.1333635181653688, 23: 0.10236287939228987, 24: 0.0807352421585531, 25: 0.0633766886406579, 26: 0.04842625156008207, 27: 0.036802596184869134, 28: 0.028051061276865984, 29: 0.021033027657262907, 30: 0.03779498015477863, 31: 0.02808811653838781, 32: 0.021157490209343943, 33: 0.01553208045736618, 34: 0.028605146837479136, 35: 0.02112842417865996, 36: 0.08816059043474987, 37: 0.0684891534439287, 38: 0.05241954974166877, 39: 0.03858706943199308, 40: 0.028683372369153203, 41: 0.021158099335248


>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.



>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.




{1: 0.9952942946774394, 2: 0.9952542654311343, 3: 0.9963307473522626, 4: 0.8003690047515505, 5: 0.5705568279721067, 6: 0.4076388853546742, 7: 0.3036435742536168, 8: 0.23031289702684585, 9: 0.17592655988143016, 10: 0.1340004004814885, 11: 0.10608623518063737, 12: 0.08391169209784277, 13: 0.06681108268032884, 14: 0.09018688969110487, 15: 0.06849350480590323, 16: 0.05269575620934089, 17: 0.04055694207318303, 18: 0.0316053895502049, 19: 0.024587050443867713, 20: 0.019568137591264818, 21: 0.015263361985809583, 22: 0.011980832048950021, 23: 0.009604025500278458, 24: 0.007602691823191878, 25: 0.006001660062685523, 26: 0.004766905200133009, 27: 0.003782103434895189, 28: 0.0030763070315006054, 29: 0.0033738494219438656, 30: 0.004708730980501644, 31: 0.007444787958097422, 32: 0.005466860162230972, 33: 0.00900659975199026, 34: 0.006619110836142151, 35: 0.0047825920307109325, 36: 0.007645766772361312, 37: 0.012658497520415025, 38: 0.009315679959449564, 39: 0.006775889943727782, 40: 0.0049044805437


>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.



>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.



>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-wit

{1: 0.9952858411310026, 2: 0.9953717717136688, 3: 0.996178388503601, 4: 0.9955772294154747, 5: 0.9950859419689282, 6: 0.994599503377342, 7: 0.9961154133492475, 8: 0.995914705344727, 9: 0.9956583153218572, 10: 0.9954666243139264, 11: 0.9951939067291511, 12: 0.9949991721586774, 13: 0.9948178671685016, 14: 0.9946844726483903, 15: 0.9945298352772258, 16: 0.9943560363329939, 17: 0.9941503034568495, 18: 0.9961698277484359, 19: 0.995967155442538, 20: 0.23798323971716281, 21: 0.21567069882306833, 22: 0.19640014950519438, 23: 0.1802509987660681, 24: 0.16574911302042966, 25: 0.15109467474943075, 26: 0.13916516264323578, 27: 0.1278305068551842, 28: 0.11743467534646537, 29: 0.10718181677740483, 30: 0.09732972918053691, 31: 0.0887222006598893, 32: 0.07963145185298681, 33: 0.07319574670897264, 34: 0.06636977828717167, 35: 0.06005030364005828, 36: 0.12151365981235664, 37: 0.10792915971866045, 38: 0.09684404488357036, 39: 0.08655509445335595, 40: 0.07821720640292705, 41: 0.07005780158501233, 42: 0.063



{1: 0.3575379222551429, 2: 0.9411605841234768, 3: 0.899667772062504, 4: 0.7141632312421748, 5: 0.4646403494593315, 6: 0.6036316803085846, 7: 0.41305611695688416, 8: 0.5234890742048508, 9: 0.3562742080726218, 10: 0.24945709976773534, 11: 0.17969907136658458, 12: 0.12331216265274925, 13: 0.08697683412948595, 14: 0.0610090784101023, 15: 0.04395329531052076, 16: 0.031135030627880825, 17: 0.02189088556129764, 18: 0.015516408547305752, 19: 0.021999292394067952, 20: 0.015110546611774031, 21: 0.021671154330144552, 22: 0.01500495034399758, 23: 0.010417272219643548, 24: 0.007030461685195436, 25: 0.009951341580960636, 26: 0.00671437283880389, 27: 0.004676344150114684, 28: 0.0031319154957471676, 29: 0.00461578112603859, 30: 0.0067642093120910875, 31: 0.009721537822472318, 32: 0.014259941432082566, 33: 0.01980254258925297, 34: 0.013265958478533306, 35: 0.019158048234728737, 36: 0.028058080999636217, 37: 0.019908024082959833, 38: 0.012862524932589393, 39: 0.018927632766071328, 40: 0.0123133591683333


>>> events = df['is_hcm'].astype(bool)
>>> print(df.loc[events, 'trv'].var())
>>> print(df.loc[~events, 'trv'].var())

A very low variance means that the column trv completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.




{1: 0.0017207118110580985, 2: 0.03125088348793606, 3: 0.07415935085591022, 4: 0.10388727241477305, 5: 0.15552051736334105, 6: 0.22073591136765747, 7: 0.27165720958475476, 8: 0.3468735390369163, 9: 0.40268315777274344, 10: 0.4484386808555989, 11: 0.5031857017588324, 12: 0.5941002019287218, 13: 0.6689462061871001, 14: 0.7322727390742186, 15: 0.7881952806917616, 16: 0.8675557565594734, 17: 0.9113936815591639, 18: 0.9743759148063592, 19: 0.9744661876063584, 20: 0.9328632868742173, 21: 0.8739374742250433, 22: 0.823138567138845, 23: 0.7668601078275756, 24: 0.716819354978791, 25: 0.579565642112161, 26: 0.6290512870014072, 27: 0.6754870648914618, 28: 0.7265633663446471, 29: 0.7779769334021264, 30: 0.8327323816188987, 31: 0.8875321474480165, 32: 0.9420415292551751, 33: 0.9929472425610365, 34: 0.9599104509465918, 35: 0.9270873584147941, 36: 0.8723011466706934, 37: 0.8240525912628013, 38: 0.7799097847369295, 39: 0.7323891310555646, 40: 0.6836992701052436, 41: 0.6578144136764781, 42: 0.61426730397


>>> events = df['is_hcm'].astype(bool)
>>> print(df.loc[events, 'trv'].var())
>>> print(df.loc[~events, 'trv'].var())

A very low variance means that the column trv completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.



>>> events = df['is_hcm'].astype(bool)
>>> print(df.loc[events, 'is_family_hist'].var())
>>> print(df.loc[~events, 'is_family_hist'].var())

A very low variance means that the column is_family_hist completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.



>>> events = df['is_hcm'].astype(bool)
>>> print(df.loc[events, 'trv'].var())
>>> print(df.loc[~events, 'trv'].var())

A very low variance means that the column trv completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separ

{1: 0.996068153353168, 2: 0.9967741677866598, 3: 0.9962364851186819, 4: 0.9955419392301329, 5: 0.9967658840063406, 6: 0.996468598526387, 7: 0.9961944903351694, 8: 0.9958380052582614, 9: 0.9956030727387317, 10: 0.9953354284086007, 11: 0.9950403739804764, 12: 0.9965774543048526, 13: 0.9964074696172962, 14: 0.9962684645540155, 15: 0.9961120449645077, 16: 0.995940240206917, 17: 0.9957960861326497, 18: 0.5259979992197131, 19: 0.5612819379974585, 20: 0.5903474424135251, 21: 0.6367818324251706, 22: 0.6825497897683529, 23: 0.7106087442580076, 24: 0.7496435040830485, 25: 0.7825749922554437, 26: 0.8205587471233649, 27: 0.8579600126066507, 28: 0.8947724773972408, 29: 0.9309922998720455, 30: 0.9566157369710148, 31: 0.9864897373477173, 32: 0.2590393755858508, 33: 0.27490756572710273, 34: 0.28849058371096525, 35: 0.30512028461052965, 36: 0.32216457310538577, 37: 0.3396222858386344, 38: 0.3574925488541877, 39: 0.37271249958985486, 40: 0.39133472878472997, 41: 0.41036839336226727, 42: 0.42981362460052


>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.



>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.



>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-wit

{1: 0.9968485075149895, 2: 0.9956842506616739, 3: 0.9965726265924288, 4: 0.9960648622965997, 5: 0.9956154581634772, 6: 0.9968761620860932, 7: 0.99663158188522, 8: 0.9964036148677201, 9: 0.996189274979772, 10: 0.9959863752518738, 11: 0.9957932567658444, 12: 0.9956086256559346, 13: 0.9954314496200577, 14: 0.9952608891492659, 15: 0.9950962506838417, 16: 0.9949369531955304, 17: 0.9966174535050475, 18: 0.9965204069285097, 19: 0.996426007476802, 20: 0.9963340485514909, 21: 0.9962317098512381, 22: 0.9961443976300909, 23: 0.9960590231607164, 24: 0.9959754612503484, 25: 0.9958935989684963, 26: 0.9958133344537252, 27: 0.9957345753775485, 28: 0.9956572374799159, 29: 0.9955812437233712, 30: 0.9955065235802468, 31: 0.9954330119996749, 32: 0.9953606486609252, 33: 0.9952893785385895, 34: 0.9952191493943311, 35: 0.9951499134041577, 36: 0.9950816254421861, 37: 0.9950142437425605, 38: 0.9949477286927774, 39: 0.9948820429711674, 40: 0.9948171515616043, 41: 0.9947439205193348, 42: 0.9946806224515216, 43: 

#### Find threshold with highest odds ratio 

In [272]:
genes = ['ACTN2', 'ALPK3', 'FLNC','MYBPC3','MYH6', 'MYH7', 'PTPN11', 'TNNI3', 'TTR']
threshold_by_odds_ratio = {}
for gene in genes:
    p_vals, hazard_ratios = find_params(gene)
    # By minimizing p-value
    associated_threshold = max(hazard_ratios, key=hazard_ratios.get)
    threshold_by_odds_ratio.update({gene:associated_threshold})


>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.



>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.



>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-wit

#### Rerun CoxPH cross-validation with chosen threshold

In [269]:
def find_threshold_vals(dict):

        p_vals = {}
        hazard_ratios = {}
        f1_scores = {}




        # filter for patients with lowest 25% and highest 25% hazard scores

        for gene in dict:
                threshold = dict[gene]

                lifelines_data = cross_val(gene)

                percentiles = np.percentile(lifelines_data['hazard'], [threshold])
                bottom = lifelines_data[lifelines_data['hazard'] < percentiles[0]]
                top = lifelines_data[lifelines_data['hazard'] >= percentiles[0]]
                bottom.loc[:,'is_hcm'] = np.where(bottom['is_hcm'] == True, 1, 0)
                top.loc[:,'is_hcm'] = np.where(top['is_hcm'] == True, 1, 0)


                dfA = pd.DataFrame({'E': bottom['is_hcm'], 'T': bottom['duration'], 'is_highest': 0})
                dfB = pd.DataFrame({'E': top['is_hcm'], 'T': top['duration'], 'is_highest': 1})
                df = pd.concat([dfA, dfB])

                cph = CoxPHFitter().fit(df, 'T', 'E', fit_options = {"step_size":0.1})
                hazard_ratios.update({gene:cph.hazard_ratios_.at['is_highest']})

                p_vals.update({gene:cph.summary['p'].at['is_highest']})
                p_adjusted = multipletests(list(p_vals.values()), alpha=0.05, method='bonferroni')
                updated_p_dict = {key: new_p_val for key, new_p_val in zip(p_vals.keys(), p_adjusted[1])}


                

                



        return updated_p_dict, hazard_ratios

In [257]:
## graphs
        """kmf_lowest_25_variant = KaplanMeierFitter()
        kmf_lowest_25_variant.fit(durations=bottom['duration'], event_observed=bottom['is_hcm'], label = 'bottom')
        kmf_lowest_25_variant.plot_survival_function()


        kmf_highest_25_variant = KaplanMeierFitter()
        kmf_highest_25_variant.fit(durations=top['duration'], event_observed=top['is_hcm'], label = 'top')
        kmf_highest_25_variant.plot_survival_function()



        plt.title(gene)
        plt.figure()"""

IndentationError: unexpected indent (1177400807.py, line 2)

#### Convert lowest p-values to dataframe

In [266]:
updated_p_vals, hazard_ratios = find_threshold_vals(lowest_thresholds_by_p_val)
p_vals = pd.DataFrame.from_dict(updated_p_vals, orient = 'index')
p_vals.columns = ["P-value"]
hazard_ratios = pd.DataFrame.from_dict(hazard_ratios, orient = 'index')
hazard_ratios.columns = ["Odds ratio"]
thresholds = pd.DataFrame.from_dict(lowest_thresholds_by_p_val, orient = 'index')
thresholds.columns = ["Thresholds"]
df = p_vals.join(hazard_ratios).join(thresholds)
df


>>> events = df['is_hcm'].astype(bool)
>>> print(df.loc[events, 'trv'].var())
>>> print(df.loc[~events, 'trv'].var())

A very low variance means that the column trv completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.



>>> events = df['is_hcm'].astype(bool)
>>> print(df.loc[events, 'trv'].var())
>>> print(df.loc[~events, 'trv'].var())

A very low variance means that the column trv completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.



>>> events = df['is_hcm'].astype(bool)
>>> print(df.loc[events, 'is_family_hist'].var())
>>> print(df.loc[~events, 'is_family_hist'].var())

A very low variance means that the column is_family_hist completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separ

Unnamed: 0,P-value,Odds ratio,Thresholds
ACTN2,0.1948255,0.089751,1
ALPK3,0.0002521928,4.965914,76
FLNC,8.005691e-06,7.4641,70
MYBPC3,2.388252e-09,9.13097,90
MYH6,0.5129942,4.465227,43
MYH7,0.02373538,2.605919,78
PTPN11,0.01548641,0.028561,1
TNNI3,1.0,0.250699,32
TTR,1.0,3.734898,79


#### Convert highest odds ratios to dataframe

In [274]:
updated_p_vals, hazard_ratios = find_threshold_vals(threshold_by_odds_ratio)
p_vals = pd.DataFrame.from_dict(updated_p_vals, orient = 'index')
p_vals.columns = ["P-value"]
hazard_ratios = pd.DataFrame.from_dict(hazard_ratios, orient = 'index')
hazard_ratios.columns = ["Odds ratio"]
thresholds = pd.DataFrame.from_dict(threshold_by_odds_ratio, orient = 'index')
thresholds.columns = ["Thresholds"]
df = p_vals.join(hazard_ratios).join(thresholds)
df



>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.



>>> events = df['E'].astype(bool)
>>> print(df.loc[events, 'is_highest'].var())
>>> print(df.loc[~events, 'is_highest'].var())

A very low variance means that the column is_highest completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression.



>>> events = df['is_hcm'].astype(bool)
>>> print(df.loc[events, 'trv'].var())
>>> print(df.loc[~events, 'trv'].var())

A very low variance means that the column trv completely determines whether a subject dies or not. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separa

Unnamed: 0,P-value,Odds ratio,Thresholds
ACTN2,1.0,3.081714,97
ALPK3,1.0,4502671.0,1
FLNC,0.000417,8.012676,89
MYBPC3,1.0,11673590.0,3
MYH6,1.0,45845290.0,18
MYH7,1.0,3.934282,99
PTPN11,0.129233,15.71873,99
TNNI3,1.0,5533091.0,17
TTR,1.0,27287260.0,51
