This folder contains three text files (Simulation.txt, EmpiricalStudy.txt, Ratio_t.txt) containing R codes used to execute simulation and empirical studies corresponding to the paper ‘Threshold Selection for Covariance Estimation’. The file ‘khan_train.csv’ contains microarray expression levels of 2308 genes from the small round blue-cell tumors experiment as described in Empirical study section of the paper. A subset drawn from this data is used to execute the empirical study.
The file ‘Simulation.txt’ contains the code related to Simulation section of the paper, used for evaluating the tuning parameters in (3.10), (3.12), (4.13), and the CV estimator. The code evaluates the average values of the estimator in (4.13) and CV estimator over 1000 replications. The resulting values are stored in the variable ‘deltas’, and the code prints them at the end. The tuning parameters in (3.10) and (3.12) are denoted as ‘True’ and ‘Theoretical’, respectively. The average values of the proposed estimator in (4.13) and the CV estimator are respectively denoted as ‘Prop’ and ‘CV’. Also, the standard deviations of these estimators are printed at the end, which are stored in ‘stddeltas’. Table 1 of the paper presents the results obtained by this code. This code also creates the box plots of Frobenius and spectral loss of each covariance structure and each combination of n and p. A compact version of all box plots corresponding to all covariance structures and dimensions with n=40 are presented in Figure 1 of the paper. The default setting of the code provides the results corresponding to the covariance structure A with n=40 and p=100 over 1000 replications. By changing the assigned values n=40 and p=100 under the section ‘Run the code’, the results corresponding to the other values of p and sample size n can be obtained. The default assignment of ‘Sig_til’ given under the section ‘Run the code’ selects the covariance structure A. For changing the covariance structure to B, C and D, respectively select the second, third and fourth commented assignments of ‘Sig_til’. The cases under p=100 can be run on a regular computer. However, due to high dimensionality and the computational intensity of the CV method, other cases with higher values of p should be run on a supercomputer. To evaluate the ratios of running the time of one replication for each covariance structure with each combination of n and p between the CV and proposed methods, the code in the file ‘Ratio_t.txt’ is used. The default setting is corresponding to the structure A with n=40 and p=100. The code prints the time ratios of the CV method over the proposed method at the end. Change the values and assignments under the section ‘Run the code’ as described before for obtaining the time ratios for the other structures and different combinations of n and p. These time ratios are reported under the column ‘Ratio_t’ in Table 1 for each corresponding case. Note that the code may not provide the exact same results at each run as shown in Table 1 due to the randomness and performance differences of computers used to run the codes. However, very close results can be obtained.
The code used to execute the empirical study is given in the file ‘EmpiricalStudy.txt’. To create the subset of the original data set, the code reads the original data in ‘khan_train.csv’. The portion of the code given in the section ‘Creating the new data set by drawing the top 100 and bottom 30 genes from the original data’ extracts the sub set of the data by using the top and bottom genes as described in the Empirical Study section of the paper. It also stores the resulting sub set in the file ‘newdata.csv’. The empirical study is executed on this resulting ‘newdata.csv’. This code creates the heatmaps and co-expression networks in Figure 2 of the paper. It also prints the values of the resulting estimators, the percentage of zeros in the resulting covariance estimators, and the running time for the CV and proposed methods. These values are reported under Empirical study section of the paper.