## Importing Required Libraries

In this section, we import the necessary libraries:
- `os`: This library allows us to interact with the operating system, particularly for file and directory manipulation.
- `pandas as pd`: The `pandas` library is used for data manipulation and analysis, providing data structures like DataFrames for handling structured data.

In [5]:
import os
import pandas as pd

## Defining Input Directories

Here, we define a list of input directories. Each directory corresponds to a dataset used for different classification tasks:
- `ad_vs_nc_train`: Alzheimer's Disease vs. Normal Controls
- `dlb_vs_nc_train`: Dementia with Lewy Bodies vs. Normal Controls
- `mci_vs_nc_train`: Mild Cognitive Impairment vs. Normal Controls
- `nph_vs_nc_train`: Normal Pressure Hydrocephalus vs. Normal Controls
- `vad_vs_nc_train`: Vascular Dementia vs. Normal Controls

In [6]:
input_directories = [
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/ad_vs_nc_train",
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/dlb_vs_nc_train",
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/mci_vs_nc_train",
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/nph_vs_nc_train",
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/vad_vs_nc_train",
]

## Filtering Significant Features from Welch ANOVA Results

### Function: `filter_significant_features(directory)`

This function filters significant features based on Welch ANOVA results:

1. **Listing Relevant Files**:
    - The function lists only CSV files in the specified directory that contain `_welch` in their filenames.

2. **File Count Validation**:
    - It checks if there are exactly three Welch ANOVA CSV files. If not, it raises a `ValueError`.

3. **Filtering Process**:
    - For each file, the function reads the CSV data.
    - It calculates the total number of features before filtering.
    - It filters features with a `P-Value` less than 0.03 and adds them to a set.

4. **Finding Common Features**:
    - It computes the intersection of significant features across all three files.

5. **Return Values**:
    - The function returns the common significant features and the total number of features before filtering.

In [None]:
def filter_significant_features(directory):
    # List only CSV files containing '_welch' in the filename
    files = [os.path.join(directory, f) for f in os.listdir(directory) if f.endswith('.csv') and '_welch' in f]
    
    if len(files) != 3:
        raise ValueError("Each directory must contain exactly three Welch ANOVA CSV files.")

    significant_features_sets = []
    total_features = 0
    
    for file in files:
        data = pd.read_csv(file)
        total_features += len(data)  # Sum up all features before filtering
        significant_features = data[data['P-Value'] < 0.03]['Feature']
        significant_features_sets.append(set(significant_features))

    # Intersection of significant features across all files
    common_features = set.intersection(*significant_features_sets)
    return common_features, total_features

## Processing Welch ANOVA Results for All Datasets

### Dictionaries:
- `welch_anova_common_features_dict`: Stores common significant features for each dataset.
- `total_initial_features_dict`: Stores the total initial number of features for each dataset.

### Loop Through Directories:
- For each directory in `input_directories`, the script:
    - Extracts the dataset type from the directory name.
    - Tries to filter significant features using `filter_significant_features(directory)`.
    - Updates the dictionaries with the results.
    - Prints the number of initially filtered features and the remaining significant features.
- If there is an error (e.g., incorrect number of files), it catches the `ValueError` and prints an error message.

In [None]:
welch_anova_common_features_dict = {}
total_initial_features_dict = {}

input_directories = [
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/ad_vs_nc_train",
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/dlb_vs_nc_train",
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/mci_vs_nc_train",
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/nph_vs_nc_train",
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/vad_vs_nc_train",
]

for directory in input_directories:
    dataset_type = directory.split('/')[-1]
    try:
        common_features, total_features = filter_significant_features(directory)
        welch_anova_common_features_dict[dataset_type] = common_features
        total_initial_features_dict[dataset_type] = total_features
        filtered_out_features_count = total_features - len(common_features)
        print(f"{dataset_type}: Initially {total_features} features, {filtered_out_features_count} features filtered out, leaving {len(common_features)} significant features.")
    except ValueError as e:
        print(f"Error for {dataset_type}: {e}")

## Filtering Top Features from ReliefF Results

### Function: `filter_top_relieff_features(directory)`

This function filters the top features based on ReliefF scores:

1. **Listing Relevant Files**:
    - The function lists only CSV files in the specified directory that contain `_relieff` in their filenames.

2. **File Count Validation**:
    - It checks if there are exactly three ReliefF CSV files. If not, it raises a `ValueError`.

3. **Filtering Process**:
    - For each file, the function reads the CSV data.
    - It calculates the total number of features before filtering.
    - It selects the top 5% of features based on the `ReliefF Score` and adds them to a set.

4. **Finding Common Features**:
    - It computes the intersection of top features across all three files.

5. **Return Values**:
    - The function returns the common top features and the total number of features before filtering.

In [11]:
import os
import pandas as pd

In [12]:
input_directories = [
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/ad_vs_nc_train",
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/dlb_vs_nc_train",
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/mci_vs_nc_train",
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/nph_vs_nc_train",
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/vad_vs_nc_train",
]

In [None]:
def filter_top_relieff_features(directory):
    # List only CSV files containing '_relieff' in the filename
    files = [os.path.join(directory, f) for f in os.listdir(directory) if f.endswith('.csv') and '_relieff' in f]
    
    if len(files) != 3:
        raise ValueError("Each directory must contain exactly three ReliefF CSV files.")

    top_features_sets = []
    total_features = 0
    
    for file in files:
        data = pd.read_csv(file)
        total_features += len(data)  # Sum up all features before filtering
        top_count = int(len(data) * 0.05)  # Top 10%
        top_features = set(data.nlargest(top_count, 'ReliefF Score')['Feature'])
        top_features_sets.append(top_features)

    # Intersection of top features across all files
    common_features = set.intersection(*top_features_sets)
    return common_features, total_features

## Processing ReliefF Results for All Datasets

### Dictionaries:
- `relieff_common_features_dict`: Stores common top features for each dataset.
- `total_initial_features_dict`: Stores the total initial number of features for each dataset.

### Loop Through Directories:
- For each directory in `input_directories`, the script:
    - Extracts the dataset type from the directory name.
    - Tries to filter top features using `filter_top_relieff_features(directory)`.
    - Updates the dictionaries with the results.
    - Prints the number of initially filtered features and the remaining top features.
- If there is an error (e.g., incorrect number of files), it catches the `ValueError` and prints an error message.

In [None]:
relieff_common_features_dict = {}
total_initial_features_dict = {}

input_directories = [
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/ad_vs_nc_train",
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/dlb_vs_nc_train",
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/mci_vs_nc_train",
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/nph_vs_nc_train",
    "/home/aghasemi/CompBio481/feature_selection/feat_select_res_branch2_overall/vad_vs_nc_train",
]

for directory in input_directories:
    dataset_type = directory.split('/')[-1]
    try:
        common_features, total_features = filter_top_relieff_features(directory)
        relieff_common_features_dict[dataset_type] = common_features
        total_initial_features_dict[dataset_type] = total_features
        filtered_out_features_count = total_features - len(common_features)
        print(f"{dataset_type}: Initially {total_features} features, {filtered_out_features_count} features filtered out, leaving {len(common_features)} significant features.")
    except ValueError as e:
        print(f"Error for {dataset_type}: {e}")

Combine Directories

In [14]:
def combine_feature_dictionaries(dict1, dict2):
    # Combine two dictionaries where values are sets of features
    combined_dict = {}
    for key in dict1:
        if key in dict2:
            combined_dict[key] = dict1[key].union(dict2[key])
        else:
            combined_dict[key] = dict1[key]  # Only dict1 has this key
    for key in dict2:
        if key not in dict1:
            combined_dict[key] = dict2[key]  # Only dict2 has this key
    return combined_dict

# Assuming welch_anova_common_features_dict and relieff_common_features_dict are already defined
combined_features_dict = combine_feature_dictionaries(welch_anova_common_features_dict, relieff_common_features_dict)

# Print the combined results for each dataset condition
for condition, features in combined_features_dict.items():
    print(f"{condition}: {len(features)} significant features combined")

ad_vs_nc_train: 624 significant features combined
dlb_vs_nc_train: 364 significant features combined
mci_vs_nc_train: 42 significant features combined
nph_vs_nc_train: 247 significant features combined
vad_vs_nc_train: 531 significant features combined


**Filter Datasets**

In [15]:
import os 
import pandas as pd

In [21]:
datasets = [
    "/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/ad_vs_nc_train.csv",
    "/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/dlb_vs_nc_train.csv",
    "/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/mci_vs_nc_train.csv",
    "/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/nph_vs_nc_train.csv",
    "/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/vad_vs_nc_train.csv",
]

In [17]:
def filter_datasets(datasets, features_dict, output_directory):
    # Ensure the output directory exists
    os.makedirs(output_directory, exist_ok=True)

    # Fixed columns to retain in each dataset
    fixed_columns = ['ID_1', 'Diagnosis', 'Age', 'Sex', 'APOE4']

    for dataset_path in datasets:
        # Extract the condition name from the file path
        condition = os.path.basename(dataset_path).replace('.csv', '')
        if condition in features_dict:
            # Load the dataset
            df = pd.read_csv(dataset_path)

            # Combine significant features with fixed columns
            significant_features = features_dict[condition]
            columns_to_keep = fixed_columns + list(significant_features)

            # Filter the DataFrame to include only the required columns
            filtered_df = df.loc[:, df.columns.isin(columns_to_keep)]

            # Reorder columns to ensure ID_1 is the first column
            columns_order = ['ID_1'] + [col for col in filtered_df.columns if col != 'ID_1']
            filtered_df = filtered_df[columns_order]

            # Prepare output file path
            output_file_path = os.path.join(output_directory, os.path.basename(dataset_path))

            # Save the filtered DataFrame
            filtered_df.to_csv(output_file_path, index=False)

            print(f"Filtered dataset saved to {output_file_path}")
        else:
            print(f"No features found for {condition}, skipping...")

In [18]:
# Define the output directory
output_directory = "/home/aghasemi/CompBio481/datasets/filtered_datasets_after_rank_feat_select_branch2_overall"

In [20]:
# Call the function
filter_datasets(datasets, combined_features_dict, output_directory)

Filtered dataset saved to /home/aghasemi/CompBio481/datasets/filtered_datasets_after_rank_feat_select_branch2_overall/ad_vs_nc_train.csv
Filtered dataset saved to /home/aghasemi/CompBio481/datasets/filtered_datasets_after_rank_feat_select_branch2_overall/dlb_vs_nc_train.csv
Filtered dataset saved to /home/aghasemi/CompBio481/datasets/filtered_datasets_after_rank_feat_select_branch2_overall/mci_vs_nc_train.csv
Filtered dataset saved to /home/aghasemi/CompBio481/datasets/filtered_datasets_after_rank_feat_select_branch2_overall/nph_vs_nc_train.csv
Filtered dataset saved to /home/aghasemi/CompBio481/datasets/filtered_datasets_after_rank_feat_select_branch2_overall/vad_vs_nc_train.csv


**Filter Test Datasets**

In [23]:
test_datasets = [
    "/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/ad_vs_nc_test.csv",
    "/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/dlb_vs_nc_test.csv",
    "/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/mci_vs_nc_test.csv",
    "/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/nph_vs_nc_test.csv",
    "/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/vad_vs_nc_test.csv",
]

train_datasets = [
    "/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/ad_vs_nc_train.csv",
    "/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/dlb_vs_nc_train.csv",
    "/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/mci_vs_nc_train.csv",
    "/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/nph_vs_nc_train.csv",
    "/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/vad_vs_nc_train.csv",
]

In [25]:
# Define the output directory
output_directory = "/home/aghasemi/CompBio481/datasets/filtered_datasets_after_rank_feat_select_branch2_overall"
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

# Process each pair of test and train datasets
for test_path, train_path in zip(test_datasets, train_datasets):
    # Load train dataset to find out which columns to keep
    train_df = pd.read_csv(train_path)
    
    # Load test dataset
    test_df = pd.read_csv(test_path)
    
    # Align columns in test dataset to match train dataset
    filtered_test_df = test_df[train_df.columns]
    
    # Define the new filename for the filtered test dataset
    filtered_filename = os.path.basename(test_path).replace('_test.csv', '_test.csv')
    
    # Save the filtered test dataset in the specified output directory
    filtered_test_df.to_csv(os.path.join(output_directory, filtered_filename), index=False)

    print(f"Filtered test dataset saved as {os.path.join(output_directory, filtered_filename)}")

Filtered test dataset saved as /home/aghasemi/CompBio481/datasets/filtered_datasets_after_rank_feat_select_branch2_overall/ad_vs_nc_test.csv
Filtered test dataset saved as /home/aghasemi/CompBio481/datasets/filtered_datasets_after_rank_feat_select_branch2_overall/dlb_vs_nc_test.csv
Filtered test dataset saved as /home/aghasemi/CompBio481/datasets/filtered_datasets_after_rank_feat_select_branch2_overall/mci_vs_nc_test.csv
Filtered test dataset saved as /home/aghasemi/CompBio481/datasets/filtered_datasets_after_rank_feat_select_branch2_overall/nph_vs_nc_test.csv
Filtered test dataset saved as /home/aghasemi/CompBio481/datasets/filtered_datasets_after_rank_feat_select_branch2_overall/vad_vs_nc_test.csv
