### Transpose Main Datasets (GSE120584 & GSE167559)

**Description:**

- **Load the Dataset:** Reads the dataset from a CSV file.
- **Transpose the DataFrame:** Swaps rows and columns.
- **Set Column Headers:** Uses the first row as the column headers.
- **Drop Original Header Row:** Removes the row used for headers.
- **Set Row Index Header:** Renames the row index to "ID_1".
- **Save to CSV:** Optionally saves the transposed DataFrame to a new CSV file.
- **Example Usage:** Applies the function to each file path provided, saving the transposed datasets.


In [1]:
import pandas as pd

In [2]:
def transpose_and_clean_dataset(file_path, output_file=None):
    # Load the dataset
    dataset = pd.read_csv(file_path)
    
    # Transpose the DataFrame
    dataset = dataset.T
    
    # Set the first row as column headers
    dataset.columns = dataset.iloc[0]
    
    # Drop the original header row
    dataset = dataset.drop(dataset.index[0])
    
    # Set the row index header to "ID_1"
    dataset.index.rename("ID_1", inplace=True)
    
    # Save to CSV if an output file path is provided
    if output_file:
        dataset.to_csv(output_file)
    
    return dataset

In [3]:
# Example usage for the provided file paths
file_paths = [
    "/home/aghasemi/CompBio481/datasets/original_datasets/GSE120584.csv",
    "/home/aghasemi/CompBio481/datasets/original_datasets/GSE167559.csv"
]

In [4]:
# You can iterate over the file paths to apply the function to each file
for file_path in file_paths:
    output_file = file_path.replace(".csv", "_T.csv")  # Naming the output file based on the input file
    transposed_dataset = transpose_and_clean_dataset(file_path, output_file)
    print(transposed_dataset.head())  # Display the first few rows of the transformed dataset

ID_REF     MIMAT0000062 MIMAT0000063 MIMAT0000064 MIMAT0000065 MIMAT0000066  \
ID_1                                                                          
GSM3403761     2.307579     2.307579     2.307579     2.307579     2.307579   
GSM3403762     1.503044      2.50538     1.503044     1.503044     1.503044   
GSM3403763     1.549877     1.983125     1.549877     1.549877     1.549877   
GSM3403764     1.560269     1.560269     1.560269     1.560269     2.232974   
GSM3403765     3.179096     3.302472     3.179096     3.179096      4.79347   

ID_REF     MIMAT0000067 MIMAT0000068 MIMAT0000069 MIMAT0000070 MIMAT0000071  \
ID_1                                                                          
GSM3403761     2.307579     2.307579     2.307579     2.307579     2.843409   
GSM3403762     1.503044     1.503044     1.503044     1.503044     3.349936   
GSM3403763     1.549877     1.549877     1.549877     1.549877     3.081569   
GSM3403764     1.560269     1.560269     1.560269  

### Merge Main Datasets (GSE120584_T & GSE167559_T)

**Description:**

- **Load the Datasets:** Reads two datasets from CSV files.
- **Concatenate DataFrames:** Combines the datasets row-wise.
- **Rename Column:** Renames the first unnamed column to 'ID_1', if applicable.
- **Save to CSV:** Optionally saves the combined DataFrame to a new CSV file.
- **Example Usage:** Applies the function to merge the specified datasets and saves the combined result.


In [5]:
import pandas as pd

In [6]:
def concatenate_datasets(file_path_1, file_path_2, output_file=None):
    # Load the datasets
    dataset_1 = pd.read_csv(file_path_1)
    dataset_2 = pd.read_csv(file_path_2)
    
    # Concatenate the two DataFrames row-wise
    combined_df = pd.concat([dataset_1, dataset_2], axis=0, ignore_index=True)
    
    # Rename the first unnamed column to 'ID_1' if it exists
    if 'Unnamed: 0' in combined_df.columns:
        combined_df.rename(columns={'Unnamed: 0': 'ID_1'}, inplace=True)
    
    # Save to CSV if an output file path is provided
    if output_file:
        combined_df.to_csv(output_file, index=False)
    
    return combined_df

In [7]:
# Example usage
file_path_1 = "/home/aghasemi/CompBio481/datasets/original_datasets/GSE120584_T.csv"
file_path_2 = "/home/aghasemi/CompBio481/datasets/original_datasets/GSE167559_T.csv"
output_file = "/home/aghasemi/CompBio481/datasets/processed_datasets/GSE120584_T_&_GSE167559_T_complete_dataset_without_clinical_factors.csv"

In [8]:
combined_dataset = concatenate_datasets(file_path_1, file_path_2, output_file)
print(combined_dataset.head())

         ID_1  MIMAT0000062  MIMAT0000063  MIMAT0000064  MIMAT0000065  \
0  GSM3403761      2.307579      2.307579      2.307579      2.307579   
1  GSM3403762      1.503044      2.505380      1.503044      1.503044   
2  GSM3403763      1.549877      1.983125      1.549877      1.549877   
3  GSM3403764      1.560269      1.560269      1.560269      1.560269   
4  GSM3403765      3.179096      3.302472      3.179096      3.179096   

   MIMAT0000066  MIMAT0000067  MIMAT0000068  MIMAT0000069  MIMAT0000070  ...  \
0      2.307579      2.307579      2.307579      2.307579      2.307579  ...   
1      1.503044      1.503044      1.503044      1.503044      1.503044  ...   
2      1.549877      1.549877      1.549877      1.549877      1.549877  ...   
3      2.232974      1.560269      1.560269      1.560269      1.560269  ...   
4      4.793470      3.179096      3.179096      3.179096      3.179096  ...   

   MIMAT0031893  MIMAT0032026  MIMAT0032029  MIMAT0032110  \
0      2.307579    

**Merge Main Dataset (GSE120584_T_&_GSE167559_T_complete_dataset_without_clinical_factors) with Clinical Factors**

### Merge Main Dataset with Clinical Factors

**Description:**

- **Load the Datasets:** Reads the main dataset and clinical factors from CSV files.
- **Rename Columns:** Updates column names in the clinical factors DataFrame.
- **Merge DataFrames:** Combines the datasets on a common column.
- **Apply Value Mapping:** Maps values in the specified column according to a given dictionary.
- **Drop Column:** Removes a column if specified.
- **Save to CSV:** Optionally saves the processed DataFrame to a new CSV file.
- **Example Usage:** Merges the datasets, processes columns, and saves the result.


In [9]:
import pandas as pd

In [25]:
dataset_file_path = '/home/aghasemi/CompBio481/datasets/processed_datasets/GSE120584_T_&_GSE167559_T_complete_dataset_without_clinical_factors.csv'
clinical_factors_file_path = '/home/aghasemi/CompBio481/datasets/processed_datasets/GSE120584_&_GSE167559_clinical_factors_combined.csv'
output_file = '/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/complete_dataset.csv'

In [26]:
def merge_and_process_datasets(dataset_file_path, clinical_factors_file_path, common_column, rename_column_dict, value_map_column, value_map_dict, drop_column, output_file=None):
    # Load the datasets
    dataset = pd.read_csv(dataset_file_path)
    clinical_factors = pd.read_csv(clinical_factors_file_path)
    
    # Rename the specified column in the clinical factors DataFrame
    clinical_factors.rename(columns=rename_column_dict, inplace=True)
    
    # Merge the DataFrames on the common column
    merged_df = pd.merge(dataset, clinical_factors, on=common_column, how='inner')
    
    # Rename columns according to the rename_column_dict including 'sex' to 'Sex'
    merged_df.rename(columns=rename_column_dict, inplace=True)
    
    # Now 'Sex' is correctly capitalized, you can apply the value mapping
    if value_map_column in merged_df.columns:
        merged_df[value_map_column] = merged_df[value_map_column].map(value_map_dict)
    
    # Drop the specified column if needed
    if drop_column in merged_df.columns:
        merged_df = merged_df.drop(drop_column, axis=1)
    
    # Save to CSV if an output file path is provided
    if output_file:
        merged_df.to_csv(output_file, index=False)
    
    return merged_df

In [27]:
merged_dataset = merge_and_process_datasets(
    dataset_file_path=dataset_file_path,
    clinical_factors_file_path=clinical_factors_file_path,
    common_column='ID_1',
    rename_column_dict={
        'SampleID': 'ID_1',  # Rename 'SampleID' to 'ID_1'
        'sex': 'Sex',  # Correctly rename 'sex' to 'Sex' before mapping
        'age': 'Age',  # Capitalize 'age'
        'apoe4': 'APOE4',  # Change 'apoe4' to 'APOE4'
        'diagnosis': 'Diagnosis'  # Capitalize 'diagnosis'
    },
    value_map_column='Sex',  # Use the new capitalized 'Sex' column for mapping
    value_map_dict={'female': 0, 'male': 1},
    drop_column=None,  # No column is being dropped after processing
    output_file=output_file
)

**Splitting Datasets into NC vs (disease groups)**

In [28]:
import pandas as pd

In [33]:
def create_diagnosis_subdatasets(file_path):
    # Load the dataset
    dataset = pd.read_csv(file_path)
    
    # Get unique diagnosis values excluding 'NC'
    unique_diagnoses = dataset['Diagnosis'].unique()
    non_nc_diagnoses = [diagnosis for diagnosis in unique_diagnoses if diagnosis != 'NC']
    
    # Iterate over non-NC diagnoses to create subdatasets
    for diagnosis in non_nc_diagnoses:
        # Filter the dataset for 'NC' and the current non-NC diagnosis
        filtered_dataset = dataset[dataset['Diagnosis'].isin(['NC', diagnosis])]
        
        # Save the filtered dataset to a CSV file
        output_file = f'/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_{diagnosis}.csv'
        filtered_dataset.to_csv(output_file, index=False)
        print(f'Saved: {output_file}')

In [35]:
# Example usage
file_path = "/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/complete_dataset.csv"  # Update with the path to your complete dataset
create_diagnosis_subdatasets(file_path)

Saved: /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_AD.csv
Saved: /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_DLB.csv
Saved: /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_MCI.csv
Saved: /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_VaD.csv
Saved: /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_NPH.csv


**Binarizing Diagnosis Values**

In [36]:
import pandas as pd
import os

In [37]:
def binarize_diagnosis(file_paths):
    for file_path in file_paths:
        # Load the dataset
        df = pd.read_csv(file_path)
        
        # Binarize the 'Diagnosis' column: 0 for 'NC', 1 for others
        df['Diagnosis'] = df['Diagnosis'].apply(lambda x: 0 if x == 'NC' else 1)
        
        # Save the modified dataset back to its original file
        df.to_csv(file_path, index=False)
        print(f'Binarized and saved: {file_path}')

In [38]:
# List of file paths to your datasets
file_paths = [
    '/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_AD.csv',
    '/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_DLB.csv',
    '/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_MCI.csv',
    '/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_VaD.csv',
    '/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_NPH.csv'
]

In [39]:
# Apply the binarization function to each dataset
binarize_diagnosis(file_paths)

Binarized and saved: /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_AD.csv
Binarized and saved: /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_DLB.csv
Binarized and saved: /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_MCI.csv
Binarized and saved: /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_VaD.csv
Binarized and saved: /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_NPH.csv


**Check for Duplicate Columns**

In [45]:
import pandas as pd
import os

In [46]:
# Define the directory where the files are stored
directory_path = '/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/'

In [47]:
# Function to check for duplicate columns and rows in a dataframe
def check_duplicates_in_dataframe(df):
    # Check for duplicate columns
    duplicate_columns = df.columns.duplicated()
    # Check for duplicate rows
    duplicate_rows = df.duplicated().any()
    
    return duplicate_columns, duplicate_rows

In [48]:
# List to keep track of the results for each file
results = []

# List the CSV files in the directory
csv_files = [file for file in os.listdir(directory_path) if file.endswith('.csv')]

# Iterate over the CSV files to check for duplicates
for file_name in csv_files:
    # Read the CSV file into a DataFrame
    df = pd.read_csv(os.path.join(directory_path, file_name))
    
    # Check for duplicates in the DataFrame
    duplicate_columns, duplicate_rows = check_duplicates_in_dataframe(df)
    
    # Store the results
    results.append({
        'file': file_name,
        'duplicate_columns': duplicate_columns,
        'duplicate_rows': duplicate_rows
    })

In [49]:
results

[{'file': 'NC_vs_MCI.csv',
  'duplicate_columns': array([False, False, False, ..., False, False, False]),
  'duplicate_rows': False},
 {'file': 'complete_dataset.csv',
  'duplicate_columns': array([False, False, False, ..., False, False, False]),
  'duplicate_rows': False},
 {'file': 'NC_vs_VaD.csv',
  'duplicate_columns': array([False, False, False, ..., False, False, False]),
  'duplicate_rows': False},
 {'file': 'NC_vs_AD.csv',
  'duplicate_columns': array([False, False, False, ..., False, False, False]),
  'duplicate_rows': False},
 {'file': 'NC_vs_NPH.csv',
  'duplicate_columns': array([False, False, False, ..., False, False, False]),
  'duplicate_rows': False},
 {'file': 'NC_vs_DLB.csv',
  'duplicate_columns': array([False, False, False, ..., False, False, False]),
  'duplicate_rows': False}]

**Train Test for Each Dataset**

In [50]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split

In [51]:
def split_and_save_datasets(dataset_paths, save_directory, test_size=0.2, random_state=42):
    for dataset_path in dataset_paths:
        # Extract the base name for the dataset
        base_name = os.path.basename(dataset_path).replace('.csv', '')
        
        # Load the dataset
        dataset = pd.read_csv(dataset_path)
        
        # Perform the train-test split
        train, test = train_test_split(dataset, test_size=test_size, random_state=random_state)
        
        # Create the file names for the training and test sets
        train_file_name = f"{base_name}_train.csv"
        test_file_name = f"{base_name}_test.csv"
        
        # Construct the full paths for the new training and test set files
        train_file_path = os.path.join(save_directory, train_file_name)
        test_file_path = os.path.join(save_directory, test_file_name)
        
        # Save the training and test sets to their respective files
        train.to_csv(train_file_path, index=False)
        test.to_csv(test_file_path, index=False)
        
        print(f"Train set saved to {train_file_path}")
        print(f"Test set saved to {test_file_path}")

In [52]:
# Example usage:
file_paths = [
    # List your file paths here
    '/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_AD.csv',
    '/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_DLB.csv',
    '/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_MCI.csv',
    '/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_NPH.csv',
    '/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch1/NC_vs_VaD.csv',
]

In [53]:
save_directory = '/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2'

In [54]:
# Now call the function
split_and_save_datasets(file_paths, save_directory)

Train set saved to /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/NC_vs_AD_train.csv
Test set saved to /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/NC_vs_AD_test.csv
Train set saved to /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/NC_vs_DLB_train.csv
Test set saved to /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/NC_vs_DLB_test.csv
Train set saved to /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/NC_vs_MCI_train.csv
Test set saved to /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/NC_vs_MCI_test.csv
Train set saved to /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/NC_vs_NPH_train.csv
Test set saved to /home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch2/NC_vs_NPH_test.csv
Train set saved to /home/aghasemi/CompBio481/datasets/processed_datasets/u

**Multi_Label Dataset Construction Branch 3**

In [57]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

In [58]:
dataset = pd.read_csv("/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch3/complete_dataset.csv")

In [59]:
# Initialize the LabelEncoder
label_encoder = LabelEncoder()

In [60]:
# Fit LabelEncoder to the 'Diagnosis' column and transform it
dataset['Diagnosis'] = label_encoder.fit_transform(dataset['Diagnosis'])

In [61]:
# Display the first few rows of the transformed dataset and the unique classes with their corresponding labels
transformed_head = dataset.head()
unique_classes = label_encoder.classes_

In [62]:
(transformed_head, unique_classes)

(         ID_1  MIMAT0000062  MIMAT0000063  MIMAT0000064  MIMAT0000065  \
 0  GSM3403761      2.307579      2.307579      2.307579      2.307579   
 1  GSM3403762      1.503044      2.505380      1.503044      1.503044   
 2  GSM3403763      1.549877      1.983125      1.549877      1.549877   
 3  GSM3403764      1.560269      1.560269      1.560269      1.560269   
 4  GSM3403765      3.179096      3.302472      3.179096      3.179096   
 
    MIMAT0000066  MIMAT0000067  MIMAT0000068  MIMAT0000069  MIMAT0000070  ...  \
 0      2.307579      2.307579      2.307579      2.307579      2.307579  ...   
 1      1.503044      1.503044      1.503044      1.503044      1.503044  ...   
 2      1.549877      1.549877      1.549877      1.549877      1.549877  ...   
 3      2.232974      1.560269      1.560269      1.560269      1.560269  ...   
 4      4.793470      3.179096      3.179096      3.179096      3.179096  ...   
 
    MIMAT0032114, MIMAT0032115  MIMAT0032116  MIMAT0033692  MIMAT0

In [63]:
dataset.to_csv("/home/aghasemi/CompBio481/datasets/processed_datasets/usable_datasets_branch3/multi_class_dataset.csv", index=False)