### Extract and Clean Clinical Factors

**Import Libraries:** 
Imports the `pandas` library for data manipulation.

In [8]:
import pandas as pd

**Define `extract_and_clean_clinical_factors` Function:** 
- **Load the Data:** Reads the clinical factors data from a CSV file into a DataFrame.
- **Extract Sample IDs:** Identifies the row with sample IDs and extracts them.
- **Extract Characteristics:** Finds rows containing sample characteristics.
- **Initialize Dictionary:** Creates a dictionary to store characteristics data with sample IDs.
- **Iterate Over Rows:** Extracts and separates characteristic names and values, updates the dictionary.
- **Transform Data:** Converts the dictionary into a DataFrame.
- **Save Data:** Optionally saves the cleaned DataFrame to a CSV file.
- **Return DataFrame:** Returns the cleaned DataFrame.

**Example Usage:** 
Calls the function with a file path and an optional output file path, and prints the first few rows of the cleaned DataFrame.

In [9]:
def extract_and_clean_clinical_factors(file_path, output_file=None):
    # Load the clinical factors data
    clinical_factors = pd.read_csv(file_path)

    # Extract Sample IDs
    sample_ids_row = clinical_factors[clinical_factors.iloc[:, 0] == '!Sample_geo_accession']
    sample_ids = sample_ids_row.iloc[0, 1:].values  # Exclude the first column which is the header

    # Extract rows that contain characteristics
    characteristics_rows = clinical_factors[clinical_factors.iloc[:, 0].str.startswith('!Sample_characteristics_ch1')]

    # Initialize the dictionary to hold the characteristics data
    characteristics_data = {'SampleID': sample_ids}

    # Iterate over each characteristic row to extract and separate the data
    for _, row in characteristics_rows.iterrows():
        for i, value in enumerate(row.iloc[1:].values):  # Skip the row label
            # Split the value to separate the characteristic name from its value
            characteristic_name, characteristic_value = value.split(':')
            characteristic_name = characteristic_name.strip().replace(' ', '_').lower()

            # If the characteristic name is not in the dictionary, add it
            if characteristic_name not in characteristics_data:
                characteristics_data[characteristic_name] = [None] * len(sample_ids)
            
            # Update the characteristic value for the current sample
            characteristics_data[characteristic_name][i] = characteristic_value.strip()

    # Transform the corrected data into a DataFrame
    cleaned_clinical_factors = pd.DataFrame(characteristics_data)

    # Save to CSV if an output file path is provided
    if output_file:
        cleaned_clinical_factors.to_csv(output_file, index=False)

    return cleaned_clinical_factors

In [11]:
# Example usage
file_path = '/home/aghasemi/CompBio481/datasets/original_datasets/GSE167559_clinical_factors.csv'
output_file = '/home/aghasemi/CompBio481/datasets/extracted_clinical_factors/GSE167559_clinical_factors_extracted.csv'
cleaned_df = extract_and_clean_clinical_factors(file_path, output_file)
print(cleaned_df.head())

     SampleID tissue diagnosis age     sex apoe4
0  GSM5107459  serum       NPH  83    male     0
1  GSM5107460  serum       NPH  75    male     0
2  GSM5107461  serum       NPH  87  female     0
3  GSM5107462  serum       NPH  73  female     0
4  GSM5107463  serum       NPH  79  female     0


**Concat Clinical Factors**

### Combine Clinical Factors

**Define `combine_dataframes` Function:** 
- **Read DataFrames:** Reads each CSV file into a DataFrame and stores them in a list.
- **Concatenate DataFrames:** Combines the DataFrames row-wise into a single DataFrame.
- **Reset Index:** Resets the index of the combined DataFrame.
- **Remove Column:** Drops the 'tissue' column if it exists.
- **Save Data:** Optionally saves the combined DataFrame to a CSV file.
- **Return DataFrame:** Returns the combined DataFrame.

**Example Usage:** 
Calls the function with file paths and an optional output file path, and prints the first few rows of the combined DataFrame.

In [13]:
def combine_dataframes(file_paths, output_file=None):
    # Read each CSV file into a DataFrame and store them in a list
    dataframes = [pd.read_csv(file_path) for file_path in file_paths]
    
    # Concatenate the DataFrames row-wise
    combined_df = pd.concat(dataframes, axis=0)
    
    # Reset the index of the combined DataFrame
    combined_df.reset_index(drop=True, inplace=True)
    
    # Remove the 'tissue' column if it exists
    if 'tissue' in combined_df.columns:
        combined_df.drop(columns=['tissue'], inplace=True)
    
    # Save to CSV if an output file path is provided
    if output_file:
        combined_df.to_csv(output_file, index=False)
    
    return combined_df

In [14]:
# Example usage
file_paths = [
    '/home/aghasemi/CompBio481/datasets/extracted_clinical_factors/GSE167559_clinical_factors_extracted.csv',
    '/home/aghasemi/CompBio481/datasets/extracted_clinical_factors/GSE120584_clinical_factors_extracted.csv'
]
output_file = '/home/aghasemi/CompBio481/datasets/processed_datasets/GSE120584_&_GSE167559_clinical_factors_combined.csv'

In [15]:
combined_df = combine_dataframes(file_paths, output_file)
print(combined_df.head())

     SampleID diagnosis  age     sex  apoe4
0  GSM5107459       NPH   83    male      0
1  GSM5107460       NPH   75    male      0
2  GSM5107461       NPH   87  female      0
3  GSM5107462       NPH   73  female      0
4  GSM5107463       NPH   79  female      0


### Check for Unique Clinical Factors

**Import Libraries:** 
Imports the `pandas` library for data manipulation.

**Read Combined Data:** 
Reads the combined clinical factors data from a CSV file into a DataFrame.

**Find Duplicates:** 
Identifies duplicate rows in the DataFrame considering all columns.

**Print Duplicates:** 
Prints the duplicate rows to inspect any repeated entries in the dataset.

In [16]:
clinical_factors = pd.read_csv("/home/aghasemi/CompBio481/datasets/processed_datasets/GSE120584_&_GSE167559_clinical_factors_combined.csv")

In [19]:
# Find duplicate rows, considering all columns
duplicates = clinical_factors[clinical_factors.duplicated()]

In [20]:
# Print the duplicate rows
print(duplicates)

Empty DataFrame
Columns: [SampleID, diagnosis, age, sex, apoe4]
Index: []
