# ASD Classification Project: Data Processing and Preparation

This repository contains the code and documentation for a machine learning project focused on classifying individuals with Autism Spectrum Disorder (ASD) based on their behavioral responses and Social Responsiveness Scale (SRS) scores.

## Project Overview

The goal of this project is to develop a machine learning model that can accurately classify individuals into different groups (e.g., ASD, High-functioning, Typically Developing) based on their performance in a series of interactive tasks and their SRS scores.

This README provides a detailed explanation of the data processing and preparation steps performed before training the classification model.

## Data Description

The project utilizes two main datasets:

1.  **SRS Data (`SRS_data_total_1004_all_Oct232024_Nov01F.csv`):** This dataset contains Social Responsiveness Scale (SRS) scores for each individual. It includes the following columns (at a minimum):
    -   `SubjectId`: Unique identifier for each individual.
    -   `class`: The individual's classification (e.g., `asd`, `high`, `td`).
    -   `combined`:  (Description of this column - add more context if you have it).

2.  **Interaction Data (`text_exc_merged_df_F_1004_Oct18F.csv`):** This dataset contains information about each individual's interactions during the tasks, likely derived from speech-to-text analysis. The original columns included:
    -   `SubjectId`: Unique identifier for each individual.
    -   `Class_txt`: Text-based classification.
    -   `Visit`: Visit number.
    -   Columns ending with `_txt`: Textual descriptions of responses to specific tasks (e.g., `Responded to name_txt`, `Mimicked actions1_txt1`, etc.).
    -   Columns without `_txt`: Likely indicate 'Y'/'N' (Yes/No) responses to tasks (e.g., `Responded to name`, `Mimicked actions1`, etc.).
    -   `Class_ex`: Another classification column.

## Data Processing and Preparation Steps

The following steps were performed to prepare the data for the machine learning model:

### 1. Import Libraries

```python


In [1]:
import pandas as pd
import numpy as np

In [2]:
#Load Data
# Save the final result
#1004
# SRS_df = pd.read_csv('/home/skbae/Documents/skbae/ASD/paper/4.Multimodal_RiskST/Git/data/SRS_data_total_1004_all_Oct232024_Nov01F.csv')
# SRS_df2.to_csv('./SRS_data_total_20250102_all_mapped_F.csv', header=True, index=False)
# SRS_df = pd.read_csv('/home/skbae/Documents/skbae/ASD/paper/4.Multimodal_RiskST/Process/SRS_data_total_20250102_all_mapped_F.csv')
SRS_df = pd.read_csv('/home/skbae/Documents/skbae/ASD/paper/4.Multimodal_RiskST/Process/SRS_data_total_1004_all_nov13F.csv')
# from server :/home/skbae/ASD/speech/text_exc_merged_df_F_1004_Oct18F.csv
df_interation = pd.read_csv('/home/skbae/Documents/skbae/ASD/paper/4.Multimodal_RiskST/Git/data/text_exc_merged_df_F_1004_Oct18F.csv')

In [3]:
#Rename Columns

df_interation.columns = ['SubjectId', 'Class_txt', 'Visit', 'Responded to name_txt','Mimicked actions1_txt1', 'Mimicked actions2_txt2','Played catch_txt', 'Fed baby doll_txt', 'Reacted to snack_txt','na_columns_name','Responded to name','Mimicked actions1','Mimicked actions2', 'Played catch', 'Fed baby doll', 'Reacted to snack','Class_ex']

In [None]:
#Create "Success/Failure" Columns

In [4]:
columns_to_check =['Responded to name', 'Mimicked actions1','Mimicked actions2','Played catch', 'Fed baby doll', 'Reacted to snack']

for column in columns_to_check:
    df_interation[column + '_new2'] = df_interation[column].apply(lambda x: 'Success of ' + column if x == 'Y' else ('Failure of ' if x == 'N' else ''))

In [5]:
df_interation['Mimicked actions1'] = df_interation['Mimicked actions1'].replace('None', np.nan)
df_interation['Mimicked actions2'] = df_interation['Mimicked actions2'].replace('None', np.nan)

def combine_text_columns(row):
   texts = [row['Mimicked actions1_txt1'], row['Mimicked actions2_txt2']]
   filtered_texts = [str(text) for text in texts if pd.notna(text)]
   return ", ".join(filtered_texts)

df_interation['Mimicked actions_tot_txt'] = df_interation.apply(combine_text_columns, axis=1)

In [None]:
#Combine Mimicked Actions (Y/N)

In [6]:
df_interation[['Mimicked actions1', 'Mimicked actions2']] = df_interation[['Mimicked actions1', 'Mimicked actions2']].replace('-', np.nan)

def combine_actions(row):
    if pd.isna(row['Mimicked actions1']) and pd.isna(row['Mimicked actions2']):
        return np.nan  # Return NaN if both are NaN
    elif (row['Mimicked actions1'] == 'Y') or (row['Mimicked actions2'] == 'Y'):
        return 'Y'
    elif (row['Mimicked actions1'] == 'N') or (row['Mimicked actions2'] == 'N'):
        return 'N'
    else:
        return '-'

df_interation['Mimicked actions_tot'] = df_interation.apply(combine_actions, axis=1)
columns_to_check =['Mimicked actions_tot']

for column in columns_to_check:
    df_interation[column + '_new'] = df_interation[column].apply(lambda x: 'Success of ' + column if x == 'Y' else ('Failure of ' + column if x == 'N' else ''))

In [None]:
#Select and Merge DataFrames

In [7]:
df_interation2=  df_interation[['SubjectId','Class_ex','Responded to name_new2', 'Mimicked actions1_new2', 'Mimicked actions2_new2','Played catch_new2','Fed baby doll_new2', 'Reacted to snack_new2']]
merged_int_SRS_df = df_interation2.merge(SRS_df[['SubjectId', 'class', 'combined']], on='SubjectId')

In [8]:
merged_int_SRS_df['class'] = merged_int_SRS_df['class'].replace({'asd': 'ASD', 'high': 'High', 'td': 'TD'})

In [None]:
#Save to JSON and Reload

In [9]:
merged_int_SRS_df.to_json('./merged_int_SRS_df_op1_2nd_1004_Nov04F2.json', orient='records', lines=True)
json_file_path = './merged_int_SRS_df_op1_2nd_1004_Nov04F2.json'


In [None]:
import pandas as pd
json_file_path = './merged_int_SRS_df_op1_2nd_1004_Nov04F2.json'
merged_int_SRS_df = pd.read_json(json_file_path, orient='records', lines=True)

In [None]:
#Preprocess Data for Modeling

In [10]:
label_mapping = {'TD': 0, 'High': 1, 'ASD': 2}

In [11]:
def preprocess_function_2(row):
    task_results = f" {row['Responded to name_new2']} ,{row['Mimicked actions1_new2'] + row['Mimicked actions2_new2']} ,{row['Played catch_new2']} ,{row['Fed baby doll_new2']} ,{row['Reacted to snack_new2']}"
    input_text = f"Task Results: {task_results} Combined: {row['combined']}"
    label = label_mapping[row['Class_ex']]
    return {"SubjectId": row['SubjectId'], "text": input_text, "label": label}

preprocessed_2 = merged_int_SRS_df.apply(preprocess_function_2, axis=1)
preprocessed_df2 = pd.DataFrame(preprocessed_2.tolist())

In [None]:
print(preprocessed_df2[['SubjectId','text', 'label']].head())
df_sF= preprocessed_df2[['SubjectId','text', 'label']]
df_sF.columns=['SubjectId','text','Class']
df_sF['label'] = df_sF['Class']
df_sF.label.value_counts()

In [None]:
ASD_mapping2 = {
    # 0: 0, # TD
    1: 0, # High risk of ASD
    2: 1, # ASD
}

# Select only rows with 'High' and 'ASD'
df_sF2 = df_sF[df_sF['Class'].isin([1, 2])]
# df_sF2 = df_sF

df_sF2['label'] = df_sF2['Class'].replace(ASD_mapping2)
df_sF2.label.value_counts()
df_m2=df_sF2

In [14]:
df_m2['label'].value_counts()

label
1    353
0    162
Name: count, dtype: int64

In [15]:
df_m2.to_json('./df_m2_5tasks_SRS_1004_Jan07F2.json', orient='records', lines=True)
# df_m2.to_json('./df_m2_5tasks_SRS_20250102_Jan07F.json', orient='records', lines=True)

In [16]:
import torch

if torch.cuda.is_available(): 
    print("GPU is available")

GPU is available


In [17]:
import random
def set_seed(random_seed):
    torch.manual_seed(random_seed)
    torch.cuda.manual_seed(random_seed)
    # torch.cuda.manual_seed_all(random_seed)  # if use multi-GPU
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(random_seed)
    random.seed(random_seed)
    
random_seed = 42
set_seed(random_seed)

In [18]:
from sklearn.model_selection import StratifiedGroupKFold

# Initialize StratifiedGroupKFold
sgkf = StratifiedGroupKFold(n_splits=5)

# Get the indices for training and test sets
train_idx, temp_idx = next(sgkf.split(df_m2, df_m2['label'], groups=df_m2['SubjectId']))

# Create the training and test sets
train = df_m2.iloc[train_idx]
temp = df_m2.iloc[temp_idx]

# Repeat for the validation set
sgkf = StratifiedGroupKFold(n_splits=2, shuffle=True, random_state=42)
# Get the indices for validation and test sets
for val_idx, test_idx in sgkf.split(temp, temp['label'], groups=temp['SubjectId']):
    break

# Create the validation and test sets
val = temp.iloc[val_idx]
test = temp.iloc[test_idx]

# Verify the label distribution
print("\nLabel distribution in Train set:")
print(train['label'].value_counts(normalize=True))
print("\nLabel distribution in Validation set:")
print(val['label'].value_counts(normalize=True))
print("\nLabel distribution in Test set:")
print(test['label'].value_counts(normalize=True))


Label distribution in Train set:
label
1    0.686893
0    0.313107
Name: proportion, dtype: float64

Label distribution in Validation set:
label
1    0.634615
0    0.365385
Name: proportion, dtype: float64

Label distribution in Test set:
label
1    0.72549
0    0.27451
Name: proportion, dtype: float64


In [19]:
# after the chekin g the other lab part 2 below 
print(train['label'].value_counts())
print(val['label'].value_counts())
print(test['label'].value_counts())

label
1    283
0    129
Name: count, dtype: int64
label
1    33
0    19
Name: count, dtype: int64
label
1    37
0    14
Name: count, dtype: int64


In [20]:

train.to_json('./train_5tasks_SRS_1004_Jan07F2.son', orient='records', lines=True)
val.to_json('./val_5tasks_SRS_1004_Jan07F2.json', orient='records', lines=True)
test.to_json('./test_5tasks_SRS_1004_Jan07F2.json', orient='records', lines=True)

In [None]:

# train.to_json('./train_5tasks_SRS_20250102_Jan07F.son', orient='records', lines=True)
# val.to_json('./val_5tasks_SRS_20250102_Jan07F.json', orient='records', lines=True)
# test.to_json('./test_5tasks_SRS_20250102_Jan07F.json', orient='records', lines=True)