# **Supervised Learning - Autism Dataset for Toddlers**

## Autism Spectrum Disorder (ASD) Diagnosis

# **Introduction**

# The Dataset being used is adapted from the Kaggle's Autism Dataset for Toddlers page, which contains 1054 reported cases , each with 19 attributes. The most important ones to look at is, namely, the Q-chat-score, that is , the aggregate of the results from the 10-question questionaire, that corresponds to influential features to be utilised for further analysis especially in determining autistic traits , each one corresponding also to an binary attribute, and other labeled attributes, such as Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test, and the "Class/ASD Traits ", which is derived from the Q-Chat Score. The remaining one that is numeric , is the case numeber, a unique identifier for each row of results

# The goal of this project is to improve the classification of the ASD traits, given a whole dataset of diagnosis based on certain features evaluated on the questionnaires, and evaluate their distribution by other parallel factors that are labeled, such as Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test, and the "Class/ASD Traits "

# The solution to this problem is a supervised learning model, which will be trained using the dataset mentioned above. The model will be trained using the training set, and then evaluated using the test set. The model will be evaluated using the accuracy metric, which is the percentage of diagnosis that are correctly done / probability of a certain diagnose is correctly done , taking into account all the labeled factors.

---

This project was made possible by:

| Name | Email |
|-|-|
| André Silva | up202108724@up.pt |
| Bernardo Pinto | up202108842@up.pt |
| Francisco Sousa | up202108838@up.pt |
|---|---|
| Group | T10 - G104 |

### Importing libraries

Throughout the study, many libraries were incrementally added, thus, it is important to install them all, which can be done by running the following command in the terminal (make sure you are in the project's root directory):

```bash
pip install -r requirements.txt
```

Then, we can import the libraries we will use in this project.

Note that we also had disabled the warnings, to make the notebook cleaner.

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stat
from tabulate import tabulate
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
#from sklearn.neural_network import MLPClassifier # não está a funcionar, por agora
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import time
#import pycaret


In [22]:
#import pandas as pd

dataframe = pd.read_csv("./Autism_dataset.csv")
dataframe.head()

Unnamed: 0,Case_No,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Age_Mons,Qchat-10-Score,Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test,Class/ASD Traits
0,1,0,0,0,0,0,0,1,1,0,1,28,3,f,middle eastern,yes,no,family member,No
1,2,1,1,0,0,0,1,1,0,0,0,36,4,m,White European,yes,no,family member,Yes
2,3,1,0,0,0,0,0,1,1,0,1,36,4,m,middle eastern,yes,no,family member,Yes
3,4,1,1,1,1,1,1,1,1,1,1,24,10,m,Hispanic,no,no,family member,Yes
4,5,1,1,0,1,1,1,1,1,1,1,20,9,f,White European,no,yes,family member,Yes


# Data pre-processing

## Pre analysis

The first step is to analyze the dataset, to understand its structure and the type of data it contains. This will allow us to identify any missing values outliers, or other issues that need to be addressed before training the model.

In [ ]:
dataframe.describe()


In [ ]:
dataframe.isna().any()

In [ ]:
dataframe= dataframe.drop(columns=['Case_No','Qchat-10-Score'])
print(dataframe['Class/ASD Traits'].head())

In [ ]:

encoder= LabelEncoder()

binary_cols= ["Sex", "Jaundice", "Family_mem_with_ASD" , "Class/ASD Traits"]

for binary_atribute in binary_cols:
    dataframe[binary_atribute] = encoder.fit_transform(dataframe[binary_atribute])

dataframe["Who completed the test"]= dataframe["Who completed the test"].replace("Health care professional","Health Care Professional")

#print(dataframe["Who completed the test"].unique())

categorical_cols = ["Ethnicity", "Who completed the test"]


# Applying one-hot encoding to categorical columns
one_hot_encoded = pd.get_dummies(dataframe[categorical_cols])

# Concatenating one-hot encoded columns with the original dataframe
dataframe_encoded = pd.concat([dataframe, one_hot_encoded], axis=1)

# Dropping the original categorical columns
dataframe_encoded.drop(categorical_cols, axis=1, inplace=True)

# Displaying the resulting dataframe
#np.unique(dataframe_encoded.iloc[:,17:18])



# Dataset Analysis

We can now analyze the dataset to better understand the distribution of the data and the relationships between the different features. This will help us identify any patterns or trends that may be useful for training the model.

In [ ]:
plt.figure(figsize=(20, 20))

# Assuming dataframe_encoded is your dataframe after one-hot encoding

# Splitting the dataframe based on the target variable
df1 = dataframe_encoded[dataframe_encoded['Class/ASD Traits'] == 1].drop(columns=['Class/ASD Traits'])
df2 = dataframe_encoded[dataframe_encoded['Class/ASD Traits'] == 0].drop(columns=['Class/ASD Traits'])
print(dataframe_encoded.columns)

# Plotting the distributions
num_columns = len(df1.columns)
plot_index = 1

for i in range(num_columns):
    column_name = df1.columns[i]
    plt.subplot(6, 6, plot_index)
    
    if column_name == 'Age_Mons':
        continue
    else:
        # Count occurrences of 0s and 1s
        count_0_1_df1 = df1[column_name].value_counts().sort_index()
        count_0_1_df2 = df2[column_name].value_counts().sort_index()
        
        # Create a DataFrame for plotting
        plot_data = pd.DataFrame({
            'Value': [0, 1],
            'ASD': count_0_1_df1,
            'Non-ASD': count_0_1_df2
        }).fillna(0)  # Fill NaNs with 0
        
        # Plot bars
        bar_width = 0.35
        plt.bar(plot_data['Value'] - bar_width/2, plot_data['ASD'], width=bar_width, label='ASD', color='blue', alpha=0.7)
        plt.bar(plot_data['Value'] + bar_width/2, plot_data['Non-ASD'], width=bar_width, label='Non-ASD', color='orange', alpha=0.7)
        plt.xlabel(column_name)
        plt.ylabel('Count')
        plt.xticks([0, 1])
    
    plt.legend()
    plot_index += 1


# Special plot for continuous variable 'age-months'
df1['Group'] = 'ASD'
df2['Group'] = 'Non-ASD'
df = pd.concat([df1, df2])
        

binwidth = 1
age_min = int(df['Age_Mons'].min())
age_max = int(df['Age_Mons'].max()) + binwidth
bins = range(age_min, age_max, binwidth)
df['Age_Binned'] = pd.cut(df['Age_Mons'], bins, right=False, include_lowest=True)
# Count occurrences within each bin and group
count_df = df.groupby(['Age_Binned', 'Group']).size().unstack(fill_value=0)

# Plot the data
plt.figure(figsize=(12, 6))
count_df.plot(kind='bar', width=0.8, color=['blue', 'orange'])
plt.xlabel('Age (Months)')
plt.ylabel('Count')
plt.title('Age Distribution by ASD and Non-ASD Groups')
plt.legend(title='Group')
plt.xticks(rotation=45)
plt.show()
df1.drop(columns=['Group'], inplace=True)
df2.drop(columns=['Group'], inplace=True)
plt.tight_layout()
plt.show()

In [ ]:
def cramers_v(x, y):
    """Calculate Cramér's V statistic for categorical-categorical association."""
    confusion_matrix = pd.crosstab(x, y)
    chi2 = stat.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    return np.sqrt(chi2 / (n * (min(confusion_matrix.shape) - 1)))

def calculate_correlations(dataframe, target_column):
    correlations = {}
    for column in dataframe.columns:
        if column == target_column:
            continue
        if column=="Age_Mons":
            corr = stat.pointbiserialr(dataframe[target_column], dataframe[column])[0]
        else:
            corr = cramers_v(dataframe[column], dataframe[target_column])    
        correlations[column] = corr
    return correlations
# a_function(int a.bar(), bool b.foo())
# Assuming dataframe_encoded is your dataframe
target_column = 'Class/ASD Traits'

for column in dataframe.columns:
    if dataframe[column].dtype == 'object':
        dataframe[column] = encoder.fit_transform(dataframe[column])

#print(before_one_hot["Ethnicity"])

# Check if the target column is in the dataframe's columns

if target_column in dataframe.columns:
    # Calculate correlations
    correlations = calculate_correlations(dataframe, target_column)
    
    # Convert to DataFrame for heatmap
    correlation_df = pd.DataFrame.from_dict(correlations, orient='index', columns=[target_column])
    correlation_df = correlation_df.sort_values(by=target_column, ascending=False)
    
    # Plot heatmap
    plt.figure(figsize=(10, 8))
    sns.heatmap(correlation_df, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
    plt.title(f'Correlations with {target_column}')
    plt.show()
else:
    print(f"Column '{target_column}' not found in the dataframe.")
    

In [ ]:
'''
def calculatecorrelations_columns(dataframe,leftColumn, rightColumn):
    if dataframe[leftColumn].nunique() == 2:  # Binary
        if dataframe[rightColumn].nunique() == 2:
            # Both are binary -> Cramér's V
            #print("Cramers_v : " + leftColumn + rightColumn )
            corr = cramers_v(dataframe[leftColumn], dataframe[rightColumn])
        else:
            # Binary - Continuous -> Point Biserial
            #print("Point Biserial Continuous/Binary : " + leftColumn + rightColumn)
            corr, _ = stat.pointbiserialr(dataframe[leftColumn], dataframe[rightColumn])
    elif dataframe[rightColumn].nunique() == 2:
        # Continuous - Binary -> Point Biserial
        #print("Point Biserial Binary(Continuous : " + leftColumn + rightColumn)
        corr, _ = stat.pointbiserialr(dataframe[rightColumn], dataframe[leftColumn])
    else:
        # Continuous - Continuous -> Pearson's
        #print("Pearson's : " + leftColumn + rightColumn)
        corr = dataframe[leftColumn].corr(dataframe[rightColumn]) # Nunca entra aqui, pois o dataset só tem uma coluna continua
    return corr
'''
def calculatecorrelations_notencodedcolumns(dataframe,leftColumn, rightColumn):
    
    if (leftColumn=="Age_Mons" and rightColumn=="Ethnicity") or (leftColumn=="Ethnicity" and rightColumn=="Age_Mons"):
        grouped_data = [dataframe[dataframe["Ethnicity"] == category]["Age_Mons"] for category in dataframe["Ethnicity"].unique()]
        f_stat, p_value = stat.f_oneway(*grouped_data)
        corr = p_value # Mudar o algoritmo talvez
    elif (leftColumn=="Age_Mons" and rightColumn=="Who completed the test") or (leftColumn=="Who completed the test" and rightColumn=="Age_Mons"):
        grouped_data = [dataframe[dataframe["Who completed the test"] == category]["Age_Mons"] for category in dataframe["Who completed the test"].unique()]
        f_stat, p_value = stat.f_oneway(*grouped_data)
        corr = p_value # Mudar o algoritmo talvez
    elif ((leftColumn=="Age_Mons"  and dataframe[rightColumn].nunique() == 2 ) or (rightColumn=="Age_Mons" and dataframe[leftColumn].nunique() == 2 ) ):
         if leftColumn=="Age_Mons":
            corr = stat.pointbiserialr(dataframe[rightColumn], dataframe[leftColumn])[0]
         else:
            corr = stat.pointbiserialr(dataframe[leftColumn], dataframe[rightColumn])[0]    
    else:
        corr = cramers_v(dataframe[leftColumn], dataframe[rightColumn])         
    return corr
higher_correlation = []
correlations_values = []
df_columns = dataframe.columns

for i, col1 in enumerate(df_columns):
    if col1 == 'Class/ASD Traits':
        continue
    for col2 in df_columns[i::]:
        if col1 == col2 or col2 == 'Class/ASD Traits':
            continue
        a_corr= calculatecorrelations_notencodedcolumns(dataframe,col1, col2)
        if a_corr > 0.8:    
            print(f"Correlation between {col1} and {col2}: {a_corr} \n")
        correlations_values.append(a_corr)
#print(sorted(correlations_values))
#print(len(correlations_values))

# Dataset Training

In [ ]:
features = dataframe.drop(['Class/ASD Traits'],axis=1)
labels = dataframe['Class/ASD Traits']

np.unique(labels, return_counts=True)

In [ ]:
x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.2)
smote_enn = SMOTE(random_state=42)
x_train, y_train = smote_enn.fit_resample(x_train, y_train)
print(np.unique(y_train, return_counts=True))


In [ ]:
classifiers = []
classifiers.append(["Decision tree classifier",DecisionTreeClassifier()])
classifiers.append(["KNeighbors Classifier",KNeighborsClassifier()])
classifiers.append(["SVM",SVC()])
classifiers.append(["Random Forest",RandomForestClassifier()])
classifiers.append(["Gradient Boosting",GradientBoostingClassifier()])

accuracy_scores = {}
precision_scores = {}
recall_scores = {}
f1_scores = {}
confusion_matrices = {}
training_times = {}

for name, classifier in classifiers:
    start_time = time.time()
    
    classifier.fit(x_train, y_train)
    
    y_pred = classifier.predict(x_test)
    
    training_times[name] = time.time() - start_time
    
    accuracy_scores[name] = accuracy_score(y_test, y_pred)
    precision_scores[name] = precision_score(y_test, y_pred)
    recall_scores[name] = recall_score(y_test, y_pred)
    f1_scores[name] = f1_score(y_test, y_pred)
    confusion_matrices[name] = confusion_matrix(y_test, y_pred)
    
for clf_name, clf in classifiers:
    print(f"Classifier: {clf_name}")
    print(f"Accuracy: {accuracy_scores[clf_name]}")
    print(f"Precision: {precision_scores[clf_name]}")
    print(f"Recall: {recall_scores[clf_name]}")
    print(f"F1 Score: {f1_scores[clf_name]}")
    print(f"Confusion Matrix:\n{confusion_matrices[clf_name]}")
    print(f"Training Time: {training_times[clf_name]} seconds")
    print("----------------------------------------")
    