# **Supervised Learning - Autism Dataset for Toddlers**

## Autism Spectrum Disorder (ASD) Diagnosis

# **Introduction**

# The Dataset being used is adapted from the Kaggle's Autism Dataset for Toddlers page, which contains 1054 reported cases , each with 19 attributes. The most important ones to look at is, namely, the Q-chat-score, that is , the aggregate of the results from the 10-question questionaire, that corresponds to influential features to be utilised for further analysis especially in determining autistic traits , each one corresponding also to an binary attribute, and other labeled attributes, such as Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test, and the "Class/ASD Traits ", which is derived from the Q-Chat Score. The remaining one that is numeric , is the case numeber, a unique identifier for each row of results

# The goal of this project is to improve the classification of the ASD traits, given a whole dataset of diagnosis based on certain features evaluated on the questionnaires, and evaluate their distribution by other parallel factors that are labeled, such as Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test, and the "Class/ASD Traits "

# The solution to this problem is a supervised learning model, which will be trained using the dataset mentioned above. The model will be trained using the training set, and then evaluated using the test set. The model will be evaluated using the accuracy metric, which is the percentage of diagnosis that are correctly done / probability of a certain diagnose is correctly done , taking into account all the labeled factors.

---

This project was made possible by:

| Name | Email |
|-|-|
| André Silva | up202108724@up.pt |
| Bernardo Pinto | up202108842@up.pt |
| Francisco Sousa | up202108838@up.pt |
|---|---|
| Group | T10 - G104 |

### Importing libraries

Throughout the study, many libraries were incrementally added, thus, it is important to install them all, which can be done by running the following command in the terminal (make sure you are in the project's root directory):

```bash
pip install -r requirements.txt
```

Then, we can import the libraries we will use in this project.

Note that we also had disabled the warnings, to make the notebook cleaner.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.preprocessing import LabelEncoder
import pycaret


In [2]:
#import pandas as pd

dataframe = pd.read_csv("./Autism_dataset.csv")
dataframe.head()

Unnamed: 0,Case_No,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Age_Mons,Qchat-10-Score,Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test,Class/ASD Traits
0,1,0,0,0,0,0,0,1,1,0,1,28,3,f,middle eastern,yes,no,family member,No
1,2,1,1,0,0,0,1,1,0,0,0,36,4,m,White European,yes,no,family member,Yes
2,3,1,0,0,0,0,0,1,1,0,1,36,4,m,middle eastern,yes,no,family member,Yes
3,4,1,1,1,1,1,1,1,1,1,1,24,10,m,Hispanic,no,no,family member,Yes
4,5,1,1,0,1,1,1,1,1,1,1,20,9,f,White European,no,yes,family member,Yes


# Data pre-processing



In [3]:
dataframe.describe()


Unnamed: 0,Case_No,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Age_Mons,Qchat-10-Score
count,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0,1054.0
mean,527.5,0.563567,0.448767,0.401328,0.512334,0.524668,0.57685,0.649905,0.459203,0.489564,0.586338,27.867173,5.212524
std,304.407895,0.496178,0.497604,0.4904,0.500085,0.499628,0.494293,0.477226,0.498569,0.500128,0.492723,7.980354,2.907304
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.0,0.0
25%,264.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.0,3.0
50%,527.5,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,30.0,5.0
75%,790.75,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,36.0,8.0
max,1054.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,36.0,10.0


In [4]:
dataframe.isna().any()

Case_No                   False
A1                        False
A2                        False
A3                        False
A4                        False
A5                        False
A6                        False
A7                        False
A8                        False
A9                        False
A10                       False
Age_Mons                  False
Qchat-10-Score            False
Sex                       False
Ethnicity                 False
Jaundice                  False
Family_mem_with_ASD       False
Who completed the test    False
Class/ASD Traits          False
dtype: bool

In [5]:
dataframe.drop(columns=['Case_No','Qchat-10-Score'])

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Age_Mons,Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test,Class/ASD Traits
0,0,0,0,0,0,0,1,1,0,1,28,f,middle eastern,yes,no,family member,No
1,1,1,0,0,0,1,1,0,0,0,36,m,White European,yes,no,family member,Yes
2,1,0,0,0,0,0,1,1,0,1,36,m,middle eastern,yes,no,family member,Yes
3,1,1,1,1,1,1,1,1,1,1,24,m,Hispanic,no,no,family member,Yes
4,1,1,0,1,1,1,1,1,1,1,20,f,White European,no,yes,family member,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1049,0,0,0,0,0,0,0,0,0,1,24,f,White European,no,yes,family member,No
1050,0,0,1,1,1,0,1,0,1,0,12,m,black,yes,no,family member,Yes
1051,1,0,1,1,1,1,1,1,1,1,18,m,middle eastern,yes,no,family member,Yes
1052,1,0,0,0,0,0,0,1,0,1,19,m,White European,no,yes,family member,No


In [6]:
label_encoders = {}
categorical_cols = ["Sex", "Ethnicity", "Jaundice", "Family_mem_with_ASD", "Who completed the test"]
for col in categorical_cols:
    label_encoders[col] = LabelEncoder()
    dataframe[col] = label_encoders[col].fit_transform(dataframe[col])

encoder = LabelEncoder()

dataframe['Class/ASD Traits '] = encoder.fit_transform(dataframe['Class/ASD Traits ']) #Target Variable
dataframe.head()


Unnamed: 0,Case_No,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Age_Mons,Qchat-10-Score,Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test,Class/ASD Traits
0,1,0,0,0,0,0,0,1,1,0,1,28,3,0,8,1,0,4,0
1,2,1,1,0,0,0,1,1,0,0,0,36,4,1,5,1,0,4,1
2,3,1,0,0,0,0,0,1,1,0,1,36,4,1,8,1,0,4,1
3,4,1,1,1,1,1,1,1,1,1,1,24,10,1,0,0,0,4,1
4,5,1,1,0,1,1,1,1,1,1,1,20,9,0,5,0,1,4,1


# Dataset Analysis

In [6]:
correlation_matrix = dataframe.corr()