# Asthma Disease Prediction - Exploratory Data Analysis

In this notebook, we will be performing an exploratory data analysis on the Kaggle dataset [Asthma Disease Prediction](https://www.kaggle.com/datasets/deepayanthakur/asthma-disease-prediction) by Deepayan Thakur. This dataset is a comprehensive collection of various symptons and factors from patients with or without asthma. The goal of this project is to build a machine learning model that can predict whether a patient has asthma or not based on the symptons and factors, and if so, how severe it is.

- **Author:** [Sergio Cuéllar](https://www.linkedin.com/in/sergiocuellaralmagro/)
- **Date:** September 2023
- **Dataset:** [Kaggle](https://www.kaggle.com/datasets/deepayanthakur/asthma-disease-prediction)
- **Python Version:** 3.10.1

## Objectives

- **TODO** cant be bothered atm

## Preliminary Setup

We will be using the following libraries for this EDA:

- **Pandas:** Data manipulation and analysis.
- **Matplotlib:** Data visualization.
- **Seaborn:** Data visualization.

In [127]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

We will also be using the following settings to display the plots:

In [128]:
sns.set_style('darkgrid',{
    'axes.facecolor': '0.9',
    'grid.color': '0.8',
    'grid.linestyle': '--',
    'grid.linewidth': 0.5,
    })

## Data loading and preliminary inspection

In [129]:
data = pd.read_csv('dataset.csv')
data.head(10)

Unnamed: 0,Tiredness,Dry-Cough,Difficulty-in-Breathing,Sore-Throat,None_Sympton,Pains,Nasal-Congestion,Runny-Nose,None_Experiencing,Age_0-9,Age_10-19,Age_20-24,Age_25-59,Age_60+,Gender_Female,Gender_Male,Severity_Mild,Severity_Moderate,Severity_None
0,1,1,1,1,0,1,1,1,0,1,0,0,0,0,0,1,1,0,0
1,1,1,1,1,0,1,1,1,0,1,0,0,0,0,0,1,1,0,0
2,1,1,1,1,0,1,1,1,0,1,0,0,0,0,0,1,1,0,0
3,1,1,1,1,0,1,1,1,0,1,0,0,0,0,0,1,0,1,0
4,1,1,1,1,0,1,1,1,0,1,0,0,0,0,0,1,0,1,0
5,1,1,1,1,0,1,1,1,0,1,0,0,0,0,0,1,0,1,0
6,1,1,1,1,0,1,1,1,0,1,0,0,0,0,0,1,0,0,0
7,1,1,1,1,0,1,1,1,0,1,0,0,0,0,0,1,0,0,0
8,1,1,1,1,0,1,1,1,0,1,0,0,0,0,0,1,0,0,0
9,1,1,1,1,0,1,1,1,0,1,0,0,0,0,0,1,0,0,1


In [130]:
shape = data.shape
print(f'The dataset has {shape[0]} rows and {shape[1]} columns.')

The dataset has 316800 rows and 19 columns.


In [131]:
columns = data.columns
print('The columns are:')
for column in columns:
    print(f'  - {column}')

The columns are:
  - Tiredness
  - Dry-Cough
  - Difficulty-in-Breathing
  - Sore-Throat
  - None_Sympton
  - Pains
  - Nasal-Congestion
  - Runny-Nose
  - None_Experiencing
  - Age_0-9
  - Age_10-19
  - Age_20-24
  - Age_25-59
  - Age_60+
  - Gender_Female
  - Gender_Male
  - Severity_Mild
  - Severity_Moderate
  - Severity_None


In [132]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 316800 entries, 0 to 316799
Data columns (total 19 columns):
 #   Column                   Non-Null Count   Dtype
---  ------                   --------------   -----
 0   Tiredness                316800 non-null  int64
 1   Dry-Cough                316800 non-null  int64
 2   Difficulty-in-Breathing  316800 non-null  int64
 3   Sore-Throat              316800 non-null  int64
 4   None_Sympton             316800 non-null  int64
 5   Pains                    316800 non-null  int64
 6   Nasal-Congestion         316800 non-null  int64
 7   Runny-Nose               316800 non-null  int64
 8   None_Experiencing        316800 non-null  int64
 9   Age_0-9                  316800 non-null  int64
 10  Age_10-19                316800 non-null  int64
 11  Age_20-24                316800 non-null  int64
 12  Age_25-59                316800 non-null  int64
 13  Age_60+                  316800 non-null  int64
 14  Gender_Female            316800 non-

In [133]:
print('The possible values for each column are:')
for column in columns:
    print(f'{column}: {data[column].unique()}')

The possible values for each column are:
Tiredness: [1 0]
Dry-Cough: [1 0]
Difficulty-in-Breathing: [1 0]
Sore-Throat: [1 0]
None_Sympton: [0 1]
Pains: [1 0]
Nasal-Congestion: [1 0]
Runny-Nose: [1 0]
None_Experiencing: [0 1]
Age_0-9: [1 0]
Age_10-19: [0 1]
Age_20-24: [0 1]
Age_25-59: [0 1]
Age_60+: [0 1]
Gender_Female: [0 1]
Gender_Male: [1 0]
Severity_Mild: [1 0]
Severity_Moderate: [0 1]
Severity_None: [0 1]


As we can conclude from this preliminary analysis, the dataset is composed of 316800 rows (data points) and 19 columns (features). This dataset is fairly clean, as there are no missing values and all the features are numerical (boolean, encoded as 0 or 1, in this case). Some of the features, such as the age, or the severity of the asthma, are "hot-encoded".

This means that the data won't require much (if any) cleaning and pre-processing for the machine learning model. For the purposes of this exploratory data analysis, however, we will be performing some data manipulation (reversing the one-hot encoding and turning those features into categorical ones) in order to aid with the visualization of the data.

## Data Cleaning and Pre-processing

In this section, we will be preparing the dataframe for the exploratory data analysis. As we mentioned in the previous section, we will be reversing the one-hot encoding of the age and severity features, and turning them into categorical features. We will also be renaming some of the columns and their values to make them more readable.

In [134]:
## Reverse one-hot encoding

# Age column

# Create a temporary dataframe with the age columns

temp_age = data[['Age_0-9', 'Age_10-19', 'Age_20-24', 'Age_25-59', 'Age_60+']]

# Create a new column with the age range

temp_age['Age'] = temp_age.idxmax(axis=1)

# Drop the prefix from the column name

temp_age['Age'] = temp_age['Age'].str.replace('Age_', '')

# Drop the one-hot encoded columns from the temporary dataframe

temp_age = temp_age.drop(['Age_0-9', 'Age_10-19', 'Age_20-24', 'Age_25-59', 'Age_60+'], axis=1)

# Drop the one-hot encoded columns from the original dataframe

data = data.drop(['Age_0-9', 'Age_10-19', 'Age_20-24', 'Age_25-59', 'Age_60+'], axis=1)

# Add the new column to the original dataframe

data['Age_range'] = temp_age['Age']

data.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp_age['Age'] = temp_age.idxmax(axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp_age['Age'] = temp_age['Age'].str.replace('Age_', '')


Unnamed: 0,Tiredness,Dry-Cough,Difficulty-in-Breathing,Sore-Throat,None_Sympton,Pains,Nasal-Congestion,Runny-Nose,None_Experiencing,Gender_Female,Gender_Male,Severity_Mild,Severity_Moderate,Severity_None,Age_range
0,1,1,1,1,0,1,1,1,0,0,1,1,0,0,0-9
1,1,1,1,1,0,1,1,1,0,0,1,1,0,0,0-9
2,1,1,1,1,0,1,1,1,0,0,1,1,0,0,0-9
3,1,1,1,1,0,1,1,1,0,0,1,0,1,0,0-9
4,1,1,1,1,0,1,1,1,0,0,1,0,1,0,0-9
5,1,1,1,1,0,1,1,1,0,0,1,0,1,0,0-9
6,1,1,1,1,0,1,1,1,0,0,1,0,0,0,0-9
7,1,1,1,1,0,1,1,1,0,0,1,0,0,0,0-9
8,1,1,1,1,0,1,1,1,0,0,1,0,0,0,0-9
9,1,1,1,1,0,1,1,1,0,0,1,0,0,1,0-9


In [135]:
# Severity Column

# Create a temporary dataframe with the severity columns

temp_severity = data[['Severity_Mild', 'Severity_Moderate', 'Severity_None']]

# Create a new column with the severity range

temp_severity['Severity'] = temp_severity.idxmax(axis=1)

# Drop the prefix from the column name

temp_severity['Severity'] = temp_severity['Severity'].str.replace('Severity_', '')

# Drop the one-hot encoded columns from the temporary dataframe

temp_severity = temp_severity.drop(['Severity_Mild', 'Severity_Moderate', 'Severity_None'], axis=1)

# Drop the one-hot encoded columns from the original dataframe

data = data.drop(['Severity_Mild', 'Severity_Moderate', 'Severity_None'], axis=1)

# Add the new column to the original dataframe

data['Severity'] = temp_severity['Severity']

data.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp_severity['Severity'] = temp_severity.idxmax(axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp_severity['Severity'] = temp_severity['Severity'].str.replace('Severity_', '')


Unnamed: 0,Tiredness,Dry-Cough,Difficulty-in-Breathing,Sore-Throat,None_Sympton,Pains,Nasal-Congestion,Runny-Nose,None_Experiencing,Gender_Female,Gender_Male,Age_range,Severity
0,1,1,1,1,0,1,1,1,0,0,1,0-9,Mild
1,1,1,1,1,0,1,1,1,0,0,1,0-9,Mild
2,1,1,1,1,0,1,1,1,0,0,1,0-9,Mild
3,1,1,1,1,0,1,1,1,0,0,1,0-9,Moderate
4,1,1,1,1,0,1,1,1,0,0,1,0-9,Moderate
5,1,1,1,1,0,1,1,1,0,0,1,0-9,Moderate
6,1,1,1,1,0,1,1,1,0,0,1,0-9,Mild
7,1,1,1,1,0,1,1,1,0,0,1,0-9,Mild
8,1,1,1,1,0,1,1,1,0,0,1,0-9,Mild
9,1,1,1,1,0,1,1,1,0,0,1,0-9,


In [136]:
# Gender Column

# Create a temporary dataframe with the gender columns

temp_gender = data[['Gender_Male', 'Gender_Female']]

# Create a new column with the gender

temp_gender['Gender'] = temp_gender.idxmax(axis=1)

# Drop the prefix from the column name

temp_gender['Gender'] = temp_gender['Gender'].str.replace('Gender_', '')

# Drop the one-hot encoded columns from the temporary dataframe

temp_gender = temp_gender.drop(['Gender_Male', 'Gender_Female'], axis=1)

# Drop the one-hot encoded columns from the original dataframe

data = data.drop(['Gender_Male', 'Gender_Female'], axis=1)

# Add the new column to the original dataframe

data['Gender'] = temp_gender['Gender']

data.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp_gender['Gender'] = temp_gender.idxmax(axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp_gender['Gender'] = temp_gender['Gender'].str.replace('Gender_', '')


Unnamed: 0,Tiredness,Dry-Cough,Difficulty-in-Breathing,Sore-Throat,None_Sympton,Pains,Nasal-Congestion,Runny-Nose,None_Experiencing,Age_range,Severity,Gender
0,1,1,1,1,0,1,1,1,0,0-9,Mild,Male
1,1,1,1,1,0,1,1,1,0,0-9,Mild,Male
2,1,1,1,1,0,1,1,1,0,0-9,Mild,Male
3,1,1,1,1,0,1,1,1,0,0-9,Moderate,Male
4,1,1,1,1,0,1,1,1,0,0-9,Moderate,Male
5,1,1,1,1,0,1,1,1,0,0-9,Moderate,Male
6,1,1,1,1,0,1,1,1,0,0-9,Mild,Male
7,1,1,1,1,0,1,1,1,0,0-9,Mild,Male
8,1,1,1,1,0,1,1,1,0,0-9,Mild,Male
9,1,1,1,1,0,1,1,1,0,0-9,,Male
