# **DATA CLEANING**

## Objectives

* Load, inspect, clean, review and export Mpox dataset

## Inputs

* Mpox dataset sourced from Kaggle and already saved as CSV in the project folder 

## Outputs

* Cleaned dataset or analysis ready dataset 


---

# Change working directory

Changing the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\zzama\\OneDrive\\Documents\\Data Analytics with AI Course\\Capstone Project\\Risk-Factors-for-MonkeyPox-Infection\\jupyter_notebooks'

Making the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [6]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\zzama\\OneDrive\\Documents\\Data Analytics with AI Course\\Capstone Project\\Risk-Factors-for-MonkeyPox-Infection'

# Section 1: Load dataset and Inspect the data

Import core libraries and load data

In [7]:
# Import core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

---

In [8]:
# Load dataset
df = pd.read_csv("Dataset/MonkeyPoxData.csv")

Dataset Inspection <br>
I conducted the several checks (see below) to understand the data and identify issues that needs to be addressed as part of the data cleaning.
* Checked the first and last five rows. This is a quick way to see if all columns look aligned and if the data visually makes sense.
* Checked the shape of the data to confirm number of rows and columns
* Checked columns names, data types and non-null count for variables
* Checked unique values for all columns
* Quantified the missiong values in columns
* Checked for duplicates.

Key observations
* Only one variables (systemic illness) has missing values, which seems significant
* Data type for 2 of 11 is object, the rest are boolean
* The dataset includes Patient_ID, which may be used to identify the patients
* No duplicates were identified.
* The data has no personal information or individual attributes 

In [9]:
# Check the first few rows of the dataset
df.head()

Unnamed: 0,Patient_ID,Systemic Illness,Rectal Pain,Sore Throat,Penile Oedema,Oral Lesions,Solitary Lesion,Swollen Tonsils,HIV Infection,Sexually Transmitted Infection,MonkeyPox
0,P0,,False,True,True,True,False,True,False,False,Negative
1,P1,Fever,True,False,True,True,False,False,True,False,Positive
2,P2,Fever,False,True,True,False,False,False,True,False,Positive
3,P3,,True,False,False,False,True,True,True,False,Positive
4,P4,Swollen Lymph Nodes,True,True,True,False,False,True,True,False,Positive


In [10]:
# Check the last few rows of the dataset
df.tail()

Unnamed: 0,Patient_ID,Systemic Illness,Rectal Pain,Sore Throat,Penile Oedema,Oral Lesions,Solitary Lesion,Swollen Tonsils,HIV Infection,Sexually Transmitted Infection,MonkeyPox
24995,P24995,,True,True,False,True,True,False,False,True,Positive
24996,P24996,Fever,False,True,True,False,True,True,True,True,Positive
24997,P24997,,True,True,False,False,True,True,False,False,Positive
24998,P24998,Swollen Lymph Nodes,False,True,False,True,True,True,False,False,Negative
24999,P24999,Swollen Lymph Nodes,False,False,True,False,False,True,True,False,Positive


In [12]:
# Check the shape of the dataset
df.shape

(25000, 11)

In [13]:
# Check column data types and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 11 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Patient_ID                      25000 non-null  object
 1   Systemic Illness                18784 non-null  object
 2   Rectal Pain                     25000 non-null  bool  
 3   Sore Throat                     25000 non-null  bool  
 4   Penile Oedema                   25000 non-null  bool  
 5   Oral Lesions                    25000 non-null  bool  
 6   Solitary Lesion                 25000 non-null  bool  
 7   Swollen Tonsils                 25000 non-null  bool  
 8   HIV Infection                   25000 non-null  bool  
 9   Sexually Transmitted Infection  25000 non-null  bool  
 10  MonkeyPox                       25000 non-null  object
dtypes: bool(8), object(3)
memory usage: 781.4+ KB


In [14]:
# Check column names and data types
df.columns

Index(['Patient_ID', 'Systemic Illness', 'Rectal Pain', 'Sore Throat',
       'Penile Oedema', 'Oral Lesions', 'Solitary Lesion', 'Swollen Tonsils',
       'HIV Infection', 'Sexually Transmitted Infection', 'MonkeyPox'],
      dtype='object')

In [15]:
# Describe the dataset to get summary statistics

df.describe().T # Transpose for better readability

Unnamed: 0,count,unique,top,freq
Patient_ID,25000,25000,P0,1
Systemic Illness,18784,3,Fever,6382
Rectal Pain,25000,2,False,12655
Sore Throat,25000,2,True,12554
Penile Oedema,25000,2,True,12612
Oral Lesions,25000,2,False,12514
Solitary Lesion,25000,2,True,12527
Swollen Tonsils,25000,2,True,12533
HIV Infection,25000,2,True,12584
Sexually Transmitted Infection,25000,2,False,12554


In [16]:
# Print unique values for all columns
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Unique values in column '{column}': {unique_values}\n")

Unique values in column 'Patient_ID': ['P0' 'P1' 'P2' ... 'P24997' 'P24998' 'P24999']

Unique values in column 'Systemic Illness': [nan 'Fever' 'Swollen Lymph Nodes' 'Muscle Aches and Pain']

Unique values in column 'Rectal Pain': [False  True]

Unique values in column 'Sore Throat': [ True False]

Unique values in column 'Penile Oedema': [ True False]

Unique values in column 'Oral Lesions': [ True False]

Unique values in column 'Solitary Lesion': [False  True]

Unique values in column 'Swollen Tonsils': [ True False]

Unique values in column 'HIV Infection': [False  True]

Unique values in column 'Sexually Transmitted Infection': [False  True]

Unique values in column 'MonkeyPox': ['Negative' 'Positive']



In [17]:
# Check for and quantify missing values in each column 
df.isnull().sum()

Patient_ID                           0
Systemic Illness                  6216
Rectal Pain                          0
Sore Throat                          0
Penile Oedema                        0
Oral Lesions                         0
Solitary Lesion                      0
Swollen Tonsils                      0
HIV Infection                        0
Sexually Transmitted Infection       0
MonkeyPox                            0
dtype: int64

In [18]:
# Value count for systemic illness
df['Systemic Illness'].value_counts()

Systemic Illness
Fever                    6382
Swollen Lymph Nodes      6252
Muscle Aches and Pain    6150
Name: count, dtype: int64

In [20]:
# Check for duplicates
df.duplicated().sum()

0

# Section 2: Data Cleaning

Data manipulation

* Handle missing data
* Encode and standardize variables
* Save the cleaned and encoded dataset separately

Load data, create new index and drop patient ID

In [19]:
# Load data again to start fresh
df = pd.read_csv("Dataset/MonkeyPoxData.csv")

# Create a new index column
df.reset_index(drop=True, inplace=True)
df.head()

# Drop Patient_ID column in case it is linked to personal information
df.drop(columns=['Patient_ID'], inplace=True)
df.head()

Unnamed: 0,Systemic Illness,Rectal Pain,Sore Throat,Penile Oedema,Oral Lesions,Solitary Lesion,Swollen Tonsils,HIV Infection,Sexually Transmitted Infection,MonkeyPox
0,,False,True,True,True,False,True,False,False,Negative
1,Fever,True,False,True,True,False,False,True,False,Positive
2,Fever,False,True,True,False,False,False,True,False,Positive
3,,True,False,False,False,True,True,True,False,Positive
4,Swollen Lymph Nodes,True,True,True,False,False,True,True,False,Positive


Handling missing data

*  The symptoms under the systemic variable are important feature for predictive modelling. But about 25% of the values are missing.
*  Given the nature of the variable, it seems the missing values are those who didn't have any symptoms
*  So, I decided to label all missing values as None.

In [20]:
# Encode missing values as 'None' for Systemic Illness
df['Systemic Illness'].fillna('No', inplace=True)

# Value count for systemic illness after imputation
print(df['Systemic Illness'].value_counts())
df.head()

Systemic Illness
Fever                    6382
Swollen Lymph Nodes      6252
No                       6216
Muscle Aches and Pain    6150
Name: count, dtype: int64


Unnamed: 0,Systemic Illness,Rectal Pain,Sore Throat,Penile Oedema,Oral Lesions,Solitary Lesion,Swollen Tonsils,HIV Infection,Sexually Transmitted Infection,MonkeyPox
0,No,False,True,True,True,False,True,False,False,Negative
1,Fever,True,False,True,True,False,False,True,False,Positive
2,Fever,False,True,True,False,False,False,True,False,Positive
3,No,True,False,False,False,True,True,True,False,Positive
4,Swollen Lymph Nodes,True,True,True,False,False,True,True,False,Positive


Save cleaned dataset

* Saved clean dataset before encoding as EDA would not require encoded dataset
* This is also preserves data to be used for the dashboard in power BI

In [21]:
# Save cleaned data to a new CSV file
df.to_csv("Dataset/Mpox_Cleaned.csv", index=False)

Encoding the variables <br>

To streamline further analysis, including modelling later, the following encoding were done:

* Boolean were replaced with 1 for True and 0 for False
* MonkeyPost Test results were encoded as 1 for Positive and 0 for negative
* Systemmic Illness were also encoded with numbers to represent each symptom
* The data types was then changed to object to avoid statistical computations on these variables
* The dataset was then saved with a different name.

In [32]:
# Load cleaned data
df = pd.read_csv("Dataset/Mpox_Cleaned.csv")
df.head()

Unnamed: 0,Systemic Illness,Rectal Pain,Sore Throat,Penile Oedema,Oral Lesions,Solitary Lesion,Swollen Tonsils,HIV Infection,Sexually Transmitted Infection,MonkeyPox
0,No,False,True,True,True,False,True,False,False,Negative
1,Fever,True,False,True,True,False,False,True,False,Positive
2,Fever,False,True,True,False,False,False,True,False,Positive
3,No,True,False,False,False,True,True,True,False,Positive
4,Swollen Lymph Nodes,True,True,True,False,False,True,True,False,Positive


In [33]:
# Encode bool True 1 and False 0 for easier analysis
bool_cols = df.select_dtypes(include=['bool']).columns

for col in bool_cols:
    df[col] = df[col].map({True: 1, False: 0})
    
# Encode categorical variables to numerical values

df = df.replace(["Positive", "Negative"], [1,0]) # Encode Positive 1 and Negative 0 for MonkeyPox Test Result

df.head()

Unnamed: 0,Systemic Illness,Rectal Pain,Sore Throat,Penile Oedema,Oral Lesions,Solitary Lesion,Swollen Tonsils,HIV Infection,Sexually Transmitted Infection,MonkeyPox
0,No,0,1,1,1,0,1,0,0,0
1,Fever,1,0,1,1,0,0,1,0,1
2,Fever,0,1,1,0,0,0,1,0,1
3,No,1,0,0,0,1,1,1,0,1
4,Swollen Lymph Nodes,1,1,1,0,0,1,1,0,1


In [34]:
# Split systemic illness into 0 and 1 columns for fever, swollen lymph nodes, muscle aches
df['Fever'] = np.where(df['Systemic Illness'] == 'Fever', 1, 0)
df['Swollen Nodes'] = np.where(df['Systemic Illness'] == 'Swollen Lymph Nodes', 1, 0)
df['Muscle Aches'] = np.where(df['Systemic Illness'] == 'Muscle Aches and Pain', 1, 0)

# Drop original Systemic Illness column
df.drop(columns=['Systemic Illness'], inplace=True)

df.head()

Unnamed: 0,Rectal Pain,Sore Throat,Penile Oedema,Oral Lesions,Solitary Lesion,Swollen Tonsils,HIV Infection,Sexually Transmitted Infection,MonkeyPox,Fever,Swollen Nodes,Muscle Aches
0,0,1,1,1,0,1,0,0,0,0,0,0
1,1,0,1,1,0,0,1,0,1,1,0,0
2,0,1,1,0,0,0,1,0,1,1,0,0
3,1,0,0,0,1,1,1,0,1,0,0,0
4,1,1,1,0,0,1,1,0,1,0,1,0


In [35]:
# Convert all columns to object type to ensure consistency for categorical analysis and avoid statistical calculations
for col in df.columns:
    df[col] = df[col].astype('object')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 12 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Rectal Pain                     25000 non-null  object
 1   Sore Throat                     25000 non-null  object
 2   Penile Oedema                   25000 non-null  object
 3   Oral Lesions                    25000 non-null  object
 4   Solitary Lesion                 25000 non-null  object
 5   Swollen Tonsils                 25000 non-null  object
 6   HIV Infection                   25000 non-null  object
 7   Sexually Transmitted Infection  25000 non-null  object
 8   MonkeyPox                       25000 non-null  object
 9   Fever                           25000 non-null  object
 10  Swollen Nodes                   25000 non-null  object
 11  Muscle Aches                    25000 non-null  object
dtypes: object(12)
memory usage: 2.3+ MB


---

In [38]:
# Save encoded cleaned data to a new CSV file
df.to_csv("Dataset/Mpox_Encoded2.csv", index=False)

NOTE

# Section 3: Final Review and Data Exporting

I conducted final checks to ensure that all the issues I noted have been addressed and that the data is ready for analysis. These included:

* Checking dataset shape
* Checking missing values
* Checking column names
* Checking unique values

In [39]:
# Load cleaned data
df = pd.read_csv("Dataset/Mpox_Cleaned.csv")
print(df.shape)
df.head()

(25000, 10)


Unnamed: 0,Systemic Illness,Rectal Pain,Sore Throat,Penile Oedema,Oral Lesions,Solitary Lesion,Swollen Tonsils,HIV Infection,Sexually Transmitted Infection,MonkeyPox
0,No,False,True,True,True,False,True,False,False,Negative
1,Fever,True,False,True,True,False,False,True,False,Positive
2,Fever,False,True,True,False,False,False,True,False,Positive
3,No,True,False,False,False,True,True,True,False,Positive
4,Swollen Lymph Nodes,True,True,True,False,False,True,True,False,Positive


In [40]:
# Checking data types, non-null counts, and column names after cleaning

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 10 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Systemic Illness                25000 non-null  object
 1   Rectal Pain                     25000 non-null  bool  
 2   Sore Throat                     25000 non-null  bool  
 3   Penile Oedema                   25000 non-null  bool  
 4   Oral Lesions                    25000 non-null  bool  
 5   Solitary Lesion                 25000 non-null  bool  
 6   Swollen Tonsils                 25000 non-null  bool  
 7   HIV Infection                   25000 non-null  bool  
 8   Sexually Transmitted Infection  25000 non-null  bool  
 9   MonkeyPox                       25000 non-null  object
dtypes: bool(8), object(2)
memory usage: 586.1+ KB


In [41]:
# Checking unique values after cleaning
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Unique values in column '{column}': {unique_values}\n")

Unique values in column 'Systemic Illness': ['No' 'Fever' 'Swollen Lymph Nodes' 'Muscle Aches and Pain']

Unique values in column 'Rectal Pain': [False  True]

Unique values in column 'Sore Throat': [ True False]

Unique values in column 'Penile Oedema': [ True False]

Unique values in column 'Oral Lesions': [ True False]

Unique values in column 'Solitary Lesion': [False  True]

Unique values in column 'Swollen Tonsils': [ True False]

Unique values in column 'HIV Infection': [False  True]

Unique values in column 'Sexually Transmitted Infection': [False  True]

Unique values in column 'MonkeyPox': ['Negative' 'Positive']



In [42]:
# Describe the dataset to get summary statistics

df.describe().T # Transpose for better readability

Unnamed: 0,count,unique,top,freq
Systemic Illness,25000,4,Fever,6382
Rectal Pain,25000,2,False,12655
Sore Throat,25000,2,True,12554
Penile Oedema,25000,2,True,12612
Oral Lesions,25000,2,False,12514
Solitary Lesion,25000,2,True,12527
Swollen Tonsils,25000,2,True,12533
HIV Infection,25000,2,True,12584
Sexually Transmitted Infection,25000,2,False,12554
MonkeyPox,25000,2,Positive,15909


---

# Conclusion and next steps

* I have performed data inspection and data cleaning
* A few issues were identified during inspection and were addressed during data cleaning, including missing values
* Cleaned data has been saved and pushed to github
* Next step is data exploration