# **DATA MODELLING**

## Objectives

* To identify clinical indicator that signify presence of an Mpox infection through logistic regression as an inferential statistic as well as machine learning algorithm in a pipeline.

## Inputs

* Cleaned Mpox dataset
* Encoded Mpox data
* Libraries like pandas, numPy, matplotlib, seaborn, plotly, statsmodels and scikit learn

## Outputs

* Statistical test results
* Machine learning model


---

# Change working directory

Changing the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\zzama\\OneDrive\\Documents\\Data Analytics with AI Course\\Capstone Project\\Risk-Factors-for-MonkeyPox-Infection\\jupyter_notebooks'

Making the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\zzama\\OneDrive\\Documents\\Data Analytics with AI Course\\Capstone Project\\Risk-Factors-for-MonkeyPox-Infection'

# Section 1: Import libraries and load cleaned dataset

Import libraries

In [6]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
sns.set_theme(style="whitegrid")
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans 
import warnings
warnings.filterwarnings("ignore") 

Load cleaned dataset

---

In [7]:
# Load encoded dataset for data analysis and modeling
df_Cleaned = pd.read_csv("Dataset/Mpox_Cleaned.csv")
df_Encoded = pd.read_csv("Dataset/Mpox_Encoded.csv")

print(df_Cleaned.head())
print(df_Encoded.head())

      Systemic Illness  Rectal Pain  Sore Throat  Penile Oedema  Oral Lesions  \
0                   No        False         True           True          True   
1                Fever         True        False           True          True   
2                Fever        False         True           True         False   
3                   No         True        False          False         False   
4  Swollen Lymph Nodes         True         True           True         False   

   Solitary Lesion  Swollen Tonsils  HIV Infection  \
0            False             True          False   
1            False            False           True   
2            False            False           True   
3             True             True           True   
4            False             True           True   

   Sexually Transmitted Infection MonkeyPox  
0                           False  Negative  
1                           False  Positive  
2                           False  Positive  
3   

# Section 2: Multivariate analysis

* Explore data in individual variables through visualisations and statistics
* To get a better understanding of the data in individual variables, including distribution or frequencies

Mpox Diagnosis

* About 2/3 of the people tested had Mpox
* Test positivity rate was 63.6 percent

Systemic Illness
* About 75 percent of the people tested had at least one systemic illness or symptom.
* Data does not allow analysis on whether other people had more than one symptom

Other variables

* Half of the people tested hand rental pain, sore throat, penile oedema, oral lesion, solitary lession, swollen tonsils, HIV infection and sexually transmitted infections while the other half did not.

In [None]:
# Logistic regression model to predict MonkeyPox infection risk odd ratio based on risk factors
# Define features and target variable



KeyError: "['MonkeyPox_Infection'] not found in axis"

In [None]:
# Logistic Regression Analysis with MonkeyPox as target variable
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
# Prepare data
X = df.drop('MonkeyPox', axis=1)
y = df['MonkeyPox']
# One-hot encode categorical variables
X = pd.get_dummies(X, drop_first=True) # Avoid dummy variable trap
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = LogisticRegression(max_iter=1000) # Increase max_iter to ensure convergence
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Visualise coefficients
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_[0]})
coefficients = coefficients.sort_values(by='Coefficient', ascending=False)
plt.figure(figsize=(10,6))
sns.barplot(data=coefficients, x='Coefficient', y='Feature', palette='viridis')
plt.title('Logistic Regression Coefficients')
plt.xlabel('Coefficient Value')
plt.ylabel('Feature')
plt.show()
# Set random seed for reproducibility
np.random.seed(42)
import random
random.seed(42) 
# End of script

# Conclusion and next steps

* To EDA has addressed the question about the test positivty rate for Mpox, which is 63.6 percent
* There is also an indication that most of the clinical features are not significantly associated Mpox diagnosis
* Next step is data modelling - notebook 3