# DEVELOPMENT OF THE MODEL
The approach proposed in this context involves using a classification model to predict the most likely source of infection given a list of relevant symptoms presented by a patient. This approach is based on dividing the symptoms into clusters previously identified through clustering analysis on text data. The symptoms presented by the patient are then assigned to the corresponding clusters, and a feature representation is created based on the number of symptoms per cluster. This feature representation is used to train a RandomForest model in this notebook, and a small Multilayer Perceptron in `alternative_class_mlp.ipynb` which predict the most likely source of infection.

In [17]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

In [18]:
# Load the dataset
data = pd.read_csv("df_last_task.csv")

After loading the dataset, it is divided into two parts:

- The features (X), which contain the data that will be used for making predictions. In this case, the features are taken from all columns except the one named 'cluster_counter'.
- The labels (y), which contain the corresponding responses to the features. In this case, the label is represented by the column 'cluster_counter'.

In [19]:
# Split the dataset into features and labels
y = data.drop(columns=['cluster_counter'])
X = data['cluster_counter']

The 'cluster_counter' column contains lists of numerical values represented as strings. This step converts each string into a list of float numbers and then into a numpy array.

In [20]:
# Convert the 'cluster_counter' column to a numpy array
X = X.apply(lambda x: [int(float(i)) for i in x.strip('[]').split()]).values

The dataset is divided into two parts, one for training the model (training set) and the other for evaluating the performance of the trained model (test set). The parameter test_size=0.2 indicates that 20% of the dataset will be used for testing.

In [21]:
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In the following part, the lists of numbers within the 'cluster_counter' column are converted into a two-dimensional numpy array. This is necessary because the scikit-learn Random Forest classification model requires a two-dimensional array as input.

In [22]:
# Convert the 'cluster_counter' column to a two-dimensional numpy array
X_train = np.array([np.array(x) for x in X_train])
X_test = np.array([np.array(x) for x in X_test])

The RandomForest model is created and trained using the RandomForestClassifier class from scikit-learn. During training, the model learns patterns and relationships in the training data to make predictions.

After training, the model's performance is evaluated on the test data. The trained model makes predictions on the test dataset, and these predictions are compared with the true labels to assess how well the model generalizes to unseen data.

Finally, a classification report is printed, which provides various performance metrics such as precision, recall, F1-score, and support for each class. This report offers insights into how effectively the model classifies different classes in the test dataset.

In [23]:
# Create and train the RandomForest model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Evaluate the model on the test data
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         5
           1       0.00      0.00      0.00         6
           2       0.00      0.00      0.00         8
           3       0.25      0.04      0.07        25
           4       0.00      0.00      0.00         3
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00        13
           7       0.00      0.00      0.00         1
           8       0.00      0.00      0.00         0
           9       0.00      0.00      0.00        19
          10       0.00      0.00      0.00         2
          11       0.00      0.00      0.00         3

   micro avg       0.20      0.01      0.02        86
   macro avg       0.02      0.00      0.01        86
weighted avg       0.07      0.01      0.02        86
 samples avg       0.01      0.01      0.01        86



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


When explaining the obtained results, it's important to consider the insights gained from the clustering analysis conducted earlier. This analysis revealed that the identified clusters are not representative of specific health issues, primarily due to the lack of consistency in symptoms and the presence of multiple variables involved in the clustering process. Moreover, issues like symptom overlap, individual variability of each patient, and the large number of symptoms contained in the dataset further complicated the analysis.

Therefore, these factors contributed to the challenge of identifying clear patterns within the data. The analysis of the obtained results reveals that the RandomForest-based classification model did not produce significant predictions for the source of infection. The evaluation results of the model, expressed through performance metrics such as precision, recall, and F1-score, show very low or zero values for all considered infection classes. This suggests that the model was unable to effectively learn the relationships between symptoms and sources of infection, and thus unable to provide accurate predictions.

A possible aid for a more precise analysis could be integrating domain expertise and medical knowledge into feature engineering and model development. By leveraging insights from medical professionals, we can design more informative features and select appropriate algorithms tailored to the specific characteristics of medical data. Additionally, conducting further research to deepen our understanding of the underlying mechanisms of different infections and their manifestations in symptoms could provide invaluable insights. This knowledge can inform the development of more accurate and robust classification models, ultimately improving diagnostic accuracy and patient care.

The approach proposed in this context involves using a classification model to predict the most likely source of infection given a list of relevant symptoms presented by a patient. This approach is based on dividing the symptoms into clusters previously identified through clustering analysis on text data. The symptoms presented by the patient are then assigned to the corresponding clusters, and a feature representation is created based on the number of symptoms per cluster. This feature representation is used to train a RandomForest model, which predicts the most likely source of infection.