Question 1: A) The given dataset is a multiclass classification problem. Multiclass classification is a type of supervised learning problem in which we need to classify instances into one of three or more classes. In this dataset, we have 42 different types of diseases that can be predicted based on 132 parameters.

For example, if we have a dataset of images of fruits and we want to classify them into different types of fruits such as apples, oranges, and bananas, then it would be a multiclass classification problem. Similarly, if we have a dataset of emails and we want to classify them into different categories such as spam, promotional, and personal, then it would also be a multiclass classification problem.

B)Popular algorithms for training a classification model:
There are several popular algorithms that can be used for training a classification model on the given dataset. Here are three of them with their pros and cons:

Random Forest Classifier:
Pros:

- Can handle large datasets with high dimensionality.
- Does not overfit the data easily.
- Can be used for feature selection.

Cons:

- Can be slow for large datasets.
- Can produce biased results if there are imbalanced classes.
- May not work well for datasets with noisy features.

Logistic Regression:
Pros:

- Fast and easy to implement.
- Works well for datasets with linearly separable features.
- Can be regularized to avoid overfitting.

Cons:

- May not work well for datasets with nonlinearly separable features.
- Assumes that the relationship between the features and the target variable is linear.
- Can produce biased results if there are imbalanced classes.

Support Vector Machines:
Pros:

- Can handle datasets with nonlinearly separable features.
- Can produce accurate results with high-dimensional datasets.
- Can be used for feature selection.

Cons:

- Can be slow for large datasets.
- May not work well for datasets with noisy features.
- Can produce biased results if there are imbalanced classes.

In [2]:
import pandas as pd
import numpy as np

# Load the dataset
train_df = pd.read_csv('./Disease_data/Training.csv')

# Remove unnecessary columns
train_df.drop(['Unnamed: 133'], axis=1, inplace=True)

# Separate categorical values from the "Prognosis" column
prognosis = train_df['prognosis']
train_df.drop(['prognosis'], axis=1, inplace=True)

# Count all columns containing 1 in their rows
count_ones = train_df.sum(axis=0)

count_ones

itching                 678
skin_rash               786
nodal_skin_eruptions    108
continuous_sneezing     222
shivering               108
                       ... 
small_dents_in_nails    114
inflammatory_nails      114
blister                 114
red_sore_around_nose    114
yellow_crust_ooze       114
Length: 132, dtype: int64

In [3]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
#train_df.drop(['Unnamed:133'], axis=1, inplace=True)
# Create the model
dtc = DecisionTreeClassifier(random_state=42)

# Train the model
dtc.fit(train_df, prognosis)

# Make predictions on the test set
test_df = pd.read_csv('./Disease_data/Testing.csv')
test_prognosis = test_df['prognosis']
test_df.drop(['prognosis'], axis=1, inplace=True)
dtc_predictions = dtc.predict(test_df)
# Evaluate the accuracy of the model
dtc_accuracy = accuracy_score(test_prognosis, dtc_predictions)
print("Decision Tree Classifier accuracy:", dtc_accuracy)

Decision Tree Classifier accuracy: 0.9761904761904762


In [4]:
from sklearn.ensemble import RandomForestClassifier

# Create the model
rfc = RandomForestClassifier(random_state=42)
# Train the model
rfc.fit(train_df, prognosis)


# Make predictions on the test set
rfc_predictions = rfc.predict(test_df)

# Evaluate the accuracy of the model
rfc_accuracy = accuracy_score(test_prognosis, rfc_predictions)
print("Random Forest Classifier accuracy:", rfc_accuracy)


Random Forest Classifier accuracy: 0.9761904761904762


In [5]:
from sklearn.svm import SVC

# Create the model
svm = SVC(random_state=42)

# Train the model
svm.fit(train_df, prognosis)

# Make predictions on the test set
svm_predictions = svm.predict(test_df)

# Evaluate the accuracy of the model
svm_accuracy = accuracy_score(test_prognosis, svm_predictions)
print("Support Vector Machine Classifier accuracy:", svm_accuracy)


Support Vector Machine Classifier accuracy: 1.0


C)Performance evaluation methods are important for assessing the effectiveness of machine learning models. These methods allow us to measure the accuracy and effectiveness of the model in making predictions.

There are several common performance evaluation metrics used for classification models, including accuracy, precision, recall, and F1-score. Each of these metrics measures a different aspect of the model's performance and provides valuable insights into how well the model is performing.

Accuracy measures the proportion of correctly classified instances out of the total number of instances in the dataset. Precision measures the proportion of true positives (correctly classified instances) out of the total number of instances classified as positive by the model. Recall measures the proportion of true positives out of the total number of instances that are actually positive. F1-score is the harmonic mean of precision and recall, and provides a balanced measure of both metrics.

The significance of performance evaluation methods lies in their ability to help us identify the strengths and weaknesses of our models. By evaluating the model's performance on a test dataset, we can determine how well the model is likely to perform on new, unseen data. This information can be used to fine-tune the model, adjust its hyperparameters, or choose a different algorithm altogether.

In addition, performance evaluation methods help us to compare the performance of different models, algorithms, or hyperparameters. By comparing the accuracy, precision, recall, and F1-score of different models, we can determine which one is likely to be the most effective for a given task.

Overall, performance evaluation methods are essential for building effective machine learning models and ensuring that they perform well in real-world applications.

In [7]:
import pandas as pd

# Read the dataset
df_train = pd.read_csv('./Disease_data/Training.csv')
# Remove unnecessary columns
df_train.drop(['Unnamed: 133'], axis=1, inplace=True)
df_test = pd.read_csv('./Disease_data/Testing.csv')
# Extract the symptoms columns
symptoms_train = df_train.columns[:-1]
symptoms_test = df_test.columns[:-1]

# Create dummy variables for each symptom
for symptom in symptoms_train:
    df_train[symptom] = df_train[symptom].apply(lambda x: 1 if x == 1 else 0)
for symptom in symptoms_test:
    df_test[symptom] = df_test[symptom].apply(lambda x: 1 if x == 1 else 0)

# Convert the disease column into categorical values
df_train['prognosis'] = pd.Categorical(df_train['prognosis'])
df_test['prognosis'] = pd.Categorical(df_test['prognosis'])

# Split the dataset into X and y
X_train = df_train.iloc[:, :-1]
y_train = df_train.iloc[:, -1]
X_test = df_test.iloc[:, :-1]
y_test = df_test.iloc[:, -1]

In [8]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred_rf = model.predict(df_test.drop('prognosis', axis=1))


In [9]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)
y_pred_lr = model_lr.predict(df_test.drop('prognosis', axis=1))

In [10]:
# Support Vector Machines
from sklearn.svm import SVC
model_svm = SVC()
model_svm.fit(X_train, y_train)
y_pred_svm = model_svm.predict(df_test.drop('prognosis', axis=1))

In [11]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score

# accuracy score
y_true = df_test['prognosis']
acc_rf = accuracy_score(y_true, y_pred_rf)
print('Accuracy (Random Forest): {:.2f}%'.format(acc_rf * 100))
y_true = df_test['prognosis']
acc_rf = accuracy_score(y_true, y_pred_lr)
print('Accuracy (Logistic Regression): {:.2f}%'.format(acc_rf * 100))
y_true = df_test['prognosis']
acc_rf = accuracy_score(y_true, y_pred_svm)
print('Accuracy (Support Vector Machine): {:.2f}%'.format(acc_rf * 100))

# precision score
print('Precision Score')
acc_rf = precision_score(y_true, y_pred_rf, average='weighted')
print('Accuracy (Random Forest): {:.2f}%'.format(acc_rf * 100))
y_true = df_test['prognosis']
acc_rf = precision_score(y_true, y_pred_lr, average='weighted')
print('Accuracy (Logistic Regression): {:.2f}%'.format(acc_rf * 100))
y_true = df_test['prognosis']
acc_rf = precision_score(y_true, y_pred_svm, average='weighted')
print('Accuracy (Support Vector Machine): {:.2f}%'.format(acc_rf * 100))

# recall score
print('Recall Score')
acc_rf = recall_score(y_true, y_pred_rf, average='weighted')
print('Accuracy (Random Forest): {:.2f}%'.format(acc_rf * 100))
y_true = df_test['prognosis']
acc_rf = recall_score(y_true, y_pred_lr, average='weighted')
print('Accuracy (Logistic Regression): {:.2f}%'.format(acc_rf * 100))
y_true = df_test['prognosis']
acc_rf = recall_score(y_true, y_pred_svm, average='weighted')
print('Accuracy (Support Vector Machine): {:.2f}%'.format(acc_rf * 100))

# f1 score
print('F1 Score')
acc_rf = f1_score(y_true, y_pred_rf, average='weighted')
print('Accuracy (Random Forest): {:.2f}%'.format(acc_rf * 100))
y_true = df_test['prognosis']
acc_rf = f1_score(y_true, y_pred_lr, average='weighted')
print('Accuracy (Logistic Regression): {:.2f}%'.format(acc_rf * 100))
y_true = df_test['prognosis']
acc_rf = f1_score(y_true, y_pred_svm, average='weighted')
print('Accuracy (Support Vector Machine): {:.2f}%'.format(acc_rf * 100))

Accuracy (Random Forest): 97.62%
Accuracy (Logistic Regression): 100.00%
Accuracy (Support Vector Machine): 100.00%
Precision Score
Accuracy (Random Forest): 98.81%
Accuracy (Logistic Regression): 100.00%
Accuracy (Support Vector Machine): 100.00%
Recall Score
Accuracy (Random Forest): 97.62%
Accuracy (Logistic Regression): 100.00%
Accuracy (Support Vector Machine): 100.00%
F1 Score
Accuracy (Random Forest): 97.62%
Accuracy (Logistic Regression): 100.00%
Accuracy (Support Vector Machine): 100.00%


In [17]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize

# Calculate the AUC-ROC curve
print(df_test.head())
n_classes = len(np.unique(df_test))
y_test_binary = label_binarize(df_test, classes=np.unique(df_test))
y_pred_proba = model.predict_proba(df_test)
fpr,tpr,thresholds = roc_curve(df_test, y_pred, pos_label=2) 
roc_auc = auc(fpr, tpr)

# Plot the AUC-ROC curve
plt.figure()
plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

   itching  skin_rash  nodal_skin_eruptions  continuous_sneezing  shivering  \
0        1          1                     1                    0          0   
1        0          0                     0                    1          1   
2        0          0                     0                    0          0   
3        1          0                     0                    0          0   
4        1          1                     0                    0          0   

   chills  joint_pain  stomach_pain  acidity  ulcers_on_tongue  ...  \
0       0           0             0        0                 0  ...   
1       1           0             0        0                 0  ...   
2       0           0             1        1                 1  ...   
3       0           0             0        0                 0  ...   
4       0           0             1        0                 0  ...   

   blackheads  scurring  skin_peeling  silver_like_dusting  \
0           0         0             

TypeError: '<' not supported between instances of 'str' and 'int'

Question 2: A)Hyperparameter tuning refers to the process of selecting the best possible values for the hyperparameters of a machine learning algorithm, which are the parameters set manually before the model is trained. These hyperparameters play a crucial role in determining the performance and behavior of the model. Some examples of hyperparameters include learning rate, regularization strength, and the number of hidden layers in a neural network.

There are several methods for hyperparameter tuning, including:

Grid search: This involves specifying a range of values for each hyperparameter and evaluating the model's performance on all possible combinations of these values. It can be computationally expensive but can lead to the best results if the range of values is well chosen.

Random search: This involves randomly sampling hyperparameters from a specified distribution and evaluating the model's performance on each combination. It is less computationally expensive than grid search but may not always result in the best performance.

Bayesian optimization: This involves using a probabilistic model to select the next set of hyperparameters to evaluate based on the results of previous evaluations. It can be more efficient than grid search or random search, but requires more expertise to implement.

To use hyperparameter tuning to improve the performance of a classification model on a given dataset, we can use one of these methods to search for the best hyperparameters for a given algorithm. We can then use these optimal hyperparameters to train the model on the training dataset and evaluate its performance on the test dataset. This can lead to a more accurate and effective classification model that can be used to predict the prognosis of a given disease.

Hyperparameter tuning is essential because hyperparameters significantly influence the model's accuracy and generalization performance. The process can help identify the optimal hyperparameters, leading to a better understanding of the underlying patterns in the data, avoiding overfitting or underfitting, and improving the model's performance on unseen data.

In [24]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

# Load the training and testing data
train_data = pd.read_csv('./Disease_data/Training.csv')
train_data.drop('Unnamed: 133', axis=1, inplace=True)
test_data = pd.read_csv('./Disease_data/Testing.csv')

# Split the training data into features (X) and labels (y)
X_train = train_data.drop('prognosis', axis=1)
y_train = train_data['prognosis']

# Define the hyperparameters to tune
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

# Define the classifier
clf = RandomForestClassifier(random_state=42)

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(clf, param_grid=param_grid, cv=5)
# print(X_train.isna().sum())
X_train.fillna(0, inplace=True)
grid_search.fit(X_train, y_train)

# Print the best hyperparameters and their score
print("Best hyperparameters: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)

# Use the best hyperparameters to train the model
best_clf = grid_search.best_estimator_
best_clf.fit(X_train, y_train)

# Use the model to make predictions on the test data
X_test = test_data.drop('prognosis', axis=1)
y_test = test_data['prognosis']
y_pred = best_clf.predict(X_test)

# Print the classification report and accuracy score
print(classification_report(y_test, y_pred))
print("Accuracy: ", accuracy_score(y_test, y_pred))


Best hyperparameters:  {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 200}
Best score:  1.0
                                         precision    recall  f1-score   support

(vertigo) Paroymsal  Positional Vertigo       1.00      1.00      1.00         1
                                   AIDS       1.00      1.00      1.00         1
                                   Acne       1.00      1.00      1.00         1
                    Alcoholic hepatitis       1.00      1.00      1.00         1
                                Allergy       1.00      1.00      1.00         1
                              Arthritis       1.00      1.00      1.00         1
                       Bronchial Asthma       1.00      1.00      1.00         1
                   Cervical spondylosis       1.00      1.00      1.00         1
                            Chicken pox       1.00      1.00      1.00         1
                    Chronic cholestasis       1.00      1.00      1.00         1
     