### Step 1:  Install the neccesary libraries

In [1]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install openpyxl

Note: you may need to restart the kernel to use updated packages.


### Step 2:  Import the necessary libraries

In [3]:
import pandas as pd

### Step 3:  Read the xlsx files

In [4]:
textSummary = pd.read_excel('text_summary_datasets_v2.xlsx')
trainingData = pd.read_excel('training_data_v2.xlsx')

### Step 4:  Explore the data


In [5]:
print('Text Summary')
print(textSummary.head())
print(textSummary.info())
print('\n')
print('Training Data')
print(trainingData.head())
print(trainingData.info())


Text Summary
   Index                                          Paragraph  \
0      0  the diagnosis of vkh was made according to the...   
1      1  changes in the is / os junction , the cost lin...   
2      2  twenty - nine patients ( 15 males and 14 femal...   
3      3  our analysis revealed a strong correlation bet...   
4      4  better oct findings correlated to better visio...   

                                             Summary  Category  
0  The diagnosis of VKH followed revised diagnost...         0  
1  The study assessed changes in the IS/OS juncti...         0  
2  The study included 29 patients (15 males, 14 f...         0  
3  Our analysis demonstrated a strong correlation...         0  
4  Better OCT findings were associated with impro...         0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Index      200 non-null    int6

### Step 5:  Apply KNN and Decision Tree


In [6]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV


In [7]:
# split the data containing bert embeddings and category labels
x = trainingData.drop(columns=["Index","Category"]) # bert embeddings as features
y = trainingData["Category"] # target labels (0, 1, 2, 3)

# Split the data into training and testing sets
# 20% of the data will be used for testing
# 80% of the data will be used for training
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)


# Define range of n_neighbors values to search
param_grid = {'n_neighbors': [3, 5, 7, 9, 11]}

# Initialize KNN classifier
knn_classifier = KNeighborsClassifier()

# Perform grid search to find the best value of n_neighbors
grid_search = GridSearchCV(knn_classifier, param_grid, cv=5, scoring='accuracy')
grid_search.fit(x_train, y_train)

# Get the best value of n_neighbors
best_n_neighbors = grid_search.best_params_['n_neighbors']
print("Best value of n_neighbors:", best_n_neighbors)

# train KNN classifier
knn = KNeighborsClassifier(n_neighbors=best_n_neighbors)
knn.fit(x_train, y_train)

# train decision tree classifier
dt = DecisionTreeClassifier()
dt.fit(x_train, y_train)

# test and evaluate the KNN classifier
knn_pred = knn.predict(x_test)
knn_accuracy = accuracy_score(y_test, knn_pred)
knn_classification_report = classification_report(y_test, knn_pred)

# test and evaluate the decision tree classifier
dt_pred = dt.predict(x_test)
dt_accuracy = accuracy_score(y_test, dt_pred)
dt_classification_report = classification_report(y_test, dt_pred)


# print evaluation results
print('KNN Classifier')
print('Accuracy:', knn_accuracy)
print(knn_classification_report)
print('\n')
print('Decision Tree Classifier')
print('Accuracy:', dt_accuracy)
print(dt_classification_report)

Best value of n_neighbors: 3
KNN Classifier
Accuracy: 0.9
              precision    recall  f1-score   support

           0       0.86      1.00      0.92         6
           1       1.00      0.87      0.93        15
           2       1.00      0.78      0.88         9
           3       0.77      1.00      0.87        10

    accuracy                           0.90        40
   macro avg       0.91      0.91      0.90        40
weighted avg       0.92      0.90      0.90        40



Decision Tree Classifier
Accuracy: 0.625
              precision    recall  f1-score   support

           0       0.62      0.83      0.71         6
           1       0.92      0.80      0.86        15
           2       0.20      0.11      0.14         9
           3       0.50      0.70      0.58        10

    accuracy                           0.62        40
   macro avg       0.56      0.61      0.57        40
weighted avg       0.61      0.62      0.61        40



### Step 6:  Results

Best value of n_neighbors: 3
KNN Classifier
Accuracy: 0.9

|           | precision | recall | f1-score | support |
|-----------|-----------|--------|----------|---------|
|      0    |    0.86   |  1.00  |   0.92   |    6    |
|      1    |    1.00   |  0.87  |   0.93   |   15    |
|      2    |    1.00   |  0.78  |   0.88   |    9    |
|      3    |    0.77   |  1.00  |   0.87   |   10    |
|-----------|-----------|--------|----------|---------|
|  accuracy |           |        |   0.90   |    40   |
| macro avg |    0.91   |  0.91  |   0.90   |    40   |
|weighted avg|  0.92   |  0.90  |   0.90   |    40   |

Decision Tree Classifier
Accuracy: 0.625
|           | precision | recall | f1-score | support |
|-----------|-----------|--------|----------|---------|
|      0    |    0.62   |  0.83  |   0.71   |    6    |
|      1    |    0.92   |  0.80  |   0.86   |   15    |
|      2    |    0.20   |  0.11  |   0.14   |    9    |
|      3    |    0.50   |  0.70  |   0.58   |   10    |
|-----------|-----------|--------|----------|---------|
|  accuracy |           |        |   0.62   |    40   |
| macro avg |    0.56   |  0.61  |   0.57   |    40   |
|weighted avg|  0.61   |  0.62  |   0.61   |    40   |

### Step 7:  Initial evaluation

#### KNN Classifier:
1. Category 0 (Healthcare):
- Precision of 0.86 indicates that when the classifier predicts a data point as belonging to healthcare, it is correct around 86% of the time.
- Recall of 1.00 suggests that the classifier effectively captures all true healthcare instances.
- F1-score of 0.92 balances precision and recall, providing a single metric for evaluation.

2. Category 1 (AI):
- High precision of 1.00 signifies that nearly all instances predicted as AI are indeed AI-related.
- Recall of 0.87 implies that the classifier captures a significant portion of true AI instances.
- F1-score of 0.93 reflects a strong balance between precision and recall.

3. Category 2 (IoT):
- Precision of 1.00 indicates high accuracy in predicting IoT instances.
- Recall of 0.78 suggests that the classifier captures a good portion of true IoT instances but misses some.
- F1-score of 0.88 reflects a relatively high performance in this category.

4. Category 3 (Blockchain):
- Precision of 0.77 indicates that the classifier is moderately accurate in predicting blockchain instances.
- Perfect recall of 1.00 implies that the classifier captures all true blockchain instances.
- F1-score of 0.87 balances precision and recall effectively.


#### Decision Tree Classifier:
1. Category 0 (Healthcare):
- Moderate precision of 0.62 suggests that the classifier's healthcare predictions are correct around 62% of the time.
- Recall of 0.83 indicates that the classifier captures a good portion of true healthcare instances.
- F1-score of 0.71 reflects a balanced performance in this category.

2. Category 1 (AI):
- High precision of 0.92 indicates strong accuracy in predicting AI instances.
- Recall of 0.80 implies that the classifier captures most true AI instances but misses some.
- F1-score of 0.86 reflects a good balance between precision and recall.

3. Category 2 (IoT):
- Low precision of 0.20 indicates that the classifier's IoT predictions are inaccurate most of the time.
- Recall of 0.11 suggests that the classifier misses the majority of true IoT instances.
- Very low F1-score of 0.14 highlights poor performance in this category.

4. Category 3 (Blockchain):
- Precision of 0.50 suggests moderate accuracy in predicting blockchain instances.
- Recall of 0.70 indicates that the classifier captures a good portion of true blockchain instances.
- F1-score of 0.58 reflects a moderate balance between precision and recall.


In summary, while both classifiers perform well in certain categories, the KNN Classifier generally outperforms the Decision Tree Classifier across all categories, especially in predicting IoT instances where the Decision Tree Classifier performs poorly.
