<a href="https://colab.research.google.com/github/zanattopaolo1/OOP-git-merge-conflict-test/blob/master/ML_project_Zanatto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This is the project that aims to predict the response in a defined period of time of a patient with a specific condition in front of a drug administration in therapy cycles. The analysis is focused on the response in 2 time period: 1 day and 3 days before the scheduled drug administration

In [1]:
import numpy as np
from sklearn import svm
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import pandas as pd
from tabulate import tabulate
import csv
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import ConfusionMatrixDisplay

# Functions
List of functions that will be used inside the code

In [2]:
def duplicate_check(df, cols):
    # Check for duplicates based on the specified columns
    duplicate_rows = df[df.duplicated(subset=cols, keep=False)]

    return duplicate_rows

"""def filling_UltimoPrelievo(df):
    #Having GiornoDaTerapiaPrecedente equal to GiorniTeoriciDaTerapiaPrecedente means that the schedule has been respected. Considering that the cases with we are assuming
    df['UltimoPrelievo'] = df['UltimoPrelievo'].where(df['UltimoPrelievo'] == '' & df['GiornoDaTerapiaPrecedente']==df['GiorniTeoriciDaTerapiaPrecedente'], df['DataSomministrazione'] - pd.DateOffset(days=1))

    #Removing data where GiornoDaTerapiaPrecedente is NOT equal to GiorniTeoriciDaTerapiaPrecedente
    (...)"""


def not_ready_therapies(df):
    condition1 = df['GiornoDaTerapiaPrecedente'] > df['GiorniTeoriciDaTerapiaPrecedente']  # Define your condition here
    condition2 = df['NumeroPrelievi'] > 1
    filtered_df = df[condition1 | condition2]
    return filtered_df;

def del_rows_missing_val (df):
    condition = df['UltimoPrelievo'] == ''

    rows_to_drop = df[condition].index
    return df.drop(rows_to_drop)

def data_cleaning(df):
    condition1 = df['UltimoPrelievo'] < df['PrimoPrelievo'] #this information would be caused by a collection error
    #condition2 = df['classeeta'] ==   & condition2""

    rows_to_drop = df[condition1].index
    return df.drop(rows_to_drop)

def attrib_removal(df, attrib_list):
    for x in attrib_list:
        df = df.drop(x, axis=1)

    return df

def encoding_attrib(df, cols):
  label_mappings = {}

  for x in cols:
      label_encoder = LabelEncoder()
      df[x] = label_encoder.fit_transform(df[x])

      # Store the label mapping of the current column in the main dictionary
      label_mappings[x] = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

  return df, label_mappings

def get_unique(df, col, condit):
  unique_values_set = set()
  anomal_values = set()
  indices = set()

  for idx, row in df.iterrows():
    arr = row.loc[col]
    for x in arr:
        if (x not in unique_values_set):
          if eval(condit):
            unique_values_set.add(x)
          else:
            anomal_values.add(x)
            indices.add(idx)
  return unique_values_set, anomal_values, indices

def check_value_presence(row):
    return {f'{value}': value in row['PrincipiAttivi'] for value in unique_values_set}

# Which data are you interested about?

Before starting it is relevant to understand which data wants to be analyzed:
1.  therapies with blood test executed 1 days before
2.   therapies with blood test executed 3 days before
3.   both

In [3]:
#choice = input("Enter your choice (1, 2 or any other character for the 3rd option): ")
choice = 3

# Data loading

Data import from the csv file into a dictionary, to structure the information in a logical way, and then used to create the Dataframe

In [4]:
dataset_file_path = 'dataset_farmacia_anonimo(3).csv'

keys = ['IDPaziente',	'genere',	'classeeta',	'IDTerapia',	'CodiceSedeTumore',	'PrincipiAttivi',	'PrincipiAttiviDescrizione',	'NumeroSomministrazione',	'DataSomministrazioneProgrammata',	'DataSomministrazione',	'DataSomministrazioneProgrammataPrecedente',	'DataSomministrazionePrecedente',	'GiornoDaTerapiaPrecedente',	'GiorniTeoriciDaTerapiaPrecedente',	'PrimoPrelievo',	'UltimoPrelievo', 'NumeroPrelievi']
b_tests = {x: [] for x in keys}

with open(dataset_file_path, 'r') as file:
    reader = csv.reader(file, delimiter=";")
    for row_index, row in enumerate(reader):
        if row_index > 0:
            for idx, cell in enumerate(row):
                if "," in cell:             #to save the list of active ingredients as array
                    b_tests[keys[idx]].append(cell.split(","))
                else:
                    b_tests[keys[idx]].append(cell)

df = pd.DataFrame(b_tests)

Data visualization

In [5]:
print(df[:10].to_string(index=False))

#another way to visualize data
#print(tabulate(df, df.columns.tolist(), tablefmt='grid'))

                      IDPaziente genere classeeta IDTerapia CodiceSedeTumore     PrincipiAttivi       PrincipiAttiviDescrizione NumeroSomministrazione DataSomministrazioneProgrammata DataSomministrazione DataSomministrazioneProgrammataPrecedente DataSomministrazionePrecedente GiornoDaTerapiaPrecedente GiorniTeoriciDaTerapiaPrecedente PrimoPrelievo UltimoPrelievo NumeroPrelievi
660f6a81899cf623712934e2b1fb7da0      F     51-70    136427               20 [L01DB03, L01AA01]   [epirubicina, ciclofosfamide]                      2                      17/03/2023           17/03/2023                                24/02/2023                     24/02/2023                        21                               21    16/03/2023     16/03/2023              1
9f779da0518825145e638e0c8f0b0b37      M     71-90    135572               33 [L01XA03, L01BC02] [oxaliplatino, 5-fluorouracile]                      2                      15/02/2023           15/02/2023                                31/01

# Data pre processing
The data presents many issues that requires to be fixed

1. Duplicated data: Having duplicated data would mean considering multiple times the same information and it will bring a distorsion to reality, so a check is relavant (even if in this case the errors don't occur)

In [6]:
dup_rows = duplicate_check(df, ['IDPaziente', 'IDTerapia', 'NumeroSomministrazione'])
print("\n\nDuplicated rows:\n", dup_rows)



Duplicated rows:
 Empty DataFrame
Columns: [IDPaziente, genere, classeeta, IDTerapia, CodiceSedeTumore, PrincipiAttivi, PrincipiAttiviDescrizione, NumeroSomministrazione, DataSomministrazioneProgrammata, DataSomministrazione, DataSomministrazioneProgrammataPrecedente, DataSomministrazionePrecedente, GiornoDaTerapiaPrecedente, GiorniTeoriciDaTerapiaPrecedente, PrimoPrelievo, UltimoPrelievo, NumeroPrelievi]
Index: []


2.  Missing values: There are many missing values in the fields "PrimoPrelievo" and "UltimoPrelievo" and I considered the option to fill "UltimoPrelievo". The plan was to fill that field with "DataSomministrazione"-1 for each case where the "GiornoDaTerapiaPrecedente" is equal to "GiorniTeoriciDaTerapiaPrecedente" (so the schedule has been respected) but this option was assuming that all these therapy executions were scheduled on the last 4 days of the week. This is not a feasible assumption when you are interested to compare the results in percentage of blood tests done 1 day beforeand the ones done 3 days before, as this project aims to do. So below the rows where the "UltimoPrelievo" (and jointly "PrimoPrelievo") are missing will be deleted.

In [7]:
df = del_rows_missing_val(df)

3.  Casting of data: the import functions extract the data as strings. It makes a problem when the data has to be computed for what they really are.

In [8]:
df['NumeroPrelievi'] = df['NumeroPrelievi'].astype(int)
df['DataSomministrazione'] = pd.to_datetime(df['DataSomministrazione'], format='%d/%m/%Y')
df['UltimoPrelievo'] = pd.to_datetime(df['UltimoPrelievo'], format='%d/%m/%Y')
df['PrimoPrelievo'] = pd.to_datetime(df['PrimoPrelievo'], format='%d/%m/%Y')
df['DataSomministrazioneProgrammata'] = pd.to_datetime(df['DataSomministrazioneProgrammata'], format='%d/%m/%Y')
df['DataSomministrazione'] = pd.to_datetime(df['DataSomministrazione'], format='%d/%m/%Y')
df['DataSomministrazioneProgrammataPrecedente'] = pd.to_datetime(df['DataSomministrazioneProgrammataPrecedente'], format='%d/%m/%Y')
df['DataSomministrazionePrecedente'] = pd.to_datetime(df['DataSomministrazionePrecedente'], format='%d/%m/%Y')

df['GiorniTeoriciDaTerapiaPrecedente'] = df['GiorniTeoriciDaTerapiaPrecedente'].astype(int)
df['GiornoDaTerapiaPrecedente'] = df['GiornoDaTerapiaPrecedente'].astype(int)

4.  Noise: There may occur some error in collection phase of data, so it is important to verify the correctness by means of conditions based on logic of data. These are reported and used in the function "data_cleaning".

In [9]:
df = data_cleaning(df)

5.  Creation of attribute for the class

The goal of this phase is to create a classifier that allow to reach the goal of the algorithm.

In order to make it, the first step is to define the classes.
In this problem the classes are:


*   Ready for the therapy
*   Not ready for the therapy

To distringuish them it has been used 2 conditions:


*   Day of previous therapy is bigger than the scheduled day of previous therapy
*   The number of blood test is bigger than 1

The rows that reflect at least one of these 2 conditions are considered as "not ready for the therapy". Theoretically both of conditions should be valid in the same moment and using only one of them would be sufficient, but in this way it helps to consider the undetected noise of the dataset.



In [10]:
th_not_ready = not_ready_therapies(df)

print("There are {} failed blood tests: \n{}\n".format(len(th_not_ready), tabulate(th_not_ready[:10], df.columns.tolist(), tablefmt='grid')))
print("\n\nThere are {} failed blood tests ({} % of the total)\n".format(len(th_not_ready), len(th_not_ready)/len(df)*100))

There are 14275 failed blood tests: 
+----+----------------------------------+----------+-------------+-------------+--------------------+-----------------------------------+--------------------------------------------------+--------------------------+-----------------------------------+------------------------+---------------------------------------------+----------------------------------+-----------------------------+------------------------------------+---------------------+---------------------+------------------+
|    | IDPaziente                       | genere   | classeeta   |   IDTerapia |   CodiceSedeTumore | PrincipiAttivi                    | PrincipiAttiviDescrizione                        |   NumeroSomministrazione | DataSomministrazioneProgrammata   | DataSomministrazione   | DataSomministrazioneProgrammataPrecedente   | DataSomministrazionePrecedente   |   GiornoDaTerapiaPrecedente |   GiorniTeoriciDaTerapiaPrecedente | PrimoPrelievo       | UltimoPrelievo      |   Nume

There is a strong incidence of failed blood tests and it is important to keep in mind it in front of the solution.

Given the rows belonging to the classes, it is created a new column to assign the class.


*   0: therapies execution where the patient has done more blood tests
*   1: therapies execution where the patient has done 1 blood test



In [11]:
new_column_values = np.where(df.isin(th_not_ready).all(axis=1), 0, 1)
df['Single_btest'] = new_column_values

print(df.columns.tolist())

['IDPaziente', 'genere', 'classeeta', 'IDTerapia', 'CodiceSedeTumore', 'PrincipiAttivi', 'PrincipiAttiviDescrizione', 'NumeroSomministrazione', 'DataSomministrazioneProgrammata', 'DataSomministrazione', 'DataSomministrazioneProgrammataPrecedente', 'DataSomministrazionePrecedente', 'GiornoDaTerapiaPrecedente', 'GiorniTeoriciDaTerapiaPrecedente', 'PrimoPrelievo', 'UltimoPrelievo', 'NumeroPrelievi', 'Single_btest']


To confirm that the creation has done correctly:

In [12]:
#print(tabulate(df, df.columns.tolist(), tablefmt='grid'))
#print("\nPercentage of failures of the first blood test: {} %".format((df['Single_btest']==False).sum() / len(df) * 100))

For the purpose of the project has been fundamental to create also another column to count the day passed between blood test and therapy execution.  

In [13]:
diff_days = (df['DataSomministrazione'] - df['UltimoPrelievo']).dt.total_seconds().fillna(-1).astype(int)
df['Dist_bt_th'] = round(diff_days / (3600 * 24))

print(tabulate(df[:10], df.columns.tolist(), tablefmt='grid'))

print("\nPercentage of blood tests executed 1 day before therapy execution: {} % ({})".format((df['Dist_bt_th']==1).sum() / len(df) * 100, (df['Dist_bt_th']==1).sum()))
print("\nPercentage of blood tests executed 3 days before therapy execution: {} % ({})".format((df['Dist_bt_th']==3).sum() / len(df) * 100, (df['Dist_bt_th']==3).sum()))
print("\nPercentage of blood tests executed with more than 3 days before therapy execution: {} % ({})".format((df['Dist_bt_th']>3).sum() / len(df) * 100, (df['Dist_bt_th']>3).sum()))

+----+----------------------------------+----------+-------------+-------------+--------------------+------------------------+-------------------------------------+--------------------------+-----------------------------------+------------------------+---------------------------------------------+----------------------------------+-----------------------------+------------------------------------+---------------------+---------------------+------------------+----------------+--------------+
|    | IDPaziente                       | genere   | classeeta   |   IDTerapia |   CodiceSedeTumore | PrincipiAttivi         | PrincipiAttiviDescrizione           |   NumeroSomministrazione | DataSomministrazioneProgrammata   | DataSomministrazione   | DataSomministrazioneProgrammataPrecedente   | DataSomministrazionePrecedente   |   GiornoDaTerapiaPrecedente |   GiorniTeoriciDaTerapiaPrecedente | PrimoPrelievo       | UltimoPrelievo      |   NumeroPrelievi |   Single_btest |   Dist_bt_th |
|  0 | 6

Looking at data, the days with a quantity of time passed between last blood test and therapy execution higher than 3 we can distiguish 2 cases:

1.   Dist_bt_th near to 3 (4 or 5): the blood test was probably done on friday (for monday) and the therapy execution has been postponed for other reasons (not important)
2.   Dist_bt_th far from 3 (>>3): data has been wrongly collected. Eg. the update of the date for the last blood test has not been done

For these reasons, the data with Dist_bt_th different to 1 and 3 will be deleted since they require to include additional considerations and assumptions.

[PART ADDEDD LATER]

In [14]:
df['Days'] = df['GiorniTeoriciDaTerapiaPrecedente'] - df['Dist_bt_th']

In [15]:
#CODE FOR ADDITIONAL CONSIDERATIONS (NOT TO BE CONSIDERED FOR THIS PROJECT)

"""for index, row in df[:100].iterrows():
  if row['Dist_bt_th']>3:
    print(row)"""

"for index, row in df[:100].iterrows():\n  if row['Dist_bt_th']>3:\n    print(row)"

In [16]:
if choice == '1':
  df = df[df['Dist_bt_th'] == 1]
  print("1 day considered")
elif choice == '2':
  df = df[df['Dist_bt_th'] == 3]
  print("Only 3 days considered")
else:
  df = df[df['Dist_bt_th'] <= 3]
  print("1-3 days considered")

original_df = df.copy(deep=True)

1-3 days considered


6.  Management of the attribute "PrincipiAttivi" that contains array: this attribute contains an array of active ingredients, but in this format it does not suit with the algorithm operations.


> So, to make it usable, it can be created an attribute for each possible value and then setting as True for the rows where that value is present.



In [29]:
#How many active ingredients are?
cond = 'len(str(x)) > 5'
unique_val, anomal_val, indices = get_unique(df, 'PrincipiAttivi', cond)

print("There are {} active ingredients:\n{} ".format(len(unique_val), unique_val))
print("\nThere are {} anomalous values:\n{} ".format(len(anomal_val), anomal_val))
#print(df.index)
print(indices)
#print("Indices are all inside: ", (all(index in df.index for index in indices)))
wrong_rows = df.loc[list(indices)]
print("\nSome rows of anomalous values are:\n{} ".format(wrong_rows[:10].to_string(index=False)))

There are 86 active ingredients:
{'L01FD01', 'L02AE02', 'L01EA05', 'L01FX15', 'L01BC08', 'L01XA02', 'L01AA03', 'L01FG02', 'L01FX04', 'L01EC03', 'L01FX12', 'L01FC02', 'L01FE02', 'L01CX01', 'L01XX27', 'L01BA03', 'L01FD02', 'L01XC24', 'L01FD03', 'L01FC01', 'L01FF01', 'L01XX50', 'L01BC06', 'L01CD01', 'L01XG01', 'L01EH01', 'L01CE02', 'L01AA01', 'L01EE01', 'L01AA09', 'L01EX07', 'L01DA01', 'L01BA01', 'L01AX03', 'L01XA03', 'L04AX04', 'L01FF03', 'L01FF04', 'L01FA03', 'L01FF07', 'L01CA01', 'L01CA02', 'L01XX45', 'L04AX06', 'L01EF02', 'L01XA01', 'L01EF01', 'L02BA03', 'L01BC05', 'L01XX44 ', 'L01XE31', 'L01XK01', 'L02BG04', 'L01DB07', 'L01FF05', 'L01FX08', 'L01FG01', 'L01DC03', 'L01EF03', 'L03AA13', 'L01BC07', 'L01EE03', 'L01FF02', 'L01CD02', 'L01AX04', 'L01BC01', 'L01AA06', 'L01EX08', 'L01CB01', 'L01EC02', 'L01FE01', 'L01DB01', 'L01XX54', 'L01XX52', 'L01FA01', 'L01FX05', 'L01DC01', 'L01EK01', 'L01BC02', 'L01DB03', 'L02BB03', 'L01CA04', 'L01EG02', 'L01BA04', 'L04AX02', 'L01XX14'} 

There are 24 anom

In [None]:
df = pd.concat([df, df.apply(check_value_presence, axis=1, result_type='expand')], axis=1)

N° of active ingredients:  65
{'L01BA01', 'L01BC05', 'L04AX04', 'L01FF03', 'L01XX14', 'L01XG01', 'L01CA01', 'L01XA01', 'L01AA06', 'L02BB03', 'L01CA02', 'L01BC07', 'L01FE01', 'L01DC03', 'L01FD01', 'L01FG01', 'L01XX52', 'L01XC24', 'L01EF03', 'L01EK01', 'L01BC02', 'L01FF01', 'L01XA02', 'L01EF01', 'L01FX15', 'L01EE03', 'L01DB03', 'L01EF02', 'L01BA03', 'L01EC02', 'L01FF04', 'L01FC02', 'L01XK01', 'L01CE02', 'L01BC08', 'L01BC01', 'L01FX05', 'L01FA01', 'L01CD02', 'L01XA03', 'L01DB01', 'L01CA04', 'L01FF05', 'L01FF02', 'L01FE02', 'L04AX02', 'L01EH01', 'L01AX04', 'L01AA01', 'L02BG04', 'L01FD03', 'L01AX03', 'L01AA09', 'L01EC03', 'L01DC01', 'L01CB01', 'L01FG02', 'L02AE02', 'L01EX07', 'L01CD01', 'L01BC06', 'L01EX08', 'L01XX27', 'L01EE01', 'L01BA04'}


7.  Removal of useless attributes: there are some attributes that are not relevant to be kept for the working operations of the ML algorithm:



In [None]:
list = ['IDPaziente', 'IDTerapia', 'PrincipiAttivi', 'PrincipiAttiviDescrizione', 'NumeroSomministrazione',
        'DataSomministrazioneProgrammata', 'DataSomministrazione', 'DataSomministrazioneProgrammataPrecedente',
        'DataSomministrazionePrecedente', 'GiornoDaTerapiaPrecedente',	'GiorniTeoriciDaTerapiaPrecedente',
        'PrimoPrelievo', 'UltimoPrelievo', 'NumeroPrelievi']
df = attrib_removal(df, list)
print(df.columns.tolist())

['genere', 'classeeta', 'CodiceSedeTumore', 'Single_btest', 'Dist_bt_th', 'Days', 'L01BA01', 'L01BC05', 'L04AX04', 'L01FF03', 'L01XX14', 'L01XG01', 'L01CA01', 'L01XA01', 'L01AA06', 'L02BB03', 'L01CA02', 'L01BC07', 'L01FE01', 'L01DC03', 'L01FD01', 'L01FG01', 'L01XX52', 'L01XC24', 'L01EF03', 'L01EK01', 'L01BC02', 'L01FF01', 'L01XA02', 'L01EF01', 'L01FX15', 'L01EE03', 'L01DB03', 'L01EF02', 'L01BA03', 'L01EC02', 'L01FF04', 'L01FC02', 'L01XK01', 'L01CE02', 'L01BC08', 'L01BC01', 'L01FX05', 'L01FA01', 'L01CD02', 'L01XA03', 'L01DB01', 'L01CA04', 'L01FF05', 'L01FF02', 'L01FE02', 'L04AX02', 'L01EH01', 'L01AX04', 'L01AA01', 'L02BG04', 'L01FD03', 'L01AX03', 'L01AA09', 'L01EC03', 'L01DC01', 'L01CB01', 'L01FG02', 'L02AE02', 'L01EX07', 'L01CD01', 'L01BC06', 'L01EX08', 'L01XX27', 'L01EE01', 'L01BA04']


8.  Encoding of strings: the ML algorithms that is going  to be applied are not able to manage data different from numerical ones, so the strings and boolean must be "converted into number":

In [None]:
cols = ['genere', 'classeeta']
df, label_enc_map = encoding_attrib(df, cols)

print(label_enc_map)

{'genere': {'F': 0, 'M': 1}, 'classeeta': {'19-30': 0, '31-50': 1, '51-70': 2, '71-90': 3, 'null': 4}}


# Model generation

This section will be focused on the concrete creation of the classifier.


The first step for the model creation is the split of the dataset with its classes, by keeping the index as a "primary key" to trace the relation between the 2 datframes.

After that, both of the dataframes are divided into:

*   Training set (70%)
*   Test set (20%)
*   Validation set (10%)

In [None]:
print(df.columns.tolist())

['genere', 'classeeta', 'CodiceSedeTumore', 'Single_btest', 'Dist_bt_th', 'Days', 'L01BA01', 'L01BC05', 'L04AX04', 'L01FF03', 'L01XX14', 'L01XG01', 'L01CA01', 'L01XA01', 'L01AA06', 'L02BB03', 'L01CA02', 'L01BC07', 'L01FE01', 'L01DC03', 'L01FD01', 'L01FG01', 'L01XX52', 'L01XC24', 'L01EF03', 'L01EK01', 'L01BC02', 'L01FF01', 'L01XA02', 'L01EF01', 'L01FX15', 'L01EE03', 'L01DB03', 'L01EF02', 'L01BA03', 'L01EC02', 'L01FF04', 'L01FC02', 'L01XK01', 'L01CE02', 'L01BC08', 'L01BC01', 'L01FX05', 'L01FA01', 'L01CD02', 'L01XA03', 'L01DB01', 'L01CA04', 'L01FF05', 'L01FF02', 'L01FE02', 'L04AX02', 'L01EH01', 'L01AX04', 'L01AA01', 'L02BG04', 'L01FD03', 'L01AX03', 'L01AA09', 'L01EC03', 'L01DC01', 'L01CB01', 'L01FG02', 'L02AE02', 'L01EX07', 'L01CD01', 'L01BC06', 'L01EX08', 'L01XX27', 'L01EE01', 'L01BA04']


In [None]:
th_classes = df['Single_btest'].copy()
#th_classes['idx'] = df.index
df = df.drop(['Single_btest'], axis=1)
#df['idx'] = df.index

cutpoint1, cutpoint2 = 0.7, 0.8
trainset = df[:int(cutpoint1*len(df))]
train_cl = th_classes[:int(cutpoint1*len(df))]
valset = df[int(cutpoint1*len(df)):int(cutpoint2*len(df))]
val_cl = th_classes[int(cutpoint1*len(df)):int(cutpoint2*len(df))]
testset = df[int(cutpoint2*len(df)):]
test_cl = th_classes[int(cutpoint2*len(df)):]
print('Total: {} splitted in Train: {}, Val: {} and Test: {}'.format(len(df), len(trainset), len(valset), len(testset)))

Total: 3376 splitted in Train: 2363, Val: 337 and Test: 676


Now it's the turn for the creation of the classifier.

There are many possibilities and here we are going to try the **Decision tree classification** wit the following parameters:
*   min_samples_leaf: 7. It appears to be the best minimum dimension for the leaves of the decision tree

In [None]:
clf = DecisionTreeClassifier(min_samples_leaf=7)

#max_depth=10 --> reduces the performance

In [None]:
clf.fit(trainset, train_cl)

It is useful to observe the rules generated by the decision tree:

In [None]:
from sklearn.tree import export_text

tree_rules_text = export_text(clf, feature_names=df.columns.tolist())
print("Decision Tree Rules:\n", tree_rules_text)

Decision Tree Rules:
 |--- Days <= 26.50
|   |--- L01XX52 <= 0.50
|   |   |--- Days <= 7.00
|   |   |   |--- CodiceSedeTumore <= 52.50
|   |   |   |   |--- CodiceSedeTumore <= 42.50
|   |   |   |   |   |--- L01DC03 <= 0.50
|   |   |   |   |   |   |--- L01BC05 <= 0.50
|   |   |   |   |   |   |   |--- L01XA02 <= 0.50
|   |   |   |   |   |   |   |   |--- classeeta <= 1.50
|   |   |   |   |   |   |   |   |   |--- CodiceSedeTumore <= 22.50
|   |   |   |   |   |   |   |   |   |   |--- class: 0
|   |   |   |   |   |   |   |   |   |--- CodiceSedeTumore >  22.50
|   |   |   |   |   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |   |   |   |--- classeeta >  1.50
|   |   |   |   |   |   |   |   |   |--- L01FD01 <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- L01CD01 <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- L01CD01 >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 

Validation of the classifier

In [None]:
val_score = clf.score(valset, val_cl)

print('Validation accuracy: {:.3f}'.format(val_score))

Validation accuracy: 0.843


Test phase:

In [None]:
pred = clf.predict(testset)
#print('Predicted {} samples: {}'.format(len(pred), pred))
#print('GT {} samples: {}'.format(len(test_cl), test_cl))
acc_score = accuracy_score(test_cl, pred)

print('\n Final Accuracy: {:.3f} \n'.format(acc_score))


 Final Accuracy: 0.788 



In [None]:
best_score = 0

if acc_score > best_score:
  best_alg = clf
  best_score =  acc_score
  best_pred = pred

###Further classification

Now it's the turn for the creation of the classifier. There are many possibilities and here we are going to try some new others:



1.   Support Vector Machines with the following parameters (considering this [source](https://vitalflux.com/svm-rbf-kernel-parameters-code-sample/#:~:text=The%20gamma%20parameter%20defines%20how,high%20values%20meaning%20'close'.)):
:
*   kernel = rbf: the dimensionality of data is low
*   C = 20: it aims to classify all training examples correctly (instead of simpler decision functions)
*   gamma = 0.1: to guarantee sufficient influence of training set without bringing to over-fitting


---


2.   Random forest with the following parameters
(considering this [source](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html))
*   n_estimators: 200 to improve sufficiently the performances without bringing overfitting
*    
: 4 has been verified to be the lest dimension of the data to be considered a node as a leaf
*   (removed) min_samples_split: 10 has been verified to be the lest dimension of the data to be split


---
3.   AdaBoost with the following parameters (considering this [source](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html))
*   n_estimators: 100 is the best trade off between simplification and over-fitting
*   learning_rate: 1 appears to be the right contribution of each classifier







In [None]:
clf_list = [
    svm.SVC(gamma=0.1, C=20, kernel='rbf'),
    RandomForestClassifier(n_estimators=200, min_samples_leaf=4),
    AdaBoostClassifier(n_estimators=100, learning_rate=0.5)
]

for x in clf_list:
  x.fit(trainset, train_cl)
  print('Classifier: {}'.format(x))

  val_score = x.score(valset, val_cl)
  print('Validation accuracy: {:.3f}'.format(val_score))

  pred = x.predict(testset)
  acc_score = accuracy_score(test_cl, pred)
  print('\nFinal Accuracy: {:.3f} \n'.format(acc_score))

  if acc_score > best_score:
    best_alg = x
    best_score =  acc_score
    best_pred = pred

print("{} is the the best classifier with the score of {}".format(best_alg, best_score))

Classifier: SVC(C=20, gamma=0.1)
Validation accuracy: 0.831

Final Accuracy: 0.797 

Classifier: RandomForestClassifier(min_samples_leaf=4, n_estimators=200)
Validation accuracy: 0.843

Final Accuracy: 0.781 

Classifier: AdaBoostClassifier(learning_rate=0.5, n_estimators=100)
Validation accuracy: 0.855

Final Accuracy: 0.783 

SVC(C=20, gamma=0.1) is the the best classifier with the score of 0.7973372781065089


### Confusion matrix

The confusion matrix is a simple way to evaluate the accuracy of a classification and to further understand the performance on the trained model by defining how many values have been classified as:
*   True positive (TP)
*   False positive (FP)
*   True negative (TN)
*   False negative (FN)

In [None]:
matrix = confusion_matrix(test_cl, pred)
print("TP: {}\nFP: {}\nTN: {}\nFN: {}\n".format(matrix[0][0], matrix[1][0], matrix[1][1], matrix[0][1]))
print(matrix)

TP: 273
FP: 17
TN: 256
FN: 130

[[273 130]
 [ 17 256]]


# Conclusion: compare the results

Now is fundamental to undertsand "**if the analysis done have brought benefits at the initial condition**".

So, we are going to compare the initial percentage of failures with the failures of classification with the best classifiers arose in this work.

In [None]:
original_failed = not_ready_therapies(original_df)
qty_original_correct = len(original_df) - len(original_failed)
ptg_init = qty_original_correct/len(original_df)*100
print("Percentage of correct classified in the initial condition (just after deleting noise, missing values and duplicated values):\n{}%".format(ptg_init))

print("Percentage of correct classification of the best ML algorithm ({}): \n{}%".format(best_alg, best_score*100))

Percentage of correct classified in the initial condition (just after deleting noise, missing values and duplicated values):
44.52014218009479%
Percentage of correct classification of the best ML algorithm (SVC(C=20, gamma=0.1)): 
79.73372781065089%
