This Notebook proposes to build a classification methodology to predict the type of Thyroid based on the given training data. 

I Downloaded the datasets from the datasource ‘UCI Machine Learning Repository’

https://archive.ics.uci.edu/ml/datasets/Thyroid+Disease
*   allhyper.data

*   allhyper.test

*   allhypo.data


*   allhypo.test

*   ann-test.data
*   ann-train.data



*   hypothyroid.data
*   sick-euthyroid.data

*   thyroid0387.data

**CONTENTS**

1.Merging Data

2.Data PreProcessing

3.Training of The Classifiers

4.Model selection and saving













In [6]:
#Installation of required libraries (Python 3)

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from  xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from imblearn.over_sampling import SMOTE
from sklearn.utils import resample
from scipy import stats
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier


import os

**1: Data Integration**

Merging these different datasets.

A: Importing 'allhyper' and 'allhypo' datasets

In [7]:
allHyperTEST = pd.read_csv("allhyperTEST.CSV")
allHyperDATA = pd.read_csv("allhyperDATA.CSV")
allHypoTEST = pd.read_csv("allhypoTEST.csv")
allHypoDATA = pd.read_csv("allhypoDATA.CSV")

display(allHypoTEST.head(10))


Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,query_hyperthyroid,lithium,goitre,tumor,hypopituitary,psych,TSH_measured,TSH,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,Target,ID
0,35,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,?,f,?,f,?,f,?,f,?,f,?,other,negative,219
1,63,M,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,3.5,t,2.5,t,108,t,0.96,t,113,f,?,SVI,negative,2059
2,25,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,4.6,t,2.4,t,61,t,0.82,t,75,f,?,SVHD,negative,399
3,53,F,f,f,f,f,f,f,f,t,f,f,f,f,f,f,t,0.25,t,2.1,t,145,t,1.03,t,141,f,?,other,negative,1911
4,92,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.7,t,1.3,t,120,t,0.84,t,143,f,?,SVI,negative,487
5,67,M,f,f,f,f,f,f,f,t,f,f,f,f,f,f,t,0.81,f,?,t,84,t,0.83,t,101,f,?,other,negative,1234
6,60,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.2,t,2.6,t,117,t,1.31,t,90,f,?,other,negative,1113
7,60,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,27,t,1.8,t,65,t,0.99,t,66,f,?,SVI,compensated_hypothyroid,1344
8,48,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.8,f,?,t,112,t,0.92,t,121,f,?,other,negative,2758
9,27,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.6,t,2.2,t,94,t,0.89,t,106,f,?,SVI,negative,3230


Checking For any instances(duplicate values) with the same value of 'ID' attribute

In [8]:
allHyperDATA.duplicated('ID')
allHyperTEST.duplicated('ID')
allHypoDATA.duplicated('ID')
allHypoTEST.duplicated('ID')


0      False
1      False
2      False
3      False
4      False
       ...  
967    False
968    False
969    False
970    False
971    False
Length: 972, dtype: bool

Drop the 'ID' attribute as, it is not helpful for the classification

In [9]:
del allHyperTEST["ID"]
del allHyperDATA["ID"]
del allHypoTEST["ID"]
del allHypoDATA["ID"]

Converting all different instances other than negative in 'Target' column into hyperthyroid or hypothyroid from  allhyper and allhypo datasets.


In [10]:
def notCorrect_TargetFilter(df,correct_Target,target):
    df = df[df.Target.isin(correct_Target)]
    df.replace(correct_Target,target,inplace = True)
    return df
    
allHyperTEST = notCorrect_TargetFilter(allHyperTEST,["hyperthyroid","T3_toxic","goitre","secondary_toxic"],"hyperthyroid")
allHyperDATA = notCorrect_TargetFilter(allHyperDATA,["hyperthyroid","T3_toxic","goitre","secondary_toxic"],"hyperthyroid")
allHypoTEST = notCorrect_TargetFilter(allHypoTEST,["hypothyroid", "primary_hypothyroid", "compensated_hypothyroid", "secondary_hypothyroid"],"hypothyroid")
allHypoDATA = notCorrect_TargetFilter(allHypoDATA,["hypothyroid", "primary_hypothyroid", "compensated_hypothyroid", "secondary_hypothyroid"],"hypothyroid")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,


Merging allhyper(2) and allhypo(2) datasets

In [11]:
allDataset = pd.concat([allHyperTEST,allHyperDATA,allHypoTEST,allHypoDATA], ignore_index = True)
display(allDataset.shape)

(393, 30)

B:Now importing 'thyroid0387' dataset 

In [12]:
thyroid0387 = pd.read_csv("thyroid0387.CSV")
display(thyroid0387.head(10))


Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,query_hyperthyroid,lithium,goitre,tumor,hypopituitary,psych,TSH_measured,TSH,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,Target,ID
0,29,F,f,f,f,f,f,f,f,t,f,f,f,f,f,f,t,0.3,f,?,f,?,f,?,f,?,f,?,other,-,840801013
1,29,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.6,t,1.9,t,128,f,?,f,?,f,?,other,-,840801014
2,41,F,f,f,f,f,f,f,f,f,t,f,f,f,f,f,f,?,f,?,f,?,f,?,f,?,t,11,other,-,840801042
3,36,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,?,f,?,f,?,f,?,f,?,t,26,other,-,840803046
4,32,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,?,f,?,f,?,f,?,f,?,t,36,other,S,840803047
5,60,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,?,f,?,f,?,f,?,f,?,t,26,other,-,840803048
6,77,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,?,f,?,f,?,f,?,f,?,t,21,other,-,840803068
7,28,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.7,t,2.6,t,116,f,?,f,?,f,?,SVI,-,840807019
8,28,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.2,t,1.8,t,76,f,?,f,?,f,?,other,-,840808060
9,28,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.9,t,1.7,t,83,f,?,f,?,f,?,other,-,840808073


removing 'ID' attribute (since,it is not useful for the classification )

In [13]:
del thyroid0387["ID"]

converting categorical values into numerical by mapping

In [14]:
thyroid0387['sex'] = thyroid0387['sex'].map({'F': 1, 'M': 0})


This dataset has different classes: A,B,C,D,E,F,G,H.

replacing A,B,C,D with hyperthyroid

replacing E,F,G,H with hypothyroid

All the others should be considered as 'negative'.

In [15]:
thyroid0387.replace(['A','B','C','D'],"hyperthyroid",inplace = True)
thyroid0387.replace(['E','F','G','H'],"hypothyroid",inplace = True)

for value in set(thyroid0387['Target']):
    if(value != 'hypothyroid' and value != 'hyperthyroid'):
        thyroid0387.replace(value,'negative',inplace=True)

C:Now Importing 'hypothyroid' dataset

In [16]:
hypothyroid = pd.read_csv("hypothyroid.csv")
display(hypothyroid.shape)
display(hypothyroid.head(10))

(3163, 26)

Unnamed: 0.1,Unnamed: 0,Age,Sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,thyroid_surgery,query_hypothyroid,query_hyperthyroid,pregnant,sick,tumor,lithium,goitre,TSH_measured,TSH,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG
0,hypothyroid,72,M,f,f,f,f,f,f,f,f,f,f,f,y,30.0,y,0.6,y,15.0,y,1.48,y,10.0,n,?
1,hypothyroid,15,F,t,f,f,f,f,f,f,f,f,f,f,y,145.0,y,1.7,y,19.0,y,1.13,y,17.0,n,?
2,hypothyroid,24,M,f,f,f,f,f,f,f,f,f,f,f,y,0.0,y,0.2,y,4.0,y,1.0,y,0.0,n,?
3,hypothyroid,24,F,f,f,f,f,f,f,f,f,f,f,f,y,430.0,y,0.4,y,6.0,y,1.04,y,6.0,n,?
4,hypothyroid,77,M,f,f,f,f,f,f,f,f,f,f,f,y,7.3,y,1.2,y,57.0,y,1.28,y,44.0,n,?
5,hypothyroid,85,F,f,f,f,f,t,f,f,f,f,f,f,y,138.0,y,1.1,y,27.0,y,1.19,y,23.0,n,?
6,hypothyroid,64,F,f,f,f,t,f,f,f,f,f,f,f,y,7.7,y,1.3,y,54.0,y,0.86,y,63.0,n,?
7,hypothyroid,72,F,f,f,f,f,f,f,f,f,f,f,f,y,21.0,y,1.9,y,34.0,y,1.05,y,32.0,n,?
8,hypothyroid,20,F,f,f,f,f,t,f,f,f,f,f,f,y,92.0,n,?,y,39.0,y,1.21,y,32.0,n,?
9,hypothyroid,42,F,f,f,f,f,f,f,f,f,f,f,f,y,48.0,n,?,y,7.6,y,1.02,y,7.5,n,?


The 'Unnamed' attribute indicate the target of the data.

so rename it.

filter the 'hypothyroid' class instances.

In [17]:
hypothyroid = hypothyroid.rename(columns={hypothyroid.columns[0]:"Target",hypothyroid.columns[1]:"age",hypothyroid.columns[2]:"sex" })
hypothyroid = hypothyroid[hypothyroid.Target.isin(['hypothyroid'])]
display(hypothyroid.shape)

(151, 26)

C:Now importing 'sick-euthyroid' dataset 

In [18]:
sick_euthyroid = pd.read_csv("sick-euthyroid.CSV")
display(sick_euthyroid.shape)
display(sick_euthyroid.head(10))


(3163, 26)

Unnamed: 0,Target,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,thyroid_surgery,query_hypothyroid,query_hyperthyroid,pregnant,sick,tumor,lithium,goitre,TSH_measured,TSH,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG
0,sick-euthyroid,72,M,f,f,f,f,f,f,f,f,f,f,f,n,?,y,1.0,y,83,y,0.95,y,87,n,?
1,sick-euthyroid,45,F,f,f,f,f,f,f,f,f,f,f,f,y,1.90,y,1.0,y,82,y,0.73,y,112,n,?
2,sick-euthyroid,64,F,f,f,f,f,f,f,f,t,f,f,f,y,0.09,y,1.0,y,101,y,0.82,y,123,n,?
3,sick-euthyroid,56,M,f,f,f,f,f,f,f,f,f,f,f,y,0,y,0.8,y,76,y,0.77,y,99,n,?
4,sick-euthyroid,78,F,t,f,f,f,t,f,f,f,f,f,f,y,2.60,y,0.3,y,87,y,0.95,y,91,n,?
5,sick-euthyroid,80,M,f,f,f,f,f,f,f,f,f,f,f,y,1.40,y,0.8,y,105,y,0.88,y,120,n,?
6,sick-euthyroid,74,F,f,f,f,f,f,f,f,f,f,f,f,y,0,y,0.7,y,98,y,0.81,y,121,n,?
7,sick-euthyroid,?,F,f,f,f,f,f,f,f,f,f,f,f,y,1.40,y,1.1,y,121,y,1.11,y,109,n,?
8,sick-euthyroid,42,F,f,f,f,f,f,f,f,f,f,f,f,y,2.30,y,1.1,y,93,y,0.73,y,127,n,?
9,sick-euthyroid,89,M,f,f,f,f,f,f,f,f,f,f,f,y,0.80,y,0.8,y,111,y,0.68,y,165,n,?


filter the 'sick-euthyroid' class instances.

In [19]:
sick_euthyroid = sick_euthyroid[sick_euthyroid.Target.isin(['negative'])]
display(sick_euthyroid.shape)

(2870, 26)

Note:- For 'hypothyroid' and 'sick-euthyroid' dataset, we don't have 'I131_treatment', 'hypopituitary', 'psych' and 'referral_source' attributes.

D:Now importing 'ann-test' and 'ann-train' dataset

In [20]:
ann_train = pd.read_csv("ann-train.CSV")
ann_test = pd.read_csv("ann-test.CSV")
display(ann_test.head(10))


Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,query_hyperthyroid,lithium,goitre,tumor,hypopituitary,psych,TSH,T3,TT4,T4U,FTI,Target
0,0.29,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0061,0.028,0.111,0.131,0.085,2
1,0.32,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0013,0.019,0.084,0.078,0.107,3
2,0.35,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.031,0.239,0.1,0.239,3
3,0.21,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.001,0.018,0.087,0.088,0.099,3
4,0.22,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0.0004,0.022,0.134,0.135,0.099,3
5,0.22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0016,0.02,0.123,0.113,0.109,3
6,0.39,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0.0016,0.036,0.133,0.144,0.093,3
7,0.77,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00081,0.02,0.08,0.096,0.08316,3
8,0.23,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00025,0.014,0.113,0.096,0.11746,3
9,0.23,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0026,0.011,0.104,0.104,0.099,3


we can see 3,2,1 classes in 'Target' column.so, we have to find 'measured' attributes based on their columns.


In [21]:
target1 = pd.Series(ann_test[ann_test.columns[-1]].values)
display(target1.value_counts())
target2 = pd.Series(ann_train[ann_train.columns[-1]].values)
display(target2.value_counts())

3    3178
2     177
1      73
dtype: int64

3    3488
2     191
1      93
dtype: int64

Looking at the distribuition of the values for the 'Target' attribute, we can understand that:
* 3 is referring to the 'negative' class
* 2 is referring to the 'hypothyroid' class
* 1 is referring to the 'hyperthyroid' class

Now, we should analyze the distribuition of the 'sex' attribute in the other datasets to understand how I should treat it in the 'ann' datasets




In [22]:
print("Sex thyroid0387 1=F,0=M:")
sex_series1 = pd.Series(thyroid0387[thyroid0387.columns[1]].values)
display(sex_series1.value_counts())
print("Sick-euthyroid:")
sex_series2 = pd.Series(sick_euthyroid[sick_euthyroid.columns[2]].values)
display(sex_series2.value_counts())

Sex thyroid0387 1=F,0=M:


1.0    6073
0.0    2792
dtype: int64

Sick-euthyroid:


F    2003
M     800
?      67
dtype: int64

we can understand that there are more female than male patients in the above datasets. now consider 'ann' datasets to conclude 'sex' attribute. 

In [23]:
sex1 = pd.Series(ann_test[ann_test.columns[1]].values)
display(sex1.value_counts())
sex2 = pd.Series(ann_train[ann_train.columns[1]].values)
display(sex2.value_counts())

0    2380
1    1048
dtype: int64

0    2629
1    1143
dtype: int64

we can assume that 

'0' refers to female

'1' refers to male

Concatinate ann_train and ann_test datasets and also map 'sex' and 'Target' attributes.

multiply 100 for all the continuous and numerical attributes and to add the 'measured' attributes.

Another important things to do is to multply for 100 all the continuos and numerical attributes and to add the 'measured' attributes.


In [24]:
ann = pd.concat([ann_train,ann_test], ignore_index = True)
ann['sex'] = ann['sex'].map({0:'F',1:'M'})
ann['Target'] = ann['Target'].map({3:'negative',2:'hypothyroid',1:'hyperthyroid'})

continuos_attributes = ['age','TSH','T3','TT4','T4U','FTI']
for attribute in continuos_attributes:
    ann[attribute] = ann[attribute] * 100

def fillNewAttributes(row,attribute):
    if row[attribute] > 0:
        return 'y'
    else:
        return 'n'

ann['TSH_measured'] = ann.apply(lambda row: fillNewAttributes(row,'TSH'), axis=1)
ann['T3_measured'] = ann.apply(lambda row: fillNewAttributes(row,'T3'), axis=1)
ann['TT4_measured'] = ann.apply(lambda row: fillNewAttributes(row,'TT4'), axis=1)
ann['T4U_measured'] = ann.apply(lambda row: fillNewAttributes(row,'T4U'), axis=1)
ann['FTI_measured'] = ann.apply(lambda row: fillNewAttributes(row,'FTI'), axis=1)
display(ann.dtypes)

age                          float64
sex                           object
on_thyroxine                   int64
query_on_thyroxine             int64
on_antithyroid_medication      int64
sick                           int64
pregnant                       int64
thyroid_surgery                int64
I131_treatment                 int64
query_hypothyroid              int64
query_hyperthyroid             int64
lithium                        int64
goitre                         int64
tumor                          int64
hypopituitary                  int64
psych                          int64
TSH                          float64
T3                           float64
TT4                          float64
T4U                          float64
FTI                          float64
Target                        object
TSH_measured                  object
T3_measured                   object
TT4_measured                  object
T4U_measured                  object
FTI_measured                  object
d

E:Merge all the datasets into one dataset

In [25]:
data = pd.concat([allDataset,thyroid0387,hypothyroid,sick_euthyroid,ann], ignore_index = True)
display(data.shape)
display(data.dtypes)

(19786, 30)

age                          object
sex                          object
on_thyroxine                 object
query_on_thyroxine           object
on_antithyroid_medication    object
sick                         object
pregnant                     object
thyroid_surgery              object
I131_treatment               object
query_hypothyroid            object
query_hyperthyroid           object
lithium                      object
goitre                       object
tumor                        object
hypopituitary                object
psych                        object
TSH_measured                 object
TSH                          object
T3_measured                  object
T3                           object
TT4_measured                 object
TT4                          object
T4U_measured                 object
T4U                          object
FTI_measured                 object
FTI                          object
TBG_measured                 object
TBG                         

**2: Data preprocessing**

Observing the set of possible values for each attribute.

In [26]:
for column in data.columns:
    listOfValues=set(data[column])
    print(column,": ",listOfValues)

age :  {'60', 1, 2, 3, 4, 5, 6, 7, 8, 9, '23', 11, 12, 13, 14, '45', 15, 16, 18, 19, 20, 21, '58', '62', 24, 25, 26, 27, 28, '50', '48', 31, 32, 33, 34, 35, 36, 37, 38, 39, '37', 40, 42, 43, '74', 45, '75', '51', '36', 41, 50, 51, 46, 53, 54, '29', 56, 55, 58, 59, '54', 60, 62, 63, 61, 65, 64, 67, 68, 69, 70, 71, 66, 73, 74, 75, 76, 77, '87', 79, 80, 78, 82, 83, 81, 85, 86, 17, 88, 84, '59', 90, 92, 89, 87, 93, 91, 97, 94, 95, 7.000000000000001, '2', '72', '18', '66', '90', 22, '77', '9', 23, '10', '8', '4', '57', 29, 30, '20', '22', '15', '80', '26', 14.000000000000002, '82', '92', 44, '83', '81', 47, 48, '76', 49, '14', '5', '67', 10, 51.5, 52, 52.190000000000005, '21', '65', '30', 55.00000000000001, '33', '44', 56.99999999999999, 56.00000000000001, '52', '79', '85', 57, '17', 57.99999999999999, '47', '63', '53', '19', '84', '56', '41', '31', '28', '35', '68', '46', '24', '49', '42', '61', '55', 72, '13', '40', '98', '88', '39', '71', '86', '6', 28.000000000000004, 28.999999999999996

replace '?' with 'nan'

In [27]:
data=data.replace({"?":np.NAN})
data.isna().sum()


age                            409
sex                            394
on_thyroxine                     0
query_on_thyroxine               0
on_antithyroid_medication        0
sick                             0
pregnant                         0
thyroid_surgery                  0
I131_treatment                3021
query_hypothyroid                0
query_hyperthyroid               0
lithium                          0
goitre                           0
tumor                            0
hypopituitary                 3021
psych                         3021
TSH_measured                     0
TSH                           1321
T3_measured                      0
T3                            3372
TT4_measured                     0
TT4                            696
T4U_measured                     0
T4U                           1083
FTI_measured                     0
FTI                           1075
TBG_measured                  7200
TBG                          19174
referral_source     

'TBG', 'referral_source' and 'TBG_measured' attributes have too many nan values.

so,we have to drop them along with 'sex' attribute.

In [28]:
del data['TBG']
del data['referral_source']
del data['TBG_measured']
del data['sex']

we can have maximum nine nan values in a row, so I will drop all the rows with more than five nan values because they present very few data and aren't good enough for the classification

In [29]:
data.dropna(axis = 0, thresh = 20, inplace = True)
data.isna().sum()

age                           380
on_thyroxine                    0
query_on_thyroxine              0
on_antithyroid_medication       0
sick                            0
pregnant                        0
thyroid_surgery                 0
I131_treatment               2773
query_hypothyroid               0
query_hyperthyroid              0
lithium                         0
goitre                          0
tumor                           0
hypopituitary                2773
psych                        2773
TSH_measured                    0
TSH                          1074
T3_measured                     0
T3                           3125
TT4_measured                    0
TT4                           448
T4U_measured                    0
T4U                           835
FTI_measured                    0
FTI                           828
Target                          0
dtype: int64

Convert the categorical values into numerical values, because for the classification it is important that the dataset has only numerical attributes.

In [30]:
data = data.replace({"t":1,"f":0, "y":1, "n":0, "hypothyroid":2, "negative":0,"hyperthyroid":1, "F":1, "M":0})
display(data.dtypes)

age                           object
on_thyroxine                   int64
query_on_thyroxine             int64
on_antithyroid_medication      int64
sick                           int64
pregnant                       int64
thyroid_surgery                int64
I131_treatment               float64
query_hypothyroid              int64
query_hyperthyroid             int64
lithium                        int64
goitre                         int64
tumor                          int64
hypopituitary                float64
psych                        float64
TSH_measured                   int64
TSH                           object
T3_measured                    int64
T3                            object
TT4_measured                   int64
TT4                           object
T4U_measured                   int64
T4U                           object
FTI_measured                   int64
FTI                           object
Target                         int64
dtype: object

convert object datatype to numeric

In [31]:
cols = data.columns[data.dtypes.eq('object')]
data[cols] = data[cols].apply(pd.to_numeric, errors='coerce')
display(data.dtypes)

age                          float64
on_thyroxine                   int64
query_on_thyroxine             int64
on_antithyroid_medication      int64
sick                           int64
pregnant                       int64
thyroid_surgery                int64
I131_treatment               float64
query_hypothyroid              int64
query_hyperthyroid             int64
lithium                        int64
goitre                         int64
tumor                          int64
hypopituitary                float64
psych                        float64
TSH_measured                   int64
TSH                          float64
T3_measured                    int64
T3                           float64
TT4_measured                   int64
TT4                          float64
T4U_measured                   int64
T4U                          float64
FTI_measured                   int64
FTI                          float64
Target                         int64
dtype: object

In [32]:
data.to_csv('final_dataset.csv')

**3: Training of the Classifiers**

firstly, we have to find the attributes which are most related to the target.

In [33]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19538 entries, 0 to 19785
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   age                        19158 non-null  float64
 1   on_thyroxine               19538 non-null  int64  
 2   query_on_thyroxine         19538 non-null  int64  
 3   on_antithyroid_medication  19538 non-null  int64  
 4   sick                       19538 non-null  int64  
 5   pregnant                   19538 non-null  int64  
 6   thyroid_surgery            19538 non-null  int64  
 7   I131_treatment             16765 non-null  float64
 8   query_hypothyroid          19538 non-null  int64  
 9   query_hyperthyroid         19538 non-null  int64  
 10  lithium                    19538 non-null  int64  
 11  goitre                     19538 non-null  int64  
 12  tumor                      19538 non-null  int64  
 13  hypopituitary              16765 non-null  flo

In [34]:
corr_values = abs(data[data.columns[0:]].corr()['Target'][:])
corr_values = corr_values.drop('Target')
corr_values = corr_values[corr_values > 0.04]
display(corr_values)


on_thyroxine         0.089933
query_hypothyroid    0.072636
psych                0.043933
TSH_measured         0.057784
TSH                  0.359483
T3                   0.100027
TT4                  0.061453
T4U                  0.055530
FTI                  0.060555
Name: Target, dtype: float64

Now, divide the dataset into two sets: 

training set and 

the testing set.

In [35]:
corr_values.index

Index(['on_thyroxine', 'query_hypothyroid', 'psych', 'TSH_measured', 'TSH',
       'T3', 'TT4', 'T4U', 'FTI'],
      dtype='object')

In [38]:
def holdout(dataframe):
  x = dataframe[['age','on_thyroxine', 'query_hypothyroid', 'query_hyperthyroid', 'psych',
       'TSH', 'FTI']]
  y = dataframe['Target']
  X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=42) 
  return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = holdout(data)

In [39]:
data1 = data.interpolate(method = 'spline', order = 3)
display(data1.isna().sum())

age                          0
on_thyroxine                 0
query_on_thyroxine           0
on_antithyroid_medication    0
sick                         0
pregnant                     0
thyroid_surgery              0
I131_treatment               0
query_hypothyroid            0
query_hyperthyroid           0
lithium                      0
goitre                       0
tumor                        0
hypopituitary                0
psych                        0
TSH_measured                 0
TSH                          0
T3_measured                  0
T3                           0
TT4_measured                 0
TT4                          0
T4U_measured                 0
T4U                          0
FTI_measured                 0
FTI                          0
Target                       0
dtype: int64

In [46]:
classifiers1 = {
    "XGBClassifier" : XGBClassifier(learning_rate=0.01),
    "Nearest Neighbors" : KNeighborsClassifier(4),
    "Decision Tree" : DecisionTreeClassifier(class_weight = 'balanced'),
    "Random Forest": RandomForestClassifier(class_weight = 'balanced',random_state = 1),
}

In [47]:
def classification(classifiers, X_train, X_test, y_train, y_test):
    # Creo un dataframe per visualizzare i risultati calcolati
  res = pd.DataFrame(columns=["Classifier", 
                                "Accuracy", 
                                "Precision", 
                                "Recall", 
                                "FScore"])
  for name, clf in classifiers.items():
            clf.fit(X_train, y_train)
            y_pred = clf.predict(X_test)
            pr, rc, fs, sup = metrics.precision_recall_fscore_support(y_test, y_pred, average='macro')
            res = res.append({"Classifier": name,"Accuracy": round(metrics.accuracy_score(y_test, y_pred), 4),
                              "Precision": round(pr, 4), "Recall":round(rc, 4), "FScore":round(fs, 4)}, ignore_index=True)
            print("Confusion matrix for: ", name)
            display(confusion_matrix(y_test, y_pred))
  res.set_index("FScore", inplace=True)
  res.sort_values(by="FScore", ascending=False, inplace=True)   
  return res

In [48]:
corr_values = abs(data1[data1.columns[0:]].corr()['Target'][:])
corr_values = corr_values.drop('Target')
corr_values = corr_values[corr_values > 0.04]
display(corr_values)

X_train1, X_test1, y_train1, y_test1 = holdout(data1)

display(classification(classifiers1,X_train1, X_test1, y_train1, y_test1))

on_thyroxine         0.089933
query_hypothyroid    0.072636
psych                0.041390
TSH_measured         0.057784
TSH                  0.321522
T3                   0.092222
TT4                  0.064251
T4U                  0.053964
FTI                  0.059411
Name: Target, dtype: float64

Confusion matrix for:  XGBClassifier


array([[5167,   11,  101],
       [  89,   52,    9],
       [ 123,    1,  309]])

Confusion matrix for:  Nearest Neighbors


array([[5176,    8,   95],
       [ 100,   47,    3],
       [ 293,    3,  137]])

Confusion matrix for:  Decision Tree


array([[5159,   42,   78],
       [  44,  105,    1],
       [  63,    2,  368]])

Confusion matrix for:  Random Forest


array([[5170,   26,   83],
       [  46,  102,    2],
       [  34,    0,  399]])

Unnamed: 0_level_0,Classifier,Accuracy,Precision,Recall
FScore,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.862,Random Forest,0.9674,0.8687,0.8603
0.8391,Decision Tree,0.9608,0.8359,0.8424
0.727,XGBClassifier,0.943,0.8369,0.6797
0.6055,Nearest Neighbors,0.9144,0.7743,0.5367


In [49]:
model=RandomForestClassifier(class_weight = 'balanced',random_state = 1)
model.fit(X_train1,y_train1)

RandomForestClassifier(class_weight='balanced', random_state=1)

In [50]:
fet=[]
fet.append(int(40))
fet.append(int(0))
fet.append(int(0))
fet.append(int(0))
fet.append(float(0))
fet.append(float(0.003))
fet.append(float(0))
model.predict([np.array(fet)])

  "X does not have valid feature names, but"


array([0])

In [51]:
import _pickle
_pickle.dump(model,open('model.pkl','wb'))