# Introduction

**In this _data processing_ notebook, five csv dataset will be merged into one file and classifiers dimension would be reduced into three classes.**

The original dataset is from: https://archive.ics.uci.edu/ml/datasets/Thyroid+Disease. I have chosen six datasets which contains hypothyroid, hyperthyroid and negative classes and changed their file type from _.data_ into _.csv_ and store them into _Dataset_ folder.

Different file has different classes according to instruction documents, such as:
- _allhyper-train.csv_ and _allhyper-test.csv_: hyperthyroid, T3 toxic, goitre, secondary toxic and negative
- _allhypo-train.csv_ and _allhypo-test.csv_: hypothyroid, primary hypothyroid, compensated hypothyroid, secondary hypothyroid and negative
- _ann-train.csv_ and _ann-test.csv_:  normal (not hypothyroid), hyperfunction and subnormal functioning 
- _hypothyroid.csv_: hypothyroid and negative
- _thyroid0378.csv_: A,B,C,D - hyperthyroid conditions, E,F,G,H - hypothyroid conditions, I,J - binding protein, K - general health, L,M,N - replacement therapy and R - discordant results

**We would reduce them and only present three classes: hypothyroid, hyperthyroid and negative**

In [688]:
import numpy as np
import pandas as pd

# 1. Data Integration

In [689]:
"""Read dataset"""
allhyper_train = pd.read_csv('Dataset/allhyper-train.csv')
allhyper_test = pd.read_csv('Dataset/allhyper-test.csv')

allhypo_train = pd.read_csv('Dataset/allhypo-train.csv')
allhypo_test = pd.read_csv('Dataset/allhypo-test.csv')

ann_train = pd.read_csv('Dataset/ann-train.csv')
ann_test = pd.read_csv('Dataset/ann-test.csv')

hypothyroid = pd.read_csv('Dataset/hypothyroid.csv')
# sick_euthyroid = pd.read_csv('Dataset/sick-euthyroid.csv')
thyroid0387 = pd.read_csv('Dataset/thyroid0387.csv')

In [690]:
allhyper_train.head()

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,Target,ID
0,41,F,f,f,f,f,f,f,f,f,...,125,t,1.14,t,109,f,?,SVHC,negative,3733
1,23,F,f,f,f,f,f,f,f,f,...,102,f,?,f,?,f,?,other,negative,1442
2,46,M,f,f,f,f,f,f,f,f,...,109,t,0.91,t,120,f,?,other,negative,2965
3,70,F,t,f,f,f,f,f,f,f,...,175,f,?,f,?,f,?,other,negative,806
4,70,F,f,f,f,f,f,f,f,f,...,61,t,0.87,t,70,f,?,SVI,negative,2807


In [691]:
allhypo_train.head()

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,Target,ID
0,41,F,f,f,f,f,f,f,f,f,...,125,t,1.14,t,109,f,?,SVHC,negative,3733
1,23,F,f,f,f,f,f,f,f,f,...,102,f,?,f,?,f,?,other,negative,1442
2,46,M,f,f,f,f,f,f,f,f,...,109,t,0.91,t,120,f,?,other,negative,2965
3,70,F,t,f,f,f,f,f,f,f,...,175,f,?,f,?,f,?,other,negative,806
4,70,F,f,f,f,f,f,f,f,f,...,61,t,0.87,t,70,f,?,SVI,negative,2807


In [692]:
ann_train.head()

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,goitre,tumor,hypopituitary,psych,TSH,T3,TT4,T4U,FTI,Target
0,0.73,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0.0006,0.015,0.12,0.082,0.146,3
1,0.24,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0.00025,0.03,0.143,0.133,0.108,3
2,0.47,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0.0019,0.024,0.102,0.131,0.078,3
3,0.64,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0.0009,0.017,0.077,0.09,0.085,3
4,0.23,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0.00025,0.026,0.139,0.09,0.153,3


In [693]:
hypothyroid.head()

Unnamed: 0.1,Unnamed: 0,Age,Sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,thyroid_surgery,query_hypothyroid,query_hyperthyroid,pregnant,...,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG
0,hypothyroid,72,M,f,f,f,f,f,f,f,...,y,0.6,y,15,y,1.48,y,10,n,?
1,hypothyroid,15,F,t,f,f,f,f,f,f,...,y,1.7,y,19,y,1.13,y,17,n,?
2,hypothyroid,24,M,f,f,f,f,f,f,f,...,y,0.2,y,4,y,1.0,y,0,n,?
3,hypothyroid,24,F,f,f,f,f,f,f,f,...,y,0.4,y,6,y,1.04,y,6,n,?
4,hypothyroid,77,M,f,f,f,f,f,f,f,...,y,1.2,y,57,y,1.28,y,44,n,?


In [694]:
thyroid0387.head()

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,Target,ID
0,29,F,f,f,f,f,f,f,f,t,...,?,f,?,f,?,f,?,other,-,840801013
1,29,F,f,f,f,f,f,f,f,f,...,128,f,?,f,?,f,?,other,-,840801014
2,41,F,f,f,f,f,f,f,f,f,...,?,f,?,f,?,t,11,other,-,840801042
3,36,F,f,f,f,f,f,f,f,f,...,?,f,?,f,?,t,26,other,-,840803046
4,32,F,f,f,f,f,f,f,f,f,...,?,f,?,f,?,t,36,other,S,840803047


In [695]:
allhypo_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 972 entries, 0 to 971
Data columns (total 31 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   age                        972 non-null    int64 
 1   sex                        972 non-null    object
 2   on_thyroxine               972 non-null    object
 3   query_on_thyroxine         972 non-null    object
 4   on_antithyroid_medication  972 non-null    object
 5   sick                       972 non-null    object
 6   pregnant                   972 non-null    object
 7   thyroid_surgery            972 non-null    object
 8   I131_treatment             972 non-null    object
 9   query_hypothyroid          972 non-null    object
 10  query_hyperthyroid         972 non-null    object
 11  lithium                    972 non-null    object
 12  goitre                     972 non-null    object
 13  tumor                      972 non-null    object
 14  hypopituit

In [696]:
hypothyroid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3163 entries, 0 to 3162
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Unnamed: 0                 3163 non-null   object
 1   Age                        3163 non-null   object
 2   Sex                        3163 non-null   object
 3   on_thyroxine               3163 non-null   object
 4   query_on_thyroxine         3163 non-null   object
 5   on_antithyroid_medication  3163 non-null   object
 6   thyroid_surgery            3163 non-null   object
 7   query_hypothyroid          3163 non-null   object
 8   query_hyperthyroid         3163 non-null   object
 9   pregnant                   3163 non-null   object
 10  sick                       3163 non-null   object
 11  tumor                      3163 non-null   object
 12  lithium                    3163 non-null   object
 13  goitre                     3163 non-null   object
 14  TSH_meas

In [697]:
thyroid0387.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9172 entries, 0 to 9171
Data columns (total 31 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   age                        9172 non-null   int64 
 1   sex                        9172 non-null   object
 2   on_thyroxine               9172 non-null   object
 3   query_on_thyroxine         9172 non-null   object
 4   on_antithyroid_medication  9172 non-null   object
 5   sick                       9172 non-null   object
 6   pregnant                   9172 non-null   object
 7   thyroid_surgery            9172 non-null   object
 8   I131_treatment             9172 non-null   object
 9   query_hypothyroid          9172 non-null   object
 10  query_hyperthyroid         9172 non-null   object
 11  lithium                    9172 non-null   object
 12  goitre                     9172 non-null   object
 13  tumor                      9172 non-null   object
 14  hypopitu

## Replace the classes - Target

In [698]:
# allhyper
allhyper_train.Target.replace(["hyperthyroid","T3_toxic","goitre","secondary_toxic"],"hyperthyroid",inplace=True)
allhyper_test.Target.replace(["hyperthyroid","T3_toxic","goitre","secondary_toxic"],"hyperthyroid",inplace=True)

# allhypo
allhypo_train.Target.replace(["hypothyroid", "primary_hypothyroid", "compensated_hypothyroid", "secondary_hypothyroid"],"hypothyroid",inplace=True)
allhypo_test.Target.replace(["hypothyroid", "primary_hypothyroid", "compensated_hypothyroid", "secondary_hypothyroid"],"hypothyroid",inplace=True)

In [699]:
# hypothyroid
"""
rename the class column only
"""
hypothyroid.rename(columns={hypothyroid.columns[0]:"Target"}, inplace=True)

In [700]:
# thyroid0387
thyroid0387.Target.replace(['A','B','C','D'],"hyperthyroid",inplace=True)
thyroid0387.Target.replace(['E','F','G','H'],"hypothyroid",inplace=True)
thyroid0387.Target.replace('-', "negative", inplace=True)

# Delete useless classes, such as I,J - binding protein, K - general health, L,M,N - replacement therapy and R - discordant results
thyroid0387 = thyroid0387.loc[thyroid0387['Target'].isin(["hyperthyroid","hypothyroid","negative"])]

### ANN

In [701]:
print(ann_train.shape)
print(ann_test.shape)

(3772, 22)
(3428, 22)


In [702]:
# merge ann datasets together
ann_data = pd.concat([ann_train, ann_test], ignore_index=True)
ann_data.shape

(7200, 22)

In [703]:
ann_data_target = pd.Series(ann_data[ann_data.columns[-1]].values)

print("ann_data targets:")
print(ann_data_target.value_counts())

ann_data targets:
3    6666
2     368
1     166
dtype: int64


According to the description of ann file: 
The problem is to determine whether a patient referred to the clinic is hypothyroid. Therefore three classes are built: normal (not hypothyroid), hyperfunction and subnormal functioning. Because 92 percent of the patients are not hyperthyroid a good classifier must be significant better than 92%.

We can know that **'3' refers to 'negative', '2' refers to 'hypothyroid' and '1' means 'hyperthyroid'**

In [704]:
# Replace class in ann
ann_data.Target = ann_data.Target.map({3:'negative', 2:'hypothyroid', 1:'hyperthyroid'})

In [705]:
ann_data.Target

0           negative
1           negative
2           negative
3           negative
4           negative
            ...     
7195        negative
7196    hyperthyroid
7197        negative
7198        negative
7199        negative
Name: Target, Length: 7200, dtype: object

### Sex in ann

We can see that in ann file, 0 and 1 represent sex. We need to identify which one is female and male, then replace them

In [706]:
thyroid0387.groupby(['sex'])[['ID']].count()

Unnamed: 0_level_0,ID
sex,Unnamed: 1_level_1
?,250
F,4900
M,2396


In [707]:
ann_data.groupby(['sex'])[['age']].count()

Unnamed: 0_level_0,age
sex,Unnamed: 1_level_1
0,5009
1,2191


According to the results from thyroid0387, we know that there are more female patients than male in these datasets. Therefore, **0 means female and 1 means male**.

In [708]:
ann_data.sex = ann_data.sex.map({0:'F', 1:'M'})

In [709]:
# Check the results
ann_data.sex

0       F
1       F
2       F
3       M
4       F
       ..
7195    F
7196    F
7197    F
7198    M
7199    F
Name: sex, Length: 7200, dtype: object

### other variables represents by 0 and 1 in ann

The same as sex handling in ann. The label of these variables in ann is 0 and 1, in other files they are f and t. So we need to change them.

- 0 means f

- 1 means t

In [710]:
ann_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7200 entries, 0 to 7199
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   age                        7200 non-null   float64
 1   sex                        7200 non-null   object 
 2   on_thyroxine               7200 non-null   int64  
 3   query_on_thyroxine         7200 non-null   int64  
 4   on_antithyroid_medication  7200 non-null   int64  
 5   sick                       7200 non-null   int64  
 6   pregnant                   7200 non-null   int64  
 7   thyroid_surgery            7200 non-null   int64  
 8   I131_treatment             7200 non-null   int64  
 9   query_hypothyroid          7200 non-null   int64  
 10  query_hyperthyroid         7200 non-null   int64  
 11  lithium                    7200 non-null   int64  
 12  goitre                     7200 non-null   int64  
 13  tumor                      7200 non-null   int64

In [711]:
ann_data.head().T

Unnamed: 0,0,1,2,3,4
age,0.73,0.24,0.47,0.64,0.23
sex,F,F,F,M,F
on_thyroxine,1,0,0,0,0
query_on_thyroxine,0,0,0,0,0
on_antithyroid_medication,0,0,0,0,0
sick,0,0,0,0,0
pregnant,0,0,0,0,0
thyroid_surgery,0,0,0,0,0
I131_treatment,1,0,0,0,0
query_hypothyroid,0,0,0,0,0


In [712]:
handling_cols = ann_data.columns[2:16]
print(handling_cols)

Index(['on_thyroxine', 'query_on_thyroxine', 'on_antithyroid_medication',
       'sick', 'pregnant', 'thyroid_surgery', 'I131_treatment',
       'query_hypothyroid', 'query_hyperthyroid', 'lithium', 'goitre', 'tumor',
       'hypopituitary', 'psych'],
      dtype='object')


In [713]:
# replace
for col in handling_cols:
    ann_data[col] = ann_data[col].map({0:'f', 1:'t'})

In [714]:
ann_data.head().T

Unnamed: 0,0,1,2,3,4
age,0.73,0.24,0.47,0.64,0.23
sex,F,F,F,M,F
on_thyroxine,t,f,f,f,f
query_on_thyroxine,f,f,f,f,f
on_antithyroid_medication,f,f,f,f,f
sick,f,f,f,f,f
pregnant,f,f,f,f,f
thyroid_surgery,f,f,f,f,f
I131_treatment,t,f,f,f,f
query_hypothyroid,f,f,f,f,f


### Continous number in ann

We need to multiply 100 for all the continous number when we noticed the age column.

In [715]:
continuos_columns = ['age','TSH','T3','TT4','T4U','FTI']
for col in continuos_columns:
    ann_data[col] = ann_data[col] * 100

## Merge all the datasets

In [716]:
 allhyper_train.head()

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,Target,ID
0,41,F,f,f,f,f,f,f,f,f,...,125,t,1.14,t,109,f,?,SVHC,negative,3733
1,23,F,f,f,f,f,f,f,f,f,...,102,f,?,f,?,f,?,other,negative,1442
2,46,M,f,f,f,f,f,f,f,f,...,109,t,0.91,t,120,f,?,other,negative,2965
3,70,F,t,f,f,f,f,f,f,f,...,175,f,?,f,?,f,?,other,negative,806
4,70,F,f,f,f,f,f,f,f,f,...,61,t,0.87,t,70,f,?,SVI,negative,2807


In [717]:
allhypo_train.head()

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,Target,ID
0,41,F,f,f,f,f,f,f,f,f,...,125,t,1.14,t,109,f,?,SVHC,negative,3733
1,23,F,f,f,f,f,f,f,f,f,...,102,f,?,f,?,f,?,other,negative,1442
2,46,M,f,f,f,f,f,f,f,f,...,109,t,0.91,t,120,f,?,other,negative,2965
3,70,F,t,f,f,f,f,f,f,f,...,175,f,?,f,?,f,?,other,negative,806
4,70,F,f,f,f,f,f,f,f,f,...,61,t,0.87,t,70,f,?,SVI,negative,2807


In [718]:
print(allhyper_train.shape)
print(allhyper_test.shape)
print(allhypo_train.shape)
print(allhypo_test.shape)

(2800, 31)
(972, 31)
(2800, 31)
(972, 31)


In [719]:
# merge allhypo and allhyper seires
allhy_data = pd.concat([allhyper_train, allhyper_test, allhypo_train, allhypo_test], ignore_index=True)
allhy_data.shape

(7544, 31)

In [720]:
allhy_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7544 entries, 0 to 7543
Data columns (total 31 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   age                        7544 non-null   object
 1   sex                        7544 non-null   object
 2   on_thyroxine               7544 non-null   object
 3   query_on_thyroxine         7544 non-null   object
 4   on_antithyroid_medication  7544 non-null   object
 5   sick                       7544 non-null   object
 6   pregnant                   7544 non-null   object
 7   thyroid_surgery            7544 non-null   object
 8   I131_treatment             7544 non-null   object
 9   query_hypothyroid          7544 non-null   object
 10  query_hyperthyroid         7544 non-null   object
 11  lithium                    7544 non-null   object
 12  goitre                     7544 non-null   object
 13  tumor                      7544 non-null   object
 14  hypopitu

### Merge hypothyroid and thyroid0387

In [721]:
hypothyroid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3163 entries, 0 to 3162
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Target                     3163 non-null   object
 1   Age                        3163 non-null   object
 2   Sex                        3163 non-null   object
 3   on_thyroxine               3163 non-null   object
 4   query_on_thyroxine         3163 non-null   object
 5   on_antithyroid_medication  3163 non-null   object
 6   thyroid_surgery            3163 non-null   object
 7   query_hypothyroid          3163 non-null   object
 8   query_hyperthyroid         3163 non-null   object
 9   pregnant                   3163 non-null   object
 10  sick                       3163 non-null   object
 11  tumor                      3163 non-null   object
 12  lithium                    3163 non-null   object
 13  goitre                     3163 non-null   object
 14  TSH_meas

In [722]:
thyroid0387.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7546 entries, 0 to 9171
Data columns (total 31 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   age                        7546 non-null   int64 
 1   sex                        7546 non-null   object
 2   on_thyroxine               7546 non-null   object
 3   query_on_thyroxine         7546 non-null   object
 4   on_antithyroid_medication  7546 non-null   object
 5   sick                       7546 non-null   object
 6   pregnant                   7546 non-null   object
 7   thyroid_surgery            7546 non-null   object
 8   I131_treatment             7546 non-null   object
 9   query_hypothyroid          7546 non-null   object
 10  query_hyperthyroid         7546 non-null   object
 11  lithium                    7546 non-null   object
 12  goitre                     7546 non-null   object
 13  tumor                      7546 non-null   object
 14  hypopitu

There are only 26 columns in hypothyroid dataset, so we need to delete some columns in allhy dataset and thyroid0387 dataset

In [723]:
thyroid0387.shape

(7546, 31)

In [724]:
allhy_0387_data = pd.concat([allhy_data, thyroid0387], ignore_index=True)

In [725]:
allhy_0387_data.shape

(15090, 31)

In [726]:
# drop useless columns
allhy_0387_data.drop(['ID','referral_source','psych','hypopituitary','I131_treatment'],axis=1,inplace=True)

In [727]:
allhy_0387_data.shape

(15090, 26)

In [728]:
# Rename the columns
hypothyroid.rename(columns={'Sex':'sex', 'Age':'age'}, inplace=True)

In [729]:
# merge hypothyroid
new_data1 = pd.concat([allhy_0387_data, hypothyroid], ignore_index=True)
new_data1.shape

(18253, 26)

In [730]:
new_data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18253 entries, 0 to 18252
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   age                        18253 non-null  object
 1   sex                        18253 non-null  object
 2   on_thyroxine               18253 non-null  object
 3   query_on_thyroxine         18253 non-null  object
 4   on_antithyroid_medication  18253 non-null  object
 5   sick                       18253 non-null  object
 6   pregnant                   18253 non-null  object
 7   thyroid_surgery            18253 non-null  object
 8   query_hypothyroid          18253 non-null  object
 9   query_hyperthyroid         18253 non-null  object
 10  lithium                    18253 non-null  object
 11  goitre                     18253 non-null  object
 12  tumor                      18253 non-null  object
 13  TSH_measured               18253 non-null  object
 14  TSH   

### Merge with ann dataset

In order to merge them, we need to delete some useless columns. For example, these measured columns means if patients have been measured the TSH, T3 and so on. If it is 'n' which means there is no corresponding value of this parameter, the value of this hormone is NaN. Therefore, it is enough to use another parameter and other parameters.

In [731]:
# Drop useless columns
new_data1.drop(['TSH_measured','T3_measured','TT4_measured','T4U_measured','FTI_measured','TBG_measured','TBG'],axis=1,inplace=True)

ann_data.drop(['I131_treatment','hypopituitary','psych'],axis=1,inplace=True)

In [732]:
print(new_data1.shape)
print(ann_data.shape)

(18253, 19)
(7200, 19)


In [733]:
# Merge all the datasets
new_data = pd.concat([new_data1, ann_data], ignore_index=True)
new_data.shape

(25453, 19)

In [734]:
new_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25453 entries, 0 to 25452
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   age                        25453 non-null  object
 1   sex                        25453 non-null  object
 2   on_thyroxine               25453 non-null  object
 3   query_on_thyroxine         25453 non-null  object
 4   on_antithyroid_medication  25453 non-null  object
 5   sick                       25453 non-null  object
 6   pregnant                   25453 non-null  object
 7   thyroid_surgery            25453 non-null  object
 8   query_hypothyroid          25453 non-null  object
 9   query_hyperthyroid         25453 non-null  object
 10  lithium                    25453 non-null  object
 11  goitre                     25453 non-null  object
 12  tumor                      25453 non-null  object
 13  TSH                        25453 non-null  object
 14  T3    

In [735]:
new_data

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,query_hypothyroid,query_hyperthyroid,lithium,goitre,tumor,TSH,T3,TT4,T4U,FTI,Target
0,41,F,f,f,f,f,f,f,f,f,f,f,f,1.3,2.5,125,1.14,109,negative
1,23,F,f,f,f,f,f,f,f,f,f,f,f,4.1,2,102,?,?,negative
2,46,M,f,f,f,f,f,f,f,f,f,f,f,0.98,?,109,0.91,120,negative
3,70,F,t,f,f,f,f,f,f,f,f,f,f,0.16,1.9,175,?,?,negative
4,70,F,f,f,f,f,f,f,f,f,f,f,f,0.72,1.2,61,0.87,70,negative
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25448,59,F,f,f,f,f,f,f,f,f,f,f,f,0.25,2.08,7.9,9.9,8,negative
25449,51,F,f,f,f,f,f,f,f,f,f,f,f,10.6,0.6,0.5,8.9,0.55,hyperthyroid
25450,51,F,f,f,f,f,f,f,f,f,f,f,f,0.076,2.01,9,6.7,13.4,negative
25451,35,M,f,f,f,f,f,f,f,f,f,f,f,0.28,2.01,9,8.9,10.1,negative


### _new_data_ is the final clean dataset that combines all the raw data together which has 25453 entries and 19 columns.


In [736]:
new_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25453 entries, 0 to 25452
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   age                        25453 non-null  object
 1   sex                        25453 non-null  object
 2   on_thyroxine               25453 non-null  object
 3   query_on_thyroxine         25453 non-null  object
 4   on_antithyroid_medication  25453 non-null  object
 5   sick                       25453 non-null  object
 6   pregnant                   25453 non-null  object
 7   thyroid_surgery            25453 non-null  object
 8   query_hypothyroid          25453 non-null  object
 9   query_hyperthyroid         25453 non-null  object
 10  lithium                    25453 non-null  object
 11  goitre                     25453 non-null  object
 12  tumor                      25453 non-null  object
 13  TSH                        25453 non-null  object
 14  T3    

In [737]:
# Replace the '?' into NaN 
new_data = new_data.replace({"?":np.NAN})

In [738]:
# export this dataset
new_data.to_csv('Dataset_edited/new_data.csv', index=False, header=True)

In [739]:
new_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25453 entries, 0 to 25452
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   age                        25005 non-null  object
 1   sex                        24830 non-null  object
 2   on_thyroxine               25453 non-null  object
 3   query_on_thyroxine         25453 non-null  object
 4   on_antithyroid_medication  25453 non-null  object
 5   sick                       25453 non-null  object
 6   pregnant                   25453 non-null  object
 7   thyroid_surgery            25453 non-null  object
 8   query_hypothyroid          25453 non-null  object
 9   query_hyperthyroid         25453 non-null  object
 10  lithium                    25453 non-null  object
 11  goitre                     25453 non-null  object
 12  tumor                      25453 non-null  object
 13  TSH                        23525 non-null  object
 14  T3    