# Using Decision Trees and SVM to determine whether a patient's symptoms indicate Hypothyroidism

## Defining the Question

### a) Problem Statement

Can we determine whether a patients suffers from hypothyroidism from the exhibited symptoms?

### b) Success Metrics
Identifying a model with over 90% prediction success


### c) Understanding the context 

A hospital conducting a clinical camp as a community service effort wishes to reduce the amount of time spent in diagnosing a patient as to be suffering from hypothyroidism. The hospital therefore focusses on **building a model that would assist in classifying a case as hypothyroidic from the symptoms taken on a patient**

### d) Recording the Experimental Design

* Read and explore the given dataset.
* Define the appropriateness of the available data to answer the given question.
* Find and deal with outliers, anomalies, and missing data within the dataset.
* Perform univariate, bivariate and multivariate analysis recording your observations.
* Build a classification model using the following Decision Tree techniques:
      1. Random forest
      2. Ada boosted trees
      3. Gradient boosted trees

* Build a classification model using SVM employing the following functions in building it:
      1. Polynomial
      2. Linear
      3. rbf
* Tune the parameters
* Challenge your solution by providing insights on how you can make improvements in model improvement.

### e) Data Relevance

The data provided contains information on the most common symptoms observed for hypothyroidism.



## Reading the Data and Loading Dependencies

In [15]:
# DEPENDENCIES

# Standard libraries
import pandas as pd
import numpy as np

# ML libraries
from sklearn.ensemble import  RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [None]:
# Loading data
data = pd.read_csv('/content/hypothyroid.csv')

## Checking the Data

In [None]:
# No of records in our dataset
data.shape

(3163, 26)

In [None]:
# Previewing the data
data.head()

Unnamed: 0,status,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,thyroid_surgery,query_hypothyroid,query_hyperthyroid,pregnant,sick,tumor,lithium,goitre,TSH_measured,TSH,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG
0,hypothyroid,72,M,f,f,f,f,f,f,f,f,f,f,f,y,30.0,y,0.6,y,15,y,1.48,y,10,n,?
1,hypothyroid,15,F,t,f,f,f,f,f,f,f,f,f,f,y,145.0,y,1.7,y,19,y,1.13,y,17,n,?
2,hypothyroid,24,M,f,f,f,f,f,f,f,f,f,f,f,y,0.0,y,0.2,y,4,y,1.0,y,0,n,?
3,hypothyroid,24,F,f,f,f,f,f,f,f,f,f,f,f,y,430.0,y,0.4,y,6,y,1.04,y,6,n,?
4,hypothyroid,77,M,f,f,f,f,f,f,f,f,f,f,f,y,7.3,y,1.2,y,57,y,1.28,y,44,n,?


In [None]:
# Checking whether each column has an appropriate datatype
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3163 entries, 0 to 3162
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   status                     3163 non-null   object
 1   age                        3163 non-null   object
 2   sex                        3163 non-null   object
 3   on_thyroxine               3163 non-null   object
 4   query_on_thyroxine         3163 non-null   object
 5   on_antithyroid_medication  3163 non-null   object
 6   thyroid_surgery            3163 non-null   object
 7   query_hypothyroid          3163 non-null   object
 8   query_hyperthyroid         3163 non-null   object
 9   pregnant                   3163 non-null   object
 10  sick                       3163 non-null   object
 11  tumor                      3163 non-null   object
 12  lithium                    3163 non-null   object
 13  goitre                     3163 non-null   object
 14  TSH_meas

In [None]:
data.columns

Index(['status', 'age', 'sex', 'on_thyroxine', 'query_on_thyroxine',
       'on_antithyroid_medication', 'thyroid_surgery', 'query_hypothyroid',
       'query_hyperthyroid', 'pregnant', 'sick', 'tumor', 'lithium', 'goitre',
       'TSH_measured', 'TSH', 'T3_measured', 'T3', 'TT4_measured', 'TT4',
       'T4U_measured', 'T4U', 'FTI_measured', 'FTI', 'TBG_measured', 'TBG'],
      dtype='object')

* 2182 female, 908 males and 73 from unidentified gender participated in the test

In [None]:
data.TBG_measured.value_counts()

n    2903
y     260
Name: TBG_measured, dtype: int64

## Preprocessing

In [16]:
# Split the independent and dependent variables
y = data['status'].values

test_features = data.columns.to_list()
test_features.remove('status')

# We should be able to use nearly all the other features, but for the sake of simplifying our visualization later, we only pick a few.
X = data[test_features].values

In [17]:
# Train using 70% of the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

## Building the Models

### DECISION TREE

Random Forest

In [13]:
# Initiating the decision tree classifier
tree = RandomForestClassifier()

# Training on our training set
tree = tree.fit(X_train, y_train) 

# Predict based on the model we've trained
y_pred = tree.predict(X_test)

# Model accuracy
print('Accuracy: ', metrics.accuracy_score(y_test, y_pred))

ValueError: ignored

AdaBoosted Decision Tree

Gradient Boosted Decision Tree

Polynomial Regression

### SUPPORT VECTOR MACHINE