# MATH 189 - Project

> NAME: $\color{red}{\text{    Anh Tran     }}$
> 
> EMAIL: $\color{red}{\text{    ant033@ucsd.edu     }}$
> 
> PID: $\color{red}{\text{    A17634100     }}$

> NAME: $\color{red}{\text{    Sang Tran     }}$
> 
> EMAIL: $\color{red}{\text{    stt008@ucsd.edu     }}$
> 
> PID: $\color{red}{\text{    A17603500     }}$

# Research Question

Is there a correlation between the group of population that is most affected by thyroid disease and what is the most effective treatment based on the thyroid disease dataset?




## Background and Prior Work

According to the Cleveland Clinic, thyroid gland is a part of endocrine system and located around the base of the Larynx (voice box), surrounding the trachea (windpipe) and is responsible for creating necessary hormones to regulate the body functions like metabolism through chemical reaction and control hormones by releasing the hormones into the bloodstream. In fact, all cells require energy from metabolized food in order to function so thyroid plays an important role in maintaining that our body is working properly. It is a fairly common disease as stated by Cleveland Clinic, diseases that degrade or over activate the thyroid function can have a major impact on our body functionality such as fatigue, weight loss/gain, depression, heavy or irregular menstrual periods, mood swing, etc… However, the diseases are not debilitating and are all treatable through the use of medication, therapy, or surgery.

References:

professional, C. C. medical. (n.d.). Thyroid disease. Cleveland Clinic.https://my.clevelandclinic.org/health/diseases/8541-thyroid-disease#overview 

# Hypothesis


* Identification of high risk factors for disease: Through the analyses mentioned above, we aim to identify demographic and clinical factors associated with an increased risk of thyroid disease. This can help healthcare professionals better understand the populations most vulnerable to thyroid dysfunction.
* 
Prediction of thyroid disease: By training machine learning models on the dataset, we expect to develop predictive models that can accurately classify individuals as either having thyroid disease or being healthy based on their demographic and clinical characteristics.



# Data

This dataset from [kaggle](https://www.kaggle.com/datasets/emmanuelfwerr/thyroid-disease-data) was created by Kaggle user Emmanuel F. Werr and made to be more readable from the original data located at [Thyroid Disease Dataset](https://archive.ics.uci.edu/ml/datasets/thyroid+disease) - UCI Machine Learning Repository.
* This dataset includes 9172 observations and 31 attributes.

&emsp; Analyzing the dataset can provide us insights into risk factors, and clinical manifestations of thyroid disease. This information is crucial for healthcare professionals to better understand the disease and its impact on affected individuals. 

&emsp; 
Moreover, we can potentially identify high-risk groups and the conclusions from this project can inform targeted screening and prevention strategies and help optimize diagnosis and treatment plans.

&emsp; 
Lastly, analysis of this dataset can generate hypotheses for further research into the underlying mechanisms of thyroid disease and potential therapeutic targets. By contributing to the scientific understanding of the disease, the expected outcomes can drive innovation in diagnosis, treatment, and prevention strategies.


# Propose objective

* Data Cleaning
* EDA
* Feature Selection
* Modeling
* Interpretation

We are planning to do data cleaning to take out unneeded variables, missing values, one-hot encoding, and combine data in order tidying the data. In addition, we are planning to do exploratory data analysis to see how the variables might be related to each other and to the thyroid relationship. Visualization of these features is key and we are planning to use graphs such as scatter plots, ECDF, QQ-plot, correlation matrices, etc to gain a better understanding of these features. After that, we are planning to experiment with different models such as linear regression, PCA, etc... Lastly, we will interpret the results.

## Data Cleaning

In [1]:
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
path = '../Math189/data/thyroidDF.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_meds,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,target,patient_id
0,29,F,f,f,f,f,f,f,f,t,...,,f,,f,,f,,other,-,840801013
1,29,F,f,f,f,f,f,f,f,f,...,128.0,f,,f,,f,,other,-,840801014
2,41,F,f,f,f,f,f,f,f,f,...,,f,,f,,t,11.0,other,-,840801042
3,36,F,f,f,f,f,f,f,f,f,...,,f,,f,,t,26.0,other,-,840803046
4,32,F,f,f,f,f,f,f,f,f,...,,f,,f,,t,36.0,other,S,840803047


In [5]:
df.dtypes

age                      int64
sex                     object
on_thyroxine            object
query_on_thyroxine      object
on_antithyroid_meds     object
sick                    object
pregnant                object
thyroid_surgery         object
I131_treatment          object
query_hypothyroid       object
query_hyperthyroid      object
lithium                 object
goitre                  object
tumor                   object
hypopituitary           object
psych                   object
TSH_measured            object
TSH                    float64
T3_measured             object
T3                     float64
TT4_measured            object
TT4                    float64
T4U_measured            object
T4U                    float64
FTI_measured            object
FTI                    float64
TBG_measured            object
TBG                    float64
referral_source         object
target                  object
patient_id               int64
dtype: object

In [6]:
df.describe()

Unnamed: 0,age,TSH,T3,TT4,T4U,FTI,TBG,patient_id
count,9172.0,8330.0,6568.0,8730.0,8363.0,8370.0,349.0,9172.0
mean,73.555822,5.218403,1.970629,108.700305,0.976056,113.640746,29.870057,852947300.0
std,1183.976718,24.184006,0.887579,37.52267,0.20036,41.55165,21.080504,7581969.0
min,1.0,0.005,0.05,2.0,0.17,1.4,0.1,840801000.0
25%,37.0,0.46,1.5,87.0,0.86,93.0,21.0,850409000.0
50%,55.0,1.4,1.9,104.0,0.96,109.0,26.0,851004000.0
75%,68.0,2.7,2.3,126.0,1.065,128.0,31.0,860711000.0
max,65526.0,530.0,18.0,600.0,2.33,881.0,200.0,870119000.0


In [11]:
df.replace({'t': 1, 'f': 0}, inplace=True)
df

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_meds,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,target,patient_id
0,29,F,0,0,0,0,0,0,0,1,...,,0,,0,,0,,other,-,840801013
1,29,F,0,0,0,0,0,0,0,0,...,128.0,0,,0,,0,,other,-,840801014
2,41,F,0,0,0,0,0,0,0,0,...,,0,,0,,1,11.0,other,-,840801042
3,36,F,0,0,0,0,0,0,0,0,...,,0,,0,,1,26.0,other,-,840803046
4,32,F,0,0,0,0,0,0,0,0,...,,0,,0,,1,36.0,other,S,840803047
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9167,56,M,0,0,0,0,0,0,0,0,...,64.0,1,0.83,1,77.0,0,,SVI,-,870119022
9168,22,M,0,0,0,0,0,0,0,0,...,91.0,1,0.92,1,99.0,0,,SVI,-,870119023
9169,69,M,0,0,0,0,0,0,0,0,...,113.0,1,1.27,1,89.0,0,,SVI,I,870119025
9170,47,F,0,0,0,0,0,0,0,0,...,75.0,1,0.85,1,88.0,0,,other,-,870119027


# Ethics & Privacy

&emsp;...



# Team Expectations

* *Show up to scheduled team meetings.*
* *Communicate through group chat if we can't finish something in time, miss a meeting, need help, etc.*
* *Be understanding and respectful of one another.*



