Each proposal should include the following sections:

## Using sex, cholesterol and maximum heart rate achieved to classify heart disease patients from Cleveland.

Methods:
Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?
Describe at least one way that you will visualize the results
Expected outcomes and significance:
What do you expect to find?
What impact could such findings have?
What future questions could this lead to?""

### Introduction:
Heart disease includes various conditions that affect the heart and blood vessels. It is the leading cause of death worldwide. Some key types of heart disease are coronary artery disease, heart attack, heart failure, arrhythmias, and heart valve problems. Major risk factors for heart disease include high blood pressure, high cholesterol, smoking, and diabetes. Resting blood pressure refers to the pressure in the arteries when the heart rests between beats. The normal resting blood pressure is around 120/80 mmHg. Hypertension (high blood pressure) is when the resting blood pressure is consistently above 140/90 mmHg, which is a significant risk factor for heart disease. Serum cholesterol is the total amount of cholesterol in the blood, and a desirable total cholesterol level is below 200 mg/dL. High serum cholesterol can lead to the thickening of the heart muscle (hypertrophy) and, eventually, heart failure. This project will use these risk factors to classify patients based on their likelihood of having heart disease. The question is: Can we predict whether a new patient is likely to have heart disease based on their sex, cholesterol levels, and maximum heart rate achieved?

### Preliminary exploratory data analysis:
Demonstrate that the dataset can be read from the web into R 
Clean and wrangle your data into a tidy format
Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 
Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

In [31]:
# importing libraries
library(tidyverse)
library(tidymodels)
library(repr)
library(RColorBrewer)

In [32]:
set.seed(1)
# reading the dataframe from URL, assigning col names and types
cleveland_data <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",
                           col_names = c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", 
                                         "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"),
                           col_types = list("d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "f", "f", "d"))

# cleaning, wrangling data
cleveland_data[ cleveland_data == "?" ] <- NA

cleveland_tidy <- cleveland_data |>
                    mutate(sex = as.factor(as.integer(sex)), cp = as.factor(as.integer(cp)), 
                           fbs = as.factor(as.integer(fbs)), restecg = as.factor(as.integer(restecg)),
                           exang = as.factor(as.integer(exang)), thal = as.factor(as.integer(thal)),
                           ca = as.factor(as.integer(ca)), slope = as.factor(as.integer(slope)))|>
                    mutate(num=ifelse(is.na(num), NA, (num > 0)))|>
                    mutate(sex=fct_recode(sex, "Famale"="0", "Male"="1"))

cleveland_split <- initial_split(cleveland_tidy, prop = 0.75, strata = num)

cleveland_training <- training(cleveland_split)
cleveland_testing <- testing(cleveland_split)

head(cleveland_training)

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<fct>,<dbl>,<fct>,<dbl>,<fct>,<fct>,<fct>,<lgl>
63,Male,1,145,233,1,2,150,0,2.3,3,1,1,False
37,Male,3,130,250,0,0,187,0,3.5,3,1,2,False
41,Famale,2,130,204,0,2,172,0,1.4,1,1,2,False
57,Male,4,140,192,0,0,148,0,0.4,2,1,1,False
56,Famale,2,140,294,0,2,153,0,1.3,2,1,2,False
57,Male,3,150,168,0,0,174,0,1.6,1,1,2,False


In [33]:

balance_factors <- function(df, factor_col) {
  counts <- table(df[[factor_col]])
  min_count <- min(counts)
  
balanced_df <- do.call(rbind, lapply(levels(df[[factor_col]]), function(level) {
    df[df[[factor_col]] == level, ][sample(sum(df[[factor_col]] == level), min_count), ]
  }))
  
  return(balanced_df)
}
balance_df <- cleveland_training |>
              balance_factors("sex")
balance_df

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<fct>,<dbl>,<fct>,<dbl>,<fct>,<fct>,<fct>,<lgl>
67,Famale,3,152,277,0,0,172,0,0.0,1,4,2,FALSE
54,Famale,3,110,214,0,0,158,0,1.6,2,1,2,FALSE
50,Famale,2,120,244,0,0,162,0,1.1,1,1,2,FALSE
58,Famale,3,120,340,0,0,172,0,0.0,1,1,2,FALSE
66,Famale,3,146,278,0,2,152,0,0.0,2,4,2,FALSE
60,Famale,3,102,318,0,0,160,0,0.0,1,4,2,FALSE
37,Famale,3,120,215,0,0,170,0,0.0,1,1,2,FALSE
53,Famale,3,128,216,0,2,115,0,0.0,1,1,,FALSE
53,Famale,4,130,264,0,2,143,0,0.4,2,1,2,FALSE
41,Famale,2,126,306,0,0,163,0,0.0,1,1,2,FALSE


In [34]:
sex_proportions <- balance_df |>
                      group_by(sex, num) |>
                      summarize(n = n()) |>
                      mutate(percent = 100*n/nrow(balance_df))
sex_proportions

[1m[22m`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.


sex,num,n,percent
<fct>,<lgl>,<int>,<dbl>
Famale,False,56,38.88889
Famale,True,16,11.11111
Male,False,33,22.91667
Male,True,39,27.08333


In [None]:
#plotting the training data
plot1 <- balance_df|>
                 # scale(balance_df[5,8], center=TRUE, scale=TRUE)|>
                 ggplot(aes(x = thalach, y = chol, color = num)) +
                 geom_point() +
                 facet_grid(rows=vars(sex))+
                 labs(x = "max heart rate reached", y = "cholesterol (mg/dl)", color = "Heart Disease")+
                 scale_color_brewer(palette = "Set1")
plot1

## Methods

I am using the processed.cleveland.data from the Heart Disease Database, originally collected from the Cleveland Clinic Foundation, to create a knn model for predicting the likelihood of a patient from Cleveland having heart disease. The columns are as follows:

- age: age
- sex: sex (1 = male, 0 = female)
- cp: chest pain type
- trestbps: resting blood pressure in mmHg
- chol: serum cholestoral in mg/dl
- fbs: fasting blood sugar > 120 mg/dl? (1 = True, 0 = False)
- restecg: resting electrocardiographic results
- thalach: maximum heart rate achieved
- exang: whether exercise induced angina (1 = True, 0 = False)
- oldpeak: ST depression induced by exercise, relative to rest
- slope: the slope of the peak exercise ST segment (1 = upslope, 2 = flat, 3 = downslope)
- ca: number of major vessels (0-3) coloured by fluoroscopy
- thal: (3 = normal, 6 = fixed defect, 7 = reversible defect)
- num: diagnosis of heart disease (1,2,3,4 = presence, 0 = no presence)
Each column is numeric-valued, with 303 rows representing missing data as the string "?".

First and foremost, the crucial task of data cleaning is This process involves changing all "?" values in the intoned column using integers to distinguish the presence (1,2,3,4) from the absence (0) of heart disease; I want to determine whether or not a patient embarks on the c NA. Since the ' has heart disease. To clarify this, I overwrite the 'num' as a factor column and change 0 to "False" and 1 to "True". Moreover, I also overwrite the sex col by changing 1 to "male" and 0 to "female". This meticulous data cleaning is a vital step in our analysis, ensuring the accuracy and reliability of our predictive model.

I divided the data into 75% for training and 25% for testing using initial_split(), ensuring it's balanced for diagnosis. I only used the training set for analysis. Next, I created a function to balance the male and female data and the percentage of those with heart disease to illustrate better the relationship between the maximum heart rate achieved and serum cholesterol.

To summarize the data, I grouped by sex and number and then summarised the percentage of each sex having heart disease or not. I noticed that females have a low rate of heart disease.

In order to better understand the data, I used data visualization techniques. I created scatter plots to compare different numerical variables for females and males. This helped me identify the best indicators aligned with our initial research (see introduction). By plotting 'chol' against 'thalach' and colour-coding based on 'num', I could identify three distinct regions for accurate diagnoses and regions where the two overlapped. The visual exploration of the data confirmed our hypotheses and provided new insights, making our predictive model more reliable.

Therefore, I have decided to make sex, chol and thalach as the predictors.

### Expected outcomes and significance

Our research anticipates that heart disease patients will, on average, exhibit higher cholesterol levels, lower maximum heart rates, and a distinct distribution across sexes. These findings could potentially revolutionize our understanding and management of heart disease.

A classification system for heart disease could simplify and improve the accuracy of patient diagnosis, leading to earlier treatment.

Some future questions this could lead to are:

1. Which factors always work together to result in the heart disease?
2. How to treat and prevent heart disease in different sexes?