<a href="https://www.kaggle.com/code/shwetakolekar/disease-prediction-using-machine-learning?scriptVersionId=163964359" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In this data-driven journey, we will venture into the realm of disease prediction using machine learning, guided by our expert data scientist. Our dataset contains vital information regarding the diagnosis of heart disease patients. Through the power of machine learning, we aim to predict whether an individual is susceptible to heart disease, leveraging features such as chest pain type, age, sex, and more.

Heart disease stands as a prominent cause of morbidity and mortality on a global scale. Predicting cardiovascular disease is a crucial area within clinical data analysis, considering its profound impact on public health. The healthcare industry harbors a vast reservoir of data, and the process of data mining transforms this extensive healthcare repository into actionable insights, facilitating informed decision-making and predictions.

Our journey unfolds by training machine learning models capable of discerning the presence or absence of heart disease based on a comprehensive set of attributes. To achieve this, we draw upon the Cleveland Heart Disease dataset, sourced from the UCI repository. As we traverse this path, each line of code and model developed brings us closer to unraveling the secrets of disease prediction.

Our ultimate goal is to empower healthcare professionals and individuals with data-driven insights, enhancing their ability to make early and accurate predictions regarding heart disease. Through the utilization of machine learning, we aim to contribute to the vital task of improving public health and the well-being of individuals across the world.

# **Module 1**:
* **Task 1: Importing Health Data**

In [None]:
import pandas as pd

#--- Read in dataset(heart_cleveland_upload.csv) ----
df = pd.read_csv("/kaggle/input/disease-prediction/heart_cleveland_upload.csv")
df
#--- Inspect data ---

* **Task 2: Identifying Null Values**

In [None]:
sumofnull = df.isnull().sum()
sumofnull

* **Task 3: Examining Data Types**

In [None]:
datatype = df.dtypes
datatype

* **Task 4: Identifying Numerical and Categorical Features**

In [None]:
numeric_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'condition']
cat_features = [ 'sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']

* **Task 5: Converting Features to Categorical Data Types**

In [None]:
lst=['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca','thal']

df[lst]= df[lst].astype(object)

dtype = df.dtypes

# **Module 2**:
* **Task 1: Exploring Feature Correlations**

In [None]:
import seaborn as sns 
import matplotlib.pyplot as plt

# --- WRITE YOUR CODE FOR MODULE 2 TASK 1 ---
numeric_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'condition']
corr_data=df[numeric_features].corr()
sns.heatmap(corr_data.corr(),annot=True,cmap="Reds",linewidths=0.1)

* **Task 2: Visualizing Health Conditions**

In [None]:
condition_ax = sns.countplot(x='condition', data=df)
plt.show()

* **Task 3: Analyzing Health Conditions by Gender**

In [None]:
sex_ax =sns.countplot(x=df['sex'],hue=df['condition'])

* **Task 4: Examining Chest Pain Types and Health Conditions**

In [None]:
cp_ax = sns.countplot(x=df['cp'],hue=df['condition'])

* **Task 5: Investigating Fasting Blood Sugar Levels and Health Conditions**

In [None]:
fbs_ax =sns.countplot(x=df['fbs'],hue=df['condition'])

* **Task 6: Analyzing Resting Electrocardiographic Results and Health Conditions**

In [None]:
restecg_ax = sns.countplot(x=df['restecg'],hue=df['condition'])

* **Task 7: Examining Exercise-Induced Angina and Health Conditions**

In [None]:
exang_ax = sns.countplot(x=df['exang'],hue=df['condition'])

* **Task 8: Investigating the Slope of the ST Segment and Health Conditions**

In [None]:
slope_ax = sns.countplot(x=df['slope'],hue=df['condition'])

* **Task 9: Analyzing the Number of Major Vessels Colored by Fluoroscopy and Health Conditions**

In [None]:
ca_ax = sns.countplot(x=df['ca'],hue=df['condition'])

* **Task 10: Examining Thalassemia and Health Conditions**

In [None]:
thal_ax = sns.countplot(x=df['thal'],hue=df['condition'])

# **Module 3:**
* **Task 1: Visualizing Age Distribution**

In [None]:
age_col = df['age']
plt.figure(figsize=(10, 6))
plt.hist(age_col, bins=20, color='skyblue', alpha=0.7)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')
plt.show()

* **Task 2: Visualizing Resting Blood Pressure Distribution**

In [None]:
trestbps_col =df['trestbps']
plt.figure(figsize=(10, 6))
plt.hist(trestbps_col, bins=20, color='lightcoral', alpha=0.7)
plt.xlabel('trestbps')
plt.ylabel('Frequency')
plt.title('trestbps Distribution')
plt.show()


* **Task 3: Visualizing Cholesterol Distribution**

In [None]:
chol_col = df['chol']
plt.figure(figsize=(10, 6))
plt.hist(chol_col, bins=20, color='lightgreen', alpha=0.7)
plt.xlabel('Cholesterol (chol)')
plt.ylabel('Frequency')
plt.show()

* **Task 4: Visualizing Maximum Heart Rate Distribution**

In [None]:
thalach_col = df['thalach']
plt.figure(figsize=(10, 6))
plt.hist(thalach_col, bins=20, color='lightblue', alpha=0.7)
plt.xlabel('Maximum Heart Rate (thalach)')  
plt.ylabel('Frequency')
plt.title('Maximum Heart Rate Distribution')
plt.show()

* **Task 5: Visualizing ST Depression Distribution**

In [None]:
oldpeak_col = df['oldpeak']
plt.figure(figsize=(10, 6))
plt.hist(oldpeak_col, bins=20, color='lightblue', alpha=0.7)
plt.xlabel('ST depression')
plt.ylabel('Frequency')
plt.show()


* **Task 6: Visualizing Chest Pain Types, Age, and Health Conditions**

In [None]:
violinplt = sns.catplot(x='cp',y='age',hue='condition',data=df,kind='violin')

* **Task 7: Analyzing Fasting Blood Sugar Levels and Health Conditions**

In [None]:
countplt = sns.catplot(x='fbs',hue='condition',kind='count',data=df)

# **Module 4:**
* **Task 1: Encoding Categorical Features**

In [None]:
categorical_cols= [ 'cp', 'thal','slope']


df_encoded =pd.get_dummies(df, columns=categorical_cols, prefix_sep='_', dtype=int)
df_encoded=df_encoded.astype(int)
df_encoded.dtypes

* **Task 2: Preparing Features and Target Variable**

In [None]:
x = df_encoded.drop('condition',axis=1)
y = df_encoded['condition']

* **Task 3: Scaling Features**

In [None]:
from sklearn.preprocessing import MinMaxScaler

# --- WRITE YOUR CODE FOR MODULE 4 TASK 3 ---
scaler=MinMaxScaler()
x = scaler.fit_transform(x)
x

* **Task 4: Splitting the Data into Training and Testing Sets**

In [None]:
from sklearn.model_selection import train_test_split
# --- WRITE YOUR CODE FOR MODULE 4 TASK 4 ---
X_train, X_test, Y_train, Y_test  = train_test_split(x,y,test_size=0.2,random_state=4)

# **Module 5**:
* **Task 1: Building and Evaluating Logistic Regression Model**

In [None]:
from sklearn.linear_model import LogisticRegression
lr_model=LogisticRegression()
lr_model.fit(X_train, Y_train)

#--- Import cross_val_score from sklearn.model_selection ---
from sklearn.model_selection import cross_val_score 
lr_cv_results=cross_val_score(lr_model,X_train,Y_train,cv=10)

# --- WRITE YOUR CODE FOR MODULE 5 TASK 1 ---
lr_mean_score  = round(lr_cv_results.mean(),4)
lr_mean_score 


* **Task 2: Building and Evaluating Linear Discriminant Analysis Model**

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
ldr_model=LinearDiscriminantAnalysis()
ldr_model.fit(X_train, Y_train)
ldr_cv_results=cross_val_score(ldr_model, x, y, cv=10)

# --- WRITE YOUR CODE FOR MODULE 5 TASK 2 ---
ldr_mean_score  = round(ldr_cv_results.mean(),4)
ldr_mean_score

* **Task 3: Building and Evaluating K-Nearest Neighbors (KNN) Model**

In [None]:
from  sklearn.neighbors import KNeighborsClassifier
knn_model=KNeighborsClassifier()
knn_model.fit(X_train, Y_train)
knn_cv_results=cross_val_score(knn_model, x, y, cv=10)
# --- WRITE YOUR CODE FOR MODULE 5 TASK 3 ---
knn_mean_score  = round(knn_cv_results.mean(),4)
knn_mean_score 

* **Task 4: Building and Evaluating Decision Tree Classifier Model**¶

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt_model=DecisionTreeClassifier()
dt_model.fit(X_train, Y_train)
dt_cv_results=cross_val_score(dt_model,X_train, Y_train, cv=10)
# --- WRITE YOUR CODE FOR MODULE 5 TASK 4 ---
dt_mean_score  = round(dt_cv_results.mean(),4)
dt_mean_score 

* **Task 5: Building and Evaluating Gaussian Naive Bayes Model**

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb_model=GaussianNB()
gnb_model.fit(X_train, Y_train)
gnb_cv_results=cross_val_score(gnb_model,X_train, Y_train, cv=10)

# --- WRITE YOUR CODE FOR MODULE 5 TASK 5 ---
gnb_mean_score  = round(gnb_cv_results.mean(),4)
gnb_mean_score 

* **Task 6: Building and Evaluating Random Forest Classifier Model**

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_model=RandomForestClassifier(n_estimators=100, max_features=3)
rf_model.fit(X_train, Y_train)
rf_cv_results=cross_val_score(rf_model,X_train, Y_train, cv=10)
# --- WRITE YOUR CODE FOR MODULE 5 TASK 6 ---
rf_mean_score  = round(rf_cv_results.mean(),4)
rf_mean_score

* **Task 7: Building and Evaluating Support Vector Classifier (SVC) Model**

In [None]:
#--- Import SVC ---
from sklearn.svm import SVC
sv_model=SVC()
sv_model.fit(X_train, Y_train)
sv_cv_results=cross_val_score(sv_model,X_train, Y_train, cv=10)

# --- WRITE YOUR CODE FOR MODULE 5 TASK 7 ---
sv_mean_score  = round(sv_cv_results.mean(),4)
sv_mean_score

* **Task 8: Evaluating Model Performance**

In [None]:
#--- Import accuracy_score, confusion_matrix, classification_report from sklearn.metrics ---

from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
y_pred=gnb_model.predict(X_test)

# --- WRITE YOUR CODE FOR MODULE 5 TASK 8 ---
accuracy= accuracy_score(Y_test, y_pred)
accuracy

cm= confusion_matrix(Y_test, y_pred)

cm
cr= classification_report(Y_test, y_pred)
cr

* **Task 9: Making Predictions with Gaussian Naive Bayes Model**

In [None]:
data = [[0.254, 1, 0.487, 0.362,  ## age_scaled, sex, trestbps_scaled, chol
             1, 0.5, 0.641, 1,  ## fbs, restecg_scaled, thalach_scaled, exang
             0.672, 0.863, 0, 0,  ## oldpeak_scaled, ca_scaled, cp_0, cp_1
             0, 1, 0, 0,  ## cp_2, cp_3, thal_0, thal_1
             0, 1, 0, 1]]  ## thal_2, thal_3, slope_0, slope_1, slope_2

# --- WRITE YOUR CODE FOR MODULE 5 TASK 9 ---
#You need to predict the result by passing the sample data available here to your model to make a prediction.
prediction = gnb_model.predict(data)
prediction
