
![image](https://storage.googleapis.com/kaggle-datasets-images/4134888/7159329/8685cd8fb7c162e34269921f17687cbe/dataset-cover.jpeg?t=2023-12-09-07-27-45)


# <div style="color:white;display:inline-block;border-radius:5px;background-image: url(https://i.postimg.cc/fyD3nrX4/cardiovas-jcdumlao.png);font-family:Nexa;overflow:hidden"><p style="padding:15px;color:white;overflow:hidden;font-size:95%;letter-spacing:0.5px;margin:0"><b>ﮩ٨ـ❤️ﮩ٨ـﮩﮩ</b>Introduction</p></div>

This heart disease dataset, sourced from a reputable multispecialty hospital in India, comprises a rich array of information encompassing 14 common features, making it a valuable resource for cardiovascular research. With a cohort of 1000 subjects and 12 distinct features, this dataset serves as a pivotal tool for developing early-stage heart disease detection methods and constructing predictive machine-learning models. Its diverse and comprehensive nature positions it as a significant asset in advancing research endeavors aimed at understanding and mitigating cardiovascular risks.

<h2 style='border:0; border-radius: 15px; font-weight: 150; color:#9b006e; font-size:250%'><center> Cardiovascular Disease Dataset Description
</center></h2>

|S.No|Attribute|Explain|Unit|Type of Data|
|----|---------|-------|----|------------|
|1|**Patient Identification Number**|patientid|Numeric|Number|
|2|**Age**|age|Numeric|In Years|
|3|**Gender**|gender|Binary|0 (female) / 1 (male)|
|4|**Resting blood pressure**|restingBP|Numeric|94-200 (in mm HG)|
|5|**Serum cholesterol**|serumcholestrol|Numeric|126-564 (in mg/dl)|
|6|**Fasting blood sugar**|fastingbloodsugar|Binary|0 (false) / 1 (true) > 120 mg/dl|
|7|**Chest pain type**|chestpain|Nominal|0 (typical angina), 1 (atypical angina), 2 (non-anginal pain), 3 (asymptomatic)|
|8|**Resting electrocardiogram results**|restingelectro|Nominal|0 (normal), 1 (ST-T wave abnormality), 2 (probable or definite left ventricular hypertrophy)|
|9|**Maximum heart rate achieved**|maxheartrate|Numeric|71-202|
|10|**Exercise induced angina**|exerciseangina|Binary|0 (no) / 1 (yes)|
|11|**Oldpeak = ST**|oldpeak|Numeric|0-6.2|
|12|**Slope of the peak exercise ST segment**|slope|Nominal|1 (upsloping), 2 (flat), 3 (downsloping)|
|13|**Number of major vessels**|noofmajorvessels|Numeric|0, 1, 2, 3|
|14|**Classification (target)**|target|Binary|0 (Absence of Heart Disease), 1 (Presence of Heart Disease)|


# <div style="color:white;display:inline-block;border-radius:5px;background-image: url(https://i.postimg.cc/fyD3nrX4/cardiovas-jcdumlao.png);font-family:Nexa;overflow:hidden"><p style="padding:15px;color:white;overflow:hidden;font-size:95%;letter-spacing:0.5px;margin:0"><b>ﮩ٨ـ❤️ﮩ٨ـﮩﮩ</b>Import Modules</p></div>


In [None]:
# pip install termcolor
#


The termcolor module is a Python library that allows developers to add color to text in the terminal. It provides a simple way to make command-line output more visually appealing and easier to read by adding color to text, which can be particularly useful for highlighting, distinguishing, or categorizing output.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from termcolor import colored
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn import metrics
# from sklearn.linear_model import LogisticRegression
# from sklearn.svm import SVC



print(colored('\nAll libraries imported succesfully.', 'blue'))

# <div style="color:white;display:inline-block;border-radius:5px;background-image: url(https://i.postimg.cc/fyD3nrX4/cardiovas-jcdumlao.png);font-family:Nexa;overflow:hidden"><p style="padding:15px;color:white;overflow:hidden;font-size:95%;letter-spacing:0.5px;margin:0"><b>ﮩ٨ـ❤️ﮩ٨ـﮩﮩ</b>Load the Data</p></div>


In [None]:
df = pd.read_csv('Cardiovascular_Disease_Dataset/Cardiovascular_Disease_Dataset.csv')
df.head()

Check the shape of the dataframe:

shape_of_dataframe : df.shape

print("No. of samples:")
print("No. of columns:")

df.info()


# <div style="color:white;display:inline-block;border-radius:5px;background-image: url(https://i.postimg.cc/fyD3nrX4/cardiovas-jcdumlao.png);font-family:Nexa;overflow:hidden"><p style="padding:15px;color:white;overflow:hidden;font-size:95%;letter-spacing:0.5px;margin:0"><b>ﮩ٨ـ❤️ﮩ٨ـﮩﮩ</b>Data Information</p></div>

## Check Summary Statistics:

use df.describe()

# <div style="color:white;display:inline-block;border-radius:5px;background-image: url(https://i.postimg.cc/fyD3nrX4/cardiovas-jcdumlao.png);font-family:Nexa;overflow:hidden"><p style="padding:15px;color:white;overflow:hidden;font-size:95%;letter-spacing:0.5px;margin:0">ﮩ٨ـ❤️ﮩ٨ـﮩﮩ<b> </b>Data Preprocessning</p></div>

## Handling Missing Values

**Use df.isna().sum()**:

Step 1: Identifying Missing Values with df.isna()
When you call df.isna() on a pandas DataFrame, it returns a new DataFrame of the same shape as df, where each cell is either True or False. A True value indicates that the corresponding cell in the original DataFrame contains a missing value (NaN or None), while False indicates that the cell contains a non-missing value.

Step 2: Summing Up Missing Values with .sum()
After identifying the missing values, the .sum() method is used to count them. Here's how it works:

The .sum() method, by default, operates along the columns of a DataFrame, which means it aggregates data vertically down the rows for each column.
When .sum() is applied to a DataFrame of boolean values (True or False), pandas internally treats True as 1 and False as 0.
Therefore, for each column, .sum() adds up the 1s (i.e., the True values), effectively counting the number of missing values in that column.

In [None]:
df.isna().sum()

In [None]:
df.columns

# <div style="color:white;display:inline-block;border-radius:5px;background-image: url(https://i.postimg.cc/fyD3nrX4/cardiovas-jcdumlao.png);font-family:Nexa;overflow:hidden"><p style="padding:15px;color:white;overflow:hidden;font-size:95%;letter-spacing:0.5px;margin:0"><b>ﮩ٨ـ❤️ﮩ٨ـﮩﮩ</b>Exploratory Data Analysis (EDA)📊</p></div>


<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 15px; background-color:#FDF5E6; font-size: 100%; text-align: left;">
    
<font size="+1" color="#059c99"><b>💞 1.What is the age range of patients in the dataset?</b></font>

**Answer: Age Range: 20 - 80**

**Explanation: The age range is determined by finding the minimum and maximum age values in the dataset. In this case, patients' ages range from 20 to 80 years.**

In [None]:
age_range = f"Age Range: {df['age'].min()} - {df['age'].max()}"
print(age_range)


<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 15px; background-color:#FDF5E6; font-size: 100%; text-align: left;">
    
<font size="+1" color="#059c99"><b>💞 2. How many males and females are represented in the dataset?</b></font>

**Answer: Female: 0, Male: 1**




In [None]:
gender_count = df['gender'].value_counts()
print("Number of males:", gender_count[1])
print("Number of females:", gender_count[0])


<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 15px; background-color:#FDF5E6; font-size: 100%; text-align: left;">
    
<font size="+1" color="#059c99"><b>💞 3. What is the most common type of chest pain observed in the patients?</b></font>

chestpain --> 0 (typical angina), 1 (atypical angina), 2 (non-anginal pain), 3 (asymptomatic)

**Answer**: 

**Explanation**: 


In [None]:
chest_pain_counts = df['chestpain'].value_counts()
print(chest_pain_counts)

In [None]:
# Visualization:
plt.figure(figsize=(10, 6))
sns.barplot(x=chest_pain_counts.index, y=chest_pain_counts.values, palette='viridis')
plt.title('Counts of Chest Pain Types')
plt.xlabel('Chest Pain Type')
plt.ylabel('Counts')
plt.show()

<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 15px; background-color:#FDF5E6; font-size: 100%; text-align: left;">
    
<font size="+1" color="#059c99"><b>💞 4. What is the average resting blood pressure among the patients?</b></font>

**Answer: Average Resting Blood Pressure: 151.75 mm Hg**

**Explanation: The average resting blood pressure is calculated by taking the mean of the values in the 'restingBP' column.**


In [None]:
average_resting_bp = df['restingBP'].mean()
print(f"Average Resting Blood Pressure: {average_resting_bp:.2f} mm Hg")


## Distribution of resting blood pressure:

In [None]:
sns.histplot(df['restingBP'], color='mediumseagreen') 
plt.title('Distribution of Resting Blood Pressure')
plt.show()

<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 15px; background-color:#FDF5E6; font-size: 100%; text-align: left;">
    
<font size="+1" color="#059c99"><b>💞 5. How does serum cholesterol vary across different patients?</b></font>

**Explanation: Serum cholesterol distribution is visualized using a boxplot, providing insights into the spread and central tendency of cholesterol levels among patients.**


## Box Plots

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='serumcholestrol', data=df, color='royalblue')
plt.title('Distribution of Serum Cholesterol')
plt.xlabel('Serum Cholesterol')
plt.show()



## Annotating the Box Plot

plt.text(x, y, text) adds text at position x, y. Here, x is the median value, and y is arbitrarily set to 0.1 to position the text on the plot. The text displays the median value, formatted to two decimal places.

ha='center' and va='center' set the horizontal and vertical alignment of the text, respectively.

fontweight='bold', color='white', and backgroundcolor='green' style the text, making it bold with white letters on a green background.

In [None]:

# To clearly show medians, quartiles, and outliers, let's annotate the boxplot with these statistics.
serumcholestrol_data=df['serumcholestrol']
# Recreating the boxplot for serum cholesterol data
plt.figure(figsize=(12, 8))
boxplot = sns.boxplot(x=serumcholestrol_data, color='royalblue')

# Calculating statistics for annotations
median = np.median(serumcholestrol_data)
quartile1 = np.percentile(serumcholestrol_data, 25)
quartile3 = np.percentile(serumcholestrol_data, 75)
iqr = quartile3 - quartile1  # Interquartile range
upper_whisker = quartile3 + 1.5 * iqr
lower_whisker = quartile1 - 1.5 * iqr

# Annotating the median
plt.text(median, 0.1, f'Median: {median:.2f}', ha='center', va='center', fontweight='bold', color='white', backgroundcolor='green')

# Annotating the quartiles
plt.text(quartile1, 0.2, f'Q1: {quartile1:.2f}', ha='center', va='center', fontweight='bold', color='white', backgroundcolor='blue')
plt.text(quartile3, 0.2, f'Q3: {quartile3:.2f}', ha='center', va='center', fontweight='bold', color='white', backgroundcolor='blue')

# Annotating the whiskers
plt.text(upper_whisker, 0.1, f'Upper Whisker: {upper_whisker:.2f}', ha='center', va='center', fontweight='bold', color='white', backgroundcolor='red')
plt.text(lower_whisker, 0.1, f'Lower Whisker: {lower_whisker:.2f}', ha='center', va='center', fontweight='bold', color='white', backgroundcolor='red')

plt.title('Annotated Distribution of Serum Cholesterol')
plt.xlabel('Serum Cholesterol')
plt.show()

<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 15px; background-color:#FDF5E6; font-size: 100%; text-align: left;">
    
<font size="+1" color="#059c99"><b>💞 6. What percentage of patients have fasting blood sugar greater than 120 mg/dl? </b></font>

**Answer: Percentage of patients with fasting blood sugar > 120 mg/dl: 29.60%**

**Explanation: The percentage is calculated by dividing the number of patients with fasting blood sugar greater than 120 mg/dl by the total number of patients.**


In [None]:
percentage_high_fasting_sugar = (df['fastingbloodsugar'].sum() / len(df)) * 100
print(f"Percentage of patients with fasting blood sugar > 120 mg/dl: {percentage_high_fasting_sugar:.2f}%")


<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 15px; background-color:#FDF5E6; font-size: 100%; text-align: left;">
    
<font size="+1" color="#059c99"><b>💞 8. What is the average maximum heart rate achieved by the patients on average?</b></font>

**Answer: Average Maximum Heart Rate: 145.48**

**Explanation: The average maximum heart rate is calculated by taking the mean of values in the 'maxheartrate' column.**


<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 15px; background-color:#FDF5E6; font-size: 100%; text-align: left;">
    
<font size="+1" color="#059c99"><b>💞 9. How many patients experienced exercise-induced angina?</b></font>

**Answer: Number of Patients with Exercise-Induced Angina: 498**

**Explanation: The count of patients with exercise-induced angina is obtained from the 'exerciseangia' column using df['exerciseangia'].sum().**


<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 15px; background-color:#FDF5E6; font-size: 100%; text-align: left;">
    
<font size="+1" color="#059c99"><b>💞 10. What is the average oldpeak (ST depression induced by exercise relative to rest) among the patients?</b></font>

**Answer: Average Oldpeak: 2.71**

**Explanation: The average oldpeak is calculated by taking the mean of values in the 'oldpeak' column.**

<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 15px; background-color:#FDF5E6; font-size: 100%; text-align: left;">
    
<font size="+1" color="#059c99"><b>💞 11. How is the slope of the peak exercise ST segment distributed in the dataset?</b></font>

**Explanation: The distribution of the slope is visualized using a countplot, showing the frequency of each slope type in the 'slope' column.**


<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 15px; background-color:#FDF5E6; font-size: 100%; text-align: left;">
    
<font size="+1" color="#059c99"><b>💞 12. What is the range of the number of major vessels in the patients?</b></font>

**Answer: Number of Major Vessels Range: 0 - 3**

**Explanation: The range is determined by finding the minimum and maximum values in the 'noofmajorvessels' column using df['noofmajorvessels'].min()} - {df['noofmajorvessels'].max().**


<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 15px; background-color:#FDF5E6; font-size: 100%; text-align: left;">
    
<font size="+1" color="#059c99"><b>💞 13. What percentage of patients in the dataset have heart disease (target = 1)?</b></font>

**Answer: Percentage of Patients with Heart Disease: 58.00%**

**Explanation: The percentage is calculated by dividing the number of patients with heart disease (target = 1) by the total number of patients.**

In [None]:
# Visualization:
plt.figure(figsize=(10, 6))
sns.histplot(x='age', hue='gender', data=df, palette='muted', multiple='stack', bins=15)
plt.title('Age Distribution by Gender')
plt.xlabel('Age')
plt.ylabel('Count')
plt.legend(title='Gender', labels=['Female', 'Male'])
plt.show()


<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 15px; background-color:#FDF5E6; font-size: 100%; text-align: left;">
    
<font size="+1" color="#059c99"><b>💞 14. Can you identify the patient with the highest age in the dataset?</b></font>

**Answer:**
* Patient ID: 1160678
* Age: 80
* Gender: Female
* Chest Pain: 1
* Target: 1 (Heart Disease)

**Explanation: The patient with the highest age is identified by finding the maximum value in the 'age' column and extracting other details using df.loc[df['age'].idxmax().**

<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 15px; background-color:#FDF5E6; font-size: 100%; text-align: left;">
    
<font size="+1" color="#059c99"><b>💞 15. Who is the patient with the lowest resting blood pressure?</b></font>

**Answer:**
* Patient ID: 119250
* Age: 40
* Gender: Female
* Chest Pain: 0
* Target: 0 (No Heart Disease)

**Explanation: The patient with the lowest resting blood pressure is identified by finding the minimum value in the 'restingBP' column and extracting other details using df['restingBP'].idxmin().**


<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 15px; background-color:#FDF5E6; font-size: 100%; text-align: left;">
    
<font size="+1" color="#059c99"><b>💞 16. What is the correlation between age and maximum heart rate?</b></font>

**Answer: Correlation between Age and Maximum Heart Rate: -0.04**

**Explanation: The correlation coefficient is calculated to quantify the relationship between age and maximum heart rate.**
