## Import Libraries

Importing all the necessary libraries

In [168]:
# Import necessary libraries
import pandas as pd
import numpy as np

np.random.seed(42)

## Heart Disease Predictive analysis:

The Heart Disease dataset, the goal is to predict whether a person has heart disease based on various medical and demographic features. The target variable, in this case, is usually the 'target' column in the dataset. The 'target' column typically contains binary values, where:

0 often represents the absence of heart disease.
1 often represents the presence of heart disease.
The objective is to build a machine learning model that can accurately predict whether a patient has heart disease or not based on the input features. This prediction can be crucial for early diagnosis and timely medical intervention, potentially saving lives and improving healthcare outcomes.

In summary, the goal of the Heart Disease dataset is to create a predictive model to determine the presence or absence of heart disease in patients based on their health-related attributes.

## Load The Data

The data set is kept in a file at '/Users/ramyamuthineni/Downloads/heart+disease/processed.cleveland.data'. 

In [142]:
# Load the Heart Disease dataset from a local path
file_path = '/Users/ramyamuthineni/Downloads/heart+disease/processed.cleveland.data'
df = pd.read_csv(file_path)

Loaded the dataset from a file at '/Users/ramyamuthineni/Downloads/heart+disease/processed.cleveland.data'. 

In [143]:
column_names = df.columns
print(column_names)

Index(['63.0', '1.0', '1.0.1', '145.0', '233.0', '1.0.2', '2.0', '150.0',
       '0.0', '2.3', '3.0', '0.0.1', '6.0', '0'],
      dtype='object')


The dataset includes a number of columns that indicate various heart disease-related factors. The names of the columns in the dataset are specified in a numeric format.

Renaming columns in a DataFrame to increase the readability and consistency.

In [144]:
df.columns = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 
         'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target']

The column names and descriptions are provided below:

'age': Age of the patient.

'sex': Gender of the patient (1 = male, 0 = female).

'cp': Chest pain type (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic).

'trestbps': Resting blood pressure of the patient.

'chol': Serum cholesterol level of the patient.

'fbs': Fasting blood sugar > 120 mg/dl (1 = true, 0 = false).

'restecg': Resting electrocardiographic results (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy).

'thalach': Maximum heart rate achieved.

'exang': Exercise induced angina (1 = yes, 0 = no).

'oldpeak': Depression induced by exercise relative to rest.

'slope': Slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping).

'ca': Number of major vessels colored by fluoroscopy (0-3).

'thal': Thalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect).

'target': Presence or absence of heart disease (1 = presence, 0 = absence).

## Explore the dataset

Exploring the data understand it's structure, characteristics and size.

The shape attribute of a DataFrame in pandas represents its dimensions, where the first value represents the number of rows and the second value represents the number of columns.

In [154]:
# Explore the dataset
# read the first few rows of the dataset 
print(df.shape)

(302, 14)


The output (302, 14) indicates that the DataFrame df has 302 rows and 14 columns. 

In [155]:
# Explore the dataset
# read the first few rows of the dataset 
print(df.head())

    age  sex   cp  trestbps   chol  fbs  restecg  thalach  exang  oldpeak  \
0  67.0  1.0  4.0     160.0  286.0  0.0      2.0    108.0    1.0      1.5   
1  67.0  1.0  4.0     120.0  229.0  0.0      2.0    129.0    1.0      2.6   
2  37.0  1.0  3.0     130.0  250.0  0.0      0.0    187.0    0.0      3.5   
3  41.0  0.0  2.0     130.0  204.0  0.0      2.0    172.0    0.0      1.4   
4  56.0  1.0  2.0     120.0  236.0  0.0      0.0    178.0    0.0      0.8   

   slope   ca thal  target  
0    2.0  3.0  3.0       2  
1    2.0  2.0  7.0       1  
2    3.0  0.0  3.0       0  
3    1.0  0.0  3.0       0  
4    1.0  0.0  3.0       0  


The above function fetches the few rows from the data set

In [170]:
# read the column names of the dataset 
print(df.columns)

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')


THe above function fetches the column names which are renamed

In [171]:
# read the first few rows of the dataset 
print(df.describe())

              age         sex          cp    trestbps        chol         fbs  \
count  302.000000  302.000000  302.000000  302.000000  302.000000  302.000000   
mean    54.410596    0.678808    3.165563  131.645695  246.738411    0.145695   
std      9.040163    0.467709    0.953612   17.612202   51.856829    0.353386   
min     29.000000    0.000000    1.000000   94.000000  126.000000    0.000000   
25%     48.000000    0.000000    3.000000  120.000000  211.000000    0.000000   
50%     55.500000    1.000000    3.000000  130.000000  241.500000    0.000000   
75%     61.000000    1.000000    4.000000  140.000000  275.000000    0.000000   
max     77.000000    1.000000    4.000000  200.000000  564.000000    1.000000   

          restecg     thalach       exang     oldpeak       slope      target  
count  302.000000  302.000000  302.000000  302.000000  302.000000  302.000000  
mean     0.986755  149.605960    0.327815    1.035430    1.596026    0.940397  
std      0.994916   22.912959 

The above code generated a descriptive statistics of the numerical columns in your DataFrame. These statistics include count, mean, standard deviation, minimum, 25th percentile (Q1), median (50th percentile or Q2), 75th percentile (Q3), and maximum.

In [172]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 302 entries, 0 to 301
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       302 non-null    float64
 1   sex       302 non-null    float64
 2   cp        302 non-null    float64
 3   trestbps  302 non-null    float64
 4   chol      302 non-null    float64
 5   fbs       302 non-null    float64
 6   restecg   302 non-null    float64
 7   thalach   302 non-null    float64
 8   exang     302 non-null    float64
 9   oldpeak   302 non-null    float64
 10  slope     302 non-null    float64
 11  ca        302 non-null    object 
 12  thal      302 non-null    object 
 13  target    302 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.2+ KB
None


The above function provides a concise summary of a DataFrame, including the number of non-null values and memory usage. It is useful for quickly understanding the structure and composition of your dataset

## Clean data

Once the data is explored, we need to clean up the column names, whether it has some leading white spaces or not

In [173]:
# based on findings from data exploration, we need to clean up colum names, as there are some leading whitespace characters
df.columns = [s.strip() for s in df.columns] 
df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

The above code will remove any leading or trailing whitespace characters from the column names and print the cleaned column names of your DataFrame.

## Handle Missing Values

Handling the missing values

In [174]:
df[df == '?'] = np.nan

# Calculate mean for numeric columns and fill NaN values with mean
numeric_cols_mean = df.select_dtypes(include='number').mean()
df.fillna(numeric_cols_mean, inplace=True)

# For categorical columns, fill NaN values with a specific value (for example, 'unknown')
categorical_cols = df.select_dtypes(include='object').columns
df[categorical_cols] = df[categorical_cols].fillna('unknown')

The code you provided aims to handle missing values in the DataFrame df by replacing "?" with NaN values, filling NaN values in numeric columns with the mean, and filling NaN values in categorical columns with the string 'unknown'.

## Save the preprocessed Data to CSV file

Storing the preprocessed data to the csv file 

In [175]:
# Save Preprocessed Data to CSV file
preprocessed_df = df.copy()
preprocessed_df.to_csv('preprocessed_heart.csv', index=False)

In [176]:
print("\nPreprocessed data saved successfully.")


Preprocessed data saved successfully.


The above code you provided will save the preprocessed DataFrame preprocessed_df to a CSV file named 'preprocessed_heart.csv' without including the index column. This is achieved using the to_csv() function with the parameter index=False.

the preprocessed data will be saved to a file called 'preprocessed_heart.csv' in the current working directory.