# Purpose of the analysis

This study aims to evaluate the impact of various factors on an individual's risk of developing coronary heart disease (CHD) within the next decade. To achieve this goal, it is necessary to conduct statistical analysis on individual health data related to CHD. This dataset covers multiple aspects, including personal basic information, living habits, health status, and medical test indicators. Through in - depth analysis of this data, we can identify key factors closely associated with the risk of CHD onset.

The results of this study can be used to build a CHD risk prediction model. This model will assist medical institutions and healthcare providers in assessing the likelihood of an individual developing CHD in the future. Data analysis will help healthcare providers make more informed decisions when formulating preventive measures, diagnostic plans, and treatment strategies. For individuals, based on their personal risk assessment results, they can take targeted preventive actions, such as adjusting their lifestyle and enhancing health management, to reduce their risk of developing CHD.

# The short plan

* Import data
* Make a brief preliminary overview of the data
* If necessary, pre-process and clean each field.
* Check the final data set for duplicates and omissions, and evaluate its suitability for the study.
* Conduct a research analysis of the data and formulate conclusions
* Based on the findings from the data analysis, formulate a set of recommendations to the medical research

# A brief overview of the data

In [37]:
# Importing library modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Reading the source data from the CSV file into the Data Frame df
df = pd.read_csv('framingham.csv')

# We display the first ten lines for visual verification of the correctness of downloads.
display(df.head(10)) #tail()

# Displaying information about the Data frame
print(df.info())

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0
5,0,43,2.0,0,0.0,0.0,0,1,0,228.0,180.0,110.0,30.3,77.0,99.0,0
6,0,63,1.0,0,0.0,0.0,0,0,0,205.0,138.0,71.0,33.11,60.0,85.0,1
7,0,45,2.0,1,20.0,0.0,0,0,0,313.0,100.0,71.0,21.68,79.0,78.0,0
8,1,52,1.0,0,0.0,0.0,0,1,0,260.0,141.5,89.0,26.36,76.0,79.0,0
9,1,43,1.0,1,30.0,0.0,0,1,0,225.0,162.0,107.0,23.61,93.0,88.0,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4238 entries, 0 to 4237
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             4238 non-null   int64  
 1   age              4238 non-null   int64  
 2   education        4133 non-null   float64
 3   currentSmoker    4238 non-null   int64  
 4   cigsPerDay       4209 non-null   float64
 5   BPMeds           4185 non-null   float64
 6   prevalentStroke  4238 non-null   int64  
 7   prevalentHyp     4238 non-null   int64  
 8   diabetes         4238 non-null   int64  
 9   totChol          4188 non-null   float64
 10  sysBP            4238 non-null   float64
 11  diaBP            4238 non-null   float64
 12  BMI              4219 non-null   float64
 13  heartRate        4237 non-null   float64
 14  glucose          3850 non-null   float64
 15  TenYearCHD       4238 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 529.9 KB
None


# Data preprocessing

In [38]:
# Initial amount of data
data_len_start=df.shape[0]
print("Rows in the original set:",data_len_start)

Rows in the original set: 4238


## Education

In [39]:
# Count the number of missing entries in the 'education' column
missing_education = df['education'].isnull().sum()

# Calculate the mode (most frequent value) of education level
mode_education = df['education'].mode()[0]

# Fill missing values with the mode and convert to category type 
df['education'] = df['education'].fillna(mode_education).astype('category')

print(f"Processing 'education' column: filled {missing_education} missing values")

Processing 'education' column: filled 105 missing values


## CigsPerDay

In [40]:
# Distinguish smokers and non-smokers
smoker_mask = df['currentSmoker'] == 1  # Smoker mask
non_smoker_mask = df['currentSmoker'] == 0  # Non-smoker mask

# Count the missing values separately for smokers and non-smokers
missing_cigs_smoker = df.loc[smoker_mask, 'cigsPerDay'].isnull().sum()
missing_cigs_non_smoker = df.loc[non_smoker_mask, 'cigsPerDay'].isnull().sum()
total_missing_cigs = missing_cigs_smoker + missing_cigs_non_smoker

# Calculate the median daily cigarette consumption for smokers (more resistant to outliers)
median_cigs = df.loc[smoker_mask, 'cigsPerDay'].median()

# Fill missing values with median for smokers, and 0 for non-smokers (reasonable assumption that non-smokers have 0 cigarette consumption)
df.loc[smoker_mask, 'cigsPerDay'] = df.loc[smoker_mask, 'cigsPerDay'].fillna(median_cigs)
df.loc[non_smoker_mask, 'cigsPerDay'] = df.loc[non_smoker_mask, 'cigsPerDay'].fillna(0)

print(f"Processing the 'cigsPerDay' column: a total of {total_missing_cigs} missing values were filled in ({missing_cigs_smoker} among smokers, {missing_cigs_non_smoker} among non-smokers)")

Processing the 'cigsPerDay' column: a total of 29 missing values were filled in (29 among smokers, 0 among non-smokers)


## BPMeds


In [41]:
# Calculate the mode of blood pressure medication usage
mode_bpmeds = df['BPMeds'].mode()[0]
# Fill missing values with the mode and convert to integer type
df['BPMeds'] = df['BPMeds'].fillna(mode_bpmeds).astype('int32')

##  TotChol

In [42]:
# Calculate the median of total cholesterol (resistant to outliers)
median_totChol = df['totChol'].median()
# Fill missing values with the median
df['totChol'] = df['totChol'].fillna(median_totChol)

## BMI

In [43]:
# Calculate the median of BMI
median_BMI = df['BMI'].median()
# Fill missing values with the median
df['BMI'] = df['BMI'].fillna(median_BMI)

## HeartRate

In [44]:
# Calculate the median of heart rate
median_heartRate = df['heartRate'].median()
# Fill missing values with the median (only 1 missing value, simple to handle)
df['heartRate'] = df['heartRate'].fillna(median_heartRate)

## Glucose

In [45]:
# Calculate the median of glucose
median_glucose = df['glucose'].median()
# Fill missing values with the median
df['glucose'] = df['glucose'].fillna(median_glucose)

# Exploratory analysis and visualization