In [1]:
import pandas as pd
import numpy as np

# Part 1

In [2]:
heartTraindf = pd.read_csv("Part1_dataset/heart-train.csv")
heartTestdf = pd.read_csv("Part1_dataset/heart-test.csv")

### 1. Identify the dataset columns into nominal, categorical, continuous, etc. categories

In [9]:
heartTraindf.head()

Unnamed: 0,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
0,134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1
1,132,6.2,6.47,36.21,Present,62,30.77,14.14,45,0
2,142,4.05,3.38,16.2,Absent,59,20.81,2.62,38,0
3,114,4.08,4.59,14.6,Present,62,23.11,6.72,58,1
4,114,0.0,3.83,19.4,Present,49,24.86,2.49,29,0


In [10]:
heartTraindf.dtypes

sbp            int64
tobacco      float64
ldl          float64
adiposity    float64
famhist       object
typea          int64
obesity      float64
alcohol      float64
age            int64
chd            int64
dtype: object

By observing the values of each columns, we could conclude as below:

sbp (Systolic blood pressure):
 - Datatype: Continuous

tobacco (Cumulative tobacco consumption, in kg):
 - Datatype: Continuous

ldl (Low-density lipoprotein cholesterol):
  - Datatype: Continuous

adiposity (Adipose tissue concentration):
  - Datatype: Continuous

famhist (Family history of heart disease; 1=Present, 0=Absent):
  - Datatype: Nominal

typea (Score on a test designed to measure type-A behavior):
  - Datatype: Continuous

obesity (Obesity):
  - Datatype: Continuous

alcohol (Current consumption of alcohol):
  - Datatype: Continuous

age (Age of subject):
  - Datatype: Continuous

chd (Coronary heart disease; 1=Yes, 0=No):
  - Datatype: Nominal


_In the case of `Age` column, it sometimes considered as discrete variable. However, in this analysis, we considered it to be continouous variable._ 

### 2. Present insights about the data.

In [22]:
# general statistics for all the coninous columns.

heartTraindf.iloc[:,:-1].describe()

Unnamed: 0,sbp,tobacco,ldl,adiposity,typea,obesity,alcohol,age
count,412.0,412.0,412.0,412.0,412.0,412.0,412.0,412.0
mean,139.240291,3.666262,4.589539,25.151214,52.135922,25.802112,18.030073,42.686893
std,20.451903,4.518501,1.883744,7.740794,9.592727,4.081745,25.298909,15.129338
min,101.0,0.0,0.98,6.74,20.0,17.89,0.0,15.0
25%,125.5,0.0375,3.24,19.3975,46.0,22.7375,0.4475,30.75
50%,136.0,1.805,4.225,26.09,52.0,25.635,7.51,45.0
75%,148.0,5.85,5.5275,30.755,58.0,28.1675,24.96,57.0
max,218.0,27.4,14.16,42.49,73.0,45.72,145.29,64.0


In [26]:
print(heartTraindf['famhist'].value_counts())
print(heartTraindf['chd'].value_counts())

Absent     239
Present    173
Name: famhist, dtype: int64
0    275
1    137
Name: chd, dtype: int64


_**Insight**_



#### Insights from Simple Statistics:
sbp (Systolic Blood Pressure):
 - The average systolic blood pressure is around 139.24 with a standard deviation of 20.45.
 - The minimum blood pressure recorded is 101, and the maximum is 218.

tobacco (Cumulative Tobacco Consumption):
 - The average cumulative tobacco consumption is approximately 3.67 kg, with a standard deviation of 4.52.
 - There is a wide range of tobacco consumption, from 0 to 27.4 kg.

ldl (Low-Density Lipoprotein Cholesterol):
 - The average LDL cholesterol level is around 4.59, with a standard deviation of 1.88.
 - The minimum LDL cholesterol level is 0.98, and the maximum is 14.16.

adiposity (Adipose Tissue Concentration):
 - The average adiposity is 25.15, with a standard deviation of 7.74.
 - Adiposity ranges from 6.74 to 42.49.

typea (Type-A Behavior Score):
 - The average type-A behavior score is 52.14, with a standard deviation of 9.59.
 - Scores range from 20 to 73.

obesity:
 - The average obesity level is 25.80, with a standard deviation of 4.08.
 - Obesity levels range from 17.89 to 45.72.

alcohol (Alcohol Consumption):
 - The average alcohol consumption is 18.03, with a wide standard deviation of 25.30.

Some individuals have very high alcohol consumption, as indicated by the maximum value of 145.29.
age:
 - The average age is 42.69, with a standard deviation of 15.13.
 - Age ranges from 15 to 64.

#### Insights from Value Counts:
famhist (Family History of Heart Disease):
 - 239 individuals have no family history of heart disease (Absent), while 173 have a family history (Present).

chd (Coronary Heart Disease):
 - 275 individuals do not have coronary heart disease (chd=0), while 137 individuals have coronary heart disease (chd=1).

### 3. Find the number of null values for each column

In [36]:
heartTraindf.isnull().sum()

sbp          0
tobacco      0
ldl          0
adiposity    0
famhist      0
typea        0
obesity      0
alcohol      0
age          0
chd          0
dtype: int64

In [37]:
heartTestdf.isnull().sum()

ID           0
sbp          0
tobacco      0
ldl          0
adiposity    0
famhist      0
typea        0
obesity      0
alcohol      0
age          0
dtype: int64

### 4. Know about the patients: