This project notebook aims to provide concise data summary for the IHDP (Infant Health and Development Program) dataset specifically targeting on the section of infant birth weight. The following process will use pandas as the main statistical tools to calculate the data column's mean, median and mode. Addtionally, the project will also provide data visualization using the Python standard library. Since we are only allowed to use pandas to plot the visualization, we cannnot draw a histogram or bar chart like plot since they invloves using matplotlib, plotly and so on.

In [None]:
#Data summary for IHDP dataset using pandas
import pandas as pd

df = pd.read_csv("HW1_IHDP.csv")

bw = df["bw"]

bw_mean = bw.mean()
bw_median = bw.median()
bw_mode_series = bw.mode()   
bw_mode = bw_mode_series.iloc[0] 

print("Birth weight (bw) statistics using pandas:")
print("Mean  :", bw_mean)
print("Median:", bw_median)
print("Mode  :", bw_mode)


Birth weight (bw) statistics using pandas:
Mean  : 1795.867005076142
Median: 1860.0
Mode  : 2340


In [None]:
#Data summary for IHDP dataset using only the Python standard library
import csv

bw_values = []

with open("HW1_IHDP.csv", newline="") as f:
    reader = csv.DictReader(f)
    for row in reader:
        value_str = row["bw"]
        if value_str != "":           
            bw_values.append(float(value_str))

n = len(bw_values)

# mean
bw_sum = 0
for v in bw_values:
    bw_sum += v
bw_mean = bw_sum / n

# median
sorted_bw = sorted(bw_values)

if n % 2 == 1:
    bw_median = sorted_bw[n // 2]
else:
    mid1 = sorted_bw[n // 2 - 1]
    mid2 = sorted_bw[n // 2]
    bw_median = (mid1 + mid2) / 2

# mode
counts = {}
for v in bw_values:
    if v in counts:
        counts[v] += 1
    else:
        counts[v] = 1

max_count = 0
for v in counts:
    if counts[v] > max_count:
        max_count = counts[v]

bw_modes = []
for v in counts:
    if counts[v] == max_count:
        bw_modes.append(v)

print("Birth weight (bw) statistics using only the Python standard library:")
print("Number of observations:", n)
print("Mean   :", bw_mean)
print("Median :", bw_median)
print("Mode(s):", bw_modes)

Birth weight (bw) statistics using only the Python standard library:
Number of observations: 985
Mean   : 1795.867005076142
Median : 1860.0
Mode(s): [2340.0]


In [1]:
#Data visualization using pandas
import pandas as pd

df = pd.read_csv("HW1_IHDP.csv")
bw = df["bw"].dropna()

num_bins = 10

min_bw = bw.min()
max_bw = bw.max()

bin_width = (max_bw - min_bw) / num_bins

bin_edges = [min_bw + i * bin_width for i in range(num_bins + 1)]

counts = [0] * num_bins

for v in bw:
    idx = int((v - min_bw) / bin_width)
    if idx == num_bins:
        idx -= 1
    counts[idx] += 1

max_count = max(counts)
max_bar_width = 40 

print("Histogram of birth weight (bw) \n")

for i, count in enumerate(counts):
    left = int(bin_edges[i])
    right = int(bin_edges[i + 1])

    if max_count > 0:
        bar_len = int(count / max_count * max_bar_width)
    else:
        bar_len = 0

    if count > 0 and bar_len == 0:
        bar_len = 1

    bar = "█" * bar_len

    print(f"{left:4d}-{right:4d} | {bar} ({count})")

Histogram of birth weight (bw) 

 540- 736 | █ (5)
 736- 932 | ███████████ (45)
 932-1128 | █████████████ (55)
1128-1324 | █████████████████ (69)
1324-1520 | █████████████████████ (86)
1520-1716 | ██████████████████████████████ (123)
1716-1912 | ██████████████████████████████████████ (155)
1912-2108 | ███████████████████████████████████ (144)
2108-2304 | ████████████████████████████████████████ (161)
2304-2500 | ███████████████████████████████████ (142)
