## Analyzing Birth Statistics in California fro Medi Cal Patients
Salma Ibrahim, Dalia Cruz, Lisette Vasquez Perez

## Project Overview

In this project, we aim to design and implement a modular, reusable software system for analyzing real-world birth statistics data from California. Our focus is to apply **functional** and **object-oriented programming (OOP)** concepts in building a structured data analysis tool that can load, clean, transform, and visualize key patterns in the dataset.

We will also extract insights that answer meaningful public health questions related to maternal and infant health, while ensuring our code is clean, maintainable, and easy to test.

## 🎯 Goals and Learning Outcomes

- Enhance problem-solving and critical thinking through real-world data exploration.
- Apply **data cleaning**, **missing value handling**, and **data transformation** techniques.
- Perform **exploratory data analysis (EDA)** to identify trends and correlations.
- Utilize **object-oriented programming** to structure analysis tasks into reusable classes.
- Leverage **functional programming** (e.g., lambda functions, `map`, `filter`) for data transformation.
- Create clear and compelling **visualizations** using `pandas`, `matplotlib`, and `seaborn`.
- Design clean, modular code using separation of concerns and reusable components.
- Use **UML diagrams** to communicate program architecture and design.

## 🧾 Dataset Background

- **Dataset**: Medi-Cal Birth Statistics by Select Characteristics  
- **Source**: [data.ca.gov](https://catalog.data.gov/dataset/medi-cal-birth-statistics-by-select-characteristics-and-california-resident-hospital-birth-caf33)
- **Format**: CSV
- **Features include**:
  - Mother's age
  - Mother's healh details
  - Mother's socioeconomical details
  - Birth weight
  - Prenatal care received
  - Delivery method
  - Infant health outcomes
  - Insurance type (e.g., Medi-Cal)

We chose this dataset because it reflects important real-world issues in maternal and infant health, which can inform healthcare policy, patient outreach, and education.

---

## ❓ Research Questions We Aim to Explore

1. Does maternal age impact birth outcomes (e.g., birth weight)?
2. Is there a difference in outcomes between Medi-Cal and non-Medi-Cal patients?
3. How does the number of prenatal care visits affect delivery method or outcome?
4. Are there regional differences in birth statistics within California?
5. What trends exist over time (e.g., changes in delivery methods or prenatal care)?

---

## 🔧 Planned Implementation Approach

- **Data Cleaning**:
      
  - Remove or impute missing data
  - Megre appropriate data sets
  - Standardize column names and data types

- **Object-Oriented Components**:
  - `DataLoader` class to handle file loading and basic preprocessing
  - `Analyzer` class to compute descriptive statistics and correlations
  - `Visualizer` class to produce all relevant plots

- **Functional Programming**:
  - Use `lambda`, `map`, `filter`, and higher-order functions for data transformation
  - Modular design for reusability across different datasets

- **Planned Visualizations**:
  - Histograms and box plots of birth weight and maternal age
  - Bar charts for categorical features
  - Heatmap of correlations

---

## 📈 Expected Results and Challenges

We expect to uncover statistically significant relationships between demographic or care-related factors and birth outcomes. Potential challenges include:
- Incomplete or inconsistent data
- Bias in reported metrics (e.g., underreporting)
- Limited documentation


## 📚 References

- Dataset: https://catalog.data.gov/dataset/medi-cal-birth-statistics-by-select-characteristics-and-california-resident-hospital-birth-caf33


In [14]:
## Imports and Setup 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from functools import reduce

In [31]:

# Load datasets
data_dir = "data"

datasets = {
    "ethnicity": pd.read_csv(f"{data_dir}/ca-resident-births-by-cy-payer-maternal-race-ethnicity-and-select-birth-characteristics.csv"),
    "birthweight": pd.read_csv(f"{data_dir}/ca-resident-births-by-cy-payer-birthweight-range-and-select-birth-characteristics.csv"),
    "preterm": pd.read_csv(f"{data_dir}/ca-resident-births-by-cy-payer-very-preterm-status-and-select-birth-characteristics.csv"),
    "gestation": pd.read_csv(f"{data_dir}/ca-resident-births-by-cy-payer-length-of-gestation-and-select-birth-characteristics.csv"),
    "education": pd.read_csv(f"{data_dir}/ca-resident-births-by-cy-payer-method-of-delivery-and-select-birth-characteristics.csv"),
}

# Filter helper
def filter_df(df, grouper=None, payer="Medi-Cal"):
    if payer and "Payer" in df.columns:
        df = df[df["Payer"] == payer]
    if grouper:
        df = df[df["Grouper"] == grouper]
    return df

# 1. Maternal Age and Ethnicity vs Birth Outcomes
q1_age = filter_df(datasets["birthweight"], grouper="Maternal Age Group")
q1_ethnicity = filter_df(datasets["ethnicity"], grouper="Maternal Race/Ethnicity")
q1 = pd.merge(q1_age, q1_ethnicity, on=["Calendar Year", "Payer", "Grouper", "Birth Characteristic"], how="outer")
q1 = q1.drop(q1.columns[[6, 7]], axis=1)


# 2. Maternal Education vs Birth Outcomes
q2 = filter_df(datasets["education"], grouper="Maternal Education Status")

# 3. Smoking During Pregnancy (filtering keywords)
q3 = filter_df(datasets["education"])
q3 = q3[q3["Birth Characteristic"].str.contains("smoking", case=False, na=False)]

# 4. BMI and Birth Outcomes
q4_bmi = filter_df(datasets["birthweight"], grouper="Pre-pregnancy BMI Group")

# 5. Racial Disparities in Birth Outcomes
q5_ethnicity = filter_df(datasets["ethnicity"], grouper="Maternal Race/Ethnicity")
q5_preterm = filter_df(datasets["preterm"], grouper="Maternal Race/Ethnicity")
q5 = pd.merge(q5_ethnicity, q5_preterm, on=["Calendar Year", "Payer", "Grouper", "Birth Characteristic"], how="outer")

q2.tail(20)


Unnamed: 0,Calendar Year,Payer,Method of Delivery,Grouper,Birth Characteristic,Number of Births
2767,2022,Medi-Cal,Cesarean-Primary,Maternal Education Status,Bachelor's Degree or Higher,3651
2768,2022,Medi-Cal,Cesarean-Primary,Maternal Education Status,High School Graduate,9277
2769,2022,Medi-Cal,Cesarean-Primary,Maternal Education Status,Less Than High School,3829
2770,2022,Medi-Cal,Cesarean-Primary,Maternal Education Status,Some College or Associate's Degree,8076
2771,2022,Medi-Cal,Cesarean-Primary,Maternal Education Status,Unknown Education,2477
2790,2022,Medi-Cal,Cesarean-Repeat,Maternal Education Status,Bachelor's Degree or Higher,2716
2791,2022,Medi-Cal,Cesarean-Repeat,Maternal Education Status,High School Graduate,10145
2792,2022,Medi-Cal,Cesarean-Repeat,Maternal Education Status,Less Than High School,5712
2793,2022,Medi-Cal,Cesarean-Repeat,Maternal Education Status,Some College or Associate's Degree,8124
2794,2022,Medi-Cal,Cesarean-Repeat,Maternal Education Status,Unknown Education,2659


In [None]:
## Cleaning data

In [None]:
## Cleaning data

In [None]:
## Cleaning data

In [None]:
## Exploritory Data Analysis

In [None]:
## Exploritory Data Analysis

In [None]:
## Exploritory Data Analysis

In [None]:
## Exploritory Data Analysis

In [None]:
##Questions to be answered with data

In [None]:
##Questions to be answered with data

In [None]:
##Questions to be answered with data

In [None]:
##Questions to be answered with data

In [None]:
##Questions to be answered with data

In [None]:
##Visualisations Summary

In [None]:
##Visualisations Summary

In [None]:
##Visualisations Summary

In [None]:
##Visualisations Summary

In [None]:
##Conclusions