# Exploratory Data Analysis

- I will be using `ydata-profiling` for rapidly having access to complete statistical data exploration.
- First, we can look at the current project folder structure.

In [4]:
!tree

[01;34m.[0m
├── [01;34martifacts[0m
├── CODEOWNERS
├── [01;34mcontainer[0m
│   ├── [01;34mapp[0m
│   │   └── main.py
│   ├── Dockerfile
│   └── requirements.txt
├── [01;34mdata[0m
│   └── census.csv
├── EDA.ipynb
├── LICENSE.txt
├── model_card_template.md
├── README.md
├── requirements.txt
├── [01;34mscreenshots[0m
├── setup.py
├── [01;34msrc[0m
│   ├── __init__.py
│   ├── [01;34mml[0m
│   │   ├── data.py
│   │   ├── __init__.py
│   │   └── model.py
│   └── train_model.py
└── [01;34mtests[0m
    ├── README.md
    ├── sanitycheck.py
    ├── test_api.py
    └── test_model.py

9 directories, 20 files


- Read data
- Instantiate data inside a Pandas dataframe
- `ydata-profiling` for exporting statistical analysis in `html` format.

In [6]:
import os
import pandas as pd
from ydata_profiling import ProfileReport

# Read data 
data_path = os.path.join(os.path.dirname("__file__"), "data", "census.csv")
df = pd.read_csv(data_path)

df.head()

Unnamed: 0,age,workclass,fnlgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [7]:
# Generate profiling report
profile = ProfileReport(df, title="Data Profiling Report", explorative=True)

# Export to HTML
output_file = "artifacts/profiling_report.html"
profile.to_file(output_file=output_file)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 1720.72it/s][A


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## **General Dataset Overview**
- **Observations**: 32,561 rows
- **Variables**: 15 total (6 numeric, 9 categorical)
- **Memory usage**: ~20.2 MiB
- **Missing Values**: None (0%)
- **Duplicate Rows**: 23 (0.1%)

---

## **Notable Alerts**
- **High Correlations**:
  - `education` ↔ `education-num`
  - `relationship` ↔ `sex`
- **Imbalanced Features**:
  - `race`: 65.6% dominated by one class
  - `native-country`: 82.5% from a single country (likely "United-States")
- **Sparse Features**:
  - `capital-gain`: 91.7% zeros
  - `capital-loss`: 95.3% zeros

---

## **Skewed & Interesting Distributions**
- **Skewness**:
  - `capital-gain`: 11.95 (very positively skewed)
  - `capital-loss`: 4.59 (also heavily skewed)
  - `fnlgt`: 1.45 (moderately skewed)
- This skew indicates strong outliers and a long right tail, especially in income-related fields.

---

### **Duplicates**
Examples of repeated rows include:
- A 25-year-old female, private workclass, with `1st-4th` education from Guatemala repeated **3 times**.
- Several 19-year-olds with different occupations repeated **2 times** each

---

### **Correlations**
- As expected:
  - `education` and `education-num` are highly correlated (likely encoding the same info differently).
  - Some minor correlations were observed between:
    - `age` and `hours-per-week`
    - `sex` and `relationship`

---

### **Key Takeaways**
- The dataset is **clean** (no missing values).
- Some fields are **heavily imbalanced** or **sparse**, which may affect modeling (e.g., logistic regression).
- Duplicates and strong correlations should be addressed in preprocessing.
- Outlier detection and treatment will be crucial, especially for skewed numeric features.