In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Optional: larger, consistent style
plt.rcParams["figure.figsize"] = (8, 5)
plt.rcParams["axes.grid"] = True

In [None]:
df = pd.read_csv("adult.csv")
numeric_cols = ["age", "hours.per.week", "capital.gain", "capital.loss", "education.num"]
df[numeric_cols].head()

In [None]:
summary_stats = df[numeric_cols].describe().T
summary_stats

In [None]:
for col in numeric_cols:
    fig, ax = plt.subplots()
    ax.hist(df[col].dropna(), bins=40, density=True, alpha=0.6)
    
    # KDE using pandas (wrap in try in case of constant columns)
    try:
        df[col].plot(kind="kde", ax=ax)
    except Exception as e:
        print(f"Could not plot KDE for {col}: {e}")
    
    ax.set_title(f"Histogram + KDE of {col}")
    ax.set_xlabel(col)
    ax.set_ylabel("Density")
    plt.show()

In [None]:
# In[5]: Boxplots for outlier detection

for col in numeric_cols:
    fig, ax = plt.subplots(figsize=(7, 2.5))
    ax.boxplot(df[col].dropna(), vert=False)
    ax.set_title(f"Boxplot of {col}")
    ax.set_xlabel(col)
    plt.show()

In [None]:
# In[6]: Simple skewness + IQR-based outlier counts (numeric summary)

skewness = df[numeric_cols].skew()

outlier_info = {}

for col in numeric_cols:
    x = df[col].dropna()
    q1 = x.quantile(0.25)
    q3 = x.quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    outliers = x[(x < lower) | (x > upper)]
    outlier_info[col] = {
        "q1": q1,
        "q3": q3,
        "iqr": iqr,
        "lower_bound": lower,
        "upper_bound": upper,
        "num_outliers": outliers.shape[0]
    }

skewness, outlier_info

In [None]:
print("Skewness (Pearson):")
print(skewness)
print("\nOutlier summary (IQR rule):")
for col, info in outlier_info.items():
    print(f"\nColumn: {col}")
    print(f"  Q1 = {info['q1']:.2f}, Q3 = {info['q3']:.2f}, IQR = {info['iqr']:.2f}")
    print(f"  Lower bound = {info['lower_bound']:.2f}, Upper bound = {info['upper_bound']:.2f}")
    print(f"  # of outliers = {info['num_outliers']}")

# Interpretation of Numeric Feature Distributions, Skewness, Outliers, and Correlation

## 1. Age
**Skewness:** 0.56  
The age distribution is moderately right-skewed. Most individuals fall between ages 25 and 50, and the density gradually decreases as age increases. The right tail contains a small number of older individuals (ages 78+), which appear as outliers in the boxplot. These outliers are valid natural extremes rather than errors.

**Outliers (IQR):** 143  
These represent older individuals and are reasonable observations.

**Conclusion:** Age is fairly close to a normal distribution with mild skew and only a few meaningful outliers.

---

## 2. Hours per Week
**Skewness:** 0.23  
This distribution is nearly symmetric but contains an extremely strong peak at 40 hours, representing full-time work. The histogram and KDE both reflect a very concentrated center with part-time workers (1–30 hours) forming a left tail and overtime workers (50–99 hours) forming a right tail.

**Outliers (IQR):** 9,008  
The IQR method marks a very large number of values as outliers because the IQR range (40–45) is narrow. These are not true anomalies; instead, they reflect natural variability in work hours.

**Conclusion:** The distribution is sharply centered around 40 hours, and the high outlier count is a limitation of the IQR method, not actual data issues.

---

## 3. Capital Gain
**Skewness:** 11.95  
This is an extremely right-skewed distribution. Almost all individuals have a capital gain of 0, with very sparse but very high values ranging from 10,000 to over 90,000. The histogram and KDE show a tight spike at zero and a long, sparse right tail.

**Outliers (IQR):** 2,712  
These reflect the relatively few individuals who report significant investment gains.

**Conclusion:** Capital gain is a zero-inflated variable with meaningful, but rare, large values. A log transform or binarization would be appropriate for modeling.

---

## 4. Capital Loss
**Skewness:** 4.59  
Capital loss also exhibits strong right skew. Most values are zero, and the remainder cluster around a few common loss amounts (e.g., 1,880, 2,000, 2,400). The KDE confirms a sharp peak at zero with long, sparse right tail behavior.

**Outliers (IQR):** 1,519  
Like capital gain, these reflect uncommon but legitimate loss values.

**Conclusion:** Capital loss behaves similarly to capital gain: sparse, heavily skewed, and dominated by zeros. It should be treated as a zero-inflated feature.

---

## 5. Education Num
**Skewness:** -0.31  
The education level distribution is slightly left-skewed. Most individuals fall between 9 and 12 years of education, representing high school through early college. A small number of individuals below 5 years of education appear as outliers in the boxplot.

**Outliers (IQR):** 1,198  
These low-education values represent valid categories (e.g., elementary school).

**Conclusion:** Education number is mostly symmetric with a concentration around typical schooling levels and a few meaningful lower-end outliers.

---

## Correlation Analysis
Correlation among the numeric variables is generally weak, indicating limited linear relationships:

- **Age** has weak positive correlations with education level and hours worked, suggesting that older individuals may have slightly higher education or work more consistently.
- **Education num** shows weak positive correlation with capital gain, reflecting that more educated individuals may be more likely to have investment income.
- **Capital gain** and **capital loss** have weak associations with each other since individuals reporting one often report the other.
- **Hours per week** shows almost no strong correlation with any other numeric variables.

**Conclusion:**  
There is no problematic multicollinearity among the numeric features. Capital gain and loss behave more like sparse, binary-style indicators than traditional continuous variables. Age, hours per week, and education num behave as more typical continuous variables with modest spread and predictable distribution shapes.

---

## Overall Summary
- Age shows moderate right skew with a few reasonable high-age outliers.  
- Hours per week is tightly centered around 40; IQR identifies many values as outliers, but these are normal variations in work schedules.  
- Capital gain and loss are extremely right-skewed, zero-inflated, and dominated by rare large values.  
- Education num is mostly symmetric with meaningful low-end outliers.  
- Correlations among the numeric variables are weak, indicating they provide independent information for analysis.

This combined interpretation provides a complete understanding of distribution shapes, skewness, potential outliers, and relationships among the numeric variables in the dataset.
