# Task 3 : Hypothesis Generation

Below are statistically testable hypotheses derived from the dataset.  
Each hypothesis has:
- **H0 (Null):** No association / no difference
- **H1 (Alternative):** Association exists / difference exists


## Hypothesis Statements

### Hypothesis 1 — Education vs Income
- **H0:** Income category is independent of education level.
- **H1:** Income category depends on education level.

**Why:** Education is commonly linked to earning potential.

---

### Hypothesis 2 — Hours per Week vs Income
- **H0:** Mean hours worked per week is the same for `<=50K` and `>50K`.
- **H1:** Mean hours worked per week differs between `<=50K` and `>50K`.

**Why:** Higher earning jobs may involve longer work hours.

---

### Hypothesis 3 — Sex vs Income
- **H0:** Income category is independent of sex.
- **H1:** Income category depends on sex.

**Why:** Income differences across gender groups may exist.

---

### Hypothesis 4 — Workclass vs Income
- **H0:** Income category is independent of workclass.
- **H1:** Income category depends on workclass.

**Why:** Private vs government vs self-employed may show income differences.

---

### Hypothesis 5 — Marital Status vs Income
- **H0:** Income category is independent of marital status.
- **H1:** Income category depends on marital status.

**Why:** Household/earning patterns often correlate with marital status.

---


In [47]:
## Quick Data Check Supporting Hypotheses

# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Adult dataset column names (per UCI documentation)
columns = [
    "Age", "Workclass", "Fnlwgt", "Education", "Education_Num",
    "Marital_Status", "Occupation", "Relationship", "Race", "Sex",
    "Capital_Gain", "Capital_Loss", "Hours_per_Week", "Native_Country", "Income"
]

# Load adult.data (no header). Treat '?' as missing.
df = pd.read_csv(
    "adult.data",
    header=None,
    names=columns,
    na_values="?",
    skipinitialspace=True
)

# Basic target distribution
df["Income"].value_counts(dropna=False)



Income
<=50K    24720
>50K      7841
Name: count, dtype: int64

### Hypothesis 1 — Education vs Income
- **H0:** Income category is independent of education level.
- **H1:** Income category depends on education level.

**Why:** Education is commonly linked to earning potential.

---


In [50]:
# Quick cross-tab example (Education vs Income) - not a statistical test yet
pd.crosstab(df["Education"], df["Income"], normalize="index").head(10)

Income,<=50K,>50K
Education,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.933548,0.066452
11th,0.948936,0.051064
12th,0.923788,0.076212
1st-4th,0.964286,0.035714
5th-6th,0.951952,0.048048
7th-8th,0.93808,0.06192
9th,0.947471,0.052529
Assoc-acdm,0.75164,0.24836
Assoc-voc,0.738784,0.261216
Bachelors,0.585247,0.414753


### Hypothesis 2 — Hours per Week vs Income
- **H0:** Mean hours worked per week is the same for `<=50K` and `>50K`.
- **H1:** Mean hours worked per week differs between `<=50K` and `>50K`.

**Why:** Higher earning jobs may involve longer work hours.

---


In [53]:
# Quick cross-tab example (Education vs Income) - not a statistical test yet
pd.crosstab(df["Hours_per_Week"], df["Income"], normalize="index").head(10)

Income,<=50K,>50K
Hours_per_Week,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.9,0.1
2,0.75,0.25
3,0.974359,0.025641
4,0.944444,0.055556
5,0.883333,0.116667
6,0.875,0.125
7,0.846154,0.153846
8,0.924138,0.075862
9,0.944444,0.055556
10,0.928058,0.071942


### Hypothesis 3 — Sex vs Income
- **H0:** Income category is independent of sex.
- **H1:** Income category depends on sex.

**Why:** Income differences across gender groups may exist.

---


In [56]:
# Quick cross-tab example (Education vs Income) - not a statistical test yet
pd.crosstab(df["Sex"], df["Income"], normalize="index").head(10)

Income,<=50K,>50K
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,0.890539,0.109461
Male,0.694263,0.305737


### Hypothesis 4 — Workclass vs Income
- **H0:** Income category is independent of workclass.
- **H1:** Income category depends on workclass.

**Why:** Private vs government vs self-employed may show income differences.

---


In [59]:
# Quick cross-tab example (Education vs Income) - not a statistical test yet
pd.crosstab(df["Workclass"], df["Income"], normalize="index").head(10)

Income,<=50K,>50K
Workclass,Unnamed: 1_level_1,Unnamed: 2_level_1
Federal-gov,0.613542,0.386458
Local-gov,0.705208,0.294792
Never-worked,1.0,0.0
Private,0.781327,0.218673
Self-emp-inc,0.442652,0.557348
Self-emp-not-inc,0.715073,0.284927
State-gov,0.728043,0.271957
Without-pay,1.0,0.0


### Hypothesis 5 — Marital Status vs Income
- **H0:** Income category is independent of marital status.
- **H1:** Income category depends on marital status.

**Why:** Household/earning patterns often correlate with marital status.

---



In [62]:
# Quick cross-tab example (Education vs Income) - not a statistical test yet
pd.crosstab(df["Marital_Status"], df["Income"], normalize="index").head(10)

Income,<=50K,>50K
Marital_Status,Unnamed: 1_level_1,Unnamed: 2_level_1
Divorced,0.895791,0.104209
Married-AF-spouse,0.565217,0.434783
Married-civ-spouse,0.553152,0.446848
Married-spouse-absent,0.91866,0.08134
Never-married,0.954039,0.045961
Separated,0.93561,0.06439
Widowed,0.914401,0.085599
