# Task 2 : Initial Exploration: 
1. Check the first few rows of the dataset using the head() or sample() function
 to get an overview of the data.
2. Examine the columns and their data types using info(). Check for  missing values using isnull() or info().
3. Investigate basic summary statistics with describe().*/

### Initial Exploration 

Required checks:
- `head()` / `sample()` for quick view
- `info()` to examine columns and types
- Missing value check via `isnull()` / `info()`
- Summary stats via `describe()`

## 1. First Rows + Random Sample

In [11]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [19]:
columns = [
    "Age", "Workclass", "Fnlwgt", "Education", "Education_Num",
    "Marital_Status", "Occupation", "Relationship", "Race", "Sex",
    "Capital_Gain", "Capital_Loss", "Hours_per_Week", "Native_Country", "Income"
]

# Load adult.data (no header). Treat '?' as missing.
df = pd.read_csv(
    "adult.data",
    header=None,
    names=columns,
    na_values="?",
    skipinitialspace=True
)

In [21]:
df.head(10)

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education_Num,Marital_Status,Occupation,Relationship,Race,Sex,Capital_Gain,Capital_Loss,Hours_per_Week,Native_Country,Income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [23]:
df.sample(10, random_state=42)


Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education_Num,Marital_Status,Occupation,Relationship,Race,Sex,Capital_Gain,Capital_Loss,Hours_per_Week,Native_Country,Income
14160,27,Private,160178,Some-college,10,Divorced,Adm-clerical,Not-in-family,White,Female,0,0,38,United-States,<=50K
27048,45,State-gov,50567,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
28868,29,Private,185908,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,55,United-States,>50K
5667,30,Private,190040,Bachelors,13,Never-married,Machine-op-inspct,Not-in-family,White,Female,0,0,40,United-States,<=50K
7827,29,Self-emp-not-inc,189346,Some-college,10,Divorced,Craft-repair,Not-in-family,White,Male,2202,0,50,United-States,<=50K
15382,51,Private,108435,Masters,14,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,47,United-States,>50K
4641,58,Self-emp-not-inc,93664,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,15024,0,60,United-States,>50K
8943,22,Private,148431,HS-grad,9,Never-married,Adm-clerical,Not-in-family,Other,Female,0,0,40,United-States,<=50K
216,50,Private,313321,Assoc-acdm,12,Divorced,Sales,Not-in-family,White,Female,0,0,40,United-States,<=50K
5121,50,Private,71417,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,3103,0,40,United-States,>50K


## 2. Columns + Data Types info + Missing

In [25]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Age             32561 non-null  int64 
 1   Workclass       30725 non-null  object
 2   Fnlwgt          32561 non-null  int64 
 3   Education       32561 non-null  object
 4   Education_Num   32561 non-null  int64 
 5   Marital_Status  32561 non-null  object
 6   Occupation      30718 non-null  object
 7   Relationship    32561 non-null  object
 8   Race            32561 non-null  object
 9   Sex             32561 non-null  object
 10  Capital_Gain    32561 non-null  int64 
 11  Capital_Loss    32561 non-null  int64 
 12  Hours_per_Week  32561 non-null  int64 
 13  Native_Country  31978 non-null  object
 14  Income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [28]:
missing_count = df.isnull().sum().sort_values(ascending=False)
missing_pct = (df.isnull().mean() * 100).sort_values(ascending=False)

missing_summary = pd.DataFrame({
    "Missing_Count": missing_count,
    "Missing_%": missing_pct.round(2)
})

missing_summary[missing_summary["Missing_Count"] > 0]

Unnamed: 0,Missing_Count,Missing_%
Occupation,1843,5.66
Workclass,1836,5.64
Native_Country,583,1.79


## 3. Basic Summary Statistics

In [33]:
df.describe(include="all").T


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Age,32561.0,,,,38.581647,13.640433,17.0,28.0,37.0,48.0,90.0
Workclass,30725.0,8.0,Private,22696.0,,,,,,,
Fnlwgt,32561.0,,,,189778.366512,105549.977697,12285.0,117827.0,178356.0,237051.0,1484705.0
Education,32561.0,16.0,HS-grad,10501.0,,,,,,,
Education_Num,32561.0,,,,10.080679,2.57272,1.0,9.0,10.0,12.0,16.0
Marital_Status,32561.0,7.0,Married-civ-spouse,14976.0,,,,,,,
Occupation,30718.0,14.0,Prof-specialty,4140.0,,,,,,,
Relationship,32561.0,6.0,Husband,13193.0,,,,,,,
Race,32561.0,5.0,White,27816.0,,,,,,,
Sex,32561.0,2.0,Male,21790.0,,,,,,,


### Task 2 Notes (Quick Observations)
- Missing values in this dataset are represented as **"?"** and were converted to NaN using `na_values="?"`.
- Several features are categorical and will require encoding later.
- Numeric variables may contain outliers (especially `Capital_Gain`, `Capital_Loss`, and possibly `Hours_per_Week`).