# Task 4: Feature Encoding and Scaling

## Objective
The goal of this task is to preprocess the Adult Income dataset by:
- Handling categorical and numerical features
- Applying One-Hot Encoding to categorical variables
- Applying Standard Scaling to numerical variables
- Preparing the dataset for machine learning models

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

In [None]:
import os
os.getcwd()

'/content'

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

--2026-01-20 06:58:41--  https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘adult.data’

adult.data              [   <=>              ]   3.79M  7.27MB/s    in 0.5s    

2026-01-20 06:58:42 (7.27 MB/s) - ‘adult.data’ saved [3974305]



In [None]:
import os
os.listdir('/content')

['.config', 'adult.data', 'sample_data']

In [None]:
import pandas as pd

df = pd.read_csv(
    "/content/adult.data",
    header=None,
    sep=", ",
    engine="python"
)

df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Dataset Description

The Adult Income dataset contains census data used to predict whether a person's income exceeds $50K/year.

- Rows: 32,561
- Target variable: `income`
- Categorical features include workclass, education, marital-status, etc.
- Numerical features include age, fnlwgt, capital-gain, etc.

In [None]:
df.columns = [
    "age", "workclass", "fnlwgt", "education", "education-num",
    "marital-status", "occupation", "relationship", "race", "sex",
    "capital-gain", "capital-loss", "hours-per-week",
    "native-country", "income"
]

df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [None]:
X = df.drop("income", axis=1)
y = df["income"]

X.shape, y.shape

((32561, 14), (32561,))

In [None]:
categorical_cols = X.select_dtypes(include=["object"]).columns
numerical_cols = X.select_dtypes(exclude=["object"]).columns

categorical_cols, numerical_cols

(Index(['workclass', 'education', 'marital-status', 'occupation',
        'relationship', 'race', 'sex', 'native-country'],
       dtype='object'),
 Index(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
        'hours-per-week'],
       dtype='object'))

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numerical_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols)
    ]
)

In [None]:
X_processed = preprocessor.fit_transform(X)

X_processed

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 455854 stored elements and shape (32561, 108)>

In [None]:
X_processed.shape

(32561, 108)

## Feature Separation

- The target column `income` is separated from the input features.
- Categorical and numerical columns are identified automatically using data types.

## Feature Encoding and Scaling

- **Numerical features** are scaled using `StandardScaler`
  - This ensures features have mean 0 and standard deviation 1.
- **Categorical features** are encoded using `OneHotEncoder`
  - Converts categorical values into binary vectors
  - `handle_unknown="ignore"` prevents errors during unseen categories
- `ColumnTransformer` is used to apply different transformations to different columns.