# **Final Project Task 1 - Census Data Preprocess**

Requirements

- Target variable specification:
    - The target variable for this project is hours-per-week. 
    - Ensure all preprocessing steps are designed to support regression analysis on this target variable.
- Encode data  **3p**
- Handle missing values if any **1p**
- Correct errors, inconsistencies, remove duplicates if any **1p**
- Outlier detection and treatment if any **1p**
- Normalization / Standardization if necesarry **1p**
- Feature engineering **3p**
- Train test split, save it.
- Others?


Deliverable:

- Notebook code with no errors.
- Preprocessed data as csv.

In [25]:
import pandas as pd

In [26]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [27]:
##importing the required library
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

In [28]:

print(data.columns)

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='str')


In [29]:
# Informații date
data.info()

# Valori lipsă
data.isnull().sum()

# Verificăm statistic
data.describe()

# Verificare duplicate
data.duplicated().sum()


<class 'pandas.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   age             32561 non-null  int64
 1   workclass       32561 non-null  str  
 2   fnlwgt          32561 non-null  int64
 3   education       32561 non-null  str  
 4   education-num   32561 non-null  int64
 5   marital-status  32561 non-null  str  
 6   occupation      32561 non-null  str  
 7   relationship    32561 non-null  str  
 8   race            32561 non-null  str  
 9   sex             32561 non-null  str  
 10  capital-gain    32561 non-null  int64
 11  capital-loss    32561 non-null  int64
 12  hours-per-week  32561 non-null  int64
 13  native-country  32561 non-null  str  
 14  income          32561 non-null  str  
dtypes: int64(6), str(9)
memory usage: 3.7 MB


np.int64(24)

In [30]:
numeric_cols = data.select_dtypes(include=np.number).columns.tolist()
print("Numeric columns for scaling and outlier treatment:", numeric_cols)

Numeric columns for scaling and outlier treatment: ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']


In [31]:
# Verificăm primele 5 rânduri
print("Primele 5 rânduri:")
print(data.head())
print("\n")

Primele 5 rânduri:
   age         workclass  fnlwgt  education  education-num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital-status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital-gain  capital-loss  hours-per-week native-country income  
0          2174             0              40  United-States  <=50K  
1             0        

In [32]:
# Verificăm dacă există valori lipsă
print("Numărul de valori lipsă pe fiecare coloană:")
print(data.isna().sum())
print("\n")

Numărul de valori lipsă pe fiecare coloană:
age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64




In [33]:
#  Eliminăm duplicatele și coloana inutilă 'fnlwgt'
data.drop_duplicates(inplace=True)
data.drop(columns=['fnlwgt'], inplace=True)
print(f"Shape după eliminarea duplicatelor și fnlwgt: {data.shape}\n")

Shape după eliminarea duplicatelor și fnlwgt: (32537, 14)



In [34]:
# Verificăm sa vedem daca mai sunt  duplicatele
print(f"Număr de rânduri înainte de eliminarea duplicatelor: {data.shape[0]}")
data.drop_duplicates(inplace=True)
print(f"Număr de rânduri după eliminarea duplicatelor: {data.shape[0]}")
print("\n")

Număr de rânduri înainte de eliminarea duplicatelor: 32537
Număr de rânduri după eliminarea duplicatelor: 29096




In [35]:
# Tratăm valorile lipsă
# În acest set de date, valorile lipsă apar la coloanele categorice: workclass, occupation, native-country
categorical_cols = ["workclass", "occupation", "native-country"]

# Folosim valoarea cea mai frecventă pentru a completa valorile lipsă
imputer = SimpleImputer(strategy="most_frequent")
data[categorical_cols] = imputer.fit_transform(data[categorical_cols])

print("După completarea valorilor lipsă:")
print(data.isna().sum())
print("\n")

După completarea valorilor lipsă:
age               0
workclass         0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64




In [36]:
# Feature Engineering
# A. Recodăm 'income' ca binary 0/1
data['income_binary'] = data['income'].apply(lambda x: 1 if x == '>50K' else 0)
data.drop('income', axis=1, inplace=True)

# B. Net Capital
data['net_capital'] = data['capital-gain'] - data['capital-loss']
data.drop(['capital-gain', 'capital-loss'], axis=1, inplace=True)

# C. Work intensity
data['work_intensity'] = data['hours-per-week'] / (data['age'] + 1)  # evităm div 0

# D. Grupare Marital Status
data['marital_status_grouped'] = data['marital-status'].replace({
    'Married-civ-spouse': 'Married', 'Married-AF-spouse': 'Married', 'Married-spouse-absent': 'Married',
    'Divorced': 'Not-Married', 'Never-married': 'Not-Married', 'Separated': 'Not-Married', 'Widowed': 'Not-Married'
})
data.drop('marital-status', axis=1, inplace=True)

print("Primele 5 rânduri după feature engineering:")
print(data.head())
print("\n")

Primele 5 rânduri după feature engineering:
   age         workclass  education  education-num         occupation  \
0   39         State-gov  Bachelors             13       Adm-clerical   
1   50  Self-emp-not-inc  Bachelors             13    Exec-managerial   
2   38           Private    HS-grad              9  Handlers-cleaners   
3   53           Private       11th              7  Handlers-cleaners   
4   28           Private  Bachelors             13     Prof-specialty   

    relationship   race     sex  hours-per-week native-country  income_binary  \
0  Not-in-family  White    Male              40  United-States              0   
1        Husband  White    Male              13  United-States              0   
2  Not-in-family  White    Male              40  United-States              0   
3        Husband  Black    Male              40  United-States              0   
4           Wife  Black  Female              40           Cuba              0   

   net_capital  work_intensity

In [37]:
# Coloanele numerice pentru scalare (excluzând coloana țintă)
numeric_features = data.select_dtypes(include=np.number).columns.tolist()
numeric_features.remove('hours-per-week')  # eliminăm coloana țintă


In [38]:
categorical_features = ['workclass', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'marital_status_grouped']
target = 'hours-per-week'

X = data.drop(target, axis=1)
y = data[target]

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),  # scalare numerică
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)  # encoding categoric
    ]
)

X_processed = preprocessor.fit_transform(X)

In [39]:
# Numele coloanelor după one-hot encoding
ohe_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features)

# Combinăm coloanele numerice cu cele encodate
all_feature_names = numeric_features + list(ohe_feature_names)

# Transformăm într-un DataFrame pentru claritate
import pandas as pd
X_processed_df = pd.DataFrame(X_processed, columns=all_feature_names)

print(f"Dimensiunea dataset-ului după encoding și standardizare: {X_processed_df.shape}\n")

Dimensiunea dataset-ului după encoding și standardizare: (29096, 86)



In [40]:
# Identificăm coloanele numerice pentru scalare și tratament outlier
numeric_cols = data.select_dtypes(include=np.number).columns.tolist()
numeric_cols.remove('hours-per-week')  # Excludem target-ul
print("Coloanele numerice:", numeric_cols)
print("\n")

# Outlier detection și tratament (metoda IQR)
for col in numeric_cols:
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    # Ajustăm valorile sub/peste limite
    data[col] = np.where(data[col] < lower_bound, lower_bound, data[col])
    data[col] = np.where(data[col] > upper_bound, upper_bound, data[col])

print("Outliers ajustați pentru coloanele numerice.")
print("\n")

Coloanele numerice: ['age', 'education-num', 'income_binary', 'net_capital', 'work_intensity']


Outliers ajustați pentru coloanele numerice.




In [41]:
data_encoded = data.copy()  # creează o copie
numeric_cols = data_encoded.select_dtypes(include=np.number).columns.tolist()
numeric_cols.remove('hours-per-week')

# IQR treatment
for col in numeric_cols:
    Q1 = data_encoded[col].quantile(0.25)
    Q3 = data_encoded[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    data_encoded[col] = np.where(data_encoded[col] < lower_bound, lower_bound, data_encoded[col])
    data_encoded[col] = np.where(data_encoded[col] > upper_bound, upper_bound, data_encoded[col])

print("Outliers ajustați pentru coloanele numerice.")
print("\n")

Outliers ajustați pentru coloanele numerice.




In [42]:
#  Normalizare / Standardizare
scaler = StandardScaler()
data_encoded[numeric_cols] = scaler.fit_transform(data_encoded[numeric_cols])

print("Datele numerice au fost standardizate.")
print("\n")

Datele numerice au fost standardizate.




In [43]:
#  Pregătirea pentru modelul de regresie
X = data_encoded.drop("hours-per-week", axis=1)
y = data_encoded["hours-per-week"]

# Împărțim datele în train și test (80%-20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Dimensiune set antrenament: {X_train.shape}, Dimensiune set test: {X_test.shape}")
print("\n")

Dimensiune set antrenament: (23276, 13), Dimensiune set test: (5820, 13)




In [44]:
#  Combinăm X și y pentru export
train_export = pd.concat([X_train, y_train], axis=1)
test_export = pd.concat([X_test, y_test], axis=1)

In [45]:
#  Salvăm în CSV
train_export.to_csv('adult_train_preprocessed.csv', index=False)
test_export.to_csv('adult_test_preprocessed.csv', index=False)
print("\nFiles 'adult_train_preprocessed.csv' and 'adult_test_preprocessed.csv' saved successfully.")



Files 'adult_train_preprocessed.csv' and 'adult_test_preprocessed.csv' saved successfully.


In [46]:
# Citim CSV-urile pentru verificare
train_df = pd.read_csv('adult_train_preprocessed.csv')
test_df = pd.read_csv('adult_test_preprocessed.csv')

print("\nPrimele 5 rânduri din train_df:")
print(train_df.head())

print("\nPrimele 5 rânduri din test_df:")
print(test_df.head())


Primele 5 rânduri din train_df:
        age     workclass     education  education-num         occupation  \
0  0.719053  Self-emp-inc  Some-college      -0.042717       Craft-repair   
1  0.939690       Private          12th      -0.806639  Machine-op-inspct   
2 -0.825403       Private     Bachelors       1.103166              Sales   
3 -0.751857       Private           9th      -1.952523  Machine-op-inspct   
4 -0.825403       Private       HS-grad      -0.424678  Handlers-cleaners   

     relationship                race     sex native-country  income_binary  \
0         Husband  Asian-Pac-Islander    Male  United-States            0.0   
1  Other-relative               White  Female           Cuba            0.0   
2            Wife               Black  Female  United-States            0.0   
3         Husband               White    Male  United-States            0.0   
4       Own-child               Black  Female  United-States            0.0   

   net_capital  work_intensit