# **Final Project Task 1 - Census Data Preprocess**

Requirements

- Target variable specification:
    - The target variable for this project is hours-per-week. 
    - Ensure all preprocessing steps are designed to support regression analysis on this target variable.
- Encode data  **3p**
- Handle missing values if any **1p**
- Correct errors, inconsistencies, remove duplicates if any **1p**
- Outlier detection and treatment if any **1p**
- Normalization / Standardization if necesarry **1p**
- Feature engineering **3p**
- Train test split, save it.
- Others?


Deliverable:

- Notebook code with no errors.
- Preprocessed data as csv.

In [None]:
import pandas as pd

In [37]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [38]:
#Specificare variabila target


y = data["hours-per-week"]
x = data.drop("hours-per-week", axis=1)

print(x.head())

   age         workclass  fnlwgt  education  education-num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital-status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital-gain  capital-loss native-country income  
0          2174             0  United-States  <=50K  
1             0             0  United-States  <=50K  
2             0     

In [39]:
#Tratare valori lipsa

data.isnull().sum()
data=data.dropna()

In [40]:
#Corectare erori sau duplicate

data.duplicated().sum()
data=data.drop_duplicates()


In [41]:
#Detect. si tratare outliers

Q1 = data["hours-per-week"].quantile(0.25)
Q3 = data["hours-per-week"].quantile(0.75)
IQR = Q3 - Q1

data=data[(data["hours-per-week"] >= Q1-1.5*IQR) & (data["hours-per-week"] <= Q3+1.5*IQR)]

In [42]:
#Feature engineering

data["full_time"]= (data["hours-per-week"] >= 40).astype(int)
data["age_group"] = pd.cut(data["age"], bins=[0,20,40,60,80,100], labels=["young","adult","mid-age","senior","elder"])


In [43]:
# Encode Data

cat_cols = data.select_dtypes(include=['object']).columns
data = pd.get_dummies(data, columns=cat_cols, drop_first=True)

See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  cat_cols = data.select_dtypes(include=['object']).columns


In [44]:
#Normalizare/Standardizare

from sklearn.preprocessing import StandardScaler

scaler= StandardScaler()
num_cols = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss']
data[num_cols] = scaler.fit_transform(data[num_cols])

In [45]:
# Train test split

from sklearn.model_selection import train_test_split

x=data.drop("hours-per-week", axis=1)
y=data["hours-per-week"]

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

print("Train shape:", X_train.shape, y_train.shape)
print("Test shape:", X_test.shape, y_test.shape)

Train shape: (18828, 102) (18828,)
Test shape: (4707, 102) (4707,)


In [46]:
#Salvare

X_train.to_csv("X_train.csv", index=False)
X_test.to_csv("X_test.csv", index=False)
y_train.to_csv("y_train.csv", index=False)
y_test.to_csv("y_test.csv", index=False)