# **Final Project Task 1 - Census Data Preprocess**

Requirements

- Target variable specification:
    - The target variable for this project is hours-per-week. 
    - Ensure all preprocessing steps are designed to support regression analysis on this target variable.
- Encode data  **3p**
- Handle missing values if any **1p**
- Correct errors, inconsistencies, remove duplicates if any **1p**
- Outlier detection and treatment if any **1p**
- Normalization / Standardization if necesarry **1p**
- Feature engineering **3p**
- Train test split, save it.
- Others?


Deliverable:

- Notebook code with no errors.
- Preprocessed data as csv.

In [1]:
import pandas as pd

In [2]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### Voi incepe prin a importa librariile necesare, urmand sa verific baza de date de valori nule/lipsa

In [3]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer

In [4]:
#Verificarea initiala
print("Valori lipsa pe coloana", data.isnull().sum())
#Si vom scoate duplicatele
data = data.drop_duplicates()
#Pentru coloanele ce contin valori categorice putem inlocui null-urile cu o categorie descriptiva a lipsei de date
for col in ['workclass','native-country','occupation']:
    data[col] = data[col].fillna(data[col].mode()[0])

print(f"Forma datelor curatate: {data.shape}")


Valori lipsa pe coloana age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64
Forma datelor curatate: (32537, 15)


In [5]:
#Nu exista valori lipsa in acest set de date
#In continuare, avand in vedere ca targetul nostru este "hours-per-week", vom adresa problema outlier-ilor
#Putem face asta cu ajutorul IQR, care verifica distributia valorilor intre Q1 si Q3 pentru coloanele cu valori numerice (val_num)

val_num = ['age','fnlwgt','education-num','hours-per-week']
for col in val_num:
    quartile1 = data[col].quantile(0.25) #sub prima cuartila
    quartile3 = data[col].quantile(0.75) #peste a treia cuartila
    IQR = quartile3 - quartile1
    limita_min = quartile1 - 1.5 * IQR
    limita_max = quartile3 + 1.5 * IQR

    data[col] = np.where(data[col]< limita_min, limita_min, data[col])
    data[col] = np.where(data[col]> limita_max, limita_max, data[col])

#Limitam astfel setul de date si il eficientizam pentru ulteriorul train-test split
#De asemenea, voi renunta si la coloana education, deoarece este deja encoded
data.drop(['education'], axis=1, inplace=True)


In [6]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
Index: 32537 entries, 0 to 32560
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             32537 non-null  float64
 1   workclass       32537 non-null  object 
 2   fnlwgt          32537 non-null  float64
 3   education-num   32537 non-null  float64
 4   marital-status  32537 non-null  object 
 5   occupation      32537 non-null  object 
 6   relationship    32537 non-null  object 
 7   race            32537 non-null  object 
 8   sex             32537 non-null  object 
 9   capital-gain    32537 non-null  int64  
 10  capital-loss    32537 non-null  int64  
 11  hours-per-week  32537 non-null  float64
 12  native-country  32537 non-null  object 
 13  income          32537 non-null  object 
dtypes: float64(4), int64(2), object(8)
memory usage: 3.7+ MB
None


In [7]:
#Pentru a realiza o predictie cu acuratete, ne putem folosi de features create
#Propun urmatoarele: 
#Coloanele capital gain si loss pot descrie un feature de capital
data['capital'] = data['capital-gain'] - data['capital-loss']
#Si, in final, scoatem valori irelevante, odata transformate in features
data = data.drop(['capital-gain', 'capital-loss'], axis=1)

#Informatia despre sex poate fi si ea binara
data['sex'] = data['sex'].map({'Male': 1, 'Female': 0})

#Varstele pot fi categorizate, pentru a asigura o predictie mai descriptiva a fenomenului target
data['age-group'] = pd.cut(data['age'], bins = [0, 25, 45, 65, 100], labels = ['Young-Adult', 'Middle-Aged', 'Senior', 'Old'])

#Transformarea statului marital intr-un sistem binar penttru eventualul One Hot encoding
data['is-married'] = data['marital-status'].apply(lambda x: 1 if 'Married' in x else 0)

#Stiind categoriile de venit, se poatre transforma in formula binara
data['income'] = data['income'].map({'<=50K':0, '>50K':1})



In [8]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
Index: 32537 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   age             32537 non-null  float64 
 1   workclass       32537 non-null  object  
 2   fnlwgt          32537 non-null  float64 
 3   education-num   32537 non-null  float64 
 4   marital-status  32537 non-null  object  
 5   occupation      32537 non-null  object  
 6   relationship    32537 non-null  object  
 7   race            32537 non-null  object  
 8   sex             32537 non-null  int64   
 9   hours-per-week  32537 non-null  float64 
 10  native-country  32537 non-null  object  
 11  income          32537 non-null  int64   
 12  capital         32537 non-null  int64   
 13  age-group       32537 non-null  category
 14  is-married      32537 non-null  int64   
dtypes: category(1), float64(4), int64(4), object(6)
memory usage: 3.8+ MB
None


In [9]:
#Utilizand variabilele binare, putem realiza One Hot Encoding
categories = ['workclass', 'is-married', 'occupation', 'relationship', 'native-country', 'race']
data = pd.get_dummies(data, columns =  categories, drop_first=True)

In [10]:
#Pentru verificare
data.head()

Unnamed: 0,age,fnlwgt,education-num,marital-status,sex,hours-per-week,income,capital,age-group,workclass_Federal-gov,...,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,race_Asian-Pac-Islander,race_Black,race_Other,race_White
0,39.0,77516.0,13.0,Never-married,1,40.0,0,2174,Middle-Aged,False,...,False,False,False,True,False,False,False,False,False,True
1,50.0,83311.0,13.0,Married-civ-spouse,1,32.5,0,0,Senior,False,...,False,False,False,True,False,False,False,False,False,True
2,38.0,215646.0,9.0,Divorced,1,40.0,0,0,Middle-Aged,False,...,False,False,False,True,False,False,False,False,False,True
3,53.0,234721.0,7.0,Married-civ-spouse,1,40.0,0,0,Senior,False,...,False,False,False,True,False,False,False,True,False,False
4,28.0,338409.0,13.0,Married-civ-spouse,0,40.0,0,0,Middle-Aged,False,...,False,False,False,False,False,False,False,True,False,False


In [11]:
print(data.info())
#Verificam tipul datelor

<class 'pandas.core.frame.DataFrame'>
Index: 32537 entries, 0 to 32560
Data columns (total 82 columns):
 #   Column                                     Non-Null Count  Dtype   
---  ------                                     --------------  -----   
 0   age                                        32537 non-null  float64 
 1   fnlwgt                                     32537 non-null  float64 
 2   education-num                              32537 non-null  float64 
 3   marital-status                             32537 non-null  object  
 4   sex                                        32537 non-null  int64   
 5   hours-per-week                             32537 non-null  float64 
 6   income                                     32537 non-null  int64   
 7   capital                                    32537 non-null  int64   
 8   age-group                                  32537 non-null  category
 9   workclass_Federal-gov                      32537 non-null  bool    
 10  workclass_Local

In [12]:
data = data.drop(['marital-status', 'age-group'], axis=1)

In [13]:
print(data.info())
#Am eliminat toate variabilele obiecte, intrucat erau deja encodate

<class 'pandas.core.frame.DataFrame'>
Index: 32537 entries, 0 to 32560
Data columns (total 80 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   age                                        32537 non-null  float64
 1   fnlwgt                                     32537 non-null  float64
 2   education-num                              32537 non-null  float64
 3   sex                                        32537 non-null  int64  
 4   hours-per-week                             32537 non-null  float64
 5   income                                     32537 non-null  int64  
 6   capital                                    32537 non-null  int64  
 7   workclass_Federal-gov                      32537 non-null  bool   
 8   workclass_Local-gov                        32537 non-null  bool   
 9   workclass_Never-worked                     32537 non-null  bool   
 10  workclass_Private          

In [14]:
#Eliminam alte variabile redundante
data = data.drop(['sex', 'income'], axis=1)

In [15]:
#Aplicam functii de standardizare a datelor pe informatiile ramase, fara a le aplica pe target, echivalent a fit pe train/transform pe test
scaler_standard = StandardScaler()
coloane_standard = ['age', 'capital', 'education-num', 'fnlwgt']
data[coloane_standard] = scaler_standard.fit_transform(data[coloane_standard])

In [16]:
#In final vom face splitul de date, astfel: 80% va fi alocata training-ului, 20% testarii, in urma separarii coloanei target
X = data.drop('hours-per-week', axis=1)
y = data['hours-per-week']
#Voi salva un CSV cu forma aceasta
data.to_csv('Date_curat.csv', index=False)

In [17]:
#Am separat targetul de catre baza de date initiala, si acum urmeaza impartirea pe train si test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [18]:
#Salvam final toate aceste date in livrabile, construind csv-urile
X_train.to_csv('X_train_FIN.csv', index=False)
X_test.to_csv('X_test_FIN.csv', index=False)
y_train.to_csv('y_train_FIN.csv', index=False)
y_test.to_csv('y_test_FIN.csv', index=False)