In [76]:
import numpy as np
import pandas as pd

**Loading the Dataset**  
The dataset is loaded using `pd.read_csv`, which reads the CSV file containing wage data.

In [78]:
data = pd.read_csv("DT-Wage.csv")
# data

**Displaying dataset overview** <br>
Head, Summary statistics and count of missing values for each column.

In [79]:
print("Displaying the first few rows of the dataset:")
print(data.head())

print("---------------------------------------------")
print("Displaying summary statistics for the dataset:")
print(data.describe())

print("---------------------------------------------")
print("Displaying the count of missing values for each column:")
print(data.isnull().sum())

Displaying the first few rows of the dataset:
   year  age            maritl      race        education              region  \
0  2006   18  1. Never Married  1. White     1. < HS Grad  2. Middle Atlantic   
1  2004   24  1. Never Married  1. White  4. College Grad  2. Middle Atlantic   
2  2003   45        2. Married  1. White  3. Some College  2. Middle Atlantic   
3  2003   43        2. Married  3. Asian  4. College Grad  2. Middle Atlantic   
4  2005   50       4. Divorced  1. White       2. HS Grad  2. Middle Atlantic   

         jobclass          health health_ins   logwage        wage  
0   1. Industrial       1. <=Good      2. No  4.318063   75.043154  
1  2. Information  2. >=Very Good      2. No  4.255273   70.476020  
2   1. Industrial       1. <=Good     1. Yes  4.875061  130.982177  
3  2. Information  2. >=Very Good     1. Yes  5.041393  154.685293  
4  2. Information       1. <=Good     1. Yes  4.318063   75.043154  
---------------------------------------------
Display

**Dropping Uninformative Column**  
The `region` column is dropped because all its values are identical.

In [81]:
data = data.drop('region', axis=1)
# data

**One-Hot Encoding for Categorical Features**  
This cell converts categorical columns (`maritl`, `race`, `education`, `jobclass`, `health`, `health_ins`) into binary dummy variables, allowing the model to process them as numerical features.

In [83]:
data = pd.get_dummies(data, columns=['maritl', 'race', 'education', 'jobclass', 'health', 'health_ins'], prefix=['maritl', 'race', 'education', 'jobclass', 'health', 'health_ins'])
# data

**Ensuring Consistent Data Types**  
Here, all columns in the dataset are cast to integer type to maintain uniformity across the data.

In [85]:
data = data.astype(int)
# data

**Cleaning Column Names**  
Special characters are removed from column names to ensure compatibility with XGBoost and other libraries that may have restrictions on column naming.

In [87]:
data.columns = data.columns.str.replace(r'[^\w\s]', '', regex=True)
data

Unnamed: 0,year,age,logwage,wage,maritl_1 Never Married,maritl_2 Married,maritl_3 Widowed,maritl_4 Divorced,maritl_5 Separated,race_1 White,...,education_2 HS Grad,education_3 Some College,education_4 College Grad,education_5 Advanced Degree,jobclass_1 Industrial,jobclass_2 Information,health_1 Good,health_2 Very Good,health_ins_1 Yes,health_ins_2 No
0,2006,18,4,75,1,0,0,0,0,1,...,0,0,0,0,1,0,1,0,0,1
1,2004,24,4,70,1,0,0,0,0,1,...,0,0,1,0,0,1,0,1,0,1
2,2003,45,4,130,0,1,0,0,0,1,...,0,1,0,0,1,0,1,0,1,0
3,2003,43,5,154,0,1,0,0,0,0,...,0,0,1,0,0,1,0,1,1,0
4,2005,50,4,75,0,0,0,1,0,1,...,1,0,0,0,0,1,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,2008,44,5,154,0,1,0,0,0,1,...,0,1,0,0,1,0,0,1,1,0
2996,2007,30,4,99,0,1,0,0,0,1,...,1,0,0,0,1,0,0,1,0,1
2997,2005,27,4,66,0,1,0,0,0,0,...,0,0,0,0,1,0,1,0,0,1
2998,2005,27,4,87,1,0,0,0,0,1,...,0,1,0,0,1,0,0,1,1,0


**Separating Features and Target Variable**  
The target variable (`wage`) is separated from the feature set (`X`). This prepares the data for model training and testing.

In [89]:
X = data.drop(columns=['wage'])
y = data['wage']
print(type(X))
print(type(y))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


**Splitting the Data into Training, Validation, and Test Sets**  
The dataset is split into training, validation, and test sets, with 70% for training and the remaining 30% split equally for validation and testing.

In [91]:
from sklearn.model_selection import train_test_split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.30, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.50, random_state=42)

print(X_train.shape)
print(y_train.shape)
print(X_val.shape)
print(y_val.shape)
print(X_test.shape)
print(y_test.shape)

(2100, 23)
(2100,)
(450, 23)
(450,)
(450, 23)
(450,)
