# Kaggle Project: Salary prediction

## Describe my Dataset

### URL : https://www.kaggle.com/datasets/rkiattisak/salaly-prediction-for-beginer

### Task: 
        1. 필요한 library를 import
        2. 데이터셋 생성 및 분할
        3. 모델 정의: DecisionTreeRegression & Neural Network
        4. 각 모델 학습 및 검증
        5. Test data를 통한 최종 성능 평가
        

### Datasets: 373개의 데이터를 train: validation: test = 6:2:2의 비율로 분할
  * Train dataset: 238개(60%)
  * Validation dataset: 60개(20%)
  * Test dataset: 75개(20%)

### Features(x): Age, Education Level, Job Title, Years of Experience

### Target(y): Salary

## Import Library

In [158]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score

## Build Desicision Tree Regression Model

#### Data preprocessing

In [159]:
# 데이터 불러오기
salary_data = pd.read_csv('Salary Data.csv')

In [160]:
salary_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 373 entries, 0 to 372
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Age                  373 non-null    int64  
 1   Gender               373 non-null    object 
 2   Education Level      373 non-null    object 
 3   Job Title            373 non-null    object 
 4   Years of Experience  373 non-null    float64
 5   Salary               373 non-null    int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 17.6+ KB


In [161]:
salary_data.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32,Male,Bachelor's,Software Engineer,5.0,90000
1,28,Female,Master's,Data Analyst,3.0,65000
2,45,Male,PhD,Senior Manager,15.0,150000
3,36,Female,Bachelor's,Sales Associate,7.0,60000
4,52,Male,Master's,Director,20.0,200000


In [162]:
# 데이터 결측치 확인
salary_data.isnull().sum()

Age                    0
Gender                 0
Education Level        0
Job Title              0
Years of Experience    0
Salary                 0
dtype: int64

In [163]:
salary_data.shape

(373, 6)

In [164]:
# Gender, Education Level, Job Title의 같은 항목들을 숫자로 변환

salary_data['Gender'] = salary_data['Gender'].replace({'Male':1, 'Female':0})

edu_label_encoder = LabelEncoder()
salary_data['Education Level'] = edu_label_encoder.fit_transform(salary_data['Education Level'])

job_title_encoder = LabelEncoder()
salary_data['Job Title'] = job_title_encoder.fit_transform(salary_data['Job Title'])

In [165]:
# 데이터 확인
salary_data

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32,1,0,159,5.0,90000
1,28,0,1,17,3.0,65000
2,45,1,2,130,15.0,150000
3,36,0,0,101,7.0,60000
4,52,1,1,22,20.0,200000
...,...,...,...,...,...,...
368,35,0,0,131,8.0,85000
369,43,1,1,30,19.0,170000
370,29,0,0,70,2.0,40000
371,34,1,0,137,7.0,90000


#### Model Construction

In [166]:
# test set, validation set, train set 설정 

Y = salary_data['Salary']
X = salary_data.drop(['Salary'], axis=1)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size = 0.25, random_state=0)

In [167]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
print(x_val.shape)
print(y_val.shape)

(223, 5)
(75, 5)
(223,)
(75,)
(75, 5)
(75,)


#### Train Model & Select Model

In [168]:
model = DecisionTreeRegressor(max_depth= 5, max_features= None, min_samples_leaf= 1)
model.fit(x_train, y_train)

DecisionTreeRegressor(max_depth=5)

In [169]:
# grid search를 통한 hyperparameter 최적화

param_grid = {
    'max_depth': [3, 5, 7, 9, 100],
    'min_samples_leaf': [1, 2, 4],
    'max_features': [None, 'sqrt', 'log2']
}

# Grid Search 수행
grid_model = GridSearchCV(model, param_grid, cv=5)
grid_model.fit(x_train, y_train)

# 최적의 Hyperparameter 값 출력
print("최적의 Hyperparameter:", grid_model.best_params_)

최적의 Hyperparameter: {'max_depth': 5, 'max_features': 'log2', 'min_samples_leaf': 4}


## Performance

In [170]:
# Train set에서의 예측값 계산
y_train_pred = model.predict(x_train)

# Train set에서의 R-squared 값 계산
r2_train = r2_score(y_train, y_train_pred)

# Validation set에서의 예측값 계산
y_val_pred = model.predict(x_val)

# Validation set에서의 R-squared 값 계산
r2_val = r2_score(y_val, y_val_pred)

# Test set에서의 예측값 계산
y_test_pred = model.predict(x_test)

# Test set에서의 R-squared 값 계산
r2_test = r2_score(y_test, y_test_pred)

print("Train set R-squared:", r2_train)
print("Validation set R-squared:", r2_val)
print("Test set R-squared:", r2_test)

Train set R-squared: 0.9630213129532259
Validation set R-squared: 0.8125884097282398
Test set R-squared: 0.9061828247680328


## Build Neural Network Model

#### Data preprocessing

In [171]:
df = pd.read_csv('Salary Data.csv')

df['Gender'] = df['Gender'].replace({'Male':1, 'Female':0})

edu_label_encoder = LabelEncoder()
df['Education Level'] = edu_label_encoder.fit_transform(df['Education Level'])

job_title_encoder = LabelEncoder()
df['Job Title'] = job_title_encoder.fit_transform(df['Job Title'])

X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [172]:
df

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32,1,0,159,5.0,90000
1,28,0,1,17,3.0,65000
2,45,1,2,130,15.0,150000
3,36,0,0,101,7.0,60000
4,52,1,1,22,20.0,200000
...,...,...,...,...,...,...
368,35,0,0,131,8.0,85000
369,43,1,1,30,19.0,170000
370,29,0,0,70,2.0,40000
371,34,1,0,137,7.0,90000


#### model construction

In [173]:
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)
train_data, val_data = train_test_split(train_data, test_size=0.25, random_state=42)

x_train = torch.Tensor(train_data.drop(['Salary'], axis=1).values)
y_train = torch.Tensor(train_data['Salary'].values)

x_val = torch.Tensor(val_data.drop(['Salary'], axis=1).values)
y_val = torch.Tensor(val_data['Salary'].values)

x_test = torch.Tensor(test_data.drop(['Salary'], axis=1).values)
y_test = torch.Tensor(test_data['Salary'].values)

In [174]:
print(x_train.shape)
print(y_train.shape)
print(x_val.shape)

torch.Size([223, 5])
torch.Size([223])
torch.Size([75, 5])


#### Train Model & Select Model

In [175]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(5, 10)
        self.fc2 = nn.Linear(10, 5)
        self.fc3 = nn.Linear(5, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model = Net()

# 모델 학습
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = nn.MSELoss()

train_losses = []
val_losses = []

num_epochs = 1000

for epoch in range(num_epochs):
    optimizer.zero_grad()
    output = model(x_train)
    train_loss = criterion(output, y_train.unsqueeze(1))
    train_losses.append(train_loss.item())
    train_loss.backward()
    optimizer.step()

    with torch.no_grad():
        val_output = model(x_val)
        val_loss = criterion(val_output, y_val.unsqueeze(1))
        val_losses.append(val_loss.item())

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Train Loss: {train_loss.item()}, Val Loss: {val_loss.item()}")

Epoch 0, Train Loss: 12690216960.0, Val Loss: 11223384064.0
Epoch 100, Train Loss: 11066668032.0, Val Loss: 9827176448.0
Epoch 200, Train Loss: 2564344832.0, Val Loss: 2853864192.0
Epoch 300, Train Loss: 1861012224.0, Val Loss: 2258898688.0
Epoch 400, Train Loss: 1589024000.0, Val Loss: 1925349632.0
Epoch 500, Train Loss: 1299972992.0, Val Loss: 1568980352.0
Epoch 600, Train Loss: 1005891520.0, Val Loss: 1203628928.0
Epoch 700, Train Loss: 738773248.0, Val Loss: 867736320.0
Epoch 800, Train Loss: 546372160.0, Val Loss: 620410944.0
Epoch 900, Train Loss: 448187776.0, Val Loss: 489423680.0


## Performance

In [176]:
# 모델 평가
with torch.no_grad():
    train_output = model(x_train)
    train_mse = nn.functional.mse_loss(train_output, y_train.unsqueeze(1)).item()
    train_r2 = r2_score(y_train, train_output.numpy().flatten())

    val_output = model(x_val)
    val_mse = nn.functional.mse_loss(val_output, y_val.unsqueeze(1)).item()
    val_r2 = r2_score(y_val, val_output.numpy().flatten())

    test_output = model(x_test)
    test_mse = nn.functional.mse_loss(test_output, y_test.unsqueeze(1)).item()
    test_r2 = r2_score(y_test, test_output.numpy().flatten())

print("----------------------------------------------------------")    
print(f"Train MSE: {train_mse}, Train R^2: {train_r2}")
print(f"Validation MSE: {val_mse}, Validation R^2: {val_r2}")
print(f"Test MSE: {test_mse}, Test R^2: {test_r2}")

----------------------------------------------------------
Train MSE: 408524224.0, Train R^2: 0.8202534378581079
Validation MSE: 435900032.0, Validation R^2: 0.8134891012092795
Test MSE: 372709216.0, Test R^2: 0.844547898493238


##### 두 모델 다 성능은 준수한 것 같지만, neural network model이 좀 더 적합한 것으로 결론내렸다.