# Kaggle Project: Salary prediction

## Describe my Dataset

### URL : https://www.kaggle.com/datasets/rkiattisak/salaly-prediction-for-beginer

### Task: 
        1. 필요한 library를 import
        2. 데이터셋 생성 및 분할
        3. 모델 정의: DecisionTreeRegression & Neural Network
        4. 각 모델 학습 및 검증
        5. Test data를 통한 최종 성능 평가
        

### Datasets: 373개의 데이터를 train: validation: test = 6:2:2의 비율로 분할
  * Train dataset: 238개(60%)
  * Validation dataset: 60개(20%)
  * Test dataset: 75개(20%)

### Features(x): Age, Education Level, Job Title, Years of Experience

### Target(y): Salary

## Import Library

In [1]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import MinMaxScaler

## Build Desicision Tree Regression Model

#### Data preprocessing

In [129]:
# 데이터 불러오기
salary_data = pd.read_csv('Salary Data.csv')

In [130]:
salary_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 373 entries, 0 to 372
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Age                  373 non-null    int64  
 1   Gender               373 non-null    object 
 2   Education Level      373 non-null    object 
 3   Job Title            373 non-null    object 
 4   Years of Experience  373 non-null    float64
 5   Salary               373 non-null    int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 17.6+ KB


In [131]:
salary_data.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32,Male,Bachelor's,Software Engineer,5.0,90000
1,28,Female,Master's,Data Analyst,3.0,65000
2,45,Male,PhD,Senior Manager,15.0,150000
3,36,Female,Bachelor's,Sales Associate,7.0,60000
4,52,Male,Master's,Director,20.0,200000


In [132]:
# 데이터 결측치 확인
salary_data.isnull().sum()

Age                    0
Gender                 0
Education Level        0
Job Title              0
Years of Experience    0
Salary                 0
dtype: int64

In [133]:
salary_data.shape

(373, 6)

In [134]:
# Gender, Job Title의 같은 항목들을 숫자로 변환

salary_data['Gender'] = salary_data['Gender'].replace({'Male':1, 'Female':0})

job_title_encoder = LabelEncoder()
salary_data['Job Title'] = job_title_encoder.fit_transform(salary_data['Job Title'])

In [136]:
# 상관관계 분석
job_correlation = salary_data['Job Title'].corr(salary_data['Salary'])
years_correlation = salary_data['Scaled Years of Experience'].corr(salary_data['Salary'])
age_correlation = salary_data['Age'].corr(salary_data['Salary'])

print('상관관계: ', job_correlation)
print('상관관계: ', years_correlation)
print('상관관계: ', age_correlation)

# job title correlation이 너무 낮으므로 job title 삭제
salary_data = salary_data.drop('Job Title', axis=1)

상관관계:  0.13620643703632004
상관관계:  0.9303377227618325
상관관계:  0.9223352439166448


In [135]:
# years of experience 정규화

scaler = MinMaxScaler()

# nemeric 컬럼 뽑기 

years_of_experience = salary_data['Years of Experience']
years_of_experience

#2d 배열로 변환 후 정규화 
scaled_years_of_experience = scaler.fit_transform(years_of_experience.values.reshape(-1, 1))

# 데이터에 추가
salary_data['Scaled Years of Experience'] = scaled_years_of_experience

# Years of Experience column 삭제
salary_data = salary_data.drop('Years of Experience', axis=1)

In [137]:
## age 정규화

# nemeric 컬럼 뽑기 
age = salary_data['Age']
age

#2d 배열로 변환 후 정규화 
scaled_age = scaler.fit_transform(age.values.reshape(-1, 1))

# 데이터에 추가
salary_data['Scaled_Age'] = scaled_age

# Age column 삭제
salary_data = salary_data.drop('Age', axis=1)

In [138]:
# Education Level을 원핫 인코딩
education_level_encoded = pd.get_dummies(salary_data['Education Level'], prefix='Education')

# 인코딩 결과 데이터프레임에 반영
salary_data = pd.concat([salary_data, education_level_encoded], axis=1)

# 결과 확인
print(salary_data.head())

# Education Level column 삭제
salary_data = salary_data.drop('Education Level', axis=1)

   Gender Education Level  Salary  Scaled Years of Experience  Scaled_Age  \
0       1      Bachelor's   90000                        0.20    0.300000   
1       0        Master's   65000                        0.12    0.166667   
2       1             PhD  150000                        0.60    0.733333   
3       0      Bachelor's   60000                        0.28    0.433333   
4       1        Master's  200000                        0.80    0.966667   

   Education_Bachelor's  Education_Master's  Education_PhD  
0                     1                   0              0  
1                     0                   1              0  
2                     0                   0              1  
3                     1                   0              0  
4                     0                   1              0  


In [139]:
# 데이터 확인
salary_data

Unnamed: 0,Gender,Salary,Scaled Years of Experience,Scaled_Age,Education_Bachelor's,Education_Master's,Education_PhD
0,1,90000,0.20,0.300000,1,0,0
1,0,65000,0.12,0.166667,0,1,0
2,1,150000,0.60,0.733333,0,0,1
3,0,60000,0.28,0.433333,1,0,0
4,1,200000,0.80,0.966667,0,1,0
...,...,...,...,...,...,...,...
368,0,85000,0.32,0.400000,1,0,0
369,1,170000,0.76,0.666667,0,1,0
370,0,40000,0.08,0.200000,1,0,0
371,1,90000,0.28,0.366667,1,0,0


#### Model Construction

In [140]:
# test set, validation set, train set 설정 

Y = salary_data['Salary']
X = salary_data.drop(['Salary'], axis=1)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size = 0.25, random_state=0)

In [141]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
print(x_val.shape)
print(y_val.shape)

(223, 6)
(75, 6)
(223,)
(75,)
(75, 6)
(75,)


#### Train Model & Select Model

In [149]:
model = DecisionTreeRegressor(max_depth= 5, max_features= 'sqrt', min_samples_leaf= 1)
model.fit(x_train, y_train)

DecisionTreeRegressor(max_depth=5, max_features='sqrt')

In [148]:
# grid search를 통한 hyperparameter 최적화

param_grid = {
    'max_depth': [3, 5, 7, 9, 100],
    'min_samples_leaf': [1, 2, 4],
    'max_features': [None, 'sqrt', 'log2']
}

# Grid Search 수행
grid_model = GridSearchCV(model, param_grid, cv=5)
grid_model.fit(x_train, y_train)

# 최적의 Hyperparameter 값 출력
print("최적의 Hyperparameter:", grid_model.best_params_)

최적의 Hyperparameter: {'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 1}


## Performance

In [150]:
# Train set에서의 예측값 계산
y_train_pred = model.predict(x_train)

# Train set에서의 R-squared 값 계산
r2_train = r2_score(y_train, y_train_pred)

# Validation set에서의 예측값 계산
y_val_pred = model.predict(x_val)

# Validation set에서의 R-squared 값 계산
r2_val = r2_score(y_val, y_val_pred)

# Test set에서의 예측값 계산
y_test_pred = model.predict(x_test)

# Test set에서의 R-squared 값 계산
r2_test = r2_score(y_test, y_test_pred)

print("Train set R-squared:", r2_train)
print("Validation set R-squared:", r2_val)
print("Test set R-squared:", r2_test)

Train set R-squared: 0.941485611479774
Validation set R-squared: 0.8675093882733917
Test set R-squared: 0.9091321889156794


#### 중간 과제 Decision Tree performance
##### Train set R-squared: 0.9630213129532259
##### Validation set R-squared: 0.8125884097282398
##### Test set R-squared: 0.9061828247680328
##### 이전 모델 보다 performance가 향상되었다.

## Build Neural Network Model

#### Data preprocessing

In [152]:
salary_data = salary_data.reindex(['Gender', "Education_Bachelor's", "Education_Master's", 'Education_PhD', 'Scaled_Age', 'Scaled Years of Experience', 'Salary'], axis=1)
salary_data

Unnamed: 0,Gender,Education_Bachelor's,Education_Master's,Education_PhD,Scaled_Age,Scaled Years of Experience,Salary
0,1,1,0,0,0.300000,0.20,90000
1,0,0,1,0,0.166667,0.12,65000
2,1,0,0,1,0.733333,0.60,150000
3,0,1,0,0,0.433333,0.28,60000
4,1,0,1,0,0.966667,0.80,200000
...,...,...,...,...,...,...,...
368,0,1,0,0,0.400000,0.32,85000
369,1,0,1,0,0.666667,0.76,170000
370,0,1,0,0,0.200000,0.08,40000
371,1,1,0,0,0.366667,0.28,90000


#### model construction

In [154]:
train_data, test_data = train_test_split(salary_data, test_size=0.2, random_state=42)
train_data, val_data = train_test_split(train_data, test_size=0.25, random_state=42)

x_train = torch.Tensor(train_data.drop(['Salary'], axis=1).values)
y_train = torch.Tensor(train_data['Salary'].values)

x_val = torch.Tensor(val_data.drop(['Salary'], axis=1).values)
y_val = torch.Tensor(val_data['Salary'].values)

x_test = torch.Tensor(test_data.drop(['Salary'], axis=1).values)
y_test = torch.Tensor(test_data['Salary'].values)

In [155]:
print(x_train.shape)
print(y_train.shape)
print(x_val.shape)

torch.Size([223, 6])
torch.Size([223])
torch.Size([75, 6])


#### Train Model & Select Model

In [175]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(6, 120)
        self.fc2 = nn.Linear(120, 30)
        self.fc3 = nn.Linear(30, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model = Net()

# 모델 학습
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = nn.MSELoss()

train_losses = []
val_losses = []

num_epochs = 1000

for epoch in range(num_epochs):
    optimizer.zero_grad()
    output = model(x_train)
    train_loss = criterion(output, y_train.unsqueeze(1))
    train_losses.append(train_loss.item())
    train_loss.backward()
    optimizer.step()

    with torch.no_grad():
        val_output = model(x_val)
        val_loss = criterion(val_output, y_val.unsqueeze(1))
        val_losses.append(val_loss.item())

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Train Loss: {train_loss.item()}, Val Loss: {val_loss.item()}")

Epoch 0, Train Loss: 12689928192.0, Val Loss: 11223294976.0
Epoch 100, Train Loss: 8854967296.0, Val Loss: 7647623168.0
Epoch 200, Train Loss: 991776704.0, Val Loss: 1002403904.0
Epoch 300, Train Loss: 546295808.0, Val Loss: 512708768.0
Epoch 400, Train Loss: 402083616.0, Val Loss: 349829440.0
Epoch 500, Train Loss: 349552128.0, Val Loss: 295919072.0
Epoch 600, Train Loss: 319503936.0, Val Loss: 271123648.0
Epoch 700, Train Loss: 297297568.0, Val Loss: 254992272.0
Epoch 800, Train Loss: 279400416.0, Val Loss: 242731008.0
Epoch 900, Train Loss: 264924288.0, Val Loss: 233509520.0


## Performance

In [176]:
# 모델 평가
with torch.no_grad():
    train_output = model(x_train)
    train_mse = nn.functional.mse_loss(train_output, y_train.unsqueeze(1)).item()
    train_r2 = r2_score(y_train, train_output.numpy().flatten())

    val_output = model(x_val)
    val_mse = nn.functional.mse_loss(val_output, y_val.unsqueeze(1)).item()
    val_r2 = r2_score(y_val, val_output.numpy().flatten())

    test_output = model(x_test)
    test_mse = nn.functional.mse_loss(test_output, y_test.unsqueeze(1)).item()
    test_r2 = r2_score(y_test, test_output.numpy().flatten())

print("----------------------------------------------------------")    
print(f"Train MSE: {train_mse}, Train R^2: {train_r2}")
print(f"Validation MSE: {val_mse}, Validation R^2: {val_r2}")
print(f"Test MSE: {test_mse}, Test R^2: {test_r2}")

----------------------------------------------------------
Train MSE: 253624848.0, Train R^2: 0.8884076190126133
Validation MSE: 227293408.0, Validation R^2: 0.9027467486276141
Test MSE: 272488480.0, Test R^2: 0.8863486291496709


#### 중간 과제 neural network performance
##### Train MSE: 408524224.0, Train R^2: 0.8202534378581079
##### Validation MSE: 435900032.0, Validation R^2: 0.8134891012092795
##### Test MSE: 372709216.0, Test R^2: 0.844547898493238
##### performance가 향상되었다.