# Kaspi Lab solution
This notebook presents a solution to Kaspi Lab task for a second stage for application to "Data Science Academy". The data is retrieved from kaggle competition that was attached to this stage (https://www.kaggle.com/c/salarykz/). My version of solution granted me 10th place in the public leaderboard (48496.51460 RMSE). In the following notebook you will see different versions of the solution with the score that it outputs on the public leaderboard. All of them use Linear Regression but they differ in preprocessing.

### About the Task:

Main goal is to predict the salaries of salary_predict.csv dataset given the salary_train.csv dataset.

### About the Data:
+ id - id of the worker 
+ algebra - score in algebra (IV)
+ programming - score in programming (IV)
+ robotics - score in robotics (IV)
+ economics - score in economics (IV)
+ job - job of the worker (IV)
+ salary - salary of the worker (DV)

In [2]:
import pandas as pd 
import numpy as np

In [3]:
df_train = pd.read_csv('salary_train.csv')
df_train

Unnamed: 0,Id,algebra,programming,data science,robotics,economics,job,salary
0,0,87,62,86,61,90,junior developer,140000
1,1,76,84,76,80,79,data scientist,780000
2,2,56,55,99,82,98,developer,210000
3,3,99,66,65,84,58,economist,420000
4,4,73,87,56,84,73,data scientist,760000
...,...,...,...,...,...,...,...,...
8995,8995,58,85,68,62,97,senior developer,590000
8996,8996,92,58,99,77,81,robotics engineer,1050000
8997,8997,92,54,81,63,74,developer,300000
8998,8998,98,90,51,96,56,developer,420000


### To begin:
It will be a good practice to enumerate all the jobs so it will be easier to handle.

In [4]:
set(df_train.job)

{'data scientist',
 'developer',
 'economist',
 'junior developer',
 'robotics engineer',
 'senior developer'}

In [5]:
dict_jobs = dict(zip(set(df_train.job), np.arange(len(set(df_train.job)))))
dict_jobs

{'developer': 0,
 'senior developer': 1,
 'robotics engineer': 2,
 'data scientist': 3,
 'junior developer': 4,
 'economist': 5}

In [6]:
df_train['job_num'] = [dict_jobs[i] for i in df_train.job]
df_train

Unnamed: 0,Id,algebra,programming,data science,robotics,economics,job,salary,job_num
0,0,87,62,86,61,90,junior developer,140000,4
1,1,76,84,76,80,79,data scientist,780000,3
2,2,56,55,99,82,98,developer,210000,0
3,3,99,66,65,84,58,economist,420000,5
4,4,73,87,56,84,73,data scientist,760000,3
...,...,...,...,...,...,...,...,...,...
8995,8995,58,85,68,62,97,senior developer,590000,1
8996,8996,92,58,99,77,81,robotics engineer,1050000,2
8997,8997,92,54,81,63,74,developer,300000,0
8998,8998,98,90,51,96,56,developer,420000,0


In [7]:
dfs = [df_train[df_train['job_num'] == i] for i in range(6)]

In [8]:
dfs[0]

Unnamed: 0,Id,algebra,programming,data science,robotics,economics,job,salary,job_num
2,2,56,55,99,82,98,developer,210000,0
8,8,91,95,99,62,96,developer,380000,0
9,9,98,93,66,94,64,developer,350000,0
11,11,58,99,99,92,79,developer,380000,0
14,14,64,97,98,54,74,developer,310000,0
...,...,...,...,...,...,...,...,...,...
8984,8984,55,66,84,85,63,developer,240000,0
8990,8990,61,82,69,86,63,developer,280000,0
8993,8993,64,82,57,58,79,developer,340000,0
8997,8997,92,54,81,63,74,developer,300000,0


# Model 1. Minimum preprocessing, sklearn LR (RMSE 48587.93)
This Model as every other creates separate linear regression models for each of jobs. It doesn't have any preprocessing.

In [9]:
#coefficients from minimum preprocessing 
from sklearn.linear_model import LinearRegression
models_lin_reg = []
for i in range(6) :
    X = dfs[i][['algebra', 'programming', 'data science', 'robotics', 'economics']]
    y = dfs[i]['salary']
    model_i = LinearRegression().fit(X,y)
    models_lin_reg.append(model_i)
    coef_df = pd.DataFrame(model_i.coef_, X.columns, columns=['Coefficient'])
    print('\n', list(dict_jobs.keys())[i], 'model')
    print(coef_df)


 developer model
              Coefficient
algebra       1457.806051
programming   2820.854794
data science     6.997125
robotics        31.702213
economics      -57.339652

 senior developer model
              Coefficient
algebra       1140.154412
programming   3511.629981
data science  1304.205680
robotics      1133.130216
economics      -15.219287

 robotics engineer model
              Coefficient
algebra       1620.306866
programming   4920.040929
data science  1597.544809
robotics      4682.955139
economics      261.165496

 data scientist model
              Coefficient
algebra       1981.566925
programming   6279.180771
data science  2165.464924
robotics        65.011062
economics      102.452328

 junior developer model
              Coefficient
algebra        996.642201
programming   1054.335249
data science   -24.636493
robotics       -21.075026
economics       -2.359474

 economist model
              Coefficient
algebra       1858.605171
programming    656.524531
data sc

# Model 2. Minimum preprocessing, statsmodel.api OLS (RMSE 48554.29)
I found that using statsmodel sm we could get even lower RMSE. I also used it to determine outliers with standardized residuals (there weren't any).

In [10]:
import statsmodels.api as sm

for i in range(6):
    Y = dfs[i]['salary']
    X = dfs[i][['algebra', 'programming', 'data science', 'robotics', 'economics']]
    model = sm.OLS(Y,X)
    ress = model.fit()
    influence = ress.get_influence()
    standardized_residuals = influence.resid_studentized_internal
    dfs[i]['std residuals'] = standardized_residuals
dfs[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfs[i]['std residuals'] = standardized_residuals


Unnamed: 0,Id,algebra,programming,data science,robotics,economics,job,salary,job_num,std residuals
2,2,56,55,99,82,98,developer,210000,0,-0.719604
8,8,91,95,99,62,96,developer,380000,0,-0.435389
9,9,98,93,66,94,64,developer,350000,0,-1.877444
11,11,58,99,99,92,79,developer,380000,0,0.892707
14,14,64,97,98,54,74,developer,310000,0,-1.898121
...,...,...,...,...,...,...,...,...,...,...
8984,8984,55,66,84,85,63,developer,240000,0,-0.800896
8990,8990,61,82,69,86,63,developer,280000,0,-1.325150
8993,8993,64,82,57,58,79,developer,340000,0,0.866745
8997,8997,92,54,81,63,74,developer,300000,0,0.786833


In [12]:
#removing outlires using z score (removing below 3 gives worse results)
from scipy import stats

z_score = 3

for i in range(6) :
    print('before:', dfs[i].shape)
    abs_z_scores = np.abs(dfs[i]['std residuals'])
    filtered_entries = (abs_z_scores < z_score)
    dfs[i] = dfs[i][filtered_entries]
    print('after:', dfs[i].shape)

before: (1528, 10)
after: (1528, 10)
before: (1487, 10)
after: (1487, 10)
before: (1543, 10)
after: (1543, 10)
before: (1478, 10)
after: (1478, 10)
before: (1467, 10)
after: (1467, 10)
before: (1497, 10)
after: (1497, 10)


# Model 3. Applying square root to data to reduce Right Skewness (RMSE 48496.51)

In [20]:
from scipy.stats import skew
for i,j in enumerate(dict_jobs):
    print(j)
#     'algebra', 'programming', 'data science', 'robotics', 'economics'
    print('algebra:', skew(dfs[i]['algebra']))
    print('programming:', skew(dfs[i]['programming']))
    print('data science:', skew(dfs[i]['data science']))
    print('robotics:', skew(dfs[i]['robotics']))
    print('economics:', skew(dfs[i]['economics']))
    print('\n')

developer
algebra: 0.04940241498366506
programming: -0.018110800230112672
data science: -0.016915605374284107
robotics: -0.03119132122410138
economics: -0.03429252748383259


senior developer
algebra: -0.020257427267430916
programming: 0.015055471100512275
data science: -0.020316750033975685
robotics: 0.006589869455353788
economics: 0.027055441280793992


robotics engineer
algebra: 0.011971334030475787
programming: 0.017896158809475784
data science: 0.004261876937539874
robotics: 0.07672211692404061
economics: 0.0033699056551626987


data scientist
algebra: 0.026104991161351537
programming: 0.013926640327198946
data science: 0.046909644935809805
robotics: -0.038578253794357906
economics: -0.032738166997999814


junior developer
algebra: -0.060870945013425805
programming: -0.035124152308338465
data science: 0.0031402195368877817
robotics: 0.060931998224275905
economics: -0.031119923968141453


economist
algebra: 0.03367083897363352
programming: -0.02348388662726023
data science: -0.0265

In [35]:

for i in range(6) :
    x_1 = dfs[i]['algebra']
    x_1 = np.power(x_1, 1/2)
    x_2 = dfs[i]['programming']
    x_2 = np.power(x_2, 1/2)
    x_3 = dfs[i]['data science']
    x_3 = np.power(x_3, 1/2)
    x_4 = dfs[i]['robotics']
    x_4 = np.power(x_4, 1/2)
    x_5 = dfs[i]['economics']
    x_5 = np.power(x_5, 1/2)
    
    X = np.vstack((x_1,x_2,x_3,x_4,x_5,np.ones(len(dfs[i]['algebra'])))).T
                  
    y = dfs[i]['salary']
    a = np.linalg.lstsq(X,y)[0]
    x0 = dfs_test[i]['algebra']
    x0 = np.power(x0, 1/2)
    x1 = dfs_test[i]['programming'] 
    x1 = np.power(x1, 1/2)
    x2 = dfs_test[i]['data science'] 
    x2 = np.power(x2, 1/2)
    x3 = dfs_test[i]['robotics'] 
    x3 = np.power(x3, 1/2)
    x4 = dfs_test[i]['economics'] 
    x4 = np.power(x4, 1/2)
    yfit = a[0]*x0+a[1]*x1+a[2]*x2 + a[3]*x3 + a[4]*x4 + a[5]
    
    dfs_test[i]['salary'] = yfit


  a = np.linalg.lstsq(X,y)[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfs_test[i]['salary'] = yfit


In [36]:
dfs_test[1]

Unnamed: 0,Id,algebra,programming,data science,robotics,economics,job,salary,job_num
12,9012,91,79,97,89,55,data scientist,888079.115091,1
21,9021,86,68,61,93,61,data scientist,736053.099099,1
25,9025,62,91,52,79,99,data scientist,807471.658719,1
29,9029,82,86,69,97,72,data scientist,858359.355688,1
30,9030,92,87,73,73,73,data scientist,889604.611899,1
...,...,...,...,...,...,...,...,...,...
970,9970,91,58,68,66,81,data scientist,693710.176932,1
973,9973,55,57,54,67,77,data scientist,580955.454411,1
980,9980,87,85,69,79,60,data scientist,859406.026561,1
986,9986,89,62,73,86,65,data scientist,728630.255944,1


In [37]:
df_temp2 = pd.concat([dfs_test[i] for i in range(6)])
df_temp2

Unnamed: 0,Id,algebra,programming,data science,robotics,economics,job,salary,job_num
4,9004,52,85,92,87,62,junior developer,142674.490504,0
6,9006,56,50,76,83,61,junior developer,109040.600785,0
7,9007,93,51,56,91,64,junior developer,147472.649910,0
10,9010,56,90,93,60,82,junior developer,152590.867214,0
16,9016,68,77,87,91,86,junior developer,152230.772860,0
...,...,...,...,...,...,...,...,...,...
957,9957,56,69,91,91,98,developer,269982.779530,5
975,9975,93,96,81,51,56,developer,396882.612660,5
984,9984,97,78,91,84,75,developer,355220.208324,5
988,9988,97,81,65,76,50,developer,364667.365034,5


In [38]:
df_temp2 = df_temp2.sort_values(by = ['Id'], ascending=True)
df_temp2

Unnamed: 0,Id,algebra,programming,data science,robotics,economics,job,salary,job_num
0,9000,73,59,57,54,61,robotics engineer,7.267714e+05,3
1,9001,77,80,53,93,80,senior developer,5.587033e+05,4
2,9002,95,72,88,63,84,developer,3.347953e+05,5
3,9003,83,88,97,75,50,robotics engineer,1.048556e+06,3
4,9004,52,85,92,87,62,junior developer,1.426745e+05,0
...,...,...,...,...,...,...,...,...,...
995,9995,83,98,71,83,61,junior developer,1.878669e+05,0
996,9996,98,59,74,79,52,economist,3.750020e+05,2
997,9997,91,68,76,50,92,developer,3.171248e+05,5
998,9998,61,70,95,51,87,junior developer,1.381405e+05,0


In [39]:
df_results3 = pd.DataFrame()
df_results3['Id'] = df_temp2['Id']
df_results3['salary'] = np.round(df_temp2['salary'], 0).astype(int)
df_results3 = df_results3.set_index('Id')
df_results3

Unnamed: 0_level_0,salary
Id,Unnamed: 1_level_1
9000,726771
9001,558703
9002,334795
9003,1048556
9004,142674
...,...
9995,187867
9996,375002
9997,317125
9998,138141


In [40]:
df_results3['salary'] = [i if i < 1000000 else 1000000 for i in df_results3['salary']]
df_results3

Unnamed: 0_level_0,salary
Id,Unnamed: 1_level_1
9000,726771
9001,558703
9002,334795
9003,1000000
9004,142674
...,...
9995,187867
9996,375002
9997,317125
9998,138141


In [41]:
df_results3.to_csv('salary_results.csv')

In [42]:
df_train

Unnamed: 0,Id,algebra,programming,data science,robotics,economics,job,salary,job_num,pred 1st model
0,0,87,62,86,61,90,junior developer,140000,0,154696
1,1,76,84,76,80,79,data scientist,780000,1,845031
2,2,56,55,99,82,98,developer,210000,5,228996
3,3,99,66,65,84,58,economist,420000,2,390894
4,4,73,87,56,84,73,data scientist,760000,1,814260
...,...,...,...,...,...,...,...,...,...,...
8995,8995,58,85,68,62,97,senior developer,590000,4,538016
8996,8996,92,58,99,77,81,robotics engineer,1050000,3,934841
8997,8997,92,54,81,63,74,developer,300000,5,279304
8998,8998,98,90,51,96,56,developer,420000,5,391470


In [43]:
dfs[0]

Unnamed: 0,Id,algebra,programming,data science,robotics,economics,job,salary,job_num,pred 1st model,std residuals
0,0,87,62,86,61,90,junior developer,140000,0,154696,-1.117452
22,22,50,72,68,88,96,junior developer,120000,0,128224,-0.615276
25,25,72,98,91,99,91,junior developer,170000,0,176776,-0.601019
37,37,68,55,66,51,74,junior developer,120000,0,129120,-0.608096
43,43,83,85,93,87,58,junior developer,190000,0,174314,1.141339
...,...,...,...,...,...,...,...,...,...,...,...
8946,8946,93,72,80,63,81,junior developer,160000,0,171346,-0.868296
8965,8965,79,86,51,86,56,junior developer,180000,0,172442,0.592535
8969,8969,51,59,81,93,51,junior developer,100000,0,115194,-1.089890
8973,8973,98,83,77,56,56,junior developer,200000,0,188207,0.893814


In [44]:
dfs_test[0]

Unnamed: 0,Id,algebra,programming,data science,robotics,economics,job,salary,job_num
4,9004,52,85,92,87,62,junior developer,142674.490504,0
6,9006,56,50,76,83,61,junior developer,109040.600785,0
7,9007,93,51,56,91,64,junior developer,147472.649910,0
10,9010,56,90,93,60,82,junior developer,152590.867214,0
16,9016,68,77,87,91,86,junior developer,152230.772860,0
...,...,...,...,...,...,...,...,...,...
983,9983,59,83,86,66,59,junior developer,149306.499564,0
987,9987,78,75,62,60,84,junior developer,161391.985145,0
992,9992,52,89,70,65,58,junior developer,147508.160985,0
995,9995,83,98,71,83,61,junior developer,187866.860827,0


# Model 4. Gradien Boosting Regressor (RMSE 50292.96)

In [45]:
from sklearn.ensemble import GradientBoostingRegressor
grd = []
for i in range(6):
    reg = GradientBoostingRegressor(random_state=0)
    reg.fit(dfs[i][['algebra', 'programming', 'data science', 'robotics', 'economics']], dfs[i]['salary'])
    X_test = dfs_test[i][['algebra', 'programming', 'data science', 'robotics', 'economics']]
    results = reg.predict(X_test)
    dfs_test[i]['salary2'] = results

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfs_test[i]['salary2'] = results
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfs_test[i]['salary2'] = results
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfs_test[i]['salary2'] = results
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer]

# Model 5. XGBoost (RMSE 54539.94)

In [21]:
# Necessary imports 
import numpy as np 
import pandas as pd 
import xgboost as xgb
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error as MSE 

# Load the data 
df_train = pd.read_csv("salary_train.csv") 
df_train

Unnamed: 0,Id,algebra,programming,data science,robotics,economics,job,salary
0,0,87,62,86,61,90,junior developer,140000
1,1,76,84,76,80,79,data scientist,780000
2,2,56,55,99,82,98,developer,210000
3,3,99,66,65,84,58,economist,420000
4,4,73,87,56,84,73,data scientist,760000
...,...,...,...,...,...,...,...,...
8995,8995,58,85,68,62,97,senior developer,590000
8996,8996,92,58,99,77,81,robotics engineer,1050000
8997,8997,92,54,81,63,74,developer,300000
8998,8998,98,90,51,96,56,developer,420000


In [22]:
dict_jobs = dict(zip(set(df_train.job), np.arange(len(set(df_train.job)))))
dict_jobs

{'developer': 0,
 'senior developer': 1,
 'robotics engineer': 2,
 'data scientist': 3,
 'junior developer': 4,
 'economist': 5}

In [23]:
df_train['job_num'] = [dict_jobs[i] for i in df_train.job]
dfs = [df_train[df_train['job_num'] == i] for i in range(6)]

In [24]:
models_lin_reg = []
for i in range(6) :
    X = dfs[i][['algebra', 'programming', 'data science', 'robotics', 'economics']]
    y = dfs[i]['salary']

    train_dmatrix = xgb.DMatrix(data = X, label = y) 
    # Parameter dictionary specifying base learner 
    param = {"booster":"gblinear", "objective":"reg:linear"} 
  
    xgb_r = xgb.train(params = param, dtrain = train_dmatrix, num_boost_round = 1000) 
    
    models_lin_reg.append(xgb_r)



In [25]:
df_test = pd.read_csv('salary_predict.csv')
df_test['job_num'] = [dict_jobs[i] for i in df_test.job]

In [26]:
results = [models_lin_reg[df_test['job_num'][i]].predict(xgb.DMatrix(df_test.iloc[[i]][['algebra',
                                                                    'programming',
                        'data science', 'robotics', 'economics']])) for i in range(df_test.shape[0])]
df_results = pd.DataFrame()
df_results['Id'] = df_test['Id']
df_results['salary'] = (np.round(np.array(results), 0)).astype(int)
df_results = df_results.set_index('Id')
df_results

Unnamed: 0_level_0,salary
Id,Unnamed: 1_level_1
9000,729280
9001,557923
9002,333901
9003,1047116
9004,143433
...,...
9995,188629
9996,378838
9997,315856
9998,137220


In [27]:

df_results['salary'] = [i if i < 1000000 else 1000000 for i in df_results['salary']]
df_results


Unnamed: 0_level_0,salary
Id,Unnamed: 1_level_1
9000,729280
9001,557923
9002,333901
9003,1000000
9004,143433
...,...
9995,188629
9996,378838
9997,315856
9998,137220


In [28]:
df_results.to_csv('salary_resultsXGB.csv')

In [29]:
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe

In [30]:
def objective(space):
    
    clf=xgb.XGBRegressor(objective = "reg:linear", n_estimators = 1400,
                         reg_alpha = space['reg_alpha'], colsample_bytree = space['colsample_bytree'])
    
    clf.fit(X, y,verbose=False)
    

    pred = clf.predict(X)
    accuracy = mean_squared_error(y, pred)
    return {'loss': np.sqrt(accuracy), 'status': STATUS_OK }

In [31]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import xgboost as xgb
hyper_params_matrix = []
for i in range(6) :
    X = dfs[i][['algebra', 'programming', 'data science', 'robotics', 'economics']]
    scaler = StandardScaler()
    scaler.fit(X)
    X = scaler.transform(X)
    y = dfs[i]['salary']

    space={'max_depth': hp.quniform("max_depth", 3, 18, 1),
            'gamma': hp.uniform ('gamma', 1,9),
            'reg_alpha' : hp.quniform('reg_alpha', 0.0,0.3,0.001),
            'reg_lambda' : hp.uniform('reg_lambda', 0,1),
            'colsample_bytree' : hp.uniform('colsample_bytree', 0.5,1),
            'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
            'n_estimators': 180,
            'seed': 0
        }
    trials = Trials()

    best_hyperparams = fmin(fn = objective,
                            space = space,
                            
                            
                            algo = tpe.suggest,
                            max_evals = 10,
                            trials = trials
                           )
    hyper_params_matrix.append(best_hyperparams)

100%|███████████████████████████████████████████████| 10/10 [00:11<00:00,  1.17s/trial, best loss: 0.05198387782836091]
100%|███████████████████████████████████████████████| 10/10 [00:10<00:00,  1.08s/trial, best loss: 0.10122751873756773]
100%|████████████████████████████████████████████████| 10/10 [00:10<00:00,  1.06s/trial, best loss: 0.1777834491617604]
100%|███████████████████████████████████████████████| 10/10 [00:11<00:00,  1.12s/trial, best loss: 0.10555847355683091]


100%|██████████████████████████████████████████████| 10/10 [00:11<00:00,  1.12s/trial, best loss: 0.036539267091432924]
100%|███████████████████████████████████████████████| 10/10 [00:11<00:00,  1.11s/trial, best loss: 0.05921316234678638]


In [32]:
models_lin_reg = []
for i in range(6) :
    X = dfs[i][['algebra', 'programming', 'data science', 'robotics', 'economics']]
    scaler = StandardScaler()
    scaler.fit(X)
    X = scaler.transform(X)
    y = dfs[i]['salary']
    space = hyper_params_matrix[i]
    xgb_r=xgb.XGBRegressor(objective = "reg:linear", n_estimators = 1400,
                         reg_alpha = space['reg_alpha'], colsample_bytree = space['colsample_bytree'])
    xgb_r.fit(X, y)
    
    models_lin_reg.append(xgb_r)



In [33]:
results = [models_lin_reg[df_test['job_num'][i]].predict(scaler.transform(df_test.iloc[[i]][['algebra',
                                                                    'programming',
                        'data science', 'robotics', 'economics']])) for i in range(df_test.shape[0])]
df_results = pd.DataFrame()
df_results['Id'] = df_test['Id']
df_results['salary'] = (np.round(np.array(results), 0)).astype(int)
df_results = df_results.set_index('Id')
df_results

Unnamed: 0_level_0,salary
Id,Unnamed: 1_level_1
9000,740818
9001,509797
9002,353423
9003,1047886
9004,151659
...,...
9995,189323
9996,344500
9997,325248
9998,137172


In [34]:
df_results['salary'] = [i if i < 1000000 else 1000000 for i in df_results['salary']]
df_results.to_csv('salary_resultsXGB2.csv')