Import all the relevant packages.

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
import category_encoders as ce

The <b>dropIrrelevantColumns</b> function drops following columns from the Datasets provided as an argument
<ul>
    <li>Wears Glasses</li>
    <li>Hair Color</li>
    <li>Instances</li>
</ul>

In [2]:
def dropIrrelevantColumns(data) :
    data = data.drop('Wears Glasses', axis = 1)
    data = data.drop('Hair Color', axis = 1)
    data = data.drop('Instance', axis = 1)
    return data

<p>The <b>preprocessData</b> function is the most important function. It performs all the necessary imnputations and transformations needed to clean and preprocess the data.</p>
The function takes 2 arguments:
<ol>
    <li>Training Dataset</li>
    <li>Test Dataset</li>
</ol>
Following are the steps performed:
<ol>
    <li>Split the dataset into Independent and Dependent Variables</li>
    <li>Add a new column to both datasets named <b>train</b>. Set the value as 1 for training dataset and set the value as 0 for test dataset. This variable will identify which entries belong to which dataset.</li>
    <li>Combine the training and testing datasets. This is done because there are a few values in Test dataset, which do not belong in training dataset.</li>
    <li>As part of preprocessing, following operations are performed.
        <ol>
            <li>Fill <i>NaN</i> in <b>Gender</b> as <i>unknown</i></li>
            <li>Replace <i>0</i> in <b>Gender</b> as <i>unknown</i></li>
            <li>Fill <i>NaN</i> in <b>University Degree</b> as <i>unknown</i></li>
            <li>Replace <i>0</i> in <b>University Degree</b> as <i>unknown</i></li>
            <li>Fill <i>NaN</i> in <b>Profession</b> with the <i>modal value</i></li>
            <li>Fill <i>NaN</i> in <b>Country</b> with the <i>modal value</i></li>
            <li>Fill <i>NaN</i> in <b>Age</b> with the <i>mean value</i></li>
            <li>Fill <i>NaN</i> in <b>Year of Record</b> with the <i>mean value</i></li>
            <li>Fill <i>NaN</i> in <b>Body Height [cm]</b> with the <i>mean value</i></li>
            <li>Fill <i>NaN</i> in <b>Size of City</b> with the <i>mean value of the Size of City for the particular country and year.</i></li>
            <li>Split the dataset back into Training and Testing datasets</li>
            <li>Target Encode all the categorical data</li>
        </ol>
</ol>

In [3]:
def preprocessData(data, data_test) :
    X = pd.DataFrame(data.iloc[:, :-1])
    X_test = pd.DataFrame(data_test.iloc[:, :-1])
    Y = pd.DataFrame(data['Income in EUR'])

    X['train'] = 1
    X_test['train'] = 0
    cmb = pd.concat([X, X_test])
    del X
    del X_test
    
    cmb['Gender'] = cmb['Gender'].fillna('unknown')
    cmb['Gender'] = cmb['Gender'].replace('0', 'unknown')
    cmb['University Degree'] = cmb['University Degree'].fillna('No')
    cmb['University Degree'] = cmb['University Degree'].replace('0', 'No')
    cmb['Profession'].fillna(cmb['Profession'].mode()[0], inplace=True)
    cmb['Country'].fillna(cmb['Country'].mode()[0], inplace=True)
    cmb['Age'].fillna(cmb['Age'].mean(), inplace=True)
    cmb['Year of Record'].fillna(cmb['Year of Record'].median(), inplace=True)
    cmb['Body Height [cm]'].fillna(cmb['Body Height [cm]'].mean(), inplace=True)
    cmb['Size of City'] = cmb.apply(
    lambda row: cmb['Size of City'].where(cmb['Country']==row['Country'] & cmb['Year of Record']==row['Year of Record']).mean() if np.isnan(row['Size of City']) else row['Size of City'],
    axis=1)
    te = ce.TargetEncoder()
    X = cmb[cmb['train'] == 1]
    X_test = cmb[cmb['train'] == 0]
    X = te.fit_transform(X, Y, verbose = 1)
    X_test = te.transform(X_test)
    return (X,Y, X_test)

Function calls to <b>dropIrrelevantColumns</b> and <b>preprocessData</b>

In [4]:
data = pd.read_csv('data.csv')
data_test = pd.read_csv('data_test.csv')
data = dropIrrelevantColumns(data)
data_test = dropIrrelevantColumns(data_test)
X , Y , X_test = preprocessData(data, data_test)

Once the preprocessing is done, next step is to fit the model to the regressor. Below is the code for prediction using RandomForestRegressor with 50 trees.

In [5]:
regressor = RandomForestRegressor(n_estimators=50, verbose=1, n_jobs=-1)
regressor.fit(X, Y)
Y_pred = regressor.predict(X_test)
Y_pred = np.array(Y_pred)
with open("PredictedSalary.csv", "w") as file:
    file.write("Instance,Income"+"\n")
    for i in np.array(Y_pred) :
        file.write(str(i+111994) + "," + str(i) + "\n")

  
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    3.6s finished
[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    0.0s
[Parallel(n_jobs=12)]: Done  50 out of  50 | elapsed:    0.1s finished
