# **Job Salary Prediction**
This notebook looks into using various Python-based machine learning and data science libraries in an attempt to build a machine learning model capable of predicting Job Salary.

We're going to take the following approach:
1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Experimentation

# 1. Problem Definition

Successful models will incorporate some analysis of the impact of including different keywords or phrases, as well as making use of the structured data fields like location, hours or company.  Some of the structured data shown (such as category) is 'inferred' by Adzuna's own processes, based on where an ad came from or its contents, and may not be "correct" but is representative of the real data.

You will be provided with a training data set on which to build your model, which will include all variables including salary.  A second data set will be used to provide feedback on the public leaderboard.  After approximately 6 weeks, Kaggle will release a final data set that does not include the salary field to participants, who will then be required to submit their salary predictions against each job for evaluation.
# 2. Data

The main dataset consists of a large number of rows representing individual job ads, and a series of fields about each job ad

# 3.Evaluation

Our evaluation data set is simply a random subset of ads for which we know the salary, that were not included in the training and public testing datasets.

The evaluation metric for this competition is Mean Absolute Error

Sample submission files can be downloaded from the data page. Submission files should be formatted as follows:

Have a header: "Id,SalaryNormalized"
Contain two columns
Id: Id for the ads in the validation set in sorted order
SalaryNormalized: Your predicted salary for the job ad
# 4. Features

These fields are as follows:

Id - A unique identifier for each job ad

Title - A freetext field supplied to us by the job advertiser as the Title of the job ad.  Normally this is a summary of the job title or role.

FullDescription - The full text of the job ad as provided by the job advertiser.  Where you see ***s, we have stripped values from the description in order to ensure that no salary information appears within the descriptions.  There may be some collateral damage here where we have also removed other numerics.

LocationRaw - The freetext location as provided by the job advertiser.

LocationNormalized - Adzuna's normalised location from within our own location tree, interpreted by us based on the raw location.  Our normaliser is not perfect!

ContractType - full_time or part_time, interpreted by Adzuna from description or a specific additional field we received from the advertiser.

ContractTime - permanent or contract, interpreted by Adzuna from description or a specific additional field we received from the advertiser.

Company - the name of the employer as supplied to us by the job advertiser.

Category - which of 30 standard job categories this ad fits into, inferred in a very messy way based on the source the ad came from.  We know there is a lot of noise and error in this field.

SalaryRaw - the freetext salary field we received in the job advert from the advertiser.

SalaryNormalised - the annualised salary interpreted by Adzuna from the raw salary.  Note that this is always a single value based on the midpoint of any range found in the raw salary.  This is the value we are trying to predict.

SourceName - the name of the website or advertiser from whom we received the job advert. 

All of the data is real, live data used in job ads so is clearly subject to lots of real world noise, including but not limited to: ads that are not UK based, salaries that are incorrectly stated, fields that are incorrectly normalised and duplicate adverts.
Location Tree
This is a supplemental data set that describes the hierarchical relationship between the different Normalised Locations shown in the job data.  It it is likely that there are meaningful relationships between the salaries of jobs in a similar geographical area, for example average salaries in London and the South East are higher than in the rest of the UK.

In [None]:
# Importing some tools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load data

In [None]:
df_train = pd.read_csv('../input/job-salary-prediction/Train_rev1.zip', compression='zip', header=0, sep=',', quotechar='"')
df_train.head()

# Data Exploration

In [None]:
df_train.describe()

In [None]:
df_train.info()

In [None]:
# Check missing values
df_train.isna().sum()

In [None]:
# Check for string label 
for label,content in df_train.items():
    if pd.api.types.is_string_dtype(content):
        print(label)

In [None]:
# Check for numerical label
for label,content in df_train.items():
    if pd.api.types.is_numeric_dtype(content):
        print(label)

In [None]:
# This will turn all of the string value into category values
for label, content in df_train.items():
    if pd.api.types.is_string_dtype(content):
        df_train[label] = content.astype("category").cat.as_ordered()


In [None]:
# Filling missing values
for label,content in df_train.items():
    if not pd.api.types.is_numeric_dtype(content):
        # Add binary column to indicate whether sample had missing value
        df_train[label+"is_missing"]=pd.isnull(content)
        # Turn categories into numbers and add+1
        df_train[label] = pd.Categorical(content).codes+1

In [None]:
df_train.isna().sum()

# Data Visualization

In [None]:
ms = df_train["SalaryNormalized"][:10].plot.barh(figsize=(15,10))

In [None]:
df_train["SalaryNormalized"].hist()

In [None]:
# For more security,copy the train set
df_tmp = df_train.copy()

In [None]:
df_tmp.head()

In [None]:
# Split the data into X & y
X = df_tmp.drop("SalaryNormalized",axis=1)
y = df_tmp["SalaryNormalized"]

# Modeling

In [None]:
# # Let's build a machine learning model 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
model = RandomForestRegressor(n_jobs=-1)
model.fit(X_train,y_train)

# Evaluation

In [None]:
# Evaluate model using mean absolute error
from sklearn.metrics import mean_absolute_error
y_preds_0 = model.predict(X_test)
mae_rf = mean_absolute_error(y_test,y_preds_0)
mae_rf

# Hyerparameter tuning with RandomizedSearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV
np.random.seed(42)
grid = {
    "n_estimators":np.arange(10,100,10),
    "max_depth":[None,3,5,10],
    "min_samples_split":np.arange(2,20,2),
    "min_samples_leaf":np.arange(1,20,2),
    "max_features": [0.5,1,"sqrt","auto"],
    "max_samples":[10000,12000,15000,20000]
}
rs_model = RandomizedSearchCV(
RandomForestRegressor(n_jobs=-1,
                     random_state=42),
                    param_distributions = grid,
                     n_iter=5,
                    cv=5,
                    verbose=True)
rs_model.fit(X_train,y_train)

In [None]:
rs_model.best_params_

In [None]:
# Choose the best performance
y_preds_rs = rs_model.predict(X_test)
mae_hyp = mean_absolute_error(y_test,y_preds_rs)
mae_hyp,mae_rf

# Make predictions

In [None]:
# Importing test data
df_test = pd.read_csv('../input/job-salary-prediction/Test_rev1.zip', compression='zip', header=0, sep=',', quotechar='"')

In [None]:
# Check for missing values
df_test.isna().sum()

In [None]:
df_test.head()

In [None]:
# Check for string label
for label,content in df_test.items():
    if pd.api.types.is_string_dtype(content):
        print(label)

In [None]:
# Check for numerical label
for label,content in df_test.items():
    if pd.api.types.is_numeric_dtype(content):
        print(label)

In [None]:
# This will turn all of the string value into category values
for label, content in df_test.items():
    if pd.api.types.is_string_dtype(content):
        df_test[label] = content.astype("category").cat.as_ordered()
# Filling missing values
for label,content in df_test.items():
    if not pd.api.types.is_numeric_dtype(content):
        # Add binary column to indicate whether sample had missing value
        df_test[label+"is_missing"]=pd.isnull(content)
        # Turn categories into numbers and add+1
        df_test[label] = pd.Categorical(content).codes+1
X_test.shape,y_test.shape        

In [None]:
# Reshape X_train & df_test
set(X_train.columns)-set(df_test.columns)

In [None]:
df_test["SalaryRaw"] = False
df_test["SalaryRawis_missing"] = False

In [None]:
X_train.shape,df_test.shape

In [None]:
# Make predictions
y_preds = model.predict(df_test)

In [None]:
# Format predictions into the same format Kaggle is after
df_preds = pd.DataFrame()
df_preds["Id"] = df_test["Id"]
df_preds["SalaryNormalized"] = y_preds
df_preds.head()
df_preds.to_csv(".//Submission.csv",index=False)

### Feature Importance

Feature importance seeks to figure out which different attributes of the data were most importance when it comes to predicting the **target variable** (SalaryNormalized).

In [None]:
# Find feature importance of our best model
model.feature_importances_

In [None]:
# Helper function for plotting feature importance
def plot_features(columns, importances, n=20):
    df = (pd.DataFrame({"features": columns,
                        "feature_importances": importances})
          .sort_values("feature_importances", ascending=False)
          .reset_index(drop=True))
    
    # Plot the dataframe
    fig, ax = plt.subplots()
    ax.barh(df["features"][:n], df["feature_importances"][:20])
    ax.set_ylabel("Features")
    ax.set_xlabel("Feature importance")
    ax.invert_yaxis()

In [None]:
plot_features(X_train.columns,model.feature_importances_)

In [None]:
df_tmp["SalaryRaw"].value_counts()