# **Engineering Graduate Salary Prediction**

#### Our objective is to determine the salary of an engineering graduate in India.

## **Importing Required Libraries**

In [None]:
import numpy as np
import pandas as pd
#pd.set_option('max_columns', None)

import matplotlib.pyplot as plt
import seaborn as sns 

from plotly.offline import init_notebook_mode,iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go 

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score 
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor


import warnings
warnings.filterwarnings('ignore')

## **Loading Dataset**

In [None]:
df = pd.read_csv('../input/engineering-graduate-salary-prediction/Engineering_graduate_salary.csv')

In [None]:
# lets take a glimpse of first five rows of the data
df.head().style.bar(
    color='#606ff2').background_gradient(cmap='plasma')

In [None]:
# checking shape of dataframe
df.shape

In [None]:
# lets check for all the columns
df.columns

## **Data Description**

In [None]:
df.info()

Detailed description about the data

* ID: A unique ID to identify a candidate
* Salary: Annual CTC offered to the candidate (in INR)
* Gender: Candidate's gender
* DOB: Date of birth of the candidate
* 10percentage: Overall marks obtained in grade 10 examinations
* 10board: The school board whose curriculum the candidate followed in grade 10
* 12graduation: Year of graduation - senior year high school
* 12percentage: Overall marks obtained in grade 12 examinations
* 12board: The school board whose curriculum the candidate followed
* CollegeID: Unique ID identifying the university/college which the candidate attended for her/his undergraduate
* CollegeTier: Each college has been annotated as 1 or 2. The annotations have been computed from the average AMCAT scores obtained by the students in the college/university. Colleges with an average score above a threshold are tagged as 1 and others as 2.
* Degree: Degree obtained/pursued by the candidate
* Specialization: Specialization pursued by the candidate
* CollegeGPA: Aggregate GPA at graduation
* CollegeCityID: A unique ID to identify the city in which the college is located in.
* CollegeCityTier: The tier of the city in which the college is located in. This is annotated based on the population of the cities.
* CollegeState: Name of the state in which the college is located
* GraduationYear: Year of graduation (Bachelor's degree)
* English: Scores in AMCAT English section
* Logical: Score in AMCAT Logical ability section
* Quant: Score in AMCAT's Quantitative ability section
* Domain: Scores in AMCAT's domain module
* ComputerProgramming: Score in AMCAT's Computer programming section
* ElectronicsAndSemicon: Score in AMCAT's Electronics & Semiconductor Engineering section
* ComputerScience: Score in AMCAT's Computer Science section
* MechanicalEngg: Score in AMCAT's Mechanical Engineering section
* ElectricalEngg: Score in AMCAT's Electrical Engineering section
* TelecomEngg: Score in AMCAT's Telecommunication Engineering section
* CivilEngg: Score in AMCAT's Civil Engineering section
* conscientiousness: Scores in one of the sections of AMCAT's personality test
* agreeableness: Scores in one of the sections of AMCAT's personality test
* extraversion: Scores in one of the sections of AMCAT's personality test
* nueroticism: Scores in one of the sections of AMCAT's personality test
* openess_to_experience: Scores in one of the sections of AMCAT's personality test Note: To give you more context AMCAT is a job portal.

In [None]:
# lets check for missing values 
df.isnull().sum()

##### **No missing values are present.**

In [None]:
# summary statistics
df.describe().T

* ##### **CollegeGPA may contains some outlier values because the minimum and mean value are far away from each other. if it is the case, We can check it later in Data Visualization.**
* ##### **We need to handle -1 value, so first we will simply convert it into NaN and then substitute mean/median inplace of NaN according to the requirement.**

In [None]:
# dropping features which do not make any sense to predict salary
df.drop(['ID', '10board','12graduation','12board' ,'CollegeID' , 'CollegeCityID','CollegeState'
                                     , 'CollegeCityTier'], axis = 1, inplace = True)

In [None]:
# lets check the shape again
df.shape

## **Data Cleaning**

In [None]:
# fill missing values
df.replace(-1, np.NaN,inplace=True)

In [None]:
# lets check the missing values again
df.isnull().sum()

In [None]:
# list of columns with null values 
missing_values_columns = [col for col in df.columns if df.isnull().sum()[col] > 0]

In [None]:
# function for missing values substitution
def fill_missing_values(df,missing_values_columns):
    data = df.copy()
    '''Filling missing values with mean'''
    for col in missing_values_columns:
        data[col] = data[col].fillna(data[col].mean())
     
    return data

# lets use this function to fill the missing values
df = fill_missing_values(df,missing_values_columns)

## **Exploratory Data Analysis**

### Correlation Analysis

In [None]:
plt.figure(figsize=(16,12))
sns.heatmap(df.corr(),annot=True,cmap='viridis')
plt.show()

### Analysis of Variable Salary

In [None]:
plt.figure(figsize = (12, 6))

plt.subplot(121)
plt.title('Salary Distribuition')
sns.distplot(df['Salary'])

plt.subplot(122)
g1 = plt.scatter(range(df.shape[0]), np.sort(df.Salary.values))
g1= plt.title("Salary Curve Distribuition", fontsize=15)
g1 = plt.xlabel("")
g1 = plt.ylabel("Salary", fontsize=12)

plt.subplots_adjust(wspace = 0.3, hspace = 0.5,
                    top = 0.9)
plt.show()

##### **Most of the graduates having salaries under 10 lakhs.**
##### **Long tail of distribution is longer on right hand side as compared to left hand side which shows that distribution is positively skewed.**

### Analysis of Salary and Gender

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot( x=df['Gender'], y=df['Salary'] )

plt.title('Statistical Distribution of Gender versus Salary')
plt.show()

##### **It is clearly visible from the plot that the Average salary for man and woman are looking almost same.**

### Analysis of 10th and 12th percentage by college tier

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x = '10percentage', y = '12percentage', hue = 'CollegeTier',palette='dark', data = df)
plt.show()

##### **According to correlation plot and scatterplot, we can see that 10th and 12th are positively correlated, this is the case of multicollinearity so I have decided to keep only one.**

### Analysis of Salary and Specialization

In [None]:
# checking the count of unique specialization present in dataframe
df.Specialization.value_counts()

In [None]:
# create the copy of dataframe
data = df.copy()
# count of unique categories in specialization
value_count = data['Specialization'].value_counts()

def map_to_other_specialization(var):
    ''' if count of unique category is less than 10, replace the category as other '''
    if var in value_count[value_count<=10]:
        return 'other'
    else:
        return var
    
# apply the function to specialization to get the results    
df['Specialization'] = df.Specialization.apply(map_to_other_specialization)

In [None]:
# count plot of unique categories in specialization 
plt.figure(figsize = (16, 8))
total = float(len(df))
ax = sns.countplot(x='Specialization',data=df)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 1,
            '{:1.2f}%'.format((height/total) * 100),
            ha="center",fontsize=10) 
plt.xticks(rotation = 90)
plt.show()

In [None]:
# electronics & instrumentation engineering is repeating here with slight change in name so converting it into one category
df['Specialization'] = df['Specialization'].str.replace('electronics & instrumentation eng',\
                                                'electronics and instrumentation engineering')

In [None]:
# average salary by specialization and sort them in decreasing order
avg_sal_per_specialization = df.groupby('Specialization').agg(mean_salary =("Salary", 'mean')).sort_values(by = 'mean_salary',ascending=False)

# barplot of mean salary and specialization
plt.figure(figsize = (12, 6))
sns.barplot(x = avg_sal_per_specialization.index,y = 'mean_salary',data = avg_sal_per_specialization,palette='rocket')
plt.xticks(rotation = 90)
plt.show()

##### **ICE Engineer, Computer Engineer and Electronics Engineer having highest mean salary.**

### Analysis of Salary and College GPA

In [None]:
# interesting insights  
df[df['Salary'] == df.Salary.max()]

In [None]:
# lets plot collegeGPA as we noticed in summary statistics
plt.figure(figsize = (12, 6))
sns.scatterplot(x ='Salary', y = 'collegeGPA',hue='CollegeTier',data=df,palette = 'dark')
plt.show()

##### **lets drop some outliers points-**
                  - CollegeGPA < 40
                  - Salary > 15 lakh

In [None]:
# filter the dataframe where collegeGPA > 40 and salary is less then 15 lakh
df = df.loc[(df['collegeGPA'] > 40) & (df['Salary'] < 1500000)] 

In [None]:
# lets check shape again
df.shape

### Analysis of Salary and Degree

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot( x=df['Degree'], y=df['Salary'] )

plt.title('Statistical Distribution of Degree versus Salary')
plt.show()

In [None]:
df['Degree'].value_counts()

##### **Average salary is highest for BE/B.tech graduates as compared to any other degree graduates.**

## **Data Preprocessing**

In [None]:
# creating list of categorical columns for one hot encoding
categorical_columns = [col for col in df.columns if df.dtypes[col] == 'object']

# creating list of numerical columns to standardized data 
numerical_columns = [col for col in df.columns if (df.dtypes[col] != 'object')]

print('Numerical Features are : ',numerical_columns)
print('\n')
print('Categorical Features are : ',categorical_columns)

In [None]:
# remove DOB & Graduation year from categorical features list

# items to be romoved
unwanted_cat = ['DOB','GraduationYear']
categorical_columns = [ele for ele in categorical_columns if ele not in unwanted_cat]
print(categorical_columns)

### Handling Datetime feature

In [None]:
# lets convert DOB & Graduation year to datetime 
df['DOB'] = pd.to_datetime(df['DOB'])
df['GraduationYear'] = pd.to_datetime(df['GraduationYear'])

In [None]:
# lets create a new feature BirthYear which is important if you want to see, how old the candidate was when he/she completed degree.

df['birth_year'] = df['DOB'].dt.year

df['GraduationYear'] = df['GraduationYear'].dt.year

# lets drop DOB
df.drop('DOB',axis=1,inplace=True)

### OneHotEncoding for Categorical Features

In [None]:
# one hot encoding function for categorical features 
def onehot_encoder(df, cols):
    df = df.copy()
    for col in cols:
        dummies = pd.get_dummies(df[col])
        # concatenating dummies and original dataframe
        df = pd.concat([df, dummies], axis=1)
        
        # dropping original coolumns for which encoding is applied.
        df.drop(col, axis=1,inplace=True)
    return df

In [None]:
df = onehot_encoder(df,categorical_columns)

In [None]:
# lets drop one column from each encoded categorical feature to avoid dummy trap 
df.drop(['f','M.Sc. (Tech.)','biotechnology'],axis=1,inplace=True)

In [None]:
# rename the gender column
df.rename({'m':'Gender'},axis=1,inplace=True)

### Handling Numerical Features

In [None]:
df[numerical_columns].head()

In [None]:
# lets remove collegeTier from numerical features list because it is binary variable
# drop salary and GraduationYear
# items to be romoved
unwanted_num = ['CollegeTier','Salary','GraduationYear']
numerical_columns = [ele for ele in numerical_columns if ele not in unwanted_num]
print(numerical_columns)

In [None]:
# Split df into dependent(y) and indepedent variables(X)
X = df.drop('Salary',axis=1)
y = df['Salary']

### Train-Test Split

In [None]:
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)

### Scaling Numerical Features

In [None]:
# Scaling Numerical Features

sc = StandardScaler()
X_train[numerical_columns] = sc.fit_transform(X_train[numerical_columns])
X_test[numerical_columns] = sc.transform(X_test[numerical_columns])

## **Model Building**

### Linear Regression

In [None]:
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
y_pred_linear_reg = linear_reg.predict(X_test)

linear_reg_r2_score = linear_reg.score(X_test, y_test)

print("Linear Regression R^2 Score: {:.4f}".format(linear_reg_r2_score))

### XGBoost Model

In [None]:
xgb = XGBRegressor()
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)

xgb_r2 = xgb.score(X_test, y_test)

print("XGBoost R^2 Score: {:.5f}".format(xgb_r2))

#### Error Plot

In [None]:
# Linear Regression Error plot
errors_reg = y_test - y_pred_linear_reg

plt.figure(figsize = (10,6))
sns.distplot(errors_reg)
plt.show()

In [None]:
# XGBoost Error plot
errors = y_test - y_pred_xgb

plt.figure(figsize = (10,6))
sns.distplot(errors)
plt.show()