# HR Analytics Job Change of Data Scientists. Accuracy = 85%

## About this Dataset

### Context and Content

A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Many people signup for their training. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Information related to demographics, education, experience are in hands from candidates signup and enrollment.

This dataset designed to understand the factors that lead a person to leave current job for HR researches too. By model(s) that uses the current credentials,demographics,experience data you will predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision.

The whole data divided to train and test . Target isn't included in test but the test target values data file is in hands for related tasks. A sample submission correspond to enrollee_id of test set provided too with columns : enrollee _id , target

### Note:

*The dataset is imbalanced.*
Most features are categorical (Nominal, Ordinal, Binary), some with high cardinality.
Missing imputation can be a part of your pipeline as well.

### Goal

Predict the probability of a candidate will work for the company
Interpret model(s) such a way that illustrate which features affect candidate decision

### The source of the dataset
https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks?taskId=3015



## Imports

### Import libraries

In [None]:
#load packages
import os           # Operating system dependent functionality
import sys
print("Python version: {}". format(sys.version))

import numpy  as np # Numerical computing tools.
print("NumPy version: {}". format(np.__version__))

import pandas as pd # Data analysis and manipulation tool
print("pandas version: {}". format(pd.__version__))

from scipy.stats import gamma

import matplotlib
import matplotlib.pyplot as plt     # Provides a MATLAB-like plotting framework. 
# The source: https://matplotlib.org/api/pyplot_api.html
print("matplotlib version: {}". format(matplotlib.__version__))

import matplotlib.mlab   as mlab    # Numerical python functions 
# written for compatibility with MATLAB commands with the same names. 
# The source: https://matplotlib.org/3.3.3/api/mlab_api.html 

import seaborn           as sns     # Python data visualization library based on matplotlib

import cufflinks as cf              # Python data visualization library for dynamical plots
# Source: https://github.com/santosjorge/cufflinks
# Some useful examples: https://www.kaggle.com/kyleos/cufflinks
cf.set_config_file(offline=True)

import plotly.io as pio             # Themes for iplot
# Some useful examples: https://plotly.com/python/templates/

import plotly.express as px         # 

import colorama     # ANSI escape character sequences have long been used 
# to produce colored terminal text and cursor positioning on Unix and Macs
# The source: https://pypi.org/project/colorama/

import warnings    # alert the user of some condition in a program, where that condition 
#(normally) doesn’t warrant raising an exception and terminating the program.
# The source: https://docs.python.org/3/library/warnings.html
warnings.filterwarnings('ignore')


from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report,confusion_matrix


print('-'*40)

from subprocess import check_output
print(check_output(["ls", "-L"]).decode("utf8"))


from sklearn import metrics


#  show plots in Jupyter Notebook browser
%matplotlib inline 

### Import data

In [None]:
# Import data from aug_train.csv. 
# It will be separated into 2 parts later. 

# files
file_train       = '../input/hr-analytics-job-change-of-data-scientists/aug_train.csv'
file_validation  = '../input/hr-analytics-job-change-of-data-scientists/aug_test.csv'

# check if datafile exists and load file dependently on the file type.
# used functions: read_csv 
# read_csv https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
# or read_csv 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
if (not os.path.exists(file_train)): 
    if str(__name__) == '__main__':
        raise SystemExit("There is no file with data!")
else: 
    file_name, file_extension = os.path.splitext(file_train)
    if (file_extension  == '.csv'):
        train_set_init = pd.read_csv(file_train)
    elif(file_extension == '.xlsx'):
        train_set_init = pd.read_excel(file_train)
        
if (not os.path.exists(file_validation)): 
    if str(__name__) == '__main__':
        raise SystemExit("There is no file with data!")
else: 
    file_name, file_extension = os.path.splitext(file_validation)
    if (file_extension  == '.csv'):
        valid_set = pd.read_csv(file_validation)
    elif(file_extension == '.xlsx'):
        valid_set = pd.read_excel(file_validation)

# Make a copy         
train_set     = train_set_init.copy()

n_rows        =  train_set.shape[0]    
n_columns     =  train_set.shape[1]

n_rows_val    =  valid_set.shape[0]    
n_columns_val =  valid_set.shape[1]

# To process datasets together, they will be collected into one array.
two_sets_data = [train_set, valid_set]

# information about train dataset
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html
train_set.info()

# # selection of numeric columns 
df_numeric   = train_set.select_dtypes(include=[np.number])
numeric_cols = df_numeric.columns.values
print("\n Numeric columns: ")
print(numeric_cols)

# # selection of non-numeric columns 
df_non_numeric   = train_set.select_dtypes(exclude=[np.number])
non_numeric_cols = df_non_numeric.columns.values
print("\n Non-numeric columns: ")
print(non_numeric_cols)

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html
train_set.head() 

# Data processing

Let's have a look at the data in general view: on the amount of unique values of attributes and missed values.

### Unique elements and non-null elements

In [None]:
''' 
The solution how to draw a table in the Jupyter Notebook 
is presented here:
https://stackoverflow.com/questions/35160256/how-do-i-output-lists-as-a-table-in-jupyter-notebook
'''
from IPython.display import HTML, display
def display_table(data):
    html = "<table>"
    for row in data:
        html += "<tr>"
        for field in row:
            html += "<td><h4>%s</h4><td>"%(field)
        html += "</tr>"
    html += "</table>"
    display(HTML(html))

table = [] # Name , Unique elements, Non-null Elements
table.append(['Name', 'Unique elements','Null Elements'])
for (columnName, columnData) in train_set.iteritems():
    counted_number = train_set[columnName].isnull().sum()
    line = [columnName, len(columnData.unique()), counted_number]
    table.append(line)
        
print("The total number of rows in the training dataset is {}.".format(n_rows))
display_table(table)

print('-'*40)

table = [] # Name , Unique elements, Non-null Elements
table.append(['Name', 'Unique elements','Null Elements'])
for (columnName, columnData) in valid_set.iteritems():
    counted_number = valid_set[columnName].isnull().sum()
    line = [columnName, len(columnData.unique()), counted_number]
    table.append(line)
        
print("The total number of rows in the validation dataset is {}.".format(n_rows_val))
display_table(table)

***Correcting***

***Completing***
The are null values in gender, enrolled university, education level, major discipline, experience, company size, company type and time gap between last and new job fields. 
The numerical attributes are supposed to be filled by a median value, the categorical attributes - don't change for Decision Tree technique and combine as a special category - for other algorithms.

***Creating***

***Converting***

# Clean data

### Missed values with the heat map

In [None]:
# Construct a heat map
cols = train_set.columns
colours = ['#000099', '#ffff00'] 
fig, ax0 = plt.subplots(1,1, figsize = (5, 5))
sns.heatmap(train_set[cols].isnull(), cmap=sns.color_palette(colours), vmin=0, vmax=1)

# To write the percentage of missed values
print("The percentage of missing values")
for col in train_set.columns:
    pct_missing = np.mean(train_set[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing*100))) 
    
# indicator for attributes with missed data
train_set_copy = train_set.copy()
for col in train_set_copy.columns:
    missing     = train_set_copy[col].isnull()
    num_missing = np.sum(missing)
    
    if num_missing > 0:  
        train_set_copy['{}_ismissing'.format(col)] = missing

# hist for 
ismissing_cols = [col for col in train_set_copy.columns if 'ismissing' in col]
train_set_copy['num_missing'] = train_set_copy[ismissing_cols].sum(axis=1)
train_set_copy['num_missing'].value_counts().reset_index().sort_values(by='index').plot.bar(x='index', y='num_missing',
                                                                                           title='The number of rows vs number of missed values')
plt.show()

In the graph one can see the missed values denoted by yellow color, the precented values are denoted by blue color. 

In the next parts we well look at data atribbutes, their  distributions and fill the missed values and prepare data set for model using.

## Gender

In [None]:
# Replace nan values by a string value 'No gender'
for dataset in two_sets_data:
    dataset["gender"].fillna("No gender", inplace = True)
# calculate the number of unique genders 
n_genders = len(train_set.gender.unique()) 

# Create a subset to plot the data presentation and write 
# the total number of different values of the attribute
df = pd.DataFrame(train_set.groupby('gender').size())
print(df)

df.columns = ['Count']
# one additional column for plotting
df['gender'] = df.index
fig = px.pie(df, values='Count', names='gender', color='gender',
             title='Gender distribution in the training dataset',
             color_discrete_sequence=px.colors.sequential.Plasma)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()    

The dominant gender in dataset is Male. The missed value are a big part of presented data, hence we will fill as by a category "No gender" in meaning that the gender gap was not filled.

## Cities

#### Unique cities

In [None]:
# Write the number of null elements
counted_number = train_set['city'].isnull().sum()
print("The number of null elements is {}.".format(counted_number))

n_cities = len(train_set.city.unique())
cities = []
number = []
for i in range(n_cities):
    city           = train_set.city.unique()[i]
    counted_number = train_set[train_set.city == city].city.count()
    cities.append(city)
    number.append(counted_number)

fig, ax0 = plt.subplots(1,1, figsize = (18, 5))
ax0.bar(cities, number, facecolor='g', alpha=0.5)
plt.title('Distributions of the people in cities')
plt.xlabel('Unique city')
plt.ylabel('The number on people')
plt.show()

# max in the distribution
index = number.index(max(number))
print (str(cities[index]) + ' \t: ' + str(max(number)))

The distribution looks similar to the double-Gaussian with one narrow dome or delta-distribution. That means that in the set there are mostly people from one city.

#### Unique city_development_index

In [None]:
x = train_set.city_development_index
fig, ax0 = plt.subplots(1,1, figsize = (18, 5))
sns.distplot(x)
plt.title('Distributions of the people in cities')
plt.xlabel('City development index')
plt.ylabel('The number on people')
plt.show()

fig, axes = plt.subplots(1,1, figsize = (5, 5))
x = train_set.city_development_index
sns.boxplot(data=x)
plt.show()

cities = []
number = []
n_cities = len(train_set.city_development_index.unique())

for i in range(n_cities):
    city         = train_set.city_development_index.unique()[i]
    number_count = train_set[train_set.city_development_index == city].city_development_index.count()
    cities.append(city)
    number.append(number_count)

# max in the distribution
index = number.index(max(number))
print (str(cities[index]) + ' \t: ' + str(max(number)))

One can see on the box diagram that there is an outlier.

## Education and qualification of employee

#### Education

In [None]:
def plot_hist_ed_level():
    x = train_set.education_level
    fig, ax0 = plt.subplots(1,1, figsize = (18, 5))
    sns.countplot(x)
    plt.title('Education')
    plt.xlabel('Education')
    plt.ylabel('The number on people')
    plt.show()

# Write the number of null elements
counted_number = train_set['education_level'].isnull().sum()
print("The number of null elements is {}.".format(counted_number))

# Fill missed values to 'Undetermined'
for dataset in two_sets_data:
    dataset["education_level"].fillna("Undetermined", inplace = True)

plot_hist_ed_level()

To make the dataset more balanced, there is a suggestion to make 2 groups for this category: 'Graduate' and 'Others', contained all other values. 

In [None]:
# Change all unigue values except of 'Graduate' to 'Others'.
n_ed_levels = train_set.education_level.unique()

for education_level in n_ed_levels:
    if education_level != 'Graduate':
        train_set['education_level'] = train_set['education_level'].replace([education_level],'Others')
        
plot_hist_ed_level()

#### Major discipline

In [None]:
def plot_hist_major_disc():
    x = train_set.major_discipline
    fig, ax0 = plt.subplots(1,1, figsize = (18, 5))
    sns.countplot(x)
    plt.title("Major discipline")
    plt.xlabel('Discipline')
    plt.ylabel('The number on people')
    plt.show()

# Write the number of null elements
counted_number = train_set['major_discipline'].isnull().sum()
print("The number of null elements is {}.".format(counted_number))

# Fill missed values to 'Undetermined'
for dataset in two_sets_data:
    dataset["major_discipline"].fillna("Undetermined", inplace = True)

plot_hist_major_disc()

It is also visible that the prevailing value is 'STEM'. It means that people in Datascience are mostly have a technical education like engineering, math etc. It is suggested to create 2 groups to make the dataset more balanced: 'STEM' and 'Not STEM', which contains all other groups.

In [None]:
# Change all unigue values except of 'STEM' to 'Others'.
n_mj_disc = train_set.major_discipline.unique()

for major_discipline in n_mj_disc:
    if major_discipline != 'STEM':
        train_set['major_discipline'] = train_set['major_discipline'].replace([major_discipline],'Others')
        
plot_hist_major_disc()

#### Type of University course enrolled if any

In [None]:
# Write the number of null elements
counted_number = train_set['enrolled_university'].isnull().sum()
print("The number of null elements is {}.".format(counted_number))

# Fill missed values to 'Undetermined'
for dataset in two_sets_data:
    dataset["enrolled_university"].fillna("Undetermined", inplace = True)
    
n_univ = len(train_set.enrolled_university.unique())
x = train_set.enrolled_university
fig, ax0 = plt.subplots(1,1, figsize = (18, 5))
sns.set(style="whitegrid")
sns.countplot(x, palette='Spectral')
plt.title('Education')
plt.xlabel('Education')
plt.ylabel('The number on people')
plt.show()

#### Training hours completed

In [None]:
n_training = len(train_set.training_hours.unique())
n_training
x = train_set.training_hours
fig, axes = plt.subplots(1,1, figsize = (18, 5))
sns.set(style="whitegrid")
sns.countplot(x, palette='Spectral')
plt.title('Training hours')
plt.xlabel('Hours')
plt.ylabel('The number on people')
plt.show()

fig, axes = plt.subplots(1,1, figsize = (18, 5))
sns.distplot(x)
plt.title('Training hours')
plt.xlabel('Hours')
plt.ylabel('The number on people')
plt.show()

fig, axes = plt.subplots(1,1, figsize = (5, 5))
x = train_set.training_hours
sns.boxplot(data=x)
plt.show()

## Experience

#### relevent_experience: Relevant experience of candidate

In [None]:
# Write the number of null elements
counted_number = train_set['enrolled_university'].isnull().sum()
print("The number of null elements is {}.".format(counted_number))

train_set.relevent_experience.unique()
x = train_set.relevent_experience
fig, ax0 = plt.subplots(1,1, figsize = (18, 5))
sns.set(style="whitegrid")
sns.countplot(x, palette='Spectral')
plt.title('Relevant experience')
plt.xlabel('Experience')
plt.ylabel('The number on people')
plt.show()

#### Experience: Candidate total experience in years

In [None]:
# Write the number of null elements
counted_number = train_set['experience'].isnull().sum()
print("The number of null elements is {}.".format(counted_number))

# Fill missed values to '-1'.
for dataset in two_sets_data:
    dataset["experience"].fillna("-1", inplace = True)

sizes = []
train_set_sorted = train_set.sort_values(by=['experience'])
labels     = train_set_sorted.experience.unique()
experience = len(train_set_sorted.experience.unique())

# check if there are nan values
# n_nan = train_set_sorted[train_set_sorted.experience == 'nan'].enrollee_id.count()

for i in range(len(train_set_sorted.experience.unique())):
        exp              = train_set_sorted.experience.unique()[i]
        counted_number   = train_set_sorted[train_set_sorted.experience == exp].experience.count()
        sizes.append(counted_number)
        # nan is float, hence
        if type(experience) != 'str':
            exp_str = str(exp)
        else:
            exp_str = exp

fig1, ax1 = plt.subplots(1,1, figsize = (7, 7))

from palettable.colorbrewer.qualitative import Pastel1_9
# Create a circle for the center of the plot
my_circle=plt.Circle( (0,0), 0.7, color='white')
plt.pie(sizes, labels=labels, colors=Pastel1_9.hex_colors)
p=plt.gcf()
p.gca().add_artist(my_circle)
# plt.legend()
plt.show()

## Complanies

#### Company size

In [None]:
# Write the number of null elements
counted_number = train_set['company_size'].isnull().sum()
print("The number of null elements is {}.".format(counted_number))

# # Fill missed values to '-1'.
# Fill missed values to '-1'.
for dataset in two_sets_data:
    dataset["company_size"].fillna("-1", inplace = True)

train_set.company_size.unique()
x = train_set.company_size
fig, ax0 = plt.subplots(1,1, figsize = (18, 5))
sns.set(style="whitegrid")
sns.countplot(x, palette='Spectral')
plt.title('Company size')
plt.xlabel('Companies')
plt.ylabel('The number on people')
plt.show()

#### Company type

In [None]:
# Write the number of null elements
counted_number = train_set['company_type'].isnull().sum()
print("The number of null elements is {}.".format(counted_number))

# # Fill missed values to 'Undetermined'
for dataset in two_sets_data:
    dataset["company_type"].fillna("Undetermined", inplace = True)

sizes = []
labels = train_set.company_type.unique()
n_company_type = len(train_set.company_type.unique())
for i in range(len(train_set.company_type.unique())):
    c_type           = train_set.company_type.unique()[i]
    counted_number   = train_set[train_set.company_type == c_type].company_type.count()
    sizes.append(counted_number)
    # nan is float, hence
    if type(c_type) != 'str':
        c_type_str = str(c_type)
    else:
        c_type_str = c_type
    print (c_type_str + " \t: ", counted_number)
    
a = 0.25
explode = (0, a, 0, 0, 0, 0, 0)  

cmap = plt.get_cmap('Spectral')
colors = [cmap(i) for i in np.linspace(0, 1, n_company_type)]

fig1, ax1 = plt.subplots(1,1, figsize = (8, 8))
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=180)
ax1.axis('equal')  
plt.title("Company type")
plt.legend()
plt.show()

## Job

#### Difference in years between previous job and current job (last_new_job)

In [None]:
# Write the number of null elements
counted_number = train_set['last_new_job'].isnull().sum()
print("The number of null elements is {}.".format(counted_number))

# Fill missed values to 'Undetermined'
for dataset in two_sets_data:
    dataset["last_new_job"].fillna("-1", inplace = True)

train_set.last_new_job.unique()
x = train_set.last_new_job
fig, ax0 = plt.subplots(1,1, figsize = (18, 5))
sns.set(style="whitegrid")
sns.countplot(x, palette='Spectral')
plt.title('Relevant experience')
plt.xlabel('Time gap between last job and new job')
plt.ylabel('The number on people')
plt.show()

In [None]:
# print(train_set.last_new_job.unique())

# fit = preprocessing.LabelEncoder()
# fit.fit(['1', '>4', 'never', '4', '3', '2', '-1'])
# train_set['last_new_job'].replace('>4', '5', inplace=True)
# train_set['last_new_job'].replace('never', '0', inplace=True)
# train_set['last_new_job'].replace('-1', '6', inplace=True)
# train_set['last_new_job'] = train_set['last_new_job'].astype(int)
# train_set['last_new_job'].dtypes

# print(train_set_droped2.last_new_job.unique())

#### target: 0 – Not looking for job change, 1 – Looking for a job change

In [None]:
x = train_set.target
fig, ax0 = plt.subplots(1,1, figsize = (10, 10))
sns.set(style="whitegrid")
sns.countplot(x, palette='Spectral')
plt.title('Target')
plt.xlabel('')
plt.ylabel('The number on people')
plt.show()

---

## Intersections of the data and correlations

#### Intersections
One can see that the dataset is unbalanced, in other words, one type of considered variants is dominant. That fact can lead to misprediction in the future. 
Anyway, let's look at the crossing of dominant characteristics.

The highly dominant characteristics are :
1. Gender : man

2. graduated

3. city_103 (city dev. index = 0.920)

4. STEM

5. no_enrollnent

6. Pvt Ltd

7. 1 year of exp

Let's have a look at the intersections of these creteria. 
The criteria of separation are the dominant characteristics. For that goal we will construct Venn diagrams.

In [None]:
gender            = 'Male'
subset_gender     = train_set[train_set.gender == gender].enrollee_id
print('subset_gender : ', len(subset_gender))

major_discipline  = 'STEM'
subset_discipline = train_set[train_set.major_discipline == major_discipline].enrollee_id
print('subset_discipline : ', len(subset_discipline))

company_type      = 'Pvt Ltd'
subset_comp_type  = train_set[train_set.company_type == company_type].enrollee_id
print('subset_comp_type : ', len(subset_comp_type))

from matplotlib_venn import venn3
fig, ax0 = plt.subplots(1,1, figsize = (7, 7))
venn3([set(subset_gender), set(subset_discipline), set(subset_comp_type)], 
      set_labels = ('Male', 'STEM', 'Pvt Ltd'))
plt.title('Intersection')
plt.show()

In [None]:
city              = 'city_103'
subset_city_103   = train_set[train_set.city == city].enrollee_id
print('subset_city_103 : ', len(subset_city_103))

education         = 'Graduate'
subset_education  = train_set[train_set.education_level == education].enrollee_id
print('subset_education : ', len(subset_education))

university        = 'no_enrollment'
subset_univ       = train_set[train_set.enrolled_university == university].enrollee_id
print('subset_univ : ', len(subset_univ))

fig, ax0 = plt.subplots(1,1, figsize = (7, 7))
venn3([set(subset_city_103), set(subset_education), set(subset_univ)], 
      set_labels = ('city_103', 'Graduate', 'no_enrollment'))
plt.title('Intersection')
plt.show()

#### Correlations

To consider correlations between the criterias in the dataset, we will construct the correlation heat map.

In [None]:
#correlation heatmap of dataset
matrix = train_set[["city","city_development_index","gender","relevent_experience","enrolled_university","education_level",
                   "major_discipline","experience","company_size","company_type","last_new_job","training_hours","target"]].corr(method ='pearson')

sns.set(font_scale=1.10)
plt.figure(figsize=(15, 10))
sns.heatmap(matrix,  linewidths=0.05,
            square=True,annot=True,cmap='Purples_r',linecolor="white")
plt.title('Correlation');

---

## Data quality check
Here we analyse how much data is not suitable for further work.

The things which we check:

1. Absence of data (null, nan).

2. Untypical data (outliers).

3. Dublicates.

4. Inconsistent data (the same data, presented in different formats or registers).

---

# Model and training

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

In [None]:
train_set.info()

In [None]:
print(train_set.relevent_experience.unique())
fit = preprocessing.LabelEncoder()
fit.fit(['Has relevent experience', 'No relevent experience'])
train_set['relevent_experience'] = fit.transform(train_set.relevent_experience)
print(train_set.relevent_experience.unique())

In [None]:
print(train_set.enrolled_university.unique())
fit = preprocessing.LabelEncoder()
fit.fit([ 'no_enrollment', 'Full time course', 'Undetermined', 'Part time course'])
train_set['enrolled_university'] = fit.transform(train_set.enrolled_university)
print(train_set.enrolled_university.unique())

In [None]:
print(train_set.education_level.unique())
fit = preprocessing.LabelEncoder()
fit.fit(['Graduate', 'Others'])
train_set['education_level'] = fit.transform(train_set.education_level)
print(train_set.education_level.unique())

In [None]:
print(train_set.major_discipline.unique())
fit = preprocessing.LabelEncoder()
fit.fit(['STEM', 'Others'])
train_set['major_discipline'] = fit.transform(train_set.major_discipline)
print(train_set.major_discipline.unique())

In [None]:
print(train_set.experience.unique())
fit = preprocessing.LabelEncoder()
fit.fit(['>20', '15', '5', '<1', '11', '13', '7', '17', '2', '16', '1', '4', 
         '10', '14', '18', '19', '12', '3', '6', '9', '8', '20'])

train_set['experience'].replace('>20', '21', inplace=True)
train_set['experience'].replace('<1', '0', inplace=True)
train_set['experience'] = train_set['experience'].astype(int)
train_set['experience'].dtypes
print(train_set.experience.unique())

In [None]:
print(train_set.last_new_job.unique())
fit = preprocessing.LabelEncoder()
fit.fit(['1', '>4', 'never', '4', '3', '2', '-1'])
train_set['last_new_job'].replace('>4', '5', inplace=True)
train_set['last_new_job'].replace('never', '0', inplace=True)
train_set['last_new_job'] = train_set['last_new_job'].astype(int)
train_set['last_new_job'].dtypes
print(train_set.last_new_job.unique())

In [None]:
n = len(train_set.city.unique())
uniq_array = train_set.city.unique()
for i in range(n):
    string = uniq_array[i]
    train_set['city'].replace(uniq_array[i], string.strip('city_'), inplace=True)
train_set['city'] = train_set['city'].astype(int)
print(train_set.city.unique())

In [None]:
n = len(train_set.gender.unique())
uniq_array = train_set.gender.unique()
fit = preprocessing.LabelEncoder()
fit.fit(['Male', 'No gender', 'Female', 'Other'])
train_set['gender'] = fit.transform(train_set.gender)
print(train_set.gender.unique())

In [None]:
n = len(train_set.company_size.unique())
uniq_array = train_set.company_size.unique()
print(uniq_array)
fit = preprocessing.LabelEncoder()
fit.fit(['-1', '50-99', '<10', '10000+', '5000-9999', '1000-4999', '10/49', '100-500', '500-999'])
train_set['company_size'] = fit.transform(train_set.company_size)
print(train_set.company_size.unique())

In [None]:
n = len(train_set.company_size.unique())
uniq_array = train_set.company_type.unique()
print(uniq_array)
fit = preprocessing.LabelEncoder()
fit.fit(['Undetermined', 'Pvt Ltd', 'Funded Startup', 'Early Stage Startup', 'Other', 'Public Sector', 'NGO'])
train_set['company_type'] = fit.transform(train_set.company_type)
print(train_set.company_type.unique())

In [None]:
train_set.info()

# Models

### Preparation data

In [None]:
# Importing packages for SMOTE
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# separate dataset
X_dataset = train_set.drop(columns=['target']).values
Y_dataset = train_set[['target']].values

# oversampling

# 1 
oversampling = SMOTE()
X_dataset, Y_dataset = oversampling.fit_resample(X_dataset, Y_dataset)

# 2
# oversampling = RandomOverSampler(random_state=42)
# oversampling.fit(X_dataset, Y_dataset)
# X_dataset1, Y_dataset1 = oversampling.fit_resample(X_dataset, Y_dataset)

# 3 UnderSa
# rus = RandomUnderSampler(random_state=0)
# rus.fit(X_dataset, Y_dataset)
# X_dataset1, Y_dataset1 = rus.fit_resample(X_dataset, Y_dataset)

# devision of the thain and test parts
X_trainset, X_testset, Y_trainset, Y_testset = train_test_split(X_dataset, Y_dataset, test_size=0.3, random_state=3)

#Check if the shape is the same 
if (X_testset.shape[0] != Y_testset.shape[0]):
    print ('The array don\'t much!')
    print (X_testset.shape)
    print (Y_testset.shape)
else: 
    print('The array do much.')

### Decision Tree

In [None]:
# Learning
model_tree = DecisionTreeClassifier(criterion="entropy", max_depth = 8)
model_tree.fit(X_trainset,Y_trainset)

# Prediction
predTree = model_tree.predict(X_testset)

# Metrics and accuracy
print(classification_report(Y_testset, predTree))

### Random Forest

In [None]:
# Learning
forest = RandomForestClassifier( max_depth = 9)
forest.fit(X_trainset,Y_trainset)

# Prediction
predForest = forest.predict(X_testset)

# Metrics and accuracy
print(classification_report(Y_testset, predForest))