## Prediction of data analyst job application probability

Based on the size,sector,type of ownership and revenue, the application probability is predicted.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib 
from matplotlib import pyplot as plt

In [None]:
df = pd.read_csv('/kaggle/input/data-analyst-jobs/DataAnalyst.csv')
df.head()

In [None]:
df.columns

## Data Wrangling

There are NaN records in most of the columns. But they are provided as '-1'. These records are updated as Nan to ease the cleaning process.

In [None]:
df['Easy Apply'].replace(to_replace='-1',value='False',inplace=True)
df.replace(to_replace='-1',value=np.nan,inplace=True)
df['Revenue'].replace(to_replace='Unknown / Non-Applicable',value=np.nan,inplace=True)
df['Founded'].replace(to_replace=-1,value=np.nan,inplace=True)
df = df.dropna()
df = df.drop(columns='Unnamed: 0',axis=1)


In [None]:
df = df.reset_index()
df = df.drop(columns='index',axis=1)

In [None]:
df

Few of the columns like Size, type of ownership,Revenue have categorised values. Hence, Label encoding has been incorporated to turn them into independent features that can be used for the classification model,

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
col = df[['Size','Type of ownership','Sector','Revenue','Easy Apply']]
df1 = col.apply(lambda x: le.fit_transform(x))
df1['Rating'] = df['Rating']
df1

Upon conversion, there is an imbalance in the samples categorised under 'Easy Apply' column as shown below.

In [None]:
import seaborn as sns
sns.countplot(x=df1['Easy Apply'])

### Over sampling the Minority class via SMOTE

In [None]:
import imblearn
from imblearn.over_sampling import SMOTE

x = df1[['Size','Type of ownership','Sector','Revenue','Rating']]
y = df1['Easy Apply']
samples = SMOTE()
X,Y = samples.fit_resample(x,y)

In [None]:
df2 = pd.DataFrame(X)
df2['Easy Apply'] = Y
df2

Samples are now balanced.

In [None]:
sns.countplot(x=df2['Easy Apply'])

## Model Development

In [None]:
from sklearn.model_selection import train_test_split
x = df2[['Size','Type of ownership','Sector','Revenue','Rating']]
y = df2['Easy Apply']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=5)

### Cross validation - accuracy and Learning curve

In [None]:
from sklearn.model_selection import learning_curve,KFold,cross_val_predict,cross_val_score
fold = KFold(shuffle=True)
def cv_accuracy(estimator,train_size=np.linspace(0.1,1.0,5)):
                        
                        model = cross_val_predict(estimator,x_train,y_train,cv=fold)
                        accuracy1 = cross_val_score(estimator,x_train,y_train,cv=fold)
                        print(accuracy1.mean())
                        train_sizes,train_scores,test_scores = learning_curve(estimator,x_train,y_train,train_sizes=train_size,cv=fold)
                        train_scores_mean = np.mean(train_scores,axis=1)
                        train_scores_std = np.std(train_scores,axis=1)
                        test_scores_mean = np.mean(test_scores,axis=1)
                        test_scores_std = np.std(test_scores,axis=1)
                        plt.plot(train_sizes,train_scores_mean,'o-',color='r',label='Training samples')
                        plt.plot(train_sizes,test_scores_mean,'o-',color='g',label='Test samples')
                        plt.xlabel('Training sizes')
                        plt.ylabel('Error')
                        plt.title('Learning curve')
                        

### Logistic regression

Since the target variable is a feature of probablity, the first option would to choose Logistic regression due to its effective probability feature.

In [None]:
from sklearn.linear_model import LogisticRegression
cv_accuracy(LogisticRegression())

From the cross validation results, the accuracy is around 80%. However, the learning curve indicates a high bias meaning the dataset is undersampled. Hence, Logistic regresion model is not considered.

## Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
cv_accuracy(DecisionTreeClassifier())

The accuracy is 93% and the learning curve shows a decent gap between the training samples and cross-validation samples. 

### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
cv_accuracy(RandomForestClassifier())

Among the three models tested in CV, random forest classifier provides highest accuracy of 94%. The learning curve shows low bias and variance.

## Hyper Parameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
rc = RandomForestClassifier()
parm = {'n_estimators':[100,200,300,500],'max_depth':[1,3,5,7]}
grid = GridSearchCV(estimator=rc,param_grid=parm,cv=fold)
    

In [None]:
grid.fit(x_train,y_train)


In [None]:
grid.best_params_

In [None]:
accuracy1 = cross_val_score(RandomForestClassifier(max_depth=7, n_estimators=300),x_train,y_train,cv=fold)
print(accuracy1.mean())

Based on the Grid search result, the best parameters are maximum depth '7' and the estimator number is 300. Hence, the same is applied to the test samples.

In [None]:
rc = RandomForestClassifier(max_depth=7,n_estimators=300)
rc = rc.fit(x_train,y_train)
yhat = rc.predict(x_test)

## Model Evaluation

In [None]:
from sklearn.metrics import classification_report,accuracy_score,plot_confusion_matrix
print('Classification Report')
print(classification_report(y_test,yhat))
print('Accuracy',accuracy_score(y_test,yhat))
plot_confusion_matrix(rc,x_test,y_test,cmap='Blues')

Based on the overall evaluation metrics, the model fits for 96

In [None]:
df_output = x_test
df_output['Easy Apply'] = yhat
df_output