### Problem Statement

Your client is a large MNC and they have 9 broad verticals across the organisation. One of the problem your client is facing is around identifying the right people for promotion (only for manager position and below) and prepare them in time. Currently the process, they are following is:

They first identify a set of employees based on recommendations/ past performance
Selected employees go through the separate training and evaluation program for each vertical. These programs are based on the required skill of each vertical
At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., employee gets promotion
For above mentioned process, the final promotions are only announced after the evaluation and this leads to delay in transition to their new roles. Hence, company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle. 

They have provided multiple attributes around Employee's past and current performance along with demographics. Now, The task is to predict whether a potential promotee at checkpoint in the test set will be promoted or not after the evaluation process.


Dataset Description

employee_id	Unique ID for employee
department	Department of employee
region	Region of employment (unordered)
education	Education Level
gender	Gender of Employee
recruitment_channel	Channel of recruitment for employee
no_of_trainings	no of other trainings completed in previous year on soft skills, technical skills etc.
age	Age of Employee
previous_year_rating	Employee Rating for the previous year
length_of_service	Length of service in years
KPIs_met >80%	if Percent of KPIs(Key performance Indicators) >80% then 1 else 0
awards_won?	if awards won during previous year then 1 else 0
avg_training_score	Average score in current training evaluations
is_promoted	(Target) Recommended for promotion


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
train = pd.read_csv("train_LZdllcl.csv")

In [3]:
from datacleaner import autoclean
from tpot import TPOTClassifier

In [4]:
train.isna().sum()

employee_id                0
department                 0
region                     0
education               2409
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating    4124
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted                0
dtype: int64

In [5]:
autoclean(train)

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,7,31,2,0,2,1,35,5.0,8,1,0,49,0
1,65141,4,14,0,1,0,1,30,5.0,4,0,0,60,0
2,7513,7,10,0,1,2,1,34,3.0,7,0,0,50,0
3,2542,7,15,0,1,0,2,39,1.0,10,0,0,50,0
4,48945,8,18,0,1,0,1,45,3.0,2,0,0,73,0
5,58896,0,11,0,1,2,2,31,3.0,7,0,0,85,0
6,20379,4,12,0,0,0,1,31,3.0,5,0,0,59,0
7,16290,4,27,2,1,2,1,33,3.0,6,0,0,63,0
8,73202,0,12,0,1,0,1,28,4.0,5,0,0,83,0
9,28911,7,0,2,1,2,1,32,5.0,5,1,0,54,0


In [6]:
train.isna().sum()

employee_id             0
department              0
region                  0
education               0
gender                  0
recruitment_channel     0
no_of_trainings         0
age                     0
previous_year_rating    0
length_of_service       0
KPIs_met >80%           0
awards_won?             0
avg_training_score      0
is_promoted             0
dtype: int64

In [7]:
target = train["is_promoted"]

In [8]:
from Dora import Dora

In [11]:
train_dora_scaled=Dora(data=train,output="is_promoted")

In [12]:
train_dora_scaled.scale_input_values()

In [19]:
final_train = train_dora_scaled.data.drop(columns=["is_promoted","employee_id"])

In [20]:
from sklearn.model_selection import train_test_split

In [21]:
X_train, X_test, y_train, y_test = train_test_split(final_train,target,train_size=0.75, test_size=0.25)

In [22]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((41106, 12), (13702, 12), (41106,), (13702,))

In [39]:
from sklearn.metrics import make_scorer

In [40]:
class_scorer = make_scorer(f1_score)

In [42]:
tpot = TPOTClassifier(generations=5,population_size=50,scoring= class_scorer)

In [43]:
tpot.fit(X_train,y_train)

TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
        disable_update_check=False, early_stop=None, generations=5,
        max_eval_time_mins=5, max_time_mins=None, memory=None,
        mutation_rate=0.9, n_jobs=1, offspring_size=None,
        periodic_checkpoint_folder=None, population_size=50,
        random_state=None, scoring=make_scorer(f1_score), subsample=1.0,
        use_dask=False, verbosity=0, warm_start=False)

In [44]:
tpot.score(X_test,y_test)*100

52.1584861028977

In [45]:
tpot.fitted_pipeline_

Pipeline(memory=None,
     steps=[('gradientboostingclassifier', GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=8,
              max_features=0.9000000000000001, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
 ...auto', random_state=None,
              subsample=0.9500000000000001, verbose=0, warm_start=False))])

In [53]:
tpot.default_config_dict

{'sklearn.naive_bayes.GaussianNB': {},
 'sklearn.naive_bayes.BernoulliNB': {'alpha': [0.001,
   0.01,
   0.1,
   1.0,
   10.0,
   100.0],
  'fit_prior': [True, False]},
 'sklearn.naive_bayes.MultinomialNB': {'alpha': [0.001,
   0.01,
   0.1,
   1.0,
   10.0,
   100.0],
  'fit_prior': [True, False]},
 'sklearn.tree.DecisionTreeClassifier': {'criterion': ['gini', 'entropy'],
  'max_depth': range(1, 11),
  'min_samples_split': range(2, 21),
  'min_samples_leaf': range(1, 21)},
 'sklearn.ensemble.ExtraTreesClassifier': {'n_estimators': [100],
  'criterion': ['gini', 'entropy'],
  'max_features': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
         0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]),
  'min_samples_split': range(2, 21),
  'min_samples_leaf': range(1, 21),
  'bootstrap': [True, False]},
 'sklearn.ensemble.RandomForestClassifier': {'n_estimators': [100],
  'criterion': ['gini', 'entropy'],
  'max_features': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 

In [46]:
tpot.evaluated_individuals_

{'RandomForestClassifier(input_matrix, RandomForestClassifier__bootstrap=True, RandomForestClassifier__criterion=gini, RandomForestClassifier__max_features=0.25, RandomForestClassifier__min_samples_leaf=12, RandomForestClassifier__min_samples_split=11, RandomForestClassifier__n_estimators=100)': {'generation': 0,
  'mutation_count': 0,
  'crossover_count': 0,
  'predecessor': ('ROOT',),
  'operator_count': 1,
  'internal_cv_score': 0.2687779181741231},
 'GaussianNB(GaussianNB(input_matrix))': {'generation': 0,
  'mutation_count': 0,
  'crossover_count': 0,
  'predecessor': ('ROOT',),
  'operator_count': 2,
  'internal_cv_score': 0.2408873698795298},
 'ExtraTreesClassifier(input_matrix, ExtraTreesClassifier__bootstrap=True, ExtraTreesClassifier__criterion=entropy, ExtraTreesClassifier__max_features=0.6000000000000001, ExtraTreesClassifier__min_samples_leaf=17, ExtraTreesClassifier__min_samples_split=14, ExtraTreesClassifier__n_estimators=100)': {'generation': 0,
  'mutation_count': 0,
 

In [31]:
tpot.export('tpot_exported_pipeline.py')

In [47]:
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score

In [48]:
print(classification_report(y_test,tpot.predict(X_test)))
f1_score(y_test,tpot.predict(X_test))

             precision    recall  f1-score   support

          0       0.94      1.00      0.97     12500
          1       0.90      0.37      0.52      1202

avg / total       0.94      0.94      0.93     13702



0.5215848610289769

In [38]:
print(classification_report(y_test,tpot.predict(X_test)))
f1_score(y_test,tpot.predict(X_test))

             precision    recall  f1-score   support

          0       0.94      1.00      0.97     12500
          1       0.93      0.36      0.51      1202

avg / total       0.94      0.94      0.93     13702



0.5135297654840649

In [54]:
from catboost import CatBoostClassifier

In [55]:
catb = CatBoostClassifier()

In [56]:
catb.fit(X_train,y_train)
cat_pred=catb.predict(X_test)
print(classification_report(y_test,cat_pred))

Learning rate set to 0.047958
0:	learn: 0.6227400	total: 175ms	remaining: 2m 55s
1:	learn: 0.5683293	total: 278ms	remaining: 2m 18s
2:	learn: 0.5315573	total: 420ms	remaining: 2m 19s
3:	learn: 0.5002840	total: 504ms	remaining: 2m 5s
4:	learn: 0.4697987	total: 572ms	remaining: 1m 53s
5:	learn: 0.4443590	total: 652ms	remaining: 1m 47s
6:	learn: 0.4065083	total: 862ms	remaining: 2m 2s
7:	learn: 0.3884196	total: 1.14s	remaining: 2m 21s
8:	learn: 0.3717290	total: 1.36s	remaining: 2m 30s
9:	learn: 0.3561776	total: 1.65s	remaining: 2m 43s
10:	learn: 0.3425937	total: 1.74s	remaining: 2m 36s
11:	learn: 0.3295060	total: 1.8s	remaining: 2m 28s
12:	learn: 0.3179489	total: 1.97s	remaining: 2m 29s
13:	learn: 0.3060173	total: 2.03s	remaining: 2m 23s
14:	learn: 0.2921130	total: 2.09s	remaining: 2m 17s
15:	learn: 0.2852658	total: 2.17s	remaining: 2m 13s
16:	learn: 0.2796572	total: 2.44s	remaining: 2m 21s
17:	learn: 0.2737370	total: 2.52s	remaining: 2m 17s
18:	learn: 0.2668748	total: 2.59s	remaining: 2m

158:	learn: 0.1685335	total: 19.2s	remaining: 1m 41s
159:	learn: 0.1683103	total: 19.2s	remaining: 1m 40s
160:	learn: 0.1683038	total: 19.3s	remaining: 1m 40s
161:	learn: 0.1681678	total: 19.3s	remaining: 1m 40s
162:	learn: 0.1681001	total: 19.4s	remaining: 1m 39s
163:	learn: 0.1680834	total: 19.5s	remaining: 1m 39s
164:	learn: 0.1679980	total: 19.6s	remaining: 1m 39s
165:	learn: 0.1679744	total: 19.7s	remaining: 1m 38s
166:	learn: 0.1676184	total: 19.7s	remaining: 1m 38s
167:	learn: 0.1671709	total: 19.8s	remaining: 1m 38s
168:	learn: 0.1671469	total: 19.9s	remaining: 1m 37s
169:	learn: 0.1671341	total: 20s	remaining: 1m 37s
170:	learn: 0.1671213	total: 20s	remaining: 1m 36s
171:	learn: 0.1671040	total: 20.1s	remaining: 1m 36s
172:	learn: 0.1670033	total: 20.1s	remaining: 1m 36s
173:	learn: 0.1668383	total: 20.2s	remaining: 1m 35s
174:	learn: 0.1668077	total: 20.2s	remaining: 1m 35s
175:	learn: 0.1666759	total: 20.3s	remaining: 1m 35s
176:	learn: 0.1666573	total: 20.4s	remaining: 1m 3

314:	learn: 0.1595380	total: 29.9s	remaining: 1m 5s
315:	learn: 0.1594984	total: 30s	remaining: 1m 4s
316:	learn: 0.1594636	total: 30.1s	remaining: 1m 4s
317:	learn: 0.1594445	total: 30.1s	remaining: 1m 4s
318:	learn: 0.1594432	total: 30.2s	remaining: 1m 4s
319:	learn: 0.1594401	total: 30.2s	remaining: 1m 4s
320:	learn: 0.1593853	total: 30.3s	remaining: 1m 4s
321:	learn: 0.1593428	total: 30.4s	remaining: 1m 3s
322:	learn: 0.1593206	total: 30.5s	remaining: 1m 3s
323:	learn: 0.1592041	total: 30.6s	remaining: 1m 3s
324:	learn: 0.1591903	total: 30.6s	remaining: 1m 3s
325:	learn: 0.1591727	total: 30.7s	remaining: 1m 3s
326:	learn: 0.1591566	total: 30.8s	remaining: 1m 3s
327:	learn: 0.1591395	total: 30.8s	remaining: 1m 3s
328:	learn: 0.1591145	total: 30.9s	remaining: 1m 2s
329:	learn: 0.1590984	total: 30.9s	remaining: 1m 2s
330:	learn: 0.1590762	total: 31s	remaining: 1m 2s
331:	learn: 0.1590752	total: 31.1s	remaining: 1m 2s
332:	learn: 0.1590569	total: 31.1s	remaining: 1m 2s
333:	learn: 0.15

475:	learn: 0.1564794	total: 42.4s	remaining: 46.7s
476:	learn: 0.1564786	total: 42.4s	remaining: 46.5s
477:	learn: 0.1564312	total: 42.5s	remaining: 46.4s
478:	learn: 0.1563739	total: 42.6s	remaining: 46.3s
479:	learn: 0.1563616	total: 42.6s	remaining: 46.2s
480:	learn: 0.1563554	total: 42.7s	remaining: 46.1s
481:	learn: 0.1563543	total: 42.8s	remaining: 46s
482:	learn: 0.1563496	total: 42.8s	remaining: 45.8s
483:	learn: 0.1563272	total: 42.9s	remaining: 45.7s
484:	learn: 0.1563229	total: 43s	remaining: 45.6s
485:	learn: 0.1563052	total: 43s	remaining: 45.5s
486:	learn: 0.1563027	total: 43.1s	remaining: 45.4s
487:	learn: 0.1562785	total: 43.2s	remaining: 45.3s
488:	learn: 0.1562374	total: 43.2s	remaining: 45.2s
489:	learn: 0.1562264	total: 43.3s	remaining: 45s
490:	learn: 0.1562125	total: 43.3s	remaining: 44.9s
491:	learn: 0.1562108	total: 43.4s	remaining: 44.8s
492:	learn: 0.1561835	total: 43.5s	remaining: 44.7s
493:	learn: 0.1561812	total: 43.5s	remaining: 44.6s
494:	learn: 0.156161

634:	learn: 0.1544021	total: 53.6s	remaining: 30.8s
635:	learn: 0.1543969	total: 53.7s	remaining: 30.7s
636:	learn: 0.1543617	total: 53.8s	remaining: 30.6s
637:	learn: 0.1543616	total: 53.8s	remaining: 30.5s
638:	learn: 0.1543576	total: 53.9s	remaining: 30.4s
639:	learn: 0.1543562	total: 54s	remaining: 30.3s
640:	learn: 0.1543556	total: 54s	remaining: 30.3s
641:	learn: 0.1543334	total: 54.1s	remaining: 30.1s
642:	learn: 0.1543331	total: 54.1s	remaining: 30.1s
643:	learn: 0.1543138	total: 54.2s	remaining: 30s
644:	learn: 0.1543035	total: 54.3s	remaining: 29.9s
645:	learn: 0.1543009	total: 54.3s	remaining: 29.8s
646:	learn: 0.1542974	total: 54.4s	remaining: 29.7s
647:	learn: 0.1542963	total: 54.5s	remaining: 29.6s
648:	learn: 0.1542954	total: 54.5s	remaining: 29.5s
649:	learn: 0.1542835	total: 54.6s	remaining: 29.4s
650:	learn: 0.1542822	total: 54.7s	remaining: 29.3s
651:	learn: 0.1542679	total: 54.7s	remaining: 29.2s
652:	learn: 0.1542677	total: 54.8s	remaining: 29.1s
653:	learn: 0.1542

797:	learn: 0.1528435	total: 1m 5s	remaining: 16.6s
798:	learn: 0.1528246	total: 1m 5s	remaining: 16.5s
799:	learn: 0.1528007	total: 1m 5s	remaining: 16.4s
800:	learn: 0.1527992	total: 1m 5s	remaining: 16.4s
801:	learn: 0.1527971	total: 1m 5s	remaining: 16.3s
802:	learn: 0.1527733	total: 1m 5s	remaining: 16.2s
803:	learn: 0.1527713	total: 1m 6s	remaining: 16.1s
804:	learn: 0.1527710	total: 1m 6s	remaining: 16s
805:	learn: 0.1527677	total: 1m 6s	remaining: 15.9s
806:	learn: 0.1527676	total: 1m 6s	remaining: 15.8s
807:	learn: 0.1527675	total: 1m 6s	remaining: 15.7s
808:	learn: 0.1527665	total: 1m 6s	remaining: 15.7s
809:	learn: 0.1527656	total: 1m 6s	remaining: 15.6s
810:	learn: 0.1527479	total: 1m 6s	remaining: 15.5s
811:	learn: 0.1527467	total: 1m 6s	remaining: 15.4s
812:	learn: 0.1527450	total: 1m 6s	remaining: 15.3s
813:	learn: 0.1527446	total: 1m 6s	remaining: 15.2s
814:	learn: 0.1527440	total: 1m 6s	remaining: 15.1s
815:	learn: 0.1527395	total: 1m 6s	remaining: 15s
816:	learn: 0.15

955:	learn: 0.1514882	total: 1m 15s	remaining: 3.46s
956:	learn: 0.1514821	total: 1m 15s	remaining: 3.38s
957:	learn: 0.1514773	total: 1m 15s	remaining: 3.3s
958:	learn: 0.1514719	total: 1m 15s	remaining: 3.22s
959:	learn: 0.1514716	total: 1m 15s	remaining: 3.14s
960:	learn: 0.1514712	total: 1m 15s	remaining: 3.06s
961:	learn: 0.1514672	total: 1m 15s	remaining: 2.98s
962:	learn: 0.1514622	total: 1m 15s	remaining: 2.9s
963:	learn: 0.1514620	total: 1m 15s	remaining: 2.83s
964:	learn: 0.1514608	total: 1m 15s	remaining: 2.75s
965:	learn: 0.1514585	total: 1m 15s	remaining: 2.67s
966:	learn: 0.1514444	total: 1m 15s	remaining: 2.59s
967:	learn: 0.1514403	total: 1m 16s	remaining: 2.51s
968:	learn: 0.1514391	total: 1m 16s	remaining: 2.43s
969:	learn: 0.1514370	total: 1m 16s	remaining: 2.35s
970:	learn: 0.1514369	total: 1m 16s	remaining: 2.27s
971:	learn: 0.1514369	total: 1m 16s	remaining: 2.2s
972:	learn: 0.1514369	total: 1m 16s	remaining: 2.12s
973:	learn: 0.1514229	total: 1m 16s	remaining: 2.

In [57]:
import joblib

In [58]:
filename= "cat_model.pkl"

In [59]:
joblib.dump(catb,filename)

['cat_model.pkl']

In [60]:
cat_load= joblib.load(filename)

In [66]:
test  = pd.read_csv("test_2umaH9m.csv")

autoclean(test)
test["is_promoted"]= pd.Series()# dummy for Dora
test_dora_scaled=Dora(data=test,output="is_promoted")
test_dora_scaled.scale_input_values()
test_dora_scaled=test_dora_scaled.data.drop(columns=["is_promoted"])

In [69]:
test_dora_scaled=test_dora_scaled.drop(columns=["employee_id"])

In [72]:
submission = pd.DataFrame()
submission["is_promoted"]=cat_load.predict(test_dora_scaled)
submission["employee_id"]=test.employee_id

In [78]:
submission["is_promoted"]=submission.is_promoted.astype(int)

In [79]:
submission.to_csv("final_1.csv",index=False)

In [80]:
submission.shape

(23490, 2)

In [104]:
from catboost import Pool
catb.get_feature_importance(Pool(final_train))

feature_score = pd.DataFrame(list(zip(final_train.dtypes.index,\
                                      catb.get_feature_importance(Pool(final_train, label=target)))),columns=['Feature','Score'])

In [106]:
feature_score.sort_values(by="Score")

Unnamed: 0,Feature,Score
4,recruitment_channel,0.101103
3,gender,0.121759
2,education,0.253902
5,no_of_trainings,0.45109
10,awards_won?,0.852878
6,age,1.580729
8,length_of_service,1.686837
1,region,2.17746
7,previous_year_rating,5.440794
0,department,17.414341


In [49]:
import autosklearn.classification

ModuleNotFoundError: No module named 'autosklearn'

In [None]:
train.head()

In [None]:
train.shape

In [None]:
train.describe()

In [None]:
train = train.dropna()

In [None]:
train.isna().sum()

In [None]:
target = train.is_promoted

In [None]:
train.columns

In [None]:
# train_treated
# numerical = train.select_dtypes(include = np.number)
# nonnumerical = train.select_dtypes(exclude = np.number)

In [None]:
# numerical.head()

#### Feature Exploration

In [None]:
train.is_promoted.value_counts().sort_index()

In [None]:
# plt.figure(figsize=(12,12))
sns.boxplot(train.is_promoted)

In [None]:
train.no_of_trainings.value_counts().sort_index()

In [None]:
sns.boxplot(train.no_of_trainings)

In [None]:
train.age.value_counts().sort_index() 

In [None]:
sns.boxplot(train.age)

In [None]:
# numerical.columns

In [None]:
train.previous_year_rating.value_counts().sort_index()

In [None]:
train.length_of_service.value_counts().sort_index()

In [None]:
train["KPIs_met >80%"].value_counts().sort_index()

In [None]:
train["awards_won?"].value_counts().sort_index()

In [None]:
train["avg_training_score"].value_counts().sort_index()

In [None]:
# numerical.columns

'employee_id'- drop
'no_of_trainings'-bin
'age'-bin
'previous_year_rating'- contains null (drop null)
'length_of_service'- possible binning
'KPIs_met >80%'
'awards_won?'
'avg_training_score'
'is_promoted'- imbalanced - resampling

### Working on Numerical Features

In [None]:
# numerical.isna().sum()

In [None]:
# numerical_treated = numerical.copy(deep=True)

In [None]:
# numerical_treated = numerical.dropna()

In [None]:
# numerical_treated.isna().sum()

In [None]:
# numerical_treated = numerical_treated.drop(columns=["employee_id"])
train_treated = train.drop(columns=["employee_id"])

In [None]:
def age(df):
    df.loc[df['age'] <= 28, 'age'] = 1
    df.loc[(df['age'] > 28) & (df['age'] <= 36), 'age'] = 2
    df.loc[(df['age'] > 36) & (df['age'] <= 44), 'age'] = 3
    df.loc[(df['age'] > 44) & (df['age'] <= 52), 'age'] = 4
    df.loc[(df['age'] > 52) & (df['age'] <= 60), 'age'] = 5    
        
    return df

In [None]:
age(train_treated)

In [None]:
def no_of_trainings(df):
    df.loc[df['no_of_trainings'] <= 1, 'no_of_trainings'] = 1
    df.loc[(df['no_of_trainings'] > 1) & (df['no_of_trainings'] <= 5), 'no_of_trainings'] = 2
    df.loc[(df['no_of_trainings'] > 5) & (df['no_of_trainings'] <= 10), 'no_of_trainings'] = 3
    
    return df

In [None]:
no_of_trainings(train_treated)

#### Store Target

In [None]:
# target = train.is_promoted

###  Nonnumerical Analyses

In [None]:
# nonnumerical.columns

In [None]:
# nonnumerical.isna().sum()

In [None]:
# nonnumerical_treated = nonnumerical.dropna()

In [None]:
train_treated["department"].value_counts(normalize=True).plot.bar()
plt.show()

In [None]:
train_treated["department"].value_counts()

In [None]:
train_treated["region"].value_counts(normalize=True).plot.bar()
plt.show()

In [None]:
train_treated["region"].value_counts().sort_index()

In [None]:
train_treated["education"].value_counts()

In [None]:
train_treated["gender"].value_counts(normalize=True).plot.bar()
plt.show()

In [None]:
train_treated["gender"].value_counts()

In [None]:
train_treated["recruitment_channel"].value_counts(normalize=True).plot.bar()
plt.show()

In [None]:
train_treated["recruitment_channel"].value_counts()

### Scaling Numericals

In [None]:
# train_treated
numerical_treated = train_treated.select_dtypes(include = np.number)

In [None]:
cols = numerical_treated.drop(columns=["age","no_of_trainings","is_promoted"]).columns

In [None]:
# cols = numerical_treated.columns 

In [None]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler(with_scaling="True")
## df_r stores the robust scaled data

df_r = scaler.fit_transform(numerical_treated.drop(columns=["age","no_of_trainings","is_promoted"]))

df_r = pd.DataFrame(df_r, columns=cols)

df_r.head()

In [None]:
df_r= pd.concat([df_r,train_treated[["age","no_of_trainings"]]],axis=1)

In [None]:
# binned_num = numerical_treated[["age","no_of_trainings"]]

In [None]:
# df_r = pd.concat([df_r,binned_num],axis=1)

### Encoding Nonnumericals

In [None]:
nonnumerical_treated = train_treated.select_dtypes(exclude = np.number)

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# nonnumerical_treated_1 = nonnumerical_treated.copy(deep=True)

## LabelEncoding 

for x in list(nonnumerical_treated.iloc[:,:]):
    nonnumerical_treated[x] = le.fit_transform(nonnumerical_treated[x])

## Encoded categoricals
    
nonnumerical_treated

#### Concat

In [None]:
# stop

In [None]:
df_treated = pd.concat([df_r, nonnumerical_treated], axis=1)

In [None]:
# df_treated

In [None]:
df_treated.shape

In [None]:
df_treated_dropped_na = df_treated.dropna()

In [None]:
df_treated.isna().sum()

In [None]:
df_treated_dropped_na.isna().sum()

In [None]:
df_treated_dropped_na.shape

In [None]:
target.shape

In [None]:
df_final = pd.concat([df_treated,target],axis=1)

In [None]:
# df_treated.shape

In [None]:
# target.dropna().shape

#### Majority Minority Split

In [None]:
df_minority = df_final[df_final["is_promoted"]==1]
df_majority = df_final[df_final["is_promoted"]==0]

In [None]:
df_majority.shape,df_minority.shape

#### Resampling

In [None]:
from sklearn.utils import resample

df_minority_upsampled = resample(df_minority, replace=True,n_samples= int(4232*2), random_state=123) 

## new minority shape

df_minority_upsampled.shape

In [None]:
## concatenating new minority and majority

df_upsampled = pd.concat([df_majority, df_minority_upsampled])

In [None]:
df_upsampled=df_upsampled.dropna()

In [None]:
# target

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score


X_train, X_test, y_train, y_test = train_test_split(df_upsampled.drop("is_promoted", axis = 1), df_upsampled.is_promoted, test_size = 0.2, random_state = 42)
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)

## LogReg

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
log= LogisticRegression(class_weight="balanced",random_state=42)

In [None]:
log.fit(X_train,y_train)
ypred_log=log.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

print(classification_report(y_test, ypred_log))
print(accuracy_score(y_test, ypred_log))

In [None]:
## f1 being the harmonic mean of PR is the preferred metric 

from sklearn.metrics import f1_score

f1_score(y_test,ypred_log)

## DTree

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(class_weight="balanced")
dtc.fit(X_train, y_train)
y_pred_dtc = dtc.predict(X_test)
accuracy_score(y_test, y_pred_dtc)

In [None]:
dtc_cv = (cross_val_score(dtc, X_train, y_train, cv=k_fold, n_jobs=2, scoring = 'accuracy').mean())

dtc_cv

In [None]:
f1_score(y_test,y_pred_dtc)

In [None]:
print(classification_report(y_test, y_pred_dtc))
print(accuracy_score(y_test, y_pred_dtc))


In [None]:
coef = pd.Series(dtc.feature_importances_,df_upsampled.drop('is_promoted', axis = 1).columns).sort_values(ascending=False)
coef.plot(kind='bar', title='Feature Importances')

## RF

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators = 200, n_jobs=2, random_state = 12)
rfc.fit(X_train, y_train)
y_pred_rf = rfc.predict(X_test)

In [None]:

## cross validating(k/10 folds) with rfc and obtaining the mean subsequently

cross_val_score(rfc, X_train, y_train, cv=k_fold, n_jobs=2, scoring = 'accuracy')

rfc_cv = (cross_val_score(rfc, X_train, y_train, cv=k_fold, n_jobs=2, scoring = 'accuracy').mean())

rfc_cv

In [None]:
print(classification_report(y_test, y_pred_rf))
print(accuracy_score(y_test, y_pred_rf))

f1_score(y_test,y_pred_rf)

In [None]:
coef = pd.Series(rfc.feature_importances_,df_upsampled.drop('is_promoted', axis = 1).columns).sort_values(ascending=False)
coef.plot(kind='bar', title='Feature Importances')

In [None]:
# #
# X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(df_upsampled.drop(columns=["is_promoted","awards_won?"], axis = 1), df_upsampled.is_promoted, test_size = 0.2, random_state = 42)
# k_fold = KFold(n_splits=10, shuffle=True, random_state=0)

In [None]:
# rfc_new = RandomForestClassifier(n_estimators = 200, n_jobs=2, random_state = 12)
# rfc_new.fit(X_train_new, y_train_new)
# y_pred_rf_new = rfc_new.predict(X_test_new)

# cross_val_score(rfc_new, X_train_new, y_train_new, cv=k_fold, n_jobs=2, scoring = 'accuracy')

In [None]:
# f1_score(y_test_new,y_pred_rf_new)

In [None]:
import pickle

filename= "final.pkl"
pickle.dump(rfc, open(filename, 'wb'))