# Problem Statement
Your client is a financial distribution company. Over the last 10 years, they have created an offline distribution channel across the country. They sell financial products to consumers by hiring agents in their network. These agents are freelancers and get a commission when they make a product sale.

###### Overview of your client onboarding process

The managers at your client are primarily responsible for recruiting agents. Once a manager has identified a potential applicant he would explain the business opportunity to the agent. Once the agent provides the consent, an application is made to your client to become an agent. In the next 3 months, this potential agent has to undergo a 7 days training at your client's branch (about sales processes and various products) and clear a subsequent examination in order to become an agent.

###### The problem - who are the best agents?

As it is obvious in the above process, there is a significant investment which your client makes in identifying, training, and recruiting these agents. However, there are a set of agents who do not bring in the expected resultant business. Your client is looking for help from data scientists like you to help them provide insights using their past recruitment data. They want to predict the target variable for each potential agent which would help them identify the right agents to hire.

###### Key Points: The evaluation metric to be used is ROC-AUC.

## I have used Random Forest ML Algorithm to solve the given problem statement

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Importing the data

In [2]:
data = pd.read_csv('data.csv')

In [3]:
data.head()

Unnamed: 0,ID,Office_PIN,Applicant_City_PIN,Applicant_Gender,Applicant_Marital_Status,Applicant_Occupation,Applicant_Qualification,Manager_Joining_Designation,Manager_Current_Designation,Manager_Grade,Manager_Status,Manager_Gender,Manager_Num_Application,Manager_Num_Coded,Manager_Business,Manager_Num_Products,Manager_Business2,Manager_Num_Products2,Business_Sourced
0,FIN1000001,842001,844120,M,M,Others,Graduate,Level 1,Level 2,3.0,Confirmation,M,2.0,1.0,335249.0,28.0,335249.0,28.0,0
1,FIN1000002,842001,844111,M,S,Others,Class XII,Level 1,Level 2,3.0,Confirmation,M,2.0,1.0,335249.0,28.0,335249.0,28.0,1
2,FIN1000003,800001,844101,M,M,Business,Class XII,Level 1,Level 1,2.0,Confirmation,M,0.0,0.0,357184.0,24.0,357184.0,24.0,0
3,FIN1000004,814112,814112,M,S,Salaried,Class XII,Level 1,Level 3,4.0,Confirmation,F,0.0,0.0,318356.0,22.0,318356.0,22.0,0
4,FIN1000005,814112,815351,M,M,Others,Class XII,Level 1,Level 1,2.0,Confirmation,M,2.0,1.0,230402.0,17.0,230402.0,17.0,0


In [4]:
data.shape

(8844, 19)

In [5]:
data.isnull().sum()

ID                                0
Office_PIN                        0
Applicant_City_PIN                0
Applicant_Gender                 53
Applicant_Marital_Status         59
Applicant_Occupation           1090
Applicant_Qualification          71
Manager_Joining_Designation       0
Manager_Current_Designation       0
Manager_Grade                     0
Manager_Status                    0
Manager_Gender                    0
Manager_Num_Application           0
Manager_Num_Coded                 0
Manager_Business                  0
Manager_Num_Products              0
Manager_Business2                 0
Manager_Num_Products2             0
Business_Sourced                  0
dtype: int64

In [6]:
data.dtypes

ID                              object
Office_PIN                       int64
Applicant_City_PIN               int64
Applicant_Gender                object
Applicant_Marital_Status        object
Applicant_Occupation            object
Applicant_Qualification         object
Manager_Joining_Designation     object
Manager_Current_Designation     object
Manager_Grade                  float64
Manager_Status                  object
Manager_Gender                  object
Manager_Num_Application        float64
Manager_Num_Coded              float64
Manager_Business               float64
Manager_Num_Products           float64
Manager_Business2              float64
Manager_Num_Products2          float64
Business_Sourced                 int64
dtype: object

### Exploring the data and cleaning it

In [7]:
cols=['Applicant_Gender','Applicant_Marital_Status','Applicant_Occupation','Applicant_Qualification']
for i in cols:
    print("******"+i+"**********")
    print(data[i].value_counts())

******Applicant_Gender**********
M    6656
F    2135
Name: Applicant_Gender, dtype: int64
******Applicant_Marital_Status**********
M    5733
S    3042
W       6
D       4
Name: Applicant_Marital_Status, dtype: int64
******Applicant_Occupation**********
Salaried         3546
Business         2157
Others           1809
Self Employed     146
Student            96
Name: Applicant_Occupation, dtype: int64
******Applicant_Qualification**********
Class XII                                                           5426
Graduate                                                            2958
Class X                                                              195
Others                                                               116
Masters of Business Administration                                    71
Associate / Fellow of Institute of Chartered Accountans of India       3
Professional Qualification in Marketing                                1
Associate/Fellow of Institute of Company Secr

In [8]:
data['Applicant_Gender'].fillna(data['Applicant_Gender'].mode()[0],inplace=True)

In [9]:
data['Applicant_Marital_Status'].fillna(data['Applicant_Marital_Status'].mode()[0],inplace=True)

In [10]:
data['Applicant_Occupation'].fillna(data['Applicant_Occupation'].mode()[0],inplace=True)

In [11]:
data['Applicant_Qualification'].fillna(data['Applicant_Qualification'].mode()[0],inplace=True)

In [12]:
data.isnull().sum()

ID                             0
Office_PIN                     0
Applicant_City_PIN             0
Applicant_Gender               0
Applicant_Marital_Status       0
Applicant_Occupation           0
Applicant_Qualification        0
Manager_Joining_Designation    0
Manager_Current_Designation    0
Manager_Grade                  0
Manager_Status                 0
Manager_Gender                 0
Manager_Num_Application        0
Manager_Num_Coded              0
Manager_Business               0
Manager_Num_Products           0
Manager_Business2              0
Manager_Num_Products2          0
Business_Sourced               0
dtype: int64

In [13]:
data.nunique()

ID                             8844
Office_PIN                       98
Applicant_City_PIN             2858
Applicant_Gender                  2
Applicant_Marital_Status          4
Applicant_Occupation              5
Applicant_Qualification          10
Manager_Joining_Designation       8
Manager_Current_Designation       5
Manager_Grade                    10
Manager_Status                    2
Manager_Gender                    2
Manager_Num_Application          17
Manager_Num_Coded                10
Manager_Business               3747
Manager_Num_Products             57
Manager_Business2              3743
Manager_Num_Products2            57
Business_Sourced                  2
dtype: int64

'Applicant_Gender','Applicant_Marital_Status','Applicant_Occupation','Applicant_Qualification',
                   'Manager_Joining_Designation', 'Manager_Current_Designation', 'Manager_Status', 'Manager_Gender'

'ID','Office_PIN','Applicant_City_PIN','Manager_Num_Application','Manager_Num_Coded','Manager_Business','Manager_Num_Products'
,'Manager_Business2','Manager_Num_Products2'

In [14]:
data = pd.get_dummies(data)

In [15]:
data.head()

Unnamed: 0,Office_PIN,Applicant_City_PIN,Manager_Grade,Manager_Num_Application,Manager_Num_Coded,Manager_Business,Manager_Num_Products,Manager_Business2,Manager_Num_Products2,Business_Sourced,...,Manager_Joining_Designation_Other,Manager_Current_Designation_Level 1,Manager_Current_Designation_Level 2,Manager_Current_Designation_Level 3,Manager_Current_Designation_Level 4,Manager_Current_Designation_Level 5,Manager_Status_Confirmation,Manager_Status_Probation,Manager_Gender_F,Manager_Gender_M
0,842001,844120,3.0,2.0,1.0,335249.0,28.0,335249.0,28.0,0,...,0,0,1,0,0,0,1,0,0,1
1,842001,844111,3.0,2.0,1.0,335249.0,28.0,335249.0,28.0,1,...,0,0,1,0,0,0,1,0,0,1
2,800001,844101,2.0,0.0,0.0,357184.0,24.0,357184.0,24.0,0,...,0,1,0,0,0,0,1,0,0,1
3,814112,814112,4.0,0.0,0.0,318356.0,22.0,318356.0,22.0,0,...,0,0,0,1,0,0,1,0,1,0
4,814112,815351,2.0,2.0,1.0,230402.0,17.0,230402.0,17.0,0,...,0,1,0,0,0,0,1,0,0,1


### Separating independent and dependent variables

In [16]:
x = data.drop(['Business_Sourced'],axis=1)
y = data['Business_Sourced']

### Creating the train and test dataset

In [17]:
#import the train-test split
from sklearn.model_selection import train_test_split

In [18]:
#divide into train and test sets
train_x,test_x,train_y,test_y = train_test_split(x,y, random_state = 101, stratify=y)

# Building Random Forest Model

In [19]:
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier

In [20]:
rf = RandomForestClassifier(n_jobs=-1,max_depth=80,n_estimators=400,criterion='entropy')


rf.fit(train_x,train_y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=80, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=400,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [21]:
pred_train = rf.predict_proba(train_x)

In [22]:
pred_test = rf.predict_proba(test_x)

In [23]:
train_score = roc_auc_score(train_y,pred_train[:,1])
print("Train set score : ", train_score )

Train set score :  0.9999881940772612


In [24]:
test_score = roc_auc_score(test_y,pred_test[:,1])
print("Train set score : ", test_score )

Train set score :  0.6127791286157109
