# Problem Statement
Your client is a financial distribution company. Over the last 10 years, they have created an offline distribution channel across the country. They sell financial products to consumers by hiring agents in their network. These agents are freelancers and get a commission when they make a product sale.

##### Overview of your client onboarding process

The managers at your client are primarily responsible for recruiting agents. Once a manager has identified a potential applicant he would explain the business opportunity to the agent. Once the agent provides the consent, an application is made to your client to become an agent. In the next 3 months, this potential agent has to undergo a 7 days training at your client's branch (about sales processes and various products) and clear a subsequent examination in order to become an agent.

##### The problem - who are the best agents?

As it is obvious in the above process, there is a significant investment which your client makes in identifying, training, and recruiting these agents. However, there are a set of agents who do not bring in the expected resultant business. Your client is looking for help from data scientists like you to help them provide insights using their past recruitment data. They want to predict the target variable for each potential agent which would help them identify the right agents to hire. 

(Predict "Business_Sourced")

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Importing data

In [2]:
data=pd.read_csv('data.csv')
data.head()

Unnamed: 0,ID,Office_PIN,Applicant_City_PIN,Applicant_Gender,Applicant_Marital_Status,Applicant_Occupation,Applicant_Qualification,Manager_Joining_Designation,Manager_Current_Designation,Manager_Grade,Manager_Status,Manager_Gender,Manager_Num_Application,Manager_Num_Coded,Manager_Business,Manager_Num_Products,Manager_Business2,Manager_Num_Products2,Business_Sourced
0,FIN1000001,842001,844120,M,M,Others,Graduate,Level 1,Level 2,3.0,Confirmation,M,2.0,1.0,335249.0,28.0,335249.0,28.0,0
1,FIN1000002,842001,844111,M,S,Others,Class XII,Level 1,Level 2,3.0,Confirmation,M,2.0,1.0,335249.0,28.0,335249.0,28.0,1
2,FIN1000003,800001,844101,M,M,Business,Class XII,Level 1,Level 1,2.0,Confirmation,M,0.0,0.0,357184.0,24.0,357184.0,24.0,0
3,FIN1000004,814112,814112,M,S,Salaried,Class XII,Level 1,Level 3,4.0,Confirmation,F,0.0,0.0,318356.0,22.0,318356.0,22.0,0
4,FIN1000005,814112,815351,M,M,Others,Class XII,Level 1,Level 1,2.0,Confirmation,M,2.0,1.0,230402.0,17.0,230402.0,17.0,0


### Data Preprocessing

In [3]:
data.describe(include='all')

Unnamed: 0,ID,Office_PIN,Applicant_City_PIN,Applicant_Gender,Applicant_Marital_Status,Applicant_Occupation,Applicant_Qualification,Manager_Joining_Designation,Manager_Current_Designation,Manager_Grade,Manager_Status,Manager_Gender,Manager_Num_Application,Manager_Num_Coded,Manager_Business,Manager_Num_Products,Manager_Business2,Manager_Num_Products2,Business_Sourced
count,8844,8844.0,8844.0,8791,8785,7754,8773,8844,8844,8844.0,8844,8844,8844.0,8844.0,8844.0,8844.0,8844.0,8844.0,8844.0
unique,8844,,,2,4,5,10,8,5,,2,2,,,,,,,
top,FIN1006978,,,M,M,Salaried,Class XII,Level 1,Level 2,,Confirmation,M,,,,,,,
freq,1,,,6656,5733,3546,5426,4632,3208,,5277,7627,,,,,,,
mean,,450714.378562,452638.591022,,,,,,,3.264134,,,1.939733,0.758933,184371.0,7.152307,182926.3,7.131275,0.342718
std,,234079.460837,238045.727919,,,,,,,1.137449,,,2.150529,1.188644,274716.3,8.439351,271802.1,8.423597,0.474645
min,,110005.0,110001.0,,,,,,,1.0,,,0.0,0.0,-265289.0,0.0,-265289.0,0.0,0.0
25%,,226001.0,226002.0,,,,,,,2.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,,416001.0,422001.0,,,,,,,3.0,,,1.0,0.0,102178.0,5.0,101714.0,5.0,0.0
75%,,695014.0,695009.0,,,,,,,4.0,,,3.0,1.0,247116.5,11.0,246461.2,11.0,1.0


In [4]:
data.dtypes

ID                              object
Office_PIN                       int64
Applicant_City_PIN               int64
Applicant_Gender                object
Applicant_Marital_Status        object
Applicant_Occupation            object
Applicant_Qualification         object
Manager_Joining_Designation     object
Manager_Current_Designation     object
Manager_Grade                  float64
Manager_Status                  object
Manager_Gender                  object
Manager_Num_Application        float64
Manager_Num_Coded              float64
Manager_Business               float64
Manager_Num_Products           float64
Manager_Business2              float64
Manager_Num_Products2          float64
Business_Sourced                 int64
dtype: object

In [5]:
data.nunique()

ID                             8844
Office_PIN                       98
Applicant_City_PIN             2858
Applicant_Gender                  2
Applicant_Marital_Status          4
Applicant_Occupation              5
Applicant_Qualification          10
Manager_Joining_Designation       8
Manager_Current_Designation       5
Manager_Grade                    10
Manager_Status                    2
Manager_Gender                    2
Manager_Num_Application          17
Manager_Num_Coded                10
Manager_Business               3747
Manager_Num_Products             57
Manager_Business2              3743
Manager_Num_Products2            57
Business_Sourced                  2
dtype: int64

In [6]:
data.isnull().sum()

ID                                0
Office_PIN                        0
Applicant_City_PIN                0
Applicant_Gender                 53
Applicant_Marital_Status         59
Applicant_Occupation           1090
Applicant_Qualification          71
Manager_Joining_Designation       0
Manager_Current_Designation       0
Manager_Grade                     0
Manager_Status                    0
Manager_Gender                    0
Manager_Num_Application           0
Manager_Num_Coded                 0
Manager_Business                  0
Manager_Num_Products              0
Manager_Business2                 0
Manager_Num_Products2             0
Business_Sourced                  0
dtype: int64

In [7]:
data.shape

(8844, 19)

In [8]:
data.drop_duplicates()

Unnamed: 0,ID,Office_PIN,Applicant_City_PIN,Applicant_Gender,Applicant_Marital_Status,Applicant_Occupation,Applicant_Qualification,Manager_Joining_Designation,Manager_Current_Designation,Manager_Grade,Manager_Status,Manager_Gender,Manager_Num_Application,Manager_Num_Coded,Manager_Business,Manager_Num_Products,Manager_Business2,Manager_Num_Products2,Business_Sourced
0,FIN1000001,842001,844120,M,M,Others,Graduate,Level 1,Level 2,3.0,Confirmation,M,2.0,1.0,335249.0,28.0,335249.0,28.0,0
1,FIN1000002,842001,844111,M,S,Others,Class XII,Level 1,Level 2,3.0,Confirmation,M,2.0,1.0,335249.0,28.0,335249.0,28.0,1
2,FIN1000003,800001,844101,M,M,Business,Class XII,Level 1,Level 1,2.0,Confirmation,M,0.0,0.0,357184.0,24.0,357184.0,24.0,0
3,FIN1000004,814112,814112,M,S,Salaried,Class XII,Level 1,Level 3,4.0,Confirmation,F,0.0,0.0,318356.0,22.0,318356.0,22.0,0
4,FIN1000005,814112,815351,M,M,Others,Class XII,Level 1,Level 1,2.0,Confirmation,M,2.0,1.0,230402.0,17.0,230402.0,17.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8839,FIN1009520,250001,250004,F,M,,Graduate,Level 1,Level 2,3.0,Confirmation,M,1.0,1.0,55000.0,2.0,55000.0,2.0,0
8840,FIN1009522,814112,816118,M,M,,Class XII,Level 1,Level 1,2.0,Confirmation,M,4.0,2.0,418339.0,13.0,418339.0,13.0,0
8841,FIN1009523,160017,160032,M,M,Salaried,Graduate,Level 2,Level 2,3.0,Probation,M,0.0,0.0,0.0,0.0,0.0,0.0,0
8842,FIN1009525,753012,753014,F,M,Salaried,Graduate,Level 2,Level 2,3.0,Confirmation,M,0.0,0.0,316126.0,9.0,305775.0,8.0,0


In [9]:
data['Applicant_Gender'].fillna(data['Applicant_Gender'].mode()[0],inplace=True)

In [10]:
data['Applicant_Marital_Status'].value_counts()

M    5733
S    3042
W       6
D       4
Name: Applicant_Marital_Status, dtype: int64

In [11]:
data['Applicant_Marital_Status'].fillna(data['Applicant_Marital_Status'].mode()[0],inplace=True)

In [12]:
data['Applicant_Occupation'].value_counts()

Salaried         3546
Business         2157
Others           1809
Self Employed     146
Student            96
Name: Applicant_Occupation, dtype: int64

In [13]:
data['Applicant_Occupation'].fillna(data['Applicant_Occupation'].mode()[0],inplace=True)

In [14]:
data['Applicant_Qualification'].value_counts()

Class XII                                                           5426
Graduate                                                            2958
Class X                                                              195
Others                                                               116
Masters of Business Administration                                    71
Associate / Fellow of Institute of Chartered Accountans of India       3
Associate/Fellow of Insurance Institute of India                       1
Professional Qualification in Marketing                                1
Associate/Fellow of Institute of Company Secretories of India          1
Associate/Fellow of Acturial Society of India                          1
Name: Applicant_Qualification, dtype: int64

In [15]:
data['Applicant_Qualification'].fillna(data['Applicant_Qualification'].mode()[0],inplace=True)

In [16]:
data['Manager_Joining_Designation'].value_counts()

Level 1    4632
Level 2    2787
Level 3    1146
Level 4     200
Other        58
Level 6      18
Level 7       2
Level 5       1
Name: Manager_Joining_Designation, dtype: int64

In [17]:
data['Manager_Current_Designation'].value_counts()

Level 2    3208
Level 1    2479
Level 3    2033
Level 4    1031
Level 5      93
Name: Manager_Current_Designation, dtype: int64

In [18]:
data['Manager_Status'].value_counts()

Confirmation    5277
Probation       3567
Name: Manager_Status, dtype: int64

In [19]:
data.isnull().sum()

ID                             0
Office_PIN                     0
Applicant_City_PIN             0
Applicant_Gender               0
Applicant_Marital_Status       0
Applicant_Occupation           0
Applicant_Qualification        0
Manager_Joining_Designation    0
Manager_Current_Designation    0
Manager_Grade                  0
Manager_Status                 0
Manager_Gender                 0
Manager_Num_Application        0
Manager_Num_Coded              0
Manager_Business               0
Manager_Num_Products           0
Manager_Business2              0
Manager_Num_Products2          0
Business_Sourced               0
dtype: int64

In [20]:
#making data sklearn usable

data=pd.get_dummies(data.drop(['ID'],axis=1))

In [21]:
data.head()

Unnamed: 0,Office_PIN,Applicant_City_PIN,Manager_Grade,Manager_Num_Application,Manager_Num_Coded,Manager_Business,Manager_Num_Products,Manager_Business2,Manager_Num_Products2,Business_Sourced,...,Manager_Joining_Designation_Other,Manager_Current_Designation_Level 1,Manager_Current_Designation_Level 2,Manager_Current_Designation_Level 3,Manager_Current_Designation_Level 4,Manager_Current_Designation_Level 5,Manager_Status_Confirmation,Manager_Status_Probation,Manager_Gender_F,Manager_Gender_M
0,842001,844120,3.0,2.0,1.0,335249.0,28.0,335249.0,28.0,0,...,0,0,1,0,0,0,1,0,0,1
1,842001,844111,3.0,2.0,1.0,335249.0,28.0,335249.0,28.0,1,...,0,0,1,0,0,0,1,0,0,1
2,800001,844101,2.0,0.0,0.0,357184.0,24.0,357184.0,24.0,0,...,0,1,0,0,0,0,1,0,0,1
3,814112,814112,4.0,0.0,0.0,318356.0,22.0,318356.0,22.0,0,...,0,0,0,1,0,0,1,0,1,0
4,814112,815351,2.0,2.0,1.0,230402.0,17.0,230402.0,17.0,0,...,0,1,0,0,0,0,1,0,0,1


In [22]:
data.dtypes

Office_PIN                                                                                    int64
Applicant_City_PIN                                                                            int64
Manager_Grade                                                                               float64
Manager_Num_Application                                                                     float64
Manager_Num_Coded                                                                           float64
Manager_Business                                                                            float64
Manager_Num_Products                                                                        float64
Manager_Business2                                                                           float64
Manager_Num_Products2                                                                       float64
Business_Sourced                                                                              int64


In [23]:
data.to_csv('data_cleaned_sales_target.csv')

### Segregating dependent and independent data

In [24]:
x = data.drop(['Business_Sourced'],axis=1)
y = data['Business_Sourced']
x.shape, y.shape

((8844, 47), (8844,))

### Splitting the data into train set and the test set

In [25]:
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x,y, random_state = 56)

### Normalising using min_max_scaler

In [26]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [27]:
cols = train_x.columns
cols

Index(['Office_PIN', 'Applicant_City_PIN', 'Manager_Grade',
       'Manager_Num_Application', 'Manager_Num_Coded', 'Manager_Business',
       'Manager_Num_Products', 'Manager_Business2', 'Manager_Num_Products2',
       'Applicant_Gender_F', 'Applicant_Gender_M',
       'Applicant_Marital_Status_D', 'Applicant_Marital_Status_M',
       'Applicant_Marital_Status_S', 'Applicant_Marital_Status_W',
       'Applicant_Occupation_Business', 'Applicant_Occupation_Others',
       'Applicant_Occupation_Salaried', 'Applicant_Occupation_Self Employed',
       'Applicant_Occupation_Student',
       'Applicant_Qualification_Associate / Fellow of Institute of Chartered Accountans of India',
       'Applicant_Qualification_Associate/Fellow of Acturial Society of India',
       'Applicant_Qualification_Associate/Fellow of Institute of Company Secretories of India',
       'Applicant_Qualification_Associate/Fellow of Insurance Institute of India',
       'Applicant_Qualification_Class X', 'Applicant_Qual

In [28]:
train_x_scaled = scaler.fit_transform(train_x)
train_x_scaled = pd.DataFrame(train_x_scaled, columns=cols)
train_x_scaled.head()

Unnamed: 0,Office_PIN,Applicant_City_PIN,Manager_Grade,Manager_Num_Application,Manager_Num_Coded,Manager_Business,Manager_Num_Products,Manager_Business2,Manager_Num_Products2,Applicant_Gender_F,...,Manager_Joining_Designation_Other,Manager_Current_Designation_Level 1,Manager_Current_Designation_Level 2,Manager_Current_Designation_Level 3,Manager_Current_Designation_Level 4,Manager_Current_Designation_Level 5,Manager_Status_Confirmation,Manager_Status_Probation,Manager_Gender_F,Manager_Gender_M
0,0.391407,0.328124,0.222222,0.125,0.111111,0.201466,0.118812,0.201466,0.118812,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1,0.410201,0.103878,0.222222,0.0,0.0,0.116373,0.049505,0.116373,0.049505,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,0.450678,0.377691,0.111111,0.4375,0.333333,0.111745,0.049505,0.111745,0.049505,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
3,0.219939,0.184057,0.111111,0.25,0.222222,0.086761,0.0,0.086761,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
4,0.451488,0.378034,0.333333,0.1875,0.0,0.228123,0.237624,0.228123,0.237624,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0


In [29]:
test_x_scaled = scaler.transform(test_x)
test_x_scaled = pd.DataFrame(test_x_scaled, columns=cols)
test_x_scaled.head()

Unnamed: 0,Office_PIN,Applicant_City_PIN,Manager_Grade,Manager_Num_Application,Manager_Num_Coded,Manager_Business,Manager_Num_Products,Manager_Business2,Manager_Num_Products2,Applicant_Gender_F,...,Manager_Joining_Designation_Other,Manager_Current_Designation_Level 1,Manager_Current_Designation_Level 2,Manager_Current_Designation_Level 3,Manager_Current_Designation_Level 4,Manager_Current_Designation_Level 5,Manager_Status_Confirmation,Manager_Status_Probation,Manager_Gender_F,Manager_Gender_M
0,0.814062,0.681193,0.111111,0.0625,0.0,0.086761,0.0,0.086761,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,0.153984,0.128855,0.111111,0.0,0.0,0.133207,0.128713,0.133207,0.128713,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,0.47497,0.397469,0.333333,0.0,0.0,0.086761,0.0,0.086761,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
3,0.451488,0.377802,0.333333,0.375,0.111111,0.10144,0.029703,0.10144,0.029703,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
4,0.976926,0.817473,0.222222,0.1875,0.111111,0.126928,0.059406,0.126928,0.059406,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


### Implementing Logistic Regression

In [30]:
from sklearn.linear_model import LogisticRegression as LogReg
from sklearn.metrics import roc_auc_score

In [31]:
# Creating instance of Logistic Regresssion
logreg = LogReg()

# Fitting the model
logreg.fit(train_x,train_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

### Making predictions using *predict_proba* function

In [32]:
#predictiong the prbability of 0 and 1 respectively for the dependent variable i.e. business sourced here

train_pred = logreg.predict_proba(train_x)
train_pred

array([[0.60582573, 0.39417427],
       [0.59530645, 0.40469355],
       [0.6160142 , 0.3839858 ],
       ...,
       [0.62024883, 0.37975117],
       [0.70595361, 0.29404639],
       [0.61598441, 0.38401559]])

In [33]:
#separating the probability of 1 in the dependent variable

train_pred = train_pred[:,1]
train_pred

array([0.39417427, 0.40469355, 0.3839858 , ..., 0.37975117, 0.29404639,
       0.38401559])

In [34]:
test_pred = logreg.predict_proba(test_x)
test_pred

array([[0.68080083, 0.31919917],
       [0.55970072, 0.44029928],
       [0.62024883, 0.37975117],
       ...,
       [0.62377485, 0.37622515],
       [0.57465751, 0.42534249],
       [0.67310436, 0.32689564]])

In [35]:
test_pred = test_pred[:,1]
test_pred

array([0.31919917, 0.44029928, 0.37975117, ..., 0.37622515, 0.42534249,
       0.32689564])

### Evaluation the model using AUC-ROC method

In [36]:
# roc_auc_score(y_true, y_scores)
print('Training score : ', roc_auc_score(train_y, train_pred))

Training score :  0.4711249741538865


In [37]:
print('Testing score : ',roc_auc_score(test_y, test_pred))

Testing score :  0.4791932580592374
