# Problem Statement
Your client is a financial distribution company. Over the last 10 years, they have created an offline distribution channel across the country. They sell financial products to consumers by hiring agents in their network. These agents are freelancers and get a commission when they make a product sale.

##### Overview of your client onboarding process

The managers at your client are primarily responsible for recruiting agents. Once a manager has identified a potential applicant he would explain the business opportunity to the agent. Once the agent provides the consent, an application is made to your client to become an agent. In the next 3 months, this potential agent has to undergo a 7 days training at your client's branch (about sales processes and various products) and clear a subsequent examination in order to become an agent.

##### The problem - who are the best agents?

As it is obvious in the above process, there is a significant investment which your client makes in identifying, training, and recruiting these agents. However, there are a set of agents who do not bring in the expected resultant business. Your client is looking for help from data scientists like you to help them provide insights using their past recruitment data. They want to predict the target variable for each potential agent which would help them identify the right agents to hire. 

(Predict "Business_Sourced")

### Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Importing data

In [None]:
data=pd.read_csv('data.csv')
data.head()

### Data Preprocessing

In [None]:
data.describe(include='all')

In [None]:
data.dtypes

In [None]:
data.nunique()

In [None]:
data.isnull().sum()

In [None]:
data.shape

In [None]:
data.drop_duplicates()

In [None]:
data['Applicant_Gender'].fillna(data['Applicant_Gender'].mode()[0],inplace=True)

In [None]:
data['Applicant_Marital_Status'].value_counts()

In [None]:
data['Applicant_Marital_Status'].fillna(data['Applicant_Marital_Status'].mode()[0],inplace=True)

In [None]:
data['Applicant_Occupation'].value_counts()

In [None]:
data['Applicant_Occupation'].fillna(data['Applicant_Occupation'].mode()[0],inplace=True)

In [None]:
data['Applicant_Qualification'].value_counts()

In [None]:
data['Applicant_Qualification'].fillna(data['Applicant_Qualification'].mode()[0],inplace=True)

In [None]:
data['Manager_Joining_Designation'].value_counts()

In [None]:
data['Manager_Current_Designation'].value_counts()

In [None]:
data['Manager_Status'].value_counts()

In [None]:
data.isnull().sum()

In [None]:
#making data sklearn usable

data=pd.get_dummies(data.drop(['ID'],axis=1))

In [None]:
data.head()

In [None]:
data.dtypes

In [None]:
data.to_csv('data_cleaned_sales_target.csv')

### Segregating dependent and independent data

In [None]:
x = data.drop(['Business_Sourced'],axis=1)
y = data['Business_Sourced']
x.shape, y.shape

### Splitting the data into train set and the test set

In [None]:
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x,y, random_state = 56)

### Normalising using min_max_scaler

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
cols = train_x.columns
cols

In [None]:
train_x_scaled = scaler.fit_transform(train_x)
train_x_scaled = pd.DataFrame(train_x_scaled, columns=cols)
train_x_scaled.head()

In [None]:
test_x_scaled = scaler.transform(test_x)
test_x_scaled = pd.DataFrame(test_x_scaled, columns=cols)
test_x_scaled.head()

### Implementing Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression as LogReg
from sklearn.metrics import roc_auc_score

In [None]:
# Creating instance of Logistic Regresssion
logreg = LogReg()

# Fitting the model
logreg.fit(train_x,train_y)

### Making predictions using *predict_proba* function

In [None]:
#predictiong the prbability of 0 and 1 respectively for the dependent variable i.e. business sourced here

train_pred = logreg.predict_proba(train_x)
train_pred

In [None]:
#separating the probability of 1 in the dependent variable

train_pred = train_pred[:,1]
train_pred

In [None]:
test_pred = logreg.predict_proba(test_x)
test_pred

In [None]:
test_pred = test_pred[:,1]
test_pred

### Evaluation the model using AUC-ROC method

In [None]:
# roc_auc_score(y_true, y_scores)
print('Training score : ', roc_auc_score(train_y, train_pred))

In [None]:
print('Testing score : ',roc_auc_score(test_y, test_pred))