## <font color="blue">Problem Statement</font>

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses. 
 
The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%. 
 
Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone. A typical lead conversion process can be represented using the following funnel:
Lead Conversion Process - Demonstrated as a funnel
Lead Conversion Process - Demonstrated as a funnel
As you can see, there are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating etc. ) in order to get a higher lead conversion.
 
X Education has appointed you to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

<br>
<br>

## <font color="blue">Import Packages</font>


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
from sklearn import metrics

In [None]:
### settting max display column length
pd.set_option("display.max_column",999)

In [None]:
warnings.filterwarnings("ignore")

<br>
<br>

## <font color="blue">Reading & Understanding Data </font>

In [None]:
leads_df = pd.read_csv("../input/leadscore/Leads.csv")
leads_df.head()

In [None]:
### checking shape of DF
leads_df.shape

In [None]:
### checking stats of DF
leads_df.describe()

<br>
<br>

## <font color="blue">Data Cleaning </font>

In [None]:
### creating a copy of original DF
leads_df_original = leads_df.copy()

In [None]:
### replacing all "Select" to Null
leads_df = leads_df.replace("Select" , np.nan)

In [None]:
### checking % null values
col_null_check = round((leads_df.isnull().sum() * 100 / leads_df.shape[0]),2).sort_values(ascending=False)
col_null_check

In [None]:
#Check for duplicate values
leads_df.duplicated().sum()

In [None]:
### droppping columns with more than close to 30% of null values

cols_del = col_null_check[col_null_check >= 30].index

leads_df.drop(columns = cols_del , inplace=True)
leads_df.head()

In [None]:
### dropping columns that are not required as its index column

leads_df.drop(columns = ["Prospect ID" , "Lead Number"] , inplace=True)
leads_df.head()

<br>
<br>

## <font color="blue">Understanding & imputing other null columns </font>

In [None]:
### columns to consider for imputing
col_null_check = round((leads_df.isnull().sum() * 100 / leads_df.shape[0]),2).sort_values(ascending=False)
col_null_check[col_null_check > 0]

In [None]:
### replacing Lead Source value with mode value as its a categorical column
leads_df["Lead Source"].fillna(leads_df["Lead Source"].mode()[0], inplace=True)

In [None]:
### checking % of values so that we can merge low ones into one.
leads_df["Lead Source"].value_counts(normalize=True,dropna=False)* 100

In [None]:
### converting google to Google in Lead Source column so make it same.
leads_df["Lead Source"] = leads_df["Lead Source"].replace("google","Google")

In [None]:
### lead source column to merge
to_merge = leads_df["Lead Source"].value_counts(normalize=True,dropna=False)* 100
to_merge[to_merge < 10]

In [None]:
### mereging below 10% to others
leads_df["Lead Source"] = leads_df["Lead Source"].apply(lambda x: 'Others' if x in to_merge[to_merge < 10].index else x)
leads_df["Lead Source"].value_counts(normalize=True,dropna=False)* 100

In [None]:
### replacing Last Activity value with mode value as its a categorical column

leads_df["Last Activity"].fillna(leads_df["Last Activity"].mode()[0], inplace=True)

In [None]:
### checking % of values so that we can merge low ones into one.
leads_df["Last Activity"].value_counts(normalize=True,dropna=False)* 100

In [None]:
### lead source column to merge
to_merge = leads_df["Last Activity"].value_counts(normalize=True,dropna=False)* 100

### mereging below 10% to others
leads_df["Last Activity"] = leads_df["Last Activity"].apply(lambda x: 'Others' if x in to_merge[to_merge < 10].index else x)
leads_df["Last Activity"].value_counts(normalize=True,dropna=False)* 100

In [None]:
#### checking outliers in TotalVisits column

sns.boxplot(leads_df["TotalVisits"])

In [None]:
### replacing TotalVisits value with median value as it has outliers

leads_df["TotalVisits"].fillna(leads_df["TotalVisits"].median(), inplace=True)

In [None]:
#### checking outliers in Page Views Per Visit column

sns.boxplot(leads_df["Page Views Per Visit"])

In [None]:
### replacing Page Views Per Visit value with median value as it has outliers

leads_df["Page Views Per Visit"].fillna(leads_df["Page Views Per Visit"].median(), inplace=True)

In [None]:
### checking values in What is your current occupation
leads_df["What is your current occupation"].value_counts(normalize=True,dropna=False)* 100

In [None]:
### converting null to not specified
leads_df["What is your current occupation"].replace(np.nan , "Not Specified" , inplace=True)
leads_df["What is your current occupation"].value_counts(dropna=False , normalize=True)

In [None]:
### mereging below 5% to others
to_merge = ["Student" ,"Housewife", "Businessman"]
leads_df["What is your current occupation"] = leads_df["What is your current occupation"].apply(lambda x: 'Other' if x in to_merge else x)
leads_df["What is your current occupation"].value_counts(dropna=False , normalize=True)

In [None]:
### checking values in What matters most to you in choosing a course
leads_df["What matters most to you in choosing a course"].value_counts(normalize=True,dropna=False)* 100

In [None]:
### converting null to not specified
leads_df["What matters most to you in choosing a course"].replace(np.nan , "Not Specified" , inplace=True)
leads_df["What matters most to you in choosing a course"].value_counts(dropna=False , normalize=True) * 100

In [None]:
### mereging below 1% to others
to_merge = ["Flexibility & Convenience"]
leads_df["What matters most to you in choosing a course"] = leads_df["What matters most to you in choosing a course"].apply(lambda x: 'Other' if x in to_merge else x)
leads_df["What matters most to you in choosing a course"].value_counts(dropna=False , normalize=True) * 100

In [None]:
### checking % values in Country
leads_df["Country"].value_counts(normalize=True,dropna=False)* 100

In [None]:
### converting null to not specified
leads_df["Country"].replace(np.nan , "Not Specified" , inplace=True)
leads_df["Country"].value_counts(dropna=False , normalize=True) * 100

In [None]:
### mereging below 1% to others
to_merge = ["India" , "Not Specified"]
leads_df["Country"] = leads_df["Country"].apply(lambda x: 'Other' if x not in to_merge else x)
leads_df["Country"].value_counts(dropna=False , normalize=True) * 100

In [None]:
### final check for any null values

col_null_check = round((leads_df.isnull().sum() * 100 / leads_df.shape[0]),2).sort_values(ascending=False)
col_null_check[col_null_check > 0]

In [None]:
### checking datatype of the columns
leads_df.info()

In [None]:
### checking distinct values in columns

for cols in leads_df.columns:
    print("Distinct Values in Column :" , cols)
    print("\n",leads_df[cols].value_counts())
    print("************************************\n\n")
    

In [None]:
### removing columns where we have only NO values or very very low Yes. As these won't impact model
cols = ["Search","Through Recommendations","Digital Advertisement","Do Not Call" ,"Magazine","Newspaper Article","X Education Forums","Newspaper", "Receive More Updates About Our Courses","Update me on Supply Chain Content","Get updates on DM Content","I agree to pay the amount through cheque"]

leads_df.drop(columns=cols,inplace=True)

In [None]:
### checking % values in Last Notable Activity
leads_df["Last Notable Activity"].value_counts(dropna=False , normalize=True) * 100

In [None]:
### Last Notable Activity column to merge
to_merge = leads_df["Last Notable Activity"].value_counts(normalize=True,dropna=False)* 100

### mereging below 10% to other
leads_df["Last Notable Activity"] = leads_df["Last Notable Activity"].apply(lambda x: 'Other' if x in to_merge[to_merge < 10].index else x)
leads_df["Last Notable Activity"].value_counts(normalize=True,dropna=False)* 100

In [None]:
### checking final list of columns
leads_df.head()

<br>
<br>

## <font color="blue">Converting categorical column data for ML Model preparation </font>

In [None]:
### converting all "Yes" to 1 and "No" to 0

cols = ["Do Not Email","A free copy of Mastering The Interview"]

leads_df[cols] = leads_df[cols].replace({"Yes":1,"No":0,"yes":1,"no":0})

### cnverting binary columns to int from object
leads_df[cols] = leads_df[cols].astype("int")

leads_df.head()


In [None]:
leads_df.info()

In [None]:
### converting dummy columns
cols = ["What matters most to you in choosing a course","Country","What is your current occupation","Lead Origin","Lead Source","Last Activity","Last Notable Activity"]

dummy = pd.get_dummies(leads_df[cols] ,drop_first=True)
leads_df = pd.concat([leads_df , dummy] , axis=1)
leads_df.drop(columns=cols , inplace=True)

leads_df.head()

In [None]:
### getting final shape
leads_df.shape

<br>
<br>

## <font color="blue"> Checking Churn Ratio in the data</font>

In [None]:
round(leads_df.Converted.value_counts(normalize=True)* 100,2)

In [None]:
sns.countplot(leads_df.Converted)

<br>
<br>

## <font color="blue">Train Test Split </font>


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
leads_df_train , leads_df_test = train_test_split(leads_df , train_size=0.7 , random_state=100)

print("leads_df_train : " , leads_df_train.shape)
print("leads_df_test : " , leads_df_test.shape)


<br>
<br>

## <font color="blue">Scalling the Train & Test Data </font>

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
### scalling only the non-binary columns data Train DF

cols = ["TotalVisits","Total Time Spent on Website","Page Views Per Visit"]

scaler = StandardScaler()

leads_df_train[cols] = scaler.fit_transform(leads_df_train[cols])

leads_df_train[cols].head()

In [None]:
### scalling only the non-binary columns data of Test DF

cols = ["TotalVisits","Total Time Spent on Website","Page Views Per Visit"]

leads_df_test[cols] = scaler.transform(leads_df_test[cols])

leads_df_test[cols].head()

<br>
<br>

## <font color="blue">Creating X & Y Train and Test </font>

In [None]:
Y_Train = leads_df_train.pop("Converted")

In [None]:
X_Train = leads_df_train

In [None]:
Y_Test = leads_df_test.pop("Converted")

In [None]:
X_Test = leads_df_test

<br>
<br>

## <font color="blue">Logistic Regression Model Creation </font>

#### <font color="green">Step - 1:  RFE </font>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

In [None]:
logreg = LogisticRegression()
### taking best 15 features
rfe = RFE(logreg , 15)
rfe = rfe.fit(X_Train,Y_Train)

In [None]:
### listing the RFE columns
rfe_df = pd.DataFrame(zip(X_Train.columns , rfe.support_ , rfe.ranking_) , columns=["Features" ,"Support","Ranking"])
rfe_df[rfe_df.Support == True]

<br>

#### <font color="green">Step - 2:  Build Model </font>

In [None]:
import statsmodels.api as sm

In [None]:
cols = rfe_df[rfe_df["Support"] == True].iloc[:,0]
cols

<br>

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
### function to create logistic model and check p-value and VIF

def create_Logistic_model(y_train, x_train , cols):
    ## create log model
    x_train_sm  = sm.add_constant(x_train[cols])
    logsm = sm.GLM(y_train , x_train_sm , family=sm.families.Binomial())
    res = logsm.fit()
    print(res.summary())
    
    ## create VIF list
    vif = pd.DataFrame()
    vif["Features"] = x_train_sm.columns
    vif["VIF"] = [variance_inflation_factor(x_train_sm.values , i) for i in range(x_train_sm.shape[1])]
    vif["VIF"] = round(vif["VIF"],2)
    vif = vif.sort_values(by="VIF" , ascending=False)
    print("\n" , vif)

<br>

In [None]:
### model 1

create_Logistic_model(Y_Train,X_Train,cols)

In [None]:
### model 2

## removing column:Lead Source_Others Add Form with very high p-value > 0.05
cols = cols[cols != "Lead Source_Others"]

create_Logistic_model(Y_Train,X_Train,cols)

In [None]:
### model 3

## removing column:Lead Origin_Lead Import with very high p-value > 0.05
cols = cols[cols != "Lead Origin_Lead Import"]

create_Logistic_model(Y_Train,X_Train,cols)

In [None]:
### model 4

## removing column:Country_Not Specified  Received with very high p-value > 0.05
cols = cols[cols != "Country_Not Specified"]

create_Logistic_model(Y_Train,X_Train,cols)

In [None]:
### model 5

## removing column:Last Notable Activity_SMS Sent a Phone Conversation with very high VIF > 5
cols = cols[cols != "Last Notable Activity_SMS Sent"]

create_Logistic_model(Y_Train,X_Train,cols)

In [None]:
### model 6

## removing column:Last Notable Activity_Other with very high p-value > 0.05
cols = cols[cols != "Last Notable Activity_Other"]

create_Logistic_model(Y_Train,X_Train,cols)

#### <font color="green"> Finally we got the final model where p-value of the features are below 0.05 and VIF below 5 </font>

<br>

#### <font color="green">Step - 3:  Predict From Model </font>

In [None]:
## create final logistic model
X_Train_sm  = sm.add_constant(X_Train[cols])
logsm = sm.GLM(Y_Train , X_Train_sm , family=sm.families.Binomial())
res = logsm.fit()

In [None]:
### predict Y and round % to 2 decimal value
Y_Train_Predict = res.predict(X_Train_sm)
Y_Train_Predict = round(Y_Train_Predict , 2)
Y_Train_Predict.head()

In [None]:
### making the dataframe to check model stregth 

Y_Train_Predict_Final = pd.DataFrame({"Churn_Original": Y_Train.values , "Churn_Probability": Y_Train_Predict})
Y_Train_Predict_Final.head()

In [None]:
Y_Train_Predict_Final["Customer_ID"] = Y_Train.index
Y_Train_Predict_Final = Y_Train_Predict_Final[["Customer_ID" , "Churn_Original" , "Churn_Probability"]]

Y_Train_Predict_Final.head()

<br>

#### <font color="green">Step - 4:  Checking Matrix table for different threshold </font>

In [None]:
### will calculate and return for each cutoff prob what will be the matix

def model_strength(predict_df):
    model_strength_df = pd.DataFrame(columns=["Probability_Threshold","Sensivity","Specificity","Accuracy","FPR","TPR","Precision","Recall","F1_score"])
    prob_list = [float(x)/100 for x in range(100)]
    for prob in prob_list:
        predict_df["Churn_Predicted"] = predict_df["Churn_Probability"].map(lambda x: 1 if x > prob else 0)
        confusion = metrics.confusion_matrix(predict_df["Churn_Original"] ,predict_df["Churn_Predicted"])
        
        sensivity = round((confusion[1,1]  / (confusion[1,0] + confusion[1,1])) , 2)
        specificity = round((confusion[0,0]  / (confusion[0,0] + confusion[0,1])) , 2)
        accuracy = round(((confusion[0,0] + confusion[1,1]) / (confusion[0,0] +confusion[0,1]+confusion[1,1]+ confusion[0,1])) , 2)
        FPR = 1 - specificity
        TPR = sensivity
        
        precision = round((confusion[1,1] / (confusion[0,1] + confusion[1,1])) , 2)
        recall = round((confusion[1,1] / (confusion[1,0] + confusion[1,1])) , 2)
        
        F1_score = round((2 * ((precision * recall) / (precision + recall))),2)
        
        model_strength_df = model_strength_df.append({"Probability_Threshold":prob, "Sensivity":sensivity, "Specificity": specificity ,"Accuracy":accuracy ,"FPR": FPR,"TPR":TPR ,"Precision":precision,"Recall":recall ,"F1_score":F1_score} , ignore_index=True)
        
    return model_strength_df
    

In [None]:
### checking model metrices now
pd.set_option('expand_frame_repr', False)
### settting max display column length
pd.set_option("display.max_rows",101)
model_strength_df = model_strength(Y_Train_Predict_Final)
print(model_strength_df)
pd.set_option("display.max_rows",11)

<br>

#### <font color="green">Step - 5:  Create Sensivity , Specificity & Accuracy Curve</font>

In [None]:
fig = plt.figure(figsize=(12,8))
fig.suptitle('Sensivity , Specificity , Accuracy vs Probability_Threshold')
plt.plot(model_strength_df["Probability_Threshold"] , model_strength_df["Sensivity"] , "g-" ,  label='Sensivity')
plt.plot(model_strength_df["Probability_Threshold"] , model_strength_df["Specificity"] , "r-" ,  label='Specificity')
plt.plot(model_strength_df["Probability_Threshold"] , model_strength_df["Accuracy"] ,"b-",  label='Accuracy')
plt.legend()
plt.show()

#### <font color="green">Step - 6:  Precision vs Recall curve</font>

In [None]:
fig = plt.figure(figsize=(12,8))
fig.suptitle('Precision Recall vs Probability_Threshold')
plt.plot(model_strength_df["Probability_Threshold"] , model_strength_df["Precision"] , "g-" ,  label='Precision')
plt.plot(model_strength_df["Probability_Threshold"] , model_strength_df["Recall"] , "r-" ,  label='Recall')
plt.legend()
plt.show()

<br>

#### <font color="green">Step - 7:  ROC Curve</font>

In [None]:
def draw_ROC(actual,predicted):
    fpr,tpr,threshold = metrics.roc_curve(actual,predicted, drop_intermediate=False)
    auc_score = metrics.roc_auc_score(actual,predicted)
    
    plt.figure(figsize=(12,8))
    plt.plot(fpr,tpr,label="ROC_Curve (area = %0.2f)"%auc_score)
    plt.plot([0,1],[0,1],"k--")
    plt.xlim([0.0,1.0])
    plt.ylim([0.0,1.05])
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("ROC Curve")
    plt.legend(loc="lower right")
    plt.show()

In [None]:
draw_ROC(Y_Train_Predict_Final["Churn_Original"] ,Y_Train_Predict_Final["Churn_Probability"])

### <font color="blue">From the above ROC curve which has <font color="green">89%</font> area under it indicates the model is of good strength</font>

### <font color="blue">From the above 3 metrices graphs above its clear to set the threshold probability for Churn is at </font> <font color="green">best at 0.35</font>  <font color="blue">where Recall is above 80% as per business requirement</font> 

<br>

#### <font color="green">Step - 8:  Checking how this threshold works in Test set</font>

In [None]:
### checking the final colmuns to use in test set to predict
X_Test[cols].head()

In [None]:
### Predict in Test Set
X_Test_sm  = sm.add_constant(X_Test[cols])
Y_Test_Predict = res.predict(X_Test_sm)
Y_Test_Predict = round(Y_Test_Predict , 2)
Y_Test_Predict.head()


In [None]:
### making the dataframe to check model stregth at cutoff at 40%

Y_Test_Predict_Final = pd.DataFrame({"Churn_Original": Y_Test.values , "Churn_Probability": Y_Test_Predict})
Y_Test_Predict_Final["Churn_Predicted"] = Y_Test_Predict_Final["Churn_Probability"].map(lambda x: 1 if x > 0.35 else 0)
Y_Test_Predict_Final.head()

In [None]:
### checking the accuracy score

metrics.accuracy_score(Y_Test_Predict_Final["Churn_Original"] ,Y_Test_Predict_Final["Churn_Predicted"] )

<br>

### <font color="green">  So we see the model works well in the test set as well at 0.35 threshold having 81% accuracy of prediction</font>

<br>
<br>

## <font color="blue">Creating the Final List with Lead Score</font>

In [None]:
scale_cols = ["TotalVisits","Total Time Spent on Website","Page Views Per Visit"]
leads_df[scale_cols] = scaler.transform(leads_df[scale_cols])

leads_df_sm  = sm.add_constant(leads_df[cols])
leads_df_Predict = res.predict(leads_df_sm)
leads_df_Predict = round(leads_df_Predict * 100)

In [None]:
### add the churn score column to the original data frame
leads_df_original["Lead_Score"] = leads_df_Predict.astype("int")

In [None]:
### sorting the DF with Lead Numbers highest churn score on top
leads_df_original_final = pd.concat([leads_df_original.iloc[:,0:2] , leads_df_original.iloc[:,-1] , leads_df_original.iloc[:,3:-2]], axis=1)

leads_df_original_final.sort_values(by="Lead_Score",ascending=False).head()

<br>

### <font color="green">  Those Leads whose Lead Score is more than <font color="blue">35</font> has high chance of converting as per the model</font>

<br>

In [None]:
### checking distribution of data based on Lead_Score
plt.figure(figsize=(8,6))
sns.distplot(leads_df_original_final.Lead_Score)
plt.show()

In [None]:
### probable churn to non-churn ratio
non_churn = (leads_df_original_final[leads_df_original_final.Lead_Score < 34].shape[0] * 100)/leads_df_original_final.shape[0]
churn = (leads_df_original_final[leads_df_original_final.Lead_Score >= 34].shape[0] * 100)/leads_df_original_final.shape[0]

print("Probable_Churn VS Probable-Non_Churn ratio is ",round(churn,2),":",round(non_churn,2) )
