Problem Statement
An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos.

When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals.

Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’.

If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone. A typical lead conversion process can be represented using the following funnel:image.jpg

Lead Conversion Process - Demonstrated as a funnel As you can see, there are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom.

In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating etc. ) in order to get a higher lead conversion.

X Education has appointed you to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers.
The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance.

The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

Data
You have been provided with a leads dataset from the past with around 9000 data points. This dataset consists of various attributes such as Lead Source, Total Time Spent on Website, Total Visits, Last Activity, etc. which may or may not be useful in ultimately deciding whether a lead will be converted or not. The target variable, in this case, is the column ‘Converted’ which tells whether a past lead was converted or not wherein 1 means it was converted and 0 means it wasn’t converted.

Another thing that you also need to check out for are the levels present in the categorical variables.

Many of the categorical variables have a level called 'Select' which needs to be handled because it is as good as a null value.

Goal
Build a logistic regression model to assign a lead score between 0 and 100 to each of the leads which can be used by the company to target potential leads. A higher score would mean that the lead is hot, i.e. is most likely to convert whereas a lower score would mean that the lead is cold and will mostly not get converted.

In [None]:
import warnings
warnings.filterwarnings('ignore')

# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# visulaisation
from matplotlib.pyplot import xticks
%matplotlib inline

# Data display coustomization
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

data = pd.DataFrame(pd.read_csv('../input/leads-dataset/Leads.csv'))
data.head(5) 


In [None]:
# As we can observe that there are select values for many column.
#This is because customer did not select any option from the list, hence it shows select.
# Select values are as good as NULL.

# Converting 'Select' values to NaN.
data = data.replace('Select', np.nan)
round(100*(data.isnull().sum()/len(data.index)), 2)

In [None]:
data = data.drop(data.loc[:,list(round(100*(data.isnull().sum()/len(data.index)), 2)>70)].columns, 1)

In [None]:
# As Lead quality is based on the intution of employee, so if left blank we can impute 'Not Sure' in NaN safely.
data['Lead Quality'] = data['Lead Quality'].replace(np.nan, 'Not Sure')
sns.countplot(data['Lead Quality'])
plt.show()

In [None]:
fig, axs = plt.subplots(2,2, figsize = (10,7.5))
plt1 = sns.countplot(data['Asymmetrique Activity Index'], ax = axs[0,0])
plt2 = sns.boxplot(data['Asymmetrique Activity Score'], ax = axs[0,1])
plt3 = sns.countplot(data['Asymmetrique Profile Index'], ax = axs[1,0])
plt4 = sns.boxplot(data['Asymmetrique Profile Score'], ax = axs[1,1])
plt.tight_layout()

In [None]:
data = data.drop(['Asymmetrique Activity Index','Asymmetrique Activity Score','Asymmetrique Profile Index','Asymmetrique Profile Score'],1)
round(100*(data.isnull().sum()/len(data.index)), 2)
sns.countplot(data.City)
xticks(rotation = 90)

In [None]:
# Around 60% of the data is Mumbai so we can impute Mumbai in the missing values.
data['City'] = data['City'].replace(np.nan, 'Mumbai')

In [None]:
sns.countplot(data.Specialization)
xticks(rotation = 90)

In [None]:
# It maybe the case that lead has not entered any specialization if his/her option is not availabe on the list,
#  may not have any specialization or is a student.
# Hence we can make a category "Others" for missing values.
data['Specialization'] = data['Specialization'].replace(np.nan, 'Others')
round(100*(data.isnull().sum()/len(data.index)), 2)


In [None]:
fig, axs = plt.subplots(figsize = (15,7.5))
sns.countplot(data.Tags)
xticks(rotation = 90)


In [None]:
# Blanks in the tag column may be imputed by 'Will revert after reading the email'.
data['Tags'] = data['Tags'].replace(np.nan, 'Will revert after reading the email')
data['What matters most to you in choosing a course'] = data['What matters most to you in choosing a course'].replace(np.nan, 'Better Career Prospects')
data['What is your current occupation'] = data['What is your current occupation'].replace(np.nan, 'Unemployed')
# Country is India for most values so let's impute the same in missing values.
data['Country'] = data['Country'].replace(np.nan, 'India')
# Rest missing values are under 2% so we can drop these rows.
data.dropna(inplace = True)
round(100*(data.isnull().sum()/len(data.index)), 2)

In [None]:
sns.countplot(x = "Lead Origin", hue = "Converted", data = data)
xticks(rotation = 90)

Inference
API and Landing Page Submission have 30-35% conversion rate but count of lead originated from them are considerable.
Lead Add Form has more than 90% conversion rate but count of lead are not very high.
Lead Import are very less in count.
To improve overall lead conversion rate, we need to focus more on improving lead converion of API and Landing Page Submission origin and generate more leads from Lead Add Form.

In [None]:
sns.boxplot(y = 'Total Time Spent on Website', x = 'Converted', data = data)

Inference
Leads spending more time on the weblise are more likely to be converted.
Website should be made more engaging to make leads spend more time.

In [None]:
fig, axs = plt.subplots(figsize = (15,7.5))
sns.countplot(x = "Lead Source", hue = "Converted", data = data)
xticks(rotation = 90)

In [None]:
data['Lead Source'] = data['Lead Source'].replace(['google'], 'Google')
data['Lead Source'] = data['Lead Source'].replace(['Click2call', 'Live Chat', 'NC_EDM', 'Pay per Click Ads', 'Press_Release',
  'Social Media', 'WeLearn', 'bing', 'blog', 'testone', 'welearnblog_Home', 'youtubechannel'], 'Others')

In [None]:
sns.countplot(x = "Lead Source", hue = "Converted", data = data)
xticks(rotation = 90)

Inference
Google and Direct traffic generates maximum number of leads.
Conversion Rate of reference leads and leads through welingak website is high.
To improve overall lead conversion rate, focus should be on improving lead converion of olark chat, organic search, direct traffic, and google leads and generate more leads from reference and welingak website.

In [None]:
fig, axs = plt.subplots(figsize = (10,5))
sns.countplot(x = "What is your current occupation", hue = "Converted", data = data)
xticks(rotation = 90)

In [None]:
data = data.drop(['Lead Number','What matters most to you in choosing a course','Search','Magazine','Newspaper Article','X Education Forums','Newspaper',
           'Digital Advertisement','Through Recommendations','Receive More Updates About Our Courses','Update me on Supply Chain Content',
           'Get updates on DM Content','I agree to pay the amount through cheque','A free copy of Mastering The Interview'],1)

In [None]:
varlist =  ['Do Not Email', 'Do Not Call']

# Defining the map function
def binary_map(x):
    return x.map({'Yes': 1, "No": 0})

# Applying the function to the housing list
data[varlist] = data[varlist].apply(binary_map)

In [None]:
# Creating a dummy variable for some of the categorical variables and dropping the first one.
dummy = pd.get_dummies(data[['Lead Origin', 'Lead Source', 'Last Activity', 'Specialization','What is your current occupation',
                              'Tags','Lead Quality','City','Last Notable Activity']], drop_first=True)
# Adding the results to the master dataframe
data = pd.concat([data, dummy], axis=1)
data.head()

In [None]:
data = data.drop(['Country','Lead Origin', 'Lead Source', 'Last Activity', 'Specialization','What is your current occupation','Tags','Lead Quality','City','Last Notable Activity'], axis = 1)

In [None]:
import pandas as pd
import warnings
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_selection import RFE, f_regression
from sklearn.linear_model import (LinearRegression, Ridge, Lasso)
from sklearn import tree, linear_model
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import explained_variance_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import  GradientBoostingClassifier
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve, auc
from matplotlib.legend_handler import HandlerLine2D

In [None]:
def print_cm(cm, labels, hide_zeroes=False, hide_diagonal=False, hide_threshold=None):
    """pretty print for confusion matrixes"""
    columnwidth = max([len(str(x)) for x in labels] + [5])  # 5 is value length
    empty_cell = " " * columnwidth
    
    # Begin CHANGES
    fst_empty_cell = (columnwidth-3)//2 * " " + "t/p" + (columnwidth-3)//2 * " "
    
    if len(fst_empty_cell) < len(empty_cell):
        fst_empty_cell = " " * (len(empty_cell) - len(fst_empty_cell)) + fst_empty_cell
    # Print header
    print("    " + fst_empty_cell, end=" ")
    # End CHANGES
    
    for label in labels:
        print("%{0}s".format(columnwidth) % label, end=" ")
        
    print()
    # Print rows
    for i, label1 in enumerate(labels):
        print("    %{0}s".format(columnwidth) % label1, end=" ")
        for j in range(len(labels)):
            cell = "%{0}.1f".format(columnwidth) % cm[i, j]
            if hide_zeroes:
                cell = cell if float(cm[i, j]) != 0 else empty_cell
            if hide_diagonal:
                cell = cell if i != j else empty_cell
            if hide_threshold:
                cell = cell if cm[i, j] > hide_threshold else empty_cell
            print(cell, end=" ")
        print()

In [None]:
features=[f for f in data.columns if f not in ['Prospect ID','Converted']]

In [None]:
print('Total Features',len(features))

In [None]:
#feature importance
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
clf=RandomForestClassifier()

features_with_customer_id=features.copy()
features_with_customer_id.append('Prospect ID')
x_train,x_test,y_train,y_test = train_test_split(data[features_with_customer_id],data['Converted'],test_size=0.3,random_state=42)
x_train_backup=x_train
x_test_backup=x_test
x_train=x_train[features]
x_test=x_test[features]

clf.fit(x_train, y_train)
predictions_train = clf.predict(x_train)
predictions_test = clf.predict(x_test)

def accuracy_report(labels,predictions):
   precision, recall, fscore, support = score(labels, predictions)
   avg_fscore=f1_score(labels, predictions, average='macro')
   cm = confusion_matrix(labels, predictions, labels=[0,1])
   print('precision of 1: ',precision[1],', recall of 1:',recall[1])
   print_cm(cm, [0,1])

print('Confusion matrix for Train')
accuracy_report(y_train,predictions_train)
print('Confusion matrix for Test')
accuracy_report(y_test,predictions_test)

#feature selection
def feature_selection(features,clf,threshold):
    important_features=[]
    feature_importance=[]
    feature_score = pd.DataFrame(columns=['feature','importance_score'])
    
    for feature in zip(features, clf.feature_importances_):      
      feature_score.loc[len(feature_score.index)] = [feature[0],feature[1]]
    
    feature_importance=feature_score.sort_values(by=['importance_score'],ascending=False).reset_index().head(threshold)
    important_features=feature_importance['feature'].to_list()
    return important_features,feature_importance,feature_score

important_features,feature_importance,feature_score = feature_selection(x_train.columns,clf,20)
feature_importance

In [None]:
def ModelSelection(test_data,features,label,dummy_variables_list):
    MLA = [
    
    ensemble.BaggingClassifier(),
    ensemble.GradientBoostingClassifier(),
    RandomForestClassifier(),
           
    XGBClassifier(),
        
    linear_model.LogisticRegressionCV(),
    linear_model.SGDClassifier(),
            
    svm.SVC(probability=True),
        
    tree.DecisionTreeClassifier(),
                
    ]
    
    MLA_columns = ['MLA Name', 'MLA Parameters','MLA Score']
    MLA_compare = pd.DataFrame(columns = MLA_columns)
    features_with_customer_id=features.copy()
    features_with_customer_id.append('Prospect ID')
    x_train,x_test,y_train,y_test = train_test_split (test_data[features_with_customer_id],test_data[label],test_size=0.3,random_state=0)
    print('features used: ',features)
    x_train_backup=x_train
    x_test_backup=x_test
    x_train=x_train[features]
    x_test=x_test[features]
    #x_train=pd.get_dummies(x_train, columns=dummy_variables_list)
    #x_test=pd.get_dummies(x_test, columns=dummy_variables_list)
    row_index = 0
    MLA_predict = test_data[label]
    for alg in MLA:

        MLA_name = alg.__class__.__name__
        MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
        MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())
        alg.fit(x_train, y_train)
        MLA_predict[MLA_name] = alg.predict(x_test)
        MLA_compare.loc[row_index, 'MLA Score']=alg.score(x_test,y_test)
        row_index+=1

    
    MLA_compare.sort_values(by = ['MLA Score'], ascending = False, inplace = True)
    return MLA_compare,x_train,x_test,y_train,y_test,x_train_backup,x_test_backup

In [None]:
MLA_compare,x_train,x_test,y_train,y_test,x_train_backup,x_test_backup=ModelSelection(data,important_features,'Converted','')
print(MLA_compare[['MLA Name','MLA Score']])

In [None]:
# Instantiate the classfiers and make a list
classifiers = [RandomForestClassifier(), 
               XGBClassifier(), 
               ]

# Define a result table as a DataFrame
result_table = pd.DataFrame(columns=['classifiers', 'fpr','tpr','auc'])

# Train the models and record the results
for cls in classifiers:
    model = cls.fit(x_train, y_train)
    yproba = model.predict_proba(x_test)[::,1]
    
    fpr, tpr, _ = roc_curve(y_test,  yproba)
    auc = roc_auc_score(y_test, yproba)
    
    result_table = result_table.append({'classifiers':cls.__class__.__name__,
                                        'fpr':fpr, 
                                        'tpr':tpr, 
                                        'auc':auc}, ignore_index=True)

# Set name of the classifiers as index labels
result_table.set_index('classifiers', inplace=True)

fig = plt.figure(figsize=(8,6))

for i in result_table.index:
    plt.plot(result_table.loc[i]['fpr'], 
             result_table.loc[i]['tpr'], 
             label="{}, AUC={:.3f}".format(i, result_table.loc[i]['auc']))
    
plt.plot([0,1], [0,1], color='orange', linestyle='--')

plt.xticks(np.arange(0.0, 1.1, step=0.1))
plt.xlabel("Flase Positive Rate", fontsize=15)

plt.yticks(np.arange(0.0, 1.1, step=0.1))
plt.ylabel("True Positive Rate", fontsize=15)

plt.title('ROC Curve Analysis', fontweight='bold', fontsize=15)
plt.legend(prop={'size':13}, loc='lower right')

plt.show()

In [None]:
#feature importance
clf=RandomForestClassifier()

features_with_customer_id=important_features.copy()
features_with_customer_id.append('Prospect ID')
x_train,x_test,y_train,y_test = train_test_split(data[features_with_customer_id],data['Converted'],test_size=0.3,random_state=42)
x_train_backup=x_train
x_test_backup=x_test
x_train=x_train[important_features]
x_test=x_test[important_features]

clf.fit(x_train, y_train)
predictions_train = clf.predict(x_train)
predictions_test = clf.predict(x_test)

def accuracy_report(labels,predictions):
   precision, recall, fscore, support = score(labels, predictions)
   avg_fscore=f1_score(labels, predictions, average='macro')
   cm = confusion_matrix(labels, predictions, labels=[0,1])
   print('precision of 1: ',precision[1],', recall of 1:',recall[1])
   print_cm(cm, [0,1])

print('Confusion matrix for Train')
accuracy_report(y_train,predictions_train)
print('Confusion matrix for Test')
accuracy_report(y_test,predictions_test)

In [None]:
#feature importance
clf=XGBClassifier()

features_with_customer_id=important_features.copy()
features_with_customer_id.append('Prospect ID')
x_train,x_test,y_train,y_test = train_test_split(data[features_with_customer_id],data['Converted'],test_size=0.3,random_state=42)
x_train_backup=x_train
x_test_backup=x_test
x_train=x_train[important_features]
x_test=x_test[important_features]

clf.fit(x_train, y_train)
predictions_train = clf.predict(x_train)
predictions_test = clf.predict(x_test)

def accuracy_report(labels,predictions):
   precision, recall, fscore, support = score(labels, predictions)
   avg_fscore=f1_score(labels, predictions, average='macro')
   cm = confusion_matrix(labels, predictions, labels=[0,1])
   print('precision of 1: ',precision[1],', recall of 1:',recall[1])
   print_cm(cm, [0,1])

print('Confusion matrix for Train')
accuracy_report(y_train,predictions_train)
print('Confusion matrix for Test')
accuracy_report(y_test,predictions_test)

In [None]:
#probability threshold
def proba_threhsold(probas,probas_threshold):
    new_predictions=[]
    for p in probas:
      if (p[1]>p[0]):
        new_predictions.append(1)
      elif ((p[1]>probas_threshold)):
        new_predictions.append(1)
      else:
        new_predictions.append(0)
    return new_predictions

#feature importance
clf=XGBClassifier()

clf.fit(x_train[important_features], y_train)
predictions_train_prob = clf.predict_proba(x_train[important_features])
predictions_test_prob = clf.predict_proba(x_test[important_features])

probas_threshold=0.05

train_predicted=proba_threhsold(predictions_train_prob,probas_threshold)
test_predicted=proba_threhsold(predictions_test_prob,probas_threshold)

print('Confusion matrix for Train')
accuracy_report(y_train,train_predicted)
print('Confusion matrix for Test')
accuracy_report(y_test,test_predicted)

In [None]:
#hyper parameter tuning
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from datetime import datetime
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier
import pickle

def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))
     
    
count_0=y_train.tolist().count(0)
count_1=y_train.tolist().count(1)

params = {
        'min_child_weight': [1,3], # ,5 default 1 higher values reduces overfitting, prevents model from learning relationship which are specific to particular sample 
        'gamma': [0.2, 0.5], #default 0, higher value reduces overfitting, min gain for loss function on possible split. node is split only when the resulting split gives a positive reduction in the loss function.
        'subsample': [0.8, 1.0], #default 1, lower values prevents overfitting, 0.5 to 1. number of samples to consider for a tree. 1 means all samples.
        'n_estimators':[100,200], #number of trees
        'eval_metric':['auc'], #'error',
        'max_depth':[3,4,6],
        'reg_lambda': [0,1],  #ridge regularization (shrinks coeff)
        #'colsample_bytree':[0.8,1], # default is 1 number of features to consider for single tree
        #'reg_alpha': [0,1], #lasso regularization (makes coeff 0)
        'max_delta_step':[0,1], #default is 0. used when class imbalanced. absolute capping on features weight
        'scale_pos_weight':[1,count_0/count_1], #default is 1, count(0)/count(1) in tranining. #greater than 1 for class imbalance
        }
xgb = XGBClassifier(learning_rate=0.05, objective='binary:logistic', silent=True, nthread=-1)


folds = 3
param_comb = 500

skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)

random_search = RandomizedSearchCV(xgb, param_distributions=params,n_iter=param_comb,  scoring='roc_auc', n_jobs=4, cv=skf.split(x_train,y_train), verbose=3, random_state=1001 )
start_time = timer(None)
random_search.fit(x_train, y_train)
print(random_search.best_params_)

In [None]:
#probability threshold
def proba_threhsold(probas,probas_threshold):
    new_predictions=[]
    for p in probas:
      if (p[1]>p[0]):
        new_predictions.append(1)
      elif ((p[1]>probas_threshold)):
        new_predictions.append(1)
      else:
        new_predictions.append(0)
    return new_predictions

#feature importance
clf=XGBClassifier( subsample= 0.8, 
                   scale_pos_weight= 1,
                   reg_lambda= 0,
                   n_estimators= 100,
                   min_child_weight= 1,
                   max_depth= 4,
                   max_delta_step= 0,
                   gamma= 0.2,
                   eval_metric= 'auc')

clf.fit(x_train[important_features], y_train)
predictions_train_prob = clf.predict_proba(x_train[important_features])
predictions_test_prob = clf.predict_proba(x_test[important_features])

probas_threshold=0.10

train_predicted=proba_threhsold(predictions_train_prob,probas_threshold)
test_predicted=proba_threhsold(predictions_test_prob,probas_threshold)

print('Confusion matrix for Train')
accuracy_report(y_train,train_predicted)
print('Confusion matrix for Test')
accuracy_report(y_test,test_predicted)

In [None]:
'''
x_train_backup['Probability of not defaulter'] = predictions_train_prob[:,0]
x_train_backup['Probability of being defaulter'] = predictions_train_prob[:,1]
x_test_backup['Probability of not defaulter'] = predictions_test_prob[:,0]
x_test_backup['Probability of being defaulter'] = predictions_test_prob[:,1]
x_train_backup['type'] = 'train'
x_test_backup['type'] = 'test'
final_predictions = pd.concat([x_train_backup,x_test_backup])
final_predictions=final_predictions[['customer_id','Probability of not defaulter','Probability of being defaulter','type']]

result = pd.merge(data,
                 final_predictions,
                 on='customer_id')
result.to_csv('predictions.csv')
'''

**PCA**

In [None]:
clf=XGBClassifier()

features_with_customer_id=features.copy()
features_with_customer_id.append('Prospect ID')
x_train,x_test,y_train,y_test = train_test_split(data[features_with_customer_id],data['Converted'],test_size=0.3,random_state=42)
x_train_backup=x_train
x_test_backup=x_test
x_train=x_train[features]
x_test=x_test[features]

sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

from sklearn.decomposition import PCA

pca = PCA(n_components=50)
x_train = pca.fit_transform(x_train)
x_test = pca.transform(x_test)
#explained_variance = pca.explained_variance_ratio_

clf.fit(x_train, y_train)
predictions_train = clf.predict(x_train)
predictions_test = clf.predict(x_test)

def accuracy_report(labels,predictions):
   precision, recall, fscore, support = score(labels, predictions)
   avg_fscore=f1_score(labels, predictions, average='macro')
   cm = confusion_matrix(labels, predictions, labels=[0,1])
   print('precision of 1: ',precision[1],', recall of 1:',recall[1])
   print_cm(cm, [0,1])

print('Confusion matrix for Train')
accuracy_report(y_train,predictions_train)
print('Confusion matrix for Test')
accuracy_report(y_test,predictions_test)

#feature selection
def feature_selection(features,clf,threshold):
    important_features=[]
    feature_importance=[]
    feature_score = pd.DataFrame(columns=['feature','importance_score'])
    
    for feature in zip(features, clf.feature_importances_):      
      feature_score.loc[len(feature_score.index)] = [feature[0],feature[1]]
    
    feature_importance=feature_score.sort_values(by=['importance_score'],ascending=False).reset_index().head(threshold)
    important_features=feature_importance['feature'].to_list()
    return important_features,feature_importance,feature_score

important_features,feature_importance,feature_score = feature_selection(range(1, 50),clf,50)
feature_importance

In [None]:
n_estimators = [1, 2, 4, 8, 16, 32, 64, 100, 200]
train_results = []
test_results = []
for estimator in n_estimators:
    model = XGBClassifier(n_estimators=estimator)
    model.fit(x_train, y_train)
    train_pred = model.predict(x_train)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    train_results.append(roc_auc)
    y_pred = model.predict(x_test)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    test_results.append(roc_auc)
line1, = plt.plot(n_estimators, train_results, 'b', label='Train AUC')
line2, = plt.plot(n_estimators, test_results, 'r', label='Test AUC')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel('AUC score')
plt.xlabel('n_estimators')
plt.show()