# 2. Exploration of Dataset


The dataset consists of 10 numerical and 8 categorical attributes. Of which, 9 relevant attributes have been selected (5 numerical and 4 categorical).

The 'Revenue' attribute is the class label.

Description of Dataset. (Taken from UCI repository, those in bold are the selected attributes)

"**Product Related**" and "**Product Related Duration**" represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories. The values of these features are derived from the URL information of the pages visited by the user and updated in real time when a user takes an action, e.g. moving from one page to another. 

The "**Bounce Rate**" and "**Exit Rate**" features represent the metrics measured by "Google Analytics" for each page in the e-commerce site. The value of "Bounce Rate" feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session. The value of "Exit Rate" feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session. 

The "**Special Day**" feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. For example, for Valentine’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8. 

The dataset also includes **region**, **visitor type** as returning or new visitor, a Boolean value indicating whether the date of the visit is **weekend**, and **month** of the year.

In [None]:
# retrieve just the probabilities for the positive class
y_prob_categorical = classifier.predict_proba(x_test)[:,1]

# summarize the distribution of class labels
print(Counter(y_pred_test))

# Function to obtain the false positive, true positive, roc_auc score and largest g-mean threshold

def roc_curve_and_score(y_test, pred_proba):
    fpr, tpr, thresholds = roc_curve(y_test.ravel(), pred_proba.ravel())
    roc_auc = roc_auc_score(y_test.ravel(), pred_proba.ravel(), average = 'macro')
    print('Roc Auc score:', roc_auc)
    
    # calculate the g-mean for each threshold
    gmeans = sqrt(tpr * (1-fpr))
    
    # locate the index of the largest g-mean
    l_g_mean = argmax(gmeans)
    print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[l_g_mean], gmeans[l_g_mean]))
    
    return fpr, tpr, roc_auc, l_g_mean

# Function to plotting the AUC ROC Curve
def plot_ROC(fpr, tpr, roc_auc, l_g_mean):
    plt.figure(figsize=(8, 6))
    matplotlib.rcParams.update({'font.size': 14})
    plt.grid()
    plt.plot(fpr, tpr, color='darkorange', lw=2,
             label='Categorical: ROC AUC={0:.3f}'.format(roc_auc), zorder=1)

    plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--', label = 'No Skill')
    plt.scatter(fpr[l_g_mean], tpr[l_g_mean], marker='o', color='black', label='Best Threshold', zorder=2)
    plt.legend(loc="lower right")
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive rate (1- Specificity)')
    plt.ylabel('True Positive rate (Sensitivity)')
    plt.show()
    
fpr, tpr, roc_auc, l_g_mean = roc_curve_and_score(y_test, y_prob_categorical)
plot_ROC(fpr, tpr, roc_auc, l_g_mean)