<img src="rupixen-Q59HmzK38eQ-unsplash.jpg" alt="Someone is trying to purchase a produce online" width="500"/>

Online shopping decisions rely on how consumers engage with online store content. You work for a new startup company that has just launched a new online shopping website. The marketing team asks you, a new data scientist, to review a dataset of online shoppers' purchasing intentions gathered over the last year. Specifically, the team wants you to generate some insights into customer browsing behaviors in November and December, the busiest months for shoppers. You have decided to identify two groups of customers: those with a low purchase rate and returning customers. After identifying these groups, you want to determine the probability that any of these customers will make a purchase in a new marketing campaign to help gauge potential success for next year's sales.

### Data description:

You are given an `online_shopping_session_data.csv` that contains several columns about each shopping session. Each shopping session corresponded to a single user. 

|Column|Description|
|--------|-----------|
|`SessionID`|unique session ID|
|`Administrative`|number of pages visited related to the customer account|
|`Administrative_Duration`|total amount of time spent (in seconds) on administrative pages|
|`Informational`|number of pages visited related to the website and the company|
|`Informational_Duration`|total amount of time spent (in seconds) on informational pages|
|`ProductRelated`|number of pages visited related to available products|
|`ProductRelated_Duration`|total amount of time spent (in seconds) on product-related pages|
|`BounceRates`|average bounce rate of pages visited by the customer|
|`ExitRates`|average exit rate of pages visited by the customer|
|`PageValues`|average page value of pages visited by the customer|
|`SpecialDay`|closeness of the site visiting time to a specific special day|
|`Weekend`|indicator whether the session is on a weekend|
|`Month`|month of the session date|
|`CustomerType`|customer type|
|`Purchase`|class label whether the customer make a purchase|

In [173]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import binom

# Load and view your data
shopping_data = pd.read_csv("online_shopping_session_data.csv")
shopping_data.sample(n=10, random_state=42)

Unnamed: 0,SessionID,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Weekend,Month,CustomerType,Purchase
11142,11143,5,55.5,2,132.5,61,2190.429367,0.013846,0.038779,0.0,0.0,False,Nov,Returning_Customer,0.0
2340,2341,0,0.0,1,257.0,31,1906.8,0.01875,0.048125,22.629896,0.4,False,May,Returning_Customer,1.0
3635,3636,1,13.0,1,53.0,16,388.744444,0.0,0.006667,0.0,0.0,False,May,Returning_Customer,0.0
4228,4229,0,0.0,0,0.0,11,547.5,0.018182,0.024747,0.0,1.0,True,May,Returning_Customer,0.0
7631,7632,2,65.4,0,0.0,79,2426.006667,0.001266,0.017722,0.0,0.0,False,Sep,Returning_Customer,0.0
2884,2885,4,91.0,2,39.0,16,1280.166667,0.0,0.004762,39.818857,0.0,False,May,New_Customer,1.0
1213,1214,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,False,Mar,Returning_Customer,0.0
10525,10526,4,22.916667,0,0.0,22,221.25,0.014815,0.035185,0.0,0.0,False,Nov,Returning_Customer,0.0
4447,4448,0,0.0,0,0.0,8,459.0,0.0,0.028571,26.98,0.0,False,May,Returning_Customer,1.0
2144,2145,4,92.0,0,0.0,19,1148.838095,0.012121,0.025289,32.065091,0.0,False,May,Returning_Customer,0.0


In [174]:
#filtering for only Nov & Dec
sd_holiday = shopping_data[shopping_data['Month'].isin(['Nov','Dec'])]

#defining purchase rates
purchase_rates = (
    sd_holiday.groupby('CustomerType')['Purchase']
    .mean()
    .to_dict()
)

print(purchase_rates)


{'New_Customer': 0.2733516483516483, 'Returning_Customer': 0.1955937667920473}


In [175]:
#here we are filtering specfically for all returning customers within the months of Nov & Dec
returning_customers = shopping_data.loc[
    (shopping_data["CustomerType"].eq("Returning_Customer")) &
    (shopping_data['Month'].isin(['Nov', 'Dec']))
]
#here we are filtering specfically for returning customers who made made a purhcase within the months of Nov & Dec
returning_purchasers = returning_customers.loc[
(returning_customers['Purchase'].eq(1)) &
(returning_customers['Month'].isin(['Nov','Dec']))
]

#here we are counting the totals of each so we can divide it
total_returning_customers = returning_customers['Purchase'].count()
total_purchase = returning_purchasers['Purchase'].count()

#calculating the prob
base_probability = total_purchase/total_returning_customers

print("base probability is:", base_probability)

#because it is defined that the promotion will boost sales by 15%, we are adjusting for it
boosted_prob = base_probability * 1.15
print("probability of sales after promotion: ", boosted_prob)

#statistical testing
n=500
p=boosted_prob

pp = stats.binom.cdf(k=100, n=500, p=boosted_prob)
prob_at_least_100_sales = 1 - pp

print("Prob. of at least 100 sales after promotion is: ", prob_at_least_100_sales)






base probability is: 0.1955937667920473
probability of sales after promotion:  0.22493283181085436
Prob. of at least 100 sales after promotion is:  0.9012221339037267


In [176]:
#filtering dataset to only include Nov, Dec & Returning_Customer
mask_month = shopping_data["Month"].isin(["Nov", "Dec"]).fillna(False)
mask_cust  = shopping_data["CustomerType"].eq("Returning_Customer").fillna(False)

# Combning both filter arguments
mask = (mask_month & mask_cust).astype(bool)

#defining a new dataset with only the data we want by using the filter arguments coded above
sd_corr = shopping_data.loc[mask].copy()



#here we are looping all column names within the filtered dataset and only selecting it if it ends with _Duration
duration_cols = [c for c in sd_corr.columns if c.endswith("_Duration")]
#finding the correlation between pages
corr = sd_corr[duration_cols].corr()


# Finding the pair with the largest correlation
max_pair = corr_unstacked.abs().idxmax()   # (col1, col2)
max_value = corr_unstacked.loc[max_pair]

top_correlation = {
    "pair": (max_pair[0], max_pair[1]),
    "correlation": round(max_value, 3)
}
print(top_correlation)

{'pair': ('Administrative_Duration', 'ProductRelated_Duration'), 'correlation': 0.417}
