# Correlation Analysis

QUESTION: Are the numerical variables(X's) below correlated with the transaction revenue*(Y) and how? 

*: 'transaction revenue' specifies the total revenue or grand total associated with the transaction (e.g. 11.99). This value may include shipping, tax costs, or other adjustments to total revenue that you want to include as part of your revenue calculations.

The numerical variables(X's) are:
- visitNumber: The session number for this user. If this is the first session, then this is set to 1.
- totals.hits: Common hit types include page tracking hits, event tracking hits, and ecommerce hits. 
- totals.pageviews: Pageviews is a metric defined as the total number of pages viewed.
- totals.timeOnSite:(https://support.google.com/analytics/answer/1006253?hl=en)
- totals.newVisits: New users are users who have never been to your website, according to Google’s tracking snippet; returning users have visited your site before. If the cookie is not present, Google creates one and considers this a ‘new’ user. 
- totals.transactions: 'transactions' represent unique orders on your online store
- totals.totalTransactionRevenue: aggregate data of 'totals.TranscationRevenue'.
- totals.bounces: A bounce is a single-page session on your site. In Analytics, a bounce is calculated specifically as a session that triggers only a single request to the Analytics server, such as when a user opens a single page on your site and then exits without triggering any other requests to the Analytics server during that session. 

*totals: features with "totals" mean that this set of columns mostly includes high-level aggregate data.

METHOD:
When it comes to discovering the correlation between variables, one may think of using Pearson's correlation analysis. The drawbacks of Pearson r are that 
1)if you use Pearson correlation, it means that you assume the (x, y)'s are from a bivariate normal distribution;
2) The Pearson r can be highly influenced by outliers.
Since the (x, y)'s are not from the normal distribution, it is better to use a non-parametric approach, i.e. Spearman's rank correlation


HYPOTHESIS TESTING:
- H_0: ρ = 0 (no asscociation between X_i and y)
- H_1: ρ != 0 (no asscociation between X_i and y)

- test statistics: 
t_obs_spear = r_sp[0] * math.sqrt((len(X) - 2) / (1 - r_sp[0]**2))
- significance level: 5%
- decision rule: we reject H_0 if and only if |t_obs_spear| >= t_alpha/2_n-2

CONCLUSION:

- We reject H_0 and claim that the visitNumber is positively associated with the transaction revenue at the 5% level.
- We reject H_0 and claim that the totals.hits is postively associated with the transaction revenue at the 5% level.
- We reject H_0 and claim that the totals.pageviews is positively associated with the transaction revenue at the 5% level.
- We reject H_0 and claim that the totals.timeOnSite is positively associated with the transaction revenue at the 5% level.
- We reject H_0 and claim that the totals.newVisits is negatively associated with the transaction revenue at the 5% level.
- We reject H_0 and claim that the totals.transactions is positively associated with the transaction revenue at the 5% level.
- We reject H_0 and claim that the totals.totalTransactionRevenue is positively associated with the transaction revenue at the 5% level.
- We reject H_0 and claim that the totals.bounces is negatively associated with the transaction revenue at the 5% level.



In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

import math


In [2]:
X = pd.read_csv('/Users/mercuryliu/Documents/Kaggle/ga-customer-revenue-prediction/X_v2.csv', \
                low_memory=False).drop(['visitHour', 'Unnamed: 0'], axis=1)
y = pd.read_csv('/Users/mercuryliu/Documents/Kaggle/ga-customer-revenue-prediction/y_v2.csv', \
                low_memory=False).drop('Unnamed: 0', axis=1)

In [3]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119264 entries, 0 to 119263
Data columns (total 26 columns):
 #   Column                                    Non-Null Count   Dtype  
---  ------                                    --------------   -----  
 0   visitNumber                               119264 non-null  int64  
 1   totals.visits                             119264 non-null  int64  
 2   totals.hits                               119264 non-null  int64  
 3   totals.pageviews                          119264 non-null  int64  
 4   totals.timeOnSite                         119264 non-null  int64  
 5   totals.newVisits                          119264 non-null  int64  
 6   totals.transactions                       119264 non-null  float64
 7   totals.totalTransactionRevenue            119264 non-null  float64
 8   totals.bounces                            119264 non-null  int64  
 9   channelGrouping                           119264 non-null  int64  
 10  socialEngagementType

#I am only analyzing the correlaiton between continuous variables and the response variable
#I will not investigate the categorical variables here
#in terms of the categorical variables, I will leave it to ANOVA

In [4]:
num_name = ['visitNumber', 'totals.hits', 'totals.pageviews', 'totals.timeOnSite',\
                'totals.newVisits','totals.transactions', 'totals.totalTransactionRevenue',\
               'totals.bounces']

In [5]:
X_analy = X[num_name]

In [6]:
r_sp = []
p_val =[]
for i in range(len(X_analy.columns)):
    a,b = stats.spearmanr(X_analy.iloc[:,i],y)
    r_sp.append(a)
    p_val.append(b)



In [7]:
r_sp_all = pd.DataFrame(r_sp, columns=['Spearman_s_rank_correlation'])

In [8]:
r_sp_all['p_value'] = p_val

In [9]:
#find t alpha/2, n-2, suppose we want 95% confidence level
t = stats.t.ppf(1-0.05/2, len(X_analy)-2)
t_obs = []
for i in range(len(r_sp)):
    t_test =  r_sp[i] * math.sqrt((len(X_analy) - 2) / (1 - r_sp[i]**2))
    if abs(t_test) >= t:
        t_obs.append('reject')
    else: 
        t_obs.append('doesnt reject')

In [10]:
#hypothesis testing results
r_sp_all['Hypothesis Testing results'] = t_obs

In [11]:
r_sp_all['X_names'] = X_analy.columns

In [12]:
#we can claim correlation based on the results below
r_sp_all.sort_values(by=['Spearman_s_rank_correlation'], ascending=False)

Unnamed: 0,Spearman_s_rank_correlation,p_value,Hypothesis Testing results,X_names
5,0.999999,0.0,reject,totals.transactions
6,0.999999,0.0,reject,totals.totalTransactionRevenue
2,0.185551,0.0,reject,totals.pageviews
1,0.183775,0.0,reject,totals.hits
3,0.175349,0.0,reject,totals.timeOnSite
0,0.095544,7.706732999999999e-240,reject,visitNumber
4,-0.092972,3.718542e-227,reject,totals.newVisits
7,-0.107912,9.646401e-306,reject,totals.bounces


In [13]:
for i in range(len(X_analy.columns)):
    if r_sp_all.Spearman_s_rank_correlation[i] > 0:
        print(f'We reject H_0 and claim that the {r_sp_all.X_names[i]} is positively associated with the transaction revenue at the 5% level.')
    else:
        print(f'We reject H_0 and claim that the {r_sp_all.X_names[i]} is negatively associated with the transaction revenue at the 5% level.')

We reject H_0 and claim that the visitNumber is positively associated with the transaction revenue at the 5% level.
We reject H_0 and claim that the totals.hits is positively associated with the transaction revenue at the 5% level.
We reject H_0 and claim that the totals.pageviews is positively associated with the transaction revenue at the 5% level.
We reject H_0 and claim that the totals.timeOnSite is positively associated with the transaction revenue at the 5% level.
We reject H_0 and claim that the totals.newVisits is negatively associated with the transaction revenue at the 5% level.
We reject H_0 and claim that the totals.transactions is positively associated with the transaction revenue at the 5% level.
We reject H_0 and claim that the totals.totalTransactionRevenue is positively associated with the transaction revenue at the 5% level.
We reject H_0 and claim that the totals.bounces is negatively associated with the transaction revenue at the 5% level.
