In this file, I performed KS Correlation Coefficient test between different variables 

The KS test, short for Kolmogorov-Smirnov test, is a statistical test used to determine whether two samples or distributions differ significantly from each other. It compares the cumulative distribution functions (CDFs) of the two samples to assess the similarity or dissimilarity between them.

The KS test is a nonparametric test, meaning it does not assume any specific distribution for the data. It is commonly used when the distribution of the data is unknown or when the data is not normally distributed.

The KS test calculates a test statistic called the KS statistic, which represents the maximum vertical distance between the two CDFs. Additionally, it calculates a p-value that indicates the probability of obtaining a test statistic as extreme as the one observed if the two samples are drawn from the same distribution.



KS Statistic:

The KS statistic, or Kolmogorov-Smirnov statistic, is a measure of the maximum difference between the cumulative distribution functions (CDFs) of two datasets.
In the context of the KS test, the KS statistic quantifies the level of dissimilarity between two samples or datasets.
The KS statistic ranges from 0 to 1, where a value of 0 indicates perfect similarity (the two distributions are identical), and a value of 1 indicates complete dissimilarity (the two distributions are entirely different).
In the given output, a KS statistic of 1.0 suggests that there is a maximum difference between the CDFs of the two datasets being compared, indicating a high level of dissimilarity.


P-Value:

The p-value is a probability value that measures the evidence against the null hypothesis in a statistical test.
In the context of the KS test, the p-value indicates the likelihood of observing the given level of dissimilarity (or a more extreme level) between the two datasets if they were actually drawn from the same underlying distribution.
A p-value less than a chosen significance level (e.g., 0.05) suggests strong evidence to reject the null hypothesis of similarity and conclude that the two datasets are significantly different.
In the given output, a p-value of 0.0 indicates that the observed level of dissimilarity (KS statistic of 1.0) is extremely unlikely to occur by chance if the two datasets were actually similar.

Example

In [1]:
from scipy import stats

# Generate two sample arrays
sample1 = [1, 2, 3, 4, 5]
sample2 = [2, 4, 6, 8, 10]

# Perform the KS test
ks_statistic, p_value = stats.ks_2samp(sample1, sample2)

# Print the KS test statistics and p-value
print("KS statistic:", ks_statistic)
print("P-Value:", p_value)



KS statistic: 0.6
P-Value: 0.35714285714285715


In [2]:
import pandas as pd

In [3]:
data=pd.read_excel('data.xlsx')

In [4]:
invoice_id = data['Invoice ID']
customer_id = data['Customer ID']

# Perform the KS test
ks_statistic, p_value = stats.ks_2samp(invoice_id, customer_id)

# Print the KS test statistics and p-value
print("KS statistic:", ks_statistic)
print("P-Value:", p_value)

KS statistic: 0.8716984234354425
P-Value: 0.0


In [5]:
total = data['Total']
quantity = data['Quantity']

# Perform the KS test
ks_statistic, p_value = stats.ks_2samp(total, quantity)

# Print the KS test statistics and p-value
print("KS statistic:", ks_statistic)
print("P-Value:", p_value)

KS statistic: 0.9988086704560368
P-Value: 0.0


In [6]:
item_total = data['Item Total']
item_price = data['Item Price']

# Perform the KS test
ks_statistic, p_value = stats.ks_2samp(item_total, item_price)

# Print the KS test statistics and p-value
print("KS statistic:", ks_statistic)
print("P-Value:", p_value)

KS statistic: 0.23582700048893546
P-Value: 0.0


In [8]:
total = data['Total']
item_total = data['Item Total']

# Perform the KS test
ks_statistic, p_value = stats.ks_2samp(total, item_total)

# Print the KS test statistics and p-value
print("KS statistic:", ks_statistic)
print("P-Value:", p_value)

KS statistic: 0.9697249413349526
P-Value: 0.0


In [9]:
quantity = data['Quantity']
item_price = data['Item Price']

# Perform the KS test
ks_statistic, p_value = stats.ks_2samp(quantity, item_price)

# Print the KS test statistics and p-value
print("KS statistic:", ks_statistic)
print("P-Value:", p_value)

KS statistic: 0.9407046728675418
P-Value: 0.0


In [10]:
month = data['Month']
invoice_id = data['Invoice ID']

# Perform the KS test
ks_statistic, p_value = stats.ks_2samp(month, invoice_id)

# Print the KS test statistics and p-value
print("KS statistic:", ks_statistic)
print("P-Value:", p_value)

KS statistic: 1.0
P-Value: 0.0


One Tailed Test

In [12]:
invoice_id = data['Invoice ID']
month = data['Month']

# Perform the correlation test
correlation, p_value = stats.ks_2samp(invoice_id, month)

# Set the significance level (alpha)
alpha = 0.05

# Perform one-tailed test (alternative hypothesis: correlation > 0)
if p_value/2 < alpha and correlation > 0:
    print("There is a significant positive correlation between 'Invoice ID' and 'Month'.")
else:
    print("There is no significant positive correlation between 'Invoice ID' and 'Month'.")

There is a significant positive correlation between 'Invoice ID' and 'Month'.


Two Tailed Correlation Test

In [15]:
invoice_id = data['Invoice ID']
customer_id = data['Customer ID']

# Perform the correlation test
correlation, p_value = stats.ks_2samp(invoice_id, customer_id)

# Set the significance level (alpha)
alpha = 0.05

# Perform two-tailed test (alternative hypothesis: correlation != 0)
if p_value < alpha:
    print("There is a significant correlation between 'Invoice ID' and 'Customer ID'.")
else:
    print("There is no significant correlation between 'Invoice ID' and 'Customer ID'.")


There is a significant correlation between 'Invoice ID' and 'Customer ID'.


In [16]:
invoice_id = data['Invoice ID']
month = data['Month']

# Perform the correlation test
correlation, p_value = stats.ks_2samp(invoice_id, month)

# Set the significance level (alpha)
alpha = 0.05

# Perform two-tailed test (alternative hypothesis: correlation != 0)
if p_value < alpha:
    print("There is a significant correlation between 'Invoice ID' and 'Month'.")
else:
    print("There is no significant correlation between 'Invoice ID' and 'Month'.")

There is a significant correlation between 'Invoice ID' and 'Month'.


Total and Quantity: The correlation coefficient is positive (0.323371), suggesting a positive relationship. You can perform a two-tailed test to determine if the correlation coefficient is significantly different from zero.

Total and Item Total: The correlation coefficient is positive (0.422263), indicating a positive relationship. You can perform a two-tailed test to determine if the correlation coefficient is significantly different from zero.

Quantity and Item Total: The correlation coefficient is positive (0.748336), indicating a strong positive relationship. You can perform a two-tailed test to determine if the correlation coefficient is significantly different from zero.

Item Total and Item Price: The correlation coefficient is positive (0.039536), suggesting a positive relationship. You can perform a two-tailed test to determine if the correlation coefficient is significantly different from zero.

Item Price and Month: The correlation coefficient is positive (0.100411), indicating a positive relationship. You can perform a two-tailed test to determine if the correlation coefficient is significantly different from zero.