# What is Statistical Significance Test?

In statistics, statistical significance means that the result that was produced has a reason behind it, it was not produced randomly, or by chance.


SciPy provides us with a module called scipy.stats, which has functions for performing statistical significance tests.

#Hypothesis in Statistics
Hypothesis is an assumption about a parameter in population.

#Null Hypothesis
It assumes that the observation is not statistically significant.

#Alternate Hypothesis
It assumes that the observations are due to some reason.

It's alternate to Null Hypothesis.

**Example:**

For an assessment of a student we would take:

"student is worse than average" - as a null hypothesis, and:

"student is better than average" - as an alternate hypothesis.

#One tailed test
When our hypothesis is testing for one side of the value only, it is called "one tailed test".

Example:

For the null hypothesis:

"the mean is equal to k", we can have alternate hypothesis:

"the mean is less than k", or:

"the mean is greater than k"

#Two tailed test
When our hypothesis is testing for both side of the values.

Example:

For the null hypothesis:

"the mean is equal to k", we can have alternate hypothesis:

"the mean is not equal to k"

In this case the mean is less than, or greater than k, and both sides are to be checked.

#Alpha value
Alpha value is the level of significance.

Example:

How close to extremes the data must be for null hypothesis to be rejected.

It is usually taken as 0.01, 0.05, or 0.1.



#P value
P value tells how close to extreme the data actually is.

P value and alpha values are compared to establish the statistical significance.

If p value <= alpha we reject the null hypothesis and say that the data is statistically significant. otherwise we accept the null hypothesis.


#T-Test
T-tests are used to determine if there is significant deference between means of two variables and lets us know if they belong to the same distribution.

It is a two tailed test.

The function ttest_ind() takes two samples of same size and produces a tuple of t-statistic and p-value.



In [None]:
import numpy as np
from scipy.stats import ttest_ind


v1 = np.random.normal(size=100)
v2 = np.random.normal(size=100)

res=ttest_ind(v1,v2)
print(res)

TtestResult(statistic=-0.9617417069530521, pvalue=0.337352393200281, df=198.0)


statistic (-0.9617): This is the t-statistic value, which represents how far the sample mean is from the null hypothesis (usually that there is no difference between the two means), in terms of the standard error. A value closer to 0 indicates less evidence against the null hypothesis.

pvalue (0.3373): This is the p-value, which helps determine the significance of your results. A common threshold for significance is 0.05. Since your p-value (0.3373) is much higher than 0.05, it indicates that the difference between the two means is not statistically significant, meaning there's no strong evidence to reject the null hypothesis.

df (198.0): This is the degrees of freedom, which is related to the sample size and affects the shape of the t-distribution. In your case, it's 198, which could suggest that the test was based on a sample of 200 observations (for a two-sample t-test, df is typically n-2).

**Since the p-value is greater than 0.05, you do not have sufficient evidence to say there is a significant difference between the two groups' means.**

#KS-Test
KS test is used to check if given values follow a distribution.

It can be used as a one tailed or two tailed test.

By default it is two tailed. We can pass parameter alternative as a string of one of two-sided, less, or greater.





In [None]:
from scipy.stats import kstest

v = np.random.normal(size=100)

res = kstest(v, 'norm')

print(res)

KstestResult(statistic=0.06680891352534046, pvalue=0.7378439441392766, statistic_location=0.7830158115692747, statistic_sign=1)


The result you’ve provided is from a Kolmogorov-Smirnov (K-S) test, which is used to determine if a sample comes from a particular distribution (often, a normal distribution). Here’s what the output means:

statistic (0.0668): This is the K-S test statistic, which measures the maximum distance between the empirical distribution of your data and the specified theoretical distribution (e.g., normal distribution). A lower value indicates that the distributions are more similar.

pvalue (0.7378): The p-value represents the probability that the observed difference (test statistic) is due to chance. In hypothesis testing:

If the p-value is large (typically greater than 0.05), you fail to reject the null hypothesis, meaning your sample is likely from the theoretical distribution.
Here, the p-value of 0.7378 is large, so you can conclude that there is no significant difference between the sample data and the theoretical distribution (they are likely from the same distribution).
statistic_location (0.7830): This represents the point in the data where the maximum difference between the empirical and theoretical cumulative distributions occurs.

statistic_sign (1): This indicates the direction of the difference. A positive sign suggests that the empirical distribution is greater than the theoretical distribution at the point of maximum difference.