# **Customer Churn Prediction and Analysis**
**Task: Hypothesis Testing - Statistical Analysis**

## **Hypothesis Testing**:
A statistical procedure that uses sample data to evaluate an assumption about a population parameter.

**Steps for conducting a hypothesis test:**

*   State the null hypothesis and the alternative hypothesis.
*   Choose a significance level.
*   Find the p-value.
 *  p-value is the probability of observing a difference in your results as or more extreme than the difference observed when the null hypothesis is true
*   Reject or fail to reject the null hypothesis.

### **Null hypothesis vs Alternative hypothesis**

**The null hypothesis has the following characteristics:**

*   Null hypothesis is often abbreviated as H sub zero ($H_0$).
*   When written in mathematical terms, the null hypothesis always includes an equality symbol (usually =, but sometimes ≤ or ≥).
*   Null hypotheses often include phrases such as “no effect,” “no difference,” “no relationship,” or “no change.”
* **Claim**: There is no effect in the population.

**The alternative hypothesis has the following characteristics:**

*   Alternative hypothesis is often abbreviated as H sub a ($H_a$).
*   When written in mathematical terms, the alternative hypothesis always includes an inequality symbol (usually ≠, but sometimes < or >).
*   Alternative hypotheses often include phrases such as “an effect,” “a difference,” “a relationship,” or “a change.”
* **Claim**: There is an effect in the population.

### **Type I and type II errors**

\begin{array}{|c|c|} \hline
Action \backslash H_0 & TRUE & FALSE \\ \hline
Reject & Type \space I \space (FP) & Correct \space (TP) \\ \hline
Fail \space to \space Reject & Correct \space (TN) & Type \space II \space (FN) \\ \hline
\end{array}


* Reject the null hypothesis when it’s actually true (**Type I error**)

* Reject the null hypothesis when it’s actually false (Correct)

* Fail to reject the null hypothesis when it’s actually true (Correct)

* Fail to reject the null hypothesis when it’s actually false (**Type II error**)

**Type I error:**

* A Type 1 error, also known as a false positive, occurs when you reject a null hypothesis that is actually true. In other words, you conclude that your result is statistically significant when in fact it occurred by chance.
* The probability of making a Type I error is called alpha (α). Your significance level, or alpha (α), represents the probability of making a Type I error. Typically, the significance level is set at 0.05, or 5%. A significance level of 5% means you are willing to accept a 5% chance you are wrong when you reject the null hypothesis.
* To reduce your chance of making a Type I error, choose a lower significance level.

**Type II error:**

* A Type II error occurs when you fail to reject a null hypothesis which is actually false. In other words, you conclude your result occurred by chance, when in fact it didn’t.
* The probability of making a Type II error is called beta (β), and beta is related to the power of a hypothesis test (power = 1- β). Power refers to the likelihood that a test can correctly detect a real effect when there is one.
* You can reduce your risk of making a Type II error by ensuring your test has enough power. In data work, power is usually set at 0.80 or 80%. The higher the statistical power, the lower the probability of making a Type II error. To increase power, you can increase your sample size or your significance level.



## **Hypothesis Tests**:

### **One-Sample tests:**

*  A one-sample test determines whether or not a population parameter, like a mean or proportion, is equal to a specific value.
* Example problem, to determine if a company's average sales revenue is equal to a target value.
* Most commonly used tests: z-test (test statistic: z-score), t-test (test statistic: t-score)
  * The test statistic is a value that shows how closely your observed data matches the distribution expected under the null hypothesis.

**One-sample z-test assumptions (mean):**

* Sample size > 30
* The data is a random sample of a **normally-distributed** population
* The population **standard deviation is known**.
* Test statistic z-score:

\begin{align}
z = \frac{\bar{X}-\mu}{\frac{σ}{\sqrt{n}}}
\end{align}

* If left tailed test, then p-value is probability of observing test statiistic as low or lower than the z. For right tailed test, as high or higher that observed z score. For two tailed, both.
* If p-value < significance level => reject NULL hypothesis. If p-value > significance level => Fail to reject NULL hypothesis. Significance level = 5% = 0.05.

**One-sample t-test assumptions:**

* Sample size <= 30
* The data is a random sample of a **normally-distributed** population
* The population **standard deviation is unknown**.
* Test statistic t-score:

\begin{align}
t = \frac{\bar{X}-\mu}{\frac{S}{\sqrt{n}}}
\end{align}

**One-sample z-test assumptions (proportion):**

* The data are simply random values from the population.
* The population follows a binomial distribution.
* Both np and n(1-p) values are >= 10, the binomial distribution can be approximated by the normal distribution.

* Test statistic z-score:

\begin{align}
z = \frac{\hat{p}-p}{\sqrt{\frac{p(1-p)}{n}}}
\end{align}

### **Two-Sample tests:**

* A two-sample test determines whether or not two population parameters, such as two means or two proportions, are equal to each other.

* $H_0$: There is no difference between the means of both populations.
* $H_a$: There is a difference between the means of both populations.

**Two-sample z-test**:
* The two samples are independent of each other.
* For each sample, the data is drawn randomly from a normally distributed population.
* The population standard deviation is **known**.
* Sample size > 30.
* Test statistic z-score:

\begin{align}
z = \frac{\bar{X_1} - \bar{X_2}}{\sqrt{\frac{σ_1^2}{n_1} + \frac{σ_2^2}{n_2}}}
\end{align}

**Two-sample t-test**:
* The two samples are independent of each other.
* For each sample, the data is drawn randomly from a normally distributed population.
* The population standard deviation is **unknown**.
* Sample size < 30.

\begin{align}
t = \frac{\bar{X_1} - \bar{X_2}}{\sqrt{\frac{{s_1}^2}{n_1} + \frac{{s_2}^2}{n_2}}}
\end{align}



### **A/B test:**

* An A/B test (also known as split testing or bucket testing) is a controlled experiment used in marketing, product development, and other fields to compare two or more versions of a product or service. The primary goal of an A/B test is to determine which version performs better in terms of a specific metric or key performance indicator (KPI).

* The A/B test involves dividing the users or participants into two (or more) groups: the control group (A) and the experimental group (B). The control group is exposed to the existing version of the product or service, while the experimental group is exposed to a modified version (the variation). The variations can involve changes to the user interface, design, content, pricing, or any other aspect of the product or service being tested.

* A typical A/B test has at least three main features:

 * Test design

 * Sampling

 * Hypothesis testing

### **Chi-square test:**

\begin{array}{|c|c|} \hline
t \space tests & \chi^2 \\ \hline
Null \space \& \space Alternative \space Hypothesis & Null \space \& \space Alternative \space Hypothesis \\ \hline
Continuous \space Data & Categorical \space Data \\ \hline
\end{array}

**Chi-Squared Goodness of fit test:**
* The chi-squared goodness of fit test determines whether an observed categorical variable follows an expected distribution (counts per category).

* $H_0$: The variable follows the expected distribution.
* $H_a$: The variable does NOT follow the expected distribution.

* Test statistic $\chi^2$:

\begin{align}
\chi^2 = \sum \frac{(Observed - Expected)^2}{Expected}
\end{align}

**Chi-Squared test of Independence:**
* The chi-squared test for independence determines whether or not two categorical variables are associated with each other.

* $H_0$: The variables are independent and are not associated with each other.
* $H_a$: The variables are NOT independent and are associated with each other.

\begin{array}{|c|c|} \hline
 & Var \space A \space Type \space I & Var \space A \space Type \space II & Total \\ \hline
Var \space B \space Type \space I & Count_{11} & Count_{12} & R_1 \\ \hline
Var \space B \space Type \space II  & Count_{21} & Count_{22} & R_2\\ \hline
Total & C_1 & C_2 & T \\ \hline
\end{array}

* $Expected_{ij} = E_{ij} = \frac{R_iC_j}{T}$
* $Observed_{ij} = Count_{ij}$

* Test statistic $\chi^2$:

\begin{align}
\chi^2 = \sum \frac{(Observed - Expected)^2}{Expected}
\end{align}


### **Analysis of Variance (ANOVA) test:**

*  Analysis of Variance commonly called ANOVA is a group of statistical techniques that test the difference of means between three or more groups.

**One-way ANOVA test:**

One-way ANOVA testing compares the means of one continuous dependent variable in three or more groups.

* $H_0: \mu_1 = \mu_2 = \space ... \space = \mu_N$ i.e., the means of each group are equal.

* $H_1: NOT \space \mu_1 = \mu_2 = \space ... \space = \mu_N$ i.e., the means of each group are not all equal.

**Two-way ANOVA test:**

* Two-way ANOVA testing compares the means of one continuous dependent variable based on three or more groups of two categorical variables.

* Two-way ANOVA Hypotheses:

\begin{array}{|c|c|} \hline
 & Null \space Hypothesis \space (H_0) & Alternative \space Hypothesis \space (H_1)\\ \hline
 Var \space A & Equal \space mean \space X & Not \space all \space equal \space mean \space X  \\ \hline
 Var \space B & Equal \space mean \space X & Not \space all \space equal \space mean \space X \\ \hline
 Var \space A \space \& \space Var \space B  & The \space effect \space of \space A \space on \space X  \space is & Interaction \space effect  \\
 Interaction \space Effect & independent \space of \space the \space effect \space of \space B \space and \space vice \space versa.  & between \space A \space and \space B \space on \space X  \\ \hline
\end{array}

* An ANOVA post-hoc test performs a pairwise comparison between all available groups while controlling for the error rate.
* The odds that we've made at least one mistake (Type I/Type II error) increases very rapidly the more tests we perform.
* One of the most common ANOVA post-hoc test is the Tukey's HSD, honestly significantly different.




### **ANCOVA, MANOVA, and MANCOVA:**
**Analysis of covariances or ANCOVA:**
*  ANCOVA is a statistical technique that compares means between three or more groups while accounting for the effects of one or more continuous covariates.
* Null hypothesis or $H_0$: The means of the dependent variable (Y) are equal for all categories of the independent variable (A), regardless of the covariate (Z).
* Alternative hypothesis or $H_1$: The means of the dependent variable (Y) are not equal for all categories of the independent variable (A), regardless of the covariate (Z).
* Unlike linear regression, ANCOVA focuses on categorical variables and controls for covariates' influence on the dependent variable.

**Multi variant analysis of variance or MANOVA:**

*  MANOVA is an extension of ANOVA that compares how two or more continuous outcome variables vary based on categorical independent variables.
* The two common versions are one-way and two-way MANOVA, where the independent variable must be categorical, and outcome variables must be continuous.
* One-way MANOVA:
  * $H_0:$ The continuous outcome variables (Y) have the same means for each category of the independent variable (A). The continuous covariate (Z) also has the same means for each category of A.
  * $H_1:$ The means of the continuous outcome variables (Y) are not the same for each category of the independent variable (A). The means of the continuous covariate (Z) are also not the same for each category of A.

**Multi variant analysis of co variance or MANCOVA:**
* MANCOVA extends ANCOVA and MANOVA by comparing how two or more continuous outcome variables vary based on categorical independent variables while controlling for covariates.
* $H_0:$ The continuous outcome variables (Y and Z) have the same means for all categories of the independent variable (A), regardless of another continuous variable (X).
* $H_1:$ The means of the continuous outcome variables (Y and Z) are not the same for all categories of the independent variable (A), regardless of another continuous variable (X).

**TO DO: Statistical analysis of ride data based on device type.**

  * **QUESTION**: Is there a statistically significant difference in mean amount of rides between iPhone users and Android users? That is, "Do drivers who open the application using an iPhone have the same number of drives on average as drivers who use Android devices?"
  * **SOLUTION**: conduct a two-sample hypothesis test (t-test) to analyze the difference in the mean amount of rides between iPhone users and Android users.


In [None]:
import pandas as pd
from scipy import stats

In [None]:
df = pd.read_csv('waze_dataset.csv')

In [None]:
map_dictionary = {'Android': 2, 'iPhone': 1}

df['device_type'] = df['device']

df['device_type'] = df['device_type'].map(map_dictionary)

df['device_type'].head()

0    2
1    1
2    2
3    1
4    2
Name: device_type, dtype: int64

In [None]:
df.groupby('device_type')['drives'].mean()

device_type
1    67.859078
2    66.231838
Name: drives, dtype: float64

### Observation:

From the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, we conduct a hypothesis test.


### **Hypothesis testing**

**Note:** This is a t-test for two independent samples. This is the appropriate test since the two groups are independent (Android users vs. iPhone users).

**Hypotheses:**

$H_0$: There is no difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.

$H_A$: There is a difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.

Significance level = 5%

In [None]:
iPhone = df[df['device_type'] == 1]['drives']

Android = df[df['device_type'] == 2]['drives']

stats.ttest_ind(a=iPhone, b=Android, equal_var=False)

Ttest_indResult(statistic=1.4635232068852353, pvalue=0.1433519726802059)

**Observations:**

Since the p-value is larger than the chosen significance level (5%), we fail to reject the null hypothesis. We conclude that there is **not** a statistically significant difference in the average number of drives between drivers who use iPhones and drivers who use Androids.

**Conclusion:**

* The key business insight is that drivers who use iPhone devices on average have a similar number of drives as those who use Androids.

* One potential next step is to explore what other factors influence the variation in the number of drives, and run additonal hypothesis tests to learn more about user behavior. Further, temporary changes in marketing or user interface for the Waze app may provide more data to investigate churn.