<a href="https://colab.research.google.com/github/sulaimonao/sulaimonao/blob/main/MasterSchool_GloBox_AB_TEST.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Conduct a hypothesis test to see whether there is a difference in the conversion rate between the two groups. What are the resulting p-value and conclusion?**

Data Exploration: Quick check of the data to understand its structure and contents.

In [None]:
import pandas as pd

# Load the dataset
file_path = '/content/cleaned_data.csv'  # Replace with your file path
data = pd.read_csv(file_path)

# Display the first few rows of the dataset
print(data.head())


   user_id country gender device_type test_group converted  total_spent
0  1000000     CAN      M           I          B        No          0.0
1  1000001     BRA      M           A          A        No          0.0
2  1000002     FRA      M           A          A        No          0.0
3  1000003     BRA      M           I          B        No          0.0
4  1000004     DEU      F           A          A        No          0.0


Data Preparation:

Convert the 'converted' column to a binary format (e.g., Yes = 1, No = 0) for ease of analysis.

Ensure that the 'total_spent' column is in the correct numerical format

In [None]:
# Data Preparation
data['converted'] = data['converted'].map({'Yes': 1, 'No': 0})  # Convert to binary

# Ensure 'total_spent' is numeric
data['total_spent'] = pd.to_numeric(data['total_spent'], errors='coerce')

# Display the updated dataframe
print(data.head())

# Save the modified dataset to a new CSV file
new_file_path = '/content/modified_dataset.csv'
data.to_csv(new_file_path, index=False)

print("Modified dataset saved as 'modified_dataset.csv'")


   user_id country gender device_type test_group  converted  total_spent
0  1000000     CAN      M           I          B          0          0.0
1  1000001     BRA      M           A          A          0          0.0
2  1000002     FRA      M           A          A          0          0.0
3  1000003     BRA      M           I          B          0          0.0
4  1000004     DEU      F           A          A          0          0.0
Modified dataset saved as 'modified_dataset.csv'


H_0 = There is no difference in conversion rates between the treatment and control groups.


Since the conversion rate is a categorical variable (converted vs. not converted), a Chi-square test of independence is appropriate. This test will determine if there's a statistical difference in conversion rates between the two groups.

Creating a contingency table from data to perform the Chi-square test.

In [None]:
import pandas as pd
from scipy.stats import chi2_contingency

# Load the data
data = pd.read_csv('/content/cleaned_data.csv')

# Convert 'converted' to binary format for analysis
data['converted'] = data['converted'].map({'Yes': 1, 'No': 0})

# Creating a contingency table
contingency_table = pd.crosstab(data['test_group'], data['converted'])

# Performing the Chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Output the results
print("Chi-square statistic:", chi2)
print("p-value:", p)


Chi-square statistic: 14.309633731848777
p-value: 0.00015506924632928225


Since the p-value (0.000155) is less than 0.05, we reject the null hypothesis.

This suggests that there is a statistically significant difference in conversion rates between the treatment and control groups.

Conclusion:

The analysis indicates that the treatment group's conversion rate significantly differs from that of the control group. This finding supports the alternative hypothesis (H1) that there is a difference in conversion rates between the two groups.

**What is the 95% confidence interval for the difference in the conversion rate between the treatment and control (treatment-control)?**

Calculate Conversion Rates: First, find the conversion rates for both the treatment and control groups.

Compute the Standard Error: Calculate the standard error for the difference in conversion rates.

Calculate the Margin of Error: Use the standard error to compute the margin of error at a 95% confidence level.

Determine the Confidence Interval: The confidence interval is the difference in conversion rates ± the margin of error.

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import norm

# Load the data
data = pd.read_csv('/content/cleaned_data.csv')

# Convert 'converted' to binary format
data['converted'] = data['converted'].map({'Yes': 1, 'No': 0})

# Calculate conversion rates
conv_rate_control = data[data['test_group'] == 'A']['converted'].mean()
conv_rate_treatment = data[data['test_group'] == 'B']['converted'].mean()

# Calculate the standard error
n_control = len(data[data['test_group'] == 'A'])
n_treatment = len(data[data['test_group'] == 'B'])
se = np.sqrt((conv_rate_control * (1 - conv_rate_control) / n_control) +
             (conv_rate_treatment * (1 - conv_rate_treatment) / n_treatment))

# Calculate the margin of error for a 95% confidence interval
z_score = norm.ppf(0.975)  # 97.5% percentile of the normal distribution for two-tailed
margin_of_error = z_score * se

# Calculate the confidence interval
ci_lower = (conv_rate_treatment - conv_rate_control) - margin_of_error
ci_upper = (conv_rate_treatment - conv_rate_control) + margin_of_error

# Output the results
print("95% Confidence Interval for the difference in conversion rate: [{}, {}]".format(ci_lower, ci_upper))


95% Confidence Interval for the difference in conversion rate: [0.0033908772937509446, 0.010585996381550482]


**Confidence Interval: [0.003486, 0.010654]**

This means that with 95% confidence, the true difference in conversion rates between the treatment and control groups lies somewhere between 0.003390 and 0.010585. To interpret this further:

**Positive Interval:** Since both lower and upper bounds of the interval are positive, it suggests that the treatment group likely has a higher conversion rate than the control group.

**Statistical Significance:** This interval does not contain 0, which supports the earlier finding that the difference in conversion rates is statistically significant.

**Conduct a hypothesis test to see whether there is a difference in the average amount spent per user between the two groups. What are the resulting p-value and conclusion?**

Use the t distribution and a 5% significance level. Assume unequal variance.

In [None]:
import pandas as pd
from scipy.stats import ttest_ind

# Load the data
data = pd.read_csv('/content/cleaned_data.csv')

# Isolate the total spent by each group
total_spent_control = data[data['test_group'] == 'A']['total_spent']
total_spent_treatment = data[data['test_group'] == 'B']['total_spent']

# Perform Welch's t-test
t_stat, p_value = ttest_ind(total_spent_control, total_spent_treatment, equal_var=False)

# Output the results
print("T-Statistic:", t_stat)
print("p-value:", p_value)


T-Statistic: -0.1524754422215445
p-value: 0.8788125906384138


**T-Statistic**: -0.1524 - This value indicates the direction and magnitude of the difference between the group means, relative to the variation within the groups.

**p-value:** 0.8788 - This is significantly greater than the typical alpha level of 0.05 (5% significance level).

**High p-value:** Given that the p-value (0.8788) is much higher than 0.05, you fail to reject the null hypothesis.

**No Statistical Significance:** This suggests that there is no statistically significant difference in the average amount spent per user between the treatment and control groups.

**Conclusion:**

The analysis indicates that, in terms of the average amount spent per user, the treatment does not significantly differ from the control. This means that whatever changes or treatments were applied to the treatment group did not have a significant impact on how much users spent, compared to the control group.

**What is the 95% confidence interval for the difference in the average amount spent per user between the treatment and the control (treatment-control)?**

Use the t distribution and assume unequal variance.

In [None]:
import pandas as pd
from scipy.stats import t
from numpy import sqrt

# Load the data
data = pd.read_csv('/content/cleaned_data.csv')

# Calculate mean and standard deviation for each group
mean_control = data[data['test_group'] == 'A']['total_spent'].mean()
std_control = data[data['test_group'] == 'A']['total_spent'].std()
n_control = len(data[data['test_group'] == 'A'])

mean_treatment = data[data['test_group'] == 'B']['total_spent'].mean()
std_treatment = data[data['test_group'] == 'B']['total_spent'].std()
n_treatment = len(data[data['test_group'] == 'B'])

# Calculate the standard error of the difference in means
se_diff = sqrt((std_control**2 / n_control) + (std_treatment**2 / n_treatment))

# Degrees of freedom for Welch's t-test
df = (std_control**2 / n_control + std_treatment**2 / n_treatment)**2 / ((std_control**4 / (n_control**2 * (n_control - 1))) + (std_treatment**4 / (n_treatment**2 * (n_treatment - 1))))

# Calculate the margin of error
t_critical = t.ppf(0.975, df)  # 97.5th percentile of the t-distribution
margin_of_error = t_critical * se_diff

# Calculate the confidence interval
ci_lower = (mean_treatment - mean_control) - margin_of_error
ci_upper = (mean_treatment - mean_control) + margin_of_error

# Output the results
print("95% Confidence Interval for the difference in average amount spent: [{}, {}]".format(ci_lower, ci_upper))


95% Confidence Interval for the difference in average amount spent: [-0.4169734959233784, 0.48732138605483577]


**Interpretation of the Confidence Interval:**

**Interval Range**: The interval ranges from approximately -0.41697 to 0.48732.

**Contains Zero**: Notably, this confidence interval includes zero. This inclusion suggests that the difference in average spending between the two groups could be zero, indicating no effect.

**Implications:**

**No Significant Difference**: The fact that zero is within this interval aligns with the earlier hypothesis test result, which did not find a statistically significant difference in average spending between the groups.

**Practical Consideration**: Even though there might be some difference in average spending, we are not confident that this difference is significant, as the range includes both a decrease and an increase.

**Conclusion:**

Based on this confidence interval, it's prudent to conclude that the treatment did not have a statistically significant impact on the average amount spent per user when compared to the control group. In practical terms, this means that whatever intervention or changes were implemented in the treatment group did not significantly alter the spending behavior compared to the control group.

**ADVANCED TASKS**

**Calculate a 95% confidence interval for the difference in the conversion rate between the treatment and control (treatment-control).**

Use the normal distribution and unpooled proportions for the standard error.

In [None]:
import pandas as pd
import numpy as np
from scipy import stats

# Load the data
data = pd.read_csv('/content/cleaned_data.csv')


# Replace non-numeric values in the 'converted' column
data['converted'] = data['converted'].replace({'No': 0, 'Yes': 1})

# Separate treatment and control groups based on 'test_group' column
treatment = data[data['test_group'] == 'A']  # Assuming "A" is the treatment group
control = data[data['test_group'] == 'B']  # Assuming "B" is the control group

# Calculate conversion rates for treatment and control groups
conversion_rate_treatment = treatment['converted'].mean()
conversion_rate_control = control['converted'].mean()

# Calculate the number of observations (sample sizes) in each group
n_treatment = len(treatment)
n_control = len(control)

# Calculate the standard error for treatment and control groups
se_treatment = np.sqrt(conversion_rate_treatment * (1 - conversion_rate_treatment) / n_treatment)
se_control = np.sqrt(conversion_rate_control * (1 - conversion_rate_control) / n_control)

# Calculate the standard error for the difference in conversion rates
se_diff = np.sqrt(se_treatment**2 + se_control**2)

# Calculate the margin of error using the normal distribution (z-score for 95% confidence)
z_score = 1.96  # For 95% confidence interval
margin_of_error = z_score * se_diff

# Calculate the confidence interval
ci_lower = (conversion_rate_treatment - conversion_rate_control) - margin_of_error
ci_upper = (conversion_rate_treatment - conversion_rate_control) + margin_of_error

# Output the results
print("95% Confidence Interval for the difference in conversion rates (treatment - control): [{}, {}]".format(ci_lower, ci_upper))


95% Confidence Interval for the difference in conversion rates (treatment - control): [-0.010586062488766063, -0.0033908111865353636]


Conduct a hypothesis test to see whether there is a difference in the average amount spent per user between the two groups.
What would be the null and alternative hypotheses?

In [None]:
import pandas as pd
import scipy.stats as stats

# Load the data
data = pd.read_csv('/content/cleaned_data.csv')

# Separate treatment and control groups based on 'test_group' column
treatment = data[data['test_group'] == 'A']  # Assuming "A" is the treatment group
control = data[data['test_group'] == 'B']  # Assuming "B" is the control group

# Define the two samples
sample_treatment = treatment['total_spent']
sample_control = control['total_spent']

# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(sample_treatment, sample_control, equal_var=False)  # Assuming unequal variances

# Set the significance level (alpha)
alpha = 0.05

# Print the results
print("Null Hypothesis (H0): There is no significant difference in the average amount spent per user between the two groups.")
print("Alternative Hypothesis (H1): There is a significant difference in the average amount spent per user between the two groups.")
print(f"Test Statistic: {t_statistic}")
print(f"P-value: {p_value}")

if p_value < alpha:
    print("Result: Reject the null hypothesis. There is a significant difference.")
else:
    print("Result: Fail to reject the null hypothesis. There is no significant difference.")


Null Hypothesis (H0): There is no significant difference in the average amount spent per user between the two groups.
Alternative Hypothesis (H1): There is a significant difference in the average amount spent per user between the two groups.
Test Statistic: -0.1524754422215445
P-value: 0.8788125906384138
Result: Fail to reject the null hypothesis. There is no significant difference.


Conduct a hypothesis test to see whether there is a difference in the average amount spent per user between the two groups.
What are the resulting p-value and conclusion?

Use the t distribution and a 5% significance level. Assume unequal variance.

In [None]:
import pandas as pd
import scipy.stats as stats

# Load the data
data = pd.read_csv('/content/cleaned_data.csv')

# Separate treatment and control groups based on 'test_group' column
treatment = data[data['test_group'] == 'A']  # Assuming "A" is the treatment group
control = data[data['test_group'] == 'B']  # Assuming "B" is the control group

# Define the two samples
sample_treatment = treatment['total_spent']
sample_control = control['total_spent']

# Perform a two-sample t-test with unequal variance (Welch's t-test)
t_statistic, p_value = stats.ttest_ind(sample_treatment, sample_control, equal_var=False)

# Set the significance level (alpha)
alpha = 0.05

# Print the results
print("Null Hypothesis (H0): There is no significant difference in the average amount spent per user between the two groups.")
print("Alternative Hypothesis (H1): There is a significant difference in the average amount spent per user between the two groups.")
print(f"Test Statistic: {t_statistic}")
print(f"P-value: {p_value}")

if p_value < alpha:
    print("Result: Reject the null hypothesis. There is a significant difference in the average amount spent.")
else:
    print("Result: Fail to reject the null hypothesis. There is no significant difference in the average amount spent.")


Null Hypothesis (H0): There is no significant difference in the average amount spent per user between the two groups.
Alternative Hypothesis (H1): There is a significant difference in the average amount spent per user between the two groups.
Test Statistic: -0.1524754422215445
P-value: 0.8788125906384138
Result: Fail to reject the null hypothesis. There is no significant difference in the average amount spent.


Calculate a 95% confidence interval for the difference in the average amount spent per user between the treatment and the control (treatment-control).
Use the t distribution and assume unequal variance.

In [None]:
import pandas as pd
import scipy.stats as stats

# Load the data
data = pd.read_csv('/content/cleaned_data.csv')

# Separate treatment and control groups based on 'test_group' column
treatment = data[data['test_group'] == 'A']  # Assuming "A" is the treatment group
control = data[data['test_group'] == 'B']  # Assuming "B" is the control group

# Define the two samples
sample_treatment = treatment['total_spent']
sample_control = control['total_spent']

# Calculate the mean and standard deviation for each group
mean_treatment = sample_treatment.mean()
mean_control = sample_control.mean()
std_treatment = sample_treatment.std()
std_control = sample_control.std()

# Calculate the sample sizes
n_treatment = len(sample_treatment)
n_control = len(sample_control)

# Calculate the standard error for the difference in means
se_diff = ((std_treatment**2 / n_treatment) + (std_control**2 / n_control))**0.5

# Calculate the degrees of freedom for Welch's t-distribution
df = (std_treatment**2 / n_treatment + std_control**2 / n_control)**2 / ((std_treatment**4 / (n_treatment**2 * (n_treatment - 1))) + (std_control**4 / (n_control**2 * (n_control - 1))))

# Calculate the margin of error
t_critical = abs(stats.t.ppf(0.025, df))  # 2.5th percentile of the t-distribution for a 95% confidence interval
margin_of_error = t_critical * se_diff

# Calculate the confidence interval
ci_lower = (mean_treatment - mean_control) - margin_of_error
ci_upper = (mean_treatment - mean_control) + margin_of_error

# Output the results
print("95% Confidence Interval for the difference in average amount spent (treatment - control):")
print(f"({ci_lower:.4f}, {ci_upper:.4f})")


95% Confidence Interval for the difference in average amount spent (treatment - control):
(-0.4873, 0.4170)
