# Instructions

1. Use python libraries to answer both the questions
2. Write your conclusion in plain english
3. Perform basic analysis for both the question (clicks and bounce frequency counts)

In [421]:
import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats import chi2
from scipy.stats import norm
from scipy.stats import ttest_ind
from statsmodels.stats.proportion import proportions_ztest
import pandas as pd
import numpy as np
import random
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm


# INFERENCE FOR CATEGORICAL VALUES

## Question 1

An advertising firm has decided to ask 92 customers at each of three local shopping malls if they are willing to take part in a market research survey. According to previous studies, 38% of Americans refuse to take part in such surveys. The results are shown here. At a 0.01 significance level, test the claim that the proportions of those who are willing to participate are equal.

In [422]:
data = {
    "Participation Status": ["Will Participate", "Will Not Participate"],
    "Mall A": [52, 40],
    "Mall B": [45, 47],
    "Mall C": [36, 56]
}

df = pd.DataFrame(data)

In [423]:

observed = np.array([[52, 45, 36], [40, 47, 56]])

row_totals = observed.sum(axis=1)
col_totals = observed.sum(axis=0)
total = observed.sum()

expected = np.outer(row_totals, col_totals) / total

chi2_statistic = np.sum((observed - expected)**2 / expected)

degrees_of_freedom = (observed.shape[0] - 1) * (observed.shape[1] - 1)

p_value = 1 - chi2.cdf(chi2_statistic, degrees_of_freedom)

print("Chi-square Statistic:", chi2_statistic)
print("P-value:", p_value)


Chi-square Statistic: 5.601556338398444
P-value: 0.06076276051439011


In [424]:
significance_level = 0.01

# Null hypothesis: The proportions of customers willing to participate are equal across the three malls.
# Alternative hypothesis: The proportions of customers willing to participate are not equal across the three malls.

In [456]:
if p_value >=significance_level:
    print("Fail to reject the null hypothesis. There is not enough evidence to conclude that the proportions of participants are different across the malls.")
else:
    print("Reject the null hypothesis. There is evidence to suggest that the proportions of participants are not equal across the malls.")

Fail to reject the null hypothesis. There is not enough evidence to conclude that the proportions of participants are different across the malls.


## Question 2

Nationwide the shares of carbon emissions for the year 2000 are transportation, 33%; industry, 30%; residential, 20%; and commercial, 17%. A state hazardous materials official wants to see if her state is the same.

Her study of 300 emissions sources finds transportation, 36%; industry, 31%; residential, 17%; and commercial, 16%. At a 0.05 significance level, can she claim the percentages are the same?

In [426]:
data = {
    "Sector": ["Transportation", "Industry", "Residential", "Commercial"],
    "Percentage_nationwide": [33, 30, 20, 17],
    "Percentage_state": [36, 31, 17, 16]
}

df = pd.DataFrame(data)

In [427]:
observed_percentages = [36, 31, 17, 16]

expected_percentages = [33, 30, 20, 17]
sampleSize=300
expected_freq = np.array([0.33 * sampleSize, 0.30 * sampleSize, 0.20 * sampleSize, 0.17 * sampleSize])

observed_freq = np.array([0.36 * 300, 0.31 * 300, 0.17 * 300, 0.16 * 300])

chi_sq_stat = np.sum((observed_freq - expected_freq)**2 / expected_freq)

df = len(observed_freq) - 1

p_value = 1 - stats.chi2.cdf(chi_sq_stat, df)

print(f"Chi-square statistic: {chi_sq_stat}")
print(f"P-value: {p_value}")


Chi-square statistic: 2.4446524064171107
P-value: 0.4853767186730823


In [428]:
# Null hypothesis: H0: π1 = π2 = π3 = π4 (The state's emission percentages are the same as the national percentages)
# Alternative hypothesis: HA: At least one πi is different (The state's emission percentages are different from the national percentages)

significance_level = 0.05
if p_value >= significance_level:
    print("Fail to reject the null hypothesis (H0). There is not enough evidence to conclude that the proportions of carbon emissions in the state are different from the national proportions at a significance level of 0.05.")
else:
    print("Reject the null hypothesis (H0). There is evidence to suggest that the proportions of carbon emissions in the state are different from the national proportions at a significance level of 0.05.")

Fail to reject the null hypothesis (H0). There is not enough evidence to conclude that the proportions of carbon emissions in the state are different from the national proportions at a significance level of 0.05.


# A/B Testing in Business

## Question 3

### Problem Statement

You work as a Data Scientist at Udacity. Udacity recently launched a new feature in which they ask the user about the time they can put in for the course they are enrolling. If the student put less than the required time then the platform shows a pop up message which says that you can only enroll in the course if you commit at least the minimum time required for this particular course.

================================================================================

Udacity wants to determine that how this new feature will effect the bounce rate of the platform. Let's do A/B test to determine.

Example
=======
A course requires an effort of atleast 10 hours a week but the students says that he can put only 7 hours a week then the platform will prompt the user that they will have to commit atleast 10 hours to enroll in the course.





In [429]:
# Preparing the Dataset

df = pd.read_csv('https://raw.githubusercontent.com/usmanabbas7/karachi.ai/main/udacity.csv')

In [430]:
# Sample Data

df

Unnamed: 0,user_id,is_bounce,version
0,305,1,default version
1,568,1,new version
2,895,0,new version
3,665,1,new version
4,738,0,new version
...,...,...,...
495,324,1,default version
496,77,1,default version
497,890,0,new version
498,60,1,default version


In [431]:
df = df.drop_duplicates().reset_index(drop=True)

In [432]:
df.isna().sum()

user_id      0
is_bounce    0
version      0
dtype: int64

In [433]:
df['version'].value_counts(normalize=True)

version
default version    0.51
new version        0.49
Name: proportion, dtype: float64

In [434]:
df['is_bounce'].value_counts(normalize=True)

is_bounce
1    0.696
0    0.304
Name: proportion, dtype: float64

Lets consider significance level(alpha) = 0.05

In [435]:
data = pd.crosstab(df['version'], df['is_bounce'])

In [436]:
data['Total'] = df.groupby('version')['is_bounce'].count()

In [437]:
data

is_bounce,0,1,Total
version,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
default version,0,255,255
new version,152,93,245


In [438]:
## Testing the Hypothesis

from statsmodels.stats.proportion import proportions_ztest, proportion_confint

In [439]:
data[1]

version
default version    255
new version         93
Name: 1, dtype: int64

In [440]:
converted_results = data[1]
total = data['Total']

In [441]:
converted_results

version
default version    255
new version         93
Name: 1, dtype: int64

In [442]:
total

version
default version    255
new version        245
Name: Total, dtype: int64

In [443]:
z_stat, p_val = proportions_ztest(count=converted_results, nobs=total, alternative="two-sided")
print(f'z statistic = {z_stat}')

print("P-value:", p_val)

z statistic = 15.076628104103516
P-value: 2.30756566793485e-51


In [444]:
alpha=0.05

if p_val >= alpha:
    print("Fail to reject the null hypothesis. There is not enough evidence to suggest that the bounce rates of the two groups are different.")
else:
    print("Reject the null hypothesis. There is enough evidence to suggest that the bounce rates of the two groups are different.")

Reject the null hypothesis. There is enough evidence to suggest that the bounce rates of the two groups are different.


In [445]:
# Sanity check
(1 - stats.norm.cdf(z_stat)) * 2
# (1 - stats.norm.cdf(z_stat)) * 2

# Z value from table

stats.norm.ppf(0.95 + (0.05/2))

1.959963984540054

In [446]:
# OR

control_results = df[df['version'] == 'default version']['is_bounce']
treatment_results = df[df['version'] == 'new version']['is_bounce']


n_con = control_results.count()
n_treat = treatment_results.count()
successes = [control_results.sum(), treatment_results.sum()]
nobs = [n_con, n_treat]

z_stat, pvalue = proportions_ztest(successes, nobs=nobs)

print(f'z statistic: {z_stat}')

print("P-value:", pvalue)

alpha=0.05

z statistic: 15.076628104103516
P-value: 2.30756566793485e-51


In [447]:
if pvalue >= alpha:
    print("Fail to reject the null hypothesis. There is not enough evidence to suggest that the bounce rates of the two groups are different.")
else:
    print("Reject the null hypothesis. There is enough evidence to suggest that the bounce rates of the two groups are different.")

Reject the null hypothesis. There is enough evidence to suggest that the bounce rates of the two groups are different.


# Question # 4

Suppose that you work in the analytics team of Upwork. Data Science team deployed new ranking stategy in the "Job Search" module. You need to perform A/B to check that whether there is a significant difference between the average page stay time of the customers for old and new ranking strategy.

Hint : Use Two sample mean t test

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

Write conclusion of your test in plain english.

In [448]:
df = pd.read_csv('https://raw.githubusercontent.com/usmanabbas7/karachi.ai/main/ranking_ab_test.csv')

In [449]:
df

Unnamed: 0,user_id,page_stay_time,version
0,0,8,default raking algo
1,1,5,default raking algo
2,2,22,default raking algo
3,3,19,default raking algo
4,4,1,default raking algo
...,...,...,...
995,995,29,new ranking algo
996,996,11,new ranking algo
997,997,3,new ranking algo
998,998,2,new ranking algo


In [450]:
df.loc[df['version'] == 'default raking algo', 'version'] = 'default ranking algo'

In [451]:
df = df.drop_duplicates().reset_index(drop=True)

In [452]:
df.replace(to_replace=['?', 'na','',' ','none','NA','N/A','n/a'], value=pd.NA, inplace=True)

In [453]:
df.isna().sum()

user_id           0
page_stay_time    0
version           0
dtype: int64

In [454]:
df['version'].value_counts(normalize=True)

df['page_stay_time'].value_counts(normalize=True)

old_strategy = df[df['version'] == 'default ranking algo']['page_stay_time']

new_strategy = df[df['version'] == 'new ranking algo']['page_stay_time']

t_statistic, p_value = ttest_ind(old_strategy, new_strategy,equal_var=False, alternative='two-sided')

print("T-statistic:", t_statistic)

print("P-value:", p_value)

alpha = 0.05

if p_value >= alpha:
    print("Fail to reject the null hypothesis: There is no significant difference in the average page stay time between the old and new ranking strategies.")
else:
    print("Reject the null hypothesis: There is a significant difference in the average page stay time between the old and new ranking strategies.")

T-statistic: -0.588271685867135
P-value: 0.5564835484911033
Fail to reject the null hypothesis: There is no significant difference in the average page stay time between the old and new ranking strategies.


In [455]:
if p_value >= alpha:
    print("Fail to reject the null hypothesis: There is no significant difference in the average page stay time between the old and new ranking strategies.")
else:
    print("Reject the null hypothesis: There is a significant difference in the average page stay time between the old and new ranking strategies.")

Fail to reject the null hypothesis: There is no significant difference in the average page stay time between the old and new ranking strategies.
