<a href="https://colab.research.google.com/github/sheldonkemper/portfolio/blob/main/CAM_DS_C101_Activity_2_1_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2.1.3 Activity: Applying hypothesis testing to organisational scenarios

## Scenario
You are a self-employed data scientist, with a portfolio of clients. Across five use cases, you will apply hypothesis testing to make recommendations to each organisation.


## Objective
The goal is to apply hypothesis testing to a selection of real-world scenarios to validate assumptions about business challenges and to guide strategic decision-making.


## Assessment criteria
By completing this activity, you will be able to provide evidence that you can:
- demonstrate proficiency in hypothesis testing using Python for data-driven decision-making
- interprete results and deduce meaningful conclusions
- provide a brief, but informative summary to the stakeholder(s).


## Activity guide
1. Study the example before you start with **Scenario 1**.
2. Complete the text entry for each scenario, restating the problem statement as a hypothesis.
3. Add your hypotheses to the hypothesis variables in the code, and then run the code cell.
4. Interpret the results.
5. Write a sentence to explain your results to the organisation.
6. For Scenario 5, you will select which test to use, write the code, and then apply the above steps to complete the activity.


## Activity example

We want to determine whether a cosmetics salesperson’s annual salary is high enough to meet repayments on a loan. The base salary is not very high, but this income is supplemented by commission on sales. We need to explore whether the applicant’s salary plus the commission will meet affordability (the ability to repay the loan) across various affordability testing aspects.
Full affordability tests are recorded on applicants' credit files, so we will do a quick initial statistical test to work out if the commission does make enough of a difference, as this will not leave a credit footprint.

We want to know what the average annual commission is, as this will have a significant impact on the applicant's ability to make loan repayments, and have established that an average rate of £501 would meet the requirements. The applicant has indicated that, on average, they make approximately £550 in commission a month, which would be sufficient. However, sales could fluctuate, which would reduce the commission.

We want to test how fluctuations in commission will affect affordability, and a one-tailed t-test performed on a year’s worth of actual data (12 months' of commission by month), plus some additional randomised figures to represent potential future fluctuations will reveal how significant differences in earnings through commission may affect affordability.


### Activity steps
1. State the hypotheses:
  - $H_0$: The mean commission for a cosmetics salesperson is less than or equal to £500 per month.
  - $H_a$: The mean commission for a cosmetics salesperson is greater than £500 per month.
2. Add your hypotheses to the hypothesis variables in the code, then run the code cell.
3. Provide short feedback to the stakeholder(s).


In [1]:
import scipy.stats as stats

# Sample data for monthly commission for a year (in GBP).
monthly_commission = [
    480, 520, 540, 490, 510, 525, 515, 505, 500, 480, 515, 530,
    520, 530, 540, 525, 550, 460, 570, 490, 545, 535, 540, 545,
    550, 555, 560, 565, 470, 575, 580, 485, 590, 595, 500, 505
]

# Perform a one-tailed t-test.
t_statistic, p_value = stats.ttest_1samp(monthly_commission, popmean=500, alternative='greater')

# Set the significance level (alpha) at 0.05.
alpha = 0.05

# Define the null and alternative hypotheses.
null_hypothesis = "The mean commission is less than or equal to £500 per month."
alternative_hypothesis = "The mean commission is greater than £500 per month."

# Print the hypotheses, test results and interpretation.
print("Null Hypothesis:", null_hypothesis)
print("Alternative Hypothesis:", alternative_hypothesis)
print("Significance Level (alpha):", alpha)
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


ModuleNotFoundError: No module named 'scipy'

### Reporting results

The initial suitability test performed on your average commission for last year, along with potential future earnings from commission, shows that you are likely to be accepted for a loan after the full affordability tests are completed. As a reminder, the full credit check will leave a record on your credit file. Would you like to proceed with the application?

## Scenario 1: Product price comparison

- **Problem statement:** A retail company wants to determine whether there is a significant difference in the average price of a product between two different stores (Store A – the business; and Store B – the competitor).
- **Objective:** Test whether the average price of the product differs significantly between Store A and Store B.
- **Statistical test:** Independent two-sample t-test.
- **Reason for test selection:** An independent two-sample t-test is used to assess whether there is a significant difference between two continuous variables (prices, Store A and Store B). The assumptions for a t-test (independent in this case) are met, including normality and homoscedasticity.
- **Value to the organisation:** This information can influence pricing strategies and competitive analysis, leading to informed decisions about product pricing and placement.

### Activity steps
1. State the hypotheses.
2. Create Python code to test the hypotheses.
3. Provide a short feedback to the stakeholder(s).

> State your hypotheses here. Select the pen from the toolbar to add your entry.

In [None]:
# Start your coding here.
# Import the necessary library.
import scipy.stats as stats

# Sample data for prices at Store A and Store B.
store_A_prices = [50, 55, 60, 45, 48, 52, 57, 59, 53, 50, 58, 54, 51, 56, 55]
store_B_prices = [55, 52, 58, 54, 50, 56, 53, 59, 55, 57, 60, 58, 53, 55, 57]

# Perform the independent two-sample t-test.
t_statistic, p_value = stats.ttest_ind(store_A_prices, store_B_prices)

# Set the significance level (alpha) at 0.05.
alpha = 0.05

# Add the null and alternative hypotheses between the "".
null_hypothesis = ""
alternative_hypothesis = ""

# Print the hypotheses, test results, and interpretation.
print("Null Hypothesis:", null_hypothesis)
print("Alternative Hypothesis:", alternative_hypothesis)
print("Significance Level (alpha):", alpha)
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


: 

### Reporting results
> Select the pen in the cell menu to type into this text cell.

Add a short report for the client, with a recommendation based on your findings. *This should be 2–3 sentences, and should not include the statistical test numbers*.



## Scenario 2: Employee productivity

- **Problem statement:** The HR department of a company wants to determine whether there is a significant difference in the average productivity levels of employees across three different departments (sales, marketing, and finance).
- **Objective:** Test whether the average productivity levels vary significantly across the three departments.
- **Statistical test:** One-way ANOVA to determine if there is a significant difference in average productivity levels across three different departments: sales, marketing, and finance.
- **Reason for test selection:** One-way ANOVA compares the means of productivity levels across more than two groups (departments). ANOVA determines whether there is a significant difference in means, which is important for identifying which department(s) may require specific attention or improvements (independent in this case) are met, including normality and homoscedasticity.
- **Value to the organisation:** Identifying which department might need targeted improvements should lead to better resource allocation and increased efficiency.

### Activity steps
1. State the hypotheses.
2. Create Python code to test the hypotheses.
3. Provide a short feedback to the stakeholder(s).

> State your hypotheses here. Select the pen from the toolbar to add your entry.

In [None]:
import scipy.stats as stats

# Sample data for productivity in three departments.
sales_productivity = [100, 110, 105, 120, 115, 108, 112, 107, 118, 103, 105, 115, 110]
marketing_productivity = [90, 95, 88, 92, 85, 87, 93, 89, 94, 91, 86, 92, 91]
finance_productivity = [80, 75, 82, 78, 85, 86, 81, 79, 83, 87, 84, 79, 82]

# Perform one-way ANOVA.
f_statistic, p_value = stats.f_oneway(sales_productivity, marketing_productivity, finance_productivity)

# Set the significance level (alpha) at 0.05.
alpha = 0.05

# Add the null and alternative hypotheses between the "".
null_hypothesis = ""
alternative_hypothesis = ""

# Print the hypotheses, test results, and interpretation.
print("Null Hypothesis:", null_hypothesis)
print("Alternative Hypothesis:", alternative_hypothesis)
print("Significance Level (alpha):", alpha)
print("F-Statistic:", f_statistic)
print("P-Value:", p_value)

if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


: 

### Reporting results
> Select the pen in the cell menu to type into this text cell.

Add a short report for the client, with a recommendation based on your findings. *This should be 2–3 sentences, and should not include the statistical test numbers*.



## Scenario 3: Market research

- **Problem statement:** A marketing research firm is investigating whether there is a relationship between customers’ age groups (18–25, 26–35, and 36–45) and their preferred social media platforms (Facebook, Twitter, and Instagram).
- **Objective:** Test whether the age group and the choice of social media platform are independent of each other.
- **Statistical test:** Chi-square test for independence.
- **Reason for test selection:** The chi-square test for independence is suitable for categorical data (age groups and social media platforms) and to determine whether the variables are independent or related. This test assesses whether there is an association between the two variables without the need for specific assumptions about data distribution.
- **Value to the organisation:** Identifying whether age groups and preferred social media platforms are independent or related is valuable for targeted marketing campaigns tailored to specific age groups and platforms.


### Activity steps
1. State the hypotheses.
2. Create Python code to test the hypotheses.
3. Provide a short feedback to the stakeholder(s).

> State your hypotheses here. Select the pen from the toolbar to add your entry.

In [None]:
import scipy.stats as stats
import pandas as pd

# Sample data as a Pandas DataFrame.
data = pd.DataFrame({
    'Age Group': ['18-25', '26-35', '36-45', '18-25', '26-35'],
    'Social Media Platform': ['Facebook', 'Twitter', 'Instagram', 'Instagram', 'Facebook']
})

# Create a contingency table.
contingency_table = pd.crosstab(data['Age Group'], data['Social Media Platform'])

# Perform chi-square test for independence.
chi2, p, _, _ = stats.chi2_contingency(contingency_table)

# Set the significance level (alpha) at 0.05.
alpha = 0.05

# Add the null and alternative hypotheses between the "".
null_hypothesis = ""
alternative_hypothesis = ""

# Print the hypotheses, test results, and interpretation.
print("Null Hypothesis:", null_hypothesis)
print("Alternative Hypothesis:", alternative_hypothesis)
print("Significance Level (alpha):", alpha)
print("Chi-Square Statistic:", chi2)
print("P-Value:", p)

if p < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


: 

### Reporting results
> Select the pen in the cell menu to type into this text cell.

Add a short report for the client, with a recommendation based on your findings. *This should be 2–3 sentences, and should not include the statistical test numbers*.



## Scenario 4: Product quality control

- **Problem statement:** A manufacturing company is evaluating whether there is a significant difference in product quality (measured as 'defective' or 'non-defective') among three production lines (Line A, Line B, Line C).
- **Objective:** Test whether the proportion of defective products differs across the three production lines.
- **Statistical test:** Chi-square test for proportions.
- **Reason for test selection:** This scenario involves comparing proportions (defective versus non-defective) among production lines, making the chi-square test for proportions appropriate. It will test whether the proportion of defective products significantly differs among the production lines, regardless of data distribution.
- **Value to the organisation:** Identifying variations in product quality aids quality control and process improvement within the manufacturing process.

### Activity steps
1. State the hypotheses.
2. Create Python code to test the hypotheses.
3. Provide a short feedback to the stakeholder(s).

> State your hypotheses here. Select the pen from the toolbar to add your entry.

In [None]:
import scipy.stats as stats

# Sample data for three production lines.
line_A_defective = 20
line_A_non_defective = 180
line_B_defective = 30
line_B_non_defective = 170
line_C_defective = 10
line_C_non_defective = 190

# Create a 2x3 contingency table.
contingency_table = [[line_A_defective, line_B_defective, line_C_defective],
                     [line_A_non_defective, line_B_non_defective, line_C_non_defective]]

# Perform chi-square test for proportions.
chi2, p, _, _ = stats.chi2_contingency(contingency_table)

# Set the significance level (alpha) at 0.05.
alpha = 0.05

# Add the null and alternative hypotheses between the "".
null_hypothesis = ""
alternative_hypothesis = ""

# Print the hypotheses, test results, and interpretation.
print("Null Hypothesis:", null_hypothesis)
print("Alternative Hypothesis:", alternative_hypothesis)
print("Significance Level (alpha):", alpha)
print("Chi-Square Statistic:", chi2)
print("P-Value:", p)

if p < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


: 

### Reporting results
> Select the pen in the cell menu to type into this text cell.

Add a short report for the client, with a recommendation based on your findings. *This should be 2–3 sentences, and should not include the statistical test numbers*.



## Scenario 5:  Determining product lines

For this scenario, you will select the appropriate test and write the code for that test.

- **Problem statement:** An online bookstore wants to determine which books to add to its product line. It needs to determine whether to spend the budget on fiction or non-fiction books, based on which category will likely generate the most revenue.
- **Objective:** Select an appropriate test to determine how to spend the budget.
- **Statistical test:** Which test is correct for this scenario? Select one of the tests used for the previous scenarios, and apply it to the data provided.
- **Reason for test selection:** Justify why you have selected this test.
- **Value to the organisation:** Why would the information from this test add value to the organisation?


### Activity steps
1. State the hypotheses.
2. Create Python code to test the hypotheses.
3. Provide a short feedback to the stakeholder(s).

> State your hypotheses here. Select the pen from the toolbar to add your entry.

In [None]:
# Import relevant libraries.


# Sample data for revenue of fiction and non-fiction books.
fiction_revenue = [500, 550, 600, 520, 480, 530, 560,
                   540, 570, 590, 545, 525, 510, 525,
                   515, 550, 570, 580, 535, 520, 510,
                   540, 560, 575, 590]
non_fiction_revenue = [600, 620, 580, 590, 610, 630,
                       595, 605, 615, 625, 635, 590,
                       625, 630, 640, 610, 620, 600,
                       615, 630, 625, 635, 610, 590, 580]

# Perform your chosen test.



# Set the significance level (alpha) at 0.05.
alpha = 0.05

# Add the null and alternative hypotheses between the "".
null_hypothesis = ""
alternative_hypothesis = ""



# Print the test results.


# Interpret the results.


: 

### Reporting results
> Select the pen in the cell menu to type into this text cell.

Add a short report for the client, with a recommendation based on your findings. *This should be 2–3 sentences, and should not include the statistical test numbers*.



# Reflect

Write a brief paragraph highlighting your process and the rationale to showcase critical thinking and problem-solving.

> Select the pen from the toolbar to add your entry. When you have completed the activity, remember to update the link on your contents page to point to your completed Notebook.