<a href="https://clarusway.com/contact-us/"><img align="center" src="https://i.ibb.co/B43qn24/officially-licensed-logo.png" alt="Open in Clarusway LMS" width="200" height="200" title="This notebook is licensed by Clarusway IT training school. Please contact the authorized persons about the conditions under which you can use or share."></a>

<p style="text-align: center;"><img src="https://i.ibb.co/Rpz9L36/clarusway-logo-black.png" width="600" height="150" class="img-fluid" alt="CLRSWY_LOGO"></p>
<p style="text-align: center;"><img src="https://i.ibb.co/XS0bxSH/best-bootcamps.png" width="400" height="130" class="img-fluid" alt="CLRSWY_LOGO"></p>
<p style="background-color:#E51A59; font-family:newtimeroman; color:#FDFEFE; font-size:130%; text-align:center; border-radius:10px 10px;">WAY TO REINVENT YOURSELF</p>

# Central Limit Theorem (CLT):

The Central Limit Theorem is a fundamental concept in statistics that describes the behavior of the distribution of sample means for a large enough sample size, regardless of the original distribution of the population.

**Key Points:**

**1. Large Enough Sample:**

The Central Limit Theorem applies when you have a sufficiently large sample size. Though there isn't a strict rule, a common guideline is a sample size of at least 30. However, for some distributions, you might need a larger sample size.

**2. Regardless of Population Distribution:**

It doesn't matter what the shape of the population distribution is initially. The Central Limit Theorem asserts that as long as the sample size is large enough, the distribution of the sample means will be approximately normally distributed.

**3. Normal Distribution of Sample Means:**

The theorem states that, for a large sample size, the distribution of the sample means will be approximately normal, even if the population distribution is not normal.

**4. Mean and Standard Deviation:**

The mean of the sample means will be equal to the population mean, and the standard deviation of the sample means (standard error) will be equal to the population standard deviation divided by the square root of the sample size.

**Why is it Important:**

- **Statistical Inference:**

The Central Limit Theorem is crucial for statistical inference. It allows us to make inferences about population parameters using sample statistics, particularly when we're dealing with the mean.

- **Normal Distribution Simplification:**

The normal distribution is mathematically convenient and well understood. The CLT allows us to treat the distribution of sample means as normal, making statistical analysis more straightforward.

**Example:**

Imagine you're measuring the heights of people in a population. The distribution of individual heights might not be normal, but if you take many random samples of, let's say, 30 people each and calculate the mean height for each sample, the distribution of those sample means will be approximately normal, according to the Central Limit Theorem.

In summary, the Central Limit Theorem is a powerful tool in statistics that allows us to make certain assumptions and draw conclusions about populations based on the distribution of sample means, making statistical analysis more practical and applicable in real-world situations.

## Question
    
You are given a population dataset with the heights of individuals. The mean height is 65 inches, and the standard deviation is 4 inches. A random sample of 100 individuals is taken from this population. Calculate the probability that the mean height of this sample is less than 64 inches.

In [1]:
import numpy as np
import scipy.stats as stats

# Given data
population_mean = 65
population_std_dev = 4
sample_size = 100
sample_mean = population_mean  # The sample mean is equal to the population mean for the central limit theorem

In [3]:
# Calculate the sample's standart deviation
sem = population_std_dev / np.sqrt(sample_size)
sem

0.4

In [5]:
# Calculate the z-score
z = (64 - population_mean) / sem
z

-2.5

In [9]:
# Use the cumulative distribution function (CDF) to find the probability
stats.norm.cdf(z)


0.006209665325776132

# Confidence intervals

A confidence interval is a statistical tool used to estimate the range in which we are reasonably confident that a population parameter, such as the mean, is likely to lie. It provides a range of values rather than a single point estimate, offering a measure of the uncertainty associated with our estimation.

**Key Points:**

**1. Point Estimate:**

A point estimate is a single value (like the sample mean) that serves as the best guess for the population parameter. However, it doesn't convey the entire picture of uncertainty.

**2. Margin of Error:**

The confidence interval consists of a point estimate and a margin of error. The margin of error is influenced by the variability in the data and the desired level of confidence.

**3. Level of Confidence:**

The level of confidence represents the probability that the interval will contain the true population parameter. Common levels are 95%, 90%, or 99%. A 95% confidence interval, for example, implies that if we were to take many samples and construct intervals in the same way, we expect about 95% of those intervals to contain the true parameter.

**4. Calculation:**

The formula for a confidence interval is:

Confidence Interval = Point Estimate ± Margin of Error

The margin of error is typically determined using critical values from a standard normal distribution or a t-distribution, depending on the sample size.

**Why are Confidence Intervals Important:**

- **Uncertainty:**

They convey the uncertainty inherent in statistical estimates. Instead of providing a single, potentially misleading number, confidence intervals give a range that is likely to contain the true parameter.

- **Comparisons:**

They allow for meaningful comparisons between different groups or conditions. Knowing that two confidence intervals don't overlap, for example, might suggest a significant difference between the groups.

- **Decision Making:**

They aid in decision-making by providing a sense of the precision of our estimates. Wider intervals suggest greater uncertainty, while narrower intervals suggest more precise estimates.

**Example:**

Suppose you're estimating the average height of a population. You calculate a 95% confidence interval of (65 inches, 67 inches). This means you are 95% confident that the true average height of the population falls within this range. The point estimate is the midpoint (66 inches), and the margin of error is 

(67 − 65) / 2 = 1 inch.

In essence, confidence intervals provide a more nuanced and informative way to report estimates, acknowledging the inherent variability in data and the level of confidence we have in our estimates.

## Question

You are conducting a study on the scores of students in a standardized test. From a sample of 50 students, the mean score is found to be 72, and the standard deviation is 5. Calculate a 95% confidence interval for the true mean score of all students.

In [10]:
import numpy as np
import scipy.stats as stats

# Given data
sample_size = 50
sample_mean = 72
sample_std_dev = 5
confidence_level = 0.95

In [11]:
# Calculate the standard error of the mean (SEM)
sem = sample_std_dev / np.sqrt(sample_size)
sem

0.7071067811865475

In [12]:
# Calculate the margin of error (MOE)
moe = 1.96 * sem
moe

1.385929291125633

In [13]:
# Calculate the confidence interval
ci = stats.norm.interval(0.95, sample_mean, sem)
ci

(70.61409617565032, 73.38590382434968)

# One-Sample t-test

The one-sample t-test is a statistical test used to determine whether the mean of a single sample is significantly different from a known or hypothesized population mean. Here's a brief explanation:

**Objective:**

Assess whether the sample mean is significantly different from a hypothesized population mean.

**Steps:**

- **1. State Hypotheses:**

    - Null Hypothesis (Ho): The population mean is equal to a specified value (μ=hypothesized value)
    - Alternative Hypothesis (Ha): The population mean is not equal to the specified value.

- **2. Collect Data:**

    - Gather a sample from the population of interest.

- **3. Calculate the t-Statistic:**

    - Use the formula  t = (x_bar - mu)/(s/sqrt(n))
 
    - where x_bar is the sample mean, μ is the hypothesized population mean, s is the sample standard deviation, and n is the sample size.

- **4. Determine Degrees of Freedom:**

    - Degrees of freedom (df) is equal to the sample size minus one (df=n−1).

- **5. Choose Significance Level (α):**

    - Commonly chosen values are 0.05 or 0.01.

- **6. Calculate Critical t-Values:**

    - Use a t-table or statistical software to find the critical t-values based on the chosen significance level and degrees of freedom.

- **7. Make Decision:**

    - If the absolute value of the calculated t-statistic is greater than the critical t-value, reject the null hypothesis.
    - If the p-value is less than the chosen significance level, reject the null hypothesis.

- **8. Draw Conclusion:**

    - Conclude whether there is enough evidence to reject the null hypothesis and accept the alternative hypothesis.

The one-sample t-test is commonly used in various fields to compare a sample mean to a known or hypothesized population mean, providing a way to assess whether any observed differences are statistically significant.

## Question-1

Suppose you have a sample of 30 students, and you want to test whether their average score is significantly different from the population mean of 75. Perform a one-sample t-test at a significance level of 0.05.

In [14]:
import numpy as np
import scipy.stats as stats

# Given data
sample_size = 30
sample_mean = 72  # Assume the sample mean is 72 for this example
population_mean = 75
significance_level = 0.05

In [16]:
# Calculate the t-statistic and p_value
s = 5
t = (sample_mean - population_mean) / (s / np.sqrt(sample_size))
t

-3.2863353450309964

In [7]:
# Check if the result is statistically significant



## Question-2

Let's say you work for a company that produces light bulbs, and they claim that the average lifespan of their bulbs is 1000 hours. You decide to test this claim by taking a sample of 25 bulbs and recording their lifespans. The sample has a mean lifespan of 980 hours with a standard deviation of 50 hours.


Is there enough evidence to support the company's claim that the average lifespan of their light bulbs is 1000 hours?

In [8]:
# State the Hypothesis:



In [18]:
# Collect, Analyze the Data
# Set the Significance Level (α) as 0.05

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Given data
sample_mean = 980
population_mean = 1000
sample_std = 50
sample_size = 25
significance_level = 0.05 # alpha

In [19]:
# Generate a sample of lifespans - Create data with np.random.normal(loc=population_mean, scale=sample_std, size=sample_size)
x =  np.random.normal(loc=population_mean, scale=sample_std, size=sample_size)
x

array([ 930.57419976,  896.49871143, 1081.72512304,  966.06243845,
        904.917997  ,  943.48843819,  972.08663952,  957.64164541,
        988.41384435, 1020.17199294,  955.89222729,  962.89920515,
       1005.06712972, 1000.26622884,  946.47975382, 1002.28669659,
       1012.1925282 , 1015.9133843 , 1020.77015077, 1044.79725289,
       1047.64127419,  983.05067039,  961.41438452,  930.87868698,
       1011.80537821])

In [20]:
# Perform one-sample t-test
t = (sample_mean-population_mean) / (sample_std / np.sqrt(sample_size))
t

-2.0

In [21]:
# Check if the result is statistically significant
p_value = stats.t.cdf(t, sample_size-1)
p_value

0.028469924968295833

# Independent t-test:


The independent samples t-test is a statistical test used to compare the means of two independent groups to determine if they are significantly different from each other. Here's a brief explanation:

**Objective:**

Assess whether the means of two independent groups are significantly different from each other.

**Assumptions:**

- The data in each group are independent and identically distributed.
- The populations from which the samples are drawn are approximately normally distributed.
- The variances of the two populations are approximately equal (homogeneity of variances).

**Steps:**

- **1. State Hypotheses:**

    - Null Hypothesis (Ho) : The means of the two groups are equal (μ1 = μ2)
    - Alternative Hypothesis (Ha): The means of the two groups are not equal.

- **2.Collect Data:**

    - Gather two independent samples from the two groups.

- **3. Calculate the t-Statistic:**

  Use the formula:

![image.png](attachment:image.png)
 
where x1_bar and x2_bar are the sample means, s1 and s2 are the sample standard deviations, and n1 and n2 are the sample sizes for the two groups.

- **4. Determine Degrees of Freedom:**

    - Degrees of freedom (df) depend on the sample sizes and variances of the two groups.

- **5. Choose Significance Level (α):**

    - Commonly chosen values are 0.05 or 0.01.

- **6. Calculate Critical t-Values:**

    - Use a t-table or statistical software to find the critical t-values based on the chosen significance level and degrees of freedom.

- **7. Make Decision:**

    - If the absolute value of the calculated t-statistic is greater than the critical t-value, reject the null hypothesis.
    - If the p-value is less than the chosen significance level, reject the null hypothesis.

- **8. Draw Conclusion:**

    - Conclude whether there is enough evidence to reject the null hypothesis and accept the alternative hypothesis.

The independent samples t-test is widely used to compare means of two groups, such as comparing the test scores of students in two different teaching methods or comparing the blood pressure levels between two different treatments.

## Question-1

Compare the average scores of two independent groups of students, Group A and Group B, to see if there is a significant difference between them.

In [18]:
# Example data

group_a_scores = np.random.normal(72, 5, 30)
group_b_scores = np.random.normal(75, 5, 30)

In [10]:
# Calculate the t-statistic and p_value



In [11]:
# Check if the result is statistically significant



## Question-2

consider a scenario where we want to compare the average scores of two groups of students who attended different tutoring programs for exam preparation to see if there is a significant difference between them.

Hypotheses:

Null Hypothesis (Ho) : The mean exam scores of students who attended Program A are equal to the mean exam scores of students who attended Program B (μA = μB)

Alternative Hypothesis (Ha : The mean exam scores of students who attended Program A are not equal to the mean exam scores of students who attended Program B (μA ≠ μB)

In [60]:
# Data Generation:

# We generate simulated data for the exam scores of students in each program using a normal distribution.

import numpy as np
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# Generate data for students who attended Tutoring Program A
tutoring_A_scores = np.random.normal(loc=85, scale=10, size=40)
print(f"tutoring_A_scores is: ({tutoring_A_scores})")

print("\n")

# Generate data for students who attended Tutoring Program B
tutoring_B_scores = np.random.normal(loc=90, scale=8, size=40)
print(f"tutoring_B_scores is: ({tutoring_B_scores})")

tutoring_A_scores is: ([ 89.96714153  83.61735699  91.47688538 100.23029856  82.65846625
  82.65863043 100.79212816  92.67434729  80.30525614  90.42560044
  80.36582307  80.34270246  87.41962272  65.86719755  67.75082167
  79.37712471  74.8716888   88.14247333  75.91975924  70.87696299
  99.65648769  82.742237    85.67528205  70.75251814  79.55617275
  86.1092259   73.49006423  88.75698018  78.9936131   82.0830625
  78.98293388 103.52278185  84.86502775  74.42289071  93.22544912
  72.7915635   87.08863595  65.40329876  71.71813951  86.96861236])


tutoring_B_scores is: ([ 95.90773264  91.37094625  89.07481374  87.59117044  78.17182408
  84.24124633  86.31488983  98.45697781  92.74894632  75.89567876
  92.59267176  86.91934176  84.584624    94.89341031  98.24799618
  97.45024095  83.28625981  87.52630099  92.65010745  97.80436102
  86.1666061   88.51472819  81.14932021  80.43034701  96.50020658
 100.84992023  89.42391903  98.02826318  92.8930882   84.83904196
  92.89116484 102.30429253 

In [12]:
# Perform Independent Samples t-Test:



In [13]:
# Check if the result is statistically significant



# Dependent t-test

The dependent samples t-test, also known as a paired samples t-test, is a statistical test used to compare the means of two related groups. Here's a brief explanation:

**Objective:**

Assess whether there is a significant difference between the means of two related groups.

**Key Points:**

- **1. Dependent Groups:**

    - The groups being compared are not independent; they are related or matched pairs. Each individual in one group is directly paired with an individual in the other group.

- **2. Hypotheses:**

    - Null Hypothesis (Ho): The mean difference between the paired observations is equal to zero (μd = 0)
    - Alternative Hypothesis (Ha): The mean difference between the paired observations is not equal to zero (μd ≠ 0).

- **3. Data Requirement:**

    - Data is collected on the same subjects or items under different conditions, treatments, or time points.

- **4. Calculation of Differences:**

    - The test is based on the differences between the paired observations (e.g., before and after treatment, pre-test and post-test scores).

- **5. Assumption:**

    - The differences between pairs are assumed to be approximately normally distributed.

- **6. Performing the Test:**

    - The t-test is performed on the differences, and the test statistic and p-value are calculated.

- **7. Decision Making:**

    - If the p-value is less than the chosen significance level (e.g., 0.05), the null hypothesis is rejected.

- **8. Interpretation:**

    - Rejecting the null hypothesis indicates that there is a significant difference between the means of the related groups.

## Question-1

Compare the scores of the same group of students before and after a training program to see if there is a significant improvement.

In [64]:
# Example data

before_scores = np.random.normal(72, 5, 30)
after_scores = before_scores + np.random.normal(5, 2, 30)

In [14]:
# Calculate the t-statistic and p_value



In [15]:
# Check if the result is statistically significant



## Question-2

A group of individuals participates in a lifestyle intervention program to improve their health, with a focus on reducing stress and improving overall well-being. We want to investigate if there is a statistically significant change in their blood pressure before and after the intervention.

Hypotheses:

Null Hypothesis (Ho) : The mean difference in blood pressure before and after the intervention is zero (μdiff = 0)
Alternative Hypothesis (Ha) : The mean difference in blood pressure before and after the intervention is not equal to zero (μdiff ≠ 0)

In [70]:
# Data Generation:

import numpy as np
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# Generate data for blood pressure before the intervention
bp_before = np.random.normal(loc=120, scale=10, size=25)
print(f"bp_before is: ({bp_before})")

print("\n")

# Generate data for blood pressure after the intervention
bp_after = bp_before + np.random.normal(loc=-5, scale=8, size=25)
print(f"bp_after is: ({bp_after})")

bp_before is: ([124.96714153 118.61735699 126.47688538 135.23029856 117.65846625
 117.65863043 135.79212816 127.67434729 115.30525614 125.42560044
 115.36582307 115.34270246 122.41962272 100.86719755 102.75082167
 114.37712471 109.8716888  123.14247333 110.91975924 105.87696299
 134.65648769 117.742237   120.67528205 105.75251814 114.55617275])


bp_after is: ([120.85452225 104.40940837 124.48246953 125.42518904 110.32491625
 107.84497753 145.61035363 122.56636949 101.84356871 127.00595973
 100.59907387 112.01361122 101.74226172  85.24170916  99.32571156
 115.28485735 106.24263505 117.21728707 103.51092968  89.04878706
 123.89773402 109.05712683 124.13225986 103.50146445  95.45185151])


In this example, bp_before represents the blood pressure of individuals before the lifestyle intervention, and bp_after represents the blood pressure after the intervention. The blood pressure after the intervention is generated by adding a random normal distribution with a mean of -5 and a standard deviation of 8, simulating a potential reduction in blood pressure.

In [None]:
# Perform dependent samples t-test



In [16]:
# Check if the result is statistically significant



# One-way ANOVA

One-way Analysis of Variance (ANOVA) is a statistical test used to determine if there are any statistically significant differences between the means of three or more independent (unrelated) groups. Here's a brief explanation:

**Objective:**

Assess whether there are any statistically significant differences in the means of three or more groups.

**Key Points:**

- **1. Groups:**

    - The data is divided into three or more groups or categories.

- **2. Hypotheses:**

    - Null Hypothesis (Ho): The means of all groups are equal.
    - Alternative Hypothesis (Ha) : At least one group mean is different from the others.

- **3. Assumptions:**

    - The data within each group is approximately normally distributed.
    - Homogeneity of variances: The variances within each group are approximately equal.

- **4. Variability:**

    - ANOVA assesses the variability between group means compared to the variability within each group.

- **5. Sum of Squares:**

    - ANOVA calculates the sum of squares between groups and within groups.

- **6. F-Statistic:**

    - The F-statistic is calculated as the ratio of the variance between groups to the variance within groups.

- **7. P-Value:**

    - The p-value associated with the F-statistic is used to determine whether to reject the null hypothesis.

- **8. Post Hoc Tests:**

    - If ANOVA indicates significant differences, post hoc tests (e.g., Tukey's HSD) may be used to identify which groups differ from each other.

## Question-1

Test whether there is a significant difference in the average scores among three or more groups.

In [24]:
# Example data

group_a_scores = np.random.normal(72, 5, 30)
group_b_scores = np.random.normal(75, 5, 30)
group_c_scores = np.random.normal(78, 5, 30)

In [17]:
# Calculate the t-statistic and p_value



In [18]:
# Check if the result is statistically significant



**Question-2:**

A psychological experiment is conducted to measure the reaction times of participants exposed to different types of stimuli. The goal is to investigate if there are significant differences in reaction times across the three stimulus conditions.

**Hypotheses:**

- Null Hypothesis (Ho) : The mean reaction times are equal across all stimulus conditions.
- Alternative Hypothesis (Ha) : At least one group has a different mean reaction time.

In [82]:
# Data Generation:

import numpy as np
import pandas as pd
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# Generate data for three groups (stimulus conditions)
stimuli_types = ['Visual', 'Auditory', 'Tactile']

# Generate simulated reaction time data for each group
reaction_times = {
    'Visual': np.random.normal(loc=200, scale=20, size=30),
    'Auditory': np.random.normal(loc=210, scale=20, size=30),
    'Tactile': np.random.normal(loc=220, scale=20, size=30)
}

# Combine data into a DataFrame
data = pd.DataFrame({stimulus: rt for stimulus, rt in reaction_times.items()})
data

Unnamed: 0,Visual,Auditory,Tactile
0,209.934283,197.965868,210.416515
1,197.234714,247.045564,216.28682
2,212.953771,209.730056,197.873301
3,230.460597,188.845781,196.075868
4,195.316933,226.450898,236.250516
5,195.317261,185.583127,247.124801
6,231.584256,214.177272,218.559798
7,215.348695,170.806598,240.070658
8,190.610512,183.436279,227.232721
9,210.851201,213.937225,207.097605


In [83]:
# Perform one-way ANOVA


In [19]:
# Check if the p-value is less than the significance level (e.g., 0.05)



# Chi-square Test for Categorical Variables:

The chi-square test is a statistical test used to determine if there's a significant association between two categorical variables.

**Key Points:**

- **1. Null Hypothesis (Ho):**
    - Assumes no association between the variables.

- **2. Alternative Hypothesis (Ha):**
    - Assumes an association between the variables.

- **3. Test Statistic:**
    - The chi-square test calculates a test statistic based on the differences between the observed and expected frequencies in a contingency table.

- **4. Contingency Table:**
    - A table that shows the distribution of frequencies across categories for each variable. It helps compare observed and expected values.

- **5. Degrees of Freedom:**
    - The degrees of freedom depend on the number of categories in the variables being analyzed.

- **6. P-Value:**
    - The chi-square test produces a p-value. If it's below a chosen significance level (e.g., 0.05), you reject the null hypothesis, suggesting evidence of an association.

In essence, the chi-square test helps you determine if the observed distribution of categorical data differs from what you would expect by chance, providing insights into the relationship between variables.


## Question

Test whether there is a significant association between gender (male/female) and the preference for a particular subject (Math, English, Science).

In [27]:
# Example data

observed_data = np.array([[30, 20, 10], [25, 25, 20]])  # Example contingency table

In [20]:
# Calculate the chi2-statistic and p_value



In [21]:
# Check if the p-value is less than the significance level (e.g., 0.05)



___

<a href="https://lms.clarusway.com/mod/lesson/view.php?id=8511&pageid=8142&startlastseen=no"><img align="left" src="https://i.ibb.co/6Z5pQxD/lmss.png" alt="Open in Clarusway LMS" width="70" height="200" title="Open Clarusway Learning Management Sytem"></a>

<a href=""><img align="right" src="https://i.ibb.co/n3HWyQX/github-logo.png" alt="Open in Clarusway GitHub" width="100" height="150" title="Open and Execute in Clarusway GitHub Repository"></a>

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" 
alt="CLRSWY"></p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#9d4f8c; font-size:100%; text-align:center; border-radius:10px 10px;">WAY TO REINVENT YOURSELF</p>