Consider that Walmart's Quality Control department wants to know how much of company’s products in its warehouses are defective. For this, the team can simply select a `small sample of 1000 products` instead of inspecting all the products in the warehouse(which would be impossible to inspect). It can then find the defect rate (i.e., the proportion of defective products) for the sample, based on which it can further infer the defect rate for all the products in the warehouses.

This process of deriving insights or drawing inferences from sample data is called `inferential statistics`. Situations like the one above arise all the time in big companies like Amazon and Flipkart, among others.

`Inferential Statistics is used in the industry in multiple ways like:`

**1.Healthcare: Clinical Trials**

Inferential statistics are used in clinical trials to analyze the effectiveness of new drugs or treatments by drawing conclusions about the entire patient population based on a sample.

**2.Finance: Risk Assessment**

Financial institutions use inferential statistics to assess and manage risks. This includes predicting market trends, estimating the likelihood of default on loans, and analyzing investment portfolios.

**3.Marketing: Consumer Surveys**

In marketing, inferential statistics are employed to make inferences about the preferences and behaviors of a target market based on survey data, helping businesses make informed decisions about product development and advertising strategies.

**4.Manufacturing: Quality Control**

Inferential statistics are used in quality control processes to make inferences about the quality of products based on a sample of items, helping manufacturers maintain consistent product quality.

**5.Education: Standardized Testing**

In education, inferential statistics are used to draw conclusions about the performance of a larger population of students based on the results of standardized tests taken by a representative sample.

**6.Environmental Science: Pollution Monitoring**

Inferential statistics help environmental scientists estimate the level of pollution in a region by analyzing samples of air, water, or soil, allowing for inferences about the overall environmental health.

**7.Human Resources: Employee Satisfaction**

HR professionals use inferential statistics to make inferences about the overall job satisfaction of employees based on survey data, helping organizations identify areas for improvement.

**8.Retail: Demand Forecasting**

In retail, inferential statistics are applied to analyze past sales data and make predictions about future demand for products, optimizing inventory management and supply chain logistics.

**9.Telecommunications: Network Performance**

Telecom companies use inferential statistics to assess network performance by analyzing data from a sample of users, helping them make inferences about the quality and reliability of their services for the entire user base.

**10.Government: Census Data Analysis**

Governments use inferential statistics when analyzing census data. By studying a sample of the population, they can make inferences about demographic trends, socioeconomic indicators, and other important factors that inform public policy decisions.

### Probability

Probability can be defined as the measure of certainty(or uncertainty) that a certain event or 
outcome will occur given a certain stochastic or random process. It is represented numerically 
as a number between zero and one. The probabilities of zero and one both represent certainty.

**Probability measures the likelihood that an event will occur. Probability values have two properties:**

`They always lie in the range of 0 to 1.` The value is 0 when an event is impossible (for example, the probability of you being in India and America at the same time) and 1 when an event is sure to occur (for example, the probability of the sun rising in the East tomorrow).

`The sum of the probabilities of all outcomes of an experiment is always 1.` For instance, in a coin toss, there can be two outcomes: heads or tails. Each outcome has a probability of 0.5. Hence, the sum of the probabilities is 0.5 + 0.5 = 1.

### Random Variables

Random Variables are variables that represent the outcomes of a random experiment. For example, the collection of outcomes of a series of coin tosses is a random Variable. Here the possible set of outcomes are just two - Heads & Tails. If we map Heads to the number 1 and Tails to 0. Then the Random Variable could look something like (1,1,0,1,0,0,1,0) for eight coin flips. The values of a Random
Variable can change the next time it is recorded, but they can only contain a specific set of values.

A random variable is denoted with a capital letter (typically, X, Y, Z, etc.), and specific values are denoted with lowercase letters (e.g., X = x or X ≤ x).

Ex: Tossing two coins together

- X=0 if both tosses result in no heads. `P(X=0) = 1/4`
- X=1 if one of the tosses results in heads. `P(X=1) = 2/4 = 1/2`
- X=2 if both tosses result in heads. `P(X=2) = 1/4`

**Random Variables are of two types:**

1. **Discrete RV:** They take a fixed set of possible outcomes. Each outcome has an associated probability. `Ex: Number of heads in two tosses, The creditworthiness of a loan applicant, Marital Status, Gender etc.`

2. **Continuous RV:** They can take any value within a range. `Ex: Age of a person, Income of a person, Subscription of any platform like Netflix, Disney, Hotstar etc.`

![77a6e6a6-a75e-4e72-9d88-22294ecfa42f-PL08.png](attachment:77a6e6a6-a75e-4e72-9d88-22294ecfa42f-PL08.png)

### Probability Mass Functions

A probability mass function (PMF) is a type of probability distribution that defines the probability of observing a particular value of a discrete random variable. For example, a PMF can be used to calculate the probability of rolling a three on a fair six-sided die.

There are certain kinds of random variables (and associated probability distributions) that are relevant for many different kinds of problems. These commonly used probability distributions have names and parameters that make them adaptable for different situations.

For example, suppose that we flip a fair coin some number of times and count the number of heads. The probability mass function that describes the likelihood of each possible outcome (eg., 0 heads, 1 head, 2 heads, etc.) is called the binomial distribution. The parameters for the binomial distribution are:

- `n` for the number of trials (eg., n=10 if we flip a coin 10 times)
- `p` for the probability of success in each trial (probability of observing a particular outcome in each trial. In this example, p= 0.5 because the probability of observing heads on a fair coin flip is 0.5)

If we flip a fair coin 10 times, we say that the number of observed heads follows a Binomial(n=10, p=0.5) distribution. The graph below shows the probability mass function for this experiment. The heights of the bars represent the probability of observing each possible outcome as calculated by the PMF.

![Binomial Graph](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/binom_pmf_10_5.svg)

The binom.pmf() method from the scipy.stats library can be used to calculate the PMF of the binomial distribution at any value. This method takes 3 values:

- `x:` the value of interest
- `n:` the number of trials
- `p:` the probability of success

For example, suppose we flip a fair coin 10 times and count the number of heads. We can use the binom.pmf() function to calculate the probability of observing 6 heads as follows:

In [1]:
import scipy.stats as stats

#stats.binom.pmf(x, n, p)
print(stats.binom.pmf(6, 10, 0.5))

0.20507812500000022


**Using the Probability Mass Function Over a Range**
We have seen that we can calculate the probability of observing a specific value using a probability mass function. What if we want to find the probability of observing a range of values for a discrete random variable? One way we could do this is by adding up the probability of each value.

For example, let’s say we flip a fair coin 5 times, and want to know the probability of getting between 1 and 3 heads. We can visualize this scenario with the probability mass function:

![image](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/Binomial-Distribution-PMF-Probability-over-a-Range.gif)

- P(1to3heads)=P(1<=X<=3)
- P(1to3heads)=P(X=1)+P(X=2)+P(X=3)
- P(1to3heads)=0.1562+0.3125+0.3125
- P(1to3heads)=0.7812

For further experimenting with above examples visit [this](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/binomial-range_v2/index.html)

In [5]:
import scipy.stats as stats

# calculating P(2-4 heads) = P(2 heads) + P(3 heads) + P(4 heads) for flipping a coin 10 times
print(stats.binom.pmf(2, n=10, p=.5) + 
      stats.binom.pmf(3, n=10, p=.5) + stats.binom.pmf(4, n=10, p=.5))

0.36621093750000033


In [6]:
import scipy.stats as stats

#probability of observing 8 or fewer heads from 10 coin flips
print(stats.binom.pmf(0, n = 10, p = 0.5) + 
stats.binom.pmf(1, n = 10, p = 0.5) + 
stats.binom.pmf(2, n = 10, p = 0.5) + 
stats.binom.pmf(3, n = 10, p = 0.5) + 
stats.binom.pmf(4, n = 10, p = 0.5) + 
stats.binom.pmf(5, n = 10, p = 0.5) + 
stats.binom.pmf(6, n = 10, p = 0.5) + 
stats.binom.pmf(7, n = 10, p = 0.5) + 
stats.binom.pmf(8, n = 10, p = 0.5))

0.9892578125000009


### Cumulative Distribution Function

The cumulative distribution function for a discrete random variable can be derived from the probability mass function. However, instead of the probability of observing a specific value, the cumulative distribution function gives the probability of observing a specific value OR LESS.

As previously discussed, the probabilities for all possible values in a given probability distribution add up to 1. The value of a cumulative distribution function at a given value is equal to the sum of the probabilities lower than it, with a value of 1 for the largest possible number.

Cumulative distribution functions are constantly increasing, so for two different numbers that the random variable could take on, the value of the function will always be greater for the larger number. Mathematically, this is represented as:
- `If x1 < x2 : CDF(x1) < CDF(x2)`

We saw how the probability mass function can be used to calculate the probability of observing less than 3 heads out of 10 coin flips by adding up the probabilities of observing 0, 1, and 2 heads. The cumulative distribution function produces the same answer by evaluating the function at CDF(X=2). In this case, `using the CDF is simpler than the PMF` because it requires one calculation rather than three.

Ex : `P(3 <= X <= 6) = P(X <= 6) - P(X < 3)`

In [7]:
import scipy.stats as stats

# P(6 or fewer heads) = P(0 to 6 heads)
print(stats.binom.cdf(6, 10, 0.5))

0.828125


In [8]:
import scipy.stats as stats

#P(4 to 8 heads) = P(0 to 8 heads) - P(0 to 3 heads)
print(stats.binom.cdf(8, 10, 0.5) - stats.binom.cdf(3, 10, 0.5))

0.8173828125


In [9]:
print(stats.binom.cdf(3, 10, 0.5))

print('vs')

print(stats.binom.pmf(0, n=10, p=.5) + 
      stats.binom.pmf(1, n=10, p=.5) + 
      stats.binom.pmf(2, n=10, p=.5) + stats.binom.pmf(3, n=10, p=.5))

0.17187499999999994
vs
0.17187500000000014


### Probability Density Functions

Similar to how discrete random variables relate to probability mass functions, continuous random variables relate to probability density functions. They define the probability distributions of continuous random variables and span across all possible values that the given random variable can take on.

When graphed, a probability density function is a curve across all possible values the random variable can take on, and the total area under this curve adds up to 1.

The following image shows a probability density function. The highlighted area represents the probability of observing a value within the highlighted range.

![pdf](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/Adding-Area.gif)

In a probability density function, we cannot calculate the probability at a single point. This is because the area of the curve underneath a single point is always zero. The gif below showcases this.

![pdf](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/Normal-Distribution-Area-to-Zero.gif)

As we can see from the visual above, as the interval becomes smaller, the width of the area under the curve becomes smaller as well. When trying to evaluate the area under the curve at a specific point, the width of that area becomes 0, and therefore the probability equals 0.

Let’s say we want to know the probability that a randomly chosen woman is less than 158 cm tall. We can use the cumulative distribution function to calculate the area under the probability density function curve from 0 to 158 to find that probability.

![pdf](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/norm_pdf_167_8_filled.svg)

In [11]:
import scipy.stats as stats
#x : value of interest
#loc: mean of the distribution
#scale : std dev of the distribution

# stats.norm.cdf(x, loc, scale)
print(stats.norm.cdf(158, 167.64, 8))

0.11410165094812996


**Demo: Some examples on the Normal Distribution and Z-Score calculations**

### Poisson Distribution

The Poisson distribution is another common distribution, and it is used to describe the number of times a certain event occurs within a fixed time or space interval. Examples below:

**Telecommunications:** In telecommunication networks, the Poisson distribution can model the arrival of calls, messages, or data packets within a certain time interval. This helps in optimizing network resources and capacity planning.

**Insurance:** Poisson distribution is used in insurance risk modeling to predict the number of insurance claims within a given period. It helps insurance companies assess risks and determine appropriate premium rates.

**Manufacturing:** In manufacturing processes, defects or errors in production can often be modeled using the Poisson distribution. This helps in quality control and process improvement efforts.

**Healthcare:** In healthcare, the Poisson distribution is used to model the arrival of patients at a hospital's emergency department or the occurrence of rare diseases within a population.

**Finance:** In finance, the Poisson distribution is used in modeling the arrival of financial transactions, such as stock trades or loan defaults. It helps in risk management and portfolio optimization.

`Poisson Distribution` is a discrete probability distribution so it can be described as PMF and CDF.

In [1]:
import scipy.stats as stats
# expected value = 10 calls between 1-2PM, probability of observing 12-14 calls
stats.poisson.pmf(12, 10) + stats.poisson.pmf(13, 10) + stats.poisson.pmf(14, 10)

0.21976538076223123

In [2]:
import scipy.stats as stats
# expected value = 10, probability of observing 6 or less
stats.poisson.cdf(6, 10)

0.130141420882483

In [3]:
import scipy.stats as stats
# expected value = 10, probability of observing 12 or more
1 - stats.poisson.cdf(11, 10)

0.30322385369689386

In [4]:
import scipy.stats as stats
# expected value = 10, probability of observing between 12 and 18
stats.poisson.cdf(18, 10) - stats.poisson.cdf(11, 10)

0.29603734909303947

### Population vs Sample

**`Population`**
The population in statistics refers to the entire group that is the subject of the study. It includes all the individuals or items that meet a particular set of criteria. The population is the complete set of observations or elements that share a common characteristic and is of interest to the researcher.

**Example:**
If you were studying the average income of households in a city, the population would be all the households in that city. Every single household, regardless of size or income level, is part of the population.

**`Sample:`**
A sample is a subset of the population that is selected for the actual study. It is not always feasible or practical to collect data from an entire population, so researchers choose a representative sample to draw conclusions about the population. The goal is to ensure that the sample is representative enough that findings from the sample can be generalized to the entire population.

**Example:**
In the household income study mentioned earlier, it might be impractical to survey every single household in the city. Instead, a researcher might select a random sample of, say, 500 households to study. The 500 households form the sample, and the researcher uses the data collected from this sample to make inferences about the income of all households in the city.

### Sampling Techniques

We saw in CLT that the samples must be a correct representation of the population in order to arrive at correct conclusions about the population. So let's take a look at the various sampling techniques that are available:

1. Simple random sampling
2. Stratified sampling 
3. Systematic sampling
4. Cluster sampling 
5. Judgment sampling

### Central Limit Theorem

The Central Limit Theorem (CLT) is a fundamental concept in statistics and plays a crucial role in data science. It states that, regardless of the shape of the original population distribution, the sampling distribution of the sample mean will be approximately normally distributed for sufficiently `large sample sizes(n > 30)`. This theorem is particularly important in inferential statistics, where we make inferences about a population based on a sample.

**CLT in a Nutshell:**
If you have a population with any shape of distribution and you repeatedly draw random samples of a certain size from that population, the distribution of the sample means will be approximately normal, regardless of the shape of the original population distribution.

**Real World Industry Use Cases in Data Science:**

**1.Quality Control in Manufacturing:**

`Scenario:` A manufacturing plant produces a large number of products each day, and the quality control team is interested in the average weight of the products.

`Use of CLT:` By collecting random samples of product weights and calculating the sample means, the quality control team can apply the CLT to assume that the distribution of sample means is approximately normal. This allows them to make statistical inferences about the average weight of all products.

**2.Financial Modeling and Risk Assessment:**

`Scenario:` A financial analyst wants to assess the average return on investment (ROI) of a portfolio of stocks.

`Use of CLT:` By taking multiple random samples of historical ROI data and calculating sample means, the analyst can apply the CLT. This enables them to make more reliable predictions about the average ROI of the entire portfolio.

**3.Marketing and A/B Testing:**

`Scenario:` A marketing team is running an A/B test to compare the effectiveness of two different versions of an advertisement.

`Use of CLT:` By collecting random samples of user responses to each version and calculating sample means, the marketing team can use the CLT to make statistical inferences about the average effectiveness of each advertisement version for the entire target audience.

**4.Healthcare and Clinical Trials:**

`Scenario:` In a clinical trial for a new drug, researchers want to estimate the average reduction in symptoms.

`Use of CLT:` By repeatedly collecting random samples of patient data and calculating sample means, researchers can apply the CLT. This allows them to make inferences about the average impact of the drug on the entire population of interest.

**5.E-commerce and Customer Behavior:**

`Scenario:` An e-commerce platform wants to understand the average time spent by customers on their website.

`Use of CLT:` By taking random samples of user engagement data and calculating sample means, the data science team can leverage the CLT to make statistically valid predictions about the average time spent on the website for all users.

## Hypothesis Testing

Hypothesis testing is a statistical method used in various industries to make decisions or draw conclusions about a population based on sample data. Here are several industry-based examples of hypothesis testing:

**1.Pharmaceuticals: Drug Efficacy**

**Scenario:** A pharmaceutical company develops a new drug to treat a specific medical condition and wants to determine if the drug is more effective than the existing treatment.
Hypothesis:
- `Null Hypothesis (H0):` The new drug is equally effective as the existing treatment.
- `Alternative Hypothesis (H1):` The new drug is more effective than the existing treatment.

**Test:** A clinical trial is conducted, and statistical tests are performed to analyze the data and determine if there is enough evidence to reject the null hypothesis in favor of the alternative.

**2.Finance: Investment Strategy**

**Scenario:** A financial analyst proposes a new investment strategy that claims to outperform the current market average.
Hypothesis:
- **Null Hypothesis (H0):** The new investment strategy does not outperform the market average.
- **Alternative Hypothesis (H1):** The new investment strategy outperforms the market average.

**Test:** Historical data is collected and analyzed using statistical tests to assess whether the returns from the proposed strategy are significantly different from the market average.

**3.Manufacturing: Production Process Improvement**

**Scenario:** A manufacturing plant implements changes to its production process with the goal of reducing defects in the final product.
Hypothesis:
- **Null Hypothesis (H0):** The changes to the production process do not reduce defects.
- **Alternative Hypothesis (H1):** The changes to the production process reduce defects.

**Test:** Data on defect rates before and after the changes are collected, and statistical tests are performed to determine if there is a significant improvement.

**4. Retail: Sales Promotion Effectiveness**

**Scenario:** A retail store runs a promotion to increase sales and wants to assess whether the promotion has a significant impact.
Hypothesis:
- **Null Hypothesis (H0):** The promotion does not increase sales.
- **Alternative Hypothesis (H1):** The promotion increases sales.

**Test:** Sales data from the promotion period and a comparable non-promotion period are collected and analyzed using statistical tests to determine if there is a significant difference.

**5. Education: Teaching Method Evaluation**

**Scenario:** A school district introduces a new teaching method and wants to evaluate its impact on students' academic performance.
Hypothesis:
- **Null Hypothesis (H0):** The new teaching method does not improve academic performance.
- **Alternative Hypothesis (H1):** The new teaching method improves academic performance.

**Test:** Student performance data is collected before and after the implementation of the new teaching method, and statistical tests are conducted to assess if there is a significant improvement.

**6.Technology: Software Performance**

**Scenario:** A software development team introduces a new algorithm claiming to improve the performance of a computer program.
Hypothesis:
- **Null Hypothesis (H0):** The new algorithm does not improve software performance.
- **Alternative Hypothesis (H1):** The new algorithm improves software performance.

**Test:** Performance metrics are collected for the old and new algorithms, and statistical tests are applied to determine if there is a significant difference.

In each of these examples, hypothesis testing provides a structured approach to assess claims or changes within different industries, helping decision-makers make informed choices based on statistical evidence.

In [1]:
import numpy as np
from scipy.stats import chi2_contingency

In [2]:
data = np.array([[30, 20], [40, 110]])

In [3]:
test_stat, p, dof, expvalue = chi2_contingency(data)

In [10]:
expvalue

array([[17.5, 32.5],
       [52.5, 97.5]])

In [4]:
print(test_stat)

16.879120879120876


In [5]:
print(dof)

1


In [8]:
print('p-value: '+ str(p))

p-value: 3.983738939937843e-05


In [7]:
significance_level = 0.05

In [9]:
if p < significance_level:
    print('Reject Null Hypothesis')
else:
    print('Failed to reject Null Hypothesis')

Reject Null Hypothesis


#### How to choose the Hypothesis Test

![hypothesis_test](https://static-assets.codecademy.com/Courses/Hypothesis-Testing/article_graphic.png)

#### one-sample t-test

In [2]:
from scipy.stats import ttest_1samp

global_average_score = 35
sample_scores = [12, 42, 37, 18, 23, 39, 45 , 52]

t_stat, p_value = ttest_1samp(sample_scores, global_average_score)

In [3]:
p_value

0.7730466649998495

#### Binomial test

If we instead have a sample of binary data and want to compare a sample proportion/frequency to an underlying probability (population value), a binomial test is appropriate. The classic example of a binomial test is tossing a coin to determine if it’s fair (fair means that the probability of either heads or tails is exactly 50%).

For example, suppose that you collect sample data from a coin by tossing it 100 times, and find that 45 flips result in heads. Based on this sample, what is the probability that the coin is actually fair (if you flipped it infinitely many times, exactly half those flips would be heads)?

In [None]:
from scipy.stats import binom_test

p_value = binom_test(45, 100, p = 0.50)

The alternative hypothesis for this test is that the probability is different than p = 0.50, and the null is that it is equal to 0.50.

### Testing for an association between two variables at the population level

When we have a sample of data with two variables, and want to know if there is an association between those variables at the population level, we’ll need a different set of hypothesis tests. 
Ex: subscription rates for 2 versions of a web page among all site visitors.

#### two sample t-test

A two-sample t-test is used to investigate an association between a quantitative variable and a binary categorical variable.

In [None]:
from scipy.stats import ttest_ind

#run the t-test here:
tstat, pval = ttest_ind()

Other examples of two-sample t-tests include studies like drug trials or psychology studies with a control and experimental group or A/B Testing with quantitative data like “time spent on a website”.

#### ANOVA and Tukey’s range test

When the categorical variable has three or more categories, an ANOVA can be used to see if there is a significant difference between any of the groups. Then, if at least one pair of groups are significantly different, Tukey’s range test can be used to determine which groups are different. This is better than running multiple two-sample t-tests because it leads to a lower probability of making a type I error.

For example, if we want to compare the heights of three different tree species, in order to test the hypothesis that average tree heights vary by species, we can use an ANOVA. Then, if the p-value from the ANOVA is below our significance threshold, we can run Tukey’s range test to determine which tree species have significantly different heights.

In [None]:
# ANOVA Test
from scipy.stats import f_oneway
fstat, pval = f_oneway()

# Tukey’s Range Test
from statsmodels.stats.multicomp import pairwise_tukeyhsd
tukey_results = pairwise_tukeyhsd(, 0.05)

### Chi-Square Test

When looking at the relationship between two categorical variables, we can run a Chi-Square Test to see if there is a significant association between the variables. Both variables can have any number of categories. 
Ex: 
- Website version and whether or not someone subscribed, 
- Education level and tax income bracket (multiple categories such as “under 40k”, ”40k-60k”, ”60-80k”, etc)

In [None]:
from scipy.stats import chi2_contingency

# create contingency table
ab_contingency = pd.crosstab()

# run a Chi-Square test
chi2, pval, dof, expected = chi2_contingency(ab_contingency)

Finally, a Chi-Square test evaluates whether the observed contingency table is significantly different from the table that would be expected if there were no association between the variables.

Beyond choosing a hypothesis test, it is important to understand whether the data you have meets the assumptions of the test you want to run. Each hypothesis test has a unique set of assumptions, however, there is one assumption that all hypothesis tests share: the data was randomly sampled from the population of interest.

This is important because random sampling ensures that the sample is representative of the population in terms of observed (and unobserved) characteristics. Unfortunately, there may be situations where random sampling is impossible, but it is important to understand how this can bias results of a test.

For example, let’s return to the example with the yogurt company “The Dairy Culture”. Let’s say the company had multiple factories, but the quality assurance team only collected yogurts from one specific factory. The data is thus not randomly sampled from the entire population that we care about (all factories), and could be biased if the quality of yogurt differs at each one.

There can also be ethical issues that arise when a sample is not representative of a population. When developing and testing a vaccine, for example, researchers must make sure to find volunteers from an appropriate proportion of genders, races, age ranges, pre-existing conditions, and so on to test efficacy for the entire population that the vaccine will be used on. If the vaccine manufacturers test on a sample that doesn’t include sufficient data for one race, there is a risk that there could be reduced (if during the initial research phase) or unknown efficacy for that group.

It can often be challenging to find a representative sample or even to recognize when there is biased data, but it is essential to think about when designing an experiment.