<a href="https://colab.research.google.com/github/srujany/Statistics/blob/main/jupyter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Q1. What are the three measures of central tendency?

The three measures of central tendency are:

1. **Mean**: The average of a set of numbers, calculated by adding all the values together and then dividing by the number of values.

2. **Median**: The middle value in a set of numbers when they are arranged in order. If there’s an even number of values, the median is the average of the two middle values.

3. **Mode**: The value that occurs most frequently in a data set. If no value repeats, there is no mode.

Each of these measures provides different insights into the "center" or typical value of a data set.

Q2. What is the difference between the mean, median, and mode? How are they used to measure the
central tendency of a dataset

The **mean**, **median**, and **mode** are all measures of central tendency, but they each describe the "center" of a dataset in different ways. Here's how they differ:

1. **Mean**:
   - **Definition**: The mean is the average of all the values in a dataset.
   - **Calculation**: Add all the numbers together and divide by the total number of values.
   - **Use**: The mean is sensitive to extreme values (outliers), so it might not represent the center well if the dataset has large outliers.
   - **Example**: For the dataset 3, 5, 7, 10, the mean is (3 + 5 + 7 + 10) / 4 = 6.25.

2. **Median**:
   - **Definition**: The median is the middle value in an ordered dataset. If there’s an even number of values, the median is the average of the two middle values.
   - **Calculation**: Arrange the data in ascending or descending order, then pick the middle value. If there is an even number of values, calculate the average of the two middle values.
   - **Use**: The median is not affected by outliers or skewed data, so it is a better measure of central tendency when dealing with data that has extreme values.
   - **Example**: For the dataset 3, 5, 7, 10, the median is (5 + 7) / 2 = 6.

3. **Mode**:
   - **Definition**: The mode is the value that occurs most frequently in a dataset.
   - **Calculation**: Identify the value that appears most often.
   - **Use**: The mode is useful for categorical or nominal data and helps identify the most common value. It can have multiple modes (bimodal, multimodal) or no mode if all values are unique.
   - **Example**: For the dataset 3, 5, 7, 5, the mode is 5 because it appears twice.

### How They Measure Central Tendency:
- **Mean**: Provides the overall "average" and works well for symmetric datasets with no outliers.
- **Median**: Represents the "middle" value and is ideal when the data is skewed or contains outliers, as it’s less influenced by extreme values.
- **Mode**: Identifies the most frequent value and is especially useful for categorical data or when identifying the most common value is important.

In summary, each measure gives a different perspective on the central point of a dataset. The **mean** gives an overall average, the **median** shows the middle value, and the **mode** highlights the most frequent value.

Q3. Measure the three measures of central tendency for the given height data:

 [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [None]:
import numpy as np

data = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

mean = np.mean(data)
median = np.median(data)
mode = np.argmax(np.bincount(data))

print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")

Mean: 177.01875
Median: 177.0
Mode: 178


Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe
the spread of a dataset? Provide an example.

Measures of dispersion, such as **range**, **variance**, and **standard deviation**, are used to describe the spread or variability of a dataset. These measures help us understand how much the data points differ from the central tendency (mean, median, or mode) and from each other.

Here’s how each measure is used:

### 1. **Range**:
   - **Definition**: The range is the difference between the largest and smallest values in the dataset.
   - **Calculation**: Subtract the minimum value from the maximum value.
   - **Use**: The range provides a simple measure of the spread, but it is sensitive to outliers. If there are extreme values in the dataset, the range will be large.
   - **Example**:
     - Dataset: 3, 5, 7, 10
     - Range = 10 - 3 = 7
     - The range indicates that the data spread across a distance of 7 units.

### 2. **Variance**:
   - **Definition**: Variance measures how far each data point is from the mean and, on average, how much the data points differ from the mean.
   - **Calculation**: Find the mean of the dataset, subtract the mean from each data point, square the result, and then average those squared differences.
   - **Formula**:  
     \[
     \text{Variance} = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2
     \]
     where \( x_i \) is each data point, \( \mu \) is the mean, and \( N \) is the total number of data points.
   - **Use**: Variance gives a precise measure of dispersion, but because it squares the differences, it is expressed in squared units, which makes it hard to interpret directly.
   - **Example**:
     - Dataset: 3, 5, 7, 10
     - Mean = (3 + 5 + 7 + 10) / 4 = 6.25
     - Variance calculation involves subtracting the mean from each data point, squaring the differences, and averaging them.

### 3. **Standard Deviation**:
   - **Definition**: The standard deviation is the square root of the variance. It measures the average amount of variation or dispersion in a dataset.
   - **Calculation**: It is simply the square root of the variance, so it brings the unit of measurement back to the same scale as the original data.
   - **Use**: The standard deviation is widely used because it is expressed in the same unit as the data, making it easier to understand compared to variance.
   - **Example**:
     - If the variance of the dataset 3, 5, 7, 10 is calculated to be 5.1875, the standard deviation would be the square root of that, approximately **2.27**.
     - This means that, on average, the data points deviate from the mean by about 2.27 units.

### Example: Dataset = 3, 5, 7, 10

1. **Range**:  
   - Max = 10, Min = 3  
   - Range = 10 - 3 = 7
  
2. **Variance**:  
   - Mean = 6.25  
   - Differences from the mean: (3 - 6.25), (5 - 6.25), (7 - 6.25), (10 - 6.25) = -3.25, -1.25, 0.75, 3.75  
   - Squared differences: 10.5625, 1.5625, 0.5625, 14.0625  
   - Average squared difference = (10.5625 + 1.5625 + 0.5625 + 14.0625) / 4 = 5.1875  
   - Variance = 5.1875

3. **Standard Deviation**:  
   - Standard Deviation = √5.1875 ≈ **2.27**

### Summary of How These Measures Describe Spread:
- **Range**: Tells us the overall spread between the smallest and largest values but can be influenced by outliers.
- **Variance**: Quantifies the spread of the data points in terms of squared deviations from the mean, giving us a more nuanced view of variability.
- **Standard Deviation**: Provides a more interpretable measure of spread, expressed in the same units as the data, and shows how much individual data points typically deviate from the mean.

These measures help assess the **consistency** or **variability** of data, which is crucial in many real-world applications, such as predicting outcomes, risk management, or understanding the stability of a process or system.

Q6. What is a Venn diagram?

A **Venn diagram** is a visual representation used to show the relationships between different sets or groups. It uses circles (or other shapes) to represent each set, and the overlapping regions of the circles show the relationships or common elements between those sets.

### Key Features of a Venn Diagram:
- **Circles or Shapes**: Each circle represents a set of items or elements.
- **Overlapping Areas**: The overlapping part of two or more circles shows the items that are common to the sets.
- **Non-overlapping Areas**: The parts of the circles that do not overlap show the items that belong exclusively to that set.
- **Universal Set**: Sometimes, a rectangle or larger box surrounds the circles, representing the "universe" or the total collection of all items being considered.

### Example:
Imagine two sets:
- Set A: {1, 2, 3, 4}
- Set B: {3, 4, 5, 6}

A Venn diagram would show two circles, one representing Set A and the other representing Set B. The numbers 3 and 4 would appear in the overlapping area, as they are common to both sets. The numbers 1 and 2 would be only in the circle for Set A, and 5 and 6 would be only in the circle for Set B.

### Uses:
- **Set Theory**: To show relationships like intersections (common elements), unions (all elements from both sets), and differences (elements only in one set).
- **Logic**: To visualize logical relationships between sets, like "AND" or "OR" conditions.
- **Problem Solving**: Used in math, statistics, and probability to analyze relationships between groups.
  
Venn diagrams are a simple but powerful tool for understanding and analyzing the relationships between different categories or groups of data.

Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:

(i) 	A B

(ii)	A ⋃ B

In [None]:

A = {2,3,4,5,6,7}
B = {0,2,6,8,10}
a = A.intersection(B)
b = A.union(B)
print(a)
print(b)

{2, 6}
{0, 2, 3, 4, 5, 6, 7, 8, 10}


Q8. What do you understand about skewness in data?

**Skewness** in data refers to the degree of asymmetry or departure from symmetry in the distribution of a dataset. In other words, it describes whether the data is stretched or pulled to one side (left or right) relative to its mean, or if it is relatively symmetrical.

### Types of Skewness:

1. **Positive Skew (Right Skew)**:
   - **Definition**: In a positively skewed distribution, the right tail (larger values) is longer than the left tail (smaller values). This means that the majority of data points are concentrated on the left side of the mean.
   - **Characteristics**:
     - Mean > Median > Mode
     - There are a few large values (outliers) pulling the distribution to the right.
   - **Example**: Income distribution (most people earn moderate incomes, but a few people earn extremely high incomes, causing a right skew).

2. **Negative Skew (Left Skew)**:
   - **Definition**: In a negatively skewed distribution, the left tail (smaller values) is longer than the right tail (larger values). This means that most of the data points are concentrated on the right side of the mean.
   - **Characteristics**:
     - Mean < Median < Mode
     - There are a few very small values (outliers) pulling the distribution to the left.
   - **Example**: Age at retirement (most people retire later in life, but a few retire very early, causing a left skew).

3. **No Skew (Symmetrical Distribution)**:
   - **Definition**: When the data is perfectly symmetrical, the tails on both sides of the mean are of equal length. This is often seen in a normal distribution (bell curve).
   - **Characteristics**:
     - Mean = Median = Mode
     - There is no skewness, and the data is evenly spread around the center.
   - **Example**: Heights of adult humans (with no extreme outliers, the distribution is roughly symmetrical).

### How to Measure Skewness:
Skewness can be quantified numerically using formulas or statistical software. A common formula for skewness is:

\[
\text{Skewness} = \frac{n}{(n-1)(n-2)} \sum \left( \frac{x_i - \bar{x}}{s} \right)^3
\]

Where:
- \(x_i\) are the data points
- \(\bar{x}\) is the mean
- \(s\) is the standard deviation
- \(n\) is the number of data points

- **Positive skew**: If skewness is greater than 0, the distribution is positively skewed.
- **Negative skew**: If skewness is less than 0, the distribution is negatively skewed.
- **Zero skew**: If skewness is close to 0, the distribution is symmetrical.

### Why is Skewness Important?
- **Data Interpretation**: Skewness provides insight into how data is distributed. Understanding the direction and extent of skewness can help in selecting appropriate statistical methods and models.
- **Choosing the Right Analysis**: For skewed data, measures like the **median** and **mode** may be more representative of the central tendency than the mean, which can be influenced by extreme values.
- **Assumptions for Models**: Many statistical models (like linear regression) assume that data is normally distributed. If data is highly skewed, transformations or non-parametric methods may be required.

### Example of Skewness:
Imagine a dataset of test scores: {45, 50, 51, 60, 65, 90, 98, 100, 150, 200}

- The distribution is **positively skewed** (right skew) because there are a few very high test scores (150, 200) pulling the tail to the right. Most scores are closer to the lower end of the range.

In summary, **skewness** tells us how data is distributed around its mean and whether the distribution is symmetrical or stretched toward one side. Recognizing skewness is important in choosing the right statistical approach for analyzing data.

Q9. If a data is right skewed then what will be the position of median with respect to mean?

If the data is **right-skewed** (positively skewed), the **mean** will be greater than the **median**.

### Explanation:
- In a **right-skewed** distribution, the right tail (larger values) is longer, and there are a few **large outliers** that pull the mean to the right.
- The **mean** is more sensitive to these extreme values, causing it to be higher than the **median**.
- The **median**, which represents the middle value of the dataset, is less affected by outliers and thus remains closer to the center of the data.

### General Relationship in a Right-Skewed Distribution:
- **Mean > Median > Mode**

So, in a right-skewed dataset, the **mean** will be positioned to the right of the **median**.

Q10. Explain the difference between covariance and correlation. How are these measures used in
statistical analysis?

**Covariance** and **correlation** are both statistical measures that describe the relationship between two variables, but they differ in their scale, interpretation, and the type of information they provide. Here's an explanation of the key differences between them:

### 1. **Covariance**:
   - **Definition**: Covariance measures the **direction** of the linear relationship between two variables. It indicates whether two variables tend to increase or decrease together (i.e., whether they have a positive or negative relationship).
   - **Formula**:
     \[
     \text{Cov}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})
     \]
     Where:
     - \(X_i\) and \(Y_i\) are the individual data points of variables \(X\) and \(Y\),
     - \(\bar{X}\) and \(\bar{Y}\) are the means of \(X\) and \(Y\),
     - \(n\) is the number of data points.

   - **Range**: Covariance can take any value from negative infinity to positive infinity. The scale of covariance depends on the units of the variables, which can make it difficult to compare across different datasets.
     - **Positive Covariance**: If \(X\) and \(Y\) increase together (positive relationship).
     - **Negative Covariance**: If one variable increases while the other decreases (negative relationship).
     - **Zero Covariance**: No linear relationship between the variables.

   - **Limitations**: The magnitude of covariance is not standardized, so it's difficult to compare the strength of relationships between datasets with different units or scales.

### 2. **Correlation**:
   - **Definition**: Correlation measures both the **strength** and **direction** of the linear relationship between two variables. It standardizes the covariance, making it easier to interpret and compare relationships across different datasets.
   - **Formula**:
     \[
     \text{Correlation (r)} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}
     \]
     Where:
     - \(\text{Cov}(X, Y)\) is the covariance between \(X\) and \(Y\),
     - \(\sigma_X\) and \(\sigma_Y\) are the standard deviations of \(X\) and \(Y\), respectively.

   - **Range**: Correlation values range from **-1 to 1**:
     - **1** indicates a perfect positive linear relationship.
     - **-1** indicates a perfect negative linear relationship.
     - **0** indicates no linear relationship.
   
   - **Interpretation**:
     - **Positive correlation**: When \(r\) is between 0 and 1, the variables have a positive relationship.
     - **Negative correlation**: When \(r\) is between -1 and 0, the variables have a negative relationship.
     - **Zero correlation**: \(r = 0\) indicates no linear relationship.

   - **Advantages**: Correlation is unit-free, meaning it's not affected by the units of measurement of the variables. This allows for easier comparison across datasets.

### Key Differences:

| **Aspect**         | **Covariance**                             | **Correlation**                            |
|--------------------|--------------------------------------------|--------------------------------------------|
| **Definition**      | Measures the direction of the linear relationship between two variables. | Measures both the strength and direction of the linear relationship. |
| **Range**           | Unrestricted; can be any value from \(-\infty\) to \(+\infty\). | Ranges from -1 to +1. |
| **Interpretation**  | Tells whether variables move in the same direction (positive) or opposite direction (negative). | Tells the strength and direction of the relationship, with values between -1 and 1. |
| **Units**           | Dependent on the units of the variables, which can make it difficult to compare across datasets. | Unit-free; easy to compare across datasets. |
| **Use**             | Used to understand the direction of the relationship. | Used to understand both the strength and direction of the relationship. |

### How These Measures Are Used in Statistical Analysis:
- **Covariance**:
  - **Understanding the Relationship**: Covariance gives insight into whether two variables increase or decrease together, but without giving a standardized measure of strength. It’s mainly used in the context of portfolio theory (finance) or regression analysis.
  - **Limitations**: Since the magnitude of covariance depends on the units of the variables, it’s hard to compare covariance values from different datasets or variables.

- **Correlation**:
  - **Measuring Strength and Direction**: Correlation is widely used in statistics because it provides both the **strength** (how closely the variables move together) and the **direction** (positive or negative) of the relationship, while being standardized and easier to interpret.
  - **Applications**: Common in **data analysis, regression analysis, finance (e.g., stock price movements), and market research**, among others. It helps in decision-making processes, such as determining how strongly two variables are related, and is especially useful in predictive modeling.

### Example:
- Suppose you have a dataset of hours studied and exam scores.
  - If **covariance** is positive, it indicates that as the number of hours studied increases, exam scores also tend to increase.
  - If the **correlation** is 0.85, this suggests that there is a strong positive linear relationship between hours studied and exam scores, and the relationship is standardized, making it easy to compare to other datasets.

### In Summary:
- **Covariance** gives you the direction of the relationship between two variables, but its magnitude depends on the scale of the variables, making it harder to compare across datasets.
- **Correlation** standardizes the relationship, making it easier to interpret the strength and direction of the relationship and compare across different datasets.

Q11. What is the formula for calculating the sample mean? Provide an example calculation for a
dataset.

### Formula for Calculating the Sample Mean:

The **sample mean** is the average of a set of data points. It is calculated by summing all the data points and then dividing by the number of data points in the sample.

The formula for the **sample mean** (\(\bar{x}\)) is:

\[
\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i
\]

Where:
- \(\bar{x}\) = sample mean
- \(n\) = number of data points in the sample
- \(x_i\) = each individual data point
- \(\sum_{i=1}^{n} x_i\) = sum of all the data points

### Example Calculation:

Let’s say we have the following dataset representing the scores of 5 students on a test:

\[
\text{Dataset: } 80, 85, 90, 95, 100
\]

#### Step-by-step calculation:

1. **Sum of all data points**:

\[
80 + 85 + 90 + 95 + 100 = 450
\]

2. **Number of data points (n)**:

\[
n = 5
\]

3. **Sample mean**:

\[
\bar{x} = \frac{450}{5} = 90
\]

So, the **sample mean** of the dataset is **90**.

### Summary:
- The **sample mean** is simply the average value of the dataset, and it provides a measure of central tendency for the data.
- In this example, the average score of the students is **90**.

Q12. For a normal distribution data what is the relationship between its measure of central tendency?

For a **normal distribution**, the measures of central tendency—**mean**, **median**, and **mode**—are all **equal** and lie at the **center** of the distribution.

### Relationship Between the Measures of Central Tendency in a Normal Distribution:
- **Mean = Median = Mode**

### Explanation:
In a normal distribution, the data is symmetrically distributed around the center. This means:
1. **Mean**: The average of all the data points. Since the distribution is symmetric, the mean is at the exact center of the distribution.
2. **Median**: The middle value when the data is ordered. Because of symmetry, the median is also at the center, where half of the data points fall below it and half fall above it.
3. **Mode**: The value that appears most frequently. In a normal distribution, the highest point (the peak) of the curve occurs at the center, so the mode is also at this center.

### Visual Representation:
- In a **normal distribution curve** (also known as a bell curve), the peak of the curve represents the mean, median, and mode, all of which coincide at the same point.

### Why Does This Happen?
- The **symmetry** of a normal distribution ensures that the data is evenly spread on both sides of the central point.
- As a result, the **mean** (which balances the dataset), the **median** (which splits the dataset in half), and the **mode** (the most frequent value) all align at the same location.

### Summary:
For a **normal distribution**, the mean, median, and mode are all **equal** and located at the **center** of the distribution. This property is a defining characteristic of normal distributions.

Q13. How is covariance different from correlation?

Covariance and correlation both measure the relationship between two variables, but they differ in a few key ways:

1. **Scale of Measurement:**
   - **Covariance** measures the direction of the linear relationship between two variables. It can take any value from negative to positive infinity, depending on how the variables move together. However, its value is affected by the scale of the variables (i.e., the units of measurement), which can make it hard to interpret across different datasets.
   - **Correlation**, on the other hand, standardizes the covariance by dividing it by the product of the standard deviations of the two variables. This results in a value between -1 and 1. A correlation of 1 means a perfect positive linear relationship, -1 means a perfect negative linear relationship, and 0 means no linear relationship.

2. **Interpretability:**
   - **Covariance** values are not easy to interpret because they are not standardized, and the magnitude depends on the units of the variables.
   - **Correlation** values are easier to interpret because they are scaled to a fixed range of -1 to 1, making it straightforward to assess the strength and direction of the relationship.

3. **Units:**
   - **Covariance** has units that are the product of the units of the two variables being compared. For example, if one variable is measured in meters and another in seconds, the covariance will be in meter-seconds.
   - **Correlation** is dimensionless because it is a standardized value.

In short, while both covariance and correlation describe relationships between variables, **correlation is a normalized and more interpretable version of covariance** that is independent of the scale of the variables.

Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

Outliers can have a significant impact on both **measures of central tendency** (mean, median, mode) and **measures of dispersion** (range, variance, standard deviation). Here's how they affect these measures:

### 1. **Impact on Measures of Central Tendency:**
   - **Mean:** Outliers can **skew the mean** because the mean is sensitive to extreme values. If an outlier is very large or very small, it can pull the mean in that direction, making it unrepresentative of the majority of the data.
     - *Example:* Consider the dataset: **[2, 3, 3, 4, 5, 100]**. The mean is calculated as:
       \[
       \text{Mean} = \frac{(2 + 3 + 3 + 4 + 5 + 100)}{6} = \frac{117}{6} = 19.5
       \]
       This is much higher than the values in the dataset, due to the outlier (100).

   - **Median:** The median, being the middle value when the data is ordered, is **less affected by outliers**. Even if an extreme value is present, the median tends to remain stable because it depends only on the middle values.
     - *Example:* In the same dataset **[2, 3, 3, 4, 5, 100]**, the median would be the average of the two middle numbers (3 and 4):
       \[
       \text{Median} = \frac{3 + 4}{2} = 3.5
       \]
       This is a better representation of the central tendency compared to the mean.

   - **Mode:** The mode, which is the most frequent value in the dataset, is **typically not affected by outliers**, unless the outlier itself occurs frequently.

### 2. **Impact on Measures of Dispersion:**
   - **Range:** The range is highly sensitive to outliers because it is simply the difference between the largest and smallest values in the dataset. If an outlier is present, it can significantly increase the range.
     - *Example:* In the dataset **[2, 3, 3, 4, 5, 100]**, the range is:
       \[
       \text{Range} = 100 - 2 = 98
       \]
       The range is inflated due to the outlier.

   - **Variance and Standard Deviation:** Both variance and standard deviation are calculated based on squared differences from the mean. Since the mean is affected by outliers, the squared differences will be large for outliers, increasing both variance and standard deviation.
     - *Example:* In the same dataset, the outlier (100) will cause the squared difference between the data points and the mean to be much larger than it would be without the outlier. This will increase the overall variance and standard deviation.

### Conclusion:
- **Mean and range** are most affected by outliers, often leading to a distorted view of the data.
- **Median, mode, and interquartile range** (IQR) are more robust to outliers and provide a better summary of the data when outliers are present.

In practice, it's important to consider the presence of outliers and decide whether to adjust for them or use more robust measures when analyzing data.