In [1]:
# Q1. What are the three measures of central tendency?

### Solution 1-

<span style = 'font-size:0.8em;'>
    
The three measures of central tendency are:

1. **Mean**: The arithmetic average of a set of values, calculated by adding all the values together and then dividing by the number of values.

2. **Median**: The middle value in a data set when the values are arranged in ascending or descending order. If the data set has an even number of values, the median is the average of the two middle values.

3. **Mode**: The value that appears most frequently in a data set. A set of data may have one mode, more than one mode, or no mode at all if no number repeats.
</span>

In [2]:
# Q2. What is the difference between the mean, median, and mode? How are they used to measure the
# central tendency of a dataset?

### Solution 2-

<span style = 'font-size:0.8em;'>

The mean, median, and mode are all measures of central tendency, but they describe different aspects of a dataset's distribution and can yield different insights:

1. **Mean**:
   - **Definition**: The arithmetic average of all the values in a dataset.
   - **Calculation**: Sum of all values divided by the number of values.
   - **Usage**: The mean is useful for datasets without extreme outliers because it incorporates all data points. It provides a general idea of the dataset's overall level.
   - **Example**: For the dataset {2, 3, 5, 7, 11}, the mean is (2 + 3 + 5 + 7 + 11) / 5 = 5.6.

2. **Median**:
   - **Definition**: The middle value in a dataset when the values are arranged in ascending or descending order.
   - **Calculation**: If the dataset has an odd number of values, the median is the middle one. If even, it is the average of the two middle values.
   - **Usage**: The median is useful for skewed distributions or datasets with outliers, as it is not affected by extreme values. It represents the central position in the dataset.
   - **Example**: For the dataset {2, 3, 5, 7, 11}, the median is 5. For {2, 3, 5, 7, 11, 13}, the median is (5 + 7) / 2 = 6.

3. **Mode**:
   - **Definition**: The value that occurs most frequently in a dataset.
   - **Calculation**: Identified by finding the value(s) with the highest frequency.
   - **Usage**: The mode is useful for categorical data to identify the most common category. It can also indicate the most frequent occurrence in numerical data.
   - **Example**: For the dataset {2, 3, 5, 5, 7, 11}, the mode is 5. In {2, 3, 3, 5, 5, 7, 11}, both 3 and 5 are modes (bimodal).

**Differences**:
- **Mean** considers all values and can be skewed by outliers.
- **Median** provides the middle point and is not affected by outliers.
- **Mode** identifies the most frequent value and can highlight the most common category or value in the dataset.

Each measure is used based on the context and nature of the data:
- **Mean** for symmetric distributions without outliers.
- **Median** for skewed distributions or when outliers are present.
- **Mode** for categorical data or to identify the most common value in numerical data.
</span>

In [3]:
# Q3. Measure the three measures of central tendency for the given height data:
# [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

### Solution 3-

In [4]:
import numpy as np
import statistics as stats

In [5]:
heights = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [6]:
np.mean(heights)

177.01875

In [7]:
np.median(heights)

177.0

In [8]:
stats.mode(heights)

178

In [9]:
# Q4. Find the standard deviation for the given data:
# [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

### Solution 4-

In [10]:
heights = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [11]:
np.std(heights)

1.7885814036548633

In [12]:
# Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe
# the spread of a dataset? Provide an example.

### Solution 5-

<span style = 'font-size:0.8em;'>

Measures of dispersion such as range, variance, and standard deviation provide insights into the spread or variability of a dataset.

1. **Range**:
   - **Definition**: The difference between the highest and lowest values in a dataset.
   - **Calculation**: Range = Maximum value - Minimum value.
   - **Usage**: The range gives a quick sense of the spread of the data but is sensitive to outliers.
   - **Example**: For the dataset {2, 3, 5, 7, 11}, the range is 11 - 2 = 9.

2. **Variance**:
   - **Definition**: The average of the squared differences between each data point and the mean of the dataset.
   - **Calculation**: 
     - For a population: \(\sigma^2 = \frac{\sum (x_i - \mu)^2}{N}\)
     - For a sample: \(s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}\)
     - Where \(x_i\) is each value, \(\mu\) is the population mean, \(\bar{x}\) is the sample mean, \(N\) is the population size, and \(n\) is the sample size.
   - **Usage**: Variance provides a measure of how spread out the values are around the mean. It is useful in statistical and financial analysis but can be hard to interpret because it is in squared units.
   - **Example**: For the dataset {2, 3, 5, 7, 11}, the mean is 5.6. The variance (sample) is \(\frac{(2-5.6)^2 + (3-5.6)^2 + (5-5.6)^2 + (7-5.6)^2 + (11-5.6)^2}{4} = 11.3\).

3. **Standard Deviation**:
   - **Definition**: The square root of the variance, representing the average distance of each data point from the mean.
   - **Calculation**: 
     - For a population: \(\sigma = \sqrt{\sigma^2}\)
     - For a sample: \(s = \sqrt{s^2}\)
   - **Usage**: Standard deviation is widely used because it is in the same units as the data, making it easier to interpret. It provides a measure of the typical deviation from the mean.
   - **Example**: For the dataset {2, 3, 5, 7, 11}, using the sample variance calculated earlier (11.3), the standard deviation is \(\sqrt{11.3} \approx 3.36\).

**Example in context**:
Consider the dataset {2, 3, 5, 7, 11}.

- **Range**: 
  - Calculation: 11 - 2 = 9
  - Interpretation: The data spans 9 units from the smallest to the largest value.

- **Variance (sample)**: 
  - Calculation: \(\frac{(2-5.6)^2 + (3-5.6)^2 + (5-5.6)^2 + (7-5.6)^2 + (11-5.6)^2}{4} = 11.3\)
  - Interpretation: On average, the squared differences from the mean are 11.3 units squared.

- **Standard Deviation (sample)**:
  - Calculation: \(\sqrt{11.3} \approx 3.36\)
  - Interpretation: On average, the values deviate from the mean by about 3.36 units.

These measures help to understand the variability in the dataset. While the mean provides a central value, the range, variance, and standard deviation reveal how much the data points spread out around the mean.
</span>

In [13]:
# Q6. What is a Venn diagram?

### Solution 6-

<span style = 'font-size:0.8em;'>

A Venn diagram is a visual tool used to illustrate the relationships between different sets. It consists of overlapping circles, each representing a set, and the areas where the circles overlap represent the common elements between those sets. Venn diagrams are particularly useful for showing all possible logical relationships between a finite collection of sets.

**Components of a Venn Diagram**:
- **Circles or Ellipses**: Each circle or ellipse represents a set.
- **Overlapping Areas**: The areas where circles overlap represent the intersection of the sets, showing elements that are common to multiple sets.
- **Non-Overlapping Areas**: Parts of the circles that do not overlap show elements that are unique to each set.
- **Universal Set**: Sometimes, a rectangle is used to represent the universal set, which contains all possible elements under consideration.

**Usage**:
- **Set Operations**: Venn diagrams visually represent operations like union (all elements in any of the sets), intersection (elements common to all sets), and difference (elements in one set but not another).
- **Problem Solving**: They help in solving problems related to probability, logic, statistics, and computer science.
- **Logical Relationships**: They illustrate logical relationships in various fields such as mathematics, linguistics, and business.

**Example**:
Consider two sets, A and B:
- Set A = {1, 2, 3}
- Set B = {3, 4, 5}

A Venn diagram for these sets would have two overlapping circles:
- The circle for Set A includes the numbers 1, 2, and 3.
- The circle for Set B includes the numbers 3, 4, and 5.
- The overlapping area (intersection) includes the number 3, which is common to both sets.

</span>

In [14]:
# Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
# (i) A ∩ B
# (ii) A ⋃ B

### Solution 7-

<span style = 'font-size:0.8em;'>

A = (2,3,4,5,6,7)
B = (0,2,6,8,10)

(i) A ∩ B = (2,6)
    
(ii) A U B = (0,2,3,4,5,6,7,8,10)
</span>

In [15]:
# Q8. What do you understand about skewness in data?

### Solution 8-

<span style = 'font-size:0.8em;'>


Skewness in data refers to the measure of asymmetry or distortion from the symmetrical bell curve (normal distribution) in a dataset. It indicates the direction and relative magnitude of the data's deviation from the normal distribution.

### Types of Skewness:

1. **Positive Skewness (Right Skewness)**:
   - **Definition**: When the tail on the right side of the distribution is longer or fatter than the left side.
   - **Characteristics**: The mean is greater than the median, and the mode is less than the median. Most data points are concentrated on the left, with a few larger values stretching to the right.
   - **Example**: Income distribution in many populations, where most people earn lower incomes and fewer people earn very high incomes.

2. **Negative Skewness (Left Skewness)**:
   - **Definition**: When the tail on the left side of the distribution is longer or fatter than the right side.
   - **Characteristics**: The mean is less than the median, and the mode is greater than the median. Most data points are concentrated on the right, with a few smaller values stretching to the left.
   - **Example**: Age at retirement, where most people retire at older ages, with fewer retiring very early.

3. **No Skewness (Symmetrical Distribution)**:
   - **Definition**: When the distribution is symmetrical, with no skewness.
   - **Characteristics**: The mean, median, and mode are all equal, and the distribution looks like a mirror image on both sides of the center.
   - **Example**: Heights of adult men in a specific population, assuming a normal distribution.

### Measuring Skewness:
- **Skewness Coefficient**: A statistical measure that quantifies the degree of skewness in a dataset. It is typically calculated using the third standardized moment about the mean.
  - A skewness value of 0 indicates a perfectly symmetrical distribution.
  - A positive skewness value indicates right skewness.
  - A negative skewness value indicates left skewness.

### Interpretation:
- **Impact on Central Tendency**: Skewness affects the relationship between the mean, median, and mode. In positively skewed distributions, the mean is pulled to the right of the median. In negatively skewed distributions, the mean is pulled to the left of the median.
- **Real-World Implications**: Understanding skewness is crucial in fields like finance, quality control, and any area involving data analysis, as it affects the interpretation of the data and the application of statistical methods.

### Example:
Consider two datasets:

- **Positively Skewed**: \[3, 4, 5, 6, 7, 100\]
  - The mean is significantly higher than the median due to the outlier (100).

- **Negatively Skewed**: \[-100, 1, 2, 3, 4, 5\]
  - The mean is significantly lower than the median due to the outlier (-100).

</span>

In [17]:
# Q9. If a data is right skewed then what will be the position of median with respect to mean?

### Solution 9-

<span style = 'font-size:0.8em;'>

If a dataset is right skewed (positively skewed), the distribution has a longer tail on the right side. In this case, the position of the median with respect to the mean is typically as follows:

- The **mean** will be **greater than** the **median**.

This happens because the mean is affected by the extreme values (outliers) in the long right tail, pulling it to the right. The median, being the middle value, is less affected by these extreme values and remains closer to the bulk of the data.

In a right-skewed distribution:
```
       |------|---|------|
       Mode  Median  Mean
```
The mode is the highest peak of the distribution, the median is the middle value, and the mean is pulled towards the right tail.

### Example:
Consider the dataset: \[2, 3, 5, 8, 50\]

- **Mean**: \(\frac{2 + 3 + 5 + 8 + 50}{5} = 13.6\)
- **Median**: The middle value when the data is sorted: \[2, 3, 5, 8, 50\] is 5.

In this example, the mean (13.6) is greater than the median (5), illustrating the effect of right skewness.
</span>

In [18]:
# Q10. Explain the difference between covariance and correlation. How are these measures used in
# statistical analysis?

### Solution 10-

<span style = 'font-size:0.8em;'>

Covariance and correlation are both measures used to assess the relationship between two variables in statistical analysis. However, they differ in terms of scale, interpretation, and how they quantify the relationship.

### Covariance:
1. **Definition**:
   - Covariance measures the directional relationship between two variables. It indicates whether the variables tend to increase or decrease together.
   
2. **Calculation**:
![image.png](attachment:image.png)

3. **Interpretation**:
   - Positive Covariance: Indicates that the variables tend to move in the same direction.
   - Negative Covariance: Indicates that the variables tend to move in opposite directions.
   - Magnitude: The magnitude of covariance is difficult to interpret because it depends on the units of the variables.

### Correlation:
1. **Definition**:
   - Correlation measures both the strength and direction of the linear relationship between two variables. It is a standardized measure, providing a dimensionless quantity.

2. **Calculation**:
![image-2.png](attachment:image-2.png)
3. **Interpretation**:
   - Value Range: Correlation values range from -1 to 1.
     - +1: Perfect positive linear relationship.
     - -1: Perfect negative linear relationship.
     - 0: No linear relationship.
   - Magnitude: The closer the value is to ±1, the stronger the linear relationship.

### Key Differences:
1. **Scale**:
   - Covariance is not standardized and depends on the units of the variables.
   - Correlation is standardized and unitless, allowing for comparison across different datasets.

2. **Interpretation**:
   - Covariance only indicates the direction of the relationship.
   - Correlation indicates both the direction and strength of the relationship.

### Usage in Statistical Analysis:
- **Covariance**:
  - Used in portfolio theory in finance to determine how two stocks might move together.
  - Used to construct covariance matrices, which are fundamental in multivariate statistical analyses like Principal Component Analysis (PCA).

- **Correlation**:
  - Used to identify and quantify the strength and direction of linear relationships between variables.
  - Commonly used in regression analysis, time series analysis, and feature selection in machine learning to identify relevant predictors.

### Example:
Consider two datasets:
- \(X = [1, 2, 3, 4, 5]\)
- \(Y = [2, 4, 6, 8, 10]\)

**Covariance**:
- Calculation would yield a positive value, indicating that \(X\) and \(Y\) tend to increase together.

**Correlation**:
- Calculation would yield +1, indicating a perfect positive linear relationship.
</span>

In [19]:
# Q11. What is the formula for calculating the sample mean? Provide an example calculation for a
# dataset.

### Solution 11-

<span style = 'font-size:0.8em;'>

The formula for calculating the sample mean :
x̄ =∑xi /n

Here, x̄ is the sample mean.
∑xi is the sum of all the sample observations.
n is the sample size or the total number of observations.

### Example Calculation

Consider the dataset: [4, 8, 6, 5, 3, 7]

**Step-by-Step Calculation**:

1. **Sum the Data Points**:
   
   sum = 4 + 8 + 6 + 5 + 3 + 7 = 33
   

2. **Count the Number of Data Points**:
   
   n = 6
   

3. **Calculate the Sample Mean**:
   
   x̄ = {33}/{6} = 5.5
  

### Summary

For the dataset \[4, 8, 6, 5, 3, 7\]:
- The sum of the data points is 33.
- The number of data points is 6.
- The sample mean (x̄) is 5.5.

Therefore, the sample mean of the dataset \[4, 8, 6, 5, 3, 7\] is 5.5.
</span>

In [20]:
# Q12. For a normal distribution data what is the relationship between its measure of central tendency?

### Solution 12-

<span style = 'font-size:0.8em;'>

For a normal distribution, the measures of central tendency—mean, median, and mode—are all equal. This relationship arises from the symmetrical, bell-shaped nature of the normal distribution.

### Characteristics of a Normal Distribution:
1. **Symmetry**: The normal distribution is perfectly symmetrical around its center.
2. **Peak**: The highest point of the curve, where the frequency of observations is the highest, corresponds to the mode.
3. **Equal Measures of Central Tendency**:
   - **Mean**: The average of all data points.
   - **Median**: The middle value when the data points are ordered.
   - **Mode**: The most frequently occurring value in the dataset.

### Relationship:
- In a normal distribution:
  \[
  {Mean} = {Median} = {Mode}
  \]

This equality holds because the symmetry of the distribution ensures that half of the data points lie on either side of the mean, making the median equal to the mean. Additionally, the mode, being the peak point of the distribution, is located at the same center point as the mean and median.

### Visual Representation:
A normal distribution curve looks like this:

```
             *
          *     *
        *         *
      *             *
    *                 *
   *                   *
 *                       *
*                         *
```

In this bell-shaped curve:
- The mean is at the center.
- The median is also at the center.
- The mode is at the peak of the curve.

### Example:
Consider a dataset that follows a normal distribution with a mean of 50 and a standard deviation  of 10. In this case:
- The mean is 50.
- The median is also 50.
- The mode is also 50.

In summary, for a normally distributed dataset, the mean, median, and mode are identical and located at the center of the distribution. This characteristic is a key feature of the normal distribution, distinguishing it from other types of distributions where these measures of central tendency may differ.
</span>

In [21]:
# Q13. How is covariance different from correlation?

### Solution 13-

<span style = 'font-size:0.8em;'>

Covariance and correlation are both measures used to assess the relationship between two variables in statistical analysis. However, they differ in terms of scale, interpretation, and how they quantify the relationship.

### Covariance:
1. **Definition**:
   - Covariance measures the directional relationship between two variables. It indicates whether the variables tend to increase or decrease together.
   
2. **Calculation**:
![image-2.png](attachment:image-2.png)

3. **Interpretation**:
   - Positive Covariance: Indicates that the variables tend to move in the same direction.
   - Negative Covariance: Indicates that the variables tend to move in opposite directions.
   - Magnitude: The magnitude of covariance is difficult to interpret because it depends on the units of the variables.

### Correlation:
1. **Definition**:
   - Correlation measures both the strength and direction of the linear relationship between two variables. It is a standardized measure, providing a dimensionless quantity.

2. **Calculation**:
    ![image.png](attachment:image.png)
3. **Interpretation**:
   - Value Range: Correlation values range from -1 to 1.
     - +1: Perfect positive linear relationship.
     - -1: Perfect negative linear relationship.
     - 0: No linear relationship.
   - Magnitude: The closer the value is to ±1, the stronger the linear relationship.

### Key Differences:
1. **Scale**:
   - Covariance is not standardized and depends on the units of the variables.
   - Correlation is standardized and unitless, allowing for comparison across different datasets.

2. **Interpretation**:
   - Covariance only indicates the direction of the relationship.
   - Correlation indicates both the direction and strength of the relationship.

### Usage in Statistical Analysis:
- **Covariance**:
  - Used in portfolio theory in finance to determine how two stocks might move together.
  - Used to construct covariance matrices, which are fundamental in multivariate statistical analyses like Principal Component Analysis (PCA).

- **Correlation**:
  - Used to identify and quantify the strength and direction of linear relationships between variables.
  - Commonly used in regression analysis, time series analysis, and feature selection in machine learning to identify relevant predictors.

### Example:
Consider two datasets:
- \(X = [1, 2, 3, 4, 5]\)
- \(Y = [2, 4, 6, 8, 10]\)

**Covariance**:
- Calculation would yield a positive value, indicating that \(X\) and \(Y\) tend to increase together.

**Correlation**:
- Calculation would yield +1, indicating a perfect positive linear relationship.
</span>

In [22]:
# Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

### Solution 14-

<span style = 'font-size:0.8em;'>


### How Outliers Affect Measures of Central Tendency and Dispersion

Outliers can significantly impact measures of central tendency and dispersion, often distorting the interpretation of the dataset. Here's how they affect each measure along with an example.

### Measures of Central Tendency

1. **Mean**:
   - The mean is sensitive to outliers because it includes all data points in the calculation.
   - An outlier can significantly increase or decrease the mean, making it less representative of the central location of the bulk of the data.

2. **Median**:
   - The median is more robust to outliers. It is the middle value when the data is sorted and is not affected by the actual values of the extreme points.
   - An outlier has little to no effect on the median, making it a more reliable measure of central tendency when outliers are present.

3. **Mode**:
   - The mode is the value that appears most frequently in the dataset.
   - Outliers typically do not affect the mode unless the outlier itself is the most frequent value.

### Measures of Dispersion

1. **Range**:
   - The range is the difference between the maximum and minimum values in the dataset.
   - It is highly sensitive to outliers because an outlier can significantly increase the range.

2. **Variance and Standard Deviation**:
   - Both variance and standard deviation are sensitive to outliers. These measures involve squaring the differences between each data point and the mean.
   - An outlier increases these differences dramatically, leading to a much higher variance and standard deviation, suggesting greater variability in the data.

### Example

Consider two datasets:

- Dataset A: \([10, 12, 14, 16, 18]\)
- Dataset B: \([10, 12, 14, 16, 100]\) (with an outlier of 100)

#### Measures of Central Tendency

- **Mean**:
  - Dataset A:
    \[
   {Mean} = {10 + 12 + 14 + 16 + 18}/{5} = 14
    \]
  - Dataset B:
    \[
   {Mean} = {10 + 12 + 14 + 16 + 100}/{5} = 30.4
    \]
  The mean of Dataset B is significantly higher due to the outlier.

- **Median**:
  - Dataset A:
    \[
    {Median} = 14
    \]
  - Dataset B:
    \[
   {Median} = 14
    \]
  The median remains the same, showing its robustness to the outlier.

- **Mode**:
  - Both datasets do not have a mode (no repeated values).

#### Measures of Dispersion

- **Range**:
  - Dataset A:
    \[
    {Range} = 18 - 10 = 8
    \]
  - Dataset B:
    \[
   {Range} = 100 - 10 = 90
    \]
  The range of Dataset B is much larger due to the outlier.

- **Variance and Standard Deviation**:
  - Dataset A:
    \[
   {Variance} = {(10-14)^2 + (12-14)^2 + (14-14)^2 + (16-14)^2 + (18-14)^2}/{4} = {16 + 4 + 0 + 4 + 16}/{4} = 10
    \]
    \[
    {Standard Deviation} = sqrt{10}  = approx 3.16
    \]
  - Dataset B:
    \[
   {Variance} = {(10-30.4)^2 + (12-30.4)^2 + (14-30.4)^2 + (16-30.4)^2 + (100-30.4)^2}/{4} = {420.16 + 344.16 + 268.96 + 194.56 + 4840.96}/{4} = 1517.2
    \]
    \[
    {Standard Deviation} = sqrt{1517.2}  = approx 38.94
    \]
  The variance and standard deviation of Dataset B are much higher due to the outlier, indicating a higher spread.

### Summary

- **Mean**: Greatly affected by outliers, causing it to misrepresent the central tendency.
- **Median and Mode**: Less affected by outliers, providing a more accurate central tendency when outliers are present.
- **Range, Variance, and Standard Deviation**: Greatly affected by outliers, indicating increased dispersion and variability in the presence of outliers.

By understanding how outliers impact these statistical measures, you can better interpret the data and consider using more robust measures (like the median) when outliers are present.

---

</span>