Q 1.Three measures of central tendency are the mean, median, and mode.

Mean: The mean is commonly referred to as the average and is calculated by summing up all the values in a dataset and dividing the sum by the total number of values. It is sensitive to extreme values and can be influenced by outliers.

Median: The median is the middle value in a dataset when it is arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle values. The median is less affected by extreme values or outliers and provides a measure of the central value.

Mode: The mode is the value that occurs most frequently in a dataset. It represents the most common or popular value in the dataset. Unlike the mean and median, the mode can be applied to both numerical and categorical data. It is useful for identifying the central value in categorical or discrete datasets.

Q 2. The mean, median, and mode are three different measures of central tendency that provide information about the typical or central value of a dataset. Here's a breakdown of their differences and how they are used:

Mean:
The mean is calculated by summing up all the values in a dataset and dividing the sum by the total number of values. It is commonly referred to as the average. The mean is sensitive to extreme values or outliers because it takes into account all the values in the dataset. It provides a measure of the central tendency by considering the balance between higher and lower values in the dataset.
The mean is often used when the data is numerical and follows a roughly symmetric distribution. For example, it can be used to calculate the average score of a class, the average income in a population, or the average temperature over a period of time.

Median:
The median is the middle value in a dataset when it is arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle values. The median is less affected by extreme values or outliers because it focuses solely on the positional value within the dataset.
The median is particularly useful when dealing with skewed distributions or when extreme values are present. It provides a measure of the central tendency that is robust against outliers. For example, the median can be used to represent the typical salary in a company where a few extremely high salaries could skew the mean.

Mode:
The mode is the value that occurs most frequently in a dataset. It represents the most common or popular value. Unlike the mean and median, the mode can be applied to both numerical and categorical data.
The mode is commonly used to describe the central tendency of categorical or discrete datasets. For example, in a survey where respondents select their favorite color from a list of options, the mode would represent the most frequently chosen color.

In [1]:
import numpy as np
from statistics import mean

In [2]:
heights = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]


In [3]:
mean_height = np.mean(heights)
print("Mean height:", mean_height)


Mean height: 177.01875


In [5]:
median_height = np.median(heights)
print("Median height:", median_height)


Median height: 177.0


In [16]:
from statistics import mode
try:
    mode_height = mode(heights)
    print("Mode height:", mode_height)
except StatisticsError:
    print("No unique mode found.")

Mode height: 178


Q 4. Find the standard deviation for the given data:

In [23]:
import numpy as np

In [24]:
heights = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]

std_height = np.std(heights)
print("Standard deviation:", std_height)


Standard deviation: 1.7885814036548633


Q 5. Measures of dispersion, such as range, variance, and standard deviation, are used to describe the spread or variability of a dataset. They provide information about how the values in the dataset are spread out from the central tendency measures (such as mean or median).

Example:- 

In [25]:
import numpy as np

data = [12, 14, 15, 17, 20, 21, 22, 23, 24, 26]

# Range
data_range = np.ptp(data)
print("Range:", data_range)


Range: 14


In [27]:
# Variance
data_variance = np.var(data)
print("Variance:", data_variance)


Variance: 19.639999999999997


In [28]:
# Standard deviation
data_std = np.std(data)
print("Standard deviation:", data_std)


Standard deviation: 4.431703961232067


Q 6. A Venn diagram is a graphical representation that uses overlapping circles or other shapes to illustrate the logical relationships between different sets of items. 

The primary purpose of a Venn diagram is to visually illustrate set relationships, such as:

Intersection: The overlapping region(s) of the circles represent elements that are common to both sets. It shows the items that belong to both sets simultaneously.

Union: The entire area within the circles represents the combination of all elements from all sets. It shows all the items that belong to at least one of the sets.

Disjoint sets: If the circles do not overlap, it indicates that the sets have no elements in common. They are completely separate or disjoint.

Q 7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:

In [30]:
# Sets A and B
A = {2, 3, 4, 5, 6, 7}
B = {0, 2, 6, 8, 10}

# Union of A and B
union = A.union(B)
print("Union (A ⋃ B):", union)

Union (A ⋃ B): {0, 2, 3, 4, 5, 6, 7, 8, 10}


In [31]:
# Sets A and B
A = {2, 3, 4, 5, 6, 7}
B = {0, 2, 6, 8, 10}

# Intersection of A and B
intersection = A.intersection(B)
print("Intersection (A ∩ B):", intersection)

Intersection (A ∩ B): {2, 6}


Q 8. Skewness is a statistical measure that describes the asymmetry or lack of symmetry in a dataset's distribution. It indicates the degree to which the data is skewed or shifted towards one tail of the distribution.

In a symmetrical distribution, the data is evenly distributed around the mean, and the skewness is close to zero. However, when the distribution is skewed, it indicates that the data is not evenly distributed and is biased towards one side.

Skewness can be classified into three types:

Positive Skewness (Right Skew): Also known as right skew, it occurs when the tail of the distribution extends towards the right, or positive, side. In this case, the majority of the data is concentrated on the left side of the distribution, and the mean is greater than the median.

Negative Skewness (Left Skew): Also known as left skew, it occurs when the tail of the distribution extends towards the left, or negative, side. In this case, the majority of the data is concentrated on the right side of the distribution, and the mean is less than the median.

Zero Skewness: Zero skewness indicates that the dataset is symmetrically distributed. The data is evenly distributed around the mean, and there is no skewness present.

Q 9.If a data is right-skewed, the median will typically be less than the mean.

In a right-skewed distribution, the tail of the distribution extends towards the right, indicating that there are some larger values that pull the mean in that direction. Consequently, the mean is generally pulled towards the right by the influence of these larger values. As a result, the mean tends to be greater than the median.

Q 9. Covariance and correlation are two statistical measures that provide information about the relationship between two variables. While they both measure the extent of the relationship, there are some key differences between them.

Covariance:
Covariance measures how two variables vary together. It quantifies the degree to which changes in one variable are associated with changes in another variable. It can take positive, negative, or zero values.

The formula for covariance between two variables X and Y, based on a dataset of n observations, is:

cov(X, Y) = Σ((Xᵢ - X̄)(Yᵢ - Ȳ)) / (n - 1)

where Xᵢ and Yᵢ are individual data points, X̄ and Ȳ are the means of X and Y, respectively, and Σ represents the summation of all observations.

Covariance is not normalized and depends on the scale of the variables. Therefore, it can be challenging to interpret the magnitude of covariance alone. A positive covariance indicates a positive relationship (as one variable increases, the other tends to increase), a negative covariance indicates a negative relationship (as one variable increases, the other tends to decrease), and a covariance close to zero indicates a weak or no linear relationship.

Correlation:
Correlation is a standardized version of covariance that provides a more interpretable measure of the relationship between variables. It measures both the strength and direction of the linear relationship between two variables. Correlation values range from -1 to 1, where -1 indicates a perfect negative linear relationship, 1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship.

The formula for correlation between two variables X and Y, based on a dataset of n observations, is:

corr(X, Y) = cov(X, Y) / (σX * σY)

where cov(X, Y) is the covariance between X and Y, and σX and σY are the standard deviations of X and Y, respectively.

Correlation normalizes the covariance by dividing it by the product of the standard deviations. This normalization ensures that correlation is not affected by the scale of the variables and provides a standardized measure of the relationship.

In statistical analysis, covariance and correlation are used to understand the relationship between variables, detect patterns, and make predictions. They are widely used in various fields such as finance, economics, social sciences, and data analysis. 

Q 11. The formula for calculating the sample mean is as follows:

Sample Mean = (Sum of all data points) / (Number of data points)

To calculate the sample mean for a dataset, you sum up all the values in the dataset and divide the sum by the total number of data points.

Example calculation of the sample mean for a dataset:
Dataset: [12, 14, 15, 17, 20, 21, 22, 23, 24, 26]

Step 1: Add up all the values in the dataset:
12 + 14 + 15 + 17 + 20 + 21 + 22 + 23 + 24 + 26 = 194

Step 2: Determine the number of data points in the dataset:
The dataset contains 10 data points.

Step 3: Calculate the sample mean:
Sample Mean = 194 / 10 = 19.4

Therefore, the sample mean for the given dataset is 19.4.

Q 12. For a normal distribution, the three measures of central tendency (mean, median, and mode) are all equal or very close to each other. In a perfectly symmetrical and unimodal normal distribution, the mean, median, and mode are located at the same point within the distribution.

Here are the key relationships between the measures of central tendency in a normal distribution:

Mean: The mean is the arithmetic average of all the data points in a normal distribution. It is located at the center of the distribution. In a perfectly normal distribution, the mean is exactly equal to the median and mode.

Median: The median is the middle value in a dataset when it is arranged in ascending or descending order. In a normal distribution, where symmetry exists, the median is also equal to the mean and mode. It divides the distribution into two equal halves.

Mode: The mode represents the most frequently occurring value in a dataset. In a normal distribution, where the data is symmetrically distributed, the mode is also equal to the mean and median. Since every value in a normal distribution has the same frequency of occurrence, every value is a mode.

Q 13. Covariance and correlation are both measures used to assess the relationship between two variables. However, there are significant differences between the two:

1. Scale Independence:
Covariance: Covariance is not scale independent, meaning it depends on the units of the variables being measured. Consequently, the covariance value can be influenced by the scale of the variables, making it difficult to compare covariances across different datasets.
Correlation: Correlation is scale independent, which means it is not affected by the units of measurement. Correlation allows for the comparison of relationships between variables from different datasets, as it is standardized and falls within a fixed range (-1 to 1).

2. Interpretation:
Covariance: Covariance does not provide a clear interpretation of the strength of the relationship. It only indicates the direction (positive or negative) and the degree of linear association between the variables.
Correlation: Correlation provides a standardized measure that allows for easy interpretation. A correlation value of -1 indicates a perfect negative linear relationship, 1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship. Additionally, the magnitude of the correlation value reflects the strength of the relationship.

3. Range:
Covariance: Covariance can take any value, positive, negative, or zero, depending on the nature of the relationship between the variables.
Correlation: Correlation ranges from -1 to 1, where -1 indicates a perfect negative linear relationship, 1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship.
In statistical analysis, correlation is often preferred over covariance due to its standardized interpretation, scale independence, and ease of comparison. Correlation provides a more meaningful and robust measure of the relationship between variables compared to covariance.







Outliers can significantly affect measures of central tendency and dispersion, potentially distorting their values. Here's an example in Python to demonstrate how outliers can impact these measures: