## Q1. What are the three measures of central tendency?

The three measures of central tendency are:

`Mean:` The mean is the most commonly used measure of central tendency. It is calculated by adding up all the values in a dataset and dividing the sum by the number of values in the dataset.

`Median:` The median is the middle value in a dataset when the values are arranged in order. If there is an even number of values, the median is the average of the two middle values.

`Mode:` The mode is the value that occurs most frequently in a dataset. A dataset can have one mode, more than one mode (if two or more values occur with equal frequency), or no mode at all.

## Q2. What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?

The mean, median, and mode are all measures of central tendency used to describe the typical value or the center of a dataset. However, they differ in how they are calculated and in what they tell us about the dataset.

The mean is calculated by adding up all the values in a dataset and dividing the sum by the number of values in the dataset. It represents the arithmetic average of the dataset and is sensitive to extreme values or outliers. The mean is useful for symmetric datasets that are not skewed.

The median is the middle value in a dataset when the values are arranged in order. It divides the dataset into two equal halves, with half of the values below the median and half above it. The median is less sensitive to extreme values or outliers compared to the mean, making it a better measure of central tendency for skewed datasets.

The mode is the value that occurs most frequently in a dataset. It is useful for datasets with discrete values or those that have a clear peak or peak values. The mode may not be unique, meaning that there could be multiple values with the same frequency, or it may not exist if no value appears more than once.

In general, the choice of which measure of central tendency to use depends on the nature of the dataset and the research question. The mean is the most commonly used measure of central tendency, but it may not be appropriate for datasets with extreme values or outliers. The median is a better choice for skewed datasets, and the mode is useful for identifying the most common value in a dataset.

## Q3. Measure the three measures of central tendency for the given height data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [None]:
To find the three measures of central tendency for the given height data, we will calculate the mean, median, and mode.

Mean: The mean is calculated by adding up all the values in the dataset and dividing by the total number of values.
178 + 177 + 176 + 177 + 178.2 + 178 + 175 + 179 + 180 + 175 + 178.9 + 176.2 + 177 + 172.5 + 178 + 176.5 = 2838.3

Mean = 2838.3 / 16 = 177.39

The mean height for the given data is 177.39.

Median: The median is the middle value when the data is arranged in order.
Arrange the data in ascending order: 172.5, 175, 175, 176, 176.2, 176.5, 177, 177, 178, 178, 178, 178.2, 178.9, 179, 180.

There are 16 values in the dataset, so the median is the average of the eighth and ninth values.

Median = (177 + 178) / 2 = 177.5

The median height for the given data is 177.5.

Mode: The mode is the value that appears most frequently in the dataset.
The value 178 appears most frequently in the dataset (it appears four times), so the mode is 178.

The mode height for the given data is 178.

Therefore, the three measures of central tendency for the given height data are:

Mean: 177.39
Median: 177.5
Mode: 178

In [1]:
import statistics

data = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]

# Calculate the mean
mean = statistics.mean(data)
print("Mean: ", mean)

# Calculate the median
median = statistics.median(data)
print("Median: ", median)

# Calculate the mode
mode = statistics.mode(data)
print("Mode: ", mode)

Mean:  177.01875
Median:  177.0
Mode:  178


## Q4. Find the standard deviation for the given data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [3]:
import statistics

data = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]

# Calculate the standard deviation
stdev = statistics.stdev(data)
print("Standard deviation: ", stdev)

Standard deviation:  1.8472389305844188


## Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.

Measures of dispersion such as range, variance, and standard deviation are used to describe the spread of a dataset by quantifying how much the individual data points deviate from the central tendency of the dataset.

- Range: The range is the difference between the maximum and minimum values in a dataset. It provides a quick way to get an idea of how spread out the data is. However, it is sensitive to outliers and may not provide a reliable measure of dispersion in datasets with extreme values.

- Variance: The variance is the average of the squared differences between each data point and the mean of the dataset. It measures how much the data is spread out from the mean. A higher variance indicates that the data is more spread out, while a lower variance indicates that the data is more tightly clustered around the mean.

- Standard deviation: The standard deviation is the square root of the variance. It provides a more intuitive measure of the spread of the data, as it is in the same units as the data. A larger standard deviation indicates that the data is more spread out, while a smaller standard deviation indicates that the data is more tightly clustered around the mean.

For example:

In [4]:
import statistics

data = [5, 10, 15, 20, 25]

# Calculate the range
range_data = max(data) - min(data)
print("Range: ", range_data)

# Calculate the variance
mean = statistics.mean(data)
variance = sum((x - mean) ** 2 for x in data) / len(data)
print("Variance: ", variance)

# Calculate the standard deviation
stdev = variance ** 0.5
print("Standard deviation: ", stdev)

Range:  20
Variance:  50.0
Standard deviation:  7.0710678118654755


## Q6. What is a Venn diagram?

A Venn diagram is a graphical representation of the relationships between sets. It consists of overlapping circles or other closed shapes, with each shape representing a set and the overlapping region representing the intersection of those sets. The Venn diagram is named after the British logician John Venn, who introduced the concept in 1880.

Venn diagrams are commonly used in mathematics, statistics, logic, and other fields to visualize the relationships between sets and to illustrate concepts such as union, intersection, and complement. They are particularly useful for representing complex sets with multiple overlapping subsets, and for visualizing the results of set operations such as union, intersection, and difference.

Here's an example of a Venn diagram with three sets:

![image.png](attachment:a27e2721-66f5-41be-8ba8-b0aa1c485b5c.png)

## Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:

### (i)A ⋂ B

### (ii)A ⋃ B

In [10]:
# Define the sets A and B
A = {2, 3, 4, 5, 6, 7}
B = {0, 2, 6, 8, 10}

# Find the intersection of sets A and B
intersection = A.intersection(B)
print("The intersection of sets A and B is:", intersection)

# Find the union of sets A and B
union = A.union(B)
print("The union of sets A and B is:", union)


The intersection of sets A and B is: {2, 6}
The union of sets A and B is: {0, 2, 3, 4, 5, 6, 7, 8, 10}


## Q8. What do you understand about skewness in data?

Skewness is a measure of the asymmetry of a probability distribution or a dataset. A symmetric dataset has zero skewness, while an asymmetric dataset has positive or negative skewness.

- Positive skewness indicates that the tail of the distribution is longer on the right side than on the left side. This means that the distribution has more values on the left side and fewer values on the right side. In a positively skewed dataset, the mean is higher than the median.

- Negative skewness indicates that the tail of the distribution is longer on the left side than on the right side. This means that the distribution has more values on the right side and fewer values on the left side. In a negatively skewed dataset, the mean is lower than the median.

Skewness is an important concept in statistics because it affects the interpretation of the central tendency and the spread of the data.

## Q9. What do you understand about skewness in data? 

If a data is right-skewed, the median will be less than the mean.

This is because in a right-skewed distribution, the tail of the distribution is longer on the right side, which means there are some extreme values on the right side that pull the mean in that direction. The median, on the other hand, is not affected by extreme values in the same way as the mean, so it remains closer to the center of the distribution. Therefore, the median is typically less than the mean in a right-skewed distribution.

## Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?

Covariance and correlation are two statistical measures that describe the relationship between two variables.

> Covariance is a measure of how two variables vary together. Specifically, it measures the degree to which two variables tend to move in the same direction. A positive covariance means that when one variable goes up, the other tends to go up as well, and vice versa for a negative covariance. However, covariance does not give information about the strength of the relationship or the degree to which the variables are related.


> Correlation, on the other hand, is a standardized measure of the relationship between two variables. Unlike covariance, correlation takes values between -1 and 1, where -1 indicates a perfect negative correlation, 1 indicates a perfect positive correlation, and 0 indicates no correlation at all. Correlation measures not only the direction of the relationship but also its strength.

In statistical analysis, covariance and correlation are used to analyze the relationship between two variables. Covariance is useful for understanding the direction of the relationship, but correlation is often preferred because it standardizes the measure and allows for comparisons between different datasets. Correlation is particularly useful in situations where we want to compare the relationship between two variables with different units of measurement or scales. For example, we can use correlation to measure the strength of the relationship between a person's height and weight, even though height and weight are measured in different units.

## Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.

The formula for calculating the sample mean is:

sample mean = (sum of all values in the sample) / (number of values in the sample)

For example, suppose we have the following dataset of test scores for a class of 10 students:

85, 76, 92, 88, 79, 91, 83, 80, 87, 90

To calculate the sample mean, we first add up all the values in the sample:

85 + 76 + 92 + 88 + 79 + 91 + 83 + 80 + 87 + 90 = 851

Next, we divide the sum by the number of values in the sample, which in this case is 10:

sample mean = 851 / 10 = 85.1

Therefore, the sample mean for this dataset is 85.1.

In [17]:
import numpy as np
# Define the dataset
data = [85, 76, 92, 88, 79, 91, 83, 80, 87, 90]

# Calculate the sample mean
sample_mean = sum(data) / len(data)
mean = np.mean(data)

# Print the sample mean
print("Sample mean =", sample_mean)
print("Sample mean =", mean)


Sample mean = 85.1
Sample mean = 85.1


## Q12. For a normal distribution data what is the relationship between its measure of central tendency?

For a normal distribution data, the mean, median, and mode are all equal. This is because a normal distribution is symmetric around its mean, and the mean is located at the peak of the distribution. Additionally, the tails of a normal distribution decay exponentially, which means that the mode and median are also located at the same point as the mean.

Therefore, for a normal distribution data, the measures of central tendency - mean, median, and mode - are all equal and located at the center of the distribution.

## Q13. How is covariance different from correlation?

Covariance and correlation are both measures of the relationship between two variables. However, they differ in their interpretation and scale.

> Covariance measures the direction and strength of the linear relationship between two variables. It measures how much two variables change together, i.e., if one variable increases, does the other variable also increase, decrease or remain constant. Covariance can be positive, indicating a positive relationship where both variables tend to increase or decrease together, or negative, indicating a negative relationship where one variable increases as the other decreases. Covariance can be calculated using the formula:
<b>cov(X, Y) = (Σ(X - μX)(Y - μY)) / (n - 1)</b>
where X and Y are the two variables, μX and μY are their respective means, and n is the number of observations.

> The correlation coefficient, on the other hand, measures the strength and direction of the linear relationship between two variables, but it is scaled between -1 and 1, making it easier to interpret. Correlation can be positive, negative, or zero, indicating the strength and direction of the relationship. A correlation of +1 indicates a perfect positive relationship, a correlation of -1 indicates a perfect negative relationship, and a correlation of 0 indicates no relationship at all. Correlation can be calculated using the formula:
<b>r = cov(X, Y) / (sX * sY)</b>
where r is the correlation coefficient, cov(X, Y) is the covariance between X and Y, and sX and sY are the standard deviations of X and Y, respectively.

Therefore, the main difference between covariance and correlation is that covariance is not scaled, and its value depends on the units of measurement of the two variables. In contrast, correlation is scaled and ranges between -1 and 1, making it easier to interpret and compare across different datasets.

## Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

Outliers can significantly affect the measures of central tendency and dispersion in a dataset. An outlier is an observation that falls significantly outside the range of the other observations in the dataset. Outliers can be caused by measurement errors, data entry errors, or simply by chance.

The presence of outliers can distort the measures of central tendency, such as the mean and the median, as they can pull the average value towards their direction. For example, in a dataset of salaries where most of the salaries are in the range of Rs.30,000 to Rs.80,000 per year, an outlier salary of Rs.500,000 per year will significantly increase the mean salary value, making it higher than the typical salary in the dataset. In contrast, the median salary will be less affected by the outlier, as it only considers the middle value in the dataset.

Outliers can also affect the measures of dispersion, such as the range, variance, and standard deviation, as they can increase or decrease the spread of the dataset. For example, if we consider the same salary dataset as above, an outlier salary of Rs.500,000 per year will increase the range, variance, and standard deviation, making the spread of the dataset appear larger than it actually is.

In summary, outliers can have a significant impact on the measures of central tendency and dispersion in a dataset. It is important to identify and handle outliers appropriately to avoid drawing incorrect conclusions from the data.

Here's an example in Python demonstrating how outliers can affect measures of central tendency and dispersion:

In [19]:
import numpy as np

# Create a dataset with outliers
data = np.array([10, 20, 30, 40, 50, 500])

# Calculate measures of central tendency
mean = np.mean(data)
median = np.median(data)

# Calculate measures of dispersion
range = np.max(data) - np.min(data)
variance = np.var(data)
std_deviation = np.std(data)

# Print results
print("Data:", data)
print("Mean:", mean)
print("Median:", median)
print("Range:", range)
print("Variance:", variance)
print("Standard deviation:", std_deviation)


Data: [ 10  20  30  40  50 500]
Mean: 108.33333333333333
Median: 35.0
Range: 490
Variance: 30847.222222222223
Standard deviation: 175.6337730114064


In this example, the dataset contains an outlier value of 500, which is much larger than the other values in the dataset.

As we can see, the outlier has a significant impact on the mean, which is much higher than the typical values in the dataset. The median, on the other hand, is less affected by the outlier and is closer to the center of the dataset. The range, variance, and standard deviation are also much larger than they would be if the outlier was not present, indicating that the dataset has a larger spread.