**Q1.** What are the three measures of central tendency?

**Mean:** The average value of a dataset, calculated by summing all values and dividing by the total number of values.

**Median:** The middle value of a dataset when it's ordered. If there's an even number of values, the median is the average of the two middle values.

**Mode:** The value that appears most frequently in a dataset. There can be one mode, more than one mode (multimodal), or no mode if all values are unique.

**Q2.** What is the difference between the mean, median, and mode? How are they used to measure the
central tendency of a dataset?

**Mean:**

Definition: The sum of all values divided by the total number of values.

Sensitivity to Outliers: Sensitive to extreme values (outliers).

Use: Provides a balanced average but **can be influenced by outliers.**

**Median:**

Definition: The middle value of an ordered dataset. If even values, the average of two middle values.

Sensitivity to Outliers: Less sensitive to outliers; robust measure.

Use: Represents a central value, **suitable for skewed distributions and data with outliers.**

**Mode:**

Definition: The value that appears most frequently in a dataset.

Sensitivity to Outliers: Not sensitive to outliers; unaffected by extreme values.

Use: Identifies the most common value(s), especially in categorical or discrete data.

**Q3.** Measure the three measures of central tendency for the given height data:

[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [1]:

import numpy as np
from scipy import stats

height_data = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]

# Mean
mean_height = np.mean(height_data)
print("Mean:", mean_height)

# Median
median_height = np.median(height_data)
print("Median:", median_height)

# Mode
try:
    mode_height = stats.mode(height_data)
    print("Mode:", mode_height.mode[0])
except stats.StatisticsError:
    print("Mode: No mode")


Mean: 177.01875
Median: 177.0
Mode: 177.0


  mode_height = stats.mode(height_data)


**Q4.** Find the standard deviation for the given data:

[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [5]:
import numpy as np

data = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]

mean = np.mean(data)
squared_diff = [(x - mean) ** 2 for x in data]
mean_squared_diff = np.mean(squared_diff)
std_deviation = np.sqrt(mean_squared_diff)

print("Standard Deviation:", std_deviation)


Standard Deviation: 1.7885814036548633


In [6]:
std_deviation = np.std(data)

print("Standard Deviation:", std_deviation)

Standard Deviation: 1.7885814036548633


**Q5.** How are measures of dispersion such as range, variance, and standard deviation used to describe
the spread of a dataset? Provide an example.

**Range:**

Definition: The difference between the maximum and minimum values in a dataset.

Use: Provides a simple measure of the data's spread but is sensitive to outliers.

Example: If the range of test scores in a class is 40 (from 60 to 100), it indicates that scores are spread across a wide range.

**Variance:**

Definition: The average of squared differences between each data point and the mean.

Use: Quantifies the dispersion of data points around the mean.

Example: In a dataset of exam scores, a higher variance indicates that scores are more spread out from the average, reflecting greater variability.

**Standard Deviation:**

Definition: Square root of the variance; represents the typical deviation from the mean.

Use: Offers a more interpretable measure of spread than variance.

Example: For a set of company profits, a higher standard deviation implies more fluctuation in earnings, showing greater business risk.

**Example:**
Consider two datasets of exam scores:

Dataset A: [85, 88, 90, 87, 89],
Dataset B: [60, 75, 85, 95, 110]

**Range:**

Range of Dataset A: 90 - 85 = 5

Range of Dataset B: 110 - 60 = 50

**Interpretation:** Dataset B has a wider spread than Dataset A.

**Variance and Standard Deviation:**

Variance and standard deviation for Dataset A: Lower values due to less variation.

Variance and standard deviation for Dataset B: Higher values due to wider spread.

**Interpretation:** Dataset B has more dispersed scores compared to Dataset A.

**Q6.** What is a Venn diagram?

A Venn diagram is a visual tool that uses overlapping circles or shapes to show relationships and commonalities between different sets or categories. It helps illustrate intersections and differences among groups of items or concepts.

**Q7.** For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10).
 Find:

(i) A B

(ii) A ⋃ B

**(i) A ∩ B (Intersection of A and B):**

A ∩ B represents the elements that are common to both sets A and B.

In this case, the common element is 2 and 6.

A ∩ B = {2,6}

**(ii)** A ∪ B (Union of A and B):

A ∪ B represents the combination of all unique elements from sets A and B.

Combined set = {0, 2, 3, 4, 5, 6, 7, 8, 10}

A ∪ B = {0, 2, 3, 4, 5, 6, 7, 8, 10}

**Q8.** What do you understand about skewness in data?

**Skewness in Data:**

Measures the asymmetry of data distribution.

Two types: positive skew (right skew) and negative skew (left skew).

**Positive skew:** Longer tail on the right; mean > median.

**Negative skew:** Longer tail on the left; mean < median.

Impacts interpretation of central tendency and distribution shape.

Visualized through histograms or density plots.

**Q9.** If a data is right skewed then what will be the position of median with respect to mean?

In a right-skewed distribution:

The tail of the distribution is elongated on the right side.

Outliers or larger values are present on the right side, pulling the mean in that direction.

As a result, the mean is dragged towards the right by the presence of these larger values.

The median, being less sensitive to outliers, tends to stay closer to the center of the distribution.

mean > median.

**Q10.** Explain the difference between covariance and correlation. How are these measures used in
statistical analysis?

**Covariance:**

Measures how two variables change together.

Indicates direction of linear relationship.

Not standardized, affected by variable scales.

**Correlation:**

Measures strength and direction of linear relationship.

Ranges from -1 to 1.

Standardized, not affected by variable scales.

**Usage in Statistical Analysis:**

Covariance: Assess direction of relationship.

Correlation: Assess strength and direction of relationship, useful for comparing relationships across datasets.

**Q11.** What is the formula for calculating the sample mean? Provide an example calculation for a
dataset.

S**ample Mean (x̄) = (Sum of all values) / (Number of values - 1)**

Example:

Suppose we have a dataset of test scores: [85, 92, 78, 89, 95, 88]

Add up all the values: 85 + 92 + 78 + 89 + 95 + 88 = 527

Count the number of values: 6

Calculate the sample mean with correction for bias: x̄ = 527 / (6 - 1) = 105.4

So, using Bessel's correction, the sample mean of the test scores is 105.4.

**Q12.** For a normal distribution data what is the relationship between its measure of central tendency?

For a normal distribution, the relationship between measures of central tendency (mean, median, mode) is:

Mean = Median = Mode

They are all equal.

Located at the center due to distribution's symmetry.

**Q13.** How is covariance different from correlation?

**Covariance:** Covariance is a statistical measure that indicates how two variables change together. It quantifies the direction of the linear relationship between them.

Measures linear relationship direction.

Positive or negative values.

Not standardized.

Sensitive to variable scales.

No fixed range.

**Correlation:** Correlation is a statistical measure that quantifies both the strength and direction of the linear relationship between two variables.

Measures linear relationship strength and direction.

Ranges from -1 to 1.

Standardized.

Not affected by variable scales.

Fixed range for easy comparison.

**Q14.** How do outliers affect measures of central tendency and dispersion? Provide an example.

**Measures of Central Tendency:**

Outliers can significantly affect measures of central tendency, such as the mean and median:

Mean: Outliers can distort the mean towards their extreme values. If there are outliers with very high or very low values, the mean may no longer represent the typical value of the majority of data points.

Median: The median is less affected by outliers. Outliers do not influence the median's value unless they fall within the middle of the dataset, which is less common.

**Measures of Dispersion:**

Outliers can also impact measures of dispersion, such as the range, variance, and standard deviation:

Range: Outliers can significantly expand the range, making it appear that the data is more spread out than it actually is.

Variance and Standard Deviation: Outliers can lead to larger values of variance and standard deviation, indicating higher variability even if the majority of data points are close together.

In [7]:
import numpy as np

# Dataset with and without an outlier
data_with_outlier = np.array([25000, 30000, 32000, 28000, 29000, 27000, 50000])
data_without_outlier = np.array([25000, 30000, 32000, 28000, 29000, 27000])

# Calculate mean, median, range, variance, and standard deviation
def calculate_stats(data):
    mean = np.mean(data)
    median = np.median(data)
    range_val = np.max(data) - np.min(data)
    variance = np.var(data, ddof=1)  # Bessel's correction
    std_deviation = np.std(data, ddof=1)
    return mean, median, range_val, variance, std_deviation

# Calculate statistics for both datasets
mean_with_outlier, median_with_outlier, range_with_outlier, variance_with_outlier, std_deviation_with_outlier = calculate_stats(data_with_outlier)

mean_without_outlier, median_without_outlier, range_without_outlier, variance_without_outlier, std_deviation_without_outlier = calculate_stats(data_without_outlier)


print("With Outlier:")
print("Mean:", mean_with_outlier)
print("Median:", median_with_outlier)
print("Range:", range_with_outlier)
print("Variance:", variance_with_outlier)
print("Standard Deviation:", std_deviation_with_outlier)

print("\nWithout Outlier:")
print("Mean:", mean_without_outlier)
print("Median:", median_without_outlier)
print("Range:", range_without_outlier)
print("Variance:", variance_without_outlier)
print("Standard Deviation:", std_deviation_without_outlier)


With Outlier:
Mean: 31571.428571428572
Median: 29000.0
Range: 25000
Variance: 70952380.95238096
Standard Deviation: 8423.323628614833

Without Outlier:
Mean: 28500.0
Median: 28500.0
Range: 7000
Variance: 5900000.0
Standard Deviation: 2428.9915602982237
