# Q1. What are the three measures of central tendency?
Mean: The mean, also known as the average, is calculated by summing up all the values in a data set and then dividing the sum by the total number of values. It is the most commonly used measure of central tendency.

Median: The median is the middle value in a data set when the values are arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle values. The median is less sensitive to extreme outliers than the mean.

Mode: The mode is the value that occurs most frequently in a data set. A data set can have no mode (if all values are unique), one mode (unimodal), or multiple modes (multimodal) if two or more values have the same highest frequency.

# Q2. What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?
Mean:

Calculation: The mean is calculated by summing up all the values in a dataset and then dividing the sum by the total number of values.
Sensitivity to Outliers: The mean is sensitive to extreme outliers or very large or small values because it takes into account every value in the dataset.
Use: The mean is commonly used when the data is relatively symmetrical and not heavily skewed. It provides a balanced measure of central tendency.

Median:

Calculation: The median is the middle value when the values in a dataset are arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle values.
Sensitivity to Outliers: The median is less sensitive to extreme outliers compared to the mean because it focuses on the middle value(s) and is not influenced by extreme values at the ends of the dataset.
Use: The median is often used when the data is skewed or has outliers. It provides a better representation of the central value in such cases.

Mode:

Calculation: The mode is the value that occurs most frequently in a dataset. A dataset can have no mode (if all values are unique), one mode (unimodal), or multiple modes (multimodal) if two or more values have the same highest frequency.
Sensitivity to Outliers: The mode is not influenced by outliers because it is solely based on the frequency of values.
Use: The mode is useful for categorical or nominal data, and it can also be used with continuous data. It helps identify the most frequently occurring category or value in a dataset.

# Q3. Measure the three measures of central tendency for the given height data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [1]:
#from scratch

# Given height data
height_data = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]

# Calculate the mean
mean = sum(height_data) / len(height_data)

# Calculate the median
sorted_data = sorted(height_data)
n = len(sorted_data)
if n % 2 == 0:
    median = (sorted_data[n // 2 - 1] + sorted_data[n // 2]) / 2
else:
    median = sorted_data[n // 2]

# Calculate the mode
from collections import Counter
height_counts = Counter(height_data)
mode = height_counts.most_common(1)[0][0]

# Print the results
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)

Mean: 177.01875
Median: 177.0
Mode: 178


In [4]:
#Using Libraries (NumPy and SciPy):

import numpy as np
from scipy import stats

# Given height data
height_data = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]

# Calculate the mean using NumPy
mean = np.mean(height_data)

# Calculate the median using NumPy
median = np.median(height_data)

# Calculate the mode using SciPy
mode = stats.mode(height_data).mode[0]

# Print the results
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)


Mean: 177.01875
Median: 177.0
Mode: 177.0


  mode = stats.mode(height_data).mode[0]


# Q4. Find the standard deviation for the given data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [5]:
#from scratch
# Given height data
height_data = [178, 177, 176, 177, 178.2, 178, 175, 179, 180, 175, 178.9, 176.2, 177, 172.5, 178, 176.5]

# Calculate the mean
mean = sum(height_data) / len(height_data)

# Calculate the squared differences from the mean
squared_diff = [(x - mean) ** 2 for x in height_data]

# Calculate the mean of squared differences
mean_squared_diff = sum(squared_diff) / len(height_data)

# Calculate the standard deviation by taking the square root of mean_squared_diff
std_deviation = (mean_squared_diff ** 0.5)

# Print the result
print("Standard Deviation:", std_deviation)

Standard Deviation: 1.7885814036548633


In [6]:
#NumPy library
std_deviation = np.std(height_data)
print("Standard Deviation:", std_deviation)

Standard Deviation: 1.7885814036548633


# Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.

Range:

The range is the simplest measure of dispersion and represents the difference between the maximum and minimum values in a dataset.
It provides a quick way to understand the spread of data but can be influenced by outliers.
Example: Consider a dataset of exam scores for a class: [60, 70, 80, 90, 100]. The range is 100 - 60 = 40, indicating that the scores vary by 40 points.

Variance:

Variance quantifies the average squared difference between each data point and the mean of the dataset. It gives more weight to larger deviations from the mean.
Variance is often used in statistical analysis and is the basis for calculating the standard deviation.
Example: Using the same exam scores dataset, you can calculate the variance. The variance is a more detailed measure of the spread, providing insight into how scores deviate from the mean.

Standard Deviation:

Standard deviation is the square root of the variance. It measures the average absolute deviation from the mean, providing a more interpretable metric of spread compared to variance.
It is commonly used to describe the degree of variability in a dataset.
Example: If the variance of the exam scores dataset is calculated to be 100, the standard deviation would be the square root of 100, which is 10. This indicates that, on average, the scores deviate from the mean by approximately 10 points.

# Q6. What is a Venn diagram?
A Venn diagram is a graphical representation used to depict the relationship between sets or groups of objects. It consists of overlapping circles, each representing a set, and the intersections between these circles represent the elements that belong to multiple sets simultaneously. Venn diagrams are commonly used to visualize and understand set theory, logic, and the relationships between different categories or groups of items.

 (i) A ∩ B (Intersection): The intersection of two sets consists of elements that are common to both sets. In this case, the elements that appear in both sets A and B are {2, 6}.

So, A ∩ B = {2, 6}

(ii) A ⋃ B (Union): The union of two sets consists of all unique elements from both sets, without duplicates. In this case, combine all the elements from sets A and B.

So, A ⋃ B = {0, 2, 3, 4, 5, 6, 7, 8, 10}


# Q8. What do you understand about skewness in data?

Skewness is a statistical measure that indicates the asymmetry of a data distribution.
Types: There are three main types of skewness: positive (right skew), negative (left skew), and no skew (symmetrical).

Positive Skew (Right Skew):
Tail on the right side is longer.
Majority of data concentrated on the left side.
mode <  Median < Mean .

Negative Skew (Left Skew):
Tail on the left side is longer.
Majority of data concentrated on the right side.
mode >  Median > Mean .

No Skew (Symmetrical):
Data is evenly distributed around the mean.
mode =  Median = Mean .


# Q9. If a data is right skewed then what will be the position of median with respect to mean?

In a right-skewed distribution: Median < Mean  also median is in left side of mean in the visualization

# Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?

Covariance measures the direction of the linear relationship between two variables, whereas correlation quantifies both the direction and strength of this relationship. Correlation is standardized, ranging from -1 to 1, making it more interpretable and widely used in statistical analysis, finance, and research.

these measures used in statistical analysis for EDA and Feature selection , also used for:-

Relationship Analysis: Understanding how two variables relate.
Data Preprocessing: Identifying multicollinearity.
Visualization: Displaying variable relationships.
Data Selection: Choosing significant predictor variables.






# Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.


The formula for calculating the sample mean (average) is as follows:

Sample Mean (x̄) = (Sum of all data values in sample) / (sample size)

Here's an example calculation for a dataset:

Suppose you have the following dataset of exam scores:

[85, 92, 88, 78, 90]

To calculate the sample mean:

Add up all the values in the dataset: 85 + 92 + 88 + 78 + 90 = 433

Count the number of data values, which is 5 in this case.

Use the formula to calculate the sample mean:

Sample Mean (x̄) = 433 / 5 = 86.6

So, the sample mean (average) for this dataset is 86.6.






# Q12. For a normal distribution data what is the relationship between its measure of central tendency?

mean = median = mode , approximately

# Q13. How is covariance different from correlation?
Covariance measures the directional relationship between two variables, while correlation quantifies both direction and strength, is unitless, and ranges from -1 to 1.








# Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

1. **Measures of Central Tendency (Mean, Median, Mode):**
   - **Mean:** Outliers, especially extreme values, can pull the mean in their direction, making it an inaccurate representation of the typical value.
   - **Median:** The median is less affected by outliers because it is not sensitive to extreme values, so it may provide a more robust estimate of central tendency.
   - **Mode:** Outliers rarely affect the mode because it represents the most frequently occurring value, which may not be influenced by isolated extreme values.

2. **Measures of Dispersion (Range, Variance, Standard Deviation):**
   - **Range:** Outliers can significantly increase the range, making it a poor measure of data spread if extreme values are present.
   - **Variance and Standard Deviation:** Outliers increase the variability in the data, leading to larger variance and standard deviation values, which can exaggerate the perceived spread of the data.


Example:
Consider a dataset of salaries for a company:

[40,000, 42,000, 45,000, 41,000, 44,000, 250,000]

- **Mean:** Without the outlier (250,000), the mean salary is $43,333.33. With the outlier, it becomes $72,833.33, significantly skewed by the extreme value.
- **Median:** The median salary is $43,000, and it remains relatively stable with or without the outlier, making it a robust measure.
- **Standard Deviation:** Without the outlier, the standard deviation is approximately $2,449.49. With the outlier, it increases to approximately $63,803.81, indicating higher variability primarily due to the outlier.

In this example, the outlier (250,000) has a substantial impact on the mean and standard deviation, while the median remains relatively unaffected, highlighting the importance of considering the influence of outliers when interpreting central tendency and dispersion measures.