Q1. What are the three measures of central tendency?

Mean , Median , Mode

Q2. What is the difference between the mean, median, and mode? How are they used to measure the
central tendency of a dataset?

1. Mean: The mean is the arithmetic average of all the values in a dataset. It is calculated by summing up all the values and then dividing by the total number of values. The mean is sensitive to extreme values (outliers) in the dataset because it takes into account every value.

2. Median: The median is the middle value of a dataset when the values are arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle values. The median is less affected by extreme values compared to the mean, making it a robust measure of central tendency, especially in datasets with outliers.

3. Mode: The mode is the value that occurs most frequently in a dataset. A dataset may have one mode (unimodal), two modes (bimodal), or more than two modes (multimodal). Unlike the mean and median, the mode is not affected by extreme values because it only depends on the frequency of values.

These measures are used to summarize and describe the central tendency of a dataset:

1. Mean: It provides a measure of the "average" value in the dataset and is often used when the distribution of data is approximately symmetric and not heavily influenced by outliers.

2. Median: It gives insight into the "middle" value of the dataset and is particularly useful when the data is skewed or contains outliers since it is less affected by extreme values.

3. Mode: It identifies the most frequently occurring value in the dataset and is useful for categorical or nominal data, where the concept of an average may not be meaningful. The mode can also provide insights into the central tendency of continuous data when the distribution exhibits clear peaks.

Q3. Measure the three measures of central tendency for the given height data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [13]:
heights = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [3]:
import numpy as np

In [4]:
np.mean(heights)

177.01875

In [15]:
np.median(heights)

177.0

In [17]:
from scipy import stats
stats.mode(heights)

  stats.mode(heights)


ModeResult(mode=array([177.]), count=array([3]))

In [6]:
height = {'heights' : [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]}

In [7]:
import pandas as pd

In [8]:
df = pd.DataFrame(height)

In [9]:
df

Unnamed: 0,heights
0,178.0
1,177.0
2,176.0
3,177.0
4,178.2
5,178.0
6,175.0
7,179.0
8,180.0
9,175.0


In [10]:
df.mode()

Unnamed: 0,heights
0,177.0
1,178.0


In [11]:
df.median()

heights    177.0
dtype: float64

Q4. Find the standard deviation for the given data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [18]:
df.std()

heights    1.847239
dtype: float64

In [19]:
np.std(heights)

1.7885814036548633

Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe
the spread of a dataset? Provide an example.

Measures of dispersion, such as range, variance, and standard deviation, provide information about how spread out the values in a dataset are from the central tendency (mean, median, or mode). They help quantify the extent of variability or dispersion within the data.

1. Range: The range is the difference between the maximum and minimum values in a dataset. It gives a simple measure of the   spread of the data but is sensitive to outliers.

2. Variance: The variance measures the average squared deviation of each data point from the mean of the dataset. It provides a measure of the overall variability of the data. A higher variance indicates greater spread or dispersion of the data points from the mean.

3. Standard Deviation: The standard deviation is the square root of the variance. It measures the average distance of each data point from the mean. Like variance, a higher standard deviation indicates greater variability in the dataset.

Here's an example to illustrate how these measures describe the spread of a dataset:

Suppose we have two datasets representing the daily temperatures (in degrees Celsius) for two cities, City A and City B, over the past week:

City A: 22, 24, 23, 21, 25, 23, 22
City B: 18, 30, 20, 22, 19, 21, 25

1. Range:
For City A: Range = Maximum value - Minimum value = 25 - 21 = 4°C
For City B: Range = Maximum value - Minimum value = 30 - 18 = 12°C

2. Variance:
For City A: Variance = (Sum of squared deviations from the mean) / (Number of observations - 1)
= [(22-23)^2 + (24-23)^2 + (23-23)^2 + (21-23)^2 + (25-23)^2 + (23-23)^2 + (22-23)^2] / 6
≈ 1.67°C^2
For City B: Variance = [(18-22)^2 + (30-22)^2 + (20-22)^2 + (22-22)^2 + (19-22)^2 + (21-22)^2 + (25-22)^2] / 6
≈ 15.33°C^2

3. Standard Deviation:
For City A: Standard Deviation = Square Root of Variance ≈ √1.67 ≈ 1.29°C
For City B: Standard Deviation = Square Root of Variance ≈ √15.33 ≈ 3.91°C

In this example, City B has a higher range, variance, and standard deviation compared to City A, indicating that City B's temperatures are more spread out or variable over the week.

Q6. What is a Venn diagram?


A Venn diagram is a graphical representation used to show all possible logical relations between a finite collection of different sets. It is composed of overlapping circles (or other shapes) that represent the sets, and the overlapping regions represent the intersections between these sets.

Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
(i) A intersection B
(ii) A ⋃ B

In [21]:
import pandas as pd

# Define the sets A and B
A = pd.Series([2, 3, 4, 5, 6, 7])
B = pd.Series([0, 2, 6, 8, 10])

# Intersection of sets A and B
intersection = pd.Series(list(set(A) & set(B)))
print("Intersection of A and B:", intersection.tolist()) ## .tolist indicates that convert the result into list

# Union of sets A and B
union = pd.Series(list(set(A) | set(B)))
print("Union of A and B:", union.tolist())


Intersection of A and B: [2, 6]
Union of A and B: [0, 2, 3, 4, 5, 6, 7, 8, 10]


Q8. What do you understand about skewness in data ?


Skewness in data refers to the lack of symmetry in the distribution of values. It indicates the degree to which a dataset deviates from a symmetrical distribution around its mean. A symmetrical distribution has equal amounts of data on both sides of the mean, resulting in a balanced shape, while a skewed distribution has a longer tail on one side compared to the other.

There are two main types of skewness:

1. Positive Skewness (Right Skewness): In a positively skewed distribution, the tail of the distribution extends to the right, indicating that there are more extreme values on the right side of the distribution. The mean is typically greater than the median, and the mode is less than the median. Positive skewness is also known as right skewness because the tail points towards the right.

2. Negative Skewness (Left Skewness): In a negatively skewed distribution, the tail of the distribution extends to the left, indicating that there are more extreme values on the left side of the distribution. The mean is typically less than the median, and the mode is greater than the median. Negative skewness is also known as left skewness because the tail points towards the left.

Skewness is an essential concept in statistics and data analysis because it provides insights into the shape and characteristics of a dataset. Understanding skewness helps analysts identify patterns, outliers, and potential issues in the data. Additionally, skewness plays a crucial role in selecting appropriate statistical methods and interpreting results accurately.

Q9. If a data is right skewed then what will be the position of median with respect to mean?

If a dataset is right-skewed, the position of the median with respect to the mean will typically be as follows:

The mean will be greater than the median.
Since a right-skewed distribution has a longer tail on the right side, indicating that there are more extreme values on the right, the mean will be pulled towards the higher values, leading to a higher mean value. On the other hand, the median represents the middle value of the dataset, which is less influenced by extreme values. Therefore, in a right-skewed distribution, the mean tends to be greater than the median.







Q10. Explain the difference between covariance and correlation. How are these measures used in
statistical analysis?

1. Covariance:

Covariance measures the degree to which two variables change together. It indicates the direction of the linear relationship between the variables.
The formula for covariance between two variables X and Y is:
cov
(
�
,
�
)
=
∑
(
�
�
−
�
ˉ
)
(
�
�
−
�
ˉ
)
�
cov(X,Y)= 
n
∑(X 
i
​
 − 
X
ˉ
 )(Y 
i
​
 − 
Y
ˉ
 )
​
 
Where:
�
�
X 
i
​
  and 
�
�
Y 
i
​
  are individual data points of variables X and Y.
�
ˉ
X
ˉ
  and 
�
ˉ
Y
ˉ
  are the means of variables X and Y, respectively.
�
n is the number of data points.
Covariance can take any value, positive or negative, depending on the direction of the relationship. A positive covariance indicates that as one variable increases, the other variable tends to increase as well, while a negative covariance indicates an inverse relationship.
The magnitude of the covariance is not standardized and depends on the scale of the variables. Therefore, it can be challenging to interpret and compare covariances across different datasets.


2. Correlation:

Correlation is a standardized measure that quantifies the strength and direction of the linear relationship between two variables. It is a dimensionless quantity, ranging from -1 to +1.
The most commonly used measure of correlation is Pearson correlation coefficient, denoted by 
�
ρ (rho), which is calculated as:
�
�
,
�
=
cov
(
�
,
�
)
�
�
�
�
ρ 
X,Y
​
 = 
σ 
X
​
 σ 
Y
​
 
cov(X,Y)
​
 
Where:
cov
(
�
,
�
)
cov(X,Y) is the covariance between variables X and Y.
�
�
σ 
X
​
  and 
�
�
σ 
Y
​
  are the standard deviations of variables X and Y, respectively.
Correlation coefficients close to +1 indicate a strong positive linear relationship, coefficients close to -1 indicate a strong negative linear relationship, and coefficients close to 0 indicate little to no linear relationship.
Correlation is useful because it is standardized, making it easier to interpret and compare relationships across different datasets with different scales.

Q11. What is the formula for calculating the sample mean? Provide an example calculation for a
dataset.


The formula for calculating the sample mean (also known as the average) of a dataset is:

Sample Mean
=
∑
�
=
1
�
�
�
�
Sample Mean= 
n
∑ 
i=1
n
​
 x 
i
​
 
​
 

Where:

�
�
x 
i
​
  represents each individual value in the dataset.
�
n is the total number of values in the dataset.
To calculate the sample mean:

Add up all the values in the dataset.
Divide the sum by the total number of values in the dataset.
Here's an example calculation of the sample mean for a dataset:

Dataset: {10, 15, 20, 25, 30}

Sum of the values: 
10
+
15
+
20
+
25
+
30
=
100
10+15+20+25+30=100
Total number of values: 
�
=
5
n=5
Sample Mean
=
100
5
=
20
Sample Mean= 
5
100
​
 =20

So, the sample mean of the dataset {10, 15, 20, 25, 30} is 20.

Q12. For a normal distribution data what is the relationship between its measure of central tendency?

 For a normal distribution, the mean, median, and mode are all equal and located at the center of the distribution. This relationship holds true for perfectly symmetric normal distributions. However, in real-world scenarios, slight deviations from perfect symmetry may lead to small differences between these measures.

Q13. How is covariance different from correlation?

Covariance and correlation are both measures used to quantify the relationship between two variables, but they differ in several aspects:

1. Definition:

Covariance measures the degree to which two variables change together. It indicates the direction of the linear relationship between the variables.
Correlation measures the strength and direction of the linear relationship between two variables. It is a standardized measure that ranges from -1 to +1, where:
A correlation coefficient close to +1 indicates a strong positive linear relationship.
A correlation coefficient close to -1 indicates a strong negative linear relationship.
A correlation coefficient close to 0 indicates little to no linear relationship.


2. Scale:

Covariance is not standardized and depends on the scale of the variables. Therefore, the magnitude of covariance can vary widely, making it difficult to interpret and compare across different datasets.
Correlation is a dimensionless quantity, ranging from -1 to +1. It is standardized, making it easier to interpret and compare relationships across different datasets with different scales.


3. Interpretation:

Covariance does not provide clear guidelines for interpretation due to its dependence on the scale of the variables. A positive covariance indicates that as one variable increases, the other tends to increase as well, while a negative covariance indicates an inverse relationship. However, the magnitude of covariance does not indicate the strength of the relationship.
Correlation provides a standardized measure of the strength and direction of the linear relationship between two variables. The correlation coefficient indicates the degree to which the variables move together in a linear fashion, regardless of their scale.


4. Range:

Covariance can take any value, positive or negative, depending on the direction of the relationship between variables.
Correlation coefficients range from -1 to +1, where extreme values indicate stronger linear relationships and values closer to 0 indicate weaker or no linear relationship.

Covariance and Correlation both measure the relationship between two variables, correlation provides a standardized measure that is easier to interpret and compare across different datasets, whereas covariance does not have such standardization and is more affected by the scale of the variables.

Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.


Outliers can significantly impact measures of central tendency and dispersion in a dataset. Their presence can distort the calculated values and lead to inaccurate summaries of the data. Here's how outliers affect different measures:

1. Measures of Central Tendency:

a. Mean: Outliers can heavily influence the mean because it takes into account every value in the dataset. If an outlier is significantly higher or lower than the other values, it can pull the mean towards it, causing it to be either inflated or deflated.

b. Median: The median is less affected by outliers compared to the mean. Since the median is the middle value when the data are sorted, it only considers the central values and is not influenced by extreme values.
Mode: Outliers typically do not affect the mode because it represents the most frequently occurring value in the dataset. As long as the outlier is not a part of the mode, its presence does not change the mode.


2. Measures of Dispersion:

a. Range: Outliers can significantly affect the range because it is calculated as the difference between the maximum and minimum values in the dataset. If an outlier is present, it can increase or decrease the range depending on its position relative to the other values.


b. Variance and Standard Deviation: Outliers can inflate the variance and standard deviation because they measure the spread of data points from the mean. Since outliers are far from the mean, they contribute to larger deviations, resulting in higher variance and standard deviation.

Example:
Consider a dataset representing the salaries of employees in a company:

{
20000
,
25000
,
30000
,
35000
,
40000
,
50000
,
80000
}
{20000,25000,30000,35000,40000,50000,80000}

If we add an outlier to this dataset:

{
20000
,
25000
,
30000
,
35000
,
40000
,
50000
,
80000
,
150000
}
{20000,25000,30000,35000,40000,50000,80000,150000}

The mean salary will significantly increase due to the outlier, making it higher than what most employees earn. Similarly, the standard deviation will also increase because the outlier is far from the mean, leading to larger deviations. However, the median and mode will remain unchanged because they are less influenced by extreme values.





