**Theoritical Questions and Answers**

1. What is statistics, and why is it important?
Statistics is the science of collecting, analyzing, interpreting, presenting, and organizing data.
Importance: It helps in decision-making, understanding trends, predicting outcomes, and validating hypotheses in areas like business, healthcare, politics, and social sciences.

2. What are the two main types of statistics?
Descriptive Statistics: Summarizes or describes features of a dataset.

Inferential Statistics: Makes predictions or generalizations about a population based on a sample.

3. What are descriptive statistics?
These are techniques for summarizing or describing data using:

Measures of central tendency (mean, median, mode)

Measures of variability (range, variance, standard deviation)

Graphs and charts (bar graphs, histograms)

4. What is inferential statistics?
Inferential statistics uses sample data to:

Make predictions or inferences about a population

Test hypotheses

Estimate parameters

5. What is sampling in statistics?
Sampling involves selecting a subset of individuals from a population to analyze and draw conclusions about the whole population.

6. What are the different types of sampling methods?
Random Sampling (equal chance): Simple random, stratified, cluster, systematic

Non-Random Sampling: Convenience, judgmental, quota, snowball

7. What is the difference between random and non-random sampling?
Random Sampling: Each member has a known, equal chance of selection — reduces bias.

Non-Random Sampling: Selection is based on non-probabilistic methods — may introduce bias.

8. Define and give examples of qualitative and quantitative data
Qualitative (Categorical): Non-numerical (e.g., gender, color, brand)

Quantitative (Numerical): Measurable data (e.g., height, weight, income)

9. What are the different types of data in statistics?
Qualitative Data: Nominal and Ordinal

Quantitative Data: Interval and Ratio

10. Explain nominal, ordinal, interval, and ratio levels of measurement
Nominal: Categories with no order (e.g., blood type)

Ordinal: Categories with a ranked order (e.g., survey satisfaction levels)

Interval: Numerical, equal intervals, no true zero (e.g., temperature in Celsius)

Ratio: Like interval, but with a true zero (e.g., height, age, weight)

11. What is the measure of central tendency?
It identifies the center or typical value in a dataset: mean, median, and mode.

12. Define mean, median, and mode
Mean: Average of all values

Median: Middle value when sorted

Mode: Most frequent value

13. What is the significance of the measure of central tendency?
It provides a summary of a dataset, allowing comparisons and understanding of general patterns.

14. What is variance, and how is it calculated?
Variance measures how data points spread out from the mean.
Formula (population):

𝜎
2
=
∑
(
𝑥
𝑖
−
𝜇
)
2
𝑁
σ
2
 =
N
∑(x
i
​
 −μ)
2

​

15. What is skewness in a dataset?
Skewness measures the asymmetry of data distribution:

Positive skew: Tail on the right

Negative skew: Tail on the left

16. What is standard deviation, and why is it important?
Standard deviation is the square root of variance. It shows how much values deviate from the mean — useful for understanding consistency or variability.

17. Define and explain the term range in statistics
Range = Maximum value – Minimum value
It shows the spread of values in a dataset.

18. What is the difference between variance and standard deviation?
Variance: Average squared deviation from the mean

Standard deviation: Square root of variance (same units as original data)

19. What does it mean if a dataset is positively or negatively skewed?
Positively skewed: Mean > Median; long right tail

Negatively skewed: Mean < Median; long left tail

20. Define and explain kurtosis
Kurtosis describes the "tailedness" of a distribution:

High kurtosis: More outliers, heavy tails

Low kurtosis: Fewer outliers, light tails

21. What is the purpose of covariance?
Covariance measures how two variables change together:

Positive: They increase together

Negative: One increases, the other decreases

22. What does correlation measure in statistics?
Correlation measures the strength and direction of a linear relationship between two variables (ranges from -1 to +1).

23. What is the difference between covariance and correlation?
Covariance: Direction of relationship; unstandardized

Correlation: Strength + direction; standardized (dimensionless)

24. What are some real-world applications of statistics?
Healthcare: Clinical trials, disease tracking

Business: Market research, quality control

Sports: Player performance analysis

Government: Policy-making, census analysis

Education: Test score evaluation, program assessment

In [None]:
**Practical Questions and Answers**

---

### ✅ **1. Calculate Mean, Median, and Mode of a Dataset**


import statistics as stats

data = [10, 15, 20, 25, 30, 35, 40]

mean = stats.mean(data)
median = stats.median(data)
mode = stats.mode(data)

print(f"Mean: {mean}, Median: {median}, Mode: {mode}")


---

### ✅ **2. Compute Variance and Standard Deviation**


def compute_variance_stddev(data):
    n = len(data)
    mean = sum(data) / n
    variance = sum((x - mean) ** 2 for x in data) / n
    stddev = variance ** 0.5
    return variance, stddev

data = [10, 20, 30, 40, 50]
var, std = compute_variance_stddev(data)
print(f"Variance: {var}, Standard Deviation: {std}")


---

### ✅ **3. Create and Classify Dataset**


dataset = {
    'Nominal': ['Red', 'Blue', 'Green'],
    'Ordinal': ['Low', 'Medium', 'High'],
    'Interval': [20, 25, 30],  # Temperature in Celsius
    'Ratio': [160, 170, 180]   # Height in cm
}


---

### ✅ **4. Implement Random and Stratified Sampling**


import pandas as pd
from sklearn.model_selection import train_test_split

# Sample dataset
df = pd.DataFrame({
    'Gender': ['M', 'F'] * 5,
    'Score': [60, 70, 65, 80, 75, 90, 85, 70, 60, 88]
})

# Random Sampling
random_sample = df.sample(n=4)

# Stratified Sampling
stratified_sample = df.groupby('Gender', group_keys=False).apply(lambda x: x.sample(frac=0.5))

print("Random Sample:\n", random_sample)
print("Stratified Sample:\n", stratified_sample)


---

### ✅ **5. Python Function to Calculate Range**


def data_range(data):
    return max(data) - min(data)

data = [3, 10, 7, 5, 9]
print("Range:", data_range(data))


---

### ✅ **6. Create Dataset and Plot Histogram to Visualize Skewness**


import matplotlib.pyplot as plt

data = [1, 2, 2, 3, 4, 5, 6, 7, 8, 20]  # Right-skewed
plt.hist(data, bins=10, edgecolor='black')
plt.title("Histogram - Skewness")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()


---

### ✅ **7. Calculate Skewness and Kurtosis Using Python**


from scipy.stats import skew, kurtosis

data = [1, 2, 3, 4, 5, 6, 100]
print("Skewness:", skew(data))
print("Kurtosis:", kurtosis(data))


---

### ✅ **8. Generate Dataset to Demonstrate Skewness**


import numpy as np

# Positive skew
positive_skew = np.random.exponential(scale=2, size=1000)

# Negative skew
negative_skew = -np.random.exponential(scale=2, size=1000)

plt.hist(positive_skew, bins=30)
plt.title("Positive Skew")
plt.show()

plt.hist(negative_skew, bins=30)
plt.title("Negative Skew")
plt.show()


---

### ✅ **9. Calculate Covariance Between Two Datasets**


import numpy as np

x = [2, 4, 6, 8]
y = [1, 3, 2, 5]

cov_matrix = np.cov(x, y, bias=True)
print("Covariance:", cov_matrix[0][1])


---

### ✅ **10. Calculate Correlation Coefficient**


correlation = np.corrcoef(x, y)[0][1]
print("Correlation Coefficient:", correlation)


---

### ✅ **11. Create a Scatter Plot**


plt.scatter(x, y)
plt.title("Scatter Plot")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()


---

### ✅ **12. Compare Simple Random and Systematic Sampling**


import numpy as np

data = np.arange(1, 21)

# Simple random sample
simple_random = np.random.choice(data, size=5, replace=False)

# Systematic sampling
k = len(data) // 5
systematic_sample = data[::k]

print("Simple Random Sample:", simple_random)
print("Systematic Sample:", systematic_sample)


---

### ✅ **13. Mean, Median, Mode of Grouped Data**


import pandas as pd

# Class intervals and frequencies
classes = ['0-10', '10-20', '20-30']
midpoints = [5, 15, 25]
frequencies = [5, 10, 5]

df = pd.DataFrame({'Class': classes, 'Midpoint': midpoints, 'Frequency': frequencies})

mean = sum(df['Midpoint'] * df['Frequency']) / sum(df['Frequency'])

print("Grouped Data Mean:", mean)


---

### ✅ **14. Simulate Data and Calculate Central Tendency & Dispersion**


import numpy as np
import scipy.stats as stats

data = np.random.normal(loc=50, scale=10, size=1000)

print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Mode:", stats.mode(data, keepdims=True)[0][0])
print("Std Dev:", np.std(data))
print("Variance:", np.var(data))


---


Excellent! Here's a step-by-step guide and code snippets using **NumPy**, **pandas**, **matplotlib**, and **seaborn** to tackle your advanced statistics and visualization tasks:

---

### ✅ **15. Summarize a Dataset’s Descriptive Statistics (pandas)**


import pandas as pd

# Sample dataset
data = {
    'Age': [25, 30, 35, 40, 45, 50, 60],
    'Salary': [3000, 4000, 5000, 6000, 6500, 7000, 8000]
}
df = pd.DataFrame(data)

print(df.describe())


---

### ✅ **16. Plot a Boxplot to Understand Spread and Identify Outliers**


import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(data=df, y='Salary')
plt.title("Boxplot of Salary")
plt.show()


---

### ✅ **17. Calculate the Interquartile Range (IQR)**


Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1

print("Interquartile Range (IQR):", IQR)


---

### ✅ **18. Implement Z-score Normalization**


from scipy.stats import zscore

df['Salary_Z'] = zscore(df['Salary'])

print(df[['Salary', 'Salary_Z']])


**Significance**: Z-score normalization rescales data to have a mean of 0 and a standard deviation of 1, useful for comparing data from different scales.

---

### ✅ **19. Compare Two Datasets Using Standard Deviations**


import numpy as np

group1 = np.random.normal(100, 10, 100)
group2 = np.random.normal(100, 20, 100)

std1 = np.std(group1)
std2 = np.std(group2)

print(f"Group 1 Std Dev: {std1}")
print(f"Group 2 Std Dev: {std2}")


Larger standard deviation = more spread out values.

---

### ✅ **20. Visualize Covariance Using a Heatmap**


cov_matrix = df[['Age', 'Salary']].cov()

sns.heatmap(cov_matrix, annot=True, cmap='coolwarm')
plt.title("Covariance Matrix Heatmap")
plt.show()


---

### ✅ **21. Create a Correlation Matrix Using Seaborn**


corr_matrix = df[['Age', 'Salary']].corr()

sns.heatmap(corr_matrix, annot=True, cmap='Blues')
plt.title("Correlation Matrix")
plt.show()


---

### ✅ **22. Generate Dataset and Compute Variance & Std Dev**


data = np.random.normal(loc=50, scale=15, size=1000)

variance = np.var(data)
std_dev = np.std(data)

print(f"Variance: {variance}, Standard Deviation: {std_dev}")


---

### ✅ **23. Visualize Skewness and Kurtosis**


from scipy.stats import skew, kurtosis

sns.histplot(data, bins=30, kde=True)
plt.title("Histogram with KDE - Skewness & Kurtosis")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()

print("Skewness:", skew(data))
print("Kurtosis:", kurtosis(data))


---

### ✅ **24. Implement Pearson and Spearman Correlation**


from scipy.stats import pearsonr, spearmanr

x = np.random.randint(10, 100, 50)
y = np.random.randint(10, 100, 50)

pearson_corr, _ = pearsonr(x, y)
spearman_corr, _ = spearmanr(x, y)

print(f"Pearson Correlation: {pearson_corr}")
print(f"Spearman Correlation: {spearman_corr}")


* **Pearson** measures linear correlation.
* **Spearman** measures monotonic relationships (ranks).

---

