# Assignment: Statistics Basics

## Theory Questions

### Q1. What is statistics, and why is it important?

**Answer:**
Statistics is the science of collecting, organizing, analyzing, and interpreting numerical data to make informed decisions. It is important because it allows us to understand patterns, test hypotheses, and make predictions in almost every field—business, healthcare, government, engineering, and the social sciences.

### Q2. What are the two main types of statistics?

**Answer:**
The two main branches are **descriptive statistics** (summarizing data) and **inferential statistics** (drawing conclusions about a population from a sample).

### Q3. What are descriptive statistics?

**Answer:**
Descriptive statistics are methods for summarizing raw data into meaningful forms—such as tables, charts, averages, and variability measures—so that the main features of a dataset can be quickly understood.

### Q4. What is inferential statistics?

**Answer:**
Inferential statistics use sample data to make generalizations or predictions about a larger population, often through estimation (confidence intervals) and hypothesis testing.

### Q5. What is sampling in statistics?

**Answer:**
Sampling is the process of selecting a subset (sample) from a larger group (population) to estimate characteristics of the whole population.

### Q6. What are the different types of sampling methods?

**Answer:**
Common sampling methods include **simple random sampling, systematic sampling, stratified sampling, cluster sampling, convenience sampling,** and **snowball sampling**.

### Q7. What is the difference between random and non‑random sampling?

**Answer:**
Random sampling selects units so that each has a known, non‑zero chance of selection, reducing bias; non‑random sampling does not rely on chance and may introduce selection bias.

### Q8. Define and give examples of qualitative and quantitative data.

**Answer:**
**Qualitative (categorical) data** describe qualities or categories (e.g., gender, eye color); **quantitative (numerical) data** measure quantities (e.g., height in cm, weight in kg).

### Q9. What are the different types of data in statistics?

**Answer:**
Data types: **nominal, ordinal, interval,** and **ratio** levels of measurement;

### Q10. Explain nominal, ordinal, interval, and ratio levels of measurement.

**Answer:**
**Nominal**: categories with no order (blood type). **Ordinal**: ordered categories (Likert scale). **Interval**: ordered, equal intervals, no true zero (temperature in °C). **Ratio**: ordered, equal intervals, true zero (weight, length).

### Q11. What is the measure of central tendency?

**Answer:**
Measures of central tendency identify the center of a distribution and include the mean, median, and mode.

### Q12. Define mean, median, and mode.

**Answer:**
- **Mean**: arithmetic average.
- **Median**: middle value when data are ordered.
- **Mode**: most frequently occurring value.

### Q13. What is the significance of the measure of central tendency?

**Answer:**
They provide a single representative value that summarizes a dataset, aiding comparison between groups and simplifying complex data.

### Q14. What is variance, and how is it calculated?

**Answer:**
Variance measures average squared deviations from the mean: \(\sigma^2 = \frac{\sum (x_i - \mu)^2}{N}\) for a population or \(s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}\) for a sample.

### Q15. What is standard deviation, and why is it important?

**Answer:**
Standard deviation is the square root of variance; it re‑expresses variability in the original units, making dispersion easier to interpret.

### Q16. Define and explain the term range in statistics.

**Answer:**
Range = Maximum − Minimum; it measures the total spread of the data.

### Q17. What is the difference between variance and standard deviation?

**Answer:**
Variance squares deviations, while standard deviation is its square root, expressed in the same units as the data—often easier to interpret.

### Q18. What is skewness in a dataset?

**Answer:**
Skewness quantifies the asymmetry of a distribution around its mean.

### Q19. What does it mean if a dataset is positively or negatively skewed?

**Answer:**
Positive skew: long tail to the right; negative skew: long tail to the left.

### Q20. Define and explain kurtosis.

**Answer:**
Kurtosis describes the heaviness of tails and peakedness relative to the normal distribution: **leptokurtic** (heavy tails), **platykurtic** (light tails), **mesokurtic** (normal).

### Q21. What is the purpose of covariance?

**Answer:**
Covariance indicates the direction of the linear relationship between two variables (positive, negative, or none).

### Q22. What does correlation measure in statistics?

**Answer:**
Correlation measures both the strength **and** direction of a linear relationship on a standardized scale (−1 to +1).

### Q23. What is the difference between covariance and correlation?

**Answer:**
Covariance is unstandardized and scale‑dependent; correlation standardizes covariance, making comparisons across datasets possible.

### Q24. What are some real‑world applications of statistics?

**Answer:**
Statistics underpin quality control, clinical trials, market research, risk management, sports analytics, and public‑policy decision‑making.

## Practical Questions

### Practical Q1. How do you calculate the mean, median, and mode of a dataset?

In [None]:
import numpy as np, pandas as pd, scipy.stats as stats
data = [12, 15, 14, 10, 12, 13, 18]
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data, keepdims=False).mode
print(mean, median, mode)

### Practical Q2. Write a Python program to compute the variance and standard deviation of a dataset.

In [None]:
import numpy as np
data = np.array([12, 15, 14, 10, 12, 13, 18])
variance = np.var(data, ddof=1)
std = np.std(data, ddof=1)
print(variance, std)

### Practical Q3. Create a dataset and classify it into nominal, ordinal, interval, and ratio types.

In [None]:
# Example classifications
dataset = {
    'blood_type': ['A', 'B', 'O'],  # nominal
    'satisfaction': [1, 3, 5],        # ordinal
    'temperature': [36.5, 37.0, 38.2],# interval
    'weight': [60, 72, 80]            # ratio
}
print(dataset)

### Practical Q4. Implement sampling techniques like random sampling and stratified sampling.

In [None]:
import numpy as np, pandas as pd
pop = np.arange(1000)
# Simple random sample
sample_random = np.random.choice(pop, size=50, replace=False)
# Stratified sampling example
df = pd.DataFrame({'value': pop, 'group': pop % 10})
strata = df.groupby('group', group_keys=False).apply(lambda x: x.sample(5))
print(sample_random[:5])
print(strata.head())

### Practical Q5. Write a Python function to calculate the range of a dataset.

In [None]:
def data_range(arr):
    return max(arr) - min(arr)
print(data_range([4, 8, 15, 16, 23, 42]))

### Practical Q6. Create a dataset and plot its histogram to visualize skewness.

In [None]:
import matplotlib.pyplot as plt, numpy as np
np.random.seed(0)
data = np.random.exponential(scale=2, size=1000)
plt.hist(data, bins=30)
plt.title('Histogram')
plt.show()

### Practical Q7. Calculate skewness and kurtosis of a dataset using Python libraries.

In [None]:
from scipy.stats import skew, kurtosis
import numpy as np
np.random.seed(0)
data = np.random.normal(size=1000)
print('Skew:', skew(data))
print('Kurtosis:', kurtosis(data))

### Practical Q8. Generate a dataset and demonstrate positive and negative skewness.

In [None]:
import numpy as np, matplotlib.pyplot as plt
neg_skew = np.random.beta(5, 1, 1000)
pos_skew = np.random.beta(1, 5, 1000)
plt.hist(neg_skew, bins=30, alpha=0.7, label='Negative')
plt.hist(pos_skew, bins=30, alpha=0.7, label='Positive')
plt.legend()
plt.show()

### Practical Q9. Write a Python script to calculate covariance between two datasets.

In [None]:
import numpy as np
x = np.random.randn(100)
y = np.random.randn(100)
cov = np.cov(x, y)[0,1]
print('Covariance:', cov)

### Practical Q10. Write a Python script to calculate the correlation coefficient between two datasets.

In [None]:
import numpy as np
x = np.random.randn(100)
y = np.random.randn(100)
coef = np.corrcoef(x, y)[0,1]
print('Correlation:', coef)

### Practical Q11. Create a scatter plot to visualize the relationship between two variables.

In [None]:
import matplotlib.pyplot as plt, numpy as np
x = np.random.randn(100)
y = 2*x + np.random.randn(100)
plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.show()

### Practical Q12. Implement and compare simple random sampling and systematic sampling.

In [None]:
import numpy as np
population = np.arange(1, 101)
# Simple Random Sampling
srs = np.random.choice(population, size=10, replace=False)
# Systematic Sampling
k = len(population)//10
start = np.random.randint(0, k)
sys = population[start::k][:10]
print('SRS:', srs)
print('Systematic:', sys)

### Practical Q13. Calculate the mean, median, and mode of grouped data.

In [None]:
import pandas as pd
bins = [0, 10, 20, 30]
labels = ['0-10', '10-20', '20-30']
data = [5, 12, 17, 8, 24, 29, 15, 2]
cut = pd.cut(data, bins=bins, labels=labels)
grouped = pd.DataFrame({'data': data, 'group': cut}).groupby('group')['data']
print(grouped.mean(), grouped.median(), grouped.apply(lambda x: x.mode().iat[0]))

### Practical Q14. Simulate data using Python and calculate its central tendency and dispersion.

In [None]:
import numpy as np
sim = np.random.normal(loc=50, scale=10, size=1000)
print('Mean:', np.mean(sim), 'Std:', np.std(sim, ddof=1))

### Practical Q15. Use NumPy or pandas to summarize a dataset’s descriptive statistics.

In [None]:
import pandas as pd
from scipy import stats
import seaborn as sns
df = sns.load_dataset('iris')
print(df.describe())

### Practical Q16. Plot a boxplot to understand the spread and identify outliers.

In [None]:
import seaborn as sns, matplotlib.pyplot as plt
df = sns.load_dataset('iris')
sns.boxplot(x='species', y='sepal_length', data=df)
plt.show()

### Practical Q17. Calculate the interquartile range (IQR) of a dataset.

In [None]:
import numpy as np
q75, q25 = np.percentile(np.random.randn(100), [75 ,25])
print('IQR:', q75 - q25)

### Practical Q18. Implement Z‑score normalization and explain its significance.

In [None]:
import numpy as np
from scipy import stats
x = np.random.randn(100)
z = stats.zscore(x)
print('First 5 Z‑scores:', z[:5])
print('Mean of z:', np.mean(z))

### Practical Q19. Compare two datasets using their standard deviations.

In [None]:
import numpy as np
set1 = np.random.randn(100)
set2 = np.random.randn(100)
print('SD1:', np.std(set1, ddof=1), 'SD2:', np.std(set2, ddof=1))

### Practical Q20. Write a Python program to visualize covariance using a heatmap.

In [None]:
import seaborn as sns, numpy as np, matplotlib.pyplot as plt
data = np.random.randn(100, 2)
ax = sns.heatmap(np.cov(data, rowvar=False), annot=True)
plt.show()

### Practical Q21. Use seaborn to create a correlation matrix for a dataset.

In [None]:
import seaborn as sns, matplotlib.pyplot as plt
df = sns.load_dataset('penguins').dropna()
sns.heatmap(df.corr(), annot=True)
plt.show()

### Practical Q22. Generate a dataset and implement both variance and standard deviation computations.

In [None]:
import numpy as np
x = np.random.randn(100)
print('Variance:', np.var(x, ddof=1), 'SD:', np.std(x, ddof=1))

### Practical Q23. Visualize skewness and kurtosis using Python libraries like matplotlib or seaborn.

In [None]:
import seaborn as sns, matplotlib.pyplot as plt, numpy as np
data = np.random.exponential(size=1000)
sns.histplot(data, kde=True)
plt.show()

### Practical Q24. Implement the Pearson and Spearman correlation coefficients for a dataset.

In [None]:
import scipy.stats as stats, numpy as np
x = np.random.randn(50)
y = np.random.randn(50)
print('Pearson:', stats.pearsonr(x, y)[0])
print('Spearman:', stats.spearmanr(x, y)[0])