# Statistics Assignment

---


## Question 1
**What is the difference between descriptive statistics and inferential statistics? Explain with examples.**

### Short definitions
- **Descriptive statistics:** methods for summarizing and describing the features of a dataset you have in hand. Typical summaries: mean, median, mode, standard deviation, variance, frequency tables, histograms, boxplots.
- **Inferential statistics:** methods that use a sample of data to make estimates, predictions, or decisions about a larger population. They explicitly account for uncertainty (e.g., using confidence intervals, hypothesis tests, p-values).

### Key differences (pointwise)
1. **Goal:** Descriptive = *summarize data*; Inferential = *generalize to population / test hypotheses*.
2. **Data used:** Descriptive uses the full dataset or sample; Inferential treats the sample as a subset and uses probability to infer properties of the population.
3. **Measures vs. Uncertainty:** Descriptive reports statistics; Inferential reports estimates **with** uncertainty (e.g., CI) and p-values.

### Examples (exam-style)
- *Descriptive:* "The average score of 50 students in the class is **74.2** with a standard deviation of **6.1**." This simply summarizes the observed class.
- *Inferential:* "From a random sample of 200 voters, 52% support candidate A. The 95% confidence interval is (45%, 59%) — therefore we estimate the population support lies in this range with 95% confidence." This uses the sample to infer about the whole population.

### Short tip to score marks
- Always state the **goal**, give **one numeric example** for each, and mention **uncertainty** when describing inferential methods.

---


## Question 2
**What is sampling in statistics? Explain the differences between random and stratified sampling.**

### Sampling (definition)
Sampling is selecting a subset (sample) from a larger population to observe and measure, with the aim of learning about the population while saving time or resources.

### Random sampling (simple random sample)
- **Definition:** every member of the population has an equal chance of being selected.
- **Advantages:** unbiased selection (if done correctly), simple to analyze.
- **Disadvantages:** may underrepresent small subgroups by chance.
- **Example:** choose 100 students by randomly selecting student IDs from the full class list using a random number generator.

### Stratified sampling
- **Definition:** divide the population into homogenous subgroups (strata) based on an important characteristic (e.g., age, gender), then take random samples from each stratum (often proportional to stratum size).
- **Advantages:** ensures representation of key subgroups and can reduce sampling variance.
- **Disadvantages:** requires knowledge of strata and can be more complex to implement.
- **Example:** to survey 1,000 people in a city where 30% are aged 18–29, 40% are 30–49 and 30% are 50+, draw samples from each age group proportional to these percentages.

### When to use which
- Use **random sampling** when the population is relatively homogeneous or when simplicity is critical.
- Use **stratified sampling** when there are known subgroups that should be represented (to improve accuracy and fairness).

---


## Question 3
**Define mean, median, and mode. Explain why these measures of central tendency are important.**

### Definitions
- **Mean (arithmetic mean):** \(\bar{x} = \dfrac{1}{n} \sum_{i=1}^n x_i\). The sum of all values divided by count.
- **Median:** the middle value after sorting the data. If n is even, median is the average of the two middle values.
- **Mode:** the value(s) that appear most frequently in the dataset.

### When to use which (exam-worthy bullets)
- **Mean** — best for symmetric distributions and when you want the mathematical average; sensitive to outliers.
- **Median** — robust to outliers and skewed data; preferred for income, property prices, or highly skewed distributions.
- **Mode** — useful for categorical data or to identify the most common value(s).

### Short example
Given data: [2, 3, 3, 4, 100]
- Mean ≈ 22.4 (pulled up by outlier 100)
- Median = 3 (better central measure here)
- Mode = 3

Mentioning the **sensitivity to outliers** and picking the appropriate measure will help maximize marks.

---


## Question 4
**Explain skewness and kurtosis. What does a positive skew imply about the data?**

### Skewness
- **Definition:** measure of asymmetry of a probability distribution.
  - **Positive skew (right skew):** tail extends to the right; mean > median. Common in income, house price distributions.
  - **Negative skew (left skew):** tail extends to the left; mean < median.
- **Interpretation tip:** look at histogram and compare mean vs median.

### Kurtosis
- **Definition:** measure of "tailedness" (how heavy the tails are compared to a normal distribution).
  - **High kurtosis (leptokurtic):** heavy tails and a sharp peak — more extreme outliers than normal.
  - **Low kurtosis (platykurtic):** light tails and flatter peak.
- **Note:** many software packages report *excess kurtosis* (kurtosis minus 3) so that a normal distribution has excess kurtosis 0.

### What positive skew implies
- Most observations are concentrated on the left (lower values) with a few large values on the right.
- Mean is pulled to the right of the median.
- Example practical implication: when data are positively skewed, report the median for a representative central value.

---


## Question 5
**Implement a Python program to compute the mean, median, and mode of a given list of numbers.**

**Given:**
```python
numbers = [12, 15, 12, 18, 19, 12, 20, 22, 19, 19, 24, 24, 24, 26, 28]
```

### Steps to show in answer (marks-maximizing):
1. Show the formula for the mean and compute it by summation.
2. Explain how to find the median (sort and pick middle).
3. Explain mode — show frequency table and pick most frequent value(s).

Below is clear, commented Python code that computes and prints each value and displays a small frequency table.


In [None]:
# Question 5 - Detailed computation (code)
numbers = [12, 15, 12, 18, 19, 12, 20, 22, 19, 19, 24, 24, 24, 26, 28]

# 1) Mean by formula
_total = sum(numbers)
_n = len(numbers)
mean_manual = _total / _n

# 2) Median (sort and take middle)
sorted_nums = sorted(numbers)
if _n % 2 == 1:
    median_manual = sorted_nums[_n//2]
else:
    median_manual = (sorted_nums[_n//2 - 1] + sorted_nums[_n//2]) / 2

# 3) Mode (frequency table)
from collections import Counter
freq = Counter(numbers)
max_freq = max(freq.values())
modes = sorted([val for val,count in freq.items() if count == max_freq])

# Print results clearly
print('Numbers:', numbers)
print('Sorted:', sorted_nums)
print('\nMean (manual) = {:.4f}'.format(mean_manual))
print('Median (manual) =', median_manual)
print('Mode(s) =', modes)
print('\nFrequency table:')
for val,count in sorted(freq.items()):
    print(f'  Value {val}: {count} time(s)')


## Question 6
**Compute the covariance and correlation coefficient between the following two datasets:**

```python
list_x = [10, 20, 30, 40, 50]
list_y = [15, 25, 35, 45, 60]
```

### Marks-maximizing explanation to include:
- Show formula for sample covariance: \(s_{xy} = \dfrac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\).
- Show formula for Pearson correlation: \(r = \dfrac{s_{xy}}{s_x s_y}\) where \(s_x\) and \(s_y\) are sample standard deviations.

Below is commented code that computes both (manual calculation and NumPy check).


In [None]:
# Question 6 - covariance and correlation (code)
list_x = [10, 20, 30, 40, 50]
list_y = [15, 25, 35, 45, 60]

import numpy as np
from statistics import mean, stdev

n = len(list_x)
mean_x = mean(list_x)
mean_y = mean(list_y)

# Manual sample covariance (ddof=1)
cov_manual = sum((list_x[i]-mean_x)*(list_y[i]-mean_y) for i in range(n)) / (n-1)

# Manual Pearson correlation using sample stdev
sx = stdev(list_x)
sy = stdev(list_y)
cor_manual = cov_manual / (sx * sy)

# NumPy checks
cov_matrix = np.cov(list_x, list_y)  # ddof=1 default
cov_np = cov_matrix[0,1]
corr_np = np.corrcoef(list_x, list_y)[0,1]

print('list_x =', list_x)
print('list_y =', list_y)
print('\nMean_x =', mean_x, 'Mean_y =', mean_y)
print('Sample covariance (manual) =', cov_manual)
print('Sample covariance (numpy)   =', cov_np)
print('Sample standard dev x =', sx, 'y =', sy)
print('Pearson r (manual) =', cor_manual)
print('Pearson r (numpy)  =', corr_np)


## Question 7
**Write a Python script to draw a boxplot for the following list and identify its outliers. Explain the result.**

```python
data = [12, 14, 14, 15, 18, 19, 19, 21, 22, 22, 23, 23, 24, 26, 29, 35]
```

### Steps to show for full marks:
1. Compute Q1 (25th percentile) and Q3 (75th percentile) and IQR = Q3 - Q1.
2. Outlier rule (common): values < Q1 - 1.5*IQR or > Q3 + 1.5*IQR are outliers.
3. Plot a boxplot (label axes, add title).

The code below prints the bounds and shows the boxplot with outliers.


In [None]:
# Question 7 - boxplot and outliers
import numpy as np
import matplotlib.pyplot as plt

data = [12, 14, 14, 15, 18, 19, 19, 21, 22, 22, 23, 23, 24, 26, 29, 35]

arr = np.array(data)
q1 = np.percentile(arr, 25)
q3 = np.percentile(arr, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = arr[(arr < lower_bound) | (arr > upper_bound)].tolist()

print('Data:', data)
print('Q1 =', q1, 'Q3 =', q3, 'IQR =', iqr)
print('Lower bound =', lower_bound, 'Upper bound =', upper_bound)
print('Outliers detected =', outliers)

# Plot boxplot with labels
plt.figure(figsize=(6,4))
plt.boxplot(data, vert=True, patch_artist=True)
plt.title('Boxplot of data with detected outliers')
plt.ylabel('Value')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()


## Question 8
**You are a data analyst. The marketing team wants to know if there is a relationship between advertising spend and daily sales.**

### (a) How to use covariance and correlation
- **Covariance:** indicates the **direction** of linear relationship. Positive covariance → both variables increase together. However, covariance depends on units and scale, so its magnitude is hard to interpret.
- **Correlation (Pearson):** standardized covariance divided by product of standard deviations: it ranges between −1 and +1. It shows **strength** and direction. Values near +1 indicate a strong positive linear relationship.
- **Caveat:** correlation does **not imply causation** — confounders or time-lagged effects may exist. If you want to test causation, consider experiments or causal methods.

### (b) Compute correlation using Python
Below is commented code that computes covariance and Pearson correlation for the provided lists and shows interpretation.


In [None]:
# Question 8 - covariance and correlation for advertising vs sales
import numpy as np
from statistics import mean, stdev

advertising_spend = [200, 250, 300, 400, 500]
daily_sales = [2200, 2450, 2750, 3200, 4000]

n = len(advertising_spend)

mean_ad = mean(advertising_spend)
mean_sales = mean(daily_sales)

# Sample covariance
cov = sum((advertising_spend[i]-mean_ad)*(daily_sales[i]-mean_sales) for i in range(n)) / (n-1)
# Pearson correlation
corr = cov / (stdev(advertising_spend) * stdev(daily_sales))

print('Advertising spend:', advertising_spend)
print('Daily sales:', daily_sales)
print('\nMean ad =', mean_ad, 'Mean sales =', mean_sales)
print('Sample covariance =', cov)
print('Pearson correlation (r) =', corr)

# Interpretation
if abs(corr) > 0.8:
    interp = 'strong'
elif abs(corr) > 0.5:
    interp = 'moderate'
else:
    interp = 'weak'
print('\nInterpretation: r = {:.4f} indicates a {} linear relationship.'.format(corr, interp))
print('Note: correlation does not imply causation — consider experiments or controlled studies to test causality.')


## Question 9
**Customer satisfaction survey (scale 1–10). Explain which summary statistics and visualizations you'd use, and plot a histogram for:**

```python
survey_scores = [7, 8, 5, 9, 6, 7, 8, 9, 10, 4, 7, 6, 9, 8, 7]
```

### Summary statistics & visualizations to include (for full marks):
- **Mean & Median:** central tendency.
- **Mode:** most common response (often important for discrete survey scales).
- **Standard deviation / variance:** dispersion of responses.
- **Percentiles (25th, 75th) and IQR:** to describe spread robustly.
- **Histogram:** visualize the distribution across 1–10 scale.
- **Boxplot:** check for outliers and skewness.

Below is code with computations and a clear histogram using Matplotlib.


In [None]:
# Question 9 - survey stats and histogram
import matplotlib.pyplot as plt
import statistics as stats

survey_scores = [7, 8, 5, 9, 6, 7, 8, 9, 10, 4, 7, 6, 9, 8, 7]

mean_s = stats.mean(survey_scores)
median_s = stats.median(survey_scores)
modes_s = stats.multimode(survey_scores)
stdev_s = stats.stdev(survey_scores)
percentiles = np.percentile(survey_scores, [25,50,75])

print('Survey scores:', survey_scores)
print('\nMean =', mean_s)
print('Median =', median_s)
print('Mode(s) =', modes_s)
print('Std dev =', round(stdev_s,4))
print('25th, 50th, 75th percentiles =', percentiles)

# Histogram
plt.figure(figsize=(7,4))
plt.hist(survey_scores, bins=range(1,12), edgecolor='black')
plt.title('Histogram of Customer Satisfaction Scores (1-10)')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.xticks(range(1,11))
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()

# Boxplot to complement
plt.figure(figsize=(4,3))
plt.boxplot(survey_scores, vert=False)
plt.title('Boxplot of Survey Scores')
plt.xlabel('Score')
plt.show()


---

**Notebook saved as** `/mnt/data/statistics_assignment_final.ipynb` — download link provided after execution.

Good luck with your submission! If you'd like this exported to PDF or adjusted to match a specific template (font size, cover page, name, roll number), tell me and I'll update it right away.
