# Statistics Basics — Assignment (Completed)

**Student:** Khadija

**Assignment Code:** DS-AG-005

This notebook contains detailed answers to Questions 1–9 and runnable Python code for Questions 5–9.


## Question 1 — Difference between Descriptive and Inferential Statistics

**Descriptive Statistics** summarize and describe features of a dataset. They provide simple summaries about the sample and the measures. Common descriptive measures include:
- Measures of central tendency: mean, median, mode.
- Measures of dispersion: range, variance, standard deviation, IQR (interquartile range).
- Visualization: histograms, boxplots, bar charts, scatterplots.

**Example:** If you have exam scores of 100 students, descriptive statistics tell you the average score, the spread (standard deviation), and show the distribution via a histogram.

**Inferential Statistics** use sample data to make inferences about a larger population. They quantify uncertainty using probability and draw conclusions such as hypothesis testing, confidence intervals, and regression analysis.

**Example:** From the scores of a randomly chosen sample of 100 students, you estimate the mean score of all 5000 students and provide a confidence interval. You might test whether a new teaching method changed the population mean using a t-test.

**Key difference:** Descriptive = *describe* the observed data. Inferential = *generalize* from sample to population with uncertainty estimates.



## Question 2 — What is Sampling? Random vs Stratified Sampling

**Sampling** is the process of selecting a subset (sample) from a larger population to estimate characteristics of the whole population. Good sampling methods aim to produce samples that are representative so that inferences are valid.

### Random Sampling (Simple Random Sampling)
- Every member of the population has an equal chance of being selected.
- Typically implemented by indexing the population and selecting indices with uniform probability, or using random number generators.
- Pros: Simple to understand, unbiased if implemented correctly.
- Cons: May produce unbalanced samples for small subgroups (rare categories might be under-represented).

**Example:** Randomly pick 200 students from the university student list using a random number generator.

### Stratified Sampling
- Population is divided into homogeneous subgroups (strata) based on known characteristics (e.g., gender, age group, region).
- Samples are drawn from each stratum (often proportionally to the stratum size).
- Pros: Ensures representation across strata, reduces variance of estimates when strata are internally homogeneous.
- Cons: Requires knowledge of strata and more complex to implement.

**Example:** To estimate average income, split population into income brackets or regions and sample proportionally from each bracket to ensure each region is represented.



## Question 3 — Mean, Median, Mode (Definitions & Importance)

- **Mean (Arithmetic mean):** Sum of all observations divided by the number of observations.  
  Formula: \(\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i\).

- **Median:** The middle value when the data are ordered. If `n` is even, median is the average of the two middle values.
  Median is a robust measure of center for skewed distributions or when outliers are present.

- **Mode:** The most frequently occurring value(s) in the dataset. A distribution can be unimodal, bimodal, or multimodal.

**Importance:** These measures describe the 'central tendency' of data:
- Mean is useful for further mathematical operations (e.g., variance) but sensitive to outliers.
- Median gives a better central location when distribution is skewed or has outliers.
- Mode is meaningful for categorical data or to identify the most common value in a dataset.



## Question 4 — Skewness and Kurtosis

- **Skewness** measures the asymmetry of the probability distribution of a real-valued random variable.
  - **Positive skew (right skew):** longer tail on the right side. Most values are concentrated on the left; mean > median typically.
  - **Negative skew (left skew):** longer tail on the left side; mean < median typically.

- **Kurtosis** measures the 'tailedness' of the distribution — how heavy or light the tails are compared to a normal distribution.
  - **High kurtosis (leptokurtic):** heavier tails, more extreme outliers.
  - **Low kurtosis (platykurtic):** lighter tails, fewer extreme outliers.
  - **Mesokurtic:** kurtosis similar to normal distribution.

**What a positive skew implies:** The data have a long right tail (some large values). The mean is pulled toward larger values compared to the median. For example, income data frequently show positive skew due to a small number of high earners.


## Question 5 — Compute mean, median, and mode

**Task:** Implement a Python program to compute the mean, median, and mode of the list.

**List:** `numbers = [12, 15, 12, 18, 19, 12, 20, 22, 19, 19, 24, 24, 24, 26, 28]`

Below is the code cell (runnable).

In [None]:

# Question 5 — Compute mean, median, and mode for the list
numbers = [12, 15, 12, 18, 19, 12, 20, 22, 19, 19, 24, 24, 24, 26, 28]

import statistics as stats

mean_val = stats.mean(numbers)
median_val = stats.median(numbers)
# mode may raise StatisticsError if multimodal; use multimode for robust result
modes = stats.multimode(numbers)

mean_val, median_val, modes


## Question 6 — Covariance and Correlation

Compute covariance and Pearson correlation coefficient between:

`list_x = [10,20,30,40,50]`

`list_y = [15,25,35,45,60]`

Below is the runnable code cell.

In [None]:

# Question 6 — Compute covariance and correlation coefficient between two lists
import numpy as np

list_x = [10, 20, 30, 40, 50]
list_y = [15, 25, 35, 45, 60]

# Convert to numpy arrays
x = np.array(list_x, dtype=float)
y = np.array(list_y, dtype=float)

# Sample covariance (ddof=1) and population covariance (ddof=0). We'll compute sample covariance.
cov_matrix = np.cov(x, y, ddof=1)
cov_xy = cov_matrix[0,1]

# Pearson correlation coefficient
corr_matrix = np.corrcoef(x, y)
corr_xy = corr_matrix[0,1]

cov_xy, corr_xy, cov_matrix, corr_matrix


## Question 7 — Boxplot and Outliers

`data = [12, 14, 14, 15, 18, 19, 19, 21, 22, 22, 23, 23, 24, 26, 29, 35]`

This cell computes quartiles, IQR, identifies outliers, and draws a boxplot.

In [None]:

# Question 7 — Boxplot and outliers identification
import matplotlib.pyplot as plt
import numpy as np

data = [12, 14, 14, 15, 18, 19, 19, 21, 22, 22, 23, 23, 24, 26, 29, 35]

# Compute quartiles and IQR to identify outliers
arr = np.array(sorted(data))
q1 = np.percentile(arr, 25)
q3 = np.percentile(arr, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = [x for x in arr if (x < lower_bound) or (x > upper_bound)]

print(f"Q1={q1}, Q3={q3}, IQR={iqr}")
print(f"Lower bound for outliers: {lower_bound}")
print(f"Upper bound for outliers: {upper_bound}")
print("Outliers:", outliers)

# Plot boxplot
plt.figure(figsize=(6,3))
plt.boxplot(arr, vert=False)
plt.title('Boxplot — Question 7 Data')
plt.xlabel('Value')
plt.show()



## Question 8 — Relationship between Advertising Spend and Daily Sales

**How to use covariance and correlation:**

- **Covariance** gives the direction of the linear relationship between two variables. A positive covariance means that when one variable increases, the other tends to increase; a negative covariance means they move in opposite directions. However, covariance is scale-dependent (its magnitude depends on units) so it's hard to interpret alone.
- **Correlation (Pearson's r)** standardizes covariance to a dimensionless value between -1 and 1. It indicates both direction and strength of linear association: values near +1 or -1 imply strong linear association, values near 0 imply weak/no linear association.

**In practice for the e-commerce example:**
- Compute covariance to check direction (expect positive if advertising increases lead to higher sales).
- Compute Pearson correlation to measure strength and test statistical significance (e.g., compute p-value using `scipy.stats.pearsonr` if needed).

Below is code to compute correlation for the provided lists.


In [None]:

# Question 8 — Compute correlation for advertising_spend and daily_sales
import numpy as np
from scipy import stats

advertising_spend = [200, 250, 300, 400, 500]
daily_sales = [2200, 2450, 2750, 3200, 4000]

# Convert to arrays
a = np.array(advertising_spend, dtype=float)
s = np.array(daily_sales, dtype=float)

# Covariance (sample)
cov = np.cov(a, s, ddof=1)[0,1]

# Pearson correlation coefficient and p-value
pearson_r, p_value = stats.pearsonr(a, s)

cov, pearson_r, p_value



## Question 9 — Customer Satisfaction Distribution & Visualization

**Which summary statistics and visualizations to use:**

1. **Summary statistics**
   - Mean: average satisfaction.
   - Median: middle score (robust to skew/outliers).
   - Mode: most common score(s).
   - Standard deviation: measure of spread.
   - Minimum / Maximum and IQR: to see range and spread.

2. **Visualizations**
   - **Histogram**: shows the distribution and frequency of scores (bins). Good for spotting skewness and modes.
   - **Boxplot**: highlights median, IQR, and outliers.
   - **Bar chart of counts**: for discrete scales (1–10), show frequency of each score.

Below is code to create a histogram using Matplotlib for the provided survey scores.


In [None]:

# Question 9 — Histogram for survey scores
import matplotlib.pyplot as plt
import statistics as stats

survey_scores = [7, 8, 5, 9, 6, 7, 8, 9, 10, 4, 7, 6, 9, 8, 7]

mean_score = stats.mean(survey_scores)
median_score = stats.median(survey_scores)
mode_scores = stats.multimode(survey_scores)
stdev_score = stats.pstdev(survey_scores)  # population std dev
sample_stdev = stats.stdev(survey_scores)  # sample std dev

print(f"Mean={mean_score}, Median={median_score}, Mode(s)={mode_scores}, Sample SD={sample_stdev:.3f}")

plt.figure(figsize=(6,4))
plt.hist(survey_scores, bins=range(1,12), edgecolor='black')  # bins for discrete 1-10
plt.title('Histogram — Customer Satisfaction Scores (1-10)')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.xticks(range(1,11))
plt.grid(axis='y', alpha=0.2)
plt.show()
