# Statistics Basics — Assignment Solutions

## Question 1: Descriptive vs. Inferential Statistics

Descriptive statistics summarize and organize data you already have; they describe the sample. Typical tools include measures of central tendency (mean, median, mode), dispersion (range, variance, standard deviation), and visuals (histograms, boxplots). For example, reporting the average daily sales last month and their standard deviation is descriptive.

Inferential statistics use a sample to make generalizations about a broader population, accounting for sampling variability. Common tools include confidence intervals, hypothesis tests, and regression modeling. For example, using a random sample of customers to test whether a new checkout flow increases conversion for all customers is inferential.

## Question 2: Sampling; Random vs. Stratified

Sampling is the process of selecting a subset of individuals or observations from a larger population to draw conclusions when measuring the entire population is impractical.

Random sampling: every unit in the population has a known and non‑zero chance of selection; selections are made independently. This guards against bias and supports valid inference.

Stratified sampling: the population is divided into homogeneous subgroups (strata) such as region or age group, and random samples are drawn from each stratum—often proportionally. This improves precision (lower variance) and ensures representation of key subgroups compared with a simple random sample.

## Question 3: Mean, Median, Mode and Importance

Mean: arithmetic average of the values. Median: the middle value when data are sorted (or the average of the two middle values). Mode: the most frequent value(s).

Why they matter: these measures summarize the ‘center’ of a distribution, each with different robustness. The mean uses all values but is sensitive to outliers; the median resists outliers/skew; the mode highlights the most typical category or value, useful for discrete data. Using them together provides a fuller picture of central tendency.

## Question 4: Skewness and Kurtosis; Positive Skew

Skewness measures asymmetry of a distribution. Positive (right) skew means a longer right tail; the mean typically exceeds the median. Negative skew is the opposite. Kurtosis measures tail heaviness/peakness relative to a normal distribution: high kurtosis indicates heavier tails and more outliers; low kurtosis indicates lighter tails.

A positive skew implies that most observations are concentrated at lower values with a few large values stretching the right tail.

## Question 5: Mean, Median, Mode

In [None]:
from statistics import mean, median, multimode

numbers = [12, 15, 12, 18, 19, 12, 20, 22, 19, 19, 24, 24, 24, 26, 28]
mean_value = mean(numbers)
median_value = median(numbers)
modes = multimode(numbers)

mean_value, median_value, modes


## Question 6: Covariance and Correlation

In [None]:
import numpy as np

list_x = [10, 20, 30, 40, 50]
list_y = [15, 25, 35, 45, 60]

x = np.array(list_x, dtype=float)
y = np.array(list_y, dtype=float)

# Sample covariance
cov_xy = np.cov(x, y, ddof=1)[0, 1]
# Pearson correlation
corr_xy = np.corrcoef(x, y)[0, 1]

cov_xy, corr_xy


## Question 7: Boxplot and Outlier Detection

In [None]:
import numpy as np
import matplotlib.pyplot as plt

data = [12, 14, 14, 15, 18, 19, 19, 21, 22, 22, 23, 23, 24, 26, 29, 35]
d = np.array(data, dtype=float)

q1 = np.percentile(d, 25)
q2 = np.percentile(d, 50)
q3 = np.percentile(d, 75)
iqr = q3 - q1

lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr

outliers = [v for v in d if v < lower_fence or v > upper_fence]

plt.figure()
plt.boxplot(d, vert=True, showmeans=True)
plt.title("Boxplot of Data")
plt.ylabel("Value")
plt.show()

q1, q2, q3, iqr, lower_fence, upper_fence, outliers


## Question 8: Exploring Ad Spend vs Sales

To explore the relationship between advertising spend and daily sales, first compute the covariance to see whether the variables move together (positive for same‑direction movement, negative for opposite). Then compute the Pearson correlation, which normalizes covariance by the standard deviations to give a scale‑free measure in [−1, 1]. Inspect a scatter plot and, if appropriate, fit a simple regression to quantify the expected sales change per unit of spend.

In [None]:
advertising_spend = [200, 250, 300, 400, 500]
daily_sales = [2200, 2450, 2750, 3200, 4000]

ads = np.array(advertising_spend, dtype=float)
sales = np.array(daily_sales, dtype=float)

corr = np.corrcoef(ads, sales)[0, 1]
corr


## Question 9: Survey Scores — Summary & Histogram

Start with summary statistics: mean and median for central tendency; standard deviation and interquartile range for spread; minimum/maximum to see range; and possibly skewness/kurtosis. Visualize with a histogram to view the shape and with a boxplot to spot outliers and compare groups. These together reveal clustering, spread, asymmetry, and unusual values before modeling or decisions.

In [None]:
survey_scores = [7, 8, 5, 9, 6, 7, 8, 9, 10, 4, 7, 6, 9, 8, 7]
scores = np.array(survey_scores, dtype=float)

mean_score = np.mean(scores)
median_score = np.median(scores)
std_score = np.std(scores, ddof=1)
min_score = np.min(scores)
max_score = np.max(scores)

plt.figure()
plt.hist(scores, bins='auto', edgecolor='black')
plt.title("Histogram of Survey Scores")
plt.xlabel("Score (1-10)")
plt.ylabel("Frequency")
plt.show()

mean_score, median_score, std_score, min_score, max_score
