# Statistics Basics — Assignment (Module 01)

**Student:** Suraj Vishwakarma

This notebook contains the 9 questions from the assignment with answers and runnable Python code where applicable. It is Colab-ready.

### Question 1: What is the difference between descriptive statistics and inferential statistics? Explain with examples.

**Answer:**

Descriptive and inferential statistics are two main branches of statistical analysis, each serving a different purpose in data analysis.

**Descriptive Statistics:**

Descriptive statistics summarize and describe the essential features of a data set. They present raw data in a meaningful way through numerical calculations, graphs, and tables. These statistics help us understand patterns within the data without making generalizations or predictions beyond the data at hand.

**Common tools:** Measures of central tendency (mean, median, mode), measures of dispersion (range, variance, standard deviation), data visualizations (bar charts, histograms, pie charts).

**Example:** A teacher calculates the average score of her 30 students in a math test. Mean = 75; Highest = 98; Lowest = 45 — this is a descriptive summary.

**Inferential Statistics:**

Inferential statistics are used to make predictions or generalizations about a population based on data collected from a sample. It uses probability theory to draw conclusions and test hypotheses.

**Common tools:** Hypothesis testing, confidence intervals, regression analysis, chi-square tests, ANOVA.

**Example:** A researcher surveys 200 people to estimate the average income in a city of 500,000 people. Based on this sample the researcher estimates the average income is ₹25,000 per month — this is an inference to the population.

**Comparison (brief):**
- Purpose: descriptive = describe observed data; inferential = infer about a larger population.
- Data scope: descriptive uses the dataset available; inferential uses a sample to generalize.

**Conclusion:** Descriptive statistics summarize what we have; inferential statistics help make educated guesses about what we do not have (the full population).

### Question 2: What is sampling in statistics? Explain the differences between random and stratified sampling.

**Answer:**

**Sampling** is the process of selecting a subset (sample) from a larger group (population) to make statistical inferences. It helps make data collection manageable and cost-effective.

**Random (Simple Random) Sampling:**
- Each member of the population has an equal chance of being selected.
- Reduces selection bias when the population is homogeneous.
- Example: picking 50 students at random from a college list using random numbers.

**Stratified Sampling:**
- The population is divided into subgroups (strata) based on shared characteristics (e.g., gender, age). A sample is taken from each stratum (proportionally or equally).
- Ensures representation of subgroups and often gives more precise estimates when population is heterogeneous.
- Example: If a company has 60 men and 40 women and we want a sample of 20, pick 12 men and 8 women proportionally.

**Comparison:**
- Random sampling is simpler; stratified helps ensure subgroup representation and can reduce variance when groups differ.

**Conclusion:** Choose random sampling when population is relatively uniform; choose stratified sampling when important subgroups must be represented.

### Question 3: Define mean, median, and mode. Explain why these measures of central tendency are important.

**Answer:**

**Mean (Average):** Sum of all values divided by the number of observations.
- Example: For marks 70,75,80,85,90 → mean = 80.

**Median:** The middle value when the data are sorted. If even number of observations, median is the average of two middle values.
- Example: 15,20,25,30,35 → median = 25. For 10,20,30,40 → median = (20+30)/2 = 25.

**Mode:** The value that appears most frequently. A dataset can be unimodal, bimodal, multimodal, or have no mode.
- Example: 5,8,8,10,12 → mode = 8.

**Importance:**
- Mean gives a general average used widely (income, temperature, marks).
- Median is robust to outliers and skewed distributions (useful for income data).
- Mode identifies the most common value (useful for popular product size, category).

Using these three measures together helps summarize different aspects of a dataset's central tendency depending on data shape and presence of outliers.

### Question 4: Explain skewness and kurtosis. What does a positive skew imply about the data?

**Answer:**

**Skewness:** Measures asymmetry of a distribution.
- Symmetric (skewness ≈ 0): mean = median = mode.
- Positive skew (right-skewed): long right tail, mean > median > mode; indicates some unusually large values.
- Negative skew (left-skewed): long left tail, mean < median < mode.

**Positive skew implication:** Most data concentrated on left; a few large values pull the mean to the right. Example: income distributions commonly show positive skew.

**Kurtosis:** Measures 'tailed-ness' or peakedness of a distribution.
- Mesokurtic: similar to normal (reference).
- Leptokurtic: high peak and fat tails (more extreme values/outliers).
- Platykurtic: flatter peak and thinner tails (fewer extremes).

**Use:** Skewness tells about symmetry; kurtosis indicates likelihood of outliers. Both are useful in risk assessment and understanding distributional shape.

### Question 5: Implement a Python program to compute the mean, median, and mode of a given list of numbers.

**numbers = [12, 15, 12, 18, 19, 12, 20, 22, 19, 19, 24, 24, 24, 26, 28]**

Below is a runnable Python cell that computes mean, median, and mode and prints results.

In [None]:
# Question 5: compute mean, median, and mode
import statistics

numbers = [12, 15, 12, 18, 19, 12, 20, 22, 19, 19, 24, 24, 24, 26, 28]

mean_value = statistics.mean(numbers)
median_value = statistics.median(numbers)
# statistics.mode raises an error if multiple modes; here it will return the first most common
mode_value = statistics.mode(numbers)

print("Numbers:", numbers)
print("Mean:", mean_value)
print("Median:", median_value)
print("Mode:", mode_value)

### Question 6: Compute the covariance and correlation coefficient between two datasets.

**list_x = [10, 20, 30, 40, 50]**
**list_y = [15, 25, 35, 45, 60]**

The cell below computes covariance (sample) and Pearson correlation.

In [None]:
import numpy as np

list_x = [10, 20, 30, 40, 50]
list_y = [15, 25, 35, 45, 60]

x = np.array(list_x)
y = np.array(list_y)

# Sample covariance (by default np.cov uses ddof=1 -> sample covariance)
cov_matrix = np.cov(x, y, bias=False)
covariance = cov_matrix[0,1]

# Pearson correlation coefficient
correlation = np.corrcoef(x, y)[0,1]

print("X:", list_x)
print("Y:", list_y)
print("Covariance (sample):", covariance)
print("Correlation coefficient (Pearson r):", correlation)

### Question 7: Draw a boxplot for the numeric list and identify outliers.

**data = [12, 14, 14, 15, 18, 19, 19, 21, 22, 22, 23, 23, 24, 26, 29, 35]**

The cell will draw a boxplot and compute IQR-based outliers.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

data = [12, 14, 14, 15, 18, 19, 19, 21, 22, 22, 23, 23, 24, 26, 29, 35]

# Plot boxplot (single plot, no explicit color setting to follow notebook guidance)
plt.figure(figsize=(8,2.5))
plt.boxplot(data, vert=False)
plt.title("Boxplot of data")
plt.xlabel("Values")
plt.grid(axis='x', linestyle='--', alpha=0.5)
plt.show()

# Compute IQR and outliers
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = [x for x in data if x < lower_bound or x > upper_bound]

print(f"Q1: {q1}")
print(f"Q3: {q3}")
print(f"IQR: {iqr}")
print(f"Lower bound: {lower_bound}")
print(f"Upper bound: {upper_bound}")
print('Outliers:', outliers)

### Question 8: Relationship between advertising spend and daily sales. Compute covariance and correlation.

**advertising_spend = [200, 250, 300, 400, 500]**
**daily_sales = [2200, 2450, 2750, 3200, 4000]**

In [None]:
import numpy as np

advertising_spend = [200, 250, 300, 400, 500]
daily_sales = [2200, 2450, 2750, 3200, 4000]

x = np.array(advertising_spend)
y = np.array(daily_sales)

cov_matrix = np.cov(x, y, bias=False)
covariance = cov_matrix[0,1]
correlation = np.corrcoef(x, y)[0,1]

print("Advertising spend:", advertising_spend)
print("Daily sales:", daily_sales)
print("Covariance (sample):", covariance)
print("Correlation coefficient (Pearson r):", correlation)

### Question 9: Customer satisfaction survey distribution. Show summary statistics and histogram.

**survey_scores = [7, 8, 5, 9, 6, 7, 8, 9, 10, 4, 7, 6, 9, 8, 7]**

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import statistics as stats

survey_scores = [7, 8, 5, 9, 6, 7, 8, 9, 10, 4, 7, 6, 9, 8, 7]

# Summary statistics
mean_val = stats.mean(survey_scores)
median_val = stats.median(survey_scores)
mode_val = stats.mode(survey_scores)
stdev_val = stats.stdev(survey_scores)
min_val = min(survey_scores)
max_val = max(survey_scores)

print("Survey scores:", survey_scores)
print("Mean:", mean_val)
print("Median:", median_val)
print("Mode:", mode_val)
print("Standard deviation:", round(stdev_val, 3))
print("Min:", min_val, "Max:", max_val)

# Histogram (single plot, no explicit color specification)
plt.figure(figsize=(6,4))
plt.hist(survey_scores, bins=7, edgecolor='black')
plt.title("Customer Satisfaction Survey Scores")
plt.xlabel("Satisfaction Score (1 to 10)")
plt.ylabel("Number of Customers")
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()

---

**Notes:**
- This notebook uses standard Python libraries available in Google Colab (`numpy`, `matplotlib`, `statistics`).
- Run the notebook in Google Colab. All code cells are executable and will produce outputs/plots as shown.

If you want any changes (language Hindi, more detailed explanations, LaTeX-formatted equations, or additional plots), tell me and I'll update the notebook.