# Statistics Basics Assignment



---
## Question 1: Descriptive vs. Inferential Statistics
What is the difference between descriptive statistics and inferential statistics? Explain with examples.

### Answer:
The primary difference lies in their purpose:

1.  **Descriptive Statistics**: Focuses on summarizing and describing the characteristics of a specific dataset. It seeks to describe **what the data is**.
2.  **Inferential Statistics**: Focuses on using a sample of data to make predictions, draw conclusions, or generalize about a larger **population**. It seeks to infer **what the data means**.

### Explanation and Examples

| Feature | Descriptive Statistics | Inferential Statistics |
| :--- | :--- | :--- |
| **Purpose** | To organize, summarize, and present data. | To make generalizations, predictions, or inferences about a population from a sample. |
| **Tools** | Mean, Median, Mode, Standard Deviation, Histograms, Box Plots. | Hypothesis Testing, Confidence Intervals, Regression Analysis. |
| **Example** | Calculating the **average score** $(\bar{x})$ of all students in a single class on a final exam. | Surveying a **random sample of 1,000 voters** and using the results to **predict** the winner of a national presidential election. |

---
## Question 2: Sampling and Sampling Types
What is sampling in statistics? Explain the differences between random and stratified sampling.

### Answer:
### What is Sampling?
**Sampling** in statistics is the process of selecting a smaller, manageable subset (**sample**) of individuals or items from a larger group (**population**) in order to estimate the characteristics of the whole population. It's used because studying every single member of a large population is often too expensive or time-consuming.

### Random vs. Stratified Sampling

| Feature | Random Sampling (Simple Random Sampling) | Stratified Sampling |
| :--- | :--- | :--- |
| **Process** | Every individual in the population has an equal and independent chance of being selected. | The population is first divided into non-overlapping subgroups (called **strata**) based on shared characteristics (e.g., age, gender). A random sample is then drawn *from each* stratum. |
| **Goal** | To ensure the sample is generally unbiased and representative of the whole population. | To guarantee that specific, important subgroups are adequately and proportionally represented in the final sample. |
| **Best Used When** | The population is relatively homogeneous. | The population is heterogeneous (diverse), and representation from key subgroups is mandatory. |

---
## Question 3: Mean, Median, and Mode
Define mean, median, and mode. Explain why these measures of central tendency are important.

### Answer:
### Definitions of Mean, Median, and Mode
These are the primary **measures of central tendency**, used to find the "typical" or center value of a dataset.

* **Mean** $(\bar{x})$: The arithmetic **average**. Calculated by summing all values and dividing by the count of values ($n$): $$\text{Mean} = \frac{\sum x}{n} $$
* **Median**: The **middle value** in a dataset when it is ordered from least to greatest. If the dataset has an even number of values, the median is the average of the two middle numbers.
* **Mode**: The value that appears **most frequently** in a dataset. A dataset can have one mode, multiple modes, or no mode at all.

### Importance of Central Tendency
These measures are crucial in data analysis for several reasons:

* **Data Summarization**: They provide a single, easy-to-understand value that summarizes the entire dataset.
* **Comparison**: They allow for quick comparison between different groups or time periods (e.g., comparing the mean productivity of two teams).
* **Detecting Skewness**: Comparing the mean and median helps reveal the distribution's shape and asymmetry (skewness).
* **Resilience to Outliers (Median)**: The median is **robust** to extreme outliers, making it a better measure of the "typical" value than the mean in skewed datasets (like housing prices or income).

---
## Question 4: Skewness and Kurtosis
Explain skewness and kurtosis. What does a positive skew imply about the data?

### Answer:
### Skewness
**Skewness** measures the **asymmetry** of a probability distribution, indicating how much the data deviates from a symmetrical bell-shaped curve (normal distribution).

* **Zero Skew**: Perfectly symmetrical distribution (Mean $\approx$ Median $\approx$ Mode).
* **Positive Skew (Right-Skew)**: The distribution has a long tail extending to the right (higher values).
* **Negative Skew (Left-Skew)**: The distribution has a long tail extending to the left (lower values).

### Kurtosis
**Kurtosis** measures the **"tailedness"** of a distribution. It describes the shape of the tails and the peakedness, indicating the presence of **outliers**.

* **High Kurtosis (Leptokurtic)**: Distribution has heavy/fat tails and a sharp peak, suggesting a higher probability of extreme values (outliers).
* **Low Kurtosis (Platykurtic)**: Distribution has light/thin tails and a flatter peak.

### Implication of a Positive Skew
A **positive skew** implies that the majority of the data is clustered toward the lower values, but the distribution is stretched toward the high values by a few extreme observations (outliers).

* **Order of Measures**: $\text{Mode} < \text{Median} < \text{Mean}$. The mean is pulled to the right by the high outliers.
* **Real-World Example**: Personal income. Most people earn a moderate salary (cluster on the left), but a few top earners make extremely high amounts (the long right tail, pulling the mean up).

---
## Question 5: Mean, Median, and Mode Python Implementation
Implement a Python program to compute the mean, median, and mode of a given list of numbers.

```python
numbers = [12, 15, 12, 18, 19, 12, 20, 22, 19, 19, 24, 24, 24, 26, 28]
```

In [3]:
# Install required packages in the notebook environment (will be skipped if already installed)
%pip install numpy scipy -q

import numpy as np
from scipy import stats

# The given list of numbers
numbers = [12, 15, 12, 18, 19, 12, 20, 22, 19, 19, 24, 24, 24, 26, 28]

# 1. Compute the Mean
data_mean = np.mean(numbers)

# 2. Compute the Median
data_median = np.median(numbers)

# 3. Compute the Mode
# stats.mode returns the mode and its count. We access the mode value.
mode_result = stats.mode(numbers, keepdims=True)
data_mode = mode_result.mode[0]

# Print the results
print(f"Given Numbers: {numbers}")
print("-" * 30)
print(f"Mean:   {data_mean}")
print(f"Median: {data_median}")
print(f"Mode:   {data_mode}")

Note: you may need to restart the kernel to use updated packages.
Given Numbers: [12, 15, 12, 18, 19, 12, 20, 22, 19, 19, 24, 24, 24, 26, 28]
------------------------------
Mean:   19.6
Median: 19.0
Mode:   12


---
## Question 6: Covariance and Correlation Python Implementation
Compute the covariance and correlation coefficient between the following two datasets provided as lists in Python:

```python
list_x=[10,20,30,40,50]
list_y=[15,25,35,45,60]
```

In [None]:
import numpy as np

# The given datasets
list_x = [10, 20, 30, 40, 50]
list_y = [15, 25, 35, 45, 60]

# Convert lists to NumPy arrays for calculation
x = np.array(list_x)
y = np.array(list_y)

# 1. Compute Covariance
# np.cov returns the covariance matrix. The covariance between x and y is the off-diagonal element (0, 1).
covariance_matrix = np.cov(x, y)
covariance = covariance_matrix[0, 1]

# 2. Compute Correlation Coefficient (Pearson's r)
# np.corrcoef returns the correlation coefficient matrix. The correlation between x and y is the off-diagonal element (0, 1).
correlation_matrix = np.corrcoef(x, y)
correlation_coefficient = correlation_matrix[0, 1]

# Print the results
print(f"list_x: {list_x}")
print(f"list_y: {list_y}")
print("-" * 30)
print(f"Covariance:             {covariance:.2f}")
print(f"Correlation Coefficient: {correlation_coefficient:.4f}")

list_x: [10, 20, 30, 40, 50]
list_y: [15, 25, 35, 45, 60]
------------------------------
Covariance:             287.50
Correlation Coefficient: 0.9934


---
## Question 7: Boxplot and Outlier Identification
Write a Python script to draw a boxplot for the following numeric list and identify its outliers. Explain the result.

```python
data = [12, 14, 14, 15, 18, 19, 19, 21, 22, 22, 23, 23, 24, 26, 29, 35]
```

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# The given dataset
data = [12, 14, 14, 15, 18, 19, 19, 21, 22, 22, 23, 23, 24, 26, 29, 35]

# Create a boxplot (Visualization)
plt.figure(figsize=(6, 4))
plt.boxplot(data)
plt.title('Boxplot of Numeric Data')
plt.ylabel('Values')
plt.xticks([1], ['Data'])
plt.grid(axis='y', linestyle='--')
plt.show()

# Outlier Identification (using the 1.5 * IQR rule)
# Calculate Quartiles and IQR
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1

# Calculate Fence Boundaries
lower_fence = Q1 - (1.5 * IQR)
upper_fence = Q3 + (1.5 * IQR)

# Identify Outliers
outliers = [x for x in data if x < lower_fence or x > upper_fence]

print(f"Data: {data}")
print("-" * 30)
print(f"Q1 (25th percentile): {Q1}")
print(f"Q3 (75th percentile): {Q3}")
print(f"Interquartile Range (IQR): {IQR}")
print(f"Lower Fence: {lower_fence}")
print(f"Upper Fence: {upper_fence}")
print("-" * 30)
print(f"Identified Outliers: {outliers}")

<Figure size 600x400 with 1 Axes>

Data: [12, 14, 14, 15, 18, 19, 19, 21, 22, 22, 23, 23, 24, 26, 29, 35]
------------------------------
Q1 (25th percentile): 16.5
Q3 (75th percentile): 23.5
Interquartile Range (IQR): 7.0
Lower Fence: 6.0
Upper Fence: 34.0
------------------------------
Identified Outliers: [35]


### Explanation of the Result
The calculation uses the standard **$1.5 \times \text{IQR}$ rule** to identify outliers:

* **Quartiles and IQR**: The first quartile ($Q1$) is 16.5, and the third quartile ($Q3$) is 23.5. The **Interquartile Range (IQR)** is $Q3 - Q1 = 7.0$. The box in the plot spans this range.
* **Fences**: The fences define the limits for non-outlier data points.
    * **Lower Fence**: $Q1 - (1.5 \times \text{IQR}) = 16.5 - 10.5 = 6.0$.
    * **Upper Fence**: $Q3 + (1.5 \times \text{IQR}) = 23.5 + 10.5 = 34.0$.
* **Outlier Identification**: Any data point outside the range $[6.0, 34.0]$ is an outlier. The value **35** is greater than the upper fence of 34.0, making it the single outlier in the dataset.

---
## Question 8: Covariance and Correlation for E-commerce Analysis
You are working as a data analyst in an e-commerce company. The marketing team wants to know if there is a relationship between advertising spend and daily sales. Explain how you would use covariance and correlation to explore this relationship. Write Python code to compute the correlation between the two lists:

```python
advertising_spend = [200, 250, 300, 400, 500]
daily_sales = [2200, 2450, 2750, 3200, 4000]
```

### Answer: Using Covariance and Correlation
To explore the relationship between advertising spend and daily sales, you'd use **covariance** to find the direction of the relationship and **correlation** to determine its strength.

#### Covariance
* **Use**: Measures the **direction** of the linear relationship.
* **Interpretation**: A **positive covariance** means that as advertising spend increases, sales tend to increase (a direct relationship).
* **Limitation**: The magnitude of covariance is difficult to interpret because it depends on the units of the variables (dollars and dollars), so it is not useful for measuring strength.

#### Correlation (Pearson's $r$)
* **Use**: Measures both the **direction** and the **strength** of the linear relationship. This is the more crucial metric for the marketing team.
* **Interpretation**: The value is standardized between $-1$ and $+1$:
    * A value close to **$+1$** (e.g., 0.99) indicates an **extremely strong positive linear relationship** (high spend consistently leads to high sales).
    * A value close to **$0$** indicates a very weak or no linear relationship.
* **Advantage**: Since it's unitless, it provides a clear, comparable measure of the relationship's strength.

In [None]:
import numpy as np

# The given lists
advertising_spend = [200, 250, 300, 400, 500]
daily_sales = [2200, 2450, 2750, 3200, 4000]

# Convert lists to NumPy arrays
spend_array = np.array(advertising_spend)
sales_array = np.array(daily_sales)

# Compute the correlation coefficient (Pearson's r)
# np.corrcoef returns the correlation coefficient matrix. We extract the off-diagonal element (0, 1).
correlation_matrix = np.corrcoef(spend_array, sales_array)
correlation_coefficient = correlation_matrix[0, 1]

# Print the result
print(f"Advertising Spend: {advertising_spend}")
print(f"Daily Sales: {daily_sales}")
print("-" * 30)
print(f"Correlation Coefficient (r): {correlation_coefficient:.4f}")

Advertising Spend: [200, 250, 300, 400, 500]
Daily Sales: [2200, 2450, 2750, 3200, 4000]
------------------------------
Correlation Coefficient (r): 0.9932


**Conclusion**: A correlation coefficient of **$0.9932$** indicates an extremely strong positive linear relationship between advertising spend and daily sales. The marketing spend is highly effective in driving sales.

---
## Question 9: Analyzing Survey Data Distribution
Your team has collected customer satisfaction survey data on a scale of 1-10 and wants to understand its distribution before launching a new product. Explain which summary statistics and visualizations (e.g. mean, standard deviation, histogram) you'd use. Write Python code to create a histogram using Matplotlib for the survey data:

```python
survey_scores=[7,8,5,9,6,7,8,9,10,4,7,6,9,8,7]
```

### Answer: Summary Statistics and Visualizations
To thoroughly understand the customer satisfaction data, we would use a combination of descriptive statistics to quantify the center and spread, and a histogram for a visual assessment of the distribution's shape.

#### Summary Statistics
* **Mean and Median**: To determine the **average (typical)** satisfaction score. Comparing them helps check for any skew in customer responses.
* **Standard Deviation $(\sigma)$**: The key measure of **spread**. A low $\sigma$ means customers largely agree on satisfaction, while a high $\sigma$ suggests polarized opinions (many low and many high scores).

#### Visualizations
* **Histogram**: The most effective visual tool for understanding distribution. It shows the **frequency** of each score, immediately revealing the overall shape: is it generally high-scoring, skewed low, or perhaps bimodal (two peaks, indicating two distinct customer groups)?
* **Box Plot**: Used to quickly summarize the five-number summary and visually identify any unusually low or high scores (outliers) in satisfaction.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# The given survey scores
survey_scores = [7, 8, 5, 9, 6, 7, 8, 9, 10, 4, 7, 6, 9, 8, 7]

# Calculate mean and standard deviation for reporting
data_mean = np.mean(survey_scores)
data_std = np.std(survey_scores)

# Create a histogram
plt.figure(figsize=(8, 5))
# Define bins to ensure each whole score (4-10) has its own clear bar
bins = np.arange(min(survey_scores) - 0.5, max(survey_scores) + 1.5, 1)

plt.hist(survey_scores, bins=bins, edgecolor='black', alpha=0.7)
plt.title('Distribution of Customer Satisfaction Survey Scores (1-10)')
plt.xlabel('Satisfaction Score')
plt.ylabel('Frequency')
plt.xticks(np.arange(4, 11, 1)) # Label x-axis for each score value
plt.grid(axis='y', linestyle='--')
plt.tight_layout()
plt.show()

print(f"Survey Scores: {survey_scores}")
print("-" * 30)
print(f"Mean Score: {data_mean:.2f}")
print(f"Standard Deviation: {data_std:.2f}")

<Figure size 800x500 with 1 Axes>

Survey Scores: [7, 8, 5, 9, 6, 7, 8, 9, 10, 4, 7, 6, 9, 8, 7]
------------------------------
Mean Score: 7.33
Standard Deviation: 1.63
