### Question 1: What is the difference between descriptive statistics and inferential statistics? Explain with examples.

**Answer:**

Statistics is the science of collecting, analyzing, and interpreting data. It is mainly divided into two branches: Descriptive Statistics and Inferential Statistics. Both are important, but they serve different purposes in data analysis.

**1. Descriptive Statistics**  
Descriptive statistics are methods used to summarize and describe the important features of a dataset. It does not go beyond the available data; instead, it presents information in a clear and simplified form using numerical measures and visualizations. Common tools include mean, median, mode, standard deviation, tables, graphs, and charts.

*Example:* A teacher calculates the average marks of 50 students in her class. She finds that the mean is 65, the highest is 95, and the lowest is 30. This is descriptive statistics because it only explains the data of those 50 students.

**2. Inferential Statistics**  
Inferential statistics are techniques that allow us to make conclusions, predictions, or generalizations about a large population based on a sample. It uses probability theory, hypothesis testing, confidence intervals, and regression analysis.

*Example:* A researcher wants to know the average marks of 1000 students in a school. Instead of checking all students, he takes a sample of 100 students and finds their mean as 70. He then concludes that the average marks of the whole school are approximately 70. This is inferential statistics, as predictions are made from sample to population.

### Question 2: What is sampling in statistics? Explain the differences between random and stratified sampling.

**Answer:**

In statistics, sampling is the process of selecting a subset of individuals or observations from a larger group called the population. Since studying the entire population is often costly, time-consuming, or impossible, researchers collect data from a smaller group (sample) and use it to draw conclusions about the population.

**1. Random Sampling**  
- **Definition:** Random sampling is a method where each member of the population has an equal chance of being selected.  
- **Process:** Names or numbers are selected using lottery method, random number tables, or computer-generated randomization.  
- **Advantage:** Eliminates bias and ensures fairness.  
- **Example:** Choosing 100 students randomly from a list of 1,000 students using a random number generator.  

**2. Stratified Sampling**  
- **Definition:** Stratified sampling divides the population into subgroups (called strata) based on common characteristics like age, gender, or income, and then samples are taken from each stratum.  
- **Process:** Population → divided into homogeneous groups → random samples taken proportionally.  
- **Advantage:** Ensures better representation of all categories, reduces sampling error.  
- **Example:** A school has 60% boys and 40% girls. To select 100 students, researcher takes 60 boys and 40 girls randomly instead of selecting only random students without considering gender.

### Question 3: Define mean, median, and mode. Explain why these measures of central tendency are important.

**Answer:**

**1. Mean**  
The mean (also called average) is obtained by adding all the values in a dataset and dividing by the total number of values.  

*Formula:*  
Mean = (Sum of all observations) ÷ (Number of observations)  

*Example:* Marks of 5 students = 10, 20, 30, 40, 50.  
Mean = (10 + 20 + 30 + 40 + 50) ÷ 5 = 30.  

**2. Median**  
The median is the middle value of the dataset when arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle values.  

*Example:* Data = 15, 20, 25, 30, 35. Median = 25 (middle value).  
If data = 10, 20, 30, 40 → Median = (20+30)/2 = 25.  

**3. Mode**  
The mode is the value that occurs most frequently in a dataset. A dataset can have one mode, more than one mode, or no mode at all.  

*Example:* Data = 5, 7, 7, 8, 10, 7, 12 → Mode = 7 (most repeated value).  

**4. Importance of Mean, Median, and Mode**  
- These are called measures of central tendency, because they represent the "center" or average behavior of the data.  
- Mean gives an overall average, useful in comparing performances.  
- Median is helpful when data has extreme values (outliers), as it is not affected by them.  
- Mode is useful in categorical data analysis (e.g., the most popular product, most common marks).  
- Together, they help in understanding, summarizing, and interpreting large datasets in research, economics, business, and everyday life.

### Question 4: Explain skewness and kurtosis. What does a positive skew imply about the data?

**Answer:**

**1. Skewness**  
Skewness refers to the degree of asymmetry in the distribution of data values around the mean.  
- If data is symmetrical, skewness = 0.  
- If data is positively skewed, the tail of the distribution is longer on the right side.  
- If data is negatively skewed, the tail is longer on the left side.  

*Example:*  
- Heights of people often follow a near-symmetric distribution (low skew).  
- Income distribution is usually positively skewed, because a few very rich people pull the average upward.  

**2. Kurtosis**  
Kurtosis refers to the degree of peakedness or flatness of a data distribution compared to a normal distribution.  
- **Mesokurtic:** Normal bell-shaped distribution (kurtosis ≈ 3).  
- **Leptokurtic:** More peaked than normal, with heavy tails (kurtosis > 3).  
- **Platykurtic:** Flatter than normal, with light tails (kurtosis < 3).  

*Example:*  
- Exam scores that are tightly clustered around the mean → leptokurtic.  
- Uniformly spread marks → platykurtic.  

**3. Positive Skew and Its Implication**  
A positive skew means the distribution has a longer right tail. Most of the data values lie to the left (lower values), while a few very high values pull the mean to the right.  

- In a positively skewed distribution: Mean > Median > Mode  
- Implication: Most individuals have relatively low values, but a small number of individuals have very high values.  
*Example:* Income distribution in a country: most people earn low-to-average salaries, but a few billionaires create a long right tail.

### Question 5: Implement a Python program to compute the mean, median, and mode of a given list of numbers.

In [None]:
# Python program to compute Mean, Median and Mode

import statistics

# Given list of numbers
numbers = [12, 15, 12, 18, 19, 12, 20, 22, 19, 19, 24, 24, 24, 26, 28]

# Calculating Mean, Median, and Mode
mean_value = statistics.mean(numbers)
median_value = statistics.median(numbers)
mode_value = statistics.mode(numbers)

# Printing the results
print("Numbers:", numbers)
print("Mean:", mean_value)
print("Median:", median_value)
print("Mode:", mode_value)

### Question 6: Compute the covariance and correlation coefficient between the following two datasets provided as lists in Python.

In [None]:
# Python program to compute Covariance and Correlation

import numpy as np

# Given datasets
list_x = [10, 20, 30, 40, 50]
list_y = [15, 25, 35, 45, 60]

# Convert to numpy arrays
x = np.array(list_x)
y = np.array(list_y)

# Compute Covariance matrix
cov_matrix = np.cov(x, y, bias=True)  # bias=True gives population covariance
cov_xy = cov_matrix[0][1]

# Compute Correlation coefficient
corr_xy = np.corrcoef(x, y)[0][1]

# Print results
print("List X:", list_x)
print("List Y:", list_y)
print("Covariance:", cov_xy)
print("Correlation Coefficient:", corr_xy)

### Question 7: Write a Python script to draw a boxplot for the following numeric list and identify its outliers. Explain the result.

In [None]:
# Python program to draw a boxplot and identify outliers

import matplotlib.pyplot as plt

# Given dataset
data = [12, 14, 14, 15, 18, 19, 19, 21, 22, 22, 23, 23, 24, 26, 29, 35]

# Create a boxplot
plt.boxplot(data, vert=True, patch_artist=True)
plt.title("Boxplot of Data")
plt.ylabel("Values")
plt.show()

# Identifying outliers using IQR
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = [x for x in data if x < lower_bound or x > upper_bound]

print("Data:", data)
print("Q1 (25th percentile):", Q1)
print("Q3 (75th percentile):", Q3)
print("IQR:", IQR)
print("Lower Bound:", lower_bound)
print("Upper Bound:", upper_bound)
print("Outliers:", outliers)

### Question 8: You are working as a data analyst in an e-commerce company. The marketing team wants to know if there is a relationship between advertising spend and daily sales.

**Answer:**

1. **Using Covariance and Correlation**  
- Covariance measures how two variables move together.  
  - Positive covariance → both variables increase together.  
  - Negative covariance → one increases while the other decreases.  
  - Limitation: Covariance value depends on scale, hard to interpret magnitude.  

- Correlation coefficient measures strength and direction of linear relationship between two variables.  
  - Value ranges from -1 to +1.  
  - +1 → perfect positive correlation, 0 → no correlation, -1 → perfect negative correlation.  

**Application:** For the marketing team, we can compute covariance and correlation between advertising_spend and daily_sales to see if increasing ad spend actually increases sales, and how strong the relationship is.

In [None]:
# Python program to compute correlation

# Given datasets
advertising_spend = [200, 250, 300, 400, 500]
daily_sales = [2200, 2450, 2750, 3200, 4000]

# Convert to numpy arrays
x = np.array(advertising_spend)
y = np.array(daily_sales)

# Compute covariance
cov_xy = np.cov(x, y, bias=True)[0][1]

# Compute correlation coefficient
corr_xy = np.corrcoef(x, y)[0][1]

# Print results
print("Advertising Spend:", advertising_spend)
print("Daily Sales:", daily_sales)
print("Covariance:", cov_xy)
print("Correlation Coefficient:", corr_xy)

### Question 9: Your team has collected customer satisfaction survey data on a scale of 1-10 and wants to understand its distribution before launching a new product.

**Answer:**

1. **Summary Statistics to Use**  
- **Mean (Average):** Shows overall average score.  
- **Median:** Middle value when scores arranged in order. Useful if there are outliers.  
- **Mode:** Most frequent score.  
- **Standard Deviation:** Measures spread of scores around mean.  
- **Range, Minimum, Maximum:** Spread of scores from lowest to highest.  

2. **Visualization to Use**  
- **Histogram:** Plots frequency of each score, helps identify patterns, clusters, skewness.  
- **Boxplot (optional):** Visualizes median, quartiles, and possible outliers.

In [None]:
import matplotlib.pyplot as plt
import statistics

# Given survey scores
survey_scores = [7, 8, 5, 9, 6, 7, 8, 9, 10, 4, 7, 6, 9, 8, 7]

# Summary statistics
mean_score = statistics.mean(survey_scores)
median_score = statistics.median(survey_scores)
mode_score = statistics.mode(survey_scores)
std_dev = statistics.stdev(survey_scores)

print("Mean:", mean_score)
print("Median:", median_score)
print("Mode:", mode_score)
print("Standard Deviation:", std_dev)

# Create histogram
plt.hist(survey_scores, bins=7, edgecolor='black', color='skyblue')
plt.title("Histogram of Customer Satisfaction Scores")
plt.xlabel("Survey Scores")
plt.ylabel("Frequency")
plt.xticks(range(1, 11))
plt.show()