# **THEORY**
---
### **1. What is statistics, and why is it important?**
Statistics is the branch of mathematics that deals with the collection, analysis, interpretation, presentation, and organization of data. It is important because it allows us to make informed decisions, draw conclusions from data, and make predictions. In various fields such as healthcare, economics, business, and social sciences, statistics provides tools to make evidence-based decisions and understand trends.

---

### **2. What are the two main types of statistics?**
The two main types of statistics are:
- **Descriptive Statistics**: These statistics summarize and describe the features of a data set, such as measures of central tendency (mean, median, mode), measures of variability (range, variance, standard deviation), and graphical representations (histograms, bar charts).
- **Inferential Statistics**: These statistics use data from a sample to make generalizations or predictions about a population. It involves hypothesis testing, confidence intervals, and regression analysis.

---

### **3. What are descriptive statistics?**
Descriptive statistics summarize and present data in a meaningful way. They are used to describe the basic features of a data set and provide simple summaries about the sample and the measures. Examples include:
- Measures of central tendency (mean, median, mode)
- Measures of spread (range, variance, standard deviation)
- Graphical representations like bar charts, histograms, and pie charts

---

### **4. What is inferential statistics?**
Inferential statistics is the process of using data from a sample to make inferences or predictions about a population. It involves estimating population parameters, testing hypotheses, and making predictions or generalizations. It typically uses probability theory to make these inferences.

---

### **5. What is sampling in statistics?**
Sampling is the process of selecting a subset (sample) from a larger population to estimate the characteristics of the entire population. It is used when it's impractical or impossible to collect data from every individual in a population.

---

### **6. What are the different types of sampling methods?**
There are several types of sampling methods:
- **Random Sampling**: Every individual in the population has an equal chance of being selected.
- **Systematic Sampling**: Individuals are selected at regular intervals from a list.
- **Stratified Sampling**: The population is divided into subgroups (strata) based on a characteristic, and samples are taken from each stratum.
- **Cluster Sampling**: The population is divided into clusters, and a random selection of clusters is chosen.
- **Convenience Sampling**: Samples are taken based on what is easiest or most convenient, though it may not be representative.
- **Judgmental Sampling**: The researcher selects samples based on their judgment.

---

### **7. What is the difference between random and non-random sampling?**
- **Random Sampling**: Every individual has an equal chance of being selected, making it more representative of the population.
- **Non-random Sampling**: The selection of individuals is based on some criteria or convenience, which can lead to bias and may not represent the population accurately.

---

### **8. Define and give examples of qualitative and quantitative data.**
- **Qualitative Data**: Also called categorical data, it refers to data that can be categorized based on attributes or characteristics. Examples include gender, color, type of car, and brand of product.
- **Quantitative Data**: This refers to data that can be measured and expressed numerically. Examples include age, height, weight, temperature, and income.

---

### **9. What are the different types of data in statistics?**
There are two main types of data:
- **Qualitative (Categorical) Data**: Data that represents categories or groups (e.g., color, brand).
- **Quantitative (Numerical) Data**: Data that represents quantities and can be measured (e.g., height, weight, temperature).

---

### **10. Explain nominal, ordinal, interval, and ratio levels of measurement.**
- **Nominal**: Data that is categorized into distinct groups without any specific order (e.g., colors, gender, religion).
- **Ordinal**: Data that can be ordered or ranked but does not have a fixed difference between values (e.g., ranking in a race, education level).
- **Interval**: Data with ordered values, and the differences between values are meaningful, but there is no true zero point (e.g., temperature in Celsius).
- **Ratio**: Data with ordered values, meaningful differences between values, and a true zero point (e.g., weight, height, age).

---

### **11. What is the measure of central tendency?**
The measure of central tendency refers to statistical measures that describe the center of a data set. The most common measures are:
- **Mean**: The average of all the values.
- **Median**: The middle value when the data is sorted in order.
- **Mode**: The value that appears most frequently.

---

### **12. Define mean, median, and mode.**
- **Mean**: The sum of all data values divided by the number of values. It is the arithmetic average.
- **Median**: The middle value of the data when sorted in ascending or descending order.
- **Mode**: The value that appears most frequently in the data set.

---

### **13. What is the significance of the measure of central tendency?**
The measure of central tendency helps summarize a data set by identifying the "center" or most typical value. It is useful for understanding the general trend of the data and making comparisons across different data sets.

---

### **14. What is variance, and how is it calculated?**
Variance is a measure of how much the values in a data set differ from the mean. It is calculated as the average of the squared differences from the mean.
Formula:  
\[ \text{Variance} = \frac{1}{N} \sum (x_i - \mu)^2 \]
where \(x_i\) is each data point, \(\mu\) is the mean, and \(N\) is the number of data points.

---

### **15. What is skewness in a dataset?**
Skewness refers to the asymmetry or lack of symmetry in a dataset. A dataset can be:
- **Positively Skewed**: The right tail (larger values) is longer or fatter.
- **Negatively Skewed**: The left tail (smaller values) is longer or fatter.

---

### **16. What is standard deviation, and why is it important?**
Standard deviation measures the amount of variation or dispersion in a data set. A low standard deviation means the data points are close to the mean, while a high standard deviation indicates that the data points are spread out. It is important because it provides insights into the consistency or variability of data.

---

### **17. Define and explain the term range in statistics.**
The range is the difference between the maximum and minimum values in a data set. It provides a measure of how spread out the values are.
Formula:  
\[ \text{Range} = \text{Max Value} - \text{Min Value} \]

---

### **18. What is the difference between variance and standard deviation?**
- **Variance** measures the average of the squared differences from the mean, and is in squared units.
- **Standard Deviation** is the square root of the variance and is in the same units as the data, making it easier to interpret.

---

### **19. What does it mean if a dataset is positively or negatively skewed?**
- **Positively Skewed**: Most values are concentrated on the left, with a long right tail (larger values).
- **Negatively Skewed**: Most values are concentrated on the right, with a long left tail (smaller values).

---

### **20. Define and explain kurtosis.**
Kurtosis measures the "tailedness" of the data distribution. It indicates whether the data has extreme values (outliers).
- **Leptokurtic**: High kurtosis, with more extreme values (heavy tails).
- **Mesokurtic**: Normal distribution kurtosis (bell-shaped curve).
- **Platykurtic**: Low kurtosis, with fewer extreme values (light tails).

---

### **21. What is the purpose of covariance?**
Covariance measures the degree to which two random variables change together. It shows whether an increase in one variable would result in an increase or decrease in another variable. If the covariance is positive, both variables move in the same direction; if negative, they move in opposite directions.

---

### **22. What does correlation measure in statistics?**
Correlation measures the strength and direction of the linear relationship between two variables. The correlation coefficient ranges from -1 to 1, where:
- **+1** indicates a perfect positive relationship,
- **-1** indicates a perfect negative relationship,
- **0** indicates no linear relationship.

---

### **23. What is the difference between covariance and correlation?**
- **Covariance** measures the direction of the relationship but not the strength. Its value depends on the units of the variables.
- **Correlation** standardizes the covariance, making it a unitless measure that ranges between -1 and 1, indicating both the direction and strength of the relationship.

---

### **24. What are some real-world applications of statistics?**
- **Healthcare**: Analyzing clinical trials and patient data to make medical decisions.
- **Business**: Market research, sales forecasting, and customer segmentation.
- **Sports**: Performance analysis, player statistics, and game predictions.
- **Economics**: Understanding economic trends, inflation rates, and unemployment data.
- **Social Sciences**: Conducting surveys and analyzing demographic trends.

---

# **PRACTICAL**

---

### **1. How do you calculate the mean, median, and mode of a dataset?**
The **mean** is the sum of all values divided by the number of values. The **median** is the middle value when the data is sorted in order. The **mode** is the value that appears most frequently.

#### Python Example:
```python
import numpy as np
import scipy.stats as stats

data = [10, 20, 30, 40, 50, 60, 70, 80, 90]

mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data)

print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode.mode[0]} (Count: {mode.count[0]})")
```

---

### **2. Write a Python program to compute the variance and standard deviation of a dataset.**
The **variance** measures how far data points are from the mean, and the **standard deviation** is the square root of the variance.

#### Python Example:
```python
data = [10, 20, 30, 40, 50]

variance = np.var(data)
std_dev = np.std(data)

print(f"Variance: {variance}")
print(f"Standard Deviation: {std_dev}")
```

---

### **3. Create a dataset and classify it into nominal, ordinal, interval, and ratio types.**
- **Nominal**: Categories without order (e.g., colors).
- **Ordinal**: Categories with order but no fixed difference (e.g., education levels).
- **Interval**: Numeric values with equal intervals, but no true zero (e.g., temperature in Celsius).
- **Ratio**: Numeric values with a true zero (e.g., weight, height).

#### Python Example:
```python
nominal_data = ['Red', 'Blue', 'Green']  # Nominal
ordinal_data = ['Low', 'Medium', 'High']  # Ordinal
interval_data = [30, 40, 50, 60]  # Interval (temperature)
ratio_data = [50, 60, 70, 80]  # Ratio (e.g., weight)
```

---

### **4. Implement sampling techniques like random sampling and stratified sampling.**
- **Random Sampling**: Randomly selecting data from the dataset.
- **Stratified Sampling**: Dividing the population into subgroups (strata) and then sampling from each subgroup.

#### Python Example:
```python
import pandas as pd
from sklearn.model_selection import train_test_split

# Create a sample dataframe
data = {'Age': [23, 45, 23, 56, 89, 45, 34, 56, 23, 12],
        'Income': [1000, 2000, 1500, 3000, 3500, 2500, 2200, 2800, 1500, 1300]}

df = pd.DataFrame(data)

# Random Sampling
random_sample = df.sample(n=3)
print("Random Sample:")
print(random_sample)

# Stratified Sampling
X_train, X_test = train_test_split(df, test_size=0.3, stratify=df['Age'])
print("Stratified Sample (Train):")
print(X_train)
```

---

### **5. Write a Python function to calculate the range of a dataset.**
The **range** is the difference between the maximum and minimum values of a dataset.

#### Python Example:
```python
def calculate_range(data):
    return max(data) - min(data)

data = [10, 20, 30, 40, 50]
range_value = calculate_range(data)
print(f"Range: {range_value}")
```

---

### **6. Create a dataset and plot its histogram to visualize skewness.**
A **histogram** can be used to visualize the distribution of a dataset and the skewness.

#### Python Example:
```python
import matplotlib.pyplot as plt

# Sample data with skewness
data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 120, 150, 200]

plt.hist(data, bins=10, color='skyblue', edgecolor='black')
plt.title('Histogram of Dataset')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
```

---

### **7. Calculate skewness and kurtosis of a dataset using Python libraries.**
- **Skewness** measures the asymmetry of the data.
- **Kurtosis** measures the "tailedness" of the data distribution.

#### Python Example:
```python
from scipy.stats import skew, kurtosis

data = [10, 20, 30, 40, 50, 60, 70, 80, 90]

data_skewness = skew(data)
data_kurtosis = kurtosis(data)

print(f"Skewness: {data_skewness}")
print(f"Kurtosis: {data_kurtosis}")
```

---

### **8. Generate a dataset and demonstrate positive and negative skewness.**
- **Positive skew**: The right tail of the distribution is longer.
- **Negative skew**: The left tail of the distribution is longer.

#### Python Example:
```python
import numpy as np
import matplotlib.pyplot as plt

# Positive Skew
pos_data = np.random.exponential(scale=1, size=1000)

# Negative Skew
neg_data = -np.random.exponential(scale=1, size=1000)

plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.hist(pos_data, bins=30, color='blue', edgecolor='black')
plt.title('Positive Skew')

plt.subplot(1, 2, 2)
plt.hist(neg_data, bins=30, color='red', edgecolor='black')
plt.title('Negative Skew')

plt.show()
```

---

### **9. Write a Python script to calculate covariance between two datasets.**
**Covariance** measures how two variables change together.

#### Python Example:
```python
data1 = [10, 20, 30, 40, 50]
data2 = [15, 25, 35, 45, 55]

covariance = np.cov(data1, data2)[0][1]
print(f"Covariance: {covariance}")
```

---

### **10. Write a Python script to calculate the correlation coefficient between two datasets.**
**Correlation** measures the strength and direction of a relationship between two variables.

#### Python Example:
```python
correlation = np.corrcoef(data1, data2)[0][1]
print(f"Correlation Coefficient: {correlation}")
```

---

### **11. Create a scatter plot to visualize the relationship between two variables.**
A **scatter plot** is used to visualize the relationship between two variables.

#### Python Example:
```python
plt.scatter(data1, data2)
plt.title('Scatter Plot between Data1 and Data2')
plt.xlabel('Data1')
plt.ylabel('Data2')
plt.show()
```

---

### **12. Implement and compare simple random sampling and systematic sampling.**
- **Simple Random Sampling**: Randomly select data points.
- **Systematic Sampling**: Select every nth data point from the list.

#### Python Example:
```python
# Simple Random Sampling (already shown in Q4)
simple_random_sample = df.sample(n=3)
print("Simple Random Sample:")
print(simple_random_sample)

# Systematic Sampling (every 2nd row)
systematic_sample = df.iloc[::2]
print("Systematic Sample:")
print(systematic_sample)
```

---

### **13. Calculate the mean, median, and mode of grouped data.**
Grouped data is data that has been sorted into intervals or groups.

#### Python Example:
```python
import pandas as pd

data = {'Age': [20, 22, 24, 26, 28, 30, 32, 34, 36, 38],
        'Frequency': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]}
df = pd.DataFrame(data)

mean_grouped = (df['Age'] * df['Frequency']).sum() / df['Frequency'].sum()
median_grouped = df['Age'].median()
mode_grouped = df['Age'].mode()[0]

print(f"Grouped Data Mean: {mean_grouped}")
print(f"Grouped Data Median: {median_grouped}")
print(f"Grouped Data Mode: {mode_grouped}")
```

---

### **14. Simulate data using Python and calculate its central tendency and dispersion.**
Simulating data and calculating central tendency (mean, median, mode) and dispersion (variance, standard deviation).

#### Python Example:
```python
simulated_data = np.random.normal(0, 1, 1000)

mean_sim = np.mean(simulated_data)
median_sim = np.median(simulated_data)
mode_sim = stats.mode(simulated_data).mode[0]
variance_sim = np.var(simulated_data)
std_dev_sim = np.std(simulated_data)

print(f"Simulated Data Mean: {mean_sim}")
print(f"Simulated Data Median: {median_sim}")
print(f"Simulated Data Mode: {mode_sim}")
print(f"Simulated Data Variance: {variance_sim}")
print(f"Simulated Data Standard Deviation: {std_dev_sim}")
```

---

### **15. Use NumPy or pandas to summarize a dataset’s descriptive statistics.**
Descriptive statistics like mean, standard deviation, min, max, etc., can be summarized using pandas' `.describe()`.

#### Python Example:
```python
import pandas as pd

# Sample dataset
data = {'Age': [23, 45, 23, 56, 89, 45, 34, 56, 23, 12],
        'Income': [1000, 2000, 1500, 3000, 3500, 2500, 2200, 2800, 1500, 1300]}

df = pd.DataFrame(data)

# Summary statistics
summary_stats = df.describe()
print(summary_stats)
```

---

### **16. Plot a boxplot to understand the spread and identify outliers.**
A **box

plot** helps in understanding the spread, central tendency, and outliers in the dataset.

#### Python Example:
```python
import seaborn as sns

sns.boxplot(x=df['Age'])
plt.title('Boxplot of Age')
plt.show()
```

---

### **17. Calculate the interquartile range (IQR) of a dataset.**
The **IQR** is the difference between the 75th percentile (Q3) and the 25th percentile (Q1).

#### Python Example:
```python
Q1 = np.percentile(df['Age'], 25)
Q3 = np.percentile(df['Age'], 75)
IQR = Q3 - Q1

print(f"IQR: {IQR}")
```

---

### **18. Implement Z-score normalization and explain its significance.**
**Z-score normalization** standardizes the data by subtracting the mean and dividing by the standard deviation.

#### Python Example:
```python
from scipy.stats import zscore

z_scores = zscore(df['Age'])
print(f"Z-scores: {z_scores}")
```

---

### **19. Compare two datasets using their standard deviations.**
Comparing datasets helps to understand their variability.

#### Python Example:
```python
std_dev_age = np.std(df['Age'])
std_dev_income = np.std(df['Income'])

print(f"Standard Deviation of Age: {std_dev_age}")
print(f"Standard Deviation of Income: {std_dev_income}")
```

---

### **20. Write a Python program to visualize covariance using a heatmap.**
#### Python Example:
```python
import seaborn as sns

cov_matrix = df.cov()
sns.heatmap(cov_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Covariance Heatmap')
plt.show()
```

---

### **21. Use seaborn to create a correlation matrix for a dataset.**
#### Python Example:
```python
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()
```

---

### **22. Generate a dataset and implement both variance and standard deviation computations.**
Refer to earlier examples (variance and standard deviation calculation).

---

### **23. Visualize skewness and kurtosis using Python libraries like matplotlib or seaborn.**
For visualization of skewness and kurtosis, you can use histograms and other plots.

#### Python Example:
```python
sns.histplot(df['Age'], kde=True)
plt.title('Skewness and Kurtosis Visualization')
plt.show()
```

---

### **24. Implement the Pearson and Spearman correlation coefficients for a dataset.**
#### Python Example:
```python
pearson_corr = df['Age'].corr(df['Income'], method='pearson')
spearman_corr = df['Age'].corr(df['Income'], method='spearman')

print(f"Pearson Correlation: {pearson_corr}")
print(f"Spearman Correlation: {spearman_corr}")
```

---
