### What is statistics, and why is it important?

Statistics is the branch of mathematics that deals with collecting, analyzing, interpreting, presenting, and organizing data. It is important because it helps in making data-driven decisions, identifying trends, and drawing conclusions in various fields like business, healthcare, and research.

### What are the two main types of statistics?

- **Descriptive Statistics**: Summarizes and describes features of a dataset (e.g., mean, median, mode).
- **Inferential Statistics**: Makes predictions or inferences about a population based on a sample (e.g., hypothesis testing, confidence intervals).

### What are descriptive statistics?

Descriptive statistics summarize and present data in a meaningful way. Common measures include central tendency (mean, median, mode) and dispersion (variance, standard deviation, range).

### What is inferential statistics?

Inferential statistics use sample data to make predictions or generalizations about a larger population. It includes hypothesis testing, confidence intervals, and regression analysis.

### What is sampling in statistics?

Sampling is the process of selecting a subset of individuals from a population to represent the entire group.

### What are the different types of sampling methods?

- **Random Sampling**: Every individual has an equal chance of being selected.
- **Stratified Sampling**: Population is divided into subgroups, and samples are taken from each.
- **Systematic Sampling**: Every nth individual is selected.
- **Cluster Sampling**: Population is divided into clusters, and entire clusters are randomly selected.
- **Convenience Sampling**: Selecting subjects that are easiest to reach.

### What is the difference between random and non-random sampling?

- **Random Sampling**: Every member has an equal chance of being selected.
- **Non-Random Sampling**: Selection is based on convenience, judgment, or other non-random criteria.

### Define and give examples of qualitative and quantitative data.

- **Qualitative Data**: Descriptive data (e.g., colors, names, categories). Example: 'Blue', 'Male', 'Low'.
- **Quantitative Data**: Numerical data that can be measured. Example: Height = 170 cm, Age = 25 years.

### What are the different types of data in statistics?

- **Nominal**: Categories without order (e.g., colors, gender).
- **Ordinal**: Categories with order but no fixed differences (e.g., ratings: poor, average, good).
- **Interval**: Numeric data with meaningful differences but no true zero (e.g., temperature in Celsius).
- **Ratio**: Numeric data with a true zero, allowing for meaningful ratios (e.g., weight, height, age).

### Explain nominal, ordinal, interval, and ratio levels of measurement.

- **Nominal**: Categories (e.g., colors, nationality).
- **Ordinal**: Ordered categories (e.g., rankings, satisfaction levels).
- **Interval**: Ordered with meaningful differences but no true zero (e.g., temperature in Celsius).
- **Ratio**: Ordered with meaningful differences and a true zero (e.g., weight, distance).

### What is the measure of central tendency?

It describes the center or typical value of a dataset using mean, median, and mode.

### Define mean, median, and mode.

- **Mean**: Average value.
- **Median**: Middle value when sorted.
- **Mode**: Most frequently occurring value.

### What is the significance of the measure of central tendency?

It provides a summary of the data's central value and helps in comparison and decision-making.

### What is variance, and how is it calculated?

Variance measures data dispersion. It is calculated as:
Variance = Σ(x_i - mean)^2 / n

### What is standard deviation, and why is it important?

Standard deviation is the square root of variance and measures data spread. It is important for understanding data consistency.

### Define and explain the term range in statistics.

Range is the difference between the maximum and minimum values in a dataset.

### What is the difference between variance and standard deviation?

Variance is the average squared deviation from the mean, while standard deviation is the square root of variance.

### What is skewness in a dataset?

Skewness measures the asymmetry of data distribution.

### What does it mean if a dataset is positively or negatively skewed?

- **Positively skewed**: Right tail is longer, meaning most values are on the left.
- **Negatively skewed**: Left tail is longer, meaning most values are on the right.

### Define and explain kurtosis.

Kurtosis measures the 'tailedness' of a distribution.
- **High kurtosis**: Heavy tails.
- **Low kurtosis**: Light tails.

### What is the purpose of covariance?

Covariance measures the relationship between two variables.

### What does correlation measure in statistics?

Correlation measures the strength and direction of a relationship between two variables.

### What is the difference between covariance and correlation?

- **Covariance**: Measures how two variables move together but has no fixed range.
- **Correlation**: Standardized measure of relationship strength between -1 and 1.

### What are some real-world applications of statistics?

- Business forecasting
- Healthcare analytics
- Financial risk assessment
- Quality control in manufacturing
- Sports analytics

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import zscore, skew, kurtosis, pearsonr, spearmanr, mode
from sklearn.model_selection import train_test_split

# 1. Calculate the mean, median, and mode of a dataset
data_vals = [10, 20, 30, 40, 50]
mean_val = np.mean(data_vals)
median_val = np.median(data_vals)
mode_val, _ = mode(data_vals)
print("Mean:", mean_val)
print("Median:", median_val)
print("Mode:", mode_val[0])

# 2. Compute the variance and standard deviation of a dataset
variance = np.var(data_vals, ddof=1)
std_dev = np.std(data_vals, ddof=1)
print("Variance:", variance)
print("Standard Deviation:", std_dev)

# 3. Create a dataset and classify it into nominal, ordinal, interval, and ratio types
dataset_types = {
    'Nominal': ['Red', 'Blue', 'Green', 'Yellow', 'Black'],
    'Ordinal': ['Low', 'Medium', 'High', 'Very High'],
    'Interval': [10, 20, 30, 40, 50],
    'Ratio': [1, 2, 4, 8, 16]
}
df_types = pd.DataFrame(dict([(k, pd.Series(v)) for k,v in dataset_types.items()]))
print("Dataset Classification:")
print(df_types)

# 4. Implement sampling techniques like random sampling and stratified sampling
data = {
    'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
    'Salary': [40000, 45000, 50000, 60000, 70000, 75000, 80000, 85000, 90000, 95000]
}
df = pd.DataFrame(data)
random_sample = df.sample(n=5)
strata_sample, _ = train_test_split(df, test_size=0.5, stratify=df[['Age']])
print("Random Sampling:")
print(random_sample)
print("Stratified Sampling:")
print(strata_sample)

# 5. Write a Python function to calculate the range of a dataset
def calculate_range(dataset):
    return max(dataset) - min(dataset)

range_val = calculate_range(data_vals)
print("Range:", range_val)

# 6. Create a dataset and plot its histogram to visualize skewness
data_skew = np.random.normal(50, 10, 1000)
sns.histplot(data_skew, kde=True)
plt.title("Histogram for Skewness")
plt.show()

# 7. Calculate skewness and kurtosis of a dataset
print("Skewness:", skew(data_skew))
print("Kurtosis:", kurtosis(data_skew))

# 8. Generate a dataset and demonstrate positive and negative skewness
pos_skew_data = np.random.exponential(scale=2, size=1000)
neg_skew_data = np.random.normal(loc=50, scale=10, size=1000) ** 2
sns.histplot(pos_skew_data, kde=True)
plt.title("Positively Skewed Data")
plt.show()
sns.histplot(neg_skew_data, kde=True)
plt.title("Negatively Skewed Data")
plt.show()

# 9. Write a Python script to calculate covariance between two datasets
x = np.random.rand(10)
y = np.random.rand(10)
covariance = np.cov(x, y)[0, 1]
print("Covariance:", covariance)

# 10. Write a Python script to calculate the correlation coefficient between two datasets
pearson_corr, _ = pearsonr(x, y)
spearman_corr, _ = spearmanr(x, y)
print("Pearson Correlation:", pearson_corr)
print("Spearman Correlation:", spearman_corr)

# 11. Create a scatter plot to visualize the relationship between two variables
plt.scatter(x, y)
plt.xlabel("Variable X")
plt.ylabel("Variable Y")
plt.title("Scatter Plot of Two Variables")
plt.show()

# 12. Implement and compare simple random sampling and systematic sampling
systematic_sample = df.iloc[::2]
print("Systematic Sampling:")
print(systematic_sample)

# 13. Calculate the mean, median, and mode of grouped data
grouped_data = df.groupby('Age').mean()
print("Grouped Data Mean:")
print(grouped_data)

# 14. Simulate data using Python and calculate its central tendency and dispersion
simulated_data = np.random.normal(100, 15, 1000)
print("Simulated Data Mean:", np.mean(simulated_data))
print("Simulated Data Standard Deviation:", np.std(simulated_data))

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import zscore, skew, kurtosis, pearsonr, spearmanr, mode

# 15. Use NumPy or pandas to summarize a dataset’s descriptive statistics
data = {
    'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
    'Salary': [40000, 45000, 50000, 60000, 70000, 75000, 80000, 85000, 90000, 95000]
}
df = pd.DataFrame(data)
print(df.describe())

# 16. Plot a boxplot to understand the spread and identify outliers
sns.boxplot(y=df['Salary'])
plt.title("Boxplot of Salary Distribution")
plt.show()

# 17. Calculate the interquartile range (IQR) of a dataset
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
print("Interquartile Range (IQR):", IQR)

# 18. Implement Z-score normalization and explain its significance
df['Salary_Zscore'] = zscore(df['Salary'])
print(df[['Salary', 'Salary_Zscore']])

# 19. Compare two datasets using their standard deviations
dataset1 = np.array([10, 20, 30, 40, 50])
dataset2 = np.array([5, 15, 25, 35, 45])
std1 = np.std(dataset1, ddof=1)
std2 = np.std(dataset2, ddof=1)
print("Standard Deviation of Dataset 1:", std1)
print("Standard Deviation of Dataset 2:", std2)

# 20. Visualize covariance using a heatmap
cov_matrix = df.cov()
sns.heatmap(cov_matrix, annot=True, cmap="coolwarm")
plt.title("Covariance Matrix Heatmap")
plt.show()

# 21. Use seaborn to create a correlation matrix for a dataset
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Matrix Heatmap")
plt.show()

# 22. Generate a dataset and implement both variance and standard deviation computations
data_vals = [10, 20, 30, 40, 50]
variance = np.var(data_vals, ddof=1)
std_dev = np.std(data_vals, ddof=1)
print("Variance:", variance)
print("Standard Deviation:", std_dev)

# 23. Visualize skewness and kurtosis using Python libraries like matplotlib or seaborn
data_skew = np.random.normal(50, 10, 1000)
sns.histplot(data_skew, kde=True)
plt.title("Histogram for Skewness and Kurtosis")
plt.show()
print("Skewness:", skew(data_skew))
print("Kurtosis:", kurtosis(data_skew))

# 24. Implement the Pearson and Spearman correlation coefficients for a dataset
x = np.random.rand(10)
y = np.random.rand(10)
pearson_corr, _ = pearsonr(x, y)
spearman_corr, _ = spearmanr(x, y)
print("Pearson Correlation:", pearson_corr)
print("Spearman Correlation:", spearman_corr)
