# Statistics Basics

Q1 What is statistics, and why is it important ?

Statistics is a branch of mathematics focused on collecting, analyzing, interpreting, presenting, and organizing data. Essentially, it's the science of making sense of data by identifying patterns, trends, and relationships.
Why is statistics important?

Q2 What are the two main types of statistics ?

The two main types of statistics are:

1. **Descriptive Statistics**:  
   - These summarize and describe the main features of a dataset.  
   - Common methods include measures like mean, median, mode, range, variance, and standard deviation.  
   - Tools like graphs, charts, and tables fall under this type to visually represent data.

2. **Inferential Statistics**:  
   - These are used to make predictions or generalizations about a population based on a sample.  
   - Techniques include hypothesis testing, confidence intervals, and regression analysis.  
   - It involves drawing conclusions and making decisions under uncertainty.

Q3 What are descriptive statistics ?

Descriptive statistics involve methods for summarizing and organizing data so that it can be easily understood and interpreted. These statistics provide a clear snapshot of the main characteristics of a dataset without making predictions or inferences.

1. **Central Tendency Measures**:
   - Mean (average)
   - Median (middle value)
   - Mode (most frequent value)

2. **Dispersion Measures**:
   - Range (difference between highest and lowest values)
   - Variance (spread of the data)
   - Standard Deviation (how much data deviates from the mean)

3. **Visualization Tools**:
   - Graphs (e.g., bar graphs, pie charts)
   - Tables (e.g., frequency distribution)
   - Charts (e.g., histograms)

Q4 What is inferential statistics ?

Inferential statistics is a branch of statistics that allows us to make predictions, generalizations, or conclusions about a population based on a smaller sample of data. It goes beyond merely describing the data and helps us infer patterns or make decisions under conditions of uncertainty.

Key Aspects of Inferential Statistics:

1. **Sampling**: Instead of studying an entire population (which is often impractical), inferential statistics relies on analyzing a representative sample.

2. **Hypothesis Testing**:
   - Formulating and testing assumptions (hypotheses) about the population.
   - Determining whether observed patterns in data are statistically significant.

3. **Confidence Intervals**: Providing an estimated range of values within which the true population parameter is likely to fall, with a certain level of confidence (e.g., 95%).

4. **Prediction Models**: Techniques like regression analysis to predict future trends or relationships between variables.

Q5 What is sampling in statistics ?

Sampling in statistics refers to the process of selecting a subset (sample) of individuals, items, or data points from a larger population to analyze and draw conclusions about the entire population. Since studying an entire population is often impractical or impossible, sampling makes it feasible to study and understand patterns, trends, and characteristics efficiently.

Types of Sampling:

1. **Random Sampling**: Every individual in the population has an equal chance of being selected.
2. **Stratified Sampling**: The population is divided into groups (strata) based on shared characteristics, and samples are taken from each group.
3. **Systematic Sampling**: Items are selected at regular intervals from an ordered population list.
4. **Cluster Sampling**: The population is divided into clusters, and a sample of clusters is chosen for analysis.
5. **Convenience Sampling**: Samples are selected based on ease of access, though this can lead to biased results.

Q6 What are the different types of sampling methods ?

Sampling methods are strategies used to select a portion of a population for analysis. Different methods ensure that the sample accurately represents the population or meets specific study requirements. Here are the key types:

 1. **Probability Sampling** (Every individual has a chance of being selected):
   - **Simple Random Sampling**: Every member of the population has an equal chance of selection, often using random number generators.
   - **Stratified Sampling**: The population is divided into distinct groups (strata), and samples are taken proportionally from each.
   - **Systematic Sampling**: Members are selected at regular intervals from an ordered population list.
   - **Cluster Sampling**: The population is divided into clusters, and specific clusters are randomly selected for study.

2. **Non-Probability Sampling** (Not everyone has a chance of being selected):
   - **Convenience Sampling**: Selection based on ease of access (e.g., surveying people nearby).
   - **Judgmental/Purposive Sampling**: Individuals are chosen based on the researcher’s judgment about their suitability.
   - **Quota Sampling**: Samples are chosen to meet predefined quotas for specific groups.
   - **Snowball Sampling**: Current participants recruit others, often used for hard-to-reach populations.

Q7  What is the difference between random and non-random sampling

The primary difference between random and non-random sampling lies in how samples are selected and the level of bias involved in the process.

**Random Sampling**:

- **Definition**: Every individual in the population has an equal chance of being selected. This selection is made through random processes, such as lottery draws or random number generators.
- **Features**:
  - Minimizes bias.
  - Results are more likely to represent the population accurately.
  - Examples: Simple Random Sampling, Stratified Sampling, Cluster Sampling.
- **Use Case**: Ideal for studies where accuracy and generalizability to the population are critical.

 **Non-Random Sampling**:

- **Definition**: Samples are selected based on specific criteria or convenience rather than randomization, meaning not all individuals have a chance of being included.
- **Features**:
  - Can introduce bias, affecting the reliability of the findings.
  - Often faster and easier to implement but may lack population representativeness.
  - Examples: Convenience Sampling, Purposive Sampling, Snowball Sampling.
- **Use Case**: Suitable for exploratory studies or when resources and time are limited.


Q8 Define and give examples of qualitative and quantitative data ?

 Qualitative Data:

- **Definition**: Non-numerical data that describes qualities, characteristics, or categories. It provides descriptive insights and is often subjective.

- **Examples**:

  - Hair colors of individuals (e.g., black, brown, blonde).
  - Types of cuisine (e.g., Indian, Italian, Chinese).
  - Customer feedback (e.g., "excellent," "average," "poor").
  - Marital status (e.g., single, married, divorced).

 Quantitative Data:

- **Definition**: Numerical data that represents quantities, amounts, or measurements. It can be analyzed mathematically.
- **Examples**:

  - Heights of people (e.g., 150 cm, 165 cm, 180 cm).
  - Test scores (e.g., 85, 90, 72).
  - Daily temperature readings (e.g., 25°C, 30°C, 35°C).
  - Number of employees in a company (e.g., 50, 200, 1,000).

Q9 What are the different types of data in statistics ?

In statistics, data is broadly categorized into **qualitative** (categorical) and **quantitative** (numerical) types. These are further divided into subtypes:

1. **Qualitative Data** (Categorical):
   - **Nominal Data**: Labels or categories without a specific order.  
     *Examples*: Gender (male/female), Blood type (A, B, O), Eye color (blue, green, brown).  
   - **Ordinal Data**: Categories with a meaningful order but no fixed interval between values.  
     *Examples*: Customer satisfaction levels (poor, fair, excellent), Education level (high school, undergraduate, postgraduate).

 2. **Quantitative Data** (Numerical):
   - **Discrete Data**: Whole numbers, often counts or items that can’t be divided.  
     *Examples*: Number of students in a class, Cars in a parking lot, Goals scored in a match.  
   - **Continuous Data**: Measurable quantities that can take any value within a range.  
     *Examples*: Height (5.8 ft), Weight (68.5 kg), Temperature (36.7°C).

Q10 Explain nominal, ordinal, interval, and ratio levels of measurement ?

The levels of measurement are classifications used in statistics to describe the nature of data. These levels help determine the appropriate statistical techniques for analysis. Here’s an explanation of each:

1. **Nominal Level**:
   - **Definition**: Data that consists of labels or categories without a specific order. It’s used to classify or group data.
   - **Characteristics**:
     - No meaningful rank or order.
     - Cannot perform mathematical operations.
   - **Examples**:
     - Gender (male, female, non-binary).
     - Eye colors (blue, green, brown).
     - Types of fruits (apple, orange, banana).

2. **Ordinal Level**:
   - **Definition**: Data with categories that have a meaningful order or ranking, but the intervals between ranks are not equal or measurable.
   - **Characteristics**:
     - Represents relative positioning.
     - Differences between ranks are undefined.
   - **Examples**:
     - Customer satisfaction levels (poor, fair, good, excellent).
     - Education levels (high school, undergraduate, postgraduate).
     - Race standings (1st, 2nd, 3rd).

 3. **Interval Level**:
   - **Definition**: Data with ordered values where the intervals between values are meaningful, but there is no true zero point.
   - **Characteristics**:
     - Allows addition and subtraction.
     - Ratios (e.g., twice or half) are not meaningful.
   - **Examples**:
     - Temperature in Celsius or Fahrenheit (e.g., 20°C, 30°C).
     - Time of day on a 12-hour clock.
     - IQ scores.

4. **Ratio Level**:
   - **Definition**: Data with ordered values, equal intervals, and a true zero point, allowing for meaningful ratios.
   - **Characteristics**:
     - All arithmetic operations (e.g., addition, subtraction, multiplication, division) are valid.
     - Has a clear “absence” of the measured attribute at zero.
   - **Examples**:
     - Height (e.g., 150 cm, 170 cm).
     - Weight (e.g., 60 kg, 80 kg).
     - Income (e.g., $0, $50,000).

Q11  What is the measure of central tendency ?

The measure of central tendency refers to statistical metrics used to identify the central or typical value in a dataset. These measures help summarize the dataset and provide insights into its central characteristics.

Key Measures of Central Tendency:

1. **Mean** (Average):
   - Calculated by summing all data values and dividing by the number of values.
   - *Example*: The mean of 10, 20, 30 is \((10+20+30)/3 = 20\).

2. **Median**:
   - The middle value when data is arranged in ascending or descending order.
   - *Example*: For the dataset \[10, 20, 30\], the median is 20. If there’s an even number of data points, it’s the average of the two middle values.

3. **Mode**:
   - The value that appears most frequently in a dataset.
   - *Example*: In \[10, 20, 20, 30\], the mode is 20.

Q12  Define mean, median, and mode ?

Certainly! Here’s a clear definition of each:

 1. **Mean**:
   - The mean, often called the average, is calculated by adding all the values in a dataset and dividing by the total number of values.
   - **Example**: For the dataset \[5, 10, 15\], the mean is \((5 + 10 + 15) / 3 = 10\).

2. **Median**:
   - The median is the middle value in an ordered dataset. If the dataset has an odd number of values, it’s the one in the center. If even, it’s the average of the two middle values.
   - **Example**: For \[10, 20, 30\], the median is 20. For \[10, 20, 30, 40\], it’s \((20 + 30) / 2 = 25\).

3. **Mode**:
   - The mode is the value that occurs most frequently in a dataset.
   - **Example**: In \[10, 20, 20, 30\], the mode is 20.


Q13  What is the significance of the measure of central tendency ?

The measure of central tendency plays a critical role in understanding and interpreting datasets. Its significance lies in the way it provides a single value that summarizes the entire dataset, reflecting the "center" or typical value. Here's why it matters:

 **Significance:**
1. **Simplification of Data**:
   - In large datasets, measures like mean, median, and mode make it easier to grasp overall trends without analyzing every individual value.

2. **Comparison**:
   - Central tendency allows for the comparison of different datasets or groups. For example, comparing average income across regions or countries.

3. **Decision-Making**:
   - Businesses and policymakers rely on measures like mean sales or median incomes to make informed decisions.

4. **Identification of Trends**:
   - It helps identify where data values cluster, highlighting dominant trends or recurring patterns.

5. **Selection of Statistical Methods**:
   - Choosing a measure of central tendency guides further analysis, such as determining dispersion (variance, standard deviation).

Q14 What is variance, and how is it calculated ?

**Variance**:

Variance is a statistical measure that represents the degree to which data points in a dataset spread out from the mean (average). It quantifies variability or dispersion, showing how far individual values deviate from the mean.

**How is Variance Calculated?**

To calculate variance, follow these steps:
1. **Find the Mean**:
   - Calculate the average of all the data points.

2. **Compute Deviations**:
   - Subtract the mean from each data point to determine the deviation.

3. **Square the Deviations**:
   - Square each deviation (to eliminate negative values).

4. **Find the Average of Squared Deviations**:
   - Add up all the squared deviations and divide by the total number of values (for a population) or by one less than the total (for a sample).

**Formula**:
- **Population Variance (\( \sigma^2 \))**:
  $$ \sigma^2 = \frac{\sum (x_i - \mu)^2}{N} $$
  - \( x_i \): Each data point
  - \( \mu \): Mean of the population
  - \( N \): Total number of data points

- **Sample Variance (\( s^2 \))**:
  $$ s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1} $$
  - \( \bar{x} \): Mean of the sample
  - \( n \): Total number of sample points

Q15  What is standard deviation, and why is it important ?

 **Standard Deviation**:

Standard deviation measures the amount of variation or dispersion in a dataset. It indicates how much individual data points deviate from the mean (average) of the dataset. A small standard deviation means the data points are close to the mean, while a large standard deviation indicates they are spread out over a wider range.

**Formula**:

The standard deviation is the square root of variance:
$$ \sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{N}} $$
- **For Population Standard Deviation (\( \sigma \))**: Use the total population size \( N \).
- **For Sample Standard Deviation (\( s \))**: Use \( n - 1 \) instead of \( N \) (degrees of freedom).

 **Steps to Calculate**:

1. Find the mean (\( \mu \)).
2. Compute the deviation of each data point from the mean.
3. Square each deviation.
4. Find the average of squared deviations (variance).
5. Take the square root of the variance.

 **Importance of Standard Deviation**:

1. **Understanding Data Spread**:
   - Provides insight into the consistency of data. For example, in test scores, a small standard deviation suggests most students performed similarly, while a large one shows wide variations in scores.

2. **Comparison**:
   - Helps compare datasets. For example, two products' reliability can be compared using their standard deviations in performance tests.

3. **Risk Assessment**:
   - In finance, standard deviation is used to measure investment risk (volatility).

4. **Normal Distribution**:
   - A crucial feature of bell curves: about 68% of data lies within one standard deviation from the mean, and 95% lies within two.

5. **Decision-Making**:
   - Offers actionable insights by quantifying variability, useful in quality control, research, and predictive modeling.

Q16 Define and explain the term range in statistics ?

 **Range in Statistics**:

The range is a measure of dispersion that represents the difference between the highest and the lowest values in a dataset. It provides a simple way to describe the spread of data by focusing on its extreme points.

 **Formula**:

$$ \text{Range} = \text{Maximum Value} - \text{Minimum Value} $$

**Significance**:

1. **Understanding Data Spread**:
   - The range gives a quick estimate of how spread out the data is. A larger range indicates a wider spread, while a smaller range means the data is more clustered.

2. **Ease of Calculation**:
   - It’s one of the simplest measures of variability, making it useful for a quick overview of the data.

3. **Limitations**:
   - **Sensitivity to Outliers**: The range only considers the extreme values, ignoring the rest of the data, so outliers can distort the result.
   - **Doesn't Show Distribution**: It doesn’t provide information about the distribution or central tendency of the data.

Applications:

The range is commonly used in exploratory data analysis, quality control, and comparing datasets to identify variability or consistency.

Q17  What is the difference between variance and standard deviation ?

Variance and standard deviation are both measures of data dispersion, but they differ in how they represent that variability.

 **Variance**:

- **Definition**: It measures the average squared deviations of each data point from the mean. It's expressed in squared units of the original data.
- **Formula**:  
  $$ \sigma^2 = \frac{\sum (x_i - \mu)^2}{N} $$ for a population,  
  $$ s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1} $$ for a sample.
- **Purpose**: Useful for understanding the variability of data but can be harder to interpret because it’s in squared units.
- **Example**: If the data is in kilograms, variance will be in **kg²**.



**Standard Deviation**:

- **Definition**: It’s the square root of variance, providing a measure of dispersion in the same units as the original data.
- **Formula**:  
  $$ \sigma = \sqrt{\sigma^2} $$ (population),  
  $$ s = \sqrt{s^2} $$ (sample).
- **Purpose**: Easier to interpret and compare since it reflects the average deviation in original units.
- **Example**: If the data is in kilograms, standard deviation will also be in **kg**.


 **Key Differences**:

| **Aspect**           | **Variance**                   | **Standard Deviation**           |
|-----------------------|--------------------------------|-----------------------------------|
| **Units**            | Squared units (e.g., kg²)     | Same units as the data (e.g., kg) |
| **Ease of Interpretation** | Harder to interpret       | Easier to understand             |
| **Mathematical Relation** | Basis for standard deviation | Square root of variance          |


Q18 What is skewness in a dataset ?

 **Skewness in a Dataset**:

Skewness is a statistical measure that describes the symmetry or asymmetry of a dataset's distribution relative to its mean. It helps identify whether the data is evenly distributed or if it leans more toward one side.

 **Types of Skewness**:

1. **Positive Skew (Right-Skewed)**:
   - The tail on the right side of the distribution is longer or more stretched than the left side.
   - Most values are concentrated on the lower end, and the mean is greater than the median.
   - *Example*: Income distribution in many societies, where a few high-income individuals pull the mean higher.

2. **Negative Skew (Left-Skewed)**:

   - The tail on the left side is longer or more stretched than the right side.
   - Most values are concentrated on the higher end, and the mean is less than the median.
   - *Example*: Scores on an easy test, where most students score very high but a few score low.

3. **No Skew (Symmetrical Distribution)**:

   - The distribution is perfectly symmetrical, meaning the left and right tails are mirror images.
   - The mean, median, and mode are equal.
   - *Example*: Idealized normal distribution (bell curve).

 **Significance of Skewness**:

- **Data Interpretation**: Helps understand the underlying nature of the dataset and whether assumptions of symmetry hold.
- **Choosing Statistical Methods**: Guides whether parametric or non-parametric tests are suitable for analysis.
- **Risk Assessment**: In finance, skewness helps evaluate potential risks or returns in investment portfolios.


Q19 What does it mean if a dataset is positively or negatively skewed ?

When a dataset is **positively or negatively skewed**, it describes how the data is distributed in relation to its mean, median, and mode. Let me explain each:

**Positively Skewed (Right-Skewed)**:

- **What It Means**:
  - The tail on the right side (higher values) is longer or more stretched than the left side.
  - Most data points are concentrated at the lower end, but a few extreme high values pull the mean to the right.
- **Key Indicators**:
  - Mean > Median > Mode.
  - Skewness value is **positive**.
- **Example**:
  - Income distribution: Most people earn a moderate amount, but a small number of individuals earn significantly more, stretching the right tail.

**Negatively Skewed (Left-Skewed)**:

- **What It Means**:
  - The tail on the left side (lower values) is longer or more stretched than the right side.
  - Most data points are concentrated at the higher end, but a few extreme low values pull the mean to the left.
- **Key Indicators**:
  - Mean < Median < Mode.
  - Skewness value is **negative**.
- **Example**:
  - Exam scores on an easy test: Most students score very high, but a few score very low, stretching the left tail.


Q20 Define and explain kurtosis ?
 **Kurtosis**:

Kurtosis is a statistical measure that describes the shape of a dataset's distribution, specifically the "tailedness" or the presence of outliers. It quantifies whether the data points have heavier or lighter tails compared to a normal distribution.

 **Types of Kurtosis**:

1. **Mesokurtic**:
   - Distributions with kurtosis similar to that of a normal distribution (kurtosis = 3, excess kurtosis = 0).
   - Tails are of moderate thickness, with no significant outliers.
   - *Example*: Normal distribution.

2. **Leptokurtic**:
   - Distributions with kurtosis greater than 3 (excess kurtosis > 0).
   - Tails are thicker, indicating the presence of more extreme outliers.
   - *Example*: Income data with many extremely high earners.

3. **Platykurtic**:
   - Distributions with kurtosis less than 3 (excess kurtosis < 0).
   - Tails are thinner, meaning fewer extreme outliers.
   - *Example*: Uniform distribution.

### **Formula for Kurtosis**:
Kurtosis is computed using:
$$ K = \frac{\sum (x_i - \mu)^4}{N \cdot \sigma^4} $$
Where:
- \( x_i \): Each data point
- \( \mu \): Mean of the data
- \( \sigma^2 \): Variance
- \( N \): Number of data points

Q21 What is the purpose of covariance ?

 **Purpose of Covariance**:

Covariance is a statistical measure that quantifies the relationship between two variables. Specifically, it shows whether an increase in one variable corresponds to an increase or decrease in another. The primary purpose of covariance is to understand the directional relationship between variables.

 **Key Objectives**:

1. **Identify Relationships**:

   - Determines whether two variables move together (positive covariance) or in opposite directions (negative covariance).

2. **Foundation for Further Analysis**:

   - Covariance is a building block for calculating the **correlation coefficient**, which standardizes the strength and direction of relationships.

3. **Model Development**:

   - In fields like finance, covariance helps assess how assets interact, aiding portfolio management and diversification strategies.

Q22  What does correlation measure in statistics ?

**Correlation in Statistics**:

Correlation measures the strength and direction of a linear relationship between two variables. It quantifies how closely the changes in one variable are associated with changes in another.

 **Key Points**:

1. **Direction**:
   - **Positive Correlation**: As one variable increases, the other also increases. *Example*: Height and weight.
   - **Negative Correlation**: As one variable increases, the other decreases. *Example*: Speed and travel time.
   - **No Correlation**: No consistent relationship between the variables. *Example*: Shoe size and IQ.

2. **Strength**:
   - Measured using the **correlation coefficient** (\(r\)), which ranges from **-1 to +1**:
     - **+1**: Perfect positive correlation.
     - **-1**: Perfect negative correlation.
     - **0**: No correlation.

3. **Formula**:
   $$ r = \frac{\text{Cov}(X, Y)}{\sigma_X \cdot \sigma_Y} $$
   - Where \( \text{Cov}(X, Y) \) is the covariance, and \( \sigma_X \) and \( \sigma_Y \) are the standard deviations of \( X \) and \( Y \).

Q23 What is the difference between covariance and correlation ?

Covariance and correlation are both measures used to assess the relationship between two variables, but they differ in interpretation, scale, and purpose.

 **Key Differences**:

| **Aspect**               | **Covariance**                           | **Correlation**                     |
|--------------------------|------------------------------------------|--------------------------------------|
| **Definition**           | Quantifies how two variables change together. | Measures the strength and direction of a linear relationship. |
| **Scale**                | Values are not standardized; depend on the units of the variables. | Standardized between **-1** and **+1**. |
| **Interpretation**        | Positive covariance: Variables increase/decrease together.<br>Negative covariance: One variable increases while the other decreases. | Positive correlation: Strong positive relationship.<br>Negative correlation: Strong negative relationship.<br>Zero correlation: No linear relationship. |
| **Units**                | Based on the product of the units of both variables. | Unitless; easy to compare across datasets. |
| **Formula**              | $$ \text{Cov}(X, Y) = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{N} $$ | $$ r = \frac{\text{Cov}(X, Y)}{\sigma_X \cdot \sigma_Y} $$ |
| **Purpose**              | Highlights the directional relationship between variables. | Quantifies both the strength and direction of the relationship. |

Q24  What are some real-world applications of statistics?

Statistics is a versatile and essential tool in countless real-world applications. Here are some examples where statistics plays a key role:

 **1. Healthcare and Medicine**:
   - **Clinical Trials**: Used to test the safety and effectiveness of new treatments or medications.
   - **Epidemiology**: Tracks and predicts the spread of diseases, such as during COVID-19.
   - **Patient Care**: Hospitals analyze patient data to improve diagnoses and treatments.

**2. Business and Economics**:
   - **Market Research**: Companies analyze consumer preferences and market trends to develop strategies.
   - **Quality Control**: Ensures product consistency through statistical sampling methods.
   - **Risk Analysis**: Evaluates financial risks, aiding in decision-making for investments and loans.

**3. Education**:
   - **Standardized Testing**: Measures student performance and compares results across populations.
   - **Curriculum Development**: Uses data on student learning outcomes to optimize teaching methods.

 **4. Sports and Entertainment**:
   - **Performance Analysis**: Tracks player statistics to improve strategies and training.
   - **Audience Analytics**: Assesses viewer preferences and trends for content optimization.

 **5. Environment and Ecology**:
   - **Climate Studies**: Analyzes weather patterns and climate changes.
   - **Wildlife Conservation**: Tracks animal populations to guide conservation efforts.

 **6. Government and Policy-Making**:
   - **Census Data**: Shapes policies by understanding population demographics.
   - **Crime Statistics**: Helps allocate resources for law enforcement and safety programs.

**7. Technology and Artificial Intelligence**:
   - **Machine Learning**: Relies on statistical models to train algorithms.
   - **Data Science**: Extracts insights from large datasets for innovation and efficiency.



In [None]:
...
***Practial Questions ***
#Q1  How do you calculate the mean, median, and mode of a dataset ?

from statistics import mean, median, mode

# Example dataset
data = [1, 2, 3, 3, 4, 5, 5, 5, 6]

# Calculate mean, median, and mode
mean_value = mean(data)
median_value = median(data)

# Mode calculation may throw an exception if the dataset has no mode
try:
    mode_value = mode(data)
except:
    mode_value = "No mode"

# Print results
print("Mean:", mean_value)
print("Median:", median_value)
print("Mode:", mode_value)
...
#Q2 Write a Python program to compute the variance and standard deviation of a dataset ?

from statistics import variance, stdev

# Example dataset
data = [2, 4, 6, 8, 10]

# Calculate variance and standard deviation
variance_value = variance(data)
standard_deviation_value = stdev(data)

# Display the results
print("Variance:", variance_value)
print("Standard Deviation:", standard_deviation_value)
...
#Q3 Create a dataset and classify it into nominal, ordinal, interval, and ratio types ?

# Define the dataset as a dictionary
dataset = {
    "Name": ["Alice", "Bob", "Charlie", "Daisy"],
    "Category": ["Engineer", "Teacher", "Doctor", "Nurse"],  # Nominal
    "Rating": ["Excellent", "Good", "Fair", "Poor"],         # Ordinal
    "Temperature (°C)": [25, 30, 28, 24],                    # Interval
    "Income (USD)": [50000, 40000, 60000, 35000]             # Ratio
}

# Classify the variables
classification = {
    "Nominal": ["Category"],  # Represents labels without order
    "Ordinal": ["Rating"],    # Represents ordered categories
    "Interval": ["Temperature (°C)"],  # Ordered with no true zero
    "Ratio": ["Income (USD)"]          # Ordered with true zero
}

# Display the dataset and classifications
print("Dataset:")
for key, values in dataset.items():
    print(f"{key}: {values}")

print("\nClassification:")
for key, values in classification.items():
    print(f"{key}: {values}")
...
#Q4 Implement sampling techniques like random sampling and stratified sampling ?

import random

# Dataset
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Randomly selecting 5 samples from the dataset
random_samples = random.sample(data, 5)

print("Random Samples:", random_samples)
...
#Q5  Write a Python function to calculate the range of a dataset ?

def calculate_range(data):
    """
    Function to calculate the range of a dataset.

    Parameters:
        data (list): A list of numerical values.

    Returns:
        int/float: The range of the dataset.
    """
    if not data:
        return "Dataset is empty. Please provide valid data."

    # Calculate the range
    data_range = max(data) - min(data)
    return data_range

# Example usage
dataset = [10, 20, 30, 40, 50]
result = calculate_range(dataset)
print("Range of the dataset:", result)
...
#Q6 Create a dataset and plot its histogram to visualize skewness ?

import matplotlib.pyplot as plt
import numpy as np

# Create a dataset
# Positively skewed dataset (e.g., income data with a few high outliers)
data = [10, 15, 15, 20, 25, 25, 25, 30, 30, 35, 40, 45, 50, 100, 200]

# Plot a histogram
plt.hist(data, bins=10, color='blue', edgecolor='black', alpha=0.7)
plt.title('Histogram of Dataset')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Add grid for better readability
plt.grid(axis='y', alpha=0.75)

# Display the plot
plt.show()
...
#Q7 Calculate skewness and kurtosis of a dataset using Python libraries ?

import pandas as pd
from scipy.stats import skew, kurtosis

# Create a dataset
data = [2, 4, 6, 8, 10, 20, 30, 40, 50, 100]  # Example dataset

# Calculate skewness and kurtosis
data_skewness = skew(data)
data_kurtosis = kurtosis(data)

# Display results
print("Skewness of the dataset:", data_skewness)
print("Kurtosis of the dataset:", data_kurtosis)
...
#Q8 Generate a dataset and demonstrate positive and negative skewness ?

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import skew

# Generate positively skewed data (e.g., exponential distribution)
positive_skew_data = np.random.exponential(scale=2, size=1000)

# Generate negatively skewed data (by inverting exponential distribution)
negative_skew_data = -1 * np.random.exponential(scale=2, size=1000)

# Calculate skewness
positive_skewness = skew(positive_skew_data)
negative_skewness = skew(negative_skew_data)

# Plot histograms
plt.figure(figsize=(12, 6))

# Positive skew
plt.subplot(1, 2, 1)
plt.hist(positive_skew_data, bins=20, color='blue', edgecolor='black', alpha=0.7)
plt.title(f'Positively Skewed Data (Skewness: {positive_skewness:.2f})')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Negative skew
plt.subplot(1, 2, 2)
plt.hist(negative_skew_data, bins=20, color='green', edgecolor='black', alpha=0.7)
plt.title(f'Negatively Skewed Data (Skewness: {negative_skewness:.2f})')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Display plots
plt.tight_layout()
plt.show()
...
#Q9 Write a Python script to calculate covariance between two datasets ?

import numpy as np

def calculate_covariance(data1, data2):
    """
    Function to calculate the covariance between two datasets.

    Parameters:
        data1 (list or array): First dataset
        data2 (list or array): Second dataset

    Returns:
        float: Covariance between the two datasets
    """
    if len(data1) != len(data2):
        return "Datasets must have the same length."

    # Convert to NumPy arrays for easy calculation
    data1 = np.array(data1)
    data2 = np.array(data2)

    # Calculate the mean of each dataset
    mean1 = np.mean(data1)
    mean2 = np.mean(data2)

    # Compute covariance
    covariance = np.mean((data1 - mean1) * (data2 - mean2))
    return covariance

# Example datasets
dataset1 = [1, 2, 3, 4, 5]
dataset2 = [5, 10, 15, 20, 25]

# Calculate covariance
covariance_result = calculate_covariance(dataset1, dataset2)

# Print the result
print("Covariance between the two datasets:", covariance_result)
...
#Q10 Write a Python script to calculate the correlation coefficient between two datasets ?

import numpy as np

def calculate_correlation(data1, data2):
    """
    Function to calculate the correlation coefficient between two datasets.

    Parameters:
        data1 (list or array): First dataset
        data2 (list or array): Second dataset

    Returns:
        float: Correlation coefficient
    """
    if len(data1) != len(data2):
        return "Datasets must have the same length."

    # Convert to NumPy arrays
    data1 = np.array(data1)
    data2 = np.array(data2)

    # Compute correlation coefficient
    correlation_matrix = np.corrcoef(data1, data2)
    correlation_coefficient = correlation_matrix[0, 1]  # Extract the correlation coefficient
    return correlation_coefficient

# Example datasets
dataset1 = [1, 2, 3, 4, 5]
dataset2 = [5, 10, 15, 20, 25]

# Calculate correlation coefficient
correlation_result = calculate_correlation(dataset1, dataset2)

# Print the result
print("Correlation Coefficient:", correlation_result)
...
#Q11 Create a scatter plot to visualize the relationship between two variables ?

import matplotlib.pyplot as plt

# Example datasets
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]  # Independent variable
y = [2, 4, 5, 7, 10, 12, 14, 15, 18, 20]  # Dependent variable

# Create a scatter plot
plt.scatter(x, y, color='blue', edgecolor='black', alpha=0.8)

# Customize the plot
plt.title('Scatter Plot of Two Variables')
plt.xlabel('X (Independent Variable)')
plt.ylabel('Y (Dependent Variable)')
plt.grid(True)

# Show the plot
plt.show()
...
#Q11  Implement and compare simple random sampling and systematic sampling ?

import random
import numpy as np

# Example dataset
data = [i for i in range(1, 101)]  # Dataset with 100 elements (1 to 100)

# Simple Random Sampling
def simple_random_sampling(data, sample_size):
    return random.sample(data, sample_size)

# Systematic Sampling
def systematic_sampling(data, sample_size):
    step = len(data) // sample_size  # Calculate the step size
    return [data[i] for i in range(0, len(data), step)][:sample_size]

# Define sample size
sample_size = 10

# Perform sampling
simple_random_sample = simple_random_sampling(data, sample_size)
systematic_sample = systematic_sampling(data, sample_size)

# Display results
print("Original Dataset:", data)
print("Simple Random Sample:", simple_random_sample)
print("Systematic Sample:", systematic_sample)
...
#Q12 Calculate the mean, median, and mode of grouped data ?

# Function to calculate Mean, Median, and Mode for grouped data
def calculate_grouped_data_metrics(frequencies, midpoints):
    """
    Calculates the mean, median, and mode of grouped data.

    Parameters:
        frequencies (list): Frequency of each group.
        midpoints (list): Midpoints of each group (representing group intervals).

    Returns:
        dict: Mean, Median, and Mode as a dictionary.
    """
    # Total frequency
    total_frequency = sum(frequencies)

    # Mean calculation
    mean = sum([f * m for f, m in zip(frequencies, midpoints)]) / total_frequency

    # Median calculation
    cumulative_frequencies = [sum(frequencies[:i + 1]) for i in range(len(frequencies))]
    median_index = next(i for i, cf in enumerate(cumulative_frequencies) if cf > total_frequency / 2)
    median_class_midpoint = midpoints[median_index]

    # Mode calculation
    mode_index = frequencies.index(max(frequencies))
    mode = midpoints[mode_index]

    # Results
    return {"Mean": mean, "Median": median_class_midpoint, "Mode": mode}


# Grouped data example (Midpoints and Frequencies of intervals)
grouped_data_midpoints = [5, 15, 25, 35, 45]  # Midpoints of intervals
grouped_data_frequencies = [4, 8, 10, 5, 3]   # Frequencies of intervals

# Calculate metrics
results = calculate_grouped_data_metrics(grouped_data_frequencies, grouped_data_midpoints)

# Print the results
print("Grouped Data Metrics:")
for key, value in results.items():
    print(f"{key}: {value}")


...
#Q14 Simulate data using Python and calculate its central tendency and dispersion. ?

import numpy as np
from scipy.stats import mode
from statistics import variance, stdev

# Simulate data: Generate 100 random numbers from a normal distribution
np.random.seed(42)  # For reproducibility
data = np.random.normal(loc=50, scale=10, size=100)  # Mean=50, StdDev=10

# Calculate central tendency
mean_value = np.mean(data)  # Mean
median_value = np.median(data)  # Median
mode_value = mode(data, keepdims=False).mode  # Mode

# Calculate dispersion
variance_value = variance(data)  # Variance
std_dev_value = stdev(data)  # Standard Deviation
range_value = max(data) - min(data)  # Range

# Display results
print("Simulated Data:", data)
print("\n--- Central Tendency ---")
print(f"Mean: {mean_value:.2f}")
print(f"Median: {median_value:.2f}")
print(f"Mode: {mode_value:.2f}")

print("\n--- Dispersion ---")
print(f"Variance: {variance_value:.2f}")
print(f"Standard Deviation: {std_dev_value:.2f}")
print(f"Range: {range_value:.2f}")
...
#Q15  Use NumPy or pandas to summarize a dataset’s descriptive statistics ?

import pandas as pd

# Create a sample dataset
data = {
    "Age": [22, 25, 29, 35, 42, 55, 63, 36, 47, 50],
    "Income": [25000, 30000, 40000, 45000, 50000, 65000, 70000, 48000, 52000, 60000],
    "Score": [85, 90, 88, 75, 92, 77, 85, 84, 79, 91]
}

# Convert the dataset to a pandas DataFrame
df = pd.DataFrame(data)

# Generate descriptive statistics
summary = df.describe()

# Display the summary
print("Descriptive Statistics:\n", summary)

...
#Q16 Plot a boxplot to understand the spread and identify outliers ?

import matplotlib.pyplot as plt
import numpy as np

# Generate example data
np.random.seed(42)  # For reproducibility
data = np.random.normal(loc=50, scale=10, size=200)  # Dataset with mean=50, stddev=10

# Add a few outliers to the data
data = np.append(data, [100, 105, 110])

# Create the boxplot
plt.boxplot(data, vert=False, patch_artist=True, boxprops=dict(facecolor="lightblue"))

# Customize the plot
plt.title("Boxplot to Understand Spread and Identify Outliers")
plt.xlabel("Values")
plt.grid(axis="x", alpha=0.5)

# Display the plot
plt.show()

...
#Q17  Calculate the interquartile range (IQR) of a dataset ?

import numpy as np

def calculate_iqr(data):
    """
    Function to calculate the Interquartile Range (IQR) of a dataset.

    Parameters:
        data (list or array): A list of numerical values.

    Returns:
        float: The IQR of the dataset.
    """
    # Convert the dataset to a NumPy array for easy calculation
    data = np.array(data)

    # Calculate Q1 (25th percentile) and Q3 (75th percentile)
    Q1 = np.percentile(data, 25)
    Q3 = np.percentile(data, 75)

    # Calculate IQR
    IQR = Q3 - Q1
    return IQR

# Example dataset
dataset = [10, 15, 14, 20, 25, 22, 28, 30, 35, 40, 45, 50]

# Calculate IQR
iqr_result = calculate_iqr(dataset)

# Print the result
print("Interquartile Range (IQR):", iqr_result)
...
#Q18 Implement Z-score normalization and explain its significance ?

import numpy as np

def z_score_normalization(data):
    """
    Function to apply Z-score normalization to a dataset.

    Parameters:
        data (list or array): A list of numerical values.

    Returns:
        list: Normalized dataset using Z-score.
    """
    # Convert to NumPy array
    data = np.array(data)

    # Calculate the mean and standard deviation
    mean = np.mean(data)
    std_dev = np.std(data)

    # Apply Z-score normalization
    z_scores = (data - mean) / std_dev
    return z_scores

# Example dataset
data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Perform Z-score normalization
normalized_data = z_score_normalization(data)

# Display results
print("Original Data:", data)
print("Z-score Normalized Data:", normalized_data)
...
#Q19 Compare two datasets using their standard deviations ?

import numpy as np

def compare_standard_deviations(data1, data2):
    """
    Function to calculate and compare the standard deviations of two datasets.

    Parameters:
        data1 (list or array): First dataset
        data2 (list or array): Second dataset

    Returns:
        dict: Standard deviations and their comparison
    """
    # Convert datasets to NumPy arrays
    data1 = np.array(data1)
    data2 = np.array(data2)

    # Calculate standard deviations
    std_dev1 = np.std(data1)
    std_dev2 = np.std(data2)

    # Compare standard deviations
    comparison = "Dataset 1 has higher variability" if std_dev1 > std_dev2 else \
                 "Dataset 2 has higher variability" if std_dev2 > std_dev1 else \
                 "Both datasets have the same variability"

    # Return results
    return {"Standard Deviation of Dataset 1": std_dev1,
            "Standard Deviation of Dataset 2": std_dev2,
            "Comparison": comparison}

# Example datasets
dataset1 = [10, 20, 30, 40, 50]
dataset2 = [15, 25, 35, 45, 55]

# Compare standard deviations
results = compare_standard_deviations(dataset1, dataset2)

# Print results
print("Comparison Results:")
for key, value in results.items():
    print(f"{key}: {value}")
...
#Q20 Write a Python program to visualize covariance using a heatmap ?

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Create a sample dataset
data = {
    "Variable A": [10, 20, 30, 40, 50],
    "Variable B": [15, 25, 35, 45, 55],
    "Variable C": [50, 40, 30, 20, 10]
}

# Convert the dataset to a pandas DataFrame
df = pd.DataFrame(data)

# Calculate the covariance matrix
cov_matrix = df.cov()

# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cov_matrix, annot=True, cmap='coolwarm', fmt=".2f", cbar=True)

# Customize the plot
plt.title('Covariance Matrix Heatmap')
plt.show()
...
#Q21 Use seaborn to create a correlation matrix for a dataset ?

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Create a sample dataset
data = {
    "Age": [22, 25, 29, 35, 42, 55, 63, 36, 47, 50],
    "Income": [25000, 30000, 40000, 45000, 50000, 65000, 70000, 48000, 52000, 60000],
    "Score": [85, 90, 88, 75, 92, 77, 85, 84, 79, 91]
}

# Convert the dataset into a pandas DataFrame
df = pd.DataFrame(data)

# Calculate the correlation matrix
correlation_matrix = df.corr()

# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", cbar=True)

# Customize the plot
plt.title("Correlation Matrix Heatmap")
plt.show()
...

#Q22 Generate a dataset and implement both variance and standard deviation computations ?

import numpy as np
from statistics import variance, stdev

# Generate a dataset: Random integers between 1 and 100
np.random.seed(42)
data = np.random.randint(1, 101, size=20)

# Calculate variance and standard deviation
variance_value = variance(data)
std_dev_value = stdev(data)
# Display results
print("Generated Dataset:", data)
print("\n--- Computations ---")
print(f"Variance: {variance_value:.2f}")
print(f"Standard Deviation: {std_dev_value:.2f}")
...
#Q23 Visualize skewness and kurtosis using Python libraries like matplotlib or seaborn ?

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew, kurtosis

# Generate example data
np.random.seed(42)
data = np.random.normal(loc=50, scale=10, size=500)

# Add some skewness (positively skewed)
skewed_data = np.append(data, [100, 110, 115])

# Calculate skewness and kurtosis
data_skewness = skew(skewed_data)
data_kurtosis = kurtosis(skewed_data)

# Visualization using histograms
plt.figure(figsize=(12, 6))

# Histogram with Matplotlib
plt.subplot(1, 2, 1)
plt.hist(skewed_data, bins=20, color='blue', edgecolor='black', alpha=0.7)
plt.title(f"Histogram (Skewness: {data_skewness:.2f}, Kurtosis: {data_kurtosis:.2f})")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.grid(axis='y')

# Histogram with Seaborn
plt.subplot(1, 2, 2)
sns.histplot(skewed_data, kde=True, color='green')
plt.title(f"Seaborn Histogram (Skewness: {data_skewness:.2f}, Kurtosis: {data_kurtosis:.2f})")
plt.xlabel("Value")
plt.ylabel("Frequency")

# Display the plots
plt.tight_layout()
plt.show()

...
#Q24 Implement the Pearson and Spearman correlation coefficients for a dataset. ?

import numpy as np
from scipy.stats import pearsonr, spearmanr

# Create example datasets
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [2, 4, 5, 8, 10, 12, 14, 15, 18, 20]

# Calculate Pearson correlation coefficient
pearson_corr, pearson_p_value = pearsonr(x, y)

# Calculate Spearman correlation coefficient
spearman_corr, spearman_p_value = spearmanr(x, y)

# Display results
print("--- Correlation Coefficients ---")
print(f"Pearson Correlation: {pearson_corr:.2f} (p-value: {pearson_p_value:.2e})")
print(f"Spearman Correlation: {spearman_corr:.2f} (p-value: {spearman_p_value:.2e})")
...