<a href="https://colab.research.google.com/github/swopnimghimire-123123/180-Day-Data-Science-Journey/blob/main/Week_5_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 5: Statistics Basics – Descriptive Statistics
- Mean, Median, Mode – Concepts & Formulas
- Variance & Standard Deviation – Why do we need them?
- Range, Quartiles, IQR
- Understanding Data Distribution (Normal vs Skewed)
- Hands-on with Python: mean(), median(), std(), describe() in NumPy & Pandas

## Descriptive Statistics

Descriptive statistics are used to summarize and describe the main features of a dataset. They provide simple summaries about the sample and about the observations that have been made.

### Measures of Central Tendency

These measures describe the center point of a dataset.

*   **Mean:** The average of all values.
    *   **Formula:** Sum of all values divided by the number of values ($\bar{x} = \frac{\sum x_i}{n}$)
*   **Median:** The middle value when the data is ordered. If there's an even number of values, it's the average of the two middle values.
*   **Mode:** The value that appears most frequently in the dataset.

### Measures of Variability (Spread)

These measures describe how spread out the data is.

*   **Variance:** The average of the squared differences from the mean. It measures how far each number in the set is from the mean.
    *   **Formula (Sample Variance):** $s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}$
    *   **Formula (Population Variance):** $\sigma^2 = \frac{\sum (x_i - \mu)^2}{N}$
*   **Standard Deviation:** The square root of the variance. It provides a measure of spread in the same units as the original data.
    *   **Formula (Sample Standard Deviation):** $s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}}$
    *   **Formula (Population Standard Deviation):** $\sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{N}}$
    *   **Why do we need them?** Variance and standard deviation tell us about the dispersion of data points around the mean. A higher variance/standard deviation indicates that the data points are more spread out, while a lower value indicates they are clustered closer to the mean.

*   **Range:** The difference between the highest and lowest values in a dataset.
*   **Quartiles:** Values that divide the data into four equal parts.
    *   **Q1 (First Quartile):** 25th percentile
    *   **Q2 (Second Quartile):** 50th percentile (which is also the median)
    *   **Q3 (Third Quartile):** 75th percentile
*   **IQR (Interquartile Range):** The difference between the third quartile (Q3) and the first quartile (Q1). It represents the range of the middle 50% of the data.

### Understanding Data Distribution

*   **Normal Distribution:** A symmetrical distribution where the mean, median, and mode are all equal and located at the center. It is often described as a bell curve.
*   **Skewed Distribution:** A distribution that is not symmetrical.
    *   **Positively (Right) Skewed:** The tail of the distribution extends to the right. The mean is typically greater than the median.
    *   **Negatively (Left) Skewed:** The tail of the distribution extends to the left. The mean is typically less than the median.

### Hands-on with Python

You can calculate these descriptive statistics using libraries like NumPy and Pandas.

**Using NumPy:**


In [1]:
import pandas as pd

data = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

mean = data.mean()
median = data.median()
std_dev = data.std() # Calculates sample standard deviation by default
variance = data.var() # Calculates sample variance by default
description = data.describe()

print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Standard Deviation: {std_dev}")
print(f"Variance: {variance}")
print("\nDescription:\n", description)

# For population standard deviation and variance, you can use ddof=0
population_std_dev = data.std(ddof=0)
population_variance = data.var(ddof=0)
print(f"Population Standard Deviation: {population_std_dev}")
print(f"Population Variance: {population_variance}")

Mean: 5.5
Median: 5.5
Standard Deviation: 3.0276503540974917
Variance: 9.166666666666666

Description:
 count    10.00000
mean      5.50000
std       3.02765
min       1.00000
25%       3.25000
50%       5.50000
75%       7.75000
max      10.00000
dtype: float64
Population Standard Deviation: 2.8722813232690143
Population Variance: 8.25


### Q1. Create a NumPy array of 10 random integers between 1 and 100. Find its mean.

In [3]:
import numpy as np
# creating an array of 10 random integers from 1 to 100
arr = np.random.randint(1, 101, 10)

# finding the mean of the array
mean = np.mean(arr)

# calculate and display the mean
print("Mean:",mean)

Mean: 53.8


### Q2. Calculate the median of the array: [10, 5, 8, 12, 3, 7].

In [5]:
import numpy as np

# creating an array
arr = np.array([10, 5, 8, 12, 3, 7])

# finding the median
median = np.median(arr)

# displaying the median
print("Median:",median)

Median: 7.5


### Q3. Generate an array of 15 random numbers between 1 and 50. Find its standard deviation

In [7]:
import numpy as np

# creating an array
arr = np.random.randint(1,51,15)

# finding the standard deviation
std = np.std(arr)

# displaying the result
print("Standard Deviation:",std)

Standard Deviation: 12.606171328185077


### Q4. Create a 3x3 array with values from 1 to 9. Find the row-wise mean

In [9]:
import numpy as np

# create a 3*3 array with values from 1 to 9
arr = np.arange(1,10).reshape(3,3)

# calculate the rew-wise mean
row_mean = np.mean(arr, axis=1)

# display the mean
print("Array\n",arr)
print("Row-wise Mean:",row_mean)

Array
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
Row-wise Mean: [2. 5. 8.]


### Q5. For the array [4, 6, 8, 10, 12], calculate mean, median, std.

In [10]:
import numpy as np

# creating an array
arr = np.array([4, 6, 8, 10, 12])

# finding the mean
mean = np.mean(arr)
median = np.median(arr)
std = np.std(arr)

# displaying the result
print("Mean:",mean)
print("Median:",median)
print("Standard Deviation:",std)

Mean: 8.0
Median: 8.0
Standard Deviation: 2.8284271247461903


### Q6. Generate a 5x5 NumPy array of random integers (0 to 100). Find the overall mean.

In [13]:
import numpy as np

# creating an array
arr = np.random.randint(0,101,(5,5))

# calculate the overall mean
overall_mean = np.mean(arr)

# displaying the results
print("Array:\n",arr)
print("Overall_Mean:",overall_mean)

Array:
 [[96 96 71  0 45]
 [23 22 23 37 61]
 [76 60 25 12 83]
 [51 13 38 90 75]
 [51 30 52 98 38]]
Overall_Mean: 50.64


### Q7. Create an array of 50 random integers and calculate the standard deviation.

In [14]:
import numpy as np

# creating an array
arr = np.random.randint(0,101,50)

# calculate the standard deviation
std = np.std(arr)

# displaying the result
print("Array:",arr)
print("Standard Deviation:",std)

Array: [71  3 15  7  3 29 69 52 72 19  9 19 37 17 46 32 52 49 10 77 89 38 48 97
 39 90 79 22 12 65 45 93 81 78  4 32 95 69 85 59 93 84 60 21 17  1 55 64
 76 45]
Standard Deviation: 29.5447051093762


### Q8. Find indices of non-zero elements from [1,2,0,0,4,0]


In [15]:
import numpy as np

# Given array
arr = np.array([1,2,0,0,4,0])

# finding indices of non-zero elements
indices = np.nonzero(arr)

# displaying the results
print("Indices of non_zero element:",indices)

Indices of non_zero element: (array([0, 1, 4]),)


### Q9. You have an 2D array, print array element [[7,4]]:

In [17]:
import numpy as np

arr = np.array([[2,3,4],
                [5,7,4],
                [8,9,0]])

print("Element [7,4]:", arr[1,1], arr[1,2])   # row 1 → [5,7,4]
print("As array:", arr[1,1:3])                # slice → [7,4]

Element [7,4]: 7 4
As array: [7 4]


### Q10. Create 2 arrays of 2-Dimension and perform matrix multiplication

In [18]:
A = np.array([[1,2],[3,4]])
B = np.array([[5,6],[7,8]])

# Matrix Multiplication
C = np.dot(A,B)
print("Result:\n",C)

Result:
 [[19 22]
 [43 50]]
