# 📊 Week 3 – Task 4: Statistics Practice

## 1. Variance & Standard Deviation Refresher
- Variance measures the spread of data (average squared distance from mean).
- Standard deviation is the square root of variance (same units as the data, easier to interpret).
- Why important? In ML, these are essential for normalization, scaling, and understanding feature variability.

In [1]:
import numpy as np
import pandas as pd

## 2. Example with Simple Dataset
Let's calculate mean, variance, and standard deviation on a small dataset.

In [2]:
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
print("Mean:", np.mean(data))
print("Variance:", np.var(data))
print("Standard Deviation:", np.std(data))

Mean: 5.0
Variance: 4.0
Standard Deviation: 2.0


## 3. Simulate Dice Rolls
- Roll a 6-sided die 1,000 times.
- Calculate mean, variance, std with NumPy.
- Expected results:
  - Mean ≈ 3.5
  - Variance ≈ 2.9
  - Std Dev ≈ 1.7

In [3]:
dice = np.random.randint(1, 7, size=1000)
print("Mean of dice rolls:", np.mean(dice))
print("Variance of dice rolls:", np.var(dice))
print("Standard Deviation of dice rolls:", np.std(dice))

Mean of dice rolls: 3.543
Variance of dice rolls: 2.846151
Standard Deviation of dice rolls: 1.6870539410463437


## 4. Verify with Pandas
Convert the dice rolls into a Pandas Series and use:
- `.mean()`
- `.var()` (note: Pandas uses sample variance by default)
- `.std()`

In [4]:
dice_series = pd.Series(dice)
print("Pandas Mean of dice rolls:", dice_series.mean())
print("Pandas Variance of dice rolls:", dice_series.var())
print("Pandas Standard Deviation of dice rolls:", dice_series.std())

Pandas Mean of dice rolls: 3.543
Pandas Variance of dice rolls: 2.8489999999999998
Pandas Standard Deviation of dice rolls: 1.6878981011897607


## 5. Why Does This Matter in ML?
Write a short explanation:
- Variance and std measure how spread out the data is.
- Features with high variance can dominate learning algorithms.
- Many ML algorithms (e.g., k-NN, PCA, linear regression) assume normalized data.
- Scaling often uses standard deviation (z-score normalization).

### ✅ Summary
- Learned to calculate variance & std with NumPy and Pandas.
- Simulated dice rolls to understand expected values.
- Connected variance/std to their importance in ML.

## 6. Array Creation
- Create arrays with np.array, np.zeros, np.ones, np.arange, np.linspace.
- Inspect shape, size, and dtype.


In [None]:
arr = np.arange(10)  # Array from 0 to 9
arr2 = np.linspace(0, 1, 5)  # 5 values from 0 to 1
arr3 = np.zeros((3, 4))  # 3x4 array of zeros

print("Array using arange:", arr)
print("Array using linspace:", arr2)
print("Array of zeros:\n", arr3)

Array using arange: [0 1 2 3 4 5 6 7 8 9]
Array using linspace: [0.   0.25 0.5  0.75 1.  ]
Array of zeros:
 [[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


## 7. Indexing & Slicing
- Access single elements, slices, and use boolean indexing.
- Practice on both 1D and 2D arrays.


In [None]:
arr_1d = np.arange(10)
print("Original 1D array:", arr_1d)
print("Sliced 1D array (indices 2 to 5):", arr_1d[2:6])
print("Boolean indexing (elements > 5):", arr_1d[arr_1d > 5])

arr_2d = np.arange(20).reshape(4, 5)
print("Original 2D array:\n", arr_2d)
print("Sliced 2D array (rows 1 to 3, cols 2 to 4):\n", arr_2d[1:3, 2:5])
print("Boolean indexing (elements > 10):\n", arr_2d[arr_2d > 10])

Original 1D array: [0 1 2 3 4 5 6 7 8 9]
Sliced 1D array (indices 2 to 5): [2 3 4 5]
Boolean indexing (elements > 5): [6 7 8 9]
Original 2D array:
 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]
Sliced 2D array (rows 1 to 3, cols 2 to 4):
 [[ 7  8  9]
 [12 13 14]]
Boolean indexing (elements > 10):
 [11 12 13 14 15 16 17 18 19]


## 8. Reshaping
- Use .reshape(), .flatten(), .ravel().
- Stack and concatenate arrays.


In [None]:
arr = np.arange(10).reshape(5, 2)
print("Reshaped array (5x2):\n", arr)
arr_flat = arr.flatten()
print("Flattened array:", arr_flat)
arr_ravel = arr.ravel()
print("Raveled array:", arr_ravel)
arr2 = np.arange(10, 20).reshape(5, 2)
print("Second array (5x2):\n", arr2)
stacked = np.vstack((arr, arr2))
print("Vertically stacked arrays:\n", stacked)
concatenated = np.concatenate((arr, arr2), axis=0)
print("Concatenated arrays along rows:\n", concatenated)

Reshaped array (5x2):
 [[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]
Flattened array: [0 1 2 3 4 5 6 7 8 9]
Raveled array: [0 1 2 3 4 5 6 7 8 9]
Second array (5x2):
 [[10 11]
 [12 13]
 [14 15]
 [16 17]
 [18 19]]
Vertically stacked arrays:
 [[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]
 [12 13]
 [14 15]
 [16 17]
 [18 19]]
Concatenated arrays along rows:
 [[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]
 [12 13]
 [14 15]
 [16 17]
 [18 19]]


## 9. Broadcasting
- Add a scalar to an array.
- Add arrays of different shapes (row vector + column vector).


In [13]:
arr = np.arange(10).reshape(2, 5)
print("Original array:\n", arr)
scalar_added = arr + 5
print("Array after adding scalar 5:\n", scalar_added)
row_vector = np.array([[1, 2, 3, 4, 5]])
column_vector = np.array([[1], [2]])
print("Row vector:\n", row_vector)
print("Column vector:\n", column_vector)
broadcasted_sum = row_vector + column_vector
print("Broadcasted sum of row and column vectors:\n", broadcasted_sum)


Original array:
 [[0 1 2 3 4]
 [5 6 7 8 9]]
Array after adding scalar 5:
 [[ 5  6  7  8  9]
 [10 11 12 13 14]]
Row vector:
 [[1 2 3 4 5]]
Column vector:
 [[1]
 [2]]
Broadcasted sum of row and column vectors:
 [[2 3 4 5 6]
 [3 4 5 6 7]]
