# Practical 1 - Testing Pandas and Numpy
---

**Subject Name**: Statistical Foundation of Data Sciences

**Subject Code**: CSU1658 Practical

**Due Date**: 18th Sep 2025

**Submitted By**:

- *Ajay Sonkar*
- *GF202343160*

---
## Instructions

- **Please use either Google Colab or Jupyternotebook.**
- **You are required to make the synthetic data set which has some Nan Values as well. Keep the random seed to 42 in the beginning and then as per requirement can be changed.**

---
## Prerequisite

**STEP**: Import required libraries

In [1]:
import numpy as np
import pandas as pd

**STEP**: Freeze all imports to `requirements.txt`

In [2]:
import sys
print("Python Version:", sys.version)
IMPORT_CELL = In[1].lower()


def freeze_imports():
    import importlib.metadata as lib_meta
    lines = []
    for module in sys.modules:
        if module.startswith('_'):
            continue
        if module.lower() in IMPORT_CELL:
            try:
                dist = lib_meta.distribution(module)
                lines.append(f"{dist.name}=={dist.version}\n")
            except Exception as e:
                print("General Execption:", e)
    with open("./requirements.txt", "w") as f:
        f.writelines(lines)
    with open("./runtime.txt", "w") as f:
        f.write(f"python-{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}\n")

# Reviewers may not want to run this line
# freeze_imports()

Python Version: 3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) [GCC 13.3.0]


**STEP**: Seed random functions

In [3]:
SEED = 42
_SEED_FUNCS = (
    np.random.seed,
)
for func in _SEED_FUNCS:
    print(f"Seeding {func.__module__}.{func.__name__} with {SEED}")
    func(SEED)

Seeding numpy.random.seed with 42


**STEP**: Generate Synthetic Data with NaN values

In [4]:
def generate_synthetic_data(*, num_samples: int = 100, nan_ratio: float = 0.1, include_outliers: bool = False) -> pd.DataFrame:
    """Generate synthetic data with NaN values.
    
    Fields:
        - age: Integer, age of the individual.
        - income: Float, annual income.

    Returns:
        pd.DataFrame: A DataFrame with synthetic data.
    """
    ages = np.random.randint(18, 60, size=num_samples)
    # array of num_samples values, each representing a simulated income,
    # centered around 50,000 with variability determined by 15,000.
    incomes = np.random.normal(5E4, 15000, size=num_samples)
    data = pd.DataFrame({
        'age': ages,
        'income': incomes
    })
    # Introduce NaN values
    num_nans = int(nan_ratio * num_samples)
    nan_indices = np.random.choice(num_samples, num_nans, replace=False)
    data.loc[nan_indices, 'income'] = np.nan
    # Optionally introduce outliers
    if include_outliers:
        outlier_indices = np.random.choice(num_samples, max(1, num_samples // 20), replace=False)
        data.loc[outlier_indices, 'income'] *= 10  # Inflate income to create outliers
    return data


DATASET = generate_synthetic_data(num_samples=1000, nan_ratio=0.1, include_outliers=True)

In [5]:
DATASET.head()

Unnamed: 0,age,income
0,56,11406.031379
1,46,44216.9264
2,32,
3,25,79989.758094
4,38,60771.478224


In [6]:
DATASET.describe()

Unnamed: 0,age,income
count,1000.0,900.0
mean,38.745,73405.838107
std,12.186734,104858.690962
min,18.0,-3523.07949
25%,28.0,40861.044471
50%,40.0,51652.576142
75%,50.0,62491.047403
max,59.0,855283.869197


In [7]:
print("NaN values in each column:")
print(DATASET.isna().sum())

NaN values in each column:
age         0
income    100
dtype: int64


---
## Problem 1

> Compute (a) mean, (b) median, and (c) age-weighted mean of income. Ignore NaNs where appropriate. Explain when a weighted mean is preferable.


**Methodology**: Remove rows with NaN in required columns only, preserving valid data. Compute arithmetic mean and median of income on the cleaned subset (they ignore NaNs). For the age‑weighted mean, treat age as reliability/importance weights and apply the formulae. Compare mean vs median to note skew/outlier influence; weighted mean highlights contributions of older (heavier weight) individuals. All computations are transparent and reproducible with simple pandas Series methods.

**STEP 1**: Identify NaN values.

In [8]:
DATASET.isna().sum()

age         0
income    100
dtype: int64

**STEP 2**: Filter out `NaN` values from columns required for calculations.

In [9]:
DATASET_NAN_CLEAN = DATASET.dropna(subset=['age', 'income'])
DATASET_NAN_CLEAN.isna().sum()

age       0
income    0
dtype: int64

In [10]:
DATASET_NAN_CLEAN.head()

Unnamed: 0,age,income
0,56,11406.031379
1,46,44216.9264
3,25,79989.758094
4,38,60771.478224
6,36,32333.216878


In [11]:
print("Dropped NaN values. New dataset shape:", DATASET_NAN_CLEAN.shape)
print("Dropped Rows:", DATASET.shape[0] - DATASET_NAN_CLEAN.shape[0])

Dropped NaN values. New dataset shape: (900, 2)
Dropped Rows: 100


---
### A) Calculate Mean of income

**STEP 3**: Use `.mean()` method to calculate mean of income.

In [12]:
print(f"Mean Income: {DATASET_NAN_CLEAN['income'].mean():,.2f}")

Mean Income: 73,405.84


---
### B) Calculate Median of income

**STEP 4**: Use `.median()` method to calculate median of income.

In [13]:
print("Median Income:", f"{DATASET_NAN_CLEAN['income'].median():,.2f}")

Median Income: 51,652.58


---
### C) Calculate Age Weighted Mean of income

**STEP 5**: By using the following formula:

$$
\text{Weighted Mean} = \frac{\sum_{i=1}^{n} w_i \cdot x_i}{\sum_{i=1}^{n} w_i}
$$

where in our case, `x` is income and `w` is age.

*Ref: < https://mathemerize.com/weighted-mean-formula-and-examples/ >*

In [14]:
def weighted_mean(x: pd.Series, w: pd.Series) -> float:
    """Calculate the weighted mean of a dataset.
    
    Args:
        x (pd.Series): Data values.
        w (pd.Series): Weights.
    
    Returns:
        float: Weighted mean.
    """
    return (x * w).sum() / w.sum()

print("Age Weighted Mean Income:", f"{weighted_mean(DATASET_NAN_CLEAN['income'], DATASET_NAN_CLEAN['age']):,.2f}")

Age Weighted Mean Income: 74,104.71


---
### Explain when a weighted mean is preferable.

A weighted mean is used when different observations in a dataset have varying levels of importance, reliability, or influence. It provides a more accurate reflection of the central tendency when certain data points or groups need to be emphasized based on their significance. Below are key contexts where a weighted mean is preferable:

- **Varying Importance or Reliability of Observations**:

    >  *When different observations are not equally reliable, the weighted mean allows for adjustments. For instance, when analyzing income data, older individuals might have more stable income patterns, and their data could be given greater weight. This is because older individuals may have had longer careers or more consistent income streams, making their data more representative of long-term trends. In such cases, the weighted mean helps ensure that more reliable data points are reflected more prominently in the calculation of the average.*

- **Varying Sample Sizes Across Groups**:

    > *If data is collected from different groups of unequal sizes, the weighted mean helps balance their influence on the overall average. For instance, if income data is gathered from various age groups, and there are significantly more younger individuals than older ones, the larger group could disproportionately affect the mean. By applying weights based on group size or importance, the weighted mean ensures that each group is properly represented in the overall calculation. This is especially useful when data is not evenly distributed across groups, and the arithmetic mean would otherwise bias the result.*

- **Handling Heteroscedasticity (Unequal Variance Across Groups)**:

    > *In cases where there is heteroscedasticity—meaning the variance in the data differs across groups—applying weights can correct these discrepancies. For instance, income variability might be greater among younger individuals due to career instability or a wider income range, while older individuals might exhibit more stable income patterns. By assigning more weight to data points with lower variance (e.g., older individuals with stable incomes, it is clear that the national education goals are reflected in the data. For example, income variability might be greater among younger individuals due to career instability or a wider income range, while older individuals might exhibit more stable income patterns. By assigning more weight to data points with lower variance (e.g., older individuals with stable incomes), the weighted mean helps produce a more accurate reflection of the average income across groups, accounting for the differing reliability or consistency of the data.*

- **Population Representation Adjustments**:

    > *Weighted means are particularly useful when the sample does not accurately reflect the entire population. In survey sampling, differences in demographics between the sample and the population can affect the accuracy of the results. If a survey has more data from urban areas while the overall population is mostly rural, applying weights helps adjust for this imbalance, ensuring the weighted mean accurately represents the true distribution of the population. This adjustment helps prevent biases that may occur due to over or under-representation of certain groups in the seven contribution documents, it is clear that the weighted mean is especially useful when the sample does not perfectly represent the overall population. In survey sampling, differences in demographics between the sample and the population can skew the results. For example, if a survey disproportionately includes data from urban areas while the overall population is mostly rural, applying weights to adjust for this difference ensures that the weighted mean reflects the true population distribution. This adjustment helps avoid biases that might result from over- or under-representing certain segments of the population.*

Using an age-weighted mean gives more influence to older individuals, who typically have more stable incomes and may provide more reliable data. This results in a different perspective on the "average" income compared to a simple arithmetic mean, which would treat all individuals equally. The weighted mean in this case more accurately captures the distribution of income, reflecting the higher incomes often associated with older individuals.

---
## Problem 2

> Standardize income (z-score). Report how many incomes are outliers using rule |z| > 3. Handle NaNs correctly (do not drop entire rows unnecessarily).

**Methodology**: Compute income mean and standard deviation with skipna to ignore missing values. Derive each z-score as (income − mean)/std, propagating NaN where income is NaN (no row drop). Flag outliers where absolute z-score exceeds 3 (|z| > 3). Summarize count of such extreme values to assess tail heaviness introduced by synthetic outliers. Retain original data structure while appending a z-score column for downstream analysis.

---
**STEP 1**: Calculate Mean and Standard Deviation of Income

In [15]:
income_mean = DATASET["income"].mean(skipna=True)
income_std = DATASET["income"].std(skipna=True)
print(f"Income Mean: {income_mean:,.2f}")
print(f"Income Std Dev: {income_std:,.2f}")

Income Mean: 73,405.84
Income Std Dev: 104,858.69


---
**STEP 2**: Calculate Z-Scores for Each Income

The z-score for a data point $x_i$ is calculated as:

$$
z_i = \frac{x_i - \mu}{\sigma}
$$

Where:
- $x_i$ is the individual data point (in this case, income).
- $\mu$ is the mean of the data.
- $\sigma$ is the standard deviation of the data.

In [16]:
P2_DF = DATASET.copy()
P2_DF["z_score"] = (P2_DF["income"] - income_mean) / income_std
P2_DF.head()

Unnamed: 0,age,income,z_score
0,56,11406.031379,-0.59127
1,46,44216.9264,-0.278364
2,32,,
3,25,79989.758094,0.062789
4,38,60771.478224,-0.120489


---
**STEP 3**: Identify outliers by checking if the absolute value of the z-score is greater than 3.

In [17]:
OUTLIERS = P2_DF[P2_DF["z_score"].abs() > 3]
print(f"Number of outliers (|z| > 3): {OUTLIERS.shape[0]}")

Number of outliers (|z| > 3): 40


In [18]:
OUTLIERS.head()

Unnamed: 0,age,income,z_score
26,44,709574.664132,6.066916
33,24,462894.837371,3.714418
69,40,792459.88934,6.857362
141,30,403068.595727,3.143876
148,23,451510.206606,3.605847


---
## Problem 3

> Create age bins: [18-25), [25-35), [35-45), [45-60) and compute for each bin:
>
>   ● count of observations,
>
>   ● mean income,
>
>   ● median score.

**Methodology**: Define explicit left‑inclusive, right‑exclusive age bins matching the specification using `pd.cut`. Copy prior z-score DataFrame to preserve calculations, assign each record to a bin, then `groupby` age_bin computing: observation count (non‑NaN ages), mean income (ignoring NaNs), and median z-score (robust central tendency). Reset index and sort bins to maintain natural age order. This summarizes distributional shifts across life stages.

---
**STEP 1**: Create bin with specified sizes and compute:

- count of observations
- mean income
- median score (assuming it refers to z-scores)

**STEP 2**: Reset index and sort age_bin values.

In [19]:
bin_sizes = [18, 25, 35, 45, 60]
labels = [f"[{bin_sizes[i]}-{bin_sizes[i+1]}]" for i in range(len(bin_sizes)-1)]
P3_DF = P2_DF.copy()
# left-inclusive intervals
P3_DF["age_bin"] = pd.cut(P3_DF["age"], bins=bin_sizes, labels=labels, right=False)
BIN_DF = P3_DF.groupby("age_bin", observed=True).agg(
    ObservationCount=("age", "count"),
    MeanIncome=("income", "mean"),
    MedianScore=("z_score", "median"),
).reset_index().sort_values("age_bin")
BIN_DF

Unnamed: 0,age_bin,ObservationCount,MeanIncome,MedianScore
0,[18-25],169,65252.55943,-0.202893
1,[25-35],225,80966.663776,-0.208763
2,[35-45],232,70224.476301,-0.219122
3,[45-60],374,74483.694348,-0.20064


---
## Problem 4

> Create an array it cannot be of 1 Dimension. And then showcase the operation for the following:
>
> ● Shape and Resize → shape, size, Transpose, Flatten
>
> ● Showcasing negative indexing and display error while doing slicing
>
> ● Arithmetic Operations → Broadcasting, Dot Product
>
> ● Linear Algebra → Determinant, Inverse

**Methodology**: Construct higher‑dimensional random arrays to illustrate structural properties. Inspect shape/size/ndim, then reshape to 2D, transpose, and flatten to show memory/view transformations. Demonstrate negative indexing for tail access and intentionally trigger an index error via invalid slice. Showcase broadcasting with scalar and shape‑compatible addition, then compute dot product for matrix multiplication. Generate a square matrix to obtain determinant and (if non‑singular) its inverse, linking numeric linear algebra concepts.

---
**STEP 1**: Create a `N`-dim array.

In [20]:
def print_array_info(arr: np.ndarray) -> None:
    """Print information about a numpy array.
    
    Args:
        arr (np.ndarray): Input array.
    """
    print(f"Array Shape: {arr.shape}")
    print(f"Array Dimensions: {arr.ndim}")
    print(f"Array Data Type: {arr.dtype}")
    print(f"Array Size (Total Elements): {arr.size}")
    print(f"Array Memory Size (Bytes): {arr.nbytes}")


def create_n_dim_array(n: int) -> np.ndarray:
    """Create an N-dimensional array filled with random floats.
    
    Args:
        n (int): Dimension size.
    
    Returns:
        np.ndarray: N-dimensional array.
    """
    if n <= 1:
        raise ValueError("Aint no way bro, not today.")
    return np.random.rand(*tuple(n for _ in range(n)))


print("3-D Array Info:")
print_array_info(
    create_n_dim_array(3)
)
print("\n5-D Array Info:")
print_array_info(
    create_n_dim_array(5)
)
print("\n7-D Array Info:")
print_array_info(
    create_n_dim_array(7)
)

3-D Array Info:
Array Shape: (3, 3, 3)
Array Dimensions: 3
Array Data Type: float64
Array Size (Total Elements): 27
Array Memory Size (Bytes): 216

5-D Array Info:
Array Shape: (5, 5, 5, 5, 5)
Array Dimensions: 5
Array Data Type: float64
Array Size (Total Elements): 3125
Array Memory Size (Bytes): 25000

7-D Array Info:
Array Shape: (7, 7, 7, 7, 7, 7, 7)
Array Dimensions: 7
Array Data Type: float64
Array Size (Total Elements): 823543
Array Memory Size (Bytes): 6588344


---
**STEP 2**: Demonstrate Shape and Size Operations

In [21]:
array_3d = create_n_dim_array(3)
array_3d

array([[[0.26495676, 0.6008659 , 0.68069653],
        [0.50405233, 0.98663321, 0.43908986],
        [0.94850013, 0.03028413, 0.22594992]],

       [[0.98868852, 0.60982342, 0.99634182],
        [0.88165012, 0.13310598, 0.74304789],
        [0.88640822, 0.66417612, 0.8714741 ]],

       [[0.02509876, 0.66338936, 0.64168665],
        [0.08199673, 0.90475077, 0.77217061],
        [0.15284964, 0.04746079, 0.10255991]]])

---
**STEP 2.1**: Array properties inspection

In [22]:
print("array_3d.shape:", array_3d.shape)
print("array_3d.ndim:", array_3d.ndim)
print("array_3d.dtype:", array_3d.dtype)
print("array_3d.size:", array_3d.size)
print("array_3d.nbytes:", array_3d.nbytes)

array_3d.shape: (3, 3, 3)
array_3d.ndim: 3
array_3d.dtype: float64
array_3d.size: 27
array_3d.nbytes: 216


---
**STEP 2.2**: Array Resize

In [23]:
# Convert to 2D array
array_2d = array_3d.reshape(3, -1)
print(array_2d)
print_array_info(array_2d)

[[0.26495676 0.6008659  0.68069653 0.50405233 0.98663321 0.43908986
  0.94850013 0.03028413 0.22594992]
 [0.98868852 0.60982342 0.99634182 0.88165012 0.13310598 0.74304789
  0.88640822 0.66417612 0.8714741 ]
 [0.02509876 0.66338936 0.64168665 0.08199673 0.90475077 0.77217061
  0.15284964 0.04746079 0.10255991]]
Array Shape: (3, 9)
Array Dimensions: 2
Array Data Type: float64
Array Size (Total Elements): 27
Array Memory Size (Bytes): 216


---
**STEP 2.3**: Transpose Array

In [24]:
array_2d_transposed = array_2d.T
print_array_info(array_2d_transposed)
array_2d_transposed

Array Shape: (9, 3)
Array Dimensions: 2
Array Data Type: float64
Array Size (Total Elements): 27
Array Memory Size (Bytes): 216


array([[0.26495676, 0.98868852, 0.02509876],
       [0.6008659 , 0.60982342, 0.66338936],
       [0.68069653, 0.99634182, 0.64168665],
       [0.50405233, 0.88165012, 0.08199673],
       [0.98663321, 0.13310598, 0.90475077],
       [0.43908986, 0.74304789, 0.77217061],
       [0.94850013, 0.88640822, 0.15284964],
       [0.03028413, 0.66417612, 0.04746079],
       [0.22594992, 0.8714741 , 0.10255991]])

---
**STEP 2.4**: Flatten Array

In [25]:
flatten_array = array_2d.flatten()
print_array_info(flatten_array)
flatten_array

Array Shape: (27,)
Array Dimensions: 1
Array Data Type: float64
Array Size (Total Elements): 27
Array Memory Size (Bytes): 216


array([0.26495676, 0.6008659 , 0.68069653, 0.50405233, 0.98663321,
       0.43908986, 0.94850013, 0.03028413, 0.22594992, 0.98868852,
       0.60982342, 0.99634182, 0.88165012, 0.13310598, 0.74304789,
       0.88640822, 0.66417612, 0.8714741 , 0.02509876, 0.66338936,
       0.64168665, 0.08199673, 0.90475077, 0.77217061, 0.15284964,
       0.04746079, 0.10255991])

---
**STEP 3.1**: Negative index access

In [26]:
array_2d[-1, -3]

np.float64(0.15284964186680028)

---
**STEP 3.2**: Slicing Error

In [27]:
# Following will raise error when slicing beyond array dimensions
array_2d[3:7, 10]

IndexError: index 10 is out of bounds for axis 1 with size 9

---
**STEP 4**: Arithmetic Operations

---
**STEP 4.1**: Broadcasting

In [28]:
print("Original Array:")
print(array_2d)
scalar = array_2d * 10
print("\nArray after scalar multiplication:")
print(scalar)
other_2d_array = np.random.rand(3, array_2d.shape[1])
print("\nOther 2D Array:")
print(other_2d_array)
barr = array_2d + other_2d_array
print("\nArray after addition with another array:")
print(barr)

Original Array:
[[0.26495676 0.6008659  0.68069653 0.50405233 0.98663321 0.43908986
  0.94850013 0.03028413 0.22594992]
 [0.98868852 0.60982342 0.99634182 0.88165012 0.13310598 0.74304789
  0.88640822 0.66417612 0.8714741 ]
 [0.02509876 0.66338936 0.64168665 0.08199673 0.90475077 0.77217061
  0.15284964 0.04746079 0.10255991]]

Array after scalar multiplication:
[[2.64956761 6.00865899 6.80696533 5.04052329 9.86633212 4.39089859
  9.48500126 0.30284127 2.25949925]
 [9.88688517 6.09823418 9.96341821 8.81650118 1.33105979 7.43047888
  8.86408223 6.64176117 8.71474104]
 [0.2509876  6.63389361 6.41686648 0.81996733 9.04750767 7.72170607
  1.52849642 0.47460792 1.02559908]]

Other 2D Array:
[[0.93087131 0.82070433 0.66516217 0.67133104 0.47092199 0.46341132
  0.85883464 0.67802858 0.87453228]
 [0.90303619 0.77093316 0.75665569 0.99488809 0.69818595 0.08658933
  0.98951816 0.19632836 0.69779292]
 [0.2224241  0.82821669 0.84690075 0.89590999 0.35776735 0.46061789
  0.35189445 0.38439899 0.731

---
**STEP 4.2**: Dot Product

In [29]:
np.dot(array_2d, other_2d_array.T)

array([[3.2317787 , 3.54806549, 2.65054001],
       [5.05619452, 4.76688036, 3.95309344],
       [2.08672679, 2.03188436, 1.99831992]])

---
**STEP 5.1**: Linear Algebra - Determinant

In [30]:
sq_mat = create_n_dim_array(2)
print_array_info(sq_mat)
print(sq_mat)
det = np.linalg.det(sq_mat)
print(f"Determinant of the matrix: {det:.4f}")

Array Shape: (2, 2)
Array Dimensions: 2
Array Data Type: float64
Array Size (Total Elements): 4
Array Memory Size (Bytes): 32
[[0.34352374 0.49082524]
 [0.16181028 0.51820558]]
Determinant of the matrix: 0.0986


---
**STEP 5.2**: Linear Algebra - Inverse

In [31]:
# Inverse exists only if determinant is non-zero
if np.greater(det, 0):
    inv_mat = np.linalg.inv(sq_mat)
    print("\nInverse of the matrix:")
    print(inv_mat)


Inverse of the matrix:
[[ 5.25588258 -4.97817835]
 [-1.64115529  3.48417789]]


---
**Licensed under [MIT License](https://opensource.org/licenses/MIT).**