***Missing values handling (mean / median / mode)***<br>

Missing values handling” means dealing with data that is incomplete, i.e., some values in your dataset are missing or empty.<br>
In real-world data, this happens a lot due to errors in collection, human mistakes, or system issues.<br>
Handling missing values is important because they can affect analysis, predictions, and model performance.<br>
Missing values can appear as NaN, None, Null, or just empty cells.<br>

1. Using Mean (Average)<br>
-Mean is the average of all numbers in a column.<br>
-If a value is missing, you replace it with the average of the column.<br>
-When to use: For numbers that are roughly evenly spread out (no big outliers)<br>
-Example:Ages: 20, 22, 24, missing → fill with average (22)<br>

2. Using Median (Middle value)<br>
-Median is the middle value when numbers are arranged in order.<br>
-Replace missing values with the middle number.<br>
-When to use: Better if numbers have outliers (very high or very low values).<br>
-Example:Salary: 20k, 22k, 1000k, missing → fill with median (22k)<br>

3. Using Mode (Most frequent value)<br>
-Mode is the number or value that appears most often.<br>
-Replace missing values with the most common value.<br>
-When to use: Best for categorical data (like color, city, gender).<br>
-Example:Color: Red, Blue, Red, missing → fill with Red<br>

***Outlier detection***<br>
-An outlier is a value in your data that is very different from the other values.<br>
-It’s like a “stranger” in your data.<br>
-Outliers can affect your analysis and skew results.<br>

Example:<br>
A class has students’ ages:<br>
10, 11, 10, 12, 11, 50<br>
Here, 50 is an outlier because all other ages are around 10–12.<br>

Some simple ways to find outliers:<br>
a) Using Mean and Standard Deviation<br>
-If a value is too far from the mean, it might be an outlier.<br>
-Rule of thumb: If a value is more than 3 standard deviations away from the mean, it is an outlier.<br>

b) Using Interquartile Range (IQR)<br>
-IQR = Q3 − Q1 (middle 50% of data)<br>
-Anything below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR is an outlier.<br>


In [19]:
#Create a fill_missing(data, strategy) function that fills missing values using:
#• mean
#• median
#• mode
#based on the selected strategy.
    
import csv
from collections import Counter

# Calculate mean
def calculate_mean(numbers):
    return sum(numbers) / len(numbers)

# Calculate median
def calculate_median(numbers):
    numbers = sorted(numbers)
    n = len(numbers)
    mid = n // 2
    if n % 2 == 0:
        return (numbers[mid-1] + numbers[mid]) / 2
    else:
        return numbers[mid]

# Calculate mode
def calculate_mode(values):
    return Counter(values).most_common(1)[0][0]
    
# Function to fill missing values
def fill_missing(data, column, strategy):
    # Step 1: Extract all non-missing values
    non_missing = [row[column] for row in data if row[column] != '']
    
    # Convert to float if numeric for mean/median
    if strategy in ['mean', 'median']:
        non_missing = list(map(float, non_missing))
    
    # Step 2: Calculate replacement value
    if strategy == 'mean':
        value_to_fill = calculate_mean(non_missing)
    elif strategy == 'median':
        value_to_fill = calculate_median(non_missing)
    elif strategy == 'mode':
        value_to_fill = calculate_mode(non_missing)
    else:
        raise ValueError("Strategy must be 'mean', 'median', or 'mode'")    
        
    # Step 3: Replace missing values
    for row in data:
        if row[column] == '':
            row[column] = value_to_fill
    
    return data
 # Load Titanic dataset   
filename ="C:/Users/shres/OneDrive/Documents/titanic.csv"
with open(filename, 'r',newline='') as f:
     reader = csv.DictReader(f)
     titanic_data = [row for row in reader]

# Fill missing 'Age' with mean
titanic_data = fill_missing(titanic_data, 'Age', 'mean')

# Fill missing 'Embarked' with mode
titanic_data = fill_missing(titanic_data, 'Embarked', 'mode')

# Print first 5 rows to see result
for row in titanic_data[:5]:
    print(row)

{'Passengerid': '1', 'Age': '22', 'Fare': '7.25', 'Sex': '0', 'sibsp': '1', 'zero': '0', 'Parch': '0', 'Pclass': '3', 'Embarked': '2', '2urvived': '0'}
{'Passengerid': '2', 'Age': '38', 'Fare': '', 'Sex': '1', 'sibsp': '1', 'zero': '0', 'Parch': '0', 'Pclass': '1', 'Embarked': '0', '2urvived': '1'}
{'Passengerid': '3', 'Age': 29.50701606732976, 'Fare': '7.925', 'Sex': '1', 'sibsp': '0', 'zero': '0', 'Parch': '0', 'Pclass': '3', 'Embarked': '2', '2urvived': '1'}
{'Passengerid': '4', 'Age': '35', 'Fare': '53.1', 'Sex': '1', 'sibsp': '1', 'zero': '0', 'Parch': '0', 'Pclass': '1', 'Embarked': '2', '2urvived': '1'}
{'Passengerid': '5', 'Age': '35', 'Fare': '8.05', 'Sex': '0', 'sibsp': '0', 'zero': '0', 'Parch': '0', 'Pclass': '3', 'Embarked': '2', '2urvived': '0'}


In [18]:
#Create a drop_missing(data) function that removes records that contain missing values.

import csv

def drop_missing(data):
    clean_data = []   # to store rows without missing values

    for row in data:          # go through each row
        if '' not in row.values():   # check if no value is empty
            clean_data.append(row)   # keep the row

    return clean_data


# ---- Load Titanic Dataset ----
filename ="C:/Users/shres/OneDrive/Documents/titanic.csv"
with open(filename, 'r',newline='') as f:
     reader = csv.DictReader(f)
     titanic_data = [row for row in reader]
    
# Before dropping
print("Total rows before:", len(titanic_data))

# Drop rows with missing values
clean_titanic = drop_missing(titanic_data)

# After dropping
print("Total rows after:", len(clean_titanic))

# Show first 3 rows
for row in clean_titanic[:3]:
    print(row)


Total rows before: 1309
Total rows after: 1304
{'Passengerid': '1', 'Age': '22', 'Fare': '7.25', 'Sex': '0', 'sibsp': '1', 'zero': '0', 'Parch': '0', 'Pclass': '3', 'Embarked': '2', '2urvived': '0'}
{'Passengerid': '4', 'Age': '35', 'Fare': '53.1', 'Sex': '1', 'sibsp': '1', 'zero': '0', 'Parch': '0', 'Pclass': '1', 'Embarked': '2', '2urvived': '1'}
{'Passengerid': '5', 'Age': '35', 'Fare': '8.05', 'Sex': '0', 'sibsp': '0', 'zero': '0', 'Parch': '0', 'Pclass': '3', 'Embarked': '2', '2urvived': '0'}


In [20]:
#Implement a detect_outliers(data, method) function that identifies outliers using:
#• IQR method
#• Z-score method
import csv
import math

#  Helper functions 
def mean(values):
    return sum(values) / len(values)

def std_dev(values):
    m = mean(values)
    return math.sqrt(sum((x - m) ** 2 for x in values) / len(values))

# Outlier detection function 
def detect_outliers(data, method):
    outliers = []

    if method == "iqr":
        data = sorted(data)

        n = len(data)
        q1 = data[n // 4]
        q3 = data[(3 * n) // 4]
        iqr = q3 - q1

        lower = q1 - 1.5 * iqr
        upper = q3 + 1.5 * iqr

        for x in data:
            if x < lower or x > upper:
                outliers.append(x)

    elif method == "zscore":
        m = mean(data)
        sd = std_dev(data)

        for x in data:
            z = (x - m) / sd
            if abs(z) > 3:
                outliers.append(x)

    else:
        raise ValueError("Method must be 'iqr' or 'zscore'")

    return outliers


#  Load Titanic dataset 
filename ="C:/Users/shres/OneDrive/Documents/titanic.csv"
with open(filename, 'r',newline='') as f:
     reader = csv.DictReader(f)
     titanic_data = [row for row in reader]

# Extract Age column (ignore missing values)
ages = [float(row["Age"]) for row in titanic_data if row["Age"] != ""]

# Detect outliers
iqr_outliers = detect_outliers(ages, "iqr")
z_outliers = detect_outliers(ages, "zscore")

print("IQR Outliers:", iqr_outliers)
print("Z-score Outliers:", z_outliers)


IQR Outliers: [0.17, 0.33, 0.42, 0.67, 0.75, 0.75, 0.75, 0.83, 0.83, 0.83, 0.92, 0.92, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 55.0, 55.0, 55.0, 55.0, 55.0, 55.0, 55.0, 55.0, 55.5, 56.0, 56.0, 56.0, 56.0, 57.0, 57.0, 57.0, 57.0, 57.0, 58.0, 58.0, 58.0, 58.0, 58.0, 58.0, 59.0, 59.0, 59.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.5, 61.0, 61.0, 61.0, 61.0, 61.0, 62.0, 62.0, 62.0, 62.0, 62.0, 63.0, 63.0, 63.0, 63.0, 64.0, 64.0, 64.0, 64.0, 64.0, 65.0, 65.0, 65.0, 66.0, 67.0, 70.0, 70.0, 70.5, 71.0, 71.0, 74.0, 76.0, 80.0]
Z-score Outliers: [71.0, 70.5, 71.0, 80.0, 70.0, 70.0, 74.0, 76.0]


In [9]:
#Return the detected outliers as a separate list without modifying the original dataset.
import csv
import math

#  Helper functions 
def mean(values):
    return sum(values) / len(values)

def std_dev(values):
    m = mean(values)
    return math.sqrt(sum((x - m) ** 2 for x in values) / len(values))

# Outlier detection function 
def detect_outliers(data, method):
    outliers = []

    if method == "iqr":
        data = sorted(data)

        n = len(data)
        q1 = data[n // 4]
        q3 = data[(3 * n) // 4]
        iqr = q3 - q1

        lower = q1 - 1.5 * iqr
        upper = q3 + 1.5 * iqr

        for x in data:
            if x < lower or x > upper:
                outliers.append(x)

    elif method == "zscore":
        m = mean(data)
        sd = std_dev(data)

        for x in data:
            z = (x - m) / sd
            if abs(z) > 3:
                outliers.append(x)

    else:
        raise ValueError("Method must be 'iqr' or 'zscore'")

    return outliers


#  Load Titanic dataset 
filename ="C:/Users/shres/OneDrive/Documents/titanic.csv"
with open(filename, 'r',newline='') as f:
     reader = csv.DictReader(f)
     titanic_data = [row for row in reader]

# Extract Age column (ignore missing values)
ages = [float(row["Age"]) for row in titanic_data if row["Age"] != ""]

# Detect outliers
iqr_outliers = detect_outliers(ages, "iqr")
z_outliers = detect_outliers(ages, "zscore")

print("Original Data:",ages)
print("IQR Outliers:", iqr_outliers)
print("Z-score Outliers:", z_outliers)

Original Data: [22.0, 38.0, 26.0, 35.0, 35.0, 28.0, 54.0, 2.0, 27.0, 14.0, 4.0, 58.0, 20.0, 39.0, 14.0, 55.0, 2.0, 28.0, 31.0, 28.0, 35.0, 34.0, 15.0, 28.0, 8.0, 38.0, 28.0, 19.0, 28.0, 28.0, 40.0, 28.0, 28.0, 66.0, 28.0, 42.0, 28.0, 21.0, 18.0, 14.0, 40.0, 27.0, 28.0, 3.0, 19.0, 28.0, 28.0, 28.0, 28.0, 18.0, 7.0, 21.0, 49.0, 29.0, 65.0, 28.0, 21.0, 28.5, 5.0, 11.0, 22.0, 38.0, 45.0, 4.0, 28.0, 28.0, 29.0, 19.0, 17.0, 26.0, 32.0, 16.0, 21.0, 26.0, 32.0, 25.0, 28.0, 28.0, 0.83, 30.0, 22.0, 29.0, 28.0, 28.0, 17.0, 33.0, 16.0, 28.0, 23.0, 24.0, 29.0, 20.0, 46.0, 26.0, 59.0, 28.0, 71.0, 23.0, 34.0, 34.0, 28.0, 28.0, 21.0, 33.0, 37.0, 28.0, 21.0, 28.0, 38.0, 28.0, 47.0, 14.5, 22.0, 20.0, 17.0, 21.0, 70.5, 29.0, 24.0, 2.0, 21.0, 28.0, 32.5, 32.5, 54.0, 12.0, 28.0, 24.0, 28.0, 45.0, 33.0, 20.0, 47.0, 29.0, 25.0, 23.0, 19.0, 37.0, 16.0, 24.0, 28.0, 22.0, 24.0, 19.0, 18.0, 19.0, 27.0, 9.0, 36.5, 42.0, 51.0, 22.0, 55.5, 40.5, 28.0, 51.0, 16.0, 30.0, 28.0, 28.0, 44.0, 40.0, 26.0, 17.0, 1.0, 9.0, 

In [21]:
#Test each function using a sample numerical dataset containing missing values and outliers.
import csv
import math
from collections import Counter

#  Helper functions 
def calculate_mean(numbers):
    return sum(numbers) / len(numbers)

def calculate_median(numbers):
    numbers = sorted(numbers)
    n = len(numbers)
    mid = n // 2
    if n % 2 == 0:
        return (numbers[mid-1] + numbers[mid]) / 2
    else:
        return numbers[mid]

def calculate_mode(values):
    return Counter(values).most_common(1)[0][0]

def fill_missing(data, column, strategy):
    non_missing = [row[column] for row in data if row[column] != '']
    if strategy in ['mean', 'median']:
        non_missing = list(map(float, non_missing))

    if strategy == 'mean':
        value_to_fill = calculate_mean(non_missing)
    elif strategy == 'median':
        value_to_fill = calculate_median(non_missing)
    elif strategy == 'mode':
        value_to_fill = calculate_mode(non_missing)
    else:
        raise ValueError("Strategy must be 'mean', 'median', or 'mode'")

    for row in data:
        if row[column] == '':
            row[column] = value_to_fill

    return data

def drop_missing(data):
    clean_data = []
    for row in data:
        if '' not in row.values():
            clean_data.append(row)
    return clean_data

def mean(values):
    return sum(values) / len(values)

def std_dev(values):
    m = mean(values)
    return math.sqrt(sum((x - m) ** 2 for x in values) / len(values))

def detect_outliers(data, method):
    outliers = []
    if method == "iqr":
        sorted_data = sorted(data)
        n = len(sorted_data)
        q1 = sorted_data[n // 4]
        q3 = sorted_data[(3 * n) // 4]
        iqr = q3 - q1
        lower = q1 - 1.5 * iqr
        upper = q3 + 1.5 * iqr
        for x in data:
            if x < lower or x > upper:
                outliers.append(x)
    elif method == "zscore":
        m = mean(data)
        sd = std_dev(data)
        for x in data:
            z = (x - m) / sd
            if abs(z) > 3:
                outliers.append(x)
    else:
        raise ValueError("Method must be 'iqr' or 'zscore'")
    return outliers

#  Load Titanic dataset 
filename ="C:/Users/shres/OneDrive/Documents/titanic.csv"
with open(filename, 'r',newline='') as f:
     reader = csv.DictReader(f)
     titanic_data = [row for row in reader]


# Sample numeric data 
# Take first 10 rows for fill/drop testing
sample_data = titanic_data[:10]

# Extract numeric columns for outlier detection
ages = [float(row["Age"]) for row in titanic_data if row["Age"] != '']
fares = [float(row["Fare"]) for row in titanic_data if row["Fare"] != '']

#  1. Fill missing 
filled_data = fill_missing(sample_data, "Age", "mean")
print("Filled Age (first 10 rows):")
for row in filled_data:
    print(row["Age"])

# 2. Drop missing
clean_data = drop_missing(sample_data)
print("\nRows after dropping missing (first 5 rows):")
for row in clean_data[:5]:
    print(row)

#  3. Detect outliers 
age_iqr_outliers = detect_outliers(ages, "iqr")
age_zscore_outliers = detect_outliers(ages, "zscore")
fare_iqr_outliers = detect_outliers(fares, "iqr")
fare_zscore_outliers = detect_outliers(fares, "zscore")

print("\nAge IQR Outliers:", age_iqr_outliers)
print("Age Z-score Outliers:", age_zscore_outliers)
print("\nFare IQR Outliers:", fare_iqr_outliers)
print("Fare Z-score Outliers:", fare_zscore_outliers)

# 4. Check original dataset unchanged 
print("\nFirst 5 ages from original dataset:", ages[:5])


Filled Age (first 10 rows):
22
38
28.375
35
35
28.375
54
2
27
14

Rows after dropping missing (first 5 rows):
{'Passengerid': '1', 'Age': '22', 'Fare': '7.25', 'Sex': '0', 'sibsp': '1', 'zero': '0', 'Parch': '0', 'Pclass': '3', 'Embarked': '2', '2urvived': '0'}
{'Passengerid': '3', 'Age': 28.375, 'Fare': '7.925', 'Sex': '1', 'sibsp': '0', 'zero': '0', 'Parch': '0', 'Pclass': '3', 'Embarked': '2', '2urvived': '1'}
{'Passengerid': '4', 'Age': '35', 'Fare': '53.1', 'Sex': '1', 'sibsp': '1', 'zero': '0', 'Parch': '0', 'Pclass': '1', 'Embarked': '2', '2urvived': '1'}
{'Passengerid': '5', 'Age': '35', 'Fare': '8.05', 'Sex': '0', 'sibsp': '0', 'zero': '0', 'Parch': '0', 'Pclass': '3', 'Embarked': '2', '2urvived': '0'}
{'Passengerid': '7', 'Age': '54', 'Fare': '51.8625', 'Sex': '0', 'sibsp': '0', 'zero': '0', 'Parch': '0', 'Pclass': '1', 'Embarked': '2', '2urvived': '0'}

Age IQR Outliers: [2.0, 58.0, 55.0, 2.0, 66.0, 65.0, 0.83, 59.0, 71.0, 70.5, 2.0, 55.5, 1.0, 61.0, 1.0, 56.0, 1.0, 58.0, 2.

In [12]:
#Print the number of missing values handled and the number of outliers detected for verification. 
import csv
import math
from collections import Counter

#  Helper functions 
def calculate_mean(numbers):
    return sum(numbers) / len(numbers)

def calculate_median(numbers):
    numbers = sorted(numbers)
    n = len(numbers)
    mid = n // 2
    if n % 2 == 0:
        return (numbers[mid-1] + numbers[mid]) / 2
    else:
        return numbers[mid]

def calculate_mode(values):
    return Counter(values).most_common(1)[0][0]

# Fill missing
def fill_missing(data, column, strategy):
    non_missing = [row[column] for row in data if row[column] != '']
    if strategy in ['mean', 'median']:
        non_missing = list(map(float, non_missing))

    if strategy == 'mean':
        value_to_fill = calculate_mean(non_missing)
    elif strategy == 'median':
        value_to_fill = calculate_median(non_missing)
    elif strategy == 'mode':
        value_to_fill = calculate_mode(non_missing)
    else:
        raise ValueError("Strategy must be 'mean', 'median', or 'mode'")

    count_filled = 0
    for row in data:
        if row[column] == '':
            row[column] = value_to_fill
            count_filled += 1

    return data, count_filled  # return number of values handled

# Drop missing
def drop_missing(data):
    clean_data = []
    for row in data:
        if '' not in row.values():
            clean_data.append(row)
    return clean_data

# Detect outliers
def mean(values):
    return sum(values) / len(values)

def std_dev(values):
    m = mean(values)
    return math.sqrt(sum((x - m) ** 2 for x in values) / len(values))

def detect_outliers(data, method):
    outliers = []
    if method == "iqr":
        sorted_data = sorted(data)
        n = len(sorted_data)
        q1 = sorted_data[n // 4]
        q3 = sorted_data[(3 * n) // 4]
        iqr = q3 - q1
        lower = q1 - 1.5 * iqr
        upper = q3 + 1.5 * iqr
        for x in data:
            if x < lower or x > upper:
                outliers.append(x)
    elif method == "zscore":
        m = mean(data)
        sd = std_dev(data)
        for x in data:
            z = (x - m) / sd
            if abs(z) > 3:
                outliers.append(x)
    else:
        raise ValueError("Method must be 'iqr' or 'zscore'")
    return outliers, len(outliers)  # return outliers and their count

# Load Titanic dataset 
filename ="C:/Users/shres/OneDrive/Documents/titanic.csv"
with open(filename, 'r',newline='') as f:
     reader = csv.DictReader(f)
     titanic_data = [row for row in reader]

#Sample numeric data
sample_data = titanic_data[:10]  # first 10 rows for fill/drop testing

ages = [float(row["Age"]) for row in titanic_data if row["Age"] != '']
fares = [float(row["Fare"]) for row in titanic_data if row["Fare"] != '']

# 1. Fill missing 
filled_data, missing_count = fill_missing(sample_data, "Age", "mean")
print("Filled Age (first 10 rows):")
for row in filled_data:
    print(row["Age"])
print("Number of missing Age values handled:", missing_count)

# 2. Drop missing
clean_data = drop_missing(sample_data)
print("\nRows after dropping missing (first 5 rows):")
for row in clean_data[:5]:
    print(row)
print("Number of rows dropped:", len(sample_data) - len(clean_data))

# 3. Detect outliers
age_iqr_outliers, age_iqr_count = detect_outliers(ages, "iqr")
age_zscore_outliers, age_zscore_count = detect_outliers(ages, "zscore")
fare_iqr_outliers, fare_iqr_count = detect_outliers(fares, "iqr")
fare_zscore_outliers, fare_zscore_count = detect_outliers(fares, "zscore")

print("\nAge IQR Outliers:", age_iqr_outliers)
print("Number of Age IQR outliers detected:", age_iqr_count)
print("Age Z-score Outliers:", age_zscore_outliers)
print("Number of Age Z-score outliers detected:", age_zscore_count)

print("\nFare IQR Outliers:", fare_iqr_outliers)
print("Number of Fare IQR outliers detected:", fare_iqr_count)
print("Fare Z-score Outliers:", fare_zscore_outliers)
print("Number of Fare Z-score outliers detected:", fare_zscore_count)


Filled Age (first 10 rows):
22
38
26
35
35
28
54
2
27
14
Number of missing Age values handled: 0

Rows after dropping missing (first 5 rows):
{'Passengerid': '1', 'Age': '22', 'Fare': '7.25', 'Sex': '0', 'sibsp': '1', 'zero': '0', 'Parch': '0', 'Pclass': '3', 'Embarked': '2', '2urvived': '0'}
{'Passengerid': '2', 'Age': '38', 'Fare': '71.2833', 'Sex': '1', 'sibsp': '1', 'zero': '0', 'Parch': '0', 'Pclass': '1', 'Embarked': '0', '2urvived': '1'}
{'Passengerid': '3', 'Age': '26', 'Fare': '7.925', 'Sex': '1', 'sibsp': '0', 'zero': '0', 'Parch': '0', 'Pclass': '3', 'Embarked': '2', '2urvived': '1'}
{'Passengerid': '4', 'Age': '35', 'Fare': '53.1', 'Sex': '1', 'sibsp': '1', 'zero': '0', 'Parch': '0', 'Pclass': '1', 'Embarked': '2', '2urvived': '1'}
{'Passengerid': '5', 'Age': '35', 'Fare': '8.05', 'Sex': '0', 'sibsp': '0', 'zero': '0', 'Parch': '0', 'Pclass': '3', 'Embarked': '2', '2urvived': '0'}
Number of rows dropped: 0

Age IQR Outliers: [2.0, 58.0, 55.0, 2.0, 66.0, 65.0, 0.83, 59.0, 71

In [22]:
import csv
from collections import Counter

#  Helper functions 
def calculate_mean(numbers):
    return sum(numbers) / len(numbers)

def calculate_median(numbers):
    numbers = sorted(numbers)
    n = len(numbers)
    mid = n // 2
    if n % 2 == 0:
        return (numbers[mid-1] + numbers[mid]) / 2
    else:
        return numbers[mid]

def calculate_mode(values):
    return Counter(values).most_common(1)[0][0]

def fill_missing(data, column, strategy):
    non_missing = [row[column] for row in data if row[column] != '']
    if strategy in ['mean', 'median']:
        non_missing = list(map(float, non_missing))

    if strategy == 'mean':
        value_to_fill = calculate_mean(non_missing)
    elif strategy == 'median':
        value_to_fill = calculate_median(non_missing)
    elif strategy == 'mode':
        value_to_fill = calculate_mode(non_missing)
    else:
        raise ValueError("Strategy must be 'mean', 'median', or 'mode'")

    # Fill missing values and return
    filled_data = []
    for row in data:
        new_row = row.copy()  # make copy to avoid changing original
        if new_row[column] == '':
            new_row[column] = value_to_fill
        filled_data.append(new_row)
    return filled_data
    
# Load Titanic dataset 
filename ="C:/Users/shres/OneDrive/Documents/titanic.csv"
with open(filename, 'r',newline='') as f:
     reader = csv.DictReader(f)
     titanic_data = [row for row in reader]

# Take first 10 rows for testing
sample_data = titanic_data[:10]

# Apply fill_missing with different strategies 
filled_mean = fill_missing(sample_data, "Age", "mean")
filled_median = fill_missing(sample_data, "Age", "median")
filled_mode = fill_missing(sample_data, "Age", "mode")

# Print comparison 
print("Original Age column (first 10 rows):")
print([row["Age"] for row in sample_data])

print("\nFilled with Mean:")
print([row["Age"] for row in filled_mean])

print("\nFilled with Median:")
print([row["Age"] for row in filled_median])

print("\nFilled with Mode:")
print([row["Age"] for row in filled_mode])

# Short Comparison Summary 
print("\nSummary:")
print("Mean → replaces missing with average value, smooths data.")
print("Median → replaces missing with middle value, robust to outliers.")
print("Mode → replaces missing with most frequent value, keeps common patterns.")


Original Age column (first 10 rows):
['22', '38', '', '35', '35', '', '54', '2', '27', '14']

Filled with Mean:
['22', '38', 28.375, '35', '35', 28.375, '54', '2', '27', '14']

Filled with Median:
['22', '38', 31.0, '35', '35', 31.0, '54', '2', '27', '14']

Filled with Mode:
['22', '38', '35', '35', '35', '35', '54', '2', '27', '14']

Summary:
Mean → replaces missing with average value, smooths data.
Median → replaces missing with middle value, robust to outliers.
Mode → replaces missing with most frequent value, keeps common patterns.


In [23]:
#After detecting outliers, create a new dataset where outliers are removed and compare the mean and median of the column before and after outlier removal.
import csv
import math

#  Helper functions
def mean(values):
    return sum(values) / len(values)

def median(values):
    values = sorted(values)
    n = len(values)
    mid = n // 2
    if n % 2 == 0:
        return (values[mid-1] + values[mid]) / 2
    else:
        return values[mid]

def std_dev(values):
    m = mean(values)
    return math.sqrt(sum((x - m) ** 2 for x in values) / len(values))

def detect_outliers(data, method):
    outliers = []
    if method == "iqr":
        sorted_data = sorted(data)
        n = len(sorted_data)
        q1 = sorted_data[n // 4]
        q3 = sorted_data[(3 * n) // 4]
        iqr = q3 - q1
        lower = q1 - 1.5 * iqr
        upper = q3 + 1.5 * iqr
        for x in data:
            if x < lower or x > upper:
                outliers.append(x)
    elif method == "zscore":
        m = mean(data)
        sd = std_dev(data)
        for x in data:
            z = (x - m) / sd
            if abs(z) > 3:
                outliers.append(x)
    else:
        raise ValueError("Method must be 'iqr' or 'zscore'")
    return outliers

#  Load Titanic dataset 
with open(filename, "r") as f:
    reader = csv.DictReader(f)
    titanic_data = [row for row in reader]

# Extract Age column as float, skip missing
ages = [float(row["Age"]) for row in titanic_data if row["Age"] != '']

# 1. Detect outliers 
age_outliers = detect_outliers(ages, "iqr")
print("Detected Age Outliers:", age_outliers)

#  2. Remove outliers to create new dataset
ages_no_outliers = [x for x in ages if x not in age_outliers]

# 3. Compare mean and median 
original_mean = mean(ages)
original_median = median(ages)

cleaned_mean = mean(ages_no_outliers)
cleaned_median = median(ages_no_outliers)

print("\nMean and Median Comparison (Age):")
print(f"Original Mean: {original_mean:.2f}, Original Median: {original_median}")
print(f"Cleaned Mean: {cleaned_mean:.2f}, Cleaned Median: {cleaned_median}")
print(f"Number of outliers removed: {len(age_outliers)}")


Detected Age Outliers: [2.0, 58.0, 55.0, 2.0, 66.0, 65.0, 0.83, 59.0, 71.0, 70.5, 2.0, 55.5, 1.0, 61.0, 1.0, 56.0, 1.0, 58.0, 2.0, 59.0, 62.0, 58.0, 63.0, 65.0, 2.0, 0.92, 61.0, 2.0, 60.0, 1.0, 1.0, 64.0, 65.0, 56.0, 0.75, 2.0, 63.0, 58.0, 55.0, 71.0, 2.0, 64.0, 62.0, 62.0, 60.0, 61.0, 57.0, 80.0, 2.0, 0.75, 56.0, 58.0, 70.0, 60.0, 60.0, 70.0, 0.67, 57.0, 1.0, 0.42, 2.0, 1.0, 62.0, 0.83, 74.0, 56.0, 62.0, 63.0, 55.0, 60.0, 60.0, 55.0, 67.0, 2.0, 76.0, 63.0, 1.0, 61.0, 60.5, 64.0, 61.0, 0.33, 60.0, 57.0, 64.0, 55.0, 0.92, 1.0, 0.75, 2.0, 1.0, 64.0, 0.83, 55.0, 55.0, 57.0, 58.0, 0.17, 59.0, 55.0, 57.0]

Mean and Median Comparison (Age):
Original Mean: 29.51, Original Median: 28.0
Cleaned Mean: 28.54, Cleaned Median: 28.0
Number of outliers removed: 101
