This is a script to simulate missing data.

Since our datasets are too perfect, we create some missing data and generate new csv files.

So we can simulate the real situations.

In [8]:
import pandas as pd

# Define file paths
file1 = 'Health_Sleep_Statistics.csv'
file2 = 'ss.csv'

# Load the datasets
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)

For the first data set, we generated 3% randomly missing data to simulate MCAR.

In [21]:
import numpy as np

def introduce_missing_values(df, missing_percent):
    # Make a copy of the original dataframe to not modify the original
    df_missing = df.copy()
    
    # Calculate the total number of values in the dataset
    total_values = df_missing.size
    
    # Calculate the number of missing values to introduce
    num_missing = int(missing_percent * total_values)
    
    # Randomly select indices to replace with NaN
    for _ in range(num_missing):
        i = np.random.randint(0, df_missing.shape[0])  # row index
        j = np.random.randint(0, df_missing.shape[1])  # column index
        df_missing.iat[i, j] = np.nan
    
    return df_missing

# Apply the function to dataset
df1_missing = introduce_missing_values(df1, missing_percent=0.03)

df1_missing.describe()

# Save to a new csv file.
df1_missing.to_csv('dataset1_with_missing_values.csv', index=False)

For the second data set, we create missing values by this rule: any person who have age above 50, they have 90% chance of missing the daily steps.
We simulated MNAR, since it is inconvenient for people to exercise as they get older.

In [23]:
def apply_age_based_missing_values(df, age_column, age_threshold, target_column, missing_probability):
    df_missing = df.copy()
    
    # Apply condition for rows where age is greater than the threshold
    for index, row in df_missing.iterrows():
        if row[age_column] > age_threshold:
            # Apply 90% chance of making the 'Daily Steps' column value NaN
            if np.random.rand() < missing_probability:
                df_missing.at[index, target_column] = np.nan
    
    return df_missing

# Apply the function to dataset
df2_missing = apply_age_based_missing_values(df2, age_column='Age', age_threshold=50, target_column='Daily Steps', missing_probability=0.9)

df2_missing.describe()

# Save to a new csv file.
df2_missing.to_csv('dataset2_with_missing_values.csv', index=False)

Now, we have two datasets containing missing values.

Use these two datasets for subsequent analysis.