### load_data function:
The `load_data` function performs essential data cleaning and analysis to make informed decisions about further preprocessing steps. It begins by loading the CSV file of sensor data. It corrects the timestamp column and removes columns based on a the general_check function. Afterward, the function analyze missing values and generates a summary report to chose missing value handling strategy.

In [1]:
import pandas as pd

def load_data(csv_file_path, report_title="Report of actions taken to prepare the dataset for analysis:"):
    print(report_title)
    print()
    
    df = pd.read_csv(csv_file_path)

    # Change the data type of 'timestamp_tz' to datetime
    if 'timestamp_tz' in df.columns:
        df['timestamp_tz'] = pd.to_datetime(df['timestamp_tz'], errors='coerce')

    # List of columns to drop
    columns_to_drop = ['timestamp', 'coil_reversed', 'device', 'channel', 'hz', 'firmware', 'hour', 'Unnamed: 0.1', 'Unnamed: 0', 'event_id']

    # list to track dropped columns
    dropped_columns = []

    # Drop the specified columns
    for col in columns_to_drop:
        if col in df.columns:
            df.drop(columns=[col], inplace=True)
            dropped_columns.append(col)

    # Print a message to indicate which columns were dropped
    if dropped_columns:
        print(f'Dropped columns: {", ".join(dropped_columns)}')
        print('The timestamp column was dropped because there are two columns')
        print()
        
    # Check for missing values
    if df.isna().values.any():
        data_shape = df.shape
        print(f'File shape: {data_shape[0]} rows and {data_shape[1]} columns')
        print()
        print('Missing values report:')
        missing_values = df.isna().sum()
        percentage_missing = (missing_values / len(df)) * 100
        for column, total_missing, percentage in zip(missing_values.index, missing_values, percentage_missing):
            if total_missing > 0:
                print(f'{column} - {total_missing} - {percentage:.2f}%')

    return df


In [2]:
csv_file_path = 'sensor_7.csv'
df = load_data(csv_file_path)

Report of actions taken to prepare the dataset for analysis:

Dropped columns: timestamp, coil_reversed, device, channel, hz, firmware, hour, Unnamed: 0.1, Unnamed: 0, event_id
The timestamp column was dropped because there are two columns

File shape: 1040 rows and 42 columns

Missing values report:
active_power_delta - 52 - 5.00%
apparent_power - 59 - 5.67%
current - 106 - 10.19%
energy - 53 - 5.10%
voltage - 48 - 4.62%
peak_1 - 49 - 4.71%
peak_2 - 5 - 0.48%
peak_3 - 8 - 0.77%
peak_4 - 42 - 4.04%
peak_5 - 53 - 5.10%
peak_6 - 6 - 0.58%
peak_7 - 24 - 2.31%
peak_8 - 52 - 5.00%
peak_9 - 6 - 0.58%
peak_10 - 45 - 4.33%


In [3]:
df.columns

Index(['timestamp_tz', 'active_power', 'active_power_delta', 'apparent_power',
       'complete', 'current', 'energy', 'energy_delta', 'evt_type',
       'phase_shift', 'reactive_power', 'reactive_power_delta', 'voltage',
       'wifi_strength', 'peak_1', 'peak_2', 'peak_3', 'peak_4', 'peak_5',
       'peak_6', 'peak_7', 'peak_8', 'peak_9', 'peak_10', 'harmonic_img_1',
       'harmonic_img_2', 'harmonic_img_3', 'harmonic_img_4', 'harmonic_img_5',
       'harmonic_img_6', 'harmonic_img_7', 'harmonic_img_8', 'harmonic_img_9',
       'harmonic_real_1', 'harmonic_real_2', 'harmonic_real_3',
       'harmonic_real_4', 'harmonic_real_5', 'harmonic_real_6',
       'harmonic_real_7', 'harmonic_real_8', 'harmonic_real_9'],
      dtype='object')

In [132]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9404 entries, 0 to 9403
Data columns (total 42 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   timestamp_tz          9404 non-null   datetime64[ns]
 1   active_power          9404 non-null   int64         
 2   active_power_delta    8903 non-null   float64       
 3   apparent_power        8944 non-null   float64       
 4   complete              9404 non-null   bool          
 5   current               8506 non-null   float64       
 6   energy                8970 non-null   float64       
 7   energy_delta          9404 non-null   int64         
 8   evt_type              9404 non-null   int64         
 9   phase_shift           9404 non-null   float64       
 10  reactive_power        9404 non-null   int64         
 11  reactive_power_delta  9404 non-null   int64         
 12  voltage               8952 non-null   float64       
 13  wifi_strength     

## Summary:
##### For this analysis, I decided to keep missing values because removing them wouldn't significantly affect the insights. Instead, I explored the mathematical relationships between the other parameters. While it's possible to iterate through the data frame and fill the missing values, it's a bit long process. For the binary classification task, I used mean and interpolation to fill the missing values.