## General Check before processing 

Verifying the consistency of all CSV files to ensure they have the same columns. Next, I assessed the shape and the percentage of missing values in these files, which helped identify any potential data corruption or unique features requiring attention before further processing.

During the analysis, I noticed two timestamp columns and found that they were almost identical, with only a minimal difference in milliseconds. To streamline the dataset, I decided to drop one of these columns. Additionally, I investigated the start and end dates for each sensor and confirmed that the data was collected over a four-day period in June 2020.

#### appliance_guess
To gain insights into the dataset's contents, I examined the average active power and energy consumption across sensors, aiming to determine if the sensors represented a wide range of appliances. All findings and results were printed to provide a comprehensive final report.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data_set_1/sensor_3.csv')

In [3]:
import os
import pandas as pd

def general_check(folder_path):
    # Initialize a list to store reports for each file
    file_reports = []

    # Initialize a variable to store the columns of the first file
    first_file_columns = None

    # Initialize a flag to track if all files have the same columns
    all_files_have_same_columns = True

    # Loop through the files in the folder
    for filename in os.listdir(folder_path):
        if filename.endswith(".csv"):
            file_path = os.path.join(folder_path, filename)
            
            # Read the CSV file
            df = pd.read_csv(file_path)
            
            # Get the column titles (headers)
            current_file_columns = list(df.columns)
            
            # Check if it's the first file
            if first_file_columns is None:
                first_file_columns = current_file_columns
            else:
                # Compare the column titles with the first file
                if first_file_columns != current_file_columns:
                    all_files_have_same_columns = False
                    file_reports.append({
                        "Filename": filename,
                        "Columns Match First File": False,
                        "File Length": len(df),
                        "Missing Values": f"{df.isnull().sum().sum()} - {df.isnull().sum().sum() / df.size * 100:.2f}%",
                    })
                else:
                    file_reports.append({
                        "Filename": filename,
                        "Columns Match First File": True,
                        "File Length": len(df),
                        "Missing Values": f"{df.isnull().sum().sum()} - {df.isnull().sum().sum() / df.size * 100:.2f}%",
                    })

            # Check if 'timestamp_tz' column exists and print start and end dates
            if 'timestamp_tz' in df.columns:
                start_date = df['timestamp_tz'].min()
                end_date = df['timestamp_tz'].max()
                print(f"In {filename}:")
                print(f"Start Date: {start_date}")
                print(f"End Date: {end_date}")
                print("\n")

    # Check if all files have the same columns and print the reports
    if all_files_have_same_columns:
        print("////////////////////////////////////////////////////////////.")
        print("All files have the same columns.")
        print("////////////////////////////////////////////////////////////.")

    else:
        print("Not all files have the same columns.")
        
    print("\n")
    
    for report in file_reports:
        print(report)
        print("\n")


In [14]:
import os
import pandas as pd

def appliance_guess(folder_path, columns_to_average):
    for filename in os.listdir(folder_path):
        if filename.endswith(".csv"):
            file_path = os.path.join(folder_path, filename)
            
            # Read the CSV file
            df = pd.read_csv(file_path)
            
            # Calculate the average of selected columns
            column_averages = df[columns_to_average].mean()
            
            print(f"In {filename}:")
            for column in columns_to_average:
                average_value = column_averages[column]
                print(f"Average {column}: {average_value:.2f}")
            print()



In [15]:
folder_path = 'data_set_1'

In [16]:
print("General Data Set Report Prior to processing:")
print()
general_check(folder_path)

General Data Set Report Prior to processing:

In sensor_4.csv:
Start Date: 2020-06-15 00:00:11.000000
End Date: 2020-06-18 23:59:48.000000


In sensor_5.csv:
Start Date: 2020-06-15 00:00:07.000000
End Date: 2020-06-18 23:59:58.000000


In sensor_7.csv:
Start Date: 2020-06-15 08:04:35.000000
End Date: 2020-06-18 11:20:05.000000


In sensor_6.csv:
Start Date: 2020-06-15 00:00:37.000000
End Date: 2020-06-18 23:59:48.000000


In sensor_2.csv:
Start Date: 2020-06-15 00:00:25
End Date: 2020-06-18 23:59:43


In sensor_3.csv:
Start Date: 2020-06-15 00:00:12.000000
End Date: 2020-06-18 23:59:56.000000


In sensor_1.csv:
Start Date: 2020-06-15 00:00:45.000000
End Date: 2020-06-18 23:59:09.000000


In sensor_0.csv:
Start Date: 2020-06-15 00:00:27.000000
End Date: 2020-06-18 23:59:44.000000


In sensor_10.csv:
Start Date: 2020-06-15 00:00:19.000000
End Date: 2020-06-18 23:59:56.000000


In sensor_12.csv:
Start Date: 2020-06-15 00:00:14.000000
End Date: 2020-06-18 23:59:29.000000


In sensor_8.csv:

In [8]:
print("Calculating the average of significant features to gain insights into the types of appliances:")
columns_to_average = ['current', 'active_power', 'energy']
appliance_guess(folder_path, columns_to_average)

Calculating the average of significant features to gain insights into the types of appliances:
In sensor_4.csv:
Average current: 2.40
Average active_power: 868.86
Average energy: 2752281.01

In sensor_5.csv:
Average current: 1.14
Average active_power: 194.65
Average energy: 439140.46

In sensor_7.csv:
Average current: 3.69
Average active_power: 535.07
Average energy: 1320282.80

In sensor_6.csv:
Average current: 0.39
Average active_power: 69.13
Average energy: 6450.38

In sensor_2.csv:
Average current: 1.60
Average active_power: 886.86
Average energy: 4359449.17

In sensor_3.csv:
Average current: 3.29
Average active_power: 634.43
Average energy: 5940548.12

In sensor_1.csv:
Average current: 2.35
Average active_power: 469.02
Average energy: 900488.11

In sensor_0.csv:
Average current: 1.23
Average active_power: 842.46
Average energy: 3006766.12

In sensor_10.csv:
Average current: 1.73
Average active_power: 913.58
Average energy: 3039397.76

In sensor_12.csv:
Average current: 1.65
Averag