# Data Cleaning Notebook

## Introduction

This notebook outlines the process of cleaning a dataset. The purpose is to prepare the data for further analysis by handling missing values, correcting inconsistencies, and normalizing data.

### Goals:

- **Understand** the structure of the dataset.
- **Identify** and handle missing or inconsistent data.
- **Prepare** the dataset for graph.

## Imports and utility functions

In [1]:
import pandas as pd
import uuid
import requests
import json

In [2]:
def filter_columns_by_non_null_count(df, non_null_threshold=10):
    """
    Filter columns in a DataFrame based on a threshold of non-null values.

    Parameters:
    df (pd.DataFrame): The DataFrame to be filtered.
    non_null_threshold (int): The minimum number of non-null values required to keep a column.

    Returns:
    pd.DataFrame: A new DataFrame containing only the columns with non-null values above the threshold.
    """
    filtered_columns = [
        col for col in df.columns if df[col].count() > non_null_threshold]
    return df[filtered_columns]

## Data Loading

Here we load the data from its source, which could be a CSV file, database, or other formats.


In [3]:
df_e = pd.read_csv('./data_raw/estes_kits_details.csv', quotechar="'")
df_l = pd.read_csv('./data_raw/loc_kits_details.csv', quotechar="'")

df_manufacturers = pd.read_csv('./data_raw/manufacturers.csv')
df_motors = pd.read_csv('./data_raw/rocket_motors.csv')

## Data Inspection

In this section, we inspect the dataset for inconsistencies, missing values, and anomalies.


In [None]:
df_motors.head()

In [None]:
df_manufacturers.head(10)

In [None]:
df_e.head()

In [None]:
df_l.head()

## Data Cleaning

Here we handle missing values, correct data inconsistencies, and normalize the data.


In [4]:
# Remove cols with little data
filtered_df_e = filter_columns_by_non_null_count(df_e, 10).copy()
filtered_df_l = filter_columns_by_non_null_count(df_l, 10).copy()

## Data Transformation

In this section, we perform any necessary transformations such as creating new columns or reshaping the data.


In [5]:
filtered_df_l.rename(columns={
    'Image URL': 'image_src',
    'Product URL': 'url'},
    inplace=True)
filtered_df_e['mfgID'] = 'Estes'
filtered_df_l['mfgID'] = 'Loc'

# Concatenate the dataframes
merged_df = pd.concat([filtered_df_e, filtered_df_l], ignore_index=True)

merged_df['UniqueID'] = [uuid.uuid4() for _ in range(len(merged_df))]

In [6]:
new_rows = [{'name': 'Loc Precision', 'abbrev': 'Loc'}]

new_rows_df = pd.DataFrame(new_rows)
df_manufacturers = pd.concat([df_manufacturers, new_rows_df], ignore_index=True)

df_manufacturers.rename(columns={'abbrev': 'mfgID'}, inplace=True)

In [7]:
df_motors.rename(columns={'manufacturerAbbrev': 'mfgID'}, inplace=True)

Consolidate the motor_samples csv

In [8]:
# Load the original thrust_samples.csv
df = pd.read_csv('./data_raw/thrust_samples.csv')

# Group by motorId and aggregate the samples into a list of dictionaries
aggregated = df.groupby('motorId').apply(lambda x: x[['time', 'thrust']].to_dict('records')).reset_index(name='samples')

# Serialize the samples into a JSON string
aggregated['samples'] = aggregated['samples'].apply(json.dumps)

# Save to a new CSV file
aggregated.to_csv('./data_clean/thrust_samples.csv', index=False)


  aggregated = df.groupby('motorId').apply(lambda x: x[['time', 'thrust']].to_dict('records')).reset_index(name='samples')


In [None]:
import pandas as pd

# Assuming df is your DataFrame
df = pd.DataFrame({
    'motorId': ['id1', 'id1', 'id2', 'id2'],
    'time': [0.1, 0.2, 0.1, 0.2],
    'thrust': [100, 200, 150, 250]
})

# Adjusted aggregation logic
aggregated = df.groupby('motorId', as_index=False).apply(lambda x: x[['time', 'thrust']].to_dict('records')).reset_index(name='samples')
aggregated.columns = ['motorId', 'index', 'samples']  # Adjust column names if necessary
aggregated.drop(columns=['index'], inplace=True)  # Drop the intermediate index column

# Now 'aggregated' DataFrame contains 'motorId' and 'samples' as columns,
# where 'samples' is a list of dictionaries for each 'motorId'
print(aggregated)


## Data Quality Check

We re-inspect the dataset to ensure that all cleaning steps have been successful.


In [None]:
# Example: Re-checking the dataset
merged_df.info()
df_manufacturers.info()
df_motors.info()

## Exporting Cleaned Data

Save the cleaned data for future use.


In [None]:
merged_df.to_csv('./data_clean/kits.csv', index=False)
df_manufacturers.to_csv('./data_clean/manufacturers.csv', index=False)
df_motors.to_csv('./data_clean/motors.csv', index=False)
# new_df.to_csv('estes_kits_details_cleaned.csv', index=False)
# print(new_df.info())
# Save the cleaned dataset
# df.to_csv('path/to/cleaned_dataset.csv', index=False)

## Conclusion

Summarize the cleaning process, note any limitations, and suggest next steps or further analysis.
