# Data Cleaning Notebook

## Introduction

This notebook outlines the process of cleaning a dataset. The purpose is to prepare the data for further analysis by handling missing values, correcting inconsistencies, and normalizing data.

### Goals:

- **Understand** the structure of the dataset.
- **Identify** and handle missing or inconsistent data.
- **Prepare** the dataset for analysis.

> Note: This is a template and should be modified to suit the specific needs of your data.


## Imports and utility functions

In [1]:
import pandas as pd
import numpy as np

In [2]:
def filter_columns_by_non_null_count(df, non_null_threshold=10):
    """
    Filter columns in a DataFrame based on a threshold of non-null values.

    Parameters:
    df (pd.DataFrame): The DataFrame to be filtered.
    non_null_threshold (int): The minimum number of non-null values required to keep a column.

    Returns:
    pd.DataFrame: A new DataFrame containing only the columns with non-null values above the threshold.
    """
    filtered_columns = [
        col for col in df.columns if df[col].count() > non_null_threshold]
    return df[filtered_columns]

## Data Loading

Here we load the data from its source, which could be a CSV file, database, or other formats.


In [3]:
df_e = pd.read_csv('estes_kits_details.csv', quotechar="'")
df_l = pd.read_csv('loc_kits_details.csv', quotechar="'")

## Data Inspection

In this section, we inspect the dataset for inconsistencies, missing values, and anomalies.


In [None]:
df_e.head()

In [4]:
df_l.head()

Unnamed: 0,Name,Image URL,Complexity,Diameter,Height,Weight,Motor Mount,Parachute Size,Shock Cord Type,Shock Cord Mount,...,Vinyl Decals,Tec features,Decals,Fire Blanket,Vinyl Decal,Parachute,Fin Array,Rocksim,Parachutes,Additional Decals
0,"LOC-IV 4""",https://locprecision.com/cdn/shop/products/78_...,Intermediate-Advanced,4″,47″,2lbs.,38mm with,,,,...,,,,,,,,,,
1,"LOC 4"" Goblin",https://locprecision.com/cdn/shop/products/62_...,Advanced,4″,"40.75""",2.5lbs.,54mm,,,,...,,,,,,,,,,
2,"Yank IRIS 4""",https://locprecision.com/cdn/shop/products/YIR...,Intermediate-Advanced,4″,"77""",5.5lbs.,54mm with,,,,...,,,,,,,,,,
3,EZI-65,https://locprecision.com/cdn/shop/products/5_e...,Entry-Intermediate,4″,57.75,3.2lbs,54mm with,,,,...,,,,,,,,,,
4,"IRIS 3.10""",https://locprecision.com/cdn/shop/products/46_...,Entry/Intermediate,3.1″,59.5,2.3lbs.,38mm,,,,...,,,,,,,,,,


## Data Cleaning

Here we handle missing values, correct data inconsistencies, and normalize the data.


In [5]:
# Remove cols with little data
filtered_df_e = filter_columns_by_non_null_count(df_e, 10).copy()
filtered_df_l = filter_columns_by_non_null_count(df_l, 10).copy()
# Example: Handling missing values
# df.fillna(method='ffill', inplace=True)

## Data Transformation

In this section, we perform any necessary transformations such as creating new columns or reshaping the data.


In [6]:
filtered_df_l.rename(columns={
    'Image URL': 'image_src',
    'Product URL': 'url'},
    inplace=True)

# Concatenate the dataframes
merged_df = pd.concat([filtered_df_e, filtered_df_l], ignore_index=True)

## Data Quality Check

We re-inspect the dataset to ensure that all cleaning steps have been successful.


In [7]:
# Example: Re-checking the dataset
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 34 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   url                      194 non-null    object 
 1   image_src                195 non-null    object 
 2   Recommended Engines      73 non-null     object 
 3   Projected Max Altitude   73 non-null     object 
 4   Recovery System          72 non-null     object 
 5   Length                   73 non-null     object 
 6   Diameter                 137 non-null    object 
 7   Estimated Weight         73 non-null     object 
 8   Estimated Assembly Time  68 non-null     object 
 9   Fin Materials            70 non-null     object 
 10  Decal Type               64 non-null     object 
 11  Launch System            69 non-null     object 
 12  Launch Rod Size          68 non-null     object 
 13  instructions             76 non-null     object 
 14  Age Recommendation       5

## Exporting Cleaned Data

Save the cleaned data for future use.


In [9]:
merged_df.info()
merged_df.to_csv('kits.csv', index=False)
# new_df.to_csv('estes_kits_details_cleaned.csv', index=False)
# print(new_df.info())
# Save the cleaned dataset
# df.to_csv('path/to/cleaned_dataset.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 34 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   url                      194 non-null    object 
 1   image_src                195 non-null    object 
 2   Recommended Engines      73 non-null     object 
 3   Projected Max Altitude   73 non-null     object 
 4   Recovery System          72 non-null     object 
 5   Length                   73 non-null     object 
 6   Diameter                 137 non-null    object 
 7   Estimated Weight         73 non-null     object 
 8   Estimated Assembly Time  68 non-null     object 
 9   Fin Materials            70 non-null     object 
 10  Decal Type               64 non-null     object 
 11  Launch System            69 non-null     object 
 12  Launch Rod Size          68 non-null     object 
 13  instructions             76 non-null     object 
 14  Age Recommendation       5

## Conclusion

Summarize the cleaning process, note any limitations, and suggest next steps or further analysis.
