# Data Cleaning Notebook

## Introduction

This notebook outlines the process of cleaning a dataset. The purpose is to prepare the data for further analysis by handling missing values, correcting inconsistencies, and normalizing data.

### Goals:

- **Understand** the structure of the dataset.
- **Identify** and handle missing or inconsistent data.
- **Prepare** the dataset for graph.

## Imports and utility functions

In [1]:
import pandas as pd
import numpy as np

In [2]:
def filter_columns_by_non_null_count(df, non_null_threshold=10):
    """
    Filter columns in a DataFrame based on a threshold of non-null values.

    Parameters:
    df (pd.DataFrame): The DataFrame to be filtered.
    non_null_threshold (int): The minimum number of non-null values required to keep a column.

    Returns:
    pd.DataFrame: A new DataFrame containing only the columns with non-null values above the threshold.
    """
    filtered_columns = [
        col for col in df.columns if df[col].count() > non_null_threshold]
    return df[filtered_columns]

## Data Loading

Here we load the data from its source, which could be a CSV file, database, or other formats.


In [3]:
df_e = pd.read_csv('./data_raw/estes_kits_details.csv', quotechar="'")
df_l = pd.read_csv('./data_raw/loc_kits_details.csv', quotechar="'")

df_manufacturers = pd.read_csv('./data_raw/manufacturers.csv')
df_motors = pd.read_csv('./data_raw/rocket_motors.csv')

## Data Inspection

In this section, we inspect the dataset for inconsistencies, missing values, and anomalies.


In [4]:
df_motors.head()

Unnamed: 0,commonName,delays,diameter,infoUrl,totImpulseNs,manufacturer,burnTimeS,propInfo,length,avgThrustN,...,propWeightG,certOrg,motorId,availability,maxThrustN,totalWeightG,designation,updatedOn,type,manufacturerAbbrev
0,1/8A0.3,1,6.0,http://www.nar.org/SandT/pdf/Quest/MicroMaxx.pdf,0.135,Quest Aerospace,0.827,black powder,26.0,0.163,...,0.4,National Association of Rocketry,5f4294d20002310000000002,OOP,1.898,1.07,MICRO_MAXX,2019-04-17,SU,Quest
1,1/8A0.3,1,6.0,,0.218,Quest Aerospace,0.772,black powder,26.0,0.284,...,0.5,National Association of Rocketry,5f4294d20002310000000001,regular,2.108,1.0,MICRO_MAXX_II,2019-04-17,SU,Quest
2,1/4A2,24,10.5,http://nar.org/SandT/pdf/Apogee/qtrA2.pdf,0.57,Apogee Components,0.23,,38.0,2.48,...,0.75,National Association of Rocketry,5f4294d20002310000000003,OOP,5.4,,1/4A2,2019-04-17,SU,Apogee
3,1/4A3,3,13.0,http://nar.org/SandT/pdf/Estes/qtrA3.pdf,0.59,Estes Industries,0.25,black powder,45.0,2.36,...,0.83,National Association of Rocketry,5f4294d20002310000000004,regular,4.95,6.1,1/4A3,2019-10-05,SU,Estes
4,1/2A3,24,13.0,http://nar.org/SandT/pdf/Estes/halfA3T.pdf,1.09,Estes Industries,0.36,black powder,45.0,3.03,...,2.0,National Association of Rocketry,5f4294d20002310000000006,regular,7.62,6.6,1/2A3,2019-10-05,SU,Estes


In [5]:
df_manufacturers.head(10)

Unnamed: 0,name,abbrev
0,AeroTech,AeroTech
1,Alpha Hybrids,Alpha
2,Animal Motor Works,AMW
3,Apogee Components,Apogee
4,Cesaroni Technology,Cesaroni
5,Contrail Rockets,Contrail
6,Ellis Mountain,Ellis
7,Estes Industries,Estes
8,Gorilla Rocket Motors,Gorilla
9,Hypertek,Hypertek


In [6]:
df_e.head()

Unnamed: 0,url,description,image_src,Recommended Engines,Projected Max Altitude,Recovery System,Length,Diameter,Estimated Weight,Estimated Assembly Time,...,Launch System,Launch Rod Size,instructions,Construction,Wingspan,Age Recommendation,Launch Rod System,Recovery,Fin Material,Estimated Assembly Weight
0,https://estesrockets.com/product/000651-der-re...,,/hc/article_attachments/8767953356173/000651_m...,"B4-2, B4-4, B6-2, B6-4, C6-5",600 ft. (183 m),18 in. (45.7 cm) Parachute,16.3 in. (41.4 cm),1.64 in. (42 mm),2.4 oz. (68 g),1-3 Hours (painting and glue drying time not i...,...,,,/hc/en-us/article_attachments/9445564217485,,,,,,,
1,https://estesrockets.com/product/000804-firehawk/,,/hc/article_attachments/8772192791053/Firehawk...,"1/4A3-3T, 1/2A3-2T, A3-4T, A10-3T",550 ft. (168 m),6 in. (15.2 cm) Parachute,11.2 in. (28.4 cm),0.74 in. (19 mm),0.65 oz. (18.4 g),1 Hour (painting and glue drying time not incl...,...,Porta-Pad II and Electron Beam Launch Controller,"1/8""",/hc/en-us/article_attachments/8772248934029,,,,,,,
2,https://estesrockets.com/product/000810-220-sw...,,/hc/article_attachments/8773146657933/220_Swif...,"1/4A3-3T, 1/2A3-2T, 1/2A3-4T, A3-4T, A10-3T",850 ft. (259 m),Featherweight,4.5 in. (11.4 cm),0.54 in. (14 mm),0.09 oz. (2.5 g),1 Hour (painting and glue drying time not incl...,...,Porta-Pad II and Electron Beam Launch Controller,"1/8""",/hc/en-us/article_attachments/8773195953165,,,,,,,
3,https://estesrockets.com/product/luna-bug/,,/hc/article_attachments/8773293557261/0816-Lun...,"1/4A3-3T, 1/2A3-2T, 1/2A3-4T, A10-3T, A3-4T",800 ft. (244 m),Featherweight,3.6 in. (9.1 cm),0.54 in. (14 mm),0.8 oz. (22.7 g),,...,Porta-Pad II and Electron Beam Launch Controller,"1/8""",/hc/en-us/article_attachments/8773295812109,,,,,,,
4,https://estesrockets.com/product/000865-mini-m...,,/hc/article_attachments/8773395002637/min_mean...,"A3-4T, A10-3T",225 ft. (69 m),9 in. (22.9 cm) Parachute,39 in. (99.1 cm),0.74 in. (19 mm),1.2 oz. (34 g),1 Hour (painting and glue drying time not incl...,...,Porta-Pad II and Electron Beam Launch Controller,"1/8""",/hc/en-us/article_attachments/8773397655565,High Quality Plastic Nose Cone,,,,,,


In [7]:
df_l.head()

Unnamed: 0,Name,Image URL,Complexity,Diameter,Height,Weight,Motor Mount,Parachute Size,Shock Cord Type,Shock Cord Mount,...,Vinyl Decals,Tec features,Decals,Fire Blanket,Vinyl Decal,Parachute,Fin Array,Rocksim,Parachutes,Additional Decals
0,"LOC-IV 4""",https://locprecision.com/cdn/shop/products/78_...,Intermediate-Advanced,4″,47″,2lbs.,38mm with,,,,...,,,,,,,,,,
1,"LOC 4"" Goblin",https://locprecision.com/cdn/shop/products/62_...,Advanced,4″,"40.75""",2.5lbs.,54mm,,,,...,,,,,,,,,,
2,"Yank IRIS 4""",https://locprecision.com/cdn/shop/products/YIR...,Intermediate-Advanced,4″,"77""",5.5lbs.,54mm with,,,,...,,,,,,,,,,
3,EZI-65,https://locprecision.com/cdn/shop/products/5_e...,Entry-Intermediate,4″,57.75,3.2lbs,54mm with,,,,...,,,,,,,,,,
4,"IRIS 3.10""",https://locprecision.com/cdn/shop/products/46_...,Entry/Intermediate,3.1″,59.5,2.3lbs.,38mm,,,,...,,,,,,,,,,


## Data Cleaning

Here we handle missing values, correct data inconsistencies, and normalize the data.


In [8]:
# Remove cols with little data
filtered_df_e = filter_columns_by_non_null_count(df_e, 10).copy()
filtered_df_l = filter_columns_by_non_null_count(df_l, 10).copy()

## Data Transformation

In this section, we perform any necessary transformations such as creating new columns or reshaping the data.


In [9]:
filtered_df_l.rename(columns={
    'Image URL': 'image_src',
    'Product URL': 'url'},
    inplace=True)
filtered_df_e['mfgID'] = 'Estes'
filtered_df_l['mfgID'] = 'Loc'

# Concatenate the dataframes
merged_df = pd.concat([filtered_df_e, filtered_df_l], ignore_index=True)

In [10]:
new_rows = [{'name': 'Loc Precision', 'abbrev': 'Loc'}]

new_rows_df = pd.DataFrame(new_rows)
df_manufacturers = pd.concat([df_manufacturers, new_rows_df], ignore_index=True)

df_manufacturers.rename(columns={'abbrev': 'mfgID'}, inplace=True)

In [11]:
df_motors.rename(columns={'manufacturerAbbrev': 'mfgID'}, inplace=True)

## Data Quality Check

We re-inspect the dataset to ensure that all cleaning steps have been successful.


In [12]:
# Example: Re-checking the dataset
merged_df.info()
df_manufacturers.info()
df_motors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 35 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   url                      194 non-null    object 
 1   image_src                195 non-null    object 
 2   Recommended Engines      73 non-null     object 
 3   Projected Max Altitude   73 non-null     object 
 4   Recovery System          72 non-null     object 
 5   Length                   73 non-null     object 
 6   Diameter                 137 non-null    object 
 7   Estimated Weight         73 non-null     object 
 8   Estimated Assembly Time  68 non-null     object 
 9   Fin Materials            70 non-null     object 
 10  Decal Type               64 non-null     object 
 11  Launch System            69 non-null     object 
 12  Launch Rod Size          68 non-null     object 
 13  instructions             76 non-null     object 
 14  Age Recommendation       5

## Exporting Cleaned Data

Save the cleaned data for future use.


In [13]:
merged_df.to_csv('./data_clean/kits.csv', index=False)
df_manufacturers.to_csv('./data_clean/manufacturers.csv', index=False)
df_motors.to_csv('./data_clean/motors.csv', index=False)
# new_df.to_csv('estes_kits_details_cleaned.csv', index=False)
# print(new_df.info())
# Save the cleaned dataset
# df.to_csv('path/to/cleaned_dataset.csv', index=False)

## Conclusion

Summarize the cleaning process, note any limitations, and suggest next steps or further analysis.
