# Data Preprocessing / Exploratory Data Analysis

Objective: Clean the data and vizualize to prepare for ML modeling. 

## Data Cleaning
1. Handling of NaN values and duplicates
2. Dropping unnecessary columns
3. Dropping rows where OBJECT_TYPE is unknown

In [7]:
import pandas as pd

In [8]:
df = pd.read_csv("data/space_debris_kaggle.csv")
df.head()

Unnamed: 0,CCSDS_OMM_VERS,COMMENT,CREATION_DATE,ORIGINATOR,OBJECT_NAME,OBJECT_ID,CENTER_NAME,REF_FRAME,TIME_SYSTEM,MEAN_ELEMENT_THEORY,...,RCS_SIZE,COUNTRY_CODE,LAUNCH_DATE,SITE,DECAY_DATE,FILE,GP_ID,TLE_LINE0,TLE_LINE1,TLE_LINE2
0,2,GENERATED VIA SPACE-TRACK.ORG API,2021-11-01T06:46:11,18 SPCS,ARIANE 42P+ DEB,1992-072J,EARTH,TEME,UTC,SGP4,...,MEDIUM,FR,1992.0,FRGUI,,3195178,188614016,0 ARIANE 42P+ DEB,1 26741U 92072J 21304.94919376 .00000883 0...,2 26741 7.7156 90.2410 6528926 243.1216 38...
1,2,GENERATED VIA SPACE-TRACK.ORG API,2021-11-01T04:58:37,18 SPCS,SL-8 DEB,1979-028C,EARTH,TEME,UTC,SGP4,...,SMALL,CIS,1979.0,PKMTR,,3194950,188593285,0 SL-8 DEB,1 26743U 79028C 21304.68908982 .00000079 0...,2 26743 82.9193 299.1120 0030720 158.9093 201...
2,2,GENERATED VIA SPACE-TRACK.ORG API,2021-11-01T06:26:11,18 SPCS,GSAT 1,2001-015A,EARTH,TEME,UTC,SGP4,...,LARGE,IND,2001.0,SRI,,3195026,188609573,0 GSAT 1,1 26745U 01015A 21305.22411368 -.00000165 0...,2 26745 12.1717 16.5368 0237386 250.1248 146...
3,2,GENERATED VIA SPACE-TRACK.ORG API,2021-10-31T18:07:15,18 SPCS,CZ-4 DEB,1999-057MB,EARTH,TEME,UTC,SGP4,...,SMALL,PRC,1999.0,TSC,,3194431,188556894,0 CZ-4 DEB,1 26754U 99057MB 21304.46625230 .00002265 0...,2 26754 98.4781 8.7205 0060618 37.3771 323...
4,2,GENERATED VIA SPACE-TRACK.ORG API,2021-11-01T04:58:37,18 SPCS,CZ-4 DEB,1999-057MC,EARTH,TEME,UTC,SGP4,...,SMALL,PRC,1999.0,TSC,,3194950,188592541,0 CZ-4 DEB,1 26755U 99057MC 21304.74081807 .00002610 0...,2 26755 98.4232 122.0724 0062255 345.1605 27...


In [9]:
df.columns

Index(['CCSDS_OMM_VERS', 'COMMENT', 'CREATION_DATE', 'ORIGINATOR',
       'OBJECT_NAME', 'OBJECT_ID', 'CENTER_NAME', 'REF_FRAME', 'TIME_SYSTEM',
       'MEAN_ELEMENT_THEORY', 'EPOCH', 'MEAN_MOTION', 'ECCENTRICITY',
       'INCLINATION', 'RA_OF_ASC_NODE', 'ARG_OF_PERICENTER', 'MEAN_ANOMALY',
       'EPHEMERIS_TYPE', 'CLASSIFICATION_TYPE', 'NORAD_CAT_ID',
       'ELEMENT_SET_NO', 'REV_AT_EPOCH', 'BSTAR', 'MEAN_MOTION_DOT',
       'MEAN_MOTION_DDOT', 'SEMIMAJOR_AXIS', 'PERIOD', 'APOAPSIS', 'PERIAPSIS',
       'OBJECT_TYPE', 'RCS_SIZE', 'COUNTRY_CODE', 'LAUNCH_DATE', 'SITE',
       'DECAY_DATE', 'FILE', 'GP_ID', 'TLE_LINE0', 'TLE_LINE1', 'TLE_LINE2'],
      dtype='object')

In [None]:
#Drop columns with only one unique value
cols_to_drop = [col for col in df.columns if df[col].nunique() == 1]

# Drop these columns from the DataFrame
df.drop(columns=cols_to_drop, inplace=True)

# Check if the columns are dropped
print("Dropped columns:", cols_to_drop)

Dropped columns: ['CCSDS_OMM_VERS', 'COMMENT', 'ORIGINATOR', 'CENTER_NAME', 'REF_FRAME', 'TIME_SYSTEM', 'MEAN_ELEMENT_THEORY', 'EPHEMERIS_TYPE', 'CLASSIFICATION_TYPE', 'ELEMENT_SET_NO']


In [None]:
#Dropping DECAY_DATE as all the columns are Nan
df.drop(columns=['DECAY_DATE'], inplace=True)

In [11]:
#Check which columns have NaN values
print(df.isnull().sum())

CREATION_DATE            0
OBJECT_NAME              0
OBJECT_ID               39
EPOCH                    0
MEAN_MOTION              0
ECCENTRICITY             0
INCLINATION              0
RA_OF_ASC_NODE           0
ARG_OF_PERICENTER        0
MEAN_ANOMALY             0
NORAD_CAT_ID             0
REV_AT_EPOCH             0
BSTAR                    0
MEAN_MOTION_DOT          0
MEAN_MOTION_DDOT         0
SEMIMAJOR_AXIS           0
PERIOD                   0
APOAPSIS                 0
PERIAPSIS                0
OBJECT_TYPE              0
RCS_SIZE               198
COUNTRY_CODE            39
LAUNCH_DATE             39
SITE                    39
DECAY_DATE           14372
FILE                     0
GP_ID                    0
TLE_LINE0                0
TLE_LINE1                0
TLE_LINE2                0
dtype: int64


In [12]:
print(df.duplicated().sum())

0


In [None]:
#Checking values in target for classification
df['OBJECT_TYPE'].value_counts()

OBJECT_TYPE
DEBRIS         8431
PAYLOAD        4950
ROCKET BODY     744
TBA             247
Name: count, dtype: int64

In [15]:
# Handle missing OBJECT_ID based on OBJECT_TYPE
df['OBJECT_ID'].fillna('Unknown', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['OBJECT_ID'].fillna('Unknown', inplace=True)


In [16]:
df['RCS_SIZE'].value_counts()

RCS_SIZE
SMALL     8346
LARGE     4189
MEDIUM    1639
Name: count, dtype: int64

In [17]:
#Removing rows where OBJECT_TYPE is TBA
df = df[df['OBJECT_TYPE'] != 'TBA']

In [None]:
#Fill in NaN values
df['COUNTRY_CODE'].fillna(df['COUNTRY_CODE'].mode()[0], inplace=True)
df['LAUNCH_DATE'].fillna(df['LAUNCH_DATE'].mode()[0], inplace=True)
df['SITE'].fillna(df['SITE'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['COUNTRY_CODE'].fillna(df['COUNTRY_CODE'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['LAUNCH_DATE'].fillna(df['LAUNCH_DATE'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the interm

In [19]:
#Replace missing RCS size data with Unknown
df['RCS_SIZE'].fillna('UNKNOWN', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['RCS_SIZE'].fillna('UNKNOWN', inplace=True)


In [20]:
print(df.isnull().sum())

CREATION_DATE        0
OBJECT_NAME          0
OBJECT_ID            0
EPOCH                0
MEAN_MOTION          0
ECCENTRICITY         0
INCLINATION          0
RA_OF_ASC_NODE       0
ARG_OF_PERICENTER    0
MEAN_ANOMALY         0
NORAD_CAT_ID         0
REV_AT_EPOCH         0
BSTAR                0
MEAN_MOTION_DOT      0
MEAN_MOTION_DDOT     0
SEMIMAJOR_AXIS       0
PERIOD               0
APOAPSIS             0
PERIAPSIS            0
OBJECT_TYPE          0
RCS_SIZE             0
COUNTRY_CODE         0
LAUNCH_DATE          0
SITE                 0
FILE                 0
GP_ID                0
TLE_LINE0            0
TLE_LINE1            0
TLE_LINE2            0
dtype: int64


## Splitting of Features and Target
1. Label Encoding Target (OBJECT_TYPE)
2. Splitting into X and Y

In [21]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

# Encode OBJECT_TYPE
df['OBJECT_TYPE_ENCODED'] = label_encoder.fit_transform(df['OBJECT_TYPE'])

# Split into X and Y
X = df.drop(columns=['OBJECT_TYPE', 'OBJECT_TYPE_ENCODED'])  # Features
Y = df['OBJECT_TYPE_ENCODED']  # Encoded target

## Data Preprocessing
1. Encoding categorical features
2. Scaling numerical features 

In [22]:
df['RCS_SIZE'].value_counts()

RCS_SIZE
SMALL      8256
LARGE      4170
MEDIUM     1540
UNKNOWN     159
Name: count, dtype: int64

In [15]:
# One Hot Encode RCS_SIZE
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
rcs_size_reshaped = X[['RCS_SIZE']]
rcs_size_encoded = encoder.fit_transform(rcs_size_reshaped)
rcs_size_encoded_df = pd.DataFrame(
    rcs_size_encoded,
    columns=encoder.get_feature_names_out(['RCS_SIZE']),
    index=X.index
)

X = pd.concat([X, rcs_size_encoded_df], axis=1)
X.drop(columns=['RCS_SIZE'], inplace=True)
