## Data Cleaning
- Cleaning your data is a process of ensuring your data is in the correct format; consistent and errors are identified and dealt with appropriately.

- The actions below lead to a cleaner dataset:
    - Remove duplicate values
    - Remove irrelevant observations (observations need to be specific to the problem you are solving)
    - Address missing values (e.g. Imputation techniques, drop features/observations)
    - Reformat data types (e.g. boolean, numeric, Datetime)
    - Filter unwanted outliers (if you have a legitimate reason)
    - Reformat strings (e.g. remove white spaces, mislabeled/misspelt categories)
    - Validate (does the data make sense? does the data adhere to the defined business rules? )


## 1. Imports

In [1]:
import pandas as pd

## 2. Loading Dataset

In [2]:
df = pd.read_csv('../data/data_from_the_year_1910_to_2024.csv')

In [3]:
df.head()

Unnamed: 0,neo_id,name,absolute_magnitude,estimated_diameter_min,estimated_diameter_max,orbiting_body,relative_velocity,miss_distance,is_hazardous
0,2162117,162117 (1998 SD15),19.14,0.394962,0.883161,Earth,71745.401048,58143620.0,False
1,2349507,349507 (2008 QY),18.5,0.530341,1.185878,Earth,109949.757148,55801050.0,True
2,2455415,455415 (2003 GA),21.45,0.136319,0.304818,Earth,24865.506798,67206890.0,False
3,3132126,(2002 PB),20.63,0.198863,0.444672,Earth,78890.076805,30396440.0,False
4,3557844,(2011 DW),22.7,0.076658,0.171412,Earth,56036.519484,63118630.0,False


## 3. Remove duplicate rows (if any)

In [4]:
duplicates = df.duplicated().any()
print(duplicates)

False


There are no duplicate row, so we don't have to perform delete operation.

## 4. Removing Irrelevant Columns

Here, `neo_id` and `name` are two columns that are not relevant for this application because they doesn't add any importance and values to the model. We need to delete them from our dataset.

Let's check the distinct values for all columns.

In [5]:
# Print distinct values for each column
for column in df.columns:
    print(f"Distinct values in column {column}: {df[column].unique()}")

Distinct values in column neo_id: [ 2162117  2349507  2455415 ... 54460107 54459238 54456245]
Distinct values in column name: ['162117 (1998 SD15)' '349507 (2008 QY)' '455415 (2003 GA)' ...
 '(2024 ND3)' '(2024 NQ2)' '(2024 NE)']
Distinct values in column absolute_magnitude: [19.14  18.5   21.45  ... 23.781 23.218 23.887]
Distinct values in column estimated_diameter_min: [0.39496169 0.53034072 0.13631856 ... 0.04659668 0.0603886  0.0443767 ]
Distinct values in column estimated_diameter_max: [0.8831612  1.18587791 0.30481756 ... 0.10419334 0.13503302 0.09922931]
Distinct values in column orbiting_body: ['Earth']
Distinct values in column relative_velocity: [ 71745.40104768 109949.75714849  24865.50679812 ...  11832.04103088
  56198.38273287  42060.35782998]
Distinct values in column miss_distance: [58143623.3191697  55801047.8181994  67206887.7225446  ...
 53460784.4719883   5184742.39379309  7126682.4570791 ]
Distinct values in column is_hazardous: [False  True]


As we can see, the column `orbiting_body` only contains the value `Earth` for all observations, so it is an irrelevant column and we have to remove this.

In [6]:
new_df = df.drop(columns=['neo_id', 'name', 'orbiting_body'])
new_df.head()

Unnamed: 0,absolute_magnitude,estimated_diameter_min,estimated_diameter_max,relative_velocity,miss_distance,is_hazardous
0,19.14,0.394962,0.883161,71745.401048,58143620.0,False
1,18.5,0.530341,1.185878,109949.757148,55801050.0,True
2,21.45,0.136319,0.304818,24865.506798,67206890.0,False
3,20.63,0.198863,0.444672,78890.076805,30396440.0,False
4,22.7,0.076658,0.171412,56036.519484,63118630.0,False


## 5. Addressing Missing Values

In [7]:
# Check if there are any missing values
has_missing = new_df.isnull().values.any()
print(has_missing)

True


There are missing values in our dataset. Let's check which columns have the missing values.

In [8]:
# Count missing values in each column
missing_per_column = new_df.isnull().sum()
print(missing_per_column)


absolute_magnitude        28
estimated_diameter_min    28
estimated_diameter_max    28
relative_velocity          0
miss_distance              0
is_hazardous               0
dtype: int64


As we can see, `absolute_magnitude`, `estimated_diameter_min`, and `estimated_diameter_max` have the missing values. We need to address this issue by either imputation techniques or drop features/observations.

`Percentage of Missing Values per Column`

In [9]:
# Percentage of missing values per column
missing_percentage = (new_df.isnull().sum() / len(new_df)) * 100
print(missing_percentage)


absolute_magnitude        0.008279
estimated_diameter_min    0.008279
estimated_diameter_max    0.008279
relative_velocity         0.000000
miss_distance             0.000000
is_hazardous              0.000000
dtype: float64


In [10]:
new_df['is_hazardous'].value_counts()

False    295037
True      43162
Name: is_hazardous, dtype: int64

Since, the missing values on each columns occupy so small i.e. near to 0%, we can easily drop the observations with missing values.

In [11]:
new_df = new_df.dropna()

In [12]:
# Check if there are any missing values
has_missing = new_df.isnull().values.any()
print(has_missing)

False


## 6. Reformat Data Types

In [13]:
new_df.dtypes

absolute_magnitude        float64
estimated_diameter_min    float64
estimated_diameter_max    float64
relative_velocity         float64
miss_distance             float64
is_hazardous                 bool
dtype: object

We have to change the datatype of `is_hazardous` column from bool to integer.

In [14]:
new_df['is_hazardous']

0         False
1          True
2         False
3         False
4         False
          ...  
338194    False
338195    False
338196    False
338197    False
338198    False
Name: is_hazardous, Length: 338171, dtype: bool

In [15]:
new_df['is_hazardous'] = new_df['is_hazardous'].astype(int)

In [16]:
new_df

Unnamed: 0,absolute_magnitude,estimated_diameter_min,estimated_diameter_max,relative_velocity,miss_distance,is_hazardous
0,19.140,0.394962,0.883161,71745.401048,5.814362e+07,0
1,18.500,0.530341,1.185878,109949.757148,5.580105e+07,1
2,21.450,0.136319,0.304818,24865.506798,6.720689e+07,0
3,20.630,0.198863,0.444672,78890.076805,3.039644e+07,0
4,22.700,0.076658,0.171412,56036.519484,6.311863e+07,0
...,...,...,...,...,...,...
338194,28.580,0.005112,0.011430,56646.985988,6.406548e+07,0
338195,28.690,0.004859,0.010865,21130.768947,2.948883e+07,0
338196,21.919,0.109839,0.245607,11832.041031,5.346078e+07,0
338197,23.887,0.044377,0.099229,56198.382733,5.184742e+06,0


In [17]:
new_df['is_hazardous'].value_counts()

0    295009
1     43162
Name: is_hazardous, dtype: int64

False = 0 and True = 1

## 7. Saving the dataset

In [20]:
new_df.to_csv('../data/cleaned_data.csv', index=False)

## 8. Next Step
Outlier Detection