# Exploratory Data Analysis
As the name suggests, during EDA, you get a deeper understanding of the data. During this step, you want to understand your data's statistical characteristics, create visualisations, and test hypothesises.

There are four main types of EDA:

1. `Univariate non-graphical`: Make observations of the population and understand sample distributions of a single variable. (e.g. the measure of spread, the measure of central tendency)

2. `Univariate graphical`: Graphical analysis on a single variable. (e.g. Histograms, Boxplots, Stem and leaf)

3. `Multivariate non-graphical`: Techniques which show the relationship between two or more variables. (e.g. covariance, correlations)

4. `Multivariate graphical`: Graphically show the relationship between two or more variables. (e.g. bar plots, scatterplots)

## 1. Imports

In [3]:
import pandas as pd
import numpy as np

from scipy import stats

## 2. Load the dataset

In [2]:
data = pd.read_csv("../data/cleaned_data.csv")
data.head()

Unnamed: 0,absolute_magnitude,estimated_diameter_min,estimated_diameter_max,relative_velocity,miss_distance,is_hazardous
0,19.14,0.394962,0.883161,71745.401048,58143620.0,0
1,18.5,0.530341,1.185878,109949.757148,55801050.0,1
2,21.45,0.136319,0.304818,24865.506798,67206890.0,0
3,20.63,0.198863,0.444672,78890.076805,30396440.0,0
4,22.7,0.076658,0.171412,56036.519484,63118630.0,0


## 3. Univariate Non-graphical EDA

In [4]:
features = ["absolute_magnitude", "estimated_diameter_min", "estimated_diameter_max", 
            "relative_velocity", "miss_distance"]

In [5]:
# Initialize a dictionary to store the results
summary_stats = {}

In [6]:
# Loop through each feature and compute the required statistics
for feature in features:
    feature_data = data[feature]
    
    # Central Tendency
    mean = feature_data.mean()
    median = feature_data.median()
    mode = feature_data.mode()[0]  # Mode returns a series, so we take the first value
    
    # Spread
    std = feature_data.std()
    var = feature_data.var()
    range_value = feature_data.max() - feature_data.min()
    q1 = feature_data.quantile(0.25)
    q3 = feature_data.quantile(0.75)
    iqr = q3 - q1
    
    # Skewness and Kurtosis
    skew = feature_data.skew()
    kurt = feature_data.kurtosis()
    
    # Store the results in the dictionary
    summary_stats[feature] = {
        'Mean': mean,
        'Median': median,
        'Mode': mode,
        'Std Dev': std,
        'Variance': var,
        'Range': range_value,
        'IQR': iqr,
        'Skewness': skew,
        'Kurtosis': kurt
    }
    
    # Display the statistics for the feature
    print(f"Feature: {feature}")
    print(f"  Mean: {mean}")
    print(f"  Median: {median}")
    print(f"  Mode: {mode}")
    print(f"  Std Dev: {std}")
    print(f"  Variance: {var}")
    print(f"  Range: {range_value}")
    print(f"  IQR: {iqr}")
    print(f"  Skewness: {skew}")
    print(f"  Kurtosis: {kurt}")
    print("-" * 50)

Feature: absolute_magnitude
  Mean: 22.932524959266164
  Median: 22.8
  Mode: 24.4
  Std Dev: 2.911216390292147
  Variance: 8.47518087110564
  Range: 24.33
  IQR: 4.360000000000003
  Skewness: 0.08402052704253546
  Kurtosis: -0.47557705847894116
--------------------------------------------------
Feature: estimated_diameter_min
  Mean: 0.15781204666055487
  Median: 0.0732073989
  Mode: 0.0350392641
  Std Dev: 0.3138851378797346
  Variance: 0.09852387978178001
  Range: 37.5447367783
  IQR: 0.163656849
  Skewness: 30.963263588764914
  Kurtosis: 2664.6036246371696
--------------------------------------------------
Feature: estimated_diameter_max
  Mean: 0.352878464000549
  Median: 0.1636967205
  Mode: 0.0783501764
  Std Dev: 0.7018685054244151
  Variance: 0.4926193989067022
  Range: 83.9525836336
  IQR: 0.36594783929999997
  Skewness: 30.963263588892026
  Kurtosis: 2664.6036246477115
--------------------------------------------------
Feature: relative_velocity
  Mean: 51060.01799447809
  M