1. Libraries and Their Purpose

pandas will help you load and manipulate your dataset.

numpy will be useful for numerical operations, particularly for working with arrays.

plotly.graph_objects is for creating more customizable and complex interactive visualizations.

plotly.express is for creating interactive plots with a simpler syntax.

matplotlib is used for creating static visualizations and is a more traditional plotting library.

In [9]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt

In [10]:
# Load the dataset 
dataset = pd.read_csv('/Users/rehas./Documents/BIA /PROJECT/PRACTISE/star_classification_bia.csv')

 --- Basic Analysis ---

In [None]:
# 1. Overview of the Dataset
print("\n--- Dataset Overview ---")
print(f"Shape of the dataset: {dataset.shape}")
print(f"Columns: {list(dataset.columns)}")
print(dataset.info())

In [None]:
# 2. Summary Statistics
print("\n--- Summary Statistics ---")
print(dataset.describe())

In [None]:
# 3. Missing Data
missing_data = dataset.isnull().sum()
print("\n--- Missing Data ---")
print(missing_data[missing_data > 0])

The code visualizes the missing data using a heatmap, where True indicates missing data and False indicates non-missing data.

In [None]:
# Visualize missing data (if any)
plt.figure(figsize=(10, 6))
plt.title("Missing Data Heatmap", fontsize=16)
sns.heatmap(dataset.isnull(), cbar=False, cmap="viridis")
plt.show()

if dataset.isnull().sum().sum() > 0:
    print("There are missing values in the dataset.")
else:
    print("There are no missing values in the dataset.")


 --- Intermediate Analysis ---

This creates an interactive heatmap where the correlation values are displayed within the matrix, and the colors represent the strength of the correlation (darker colors represent stronger correlations)

In [None]:
# 1. Correlation Heatmap
numeric_cols = dataset.select_dtypes(include=[np.number]).columns
correlation_matrix = dataset[numeric_cols].corr()

fig = px.imshow(correlation_matrix, 
                text_auto=True, 
                title="Correlation Heatmap",
                color_continuous_scale='Viridis')
fig.show()

The bar chart will be shown, where each bar represents a class, and the height of the bar indicates how many instances of that class are present in the dataset.

In [None]:
# 2. Class Distribution
class_counts = dataset['class'].value_counts()
fig = px.bar(class_counts, x=class_counts.index, y=class_counts.values, 
             labels={'x': 'Class', 'y': 'Count'}, 
             title="Object Class Distribution",
             color=class_counts.index)
fig.show()

Below plots can be valuable for identifying patterns or trends in the data that could be useful for classification tasks or further analysis.

In [None]:
# 3. Photometric Data Analysis
# Plotting scatter plots for each combination of photometric bands
photometric_cols = ['u', 'g', 'r', 'i', 'z']
for i in range(len(photometric_cols)):
    for j in range(i + 1, len(photometric_cols)):
        fig = px.scatter(dataset, x=photometric_cols[i], y=photometric_cols[j], 
                         color='class', 
                         title=f"{photometric_cols[i]} vs {photometric_cols[j]}", 
                         hover_data=['redshift'], 
                         labels={'color': 'Class'})
        fig.show()


 --- Advanced Analysis ---

By analyzing the density contours and marginal histograms, we can gain a deeper understanding of the data's spatial properties, such as the concentration of different object classes and their positions in celestial coordinates.

In [None]:
# 1. Spatial Distribution
# Density plot for alpha vs. delta
fig = px.density_contour(dataset, x='alpha', y='delta', color='class',
                         title="Spatial Distribution (Alpha vs. Delta)",
                         labels={'alpha': 'Right Ascension', 'delta': 'Declination'},
                         marginal_x="histogram", marginal_y="histogram")
fig.show()

By comparing the distributions across classes, we can gain valuable insights into how different types of objects (stars, galaxies, or other celestial bodies) are distributed in terms of their distance (as indicated by redshift). This analysis is important for understanding the nature and properties of the objects 

In [None]:
# 2. Redshift Analysis
# Violin plot for redshift distribution by class
fig = px.violin(dataset, y='redshift', x='class', color='class',
                title="Redshift Distribution by Class",
                labels={'redshift': 'Redshift', 'class': 'Class'})
fig.show()

In [20]:
# 3. Clustering (K-Means)
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

--- Custom Analysis: Magnitudes Comparison and Redshift vs Class ---

1. Compare Magnitudes in Different Filters

In this below  code, you're plotting a line chart to compare the median magnitudes in different photometric filters (u, g, r, i, and z) across different classes of objects.

In [None]:

# Line Chart for Magnitudes Comparison
magnitude_median = dataset.groupby('class')[['u', 'g', 'r', 'i', 'z']].median().reset_index()

fig = go.Figure()
for col in ['u', 'g', 'r', 'i', 'z']:
    fig.add_trace(go.Scatter(x=magnitude_median['class'], 
                             y=magnitude_median[col],
                             mode='lines+markers',
                             name=col))
fig.update_layout(title="Median Magnitudes in Different Filters by Class",
                  xaxis_title="Class",
                  yaxis_title="Magnitude",
                  template="plotly_dark",
                  legend_title="Filters")
fig.show()




2. Redshift vs. Object Type

The box plot for redshift distribution by object class helps in visualizing the spread and central tendency of redshift values for different types of objects in the dataset. It allows for easy comparison of redshift distributions across classes and highlights any potential outliers or unusual trends. This type of visualization is useful in astrophysical studies where the redshift of celestial objects is key to understanding their distance, velocity, and other properties.





In [None]:

fig = px.box(dataset, x='class', y='redshift', color='class',
             title="Redshift Distribution by Object Class",
             labels={'class': 'Object Class', 'redshift': 'Redshift'})
fig.update_layout(template="plotly_dark")
fig.show()

--- Objective: Redshift Distribution by Object Classification ---

The violin plot for redshift distribution by object class provides a comprehensive view of the distribution and density of redshift values across different object classes. It combines the benefits of both box plots and kernel density plots, allowing for a detailed understanding of data spread, central tendency, and variability. This visualization is valuable for astrophysical analysis, as it helps compare redshift distributions across object classes and reveals any interesting patterns, trends, or outliers in the dataset.

In [None]:
 
fig = px.violin(dataset, x='class', y='redshift', color='class',
                box=True, points="all",
                title="Redshift Distribution by Object Class",
                labels={'class': 'Object Class', 'redshift': 'Redshift'},
                template="plotly_dark")
fig.show()




 --- Objective: Magnitude vs. Fiber ID ---


The scatter plot of magnitude (r) vs. fiber ID provides valuable insight into the relationship between the brightness (in the r-band) of objects and the fiber ID (potentially corresponding to the observation unit or region in the sky). The plot is color-coded by object class, which allows for easy comparison between different types of astronomical objects

In [None]:
fig = px.scatter(dataset, x='fiber_ID', y='r', color='class',
                 title="Magnitude (r) vs. Fiber ID",
                 labels={'fiber_ID': 'Fiber ID', 'r': 'Magnitude (r)', 'class': 'Class'},
                 hover_data=['u', 'g', 'i', 'z', 'redshift'],
                 template="plotly_dark")
fig.update_traces(marker=dict(size=7, opacity=0.7))
fig.show()

--- Objective: Magnitudes Across Different Fibers ---

The box plot of r-band magnitude distribution across fibers provides valuable insight into how the brightness (in the r-band) of different objects is distributed across various fiber IDs. It helps to compare the distributions of different object classes (e.g., galaxies, stars) and understand how the magnitude of objects varies depending on the fiber used for observation.

In [None]:

fig = px.box(dataset, x='fiber_ID', y='r', color='class',
             title="Magnitude (r) Distribution Across Fibers",
             labels={'fiber_ID': 'Fiber ID', 'r': 'Magnitude (r)', 'class': 'Class'},
             template="plotly_dark")
fig.update_traces(marker=dict(opacity=0.7))
fig.show()






The K-Means clustering results in the 3D scatter plot give a clear visualization of how the dataset is grouped into different clusters based on the selected features (u, g, and redshift). By color-coding the data points according to their cluster labels, you can easily identify how objects with similar characteristics are grouped together, helping in understanding the relationships between different object classes or types based on their photometric data.

In [None]:
# Select numeric features for clustering
clustering_features = dataset[['u', 'g', 'r', 'i', 'z', 'redshift']].dropna()
scaler = StandardScaler()
scaled_features = scaler.fit_transform(clustering_features)

# Perform K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(scaled_features)

# Add clusters to the dataset
clustering_features['Cluster'] = clusters
fig = px.scatter_3d(clustering_features, x='u', y='g', z='redshift', 
                    color='Cluster', 
                    title="K-Means Clustering Results",
                    labels={'Cluster': 'Cluster'}, 
                    hover_data=['r', 'i', 'z'])
fig.update_traces(marker=dict(size=5))
fig.show()

 --- Dimensionality Reduction ---

In [22]:
from sklearn.decomposition import PCA

In [None]:
# PCA on scaled features
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_features)

# Create a DataFrame for PCA results
pca_df = pd.DataFrame(pca_result, columns=['PC1', 'PC2'])
pca_df['Class'] = dataset['class'][:len(pca_df)]

fig = px.scatter(pca_df, x='PC1', y='PC2', 
                 color='Class', 
                 title="2D PCA Visualization",
                 labels={'Class': 'Class'})
fig.update_traces(marker=dict(size=5))
fig.show()



--- Objective: Objects Observed in Each Run and Rerun ---

In [None]:


# Line Chart
fig = px.line(run_rerun_counts, x='run_ID', y='count', color='rerun_ID',
              title="Objects Observed Over Runs (Line Chart)",
              labels={'run_ID': 'Run ID', 'count': 'Count', 'rerun_ID': 'Rerun ID'},
              template="plotly_dark")
fig.show()


--- Advanced Analysis: Time Series Analysis of Object Classifications ---

In [None]:

# Assume MJD represents the Modified Julian Date for the time series
time_series_data = dataset.groupby(['MJD', 'class']).size().reset_index(name='count')
fig = px.line(time_series_data, x='MJD', y='count', color='class',
              title="Time Series Analysis of Object Classifications",
              labels={'MJD': 'Modified Julian Date', 'count': 'Count', 'class': 'Object Class'},
              template="plotly_dark")
fig.show()

Spatial Distribution of Objects (Alpha, Delta Coordinates)

In [None]:
# --- Spatial Distribution of Objects ---
fig = px.scatter(dataset, x='alpha', y='delta', color='class',
                 title="Spatial Distribution of Objects (Alpha vs Delta)",
                 labels={'alpha': 'Right Ascension (Alpha)', 'delta': 'Declination (Delta)', 'class': 'Object Type'},
                 template="plotly_dark")
fig.update_traces(marker=dict(size=6, opacity=0.6))
fig.show()


Redshift and Distance Calculation

In [None]:
# --- Redshift and Distance Calculation ---
# Assuming Hubble's constant (H0) = 70 km/s/Mpc and c = speed of light in km/s
H0 = 70  # Hubble constant in km/s/Mpc
c = 3e5  # Speed of light in km/s

# Calculate distance in megaparsecs
dataset['distance_Mpc'] = (dataset['redshift'] * c) / H0

# Scatter Plot with Distance as Color Gradient
fig = px.scatter(dataset, x='alpha', y='delta', color='distance_Mpc',
                 title="Redshift and Distance Distribution",
                 labels={'alpha': 'Right Ascension (Alpha)', 'delta': 'Declination (Delta)', 'distance_Mpc': 'Distance (Mpc)'},
                 color_continuous_scale='Jet',
                 template="plotly_dark")
fig.update_traces(marker=dict(size=6, opacity=0.7))
fig.show()


Outlier Detection in Magnitudes and Redshifts

In [None]:
# --- Outlier Detection in Magnitudes and Redshifts ---
# Z-Score for outlier detection
from scipy.stats import zscore

dataset['zscore_magnitude'] = zscore(dataset[['u', 'g', 'r', 'i', 'z']].mean(axis=1))
dataset['zscore_redshift'] = zscore(dataset['redshift'])

# Highlight outliers (absolute Z-score > 3)
outliers = dataset[(dataset['zscore_magnitude'].abs() > 3) | (dataset['zscore_redshift'].abs() > 3)]

# Scatter Plot with Outliers Highlighted
fig = px.scatter(dataset, x='redshift', y='r', color='class', size=dataset['zscore_redshift'].abs(),
                 title="Outlier Detection in Magnitudes and Redshifts",
                 labels={'redshift': 'Redshift', 'r': 'Magnitude (r)', 'class': 'Object Type'},
                 hover_data=['zscore_magnitude', 'zscore_redshift'],
                 template="plotly_dark")
fig.add_scatter(x=outliers['redshift'], y=outliers['r'], mode='markers',
                marker=dict(color='red', size=8, symbol='x'), name='Outliers')
fig.show()


Fiber Usage Efficiency by Plate

In [None]:
# --- Fiber Usage Efficiency by Plate ---
fiber_efficiency = dataset.groupby(['plate', 'fiber_ID']).size().reset_index(name='count')

# Heat Map of Fiber Usage
heatmap_efficiency = fiber_efficiency.pivot_table(index='plate', columns='fiber_ID', values='count', fill_value=0)
fig = px.imshow(heatmap_efficiency, text_auto=True, color_continuous_scale='Blues',
                title="Fiber Usage Efficiency by Plate",
                labels={'x': 'Fiber ID', 'y': 'Plate', 'color': 'Count'})
fig.show()

# Bar Chart of Fiber Usage
fiber_efficiency_summary = fiber_efficiency.groupby('plate')['count'].sum().reset_index()
fig = px.bar(fiber_efficiency_summary, x='plate', y='count',
             title="Total Fiber Usage by Plate",
             labels={'plate': 'Plate', 'count': 'Total Observations'},
             template="plotly_dark")
fig.show()


In [39]:
# --- Save Processed Data ---
# Save the processed dataset for further use
dataset.to_csv('EDA_processed_dataset.csv', index=False)