<a href="https://colab.research.google.com/github/ubsuny/PHY386/blob/main/2025/HW/HW5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HW5: Star Classification Using Machine Learning

How to use an autoencoder and clustering to classify stars based on brightness.

## Learning Outcomes
By the end of this homework, you will:
- Understand how to enable **GPU acceleration** in Google Colab for deep learning tasks.
- Learn the fundamentals of **Machine Learning (ML)** and its applications in astrophysics.
- Use **autoencoders** to extract compressed features from star images.
- Apply **KMeans clustering** to classify stars based on brightness and size.
- Visualize results with **matplotlib**.

## Step 0: Enabling GPU Acceleration & Introduction to Machine Learning

### How to Enable GPU Acceleration in Google Colab
To train deep learning models efficiently, we need GPU acceleration:
1. Go to **Runtime** in the top menu.
2. Click **Change runtime type**.
3. Set **Hardware Accelerator** to **GPU**.
4. Click **Save**.

To verify GPU availability, run the following command:

In [None]:
import tensorflow as tf
print("GPU available:", tf.config.list_physical_devices('GPU'))

### Introduction to Machine Learning (ML)
Machine Learning is a field of Artificial Intelligence (AI) that enables computers to learn patterns from data without being explicitly programmed. It is widely used in astronomy to classify celestial objects, detect anomalies, and analyze vast datasets.

#### Types of Machine Learning:
1. **Supervised Learning**: The model learns from labeled data (e.g., star classification with known brightness categories).
2. **Unsupervised Learning**: The model finds patterns without predefined labels (e.g., clustering stars based on observed properties).
3. **Reinforcement Learning**: The model learns by interacting with an environment and receiving rewards.

In this notebook, we will use **unsupervised learning** with an **autoencoder** and **KMeans clustering**.

### What is an Autoencoder?
An autoencoder is a type of neural network used for unsupervised learning. It consists of:
- **Encoder**: Compresses input data into a lower-dimensional representation.
- **Bottleneck Layer**: The smallest representation of the data.
- **Decoder**: Reconstructs the original input from the compressed representation.

Autoencoders help in **dimensionality reduction** and **feature extraction** by learning compact representations of complex data.


## 1. Install Required Libraries

In [None]:
!pip install astropy scikit-learn tensorflow matplotlib numpy photutils auto-stretch

## 2. Load and Stretch the RGB FITS Image
We first load astronomic pictures ([FITS file format](https://en.wikipedia.org/wiki/FITS?wprov=sfti1#)) and apply a **stretching function** (logarithmic/asinh) to enhance visibility. The general problem is that in general Astrononomic pictures are stored using 32-bit integer, while your display is only able to show 8-bot integer color range. So we have to tell the computer what to do with the missing colors.

**ToDo**: Load and plot your assiged fits file (4 points)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model
from astropy.io import fits
from astropy.stats import sigma_clipped_stats
from photutils.detection import DAOStarFinder
from auto_stretch import apply_stretch
from sklearn.cluster import KMeans
import requests
from io import BytesIO

# Replace this URL with the raw URL of your FITS file on GitHub
fits_url = ""

# Fetch the FITS file from the GitHub repository
response = requests.get(fits_url)
response.raise_for_status()  # Check for request errors

# Load the FITS file into an HDUList using BytesIO
hdul = fits.open(BytesIO(response.content))

# Assume the first extension contains an RGB image in (3, Height, Width) format
rgb_data = np.transpose(hdul[0].data, (1, 2, 0))  # Shape should be (3, Height, Width)
hdul.close() #

# Display the image
fig = plt.figure()
plt.imshow(apply_stretch(rgb_data))

## 3. Count Stars Using Astropy
We use **DAOStarFinder** to detect and count stars.

**ToDo**: Extract the RGB channels seperately (4 points) and find an algorithm that makes the number of the detected stars in each channel the same (4 points). Plot the combined and the three RGB channels in a 2x2 grid plot highlighting the detecting stars.(4 points)

Depending on your fits file you might have to select a part of the image for star detection.

In [None]:
# calculate the mean of the RGB channels
avg_channel = np.mean(rgb_data[:, :, :], axis=2)

def detect_stars(channel_data):
    mean, median, std = sigma_clipped_stats(channel_data, sigma=3.0)
    finder = DAOStarFinder(fwhm=3.0, threshold=5.0*std)
    return finder(channel_data - median)

# Detect stars
sources = detect_stars(avg_channel)
print(f"Number of detected stars: {len(sources)}")

In [None]:
# Plot detected stars
plt.imshow(avg_channel, cmap='gray', origin='lower')
plt.scatter(sources['xcentroid'], sources['ycentroid'], s=30, edgecolor='red', facecolors='none')
plt.show()

### Step 3: Feature Extraction

**Stellar Colors and Surface Temperatures**  
Stars exhibit a variety of colors primarily because of their differing **surface temperatures**. The color of a star is directly related to its temperature: hotter stars emit more blue and ultraviolet light, while cooler stars emit more red and infrared light. This temperature dependence is explained by the concept of **blackbody radiation** and is captured in the [Planck's law](https://en.wikipedia.org/wiki/Planck%27s_law) of radiation. The [spectral classification](https://en.wikipedia.org/wiki/Stellar_classification) system categorizes stars into types (O, B, A, F, G, K, M) based on these temperatures, where O-type stars are extremely hot and blue, and M-type stars are cool and red. Understanding these differences helps astronomers not only determine the physical properties of stars but also track their evolutionary stages.

**Luminosity, Brightness, and the Hertzsprung–Russell Diagram**  
The apparent brightness of a star as seen from Earth is influenced by its intrinsic **luminosity** and its distance from the observer. Luminosity, the total energy output of a star per unit time, varies dramatically depending on the star’s mass, age, and evolutionary stage. This relationship is visualized in the [Hertzsprung–Russell diagram](https://en.wikipedia.org/wiki/Hertzsprung%E2%80%93Russell_diagram), which plots stars according to their luminosity and surface temperature. In this diagram, stars with higher luminosities can be found on the upper regions regardless of their color, indicating that a star's brightness is not solely a function of its temperature but also of its size and the stage of its [stellar evolution](https://en.wikipedia.org/wiki/Stellar_evolution). Such insights allow astronomers to predict the future behavior of stars and understand the underlying physics governing their life cycles.

**Composition, Metallicity, and Evolutionary Effects**  
Beyond temperature and mass, a star's **composition** plays a critical role in determining both its color and brightness. The abundance of elements heavier than helium—referred to as [metallicity](https://en.wikipedia.org/wiki/Metallicity)—can affect the opacity of a star's outer layers, influencing how energy is transported to the surface and thus its color and luminosity. Additionally, stars evolve over time; for example, stars in the later stages of [stellar evolution](https://en.wikipedia.org/wiki/Stellar_evolution) can swell into red giants, drastically changing their brightness and color profiles. This evolutionary process, combined with variations in initial mass and metallicity, leads to a rich diversity in the observed properties of stars. Together, these factors help astronomers piece together the history of our galaxy and the lifecycle of its stellar populations.


We extract:
- **Brightness (flux)** from each RGB channel.
- **Color Ratios**: R/G and B/G ratios to capture color differences.
- **Size Proxy** from sharpness values.
We normalize the extracted features for efficient learning.

**ToDo**: calculate the color ratios (4 points) and normalized flux ratios (4 points) for each color channel (R/G and B/G) for further analysis and generate a new `features` variable (4 points).

If you have very bright stars in your fits file you might have to use the `remove_extreme_brightness` function.

In [None]:
# Extract features (brightness and size)
flux = sources['flux']

# Compute size proxy
size = sources['sharpness']

# Normalize features
flux_norm = (flux - np.min(flux)) / (np.max(flux) - np.min(flux))
size_norm = (size - np.min(size)) / (np.max(size) - np.min(size))

features = np.vstack([flux_norm, size_norm]).T

In [None]:
def remove_extreme_brightness(star_features, brightness_threshold=3.0):
    """
    Filters out stars with exceptionally high brightness from a dataset of star features.

    This function calculates the overall brightness of each star by computing the Euclidean norm
    of its feature vector. It then determines the mean and standard deviation of these brightness
    values. Stars whose brightness exceeds the mean by more than a specified number of standard
    deviations (defined by `brightness_threshold`) are considered outliers and removed from the dataset.

    Parameters
    ----------
    star_features : ndarray
        A 2D NumPy array of shape (n_stars, n_features), where each row represents the feature vector
        of a star. It is assumed that the features are numerical and relevant to the brightness calculation.
    brightness_threshold : float, optional
        The number of standard deviations above the mean brightness to use as the cutoff for identifying
        extreme brightness values. The default is 3.0, which corresponds to the common statistical practice
        of removing data points that lie more than three standard deviations from the mean.

    Returns
    -------
    filtered_star_features : ndarray
        A 2D NumPy array containing the feature vectors of stars that are not considered extreme in brightness.
    filtered_indices : ndarray
        A 1D boolean NumPy array indicating which stars were retained (True) and which were filtered out (False).

    Notes
    -----
    - The function assumes that the Euclidean norm of the feature vectors is an appropriate measure of
      brightness. Ensure that the input features are scaled or selected accordingly.
    - This method uses a statistical approach to identify outliers based on the assumption of a normal
      distribution of brightness values. If the brightness distribution is significantly non-normal,
      consider using alternative outlier detection methods.
    - The function utilizes NumPy's `linalg.norm` to compute the Euclidean norm and `mean` and `std`
      functions to calculate statistical measures.

    Examples
    --------
    >>> import numpy as np
    >>> star_features = np.array([[1.0, 2.0], [2.0, 2.0], [10.0, 10.0]])
    >>> filtered_features, filtered_indices = remove_extreme_brightness(star_features)
    >>> filtered_features
    array([[1., 2.],
           [2., 2.]])
    >>> filtered_indices
    array([ True,  True, False])
    """

    brightness = np.linalg.norm(star_features, axis=1)  # Compute overall brightness
    mean_brightness = np.mean(brightness)
    std_brightness = np.std(brightness)
    filtered_indices = brightness < (mean_brightness + brightness_threshold * std_brightness)
    return star_features[filtered_indices], filtered_indices

## Step 4: Autoencoder for Feature Compression
An **autoencoder** is a neural network used for unsupervised learning. It learns a compact representation of input data.

### Network Architecture:
- **Input Layer**: Takes in two features (brightness and size).
- **Encoder**:
  - A hidden layer with 8 neurons extracts patterns.
  - A bottleneck layer with 2 neurons compresses the data.
- **Decoder**:
  - Expands data back to 8 neurons.
  - Outputs the reconstructed 2-feature data.

**ToDo**: Adapt the input and output shape to the new feature generated above (4 points) and use the GPU for acceleration (2 points).

In [None]:
# Define an autoencoder model
input_layer = Input(shape=(2,))
encoded = Dense(8, activation='relu')(input_layer)
encoded = Dense(2, activation='relu')(encoded)

decoded = Dense(8, activation='relu')(encoded)
decoded = Dense(2, activation='sigmoid')(decoded)

autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mse')

# Train autoencoder
autoencoder.fit(features, features, epochs=50, batch_size=16, verbose=1)

## Step 5: Clustering with KMeans
After extracting compressed features, we use **KMeans clustering** to classify the stars.

In [None]:
encoder = Model(input_layer, encoded)
encoded_features = encoder.predict(features)

num_clusters = 4
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
predicted_labels = kmeans.fit_predict(encoded_features)

## Step 6a: Generating Synthetic Star Data
To demonstrate clustering in astrophysics, we generate synthetic stars with controlled properties.
This helps visualize how clustering can be applied to real astrophysical problems, such as distinguishing star populations.


In [None]:
import random

def generate_synthetic_stars(num_stars=300):
    """Generates synthetic star data with predefined color and brightness properties."""
    categories = {
        0: {'rg': 0.9, 'bg': 0.5, 'scatter': 0.1},  # Red stars
        1: {'rg': 0.5, 'bg': 0.9, 'scatter': 0.1},  # Blue stars
        2: {'rg': 0.7, 'bg': 0.7, 'scatter': 0.05}, # White stars
        3: {'rg': 0.2, 'bg': 0.3, 'scatter': 0.1},  # Dim stars
    }

    stars = []
    labels = []

    for _ in range(num_stars):
        category = random.choice(list(categories.keys()))
        base = categories[category]
        rg = max(0, min(1, np.random.normal(base['rg'], base['scatter'])))
        bg = max(0, min(1, np.random.normal(base['bg'], base['scatter'])))
        stars.append([rg, bg])
        labels.append(category)

    return np.array(stars), np.array(labels)

# Generate and plot synthetic stars
synthetic_stars, synthetic_labels = generate_synthetic_stars()
plt.figure(figsize=(8,6))
plt.scatter(synthetic_stars[:,0], synthetic_stars[:,1], c=synthetic_labels, cmap='coolwarm', alpha=0.7, edgecolors='k')
plt.xlabel('Synthetic R/G Ratio')
plt.ylabel('Synthetic B/G Ratio')
plt.colorbar(label='Simulated Star Category')
plt.title('Synthetic Star Classification Example')
plt.grid(True)
plt.show()

## Step 6b: Visualizing real data
We plot stars with different colors representing their assigned clusters.

Clusters represent groups of stars with similar size and brightness. If clusters overlap too much, it might indicate the need for better feature separation.

**ToDo**: Generate new plots from the new clustering for the cololr ratio and size/brightness (8 points)
- **X-axis: Normalized R/G Ratio** – Represents how red the star is relative to green.
- **Y-axis: Normalized B/G Ratio** – Represents how blue the star is relative to green.
- **Point Color: Cluster Label** – Assigned cluster based on autoencoder features and KMeans.

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(flux_norm, size_norm, c=predicted_labels, cmap='viridis', alpha=0.7, edgecolors='k')
plt.xlabel('Normalized flux')
plt.ylabel('Normalized size')
plt.colorbar(label='Cluster')
plt.title('Star Classification by Size & Brightness')
plt.grid(True)
plt.show()

## Step 7: Refining the Clustering
To improve clustering, we:
1. **Increase Number of Clusters** – To separate stars into more detailed groups.
2. **Use Gaussian Mixture Model (GMM)** – A probabilistic alternative to KMeans for soft clustering.

In [None]:
from sklearn.mixture import GaussianMixture

# Try different numbers of clusters
num_clusters_gmm = 6
gmm = GaussianMixture(n_components=num_clusters_gmm, random_state=42)
predicted_labels_gmm = gmm.fit_predict(encoded_features)

plt.figure(figsize=(8,6))
plt.scatter(flux_norm, size_norm, c=predicted_labels_gmm, cmap='plasma', alpha=0.7, edgecolors='k')
plt.xlabel('Normalized brightness')
plt.ylabel('Normalized size')
plt.colorbar(label='GMM Cluster')
plt.title('Refined Star Classification with Gaussian Mixture Model')
plt.grid(True)
plt.show()