# Astrolibrary Tutorial: Exploring the Cosmos with Python :)

Welcome to the Astrolibrary tutorial! In this notebook, we'll dive into the fascinating world of astronomical research using the Astrolibrary, a Python library designed to assist researchers in classifying astronomical objects and working with spectral data.


# What is Astrolibrary?

Astrolibrary is a versatile library tailored for astronomers and researchers, providing a seamless interface to the Sloan Digital Sky Survey (SDSS) services. It simplifies the process of querying databases, preprocessing spectral data, visualizing astronomical spectra, and more.

## Getting Started

### Installation

To get started, make sure you have Astrolibrary installed. If not, you can install it using the following command:

In [None]:
!python3 -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ astrolibrary==1.0.1

### Importing the Library

Let's begin by importing the Astrolibrary modules and classes we'll use throughout this tutorial:


In [None]:
from astrolibrary import QueryHandler, DataPreprocessing, get_spectra_data, cross_match, MetaDataExtractor,plot, MachineLearning

## Module: QueryHandler
Let's start with exploring the QueryHandler Module.The QueryHandler module is designed to simplify the process of querying astronomical datasets, specifically interfacing with the Sloan Digital Sky Survey (SDSS) services.

The QueryHandler class allows you to:

- Instantiate a query handler for a given dataset, such as SDSS.
- Run queries using ADQL (Astronomical Data Query Language) strings.
- Check the status of queries.
- Retrieve the results of queries.

Now, we'll demonstrate how to use this module

### Step 1: Create a QueryHandler Instance

In [None]:
query_handler = QueryHandler(dataset_name="SDSS")

### Step 2: Generate a Query

In [None]:
def sample_n_objects(class_name, n):
    """Returns a sample of n objects of a given class."""
    query = f"""
    SELECT TOP {n}
    p.objid,p.ra,p.dec,p.u,p.g,p.r,p. i,p.z,
    p. run, p. rerun, p.camcol, p.field,
    s. specobjid, s.class, s.z as redshift, s.plate, s.mjd, s.fiberid
    I
    FROM Photoobj AS p
    JOIN Specobj AS s ON s.bestobjid = p.objid
    WHERE
    p.u BETWEEN 0 AND 19.6
    AND g BETWEEN 0 AND 20
    AND s.class = '{class_name}'
    """
    return query
query = sample_n_objects(class_name="Galaxy", n=10)

### Step 3: Run the Query

In [None]:
query_id = query_handler.run_query(query)

### Step 4: Check Query Status

In [None]:
status = query_handler.check_status(query_id)
# This is not neccesary when actually using the library but for tutorial purposes let's see what we get!
print(status)

### Step 5: Get and Display the Results

In [None]:
results_table = query_handler.get_results(query_id)

results_table

In [None]:
a  = results_table["ra"]

In [None]:
# objid = 1237645879578460271
# qid = query_handler.run_query(f"SELECT plate, mjd, fiberid FROM Specobj WHERE objid = {objid}")

# plate, mjd, fiberid = query_handler.get_results(qid)[0]

results_table




In [None]:
results_table[0]["g"]

## Module: Cross Matching

The `cross_match` module facilitates the cross-referencing of astronomical objects from the SDSS and Gaia catalogs, prioritizing match purity. This function is particularly useful for astronomers and researchers who need to identify celestial objects observed by both SDSS and Gaia, enabling comprehensive analyses.



Function Parameters

1. **`spec_objid_list` (list):**
   - List of SDSS spectroscopic object identifiers.
   - These identifiers uniquely identify celestial objects observed by the Sloan Digital Sky Survey (SDSS).

2. **`angular_distance_max` (float, optional):**
   - Maximum angular distance between Gaia and SDSS sources, measured in arcseconds.
   - Specifies the degree of separation between celestial objects observed by both catalogs.
   - Default value is 2.0 arcseconds.

Output

The function returns an Astropy Table containing cross-match results. The table includes the following columns:

- `source_id`: Unique identifier for the Gaia source.
- `clean_sdssdr13_oid`: Cleaned SDSS DR13 object identifier.
- `original_ext_source_id`: Original external source identifier.
- `angular_distance`: Angular distance between Gaia and SDSS sources in arcseconds.
- `number_of_neighbours`: Number of neighboring sources.
- `number_of_mates`: Number of matching sources.
- `xm_flag`: Cross-match flag.
- `arcsec`: Angular distance in arcseconds.


### Now, let's explore this function!

### Step 1: List of SDSS spectroscopic object identifiers

In [None]:
spec_objid_list = [1237645879551066262,1237645879578460255, 1237645941291614227, 1237645941824356443]

### Step 2: Perform cross-matching with a maximum angular distance of 3.0 arcseconds

In [None]:
table = cross_match(spec_objid_list, angular_distance_max=3.0)

### Step 3: Display the cross-match results

In [None]:
print("Cross-match results:")
print(table)

### A table was generated so we found matches! How exciting!

## Module: Data Retrieval


Now, let's introduce the Data Retrieval module, an essential component of the Astrolibrary. This module provides functionality to retrieve spectral data from the Sloan Digital Sky Survey (SDSS) using specific identifiers.

Here are the key components included in this module:

- **Functionality:** The `get_spectra_data` function is designed to retrieve spectral data based on Plate, MJD (Modified Julian Date), and Fiber ID. Users can specify additional parameters such as the output format (either 'fits' or 'csv') and survey details.

- **Parameters:**
  - PLATE4, MJD, FIBERID4: Identifiers for SDSS data.
  - format: Output format ('fits' or 'csv').


- **Returns:** The function returns the retrieved data in the specified format.

Now, let's explore how to use this module step by step.

### Step 1: Define your Idenitifiers

In [None]:
PLATE4=7644

MJD=57327

FIBERID4=528

dr_number=18

### Step 2: Retrieve the data in your chosen format

In [None]:
# Example Usage for CSV Format
file_path_to_csv = get_spectra_data(
    survey="eboss",
    run2d="v5_13_2",
    plateid=PLATE4,
    mjd=MJD,
    fiberid=FIBERID4,
    dr_number=18,
    output_format='csv'
)

# Display the retrieved CSV filename
print(f' Heres the file path for generated csv file: {file_path_to_csv}')

In [None]:
# Example Usage for FITS Format
file_path_to_fits = get_spectra_data(
    survey="eboss",
    run2d="v5_13_2",
    plateid=PLATE4,
    mjd=MJD,
    fiberid=FIBERID4,
    dr_number=17
)

# Display the retrieved FITS filename
print(f' Heres the file path for generated fits file: {file_path_to_fits}')

### Let's check the file we generated!

In [None]:
from astropy.io import fits

file_path = 'spec-7644-57327-0528.fits'
hdul=fits.open(file_path)
print(hdul)
hdul[1].data

### Wonderful! Our Data Retrieval worked and we now have a file to work with!

## Module: DataPreprocessing

Next, let's explore how our library preproccesses data! The `DataPreprocessing` module is a versatile tool designed for the preprocessing of astronomical data, catering to the specific needs of researchers working with spectral datasets, such as those obtained from the Sloan Digital Sky Survey (SDSS). This module is crafted to simplify and streamline the essential preprocessing steps, ensuring that the data is ready for subsequent analysis and visualization.

### Key Features

#### 1. Data Loading and Conversion

The module provides functionality for reading data from FITS or CSV files. It seamlessly handles the conversion of FITS data to a Pandas DataFrame, ensuring compatibility with popular data analysis tools.

#### 2. Outlier Removal

The `remove_outliers_column` method employs the IQR (Interquartile Range) method to identify and remove outliers from specified columns. This step contributes to the robustness of subsequent analyses by mitigating the impact of extreme values.

#### 3. Redshift Correction

The `correct_redshift` method corrects the redshift in spectra data. It utilizes the redshift value provided during initialization to adjust wavelengths to the observed frame. The correction is performed using the Astropy library, enhancing the accuracy of redshift-corrected flux values.

#### 4. Wavelength Alignment

The `wave_align` method aligns spectra wavelengths within a predefined range. This is particularly useful for ensuring consistency across different spectra datasets. Users can customize the target wavelength range to suit their specific analysis requirements.

#### 5. Normalization

The `normalize_column` method facilitates the normalization of specified columns, enhancing the comparability of data across different scales. The normalization process ensures that each column's values have a consistent scale, preventing certain features from dominating the analysis due to differences in magnitude.

#### 6. Metadata Extraction

The `MetaDataExtractor` class allows users to extract essential metadata from FITS or CSV files. This includes identifiers, coordinates, and redshifts, providing a comprehensive overview of the astronomical data. The extracted metadata can be crucial for understanding and categorizing astronomical observations.

### To show how our preprocessing module works, we can preprocess the file we recieved from data retrieval.

### Step 1: Load and Read data

In [None]:
file_path = './spec-7644-57327-0528.csv'

In [None]:
data_processor = DataPreprocessing(file_path = './spec-7644-57327-0528.csv', min_target_wavelength=3000, max_target_wavelength=8000)

In [None]:
# Call the read_data method to read and load the data
data_processor.read_data(file_path)

# Now, user can access the loaded data in the DataFrame attribute (df)
loaded_data = data_processor.df
# Let's print to see our output!
print(loaded_data)


### Step 2: Remove Outliers

In [None]:
data_processor.remove_outliers_column('Flux')

In [None]:
# Let's print out our results to check that values outside our bounds have been removed
print(data_processor.df)

#### All the values we see for flux are within the bounds so we have successfully removed outliers!

### Step 3: Wavelength Alignment


In [None]:
# We set our range for Wavelength from 3000 to 8000 so everything outside those bounds should be removed
# Let's call wave_align in order to do this
data_processor.wave_align()

In [None]:
# Let's print our dataframe to check that this call was successfull
print(data_processor.df)

###  All Wavelength Values fall from 3000 - 8000 so we have successfully aligned wavelengths within our predefined range!

### Step 4: Normalization

In [None]:
import numpy as np

# Let's normalize the flux column
data_processor.normalize_column('Flux')

# Let's see if this works by calculating the new mean and standard deviation

# Calculate the new mean and standard deviation after normalization
new_mean = np.mean(data_processor.df['Flux'])
new_std = np.std(data_processor.df['Flux'])

# Display the new mean and standard deviation
print(f"New Mean: {new_mean}")
print(f"New Standard Deviation: {new_std}")

### As we can see our mean is very close to  0 and our standard deviation is  exactly 1 so we have succesfully normalized the Flux !

### Step 5: Metadata Extraction

In [None]:
# Let's work with a fits file we know has metadata
FILE_PATH = 'dr18webexample.fits'

# Call MetaDataExtractor
extractor = MetaDataExtractor(FILE_PATH)

# Extract metadata: For example coordinates 
extractor.get_coordinates()

In [None]:
extractor.get_identifiers()

## Module: Data Visualization

Next,  we'll focus on the `plot_spectra` function, a powerful tool for visualizing spectral data. This function is designed to provide a clear and insightful representation of your spectral information.

#### Key Features:

1. **Spectral Visualization:**
   - Create a visually appealing plot that displays both the spectrum and its inferred continuum.

2. **Smoothing:**
   - Control the level of smoothing applied to the inferred continuum with the `smoothing_window_size` parameter.

3. **Customization:**
   - Easily customize the appearance of the plot to suit your preferences and presentation needs.

#### Parameters:

- `data`: Pandas DataFrame containing spectral data with 'Wavelength' and 'flux' columns.
- `window_size`: Optional parameter to control the smoothing of the inferred continuum.

Now, let's explore the power of visualization with the `plot` function in the next steps of this tutorial.

In [None]:
# Let's use datapreprocessing to get a pandas dataframe
data_processor = DataPreprocessing(file_path = 'spec-7644-57327-0528.csv', min_target_wavelength=3000, max_target_wavelength=8000)
# Assign that database to data
data = data_processor.df
data.rename(columns={'Flux': 'flux'}, inplace=True)
# Plot
plot(data)

## What a visual, Beautiful!!

## Module: Machine Learning

We will close with our Machine Learning Moudle, which is capable of distinguishing between Stars, Galaxies, and QSOs. 

In [None]:
import numpy as np
query_handler = QueryHandler(dataset_name = 'SDSS')

query_x = """
SELECT TOP 10
    p.objid, p.ra, p.dec, p.u, p.g, p.r, p.i, p.z,
    p.run, p.rerun, p.camcol, p.field,
    s.specobjid, s.z as redshift,
    s.plate, s.mjd, s.fiberid
FROM
    PhotoObj AS p
JOIN
    SpecObj AS s ON s.bestobjid = p.objid
WHERE
    p.u BETWEEN 0 AND 19.6
    AND g BETWEEN 0 AND 20
"""
query_y = """
SELECT TOP 10
    s.class
FROM
    PhotoObj AS p
JOIN
    SpecObj AS s ON s.bestobjid = p.objid
WHERE
  p.u BETWEEN 0 AND 19.6
  AND g BETWEEN 0 AND 20
"""
# Run the query using QueryHandler
query_id_x = query_handler.run_query(query_x)
# Get the results table
results_table_x = query_handler.get_results(query_id_x)



In [None]:
# Run the query using QueryHandler
query_id_y = query_handler.run_query(query_y)
# Get the results table
results_table_y = query_handler.get_results(query_id_y)

### Fit: trains the model

In [None]:
col_A = [results_table_x[col] for col in results_table_x.columns]
matrix_A = np.column_stack(col_A)

col_b = [results_table_y[col] for col in results_table_y.columns]
matrix_b = np.column_stack(col_b)


In [None]:
model = MachineLearning()
fitting = model.fit(matrix_A, matrix_b)

### predict: predicts the class of astronomical objects

In [None]:
predicition = model.predict(matrix_A)
predicition

### predict_proba: trains predictions

In [None]:
model.predict_proba(matrix_A)


## Report_confusion_matrix: 
A performance evaluation function, representing the accuracy of a classification model. It displays the number of true positives, true negatives, false positives, and false negatives. This matrix aids in analyzing model performance, identifying mis-classifications, and improving predictive accuracy.

In [None]:
model.report_confusion_matrix(matrix_b, predicition)

# Enjoy the Library! ^-^