In [None]:
# Run this cell once when starting on notebook.
# On Google Colab, wait for the Google Drive permission prompts before proceeding
import matplotlib.pyplot as plt
%matplotlib inline
import os
import sys
try:
    %load_ext jupyter_ai_magics
except:
    print("%%ai cells will not work in this notebook")
    print("Please use Gemini for AI queries instead")
from datascience import *
import numpy as np
import math

DATA_FILENAME="data/PSCompPars.csv"
try:
  from google.colab import drive
  drive.mount('/content/drive', force_remount=True)
  !mkdir -p /content/data
  !gdown --fuzzy https://drive.google.com/file/d/1oiOoEkLxWiTb8jnighU2PQ6I-IWGV-Ch/view?usp=drive_link -O /content/data/PSCompPars.csv
  DATA_FILENAME = DATA_FILENAME.replace("data", "/content/data/")
  !ls -l /content/data
except:
  print("Google Drive not mounted; this is normal on Jupyter Hub")


# CS5A S25 Midterm Project: Exoplanets

* Please refer to the general instructions in [this document](https://docs.google.com/document/d/1gYMuXukOTtJkEthxUeHmWuA1Qn9BtIYkSJa37Gb524E/edit?usp=sharing) before starting.
* You may work on either JupyterHub or Google Colab
* The Google Colab version is [this folder](https://drive.google.com/drive/folders/1r6v3k0fO5rGgLqPiDkfA71TB_hWkjKpK?usp=drive_link); you should make a copy of the notebook in your[ group folder for the midterm team project](https://docs.google.com/spreadsheets/d/1oLPWP0j1jvx7eLq7kcQc15PiSDUwjlnZAolUVSHWBOQ/edit?usp=sharing) before starting to make edits.






## Names

Please list all students that were a member of this team

1. Student Name 1
2. Student Name 2
3. Student Name 3
4. Student Name 4

## Member Responsibilities


*Write your team member responsibility distribution here* (See [instructions](https://docs.google.com/document/d/1gYMuXukOTtJkEthxUeHmWuA1Qn9BtIYkSJa37Gb524E/edit?tab=t.0#bookmark=id.igbdergm85kj))

## NASA Exoplanet Science Institute's Planetary Systems Composite Planet Dataset

The Planetary Systems Composite Parameters Planet Data table is an extensive collection of data regarding known [Confirmed Exoplanets](https://exoplanetarchive.ipac.caltech.edu/docs/exoplanet_criteria.html). It includes a range of parameters related to the planetary systems, stars, and the planets themselves. The primary goal of this table is to offer a comprehensive statistical perspective on the population of known exoplanets and their respective hosting environments.

Managed by the NASA Exoplanet Archive, this resource compiles a variety of parameters that have been documented in peer-reviewed scientific literature. The Planetary Systems Table presents this data in a unified format, where each exoplanet's information is consolidated into a single row. This row encompasses a complete set of parameters, incorporating details about the planet, its star, and the overall system, with each reference providing a unique row in the table.

The NASA Exoplanet Archive has adopted a policy of including and classifying all objects as planetary that meet the following criteria:

- The mass (or minimum mass) is equal to or less than 30 Jupiter masses.
- The planet is not free floating.
- Sufficient follow-up observations and validation have been undertaken to deem the possibility of the object being a false positive as unlikely.
- The above information, along with further orbital and/or physical properties, are available in peer-reviewed publications.
- The results must be peer reviewed and be accepted for publication in the astrophysical literature.


In [None]:
# Read the dataset
planets_uncleaned = Table.read_table(DATA_FILENAME)

# Display the first few rows of the table
planets_uncleaned.show(15)

**Question 1.** Real-world datasets do not always have complete information! If you observe some of the rows above, you will find a `nan` (Not a Number). `nan` values are **NOT** strings. They are of type `None` (another kind of data type). These are missing values that can cause issues if we try to use their corresponding rows to do any kind of arithmetic. Let's do some data cleaning first and then understand what each of the columns mean.

There are many ways to work with missing or skewed data. For now, we will just choose to delete any rows thats that have missing information. Name the final table something meaningful - like `planets`. This will be the table we use to answer all the other questions.

Hint: One way you can do this is by writing a function and then using the apply() function.

Hint: In Python, `nan` is equal to *nothing*, not even itself. So, to check if a variable x is `nan`, you can test x != x.

In [None]:
# SOLUTION

#### Enter prompt below

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

Check that all the rows that had `nan` are now deleted.

We now have the final dataset that we will work with.

The column names don't make too much sense right now. We should relabel them before we start working on the data for it to make more sense for us and anyone looking through our data analysis. Given below is a list of what the existing column names correspond to:

- pl_name: Planet Name
- sy_snum: Number of Stars
- sy_pnum: Number of Planets
- discoverymethod: Discovery Method
- disc_year: Discovery Year
- disc_facility: Discovery Facility
- pl_orbper: Orbital Period (days)
- pl_orbsmax: Orbit Semi-Major Axis
- pl_rade: Planet Radius (Earth Radius)
- pl_bmasse: Planet Mass (Earth Mass)
- st_age: Stellar Age (gigayear)
- sy_dist: Distance (parsec)

You should also read up about what each of these columns mean and in what units they are measured: https://exoplanetarchive.ipac.caltech.edu/docs/API_PS_columns.html

**UNCOMMENT AND COMPLETE THE CODE WRITTEN BELOW. ADD YOUR TABLE NAME IN PLACE OF THE <>.**

In [None]:
#<TABLE_NAME> = <TABLE_NAME>.relabeled('pl_name', 'Planet Name').relabeled('sy_snum', 'Number of Stars').relabeled('sy_pnum', 'Number of Planets').relabeled('discoverymethod', 'Discovery Method').relabeled('disc_year', 'Discovery Year').relabeled('disc_facility', 'Discovery Facility').relabeled('pl_orbper', 'Orbital Period (days)').relabeled('pl_orbsmax', 'Orbit Semi-Major Axis').relabeled('pl_rade', 'Planet Radius (Earth Radius)').relabeled('pl_bmasse', 'Planet Mass (Earth Mass)').relabeled('st_age', 'Stellar Age (gigayear)').relabeled('sy_dist', 'Distance (parsec)')
#<TABLE_NAME>

We are now ready to perform some Exploratory Data Analysis (EDA)!

All the questions are open-ended and you have the freedom to present as much information as you think will make a case for your arguments. You are also free to add columns if it will aid in presenting your results. Try to be as clear with any correlations you make. Use as many markdown or code cells as you need to explain your analysis in depth.

**Question 2.** Explore the relationship between the Planet Mass and its Orbital Semi-Major Axis. Is there a correlation between the distance of a planet from its star and its mass? Think about what kind of plots you can make to show a correlation between two columns? What do you infer from this plot?

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

**Question 3.** How has the discovery of planets evolved over time? Are there trends in the number of planets discovered each year, the methods used, or the types of planets discovered (e.g., comparing the radius or mass)?

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

**Question 4.** Analyze the distribution of the ages of stars in the dataset. Are most of the stars young, old, or is there a uniform distribution across different ages?

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

**Question 5.** Plot a line graph that shows how the average or median distance of discovered planets from Earth (sy_dist) has changed over the years (disc_year). Are we finding more distant planets as time goes on?

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

**Question 6.** Investigate the occurrence rate of multi-planet systems (planetary systems with more than one planet) in the context of stellar age. Are younger or older stars more likely to host multiple planets?

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

**Question 7.** Compare the discovery methods used for the planets in the dataset. Which methods have been the most successful in terms of number of planets discovered? Are certain types of planets (e.g., Planet Radius or Planet Mass) more likely to be discovered by one method over another?

Hint: To answer this question consider placing the values of your choice (for example: Planet Radius) into buckets/bins with a given range. How would you make these buckets? Take the minimum and maximum values of size and divide it into equal buckets. Now that you have these buckets, what type of visualization would be appropriate? How many visualizations would be appropriate to make your case? As many as the number of unique values in 'Discovery Method'.

You are free to use anything apart from Planet Radius as well, if you think there could be some correlation between that column and Discovery Method.

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

**Question 8.** Examine if there is any geographical pattern to where planets are discovered based on the location of the discovery facility. Is there a predominance of discoveries in certain parts of the world, and if so, what might be contributing factors (e.g., technological advancement, number of observatories, geographical reasons)? Note: You will need to look up the locations of the top observatories.

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

## Novel analysis
For the last section of your midterm project, you are expected to think like a data scientist. Formulate **TWO** unique questions/problems/insights that you can solve or obtain from the dataset given to you. **You need to come up with the problem and present the solution in the notebook**. The problems you come up with will be graded on novelty and the work necessary to obtain them. **DO NOT** reuse problems that are already covered in the midterm or those that could be solved or visualized with a single line of code.

**Question 9. *Type your question here***

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

**Question 10. *Type your question here***

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

# Deadine and Submission

References these links

* [Deadlines](https://docs.google.com/document/d/1gYMuXukOTtJkEthxUeHmWuA1Qn9BtIYkSJa37Gb524E/edit?tab=t.0#bookmark=id.fqop8bpqcnvf)
* [Submission Instructions](https://docs.google.com/document/d/1gYMuXukOTtJkEthxUeHmWuA1Qn9BtIYkSJa37Gb524E/edit?tab=t.0#bookmark=id.vsf44v7st0t)
