In [None]:
# Run this cell
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext jupyter_ai_magics
from datascience import *
import numpy as np
import math

# CMPSC 5A Midterm — Exoplanets

In this project, you will learn to apply all the concepts you have learnt in the class for far. This includes table manipulation (all table functions), iteration (for loops), conditional (if statements), data cleaning, etc.

## Names
Please list all students who worked on this project.
1. Student Name 1
2. Student Name 2
3. Student Name 3

## Member Responsibilities
To make sure every member of your team contributes fairly, you should split up responsibilities before starting in writing. This should be a couple sentences. Here’s two examples of what that could look like for a hypothetical team of 2 (Alice and Bob):
- “Alice will work on Q1-5 and Bob will do Q6-10. Each of us will make the slides that correspond to the problems we did.” 
- “We will meet in person to do the whole assignment as a team, switching up who is typing. Whoever is not currently typing will still help by checking documentation pages for the typist and advising the typist as they write code. Alice will make the first half of the slides, Bob will make the second half.” 

As these examples show, everyone should contribute to the notebook and presentation roughly equally.

After presentations are complete, we will send out a survey where you say what you ended up doing and whether your teammate(s) held up their end of the bargain. Your self-evaluation and your teammates evaluation of you is not directly part of the rubric. 

This does not mean you can't help each other!!! It’s a group project. Please please please reach out to your teammate(s) if you’re struggling with questions that are assigned to you!! The survey evaluation is not a competition. We do not care if someone thinks they did 55% of the work and their partner did 45%. We care a lot if someone is not doing their work, not communicating with their team, and generally not contributing in good faith.

*Write your team member responsibility distribution here*

## Logistics

**Deadline:** The midterm project notebook is due Thursday, February 13th, 2025 at 11:59pm PT. The midterm project presentation slides are due by 10:00am PT on Friday, February 14th. Unlike labs, **no late submissions are allowed**.

**Submission:** For full credit, you must complete all the questions and submit to Gradescope. You may still change your answers before the project deadline - only your final submission will be graded for correctness. Only one partner needs to submit the notebook to Gradescope, and they will need to add the other two group members as members on Gradescope. See [How to Add Group Member in Gradescope](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members). **After they submit, all group members should open their Gradescope accounts and see that a submission has been processed. Be sure that the final notebook has all the ChatGPT prompts used by all the team members**.

**Presentation:** Your group will need to create a presentation slide deck and give a 6-8 minute oral presentation during your assigned time slot. Presentations must use slides. Your group is not allowed to scroll through your notebook during the presentation. All slides must be uploaded to https://drive.google.com/drive/folders/19zNx7JWtwSde3cDNeCT3gLOdBJ7Kuset?usp=drive_link.

**IMPORTANT NOTES:** 
- You are not limited to just one solution code cell, one prompt cell and one workflow cell for each question. Use as many of each as you like to ensure your notebook is presented well, easy-to-read, and has all the required plots and intermediate tables visible to show how you deduced that answer.
- None of the questions are created in a way that will allow you to just give a one line answer. Remember, if your answer is just a one line answer, you are probably missing something.
- Every group's answers may be different based on the approach you take for data analysis. Others may have visually presented it with a graph that may be different from yours, or filtered the table in a different way. That does not mean yours is wrong. We are looking for diversity in how information is displayed and there are more than one correct answers for each question.

**Partners:** You will work with two other partners (total three in a group); your partners can be from any lab section. 

**Rules:** Don't share your code with anybody but your partners. You are welcome to discuss questions with other students, but don't share the answers. The experience of solving the problems in this project will prepare you for exams (and life). If someone asks you for the answer, resist! Instead, you can demonstrate how you would solve a similar problem. Since the problems are open-ended, they can have various different answers. What is important is the approach you take to solve your task.

**Support:** You are not alone! Come to office hours, post on Ed, and talk to your classmates. If you want to ask about the details of your solution to a problem, make a private Ed post and the staff will respond. If you're ever feeling overwhelmed or don't know how to make progress, email your TA or ULA.

**Advice:** Develop your answers incrementally. To perform a complicated table manipulation, break it up into steps, perform each step on a different line, give a new name to each result, and check that each intermediate result is what you expect. You can add any additional names or functions you want to the provided cells. Make sure that you are using distinct and meaningful variable names throughout the notebook. Along that line, **DO NOT** reuse the variable names.

You **never** have to use just one line in this project or any others. Use intermediate variables and multiple lines as much as you would like!

## NASA Exoplanet Science Institute's Planetary Systems Composite Planet Dataset

The Planetary Systems Composite Parameters Planet Data table is an extensive collection of data regarding known [Confirmed Exoplanets](https://exoplanetarchive.ipac.caltech.edu/docs/exoplanet_criteria.html). It includes a range of parameters related to the planetary systems, stars, and the planets themselves. The primary goal of this table is to offer a comprehensive statistical perspective on the population of known exoplanets and their respective hosting environments.

Managed by the NASA Exoplanet Archive, this resource compiles a variety of parameters that have been documented in peer-reviewed scientific literature. The Planetary Systems Table presents this data in a unified format, where each exoplanet's information is consolidated into a single row. This row encompasses a complete set of parameters, incorporating details about the planet, its star, and the overall system, with each reference providing a unique row in the table.

The NASA Exoplanet Archive has adopted a policy of including and classifying all objects as planetary that meet the following criteria:

- The mass (or minimum mass) is equal to or less than 30 Jupiter masses.
- The planet is not free floating.
- Sufficient follow-up observations and validation have been undertaken to deem the possibility of the object being a false positive as unlikely.
- The above information, along with further orbital and/or physical properties, are available in peer-reviewed publications.
- The results must be peer reviewed and be accepted for publication in the astrophysical literature.


In [None]:
# Read the dataset
planets_uncleaned = Table.read_table("data/PSCompPars.csv")

# Display the first few rows of the table
planets_uncleaned.show(15)

**Question 1.** Real-world datasets do not always have complete information! If you observe some of the rows above, you will find a `nan` (Not a Number). `nan` values are **NOT** strings. They are of type `None` (another kind of data type). These are missing values that can cause issues if we try to use their corresponding rows to do any kind of arithmetic. Let's do some data cleaning first and then understand what each of the columns mean.

There are many ways to work with missing or skewed data. For now, we will just choose to delete any rows thats that have missing information. Name the final table something meaningful - like `planets`. This will be the table we use to answer all the other questions.

Hint: One way you can do this is by writing a function and then using the apply() function.

Hint: In Python, `nan` is equal to *nothing*, not even itself. So, to check if a variable x is `nan`, you can test x != x.

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

Check that all the rows that had `nan` are now deleted.

We now have the final dataset that we will work with.

The column names don't make too much sense right now. We should relabel them before we start working on the data for it to make more sense for us and anyone looking through our data analysis. Given below is a list of what the existing column names correspond to:

- pl_name: Planet Name
- sy_snum: Number of Stars
- sy_pnum: Number of Planets
- discoverymethod: Discovery Method
- disc_year: Discovery Year
- disc_facility: Discovery Facility
- pl_orbper: Orbital Period (days)
- pl_orbsmax: Orbit Semi-Major Axis
- pl_rade: Planet Radius (Earth Radius)
- pl_bmasse: Planet Mass (Earth Mass)
- st_age: Stellar Age (gigayear)
- sy_dist: Distance (parsec)

You should also read up about what each of these columns mean and in what units they are measured: https://exoplanetarchive.ipac.caltech.edu/docs/API_PS_columns.html

**UNCOMMENT AND COMPLETE THE CODE WRITTEN BELOW. ADD YOUR TABLE NAME IN PLACE OF THE <>.**

In [None]:
#<TABLE_NAME> = <TABLE_NAME>.relabeled('pl_name', 'Planet Name').relabeled('sy_snum', 'Number of Stars').relabeled('sy_pnum', 'Number of Planets').relabeled('discoverymethod', 'Discovery Method').relabeled('disc_year', 'Discovery Year').relabeled('disc_facility', 'Discovery Facility').relabeled('pl_orbper', 'Orbital Period (days)').relabeled('pl_orbsmax', 'Orbit Semi-Major Axis').relabeled('pl_rade', 'Planet Radius (Earth Radius)').relabeled('pl_bmasse', 'Planet Mass (Earth Mass)').relabeled('st_age', 'Stellar Age (gigayear)').relabeled('sy_dist', 'Distance (parsec)')
#<TABLE_NAME>

We are now ready to perform some Exploratory Data Analysis (EDA)!

All the questions are open-ended and you have the freedom to present as much information as you think will make a case for your arguments. You are also free to add columns if it will aid in presenting your results. Try to be as clear with any correlations you make. Use as many markdown or code cells as you need to explain your analysis in depth.

**Question 2.** Explore the relationship between the Planet Mass and its Orbital Semi-Major Axis. Is there a correlation between the distance of a planet from its star and its mass? Think about what kind of plots you can make to show a correlation between two columns? What do you infer from this plot?

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

**Question 3.** How has the discovery of planets evolved over time? Are there trends in the number of planets discovered each year, the methods used, or the types of planets discovered (e.g., comparing the radius or mass)?

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

**Question 4.** Analyze the distribution of the ages of stars in the dataset. Are most of the stars young, old, or is there a uniform distribution across different ages?

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

**Question 5.** Plot a line graph that shows how the average or median distance of discovered planets from Earth (sy_dist) has changed over the years (disc_year). Are we finding more distant planets as time goes on?

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

**Question 6.** Investigate the occurrence rate of multi-planet systems (planetary systems with more than one planet) in the context of stellar age. Are younger or older stars more likely to host multiple planets?

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

**Question 7.** Compare the discovery methods used for the planets in the dataset. Which methods have been the most successful in terms of number of planets discovered? Are certain types of planets (e.g., Planet Radius or Planet Mass) more likely to be discovered by one method over another?

Hint: To answer this question consider placing the values of your choice (for example: Planet Radius) into buckets/bins with a given range. How would you make these buckets? Take the minimum and maximum values of size and divide it into equal buckets. Now that you have these buckets, what type of visualization would be appropriate? How many visualizations would be appropriate to make your case? As many as the number of unique values in 'Discovery Method'.

You are free to use anything apart from Planet Radius as well, if you think there could be some correlation between that column and Discovery Method.

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

**Question 8.** Examine if there is any geographical pattern to where planets are discovered based on the location of the discovery facility. Is there a predominance of discoveries in certain parts of the world, and if so, what might be contributing factors (e.g., technological advancement, number of observatories, geographical reasons)? Note: You will need to look up the locations of the top observatories.

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

## Novel analysis
For the last section of your midterm project, you are expected to think like a data scientist. Formulate **TWO** unique questions/problems/insights that you can solve or obtain from the dataset given to you. **You need to come up with the problem and present the solution in the notebook**. The problems you come up with will be graded on novelty and the work necessary to obtain them. **DO NOT** reuse problems that are already covered in the midterm or those that could be solved or visualized with a single line of code.

**Question 9. *Type your question here***

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

**Question 10. *Type your question here***

In [None]:
# SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your workflow below:

*type your answer here*

### Full Submission (due Thursday, February 13 at 11:59pm PT)

To submit:
- **save the notebook** first (press the icon on the top left)
- go up to the `Kernel` menu and select `Restart & Clear Output` (make sure the notebook is saved first, because otherwise, you will lose all your work!). 
- go to `Cell -> Run All`. Carefully look through your notebook and verify that all computations execute correctly. You should see **no errors**; if there are any errors, make sure to correct them before you submit the notebook.
- go to `File -> Download as -> Notebook` and download the notebook to your own computer. ([Please verify](https://ucsb-ds.github.io/ds1-f20/troubleshooting/#i-downloaded-the-notebook-file-but-it-saves-as-the-ipynbjson-extension-so-whenever-i-upload-it-to-gradescope-it-fails) that it got saved as an .ipynb file.)
- Upload the notebook to [Gradescope](https://www.gradescope.com/). You can drag and drop the file too.
- One submission per group. After you submit your notebook, you can Add Group Member. [How to Add Group Member in Gradescope](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members)