# Project Proposal

# *insert title*

## Introduction

### Background Information and Question
When a supergiant star (a highly luminous type of star) dies and collapses in an event called a supernova, its proton and electron located in the star's core merge and create neutrons and ultimately form neutron stars. A particular type of neutron star, pulsars, emit two beams of electromagnetic radiation from two magnetic poles and rotate quickly in a way that the the radio emissions patterns are detectable on Earth.

Pulsars are very important tools used by astronomers. Due to a pulsar's consistent radio emmisions in time, astronomers are able to estimate distances of various cosmic objects using the distance of pulsar and time it takes the radio waves to reach Earth. Additionally, since pulsars are a type of neutron star, astronmers study the inside of the pulsars, especially the obscure state of matter every neutron star contains.

Although every pulsar produces a unique radio emission pattern that slightly changes per rotation. Possible pulsars signals known as "candidates" are averaged over many rotations in order to determine whether the signal is a real pulsar or not. Unfortunately, many false signals are picked up from unwanted radio frequency interference and other noise, resulting in difficulty in identifying pulsars. 




### Dataset used
The dataset that will be used is the Pulsar Star dataset, which can be found [here](https://archive.ics.uci.edu/ml/datasets/HTRU2#). This dataset describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey, an all-sky survey for pulsars and short-duration radio transients.

A candidate is a potential signal detection that may describe a real pulsar and each row of this dataset is a candidate observation. Each column is a variable and there are 9 variables. The first 8 variables are the mean, standard deviation, excess kurtosis, and skewness, of the integrated pulse profile and of the the DM-SNR curve. 

Since pulsars are weak radio sources, many inidividual pulses must be summed to produce a signal that is distinguishable from noise and the detection of the pulsar can be made. The signal resulted from this sum of pulses is called the integrated pulse profile and it is similar to a pulsar's "fingerprint".

!!!The Dispersion Measure/Signal-to-Noise Ratio (DM/SNR) is a measure that compares the level of a desired signal and the level of background noise.

The last variable is the class label. It is 1 if the candidate is a real pulsar and 0 otherwise.

## Preliminary Data Analysis

In [15]:
set.seed(2000)
library(tidyverse)
library(tidymodels)
library(repr)
options(repr.matrix.max.rows = 6)


In [2]:
temp <- tempfile()
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/00372/HTRU2.zip", temp)
pulsar_file <- unz(temp, "HTRU_2.csv")
pulsar <- read_csv(pulsar_file, col_names = FALSE)


Parsed with column specification:
cols(
  X1 = [32mcol_double()[39m,
  X2 = [32mcol_double()[39m,
  X3 = [32mcol_double()[39m,
  X4 = [32mcol_double()[39m,
  X5 = [32mcol_double()[39m,
  X6 = [32mcol_double()[39m,
  X7 = [32mcol_double()[39m,
  X8 = [32mcol_double()[39m,
  X9 = [32mcol_double()[39m
)



In [3]:
colnames(pulsar) <- c("mean_ip", "std_ip", "kurt_ip", "skew_ip", "mean_dm_snr", "std_dm_snr", "kurt_dm_snr", "skew_dm_snr", "class")

pulsar_mutate <- pulsar %>%
                mutate(class = as_factor(class))
pulsar_mutate

mean_ip,std_ip,kurt_ip,skew_ip,mean_dm_snr,std_dm_snr,kurt_dm_snr,skew_dm_snr,class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
140.56250,55.68378,-0.234571412,-0.69964840,3.1998328,19.110426,7.975532,74.24222,0
102.50781,58.88243,0.465318154,-0.51508791,1.6772575,14.860146,10.576487,127.39358,0
103.01562,39.34165,0.323328365,1.05116443,3.1212375,21.744669,7.735822,63.17191,0
136.75000,57.17845,-0.068414638,-0.63623837,3.6429766,20.959280,6.896499,53.59366,0
88.72656,40.67223,0.600866079,1.12349169,1.1789298,11.468720,14.269573,252.56731,0
93.57031,46.69811,0.531904850,0.41672112,1.6362876,14.545074,10.621748,131.39400,0
119.48438,48.76506,0.031460220,-0.11216757,0.9991639,9.279612,19.206230,479.75657,0
130.38281,39.84406,-0.158322759,0.38954045,1.2207358,14.378941,13.539456,198.23646,0
107.25000,52.62708,0.452688025,0.17034738,2.3319398,14.486853,9.001004,107.97251,0
107.25781,39.49649,0.465881961,1.16287712,4.0794314,24.980418,7.397080,57.78474,0


In [8]:
pulsar_split <- initial_split(pulsar_mutate, prop = 0.75, strata = class)
pulsar_train <- training(pulsar_split) 
pulsar_test <- testing(pulsar_split)

In [10]:
sum(is.na(pulsar_train)) #checking for missing values in training data 

In [9]:
#pulsar observation counts with 0's and 1's
count_train_pulsar <- pulsar_train %>%
    group_by(class) %>%
    summarize(n = n())
count_train_pulsar

`summarise()` ungrouping output (override with `.groups` argument)



class,n
<fct>,<int>
0,12195
1,1229


## Methods

## Expected Outcomes and Significance

####  What do you expect to find?
- we expect to find a relationship between the integrated profile and pulsar star candidates as well as a relationship between the DM-SNR curve and pulsar star candiadates. 
- the properties (i.e. mean, standard deviation, skewness etc.) of the integrated profile and DM-SNR curve generated for each star or detected emission pattern would likely be able to predict whether the emission detected is one from a pulsar star or not. 

** be more specific, i.e given these set of characteristics, this is what i expect to find

#### What impact could such findings have? 

- the findings from this project can help with the quick identification of pulsar star candidates which are incredibly important in the scientific study of extreme states of matter, exploration of planets beyond the solar system, in the measurement of distances in space, and potentially even in the study of blackholes. 
- by being able to use certain characteristics to predict or classify a detected emission as potentially being a pulsar star with a high estimation accuracy, this will allow scientists and astronomers to quickly identify pulsar star candidates with relative confidence and save them time from not having to do in depth and long-winded classification procedures for every single emission - allowing more time to be spent on the actual study of for example the solar system. The model that we create and the findings that will be created will be able to act as a sort of vetting process for pulsar star candidates to allow researchers to work more efficiently.  

#### What future questions could this lead to?


- by creating a model and exploring the relationship between the characteristics of a radio emission and its potential of being from a pulsar star candidate, astronomers and researchers will have an easier time in the identification of pulsar stars and be able to save themselves time and dedicate that to learning more about their topic of interest (i.e. blackholes or cosmic distances etc.) 
- our findings and project could also lead to questions such as these being asked: 
    - why do pulsar stars have those predictor variable characteristics? 
    - are the common characteristics between pulsar stars legitimate or was it just a lucky coincidence? 
    - since the predictor variables explored are continuous numeric values, are there any thresholds for each variable that determine if a star is potentially pulsar or not? Are there specific characteristics that have to be fulfilled in order for a star to be classified as pulsar? 