# Project Proposal

# *insert title*

## Introduction

### Background Information and Question
When a supergiant star (a highly luminous type of star) dies and collapses in an event called a supernova, its proton and electron located in the star's core merge and create neutrons and ultimately form neutron stars. A particular type of neutron star, pulsars, emit two beams of electromagnetic radiation from two magnetic poles and rotate quickly in a way that the the radio emissions patterns are detectable on Earth.

Pulsars are very important tools used by astronomers. Due to a pulsar's consistent radio emmisions in time, astronomers are able to estimate distances of various cosmic objects using the distance of pulsar and time it takes the radio waves to reach Earth. Additionally, since pulsars are a type of neutron star, astronmers study the inside of the pulsars, especially the obscure state of matter every neutron star contains.

Although every pulsar produces a unique radio emission pattern that slightly changes per rotation. Possible pulsars signals known as "candidates" are averaged over many rotations in order to determine whether the signal is a real pulsar or not. Unfortunately, many false signals are picked up from unwanted radio frequency interference and other noise, resulting in difficulty in identifying pulsars. 

The question that we are asking: Is the radio signal detected that is quantified by the integrated profile and DM-SNR curve a potential pulsar star candidate or not? 




### Dataset used
The dataset that will be used is the Pulsar Star dataset, which can be found [here](https://archive.ics.uci.edu/ml/datasets/HTRU2#). This dataset describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey, an all-sky survey for pulsars and short-duration radio transients.

A candidate is a potential signal detection that may describe a real pulsar and each row of this dataset is a candidate observation. Each column is a variable and there are 9 variables. The first 8 variables are the mean, standard deviation, excess kurtosis, and skewness, of the integrated pulse profile and of the the DM-SNR curve. 

Since pulsars are weak radio sources, many inidividual pulses must be summed to produce a signal that is distinguishable from noise and the detection of the pulsar can be made. The signal resulted from this sum of pulses is called the integrated pulse profile and it is similar to a pulsar's "fingerprint".

!!!The Dispersion Measure/Signal-to-Noise Ratio (DM/SNR) is a measure that compares the level of a desired signal and the level of background noise.

The last variable is the class label. It is 1 if the candidate is a real pulsar and 0 otherwise.

## Preliminary Data Analysis

In [1]:
set.seed(2000)
library(tidyverse)
library(tidymodels)
library(repr)
options(repr.matrix.max.rows = 6)


── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

In [2]:
temp <- tempfile()
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/00372/HTRU2.zip", temp)
pulsar_file <- unz(temp, "HTRU_2.csv")
pulsar <- read_csv(pulsar_file, col_names = FALSE)


Parsed with column specification:
cols(
  X1 = [32mcol_double()[39m,
  X2 = [32mcol_double()[39m,
  X3 = [32mcol_double()[39m,
  X4 = [32mcol_double()[39m,
  X5 = [32mcol_double()[39m,
  X6 = [32mcol_double()[39m,
  X7 = [32mcol_double()[39m,
  X8 = [32mcol_double()[39m,
  X9 = [32mcol_double()[39m
)



In [3]:
colnames(pulsar) <- c("mean_ip", "std_ip", "kurt_ip", "skew_ip", "mean_dm_snr", "std_dm_snr", "kurt_dm_snr", "skew_dm_snr", "class")

pulsar_mutate <- pulsar %>%
                mutate(class = as_factor(class))
pulsar_mutate

mean_ip,std_ip,kurt_ip,skew_ip,mean_dm_snr,std_dm_snr,kurt_dm_snr,skew_dm_snr,class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
140.5625,55.68378,-0.2345714,-0.6996484,3.199833,19.11043,7.975532,74.24222,0
102.5078,58.88243,0.4653182,-0.5150879,1.677258,14.86015,10.576487,127.39358,0
103.0156,39.34165,0.3233284,1.0511644,3.121237,21.74467,7.735822,63.17191,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
119.3359,59.93594,0.1593631,-0.74302540,21.430602,58.87200,2.499517,4.595173,0
114.5078,53.90240,0.2011614,-0.02478884,1.946488,13.38173,10.007967,134.238910,0
57.0625,85.79734,1.4063910,0.08951971,188.306020,64.71256,-1.597527,1.429475,0


In [4]:
pulsar_split <- initial_split(pulsar_mutate, prop = 0.75, strata = class)
pulsar_train <- training(pulsar_split) 
pulsar_test <- testing(pulsar_split)

In [11]:
pulsar_test_scaled <- pulsar_test %>% 
 mutate(scaled_mean_ip = scale(mean_ip, center = TRUE), 
        scaled_std_ip = scale(std_ip, center = TRUE), 
        scaled_kurt_ip = scale(kurt_ip, center = TRUE), 
        scaled_skew_ip = scale(skew_ip, center = TRUE), 
        scaled_mean_dm_snr = scale(mean_dm_snr, center = TRUE), 
        scaled_std_dm_snr = scale(std_dm_snr, center = TRUE), 
        scaled_kurt_dm_snr = scale(kurt_dm_snr, center = TRUE), 
        scaled_skew_dm_snr = scale(skew_dm_snr, center = TRUE))
pulsar_test_scaled

mean_ip,std_ip,kurt_ip,skew_ip,mean_dm_snr,std_dm_snr,kurt_dm_snr,skew_dm_snr,class,scaled_mean_ip,scaled_std_ip,scaled_kurt_ip,scaled_skew_ip,scaled_mean_dm_snr,scaled_std_dm_snr,scaled_kurt_dm_snr,scaled_skew_dm_snr
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,"<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>"
102.5078,58.88243,0.46531815,-0.5150879,1.6772575,14.860146,10.576487,127.39358,0,-0.3496855,1.7987006,0.001971987,-0.3590524,-0.3631005,-0.5727080,0.4725744,0.1819392
119.4844,48.76506,0.03146022,-0.1121676,0.9991639,9.279612,19.206230,479.75657,0,0.3295950,0.3075809,-0.417837447,-0.2925451,-0.3863266,-0.8623251,2.3786190,3.4221149
142.0781,45.28807,-0.32032843,0.2839525,5.3762542,29.009897,6.076266,37.83139,0,1.2336350,-0.2048648,-0.758235030,-0.2271602,-0.2364021,0.1616322,-0.5213858,-0.6416354
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
126.6250,55.72183,0.002946216,-0.30321814,0.5342809,8.588882,23.913761,660.1970,0,0.6153116,1.3328840,-0.44542816,-0.3240805,-0.4022498,-0.8981724,3.418368,5.08136638
96.0000,44.19311,0.388673964,0.28134362,1.8712375,15.833746,9.634927,104.8216,0,-0.6100816,-0.3662423,-0.07219043,-0.2275908,-0.3564563,-0.5221803,0.264613,-0.02562259
114.5078,53.90240,0.201161383,-0.02478884,1.9464883,13.381731,10.007967,134.2389,0,0.1304686,1.0647331,-0.25363128,-0.2781220,-0.3538788,-0.6494344,0.347006,0.24488589


In [5]:
sum(is.na(pulsar_train)) #checking for missing values in training data 

In [6]:
#pulsar observation counts with 0's and 1's
count_train_pulsar <- pulsar_train %>%
    group_by(class) %>%
    summarize(n = n())
count_train_pulsar

`summarise()` ungrouping output (override with `.groups` argument)



class,n
<fct>,<int>
0,12169
1,1255


## Methods

We first loaded the dataset and added column names, and changed class into a fct instead of a dbl since this is the binary variable that we are trying to predict for. We will be likely using all of the variables since we only have 8 continuous variables to start with and we will try to visualize the relationship using a scatterplot between each of those variables and whether they all contribute equally to whether a signal detected is classified as a pulsar star or not. The scatterplots will have points that are distinguised by different colours represented its class, allowing us to see which variables (i.e. integrated profile mean) may be a potential indicator or a defining characteristic of potential pulsar candidates. We will use the K-nearest neighbor classification and cross-validation to determine the k that we will use, we intend to use 5 folds to tune the model and maximize the accuracy of our model, this process will be done with all eight continuous variables outlined below. 

The eight continuous variables that we will be using are are the mean, standard deviation, excess kurtosis, and skewness, of the integrated pulse profile and of the the DM-SNR curve. These correspond to the column names mean_ip, std_ip, kurt_ip, skew_ip, mean_dm_snr, std_dm_snr, kurt_dm_snr, skew_dm_snr; and what we are trying to predict for is class. 

## Expected Outcomes and Significance

####  What do you expect to find?
- we expect to find a relationship between the integrated profile and pulsar star candidates as well as a relationship between the DM-SNR curve and pulsar star candiadates. 
- the properties (i.e. mean, standard deviation, skewness etc.) of the integrated profile and DM-SNR curve generated for each star or detected emission pattern would likely be able to predict whether the emission detected is one from a pulsar star or not. 

** be more specific, i.e given these set of characteristics, this is what i expect to find

#### What impact could such findings have? 

- the findings from this project can help with the quick identification of pulsar star candidates which are incredibly important in the scientific study of extreme states of matter, exploration of planets beyond the solar system, in the measurement of distances in space, and potentially even in the study of blackholes. 
- by being able to use certain characteristics to predict or classify a detected emission as potentially being a pulsar star with a high estimation accuracy, this will allow scientists and astronomers to quickly identify pulsar star candidates with relative confidence and save them time from not having to do in depth and long-winded classification procedures for every single emission - allowing more time to be spent on the actual study of for example the solar system. The model that we create and the findings that will be created will be able to act as a sort of vetting process for pulsar star candidates to allow researchers to work more efficiently.  

#### What future questions could this lead to?


- by creating a model and exploring the relationship between the characteristics of a radio emission and its potential of being from a pulsar star candidate, astronomers and researchers will have an easier time in the identification of pulsar stars and be able to save themselves time and dedicate that to learning more about their topic of interest (i.e. blackholes or cosmic distances etc.) 
- our findings and project could also lead to questions such as these being asked: 
    - why do pulsar stars have those predictor variable characteristics? 
    - are the common characteristics between pulsar stars legitimate or was it just a lucky coincidence? 
    - since the predictor variables explored are continuous numeric values, are there any thresholds for each variable that determine if a star is potentially pulsar or not? Are there specific characteristics that have to be fulfilled in order for a star to be classified as pulsar? 