# DSCI 100 009 - Group 165 Proposal - Pulsar Stars

## Introduction

### 1) Background Information

Pulsar is a type of neutron stars that can produce radio waves detectable from Earth. When it spins in high speed, scientists can use large telescopes to detect the radio wave patterns of it. Since the radio emission mostly occurs periodically, one particular widely-known usage of the pulsar is timing in space. 

For each rotation, the radio wave emitted might have distinctive patterns. Therefore, each sample collected in the dataset refers to the average of radio waves produced by multiple rotations.

Surely, detected radio waves are not all produced by pulsars. Most of the time, radio frequency interference (RFI) or noise in space instead of real pulsar signal is detected by telescopes. Therefore, the dataset contains mostly negative class samples (Class 0), which is the detection of RFI/noise. 

### 2) Question of the Project

Using the    and    variables from the pulsar dataset, we want to predict whether the signals of a pulsar are actually from a pulsar or whether it is just radiowave interference.

### 3) Dataset Description

The dataset we are using is downloaded from UCI Machine Learning Repository. The name of the dataset is HTRU2, describing a collected sample of pulsar candidates in the High Time Resolution Universe Survey (South).

HTRU2 consists of 9 columns in total, with first 8 columns of continuous variables and 1 column of class variable at the very end that can be used for binary classification problems. 

The class variable here is represented by 0 (negative) and 1 (positive). The negative class includes samples caused by RFI/noise, whereas the positive class refers to real pulsar samples. There are 17,898 samples (rows) in total, with 1,639 positive samples and 16,259 negative samples.

Within 8 columns including continuous variables, the first four columns refer to the mean, standard deviation, excess kurtosis, and the skewness of the integrated pulse profile wave, and the latter four indicates the same four pieces of statistics of the DM-SNR (Dispersion Measure - Signal-to-Noise Ratio) curve created during the signal.

## Preliminary Exploratory Data Analysis

In [20]:
library(repr)
library(tidyverse)
library(tidymodels)
library(dplyr)

In [21]:
pulsar <- read_csv("https://raw.githubusercontent.com/splashhhhhh/dsci100-grp165/main/HTRU_2.csv", col_names = FALSE)
colnames(pulsar) <- c("mean_ip", "sd_ip", "kurt_ip", "skew_ip", "mean_dmsnr", "sd_dmsnr", "kurt_dmsnr", "skew_dmsnr", "class")
pulsar

[1mRows: [22m[34m17898[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (9): X1, X2, X3, X4, X5, X6, X7, X8, X9

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


mean_ip,sd_ip,kurt_ip,skew_ip,mean_dmsnr,sd_dmsnr,kurt_dmsnr,skew_dmsnr,class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
140.56250,55.68378,-0.234571412,-0.69964840,3.1998328,19.110426,7.975532,74.24222,0
102.50781,58.88243,0.465318154,-0.51508791,1.6772575,14.860146,10.576487,127.39358,0
103.01562,39.34165,0.323328365,1.05116443,3.1212375,21.744669,7.735822,63.17191,0
136.75000,57.17845,-0.068414638,-0.63623837,3.6429766,20.959280,6.896499,53.59366,0
88.72656,40.67223,0.600866079,1.12349169,1.1789298,11.468720,14.269573,252.56731,0
93.57031,46.69811,0.531904850,0.41672112,1.6362876,14.545074,10.621748,131.39400,0
119.48438,48.76506,0.031460220,-0.11216757,0.9991639,9.279612,19.206230,479.75657,0
130.38281,39.84406,-0.158322759,0.38954045,1.2207358,14.378941,13.539456,198.23646,0
107.25000,52.62708,0.452688025,0.17034738,2.3319398,14.486853,9.001004,107.97251,0
107.25781,39.49649,0.465881961,1.16287712,4.0794314,24.980418,7.397080,57.78474,0


In [22]:
pulsar <- pulsar |>
    mutate(class = as_factor(class))
pulsar |>
  pull(class) |>
  levels()

num_obs <- nrow(pulsar)
pulsar |>
    group_by(class) |>
    summarize(count = n(),
             percentage = n() / num_obs * 100)

class,count,percentage
<fct>,<int>,<dbl>
0,16259,90.842552
1,1639,9.157448


In [23]:
pulsar_split <- initial_split(pulsar, prop = 0.75, strata = class)
pulsar_training <- training(pulsar_split)
pulsar_testing <- testing(pulsar_split) 

## Methods

## Expected Outcomes and Significance