Title

Introduction: 

Pulsars can generate Earth-detectable radio emissions by rotating at high speeds. Each pulsar rotates slightly differently, and because of the effects of radio frequency interference (RFI) and noise on pulsars , the right signal can be difficult to detect. Therefore, researchers need to find the legitimate signals by analyzing the dataset. The shared dataset contains 16,259 false examples caused by RFI and noise, and 1,639 real pulsar examples that human annotators have checked.

We want to build a binary classifier to predict whether the candidates are "pulsar" or "non-pulsar". The data used is from the HTRU_2 dataset.   

Preliminary Data Analysis: 

In [1]:
import pandas as pd
import altair as alt
import numpy as np

In [2]:
pulsar_data = pd.read_csv("/home/jovyan/group_project/HTRU_2.csv", header = None, names = [
    "Mean",
    "Sd",
    "Excess_kurtosis",
    "Skewness",
    "Mean_DM-SNR_curve",
    "Sd_DM-SNR_curve",
    "Excess_kurtosis_DM-SNR_curve",
    "Skewness_DM-SNR_curve",
    "Class"
],)

pulsar_data["Class"]=pulsar_data["Class"].replace({
    0: "Non_pulsar",
    1: "Pulsar"
})

pulsar_data


Unnamed: 0,Mean,Sd,Excess_kurtosis,Skewness,Mean_DM-SNR_curve,Sd_DM-SNR_curve,Excess_kurtosis_DM-SNR_curve,Skewness_DM-SNR_curve,Class
0,140.562500,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,Non_pulsar
1,102.507812,58.882430,0.465318,-0.515088,1.677258,14.860146,10.576487,127.393580,Non_pulsar
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,Non_pulsar
3,136.750000,57.178449,-0.068415,-0.636238,3.642977,20.959280,6.896499,53.593661,Non_pulsar
4,88.726562,40.672225,0.600866,1.123492,1.178930,11.468720,14.269573,252.567306,Non_pulsar
...,...,...,...,...,...,...,...,...,...
17893,136.429688,59.847421,-0.187846,-0.738123,1.296823,12.166062,15.450260,285.931022,Non_pulsar
17894,122.554688,49.485605,0.127978,0.323061,16.409699,44.626893,2.945244,8.297092,Non_pulsar
17895,119.335938,59.935939,0.159363,-0.743025,21.430602,58.872000,2.499517,4.595173,Non_pulsar
17896,114.507812,53.902400,0.201161,-0.024789,1.946488,13.381731,10.007967,134.238910,Non_pulsar


In [3]:
from sklearn.model_selection import train_test_split

pulsar_train, pulsar_test = train_test_split(
    pulsar_data, train_size=0.75, stratify=pulsar_data["Class"]
)
pulsar_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13423 entries, 10018 to 5743
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Mean                          13423 non-null  float64
 1   Sd                            13423 non-null  float64
 2   Excess_kurtosis               13423 non-null  float64
 3   Skewness                      13423 non-null  float64
 4   Mean_DM-SNR_curve             13423 non-null  float64
 5   Sd_DM-SNR_curve               13423 non-null  float64
 6   Excess_kurtosis_DM-SNR_curve  13423 non-null  float64
 7   Skewness_DM-SNR_curve         13423 non-null  float64
 8   Class                         13423 non-null  object 
dtypes: float64(8), object(1)
memory usage: 1.0+ MB


Methods :

We will use train-test split and the K-Nearest Neighbors Classification to 