Title:

Introduction:

Pulsars can generate Earth-detectable radio emissions by rotating a high speeds. Each pulsar rotates slightly differently, and because of the effect of radio frequency interference (RFI) and noise on pulsars, the right signal can be difficult to detect. Therefore, researchers need to find the legitimate signals by analyzing the dataset. The shared dataset contains 16,259 false pulsar candidates caused by RFI and noise and, 1,639 real pulsar candidates that human annotators have checked. We want to build a binary classifier to predict whether the candidates are "pulsar" or "non-pulsar" using data from the HTRU2 dataset.

Preliminary Data Analysis:

In [2]:
import pandas as pd
import altair as alt
import numpy as np

pulsar_data = pd.read_csv("/home/jovyan/group_project/HTRU_2.csv", header= None, names =[
    "integrated_mean",
    "integrated_sd",
    "integrated_xs_kurtosis",
    "integrated_skewness",
    "dmsnr_mean",
    "dmsnr_sd",
    "dmsnr_xs_kurtosis",
    "dmsnr_skewness",
    "class"
],)

pulsar_data["class"]=pulsar_data["class"].replace({
    0: "not pulsar",
    1: "pulsar"
})

pulsar_data

Unnamed: 0,integrated_mean,integrated_sd,integrated_xs_kurtosis,integrated_skewness,dmsnr_mean,dmsnr_sd,dmsnr_xs_kurtosis,dmsnr_skewness,class
0,140.562500,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,not pulsar
1,102.507812,58.882430,0.465318,-0.515088,1.677258,14.860146,10.576487,127.393580,not pulsar
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,not pulsar
3,136.750000,57.178449,-0.068415,-0.636238,3.642977,20.959280,6.896499,53.593661,not pulsar
4,88.726562,40.672225,0.600866,1.123492,1.178930,11.468720,14.269573,252.567306,not pulsar
...,...,...,...,...,...,...,...,...,...
17893,136.429688,59.847421,-0.187846,-0.738123,1.296823,12.166062,15.450260,285.931022,not pulsar
17894,122.554688,49.485605,0.127978,0.323061,16.409699,44.626893,2.945244,8.297092,not pulsar
17895,119.335938,59.935939,0.159363,-0.743025,21.430602,58.872000,2.499517,4.595173,not pulsar
17896,114.507812,53.902400,0.201161,-0.024789,1.946488,13.381731,10.007967,134.238910,not pulsar


In [3]:
from sklearn.model_selection import train_test_split

pulsar_train, pulsar_test = train_test_split(
    pulsar_data, train_size=0.75, stratify=pulsar_data["class"]
)
pulsar_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13423 entries, 12049 to 10200
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   integrated_mean         13423 non-null  float64
 1   integrated_sd           13423 non-null  float64
 2   integrated_xs_kurtosis  13423 non-null  float64
 3   integrated_skewness     13423 non-null  float64
 4   dmsnr_mean              13423 non-null  float64
 5   dmsnr_sd                13423 non-null  float64
 6   dmsnr_xs_kurtosis       13423 non-null  float64
 7   dmsnr_skewness          13423 non-null  float64
 8   class                   13423 non-null  object 
dtypes: float64(8), object(1)
memory usage: 1.0+ MB


In [4]:
pulsar_train.describe()

Unnamed: 0,integrated_mean,integrated_sd,integrated_xs_kurtosis,integrated_skewness,dmsnr_mean,dmsnr_sd,dmsnr_xs_kurtosis,dmsnr_skewness
count,13423.0,13423.0,13423.0,13423.0,13423.0,13423.0,13423.0,13423.0
mean,111.103475,46.577193,0.480621,1.794628,12.617913,26.335384,8.301058,104.893072
std,25.739782,6.887556,1.075087,6.262304,29.434046,19.456816,4.510443,106.985065
min,5.8125,24.791612,-1.738021,-1.791886,0.213211,7.370432,-2.812353,-1.964998
25%,100.992188,42.401759,0.026126,-0.191099,1.925167,14.432672,5.773167,34.736563
50%,115.179688,46.960495,0.220351,0.191636,2.808528,18.472159,8.431977,82.994641
75%,127.136719,51.0831,0.471203,0.915395,5.496237,28.518111,10.695692,139.316061
max,190.421875,98.778911,8.069522,68.101622,223.392141,109.712649,34.539844,1191.000837


In [5]:
pulsar_train["class"].value_counts(normalize=True)

not pulsar    0.908441
pulsar        0.091559
Name: class, dtype: float64

In [17]:
alt.data_transformers.disable_max_rows()

alt.Chart(pulsar_train).mark_point().encode(
    alt.X(alt.repeat("row"), type="quantitative"),
    alt.Y(alt.repeat("column"), type="quantitative"),
    color="class:N"
).properties(
    width=150,
    height=150
).repeat(
    row=["integrated_mean", "integrated_xs_kurtosis", "integrated_skewness", "dmsnr_mean", "dmsnr_xs_kurtosis", "dmsnr_skewness"],
    column=["integrated_mean", "integrated_xs_kurtosis", "integrated_skewness", "dmsnr_mean", "dmsnr_xs_kurtosis", "dmsnr_skewness"]
)
