**Methods/Results**

In [5]:
import pandas as pd
import altair as alt
import numpy as np

url= "https://docs.google.com/spreadsheets/d/e/2PACX-1vROC4kgO6ctTkCjDooBh4Gc_VW7fsUeIgSiPTtcHV0FjFumQclEF8b3ThtxYAJQPyDmRN61OpR4gnpr/pub?output=csv"
pulsar_data = pd.read_csv(url, header= None, names =[
    "integrated_mean",
    "integrated_sd",
    "integrated_xs_kurtosis",
    "integrated_skewness",
    "dmsnr_mean",
    "dmsnr_sd",
    "dmsnr_xs_kurtosis",
    "dmsnr_skewness",
    "class"
],)

pulsar_data["class"]=pulsar_data["class"].replace({
    0: "not pulsar",
    1: "pulsar"
})
pulsar_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17898 entries, 0 to 17897
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   integrated_mean         17898 non-null  float64
 1   integrated_sd           17898 non-null  float64
 2   integrated_xs_kurtosis  17898 non-null  float64
 3   integrated_skewness     17898 non-null  float64
 4   dmsnr_mean              17898 non-null  float64
 5   dmsnr_sd                17898 non-null  float64
 6   dmsnr_xs_kurtosis       17898 non-null  float64
 7   dmsnr_skewness          17898 non-null  float64
 8   class                   17898 non-null  object 
dtypes: float64(8), object(1)
memory usage: 1.2+ MB


*Fig 1: Summary of Original Data from Pulsar Database*

In [6]:
from sklearn.model_selection import train_test_split
np.random.seed(1)

pulsar_train, pulsar_test = train_test_split(
    pulsar_data, train_size=0.75, stratify=pulsar_data["class"]
)
pulsar_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13423 entries, 2020 to 12740
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   integrated_mean         13423 non-null  float64
 1   integrated_sd           13423 non-null  float64
 2   integrated_xs_kurtosis  13423 non-null  float64
 3   integrated_skewness     13423 non-null  float64
 4   dmsnr_mean              13423 non-null  float64
 5   dmsnr_sd                13423 non-null  float64
 6   dmsnr_xs_kurtosis       13423 non-null  float64
 7   dmsnr_skewness          13423 non-null  float64
 8   class                   13423 non-null  object 
dtypes: float64(8), object(1)
memory usage: 1.0+ MB


*Fig 2: Summary of Pulsar Training Set*

We used the train-test method to split the pulsar data into two sets. 75% of the data was randomly selected as the training data and will be used to build our classifier. Then we will test the accuracy of the binary classifier by using it on the test set. 

*Fig 3: Visualization of Integrated Data for Preliminary Analysis*

In [8]:
alt.Chart(pulsar_train).mark_point(opacity=0.5).encode(
    alt.X(alt.repeat("row"), type="quantitative"),
    alt.Y(alt.repeat("column"), type="quantitative"),
    color="class:N"
).properties(
    width=150,
    height=150
).repeat(
    row=["dmsnr_mean", "dmsnr_sd", "dmsnr_xs_kurtosis", "dmsnr_skewness"],
    column=["dmsnr_mean", "dmsnr_sd", "dmsnr_xs_kurtosis", "dmsnr_skewness"]
)

*Fig 4: Visualization of DMSNR Data for Preliminary Analysis*

We chose to generate charts of the integrated data and the DMSNR data separately as the method of data collection is different. Based on the charts above, it is only possible to visually distinguish the two classes "pulsar" and "non-pulsar" from the integrated data set. As a result, we chose to use only the integrated data to create a binary classifier. This decision was made to increase the accuracy of the classifier. 

In [19]:
pulsar_data = pulsar_data.drop(columns=["dmsnr_mean", "dmsnr_sd", "dmsnr_xs_kurtosis", "dmsnr_skewness"])
pulsar_data.head(10)

Unnamed: 0,integrated_mean,integrated_sd,integrated_xs_kurtosis,integrated_skewness,class
0,140.5625,55.683782,-0.234571,-0.699648,not pulsar
1,102.507812,58.88243,0.465318,-0.515088,not pulsar
2,103.015625,39.341649,0.323328,1.051164,not pulsar
3,136.75,57.178449,-0.068415,-0.636238,not pulsar
4,88.726562,40.672225,0.600866,1.123492,not pulsar
5,93.570312,46.698114,0.531905,0.416721,not pulsar
6,119.484375,48.765059,0.03146,-0.112168,not pulsar
7,130.382812,39.844056,-0.158323,0.38954,not pulsar
8,107.25,52.627078,0.452688,0.170347,not pulsar
9,107.257812,39.496488,0.465882,1.162877,not pulsar


*Fig 5: First 10 Rows of Tidy Data Set for Exploration*

**Building the Binary Classifier**

To generate a binary classifier with high accuracy, we need to identify the best K nearest neighbor to tune it. Since the KNN method is sensitive to scale, we have to first standardize our data. We did this by creating a preprocessor and also setting our seed to 1. This will ensure that our results are repeatable.

In [19]:
import matplotlib.pyplot as plt
from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [20]:
np.random.seed(1)

pulsar_preprocessor = make_column_transformer(
    (StandardScaler(), ["integrated_mean", "integrated_sd", "integrated_xs_kurtosis", "integrated_skewness"]),
    verbose_feature_names_out=False
)
pulsar_preprocessor

The next step is to combine our preprocessor and data into a pipeline and use the fit method to build our classifier.

In [21]:
knn = KNeighborsClassifier()

X = pulsar_train[["integrated_mean", "integrated_sd", "integrated_xs_kurtosis", "integrated_skewness"]]
y = pulsar_train["class"]

pulsar_fit = make_pipeline(pulsar_preprocessor, knn).fit(X,y)
pulsar_fit

We performed a 5-fold cross validation to fit the pipeline on different iterations of the data. After cross-validating each split, we can look at the mean and standard deviation to see how accurate the classifier is. The range was selected to accommodate all possible values of K. 

In [52]:
param_grid = {
    "kneighborsclassifier__n_neighbors": range(1,20),
}
pulsar_tune_pipe = make_pipeline(pulsar_preprocessor, KNeighborsClassifier())

In [53]:
knn_tune_grid = GridSearchCV(
    estimator = pulsar_tune_pipe,
    param_grid=param_grid,
    cv=5
)
knn_tune_grid

In [54]:
knn_model_grid = knn_tune_grid.fit(X,y)
accuracies_grid = pd.DataFrame(knn_model_grid.cv_results_)
accuracies_grid.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsclassifier__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.026811,0.003142,0.130101,0.009483,1,{'kneighborsclassifier__n_neighbors': 1},0.96648,0.959404,0.964246,0.967958,0.966095,0.964837,0.002963,19
1,0.029138,0.011787,0.184757,0.056216,2,{'kneighborsclassifier__n_neighbors': 2},0.974674,0.970205,0.976164,0.97541,0.971684,0.973627,0.002289,18
2,0.02513,0.001495,0.131095,0.007983,3,{'kneighborsclassifier__n_neighbors': 3},0.975047,0.974302,0.977281,0.97839,0.972802,0.975564,0.002021,17
3,0.02618,0.003521,0.128028,0.001473,4,{'kneighborsclassifier__n_neighbors': 4},0.976536,0.973929,0.978399,0.977273,0.973547,0.975937,0.001895,16
4,0.022665,0.00075,0.130898,0.006768,5,{'kneighborsclassifier__n_neighbors': 5},0.976909,0.976536,0.978771,0.978763,0.974292,0.977054,0.00166,15
5,0.033161,0.016623,0.196183,0.036287,6,{'kneighborsclassifier__n_neighbors': 6},0.976536,0.975791,0.979143,0.979881,0.975037,0.977278,0.001899,14
6,0.028016,0.010539,0.140572,0.009357,7,{'kneighborsclassifier__n_neighbors': 7},0.977281,0.976536,0.979516,0.980626,0.97541,0.977874,0.001923,9
7,0.02326,0.00152,0.135074,0.001593,8,{'kneighborsclassifier__n_neighbors': 8},0.976164,0.976164,0.979516,0.979881,0.97541,0.977427,0.001879,13
8,0.022758,0.000169,0.140094,0.004236,9,{'kneighborsclassifier__n_neighbors': 9},0.976164,0.975791,0.980633,0.979508,0.975782,0.977576,0.002072,11
9,0.023319,0.001504,0.137736,0.005003,10,{'kneighborsclassifier__n_neighbors': 10},0.976536,0.976164,0.980261,0.980253,0.974665,0.977576,0.002277,12


*Fig 6: Accuracy Grid for 5-Fold Cross Validation*

In [51]:
accuracy_versus_k_grid = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors", title="K"),
    y=alt.Y("mean_test_score", scale=alt.Scale(domain=(0.964, 0.9795)), title="Accuracy estimate")
)
accuracy_versus_k_grid

*Fig 7: Accuracy Estimate vs K Neighbors Plot*

Based on the plot generated from the accuracy grid, we can see that K=17 has the highest accuracy of approximately 98%. This is the best K value and will be used to tune the classifier. To do this, we repeat the methods used above but set n_neighbors to 17. 

In [55]:
knn2 = KNeighborsClassifier(n_neighbors=17)
X = pulsar_train[["integrated_mean", "integrated_sd", "integrated_xs_kurtosis", "integrated_skewness"]]
y = pulsar_train["class"]

pulsar_fit2 = make_pipeline(pulsar_preprocessor, knn2).fit(X,y)
pulsar_fit2

In [62]:
pulsar_test_predictions = pulsar_test.assign(
 predicted = pulsar_fit2.predict(pulsar_test[["integrated_mean", "integrated_sd", "integrated_xs_kurtosis", "integrated_skewness"]])
)
pulsar_test_predictions[['predicted','class']].head(10)

Unnamed: 0,predicted,class
5642,not pulsar,not pulsar
8983,not pulsar,not pulsar
2409,not pulsar,not pulsar
8402,not pulsar,not pulsar
796,not pulsar,not pulsar
3756,not pulsar,not pulsar
249,not pulsar,not pulsar
2541,not pulsar,not pulsar
8144,not pulsar,not pulsar
7852,not pulsar,not pulsar


*Fig 8: First 10 results of Pulsar Test Predictions and Actual Data for Class*

From the dataframe above we can see that there seems to be a high accuracy between the predictions generated by our binary classifier and the actual data. The output of the score method below shows that the estimated accuracy for our binary classifier is 98%.

In [67]:
pulsar_acc_1 = pulsar_fit2.score(
    pulsar_test[["integrated_mean", "integrated_sd", "integrated_xs_kurtosis", "integrated_skewness"]],
    pulsar_test["class"]
)
pulsar_acc_1

0.9794413407821229

To organize the observations by how many of the classes were predicted accurately, we created a confusion matrix using the crosstab function.

In [66]:
pd.crosstab(
 pulsar_test_predictions["class"],
 pulsar_test_predictions["predicted"]
)

predicted,not pulsar,pulsar
class,Unnamed: 1_level_1,Unnamed: 2_level_1
not pulsar,4047,18
pulsar,74,336


We can see that our binary classifier correctly identified 4047 observations as not pulsar, and 336 as pulsar. Only 18 observations w

In [65]:
new_observation = pd.DataFrame({"integrated_mean": [108.92833], "integrated_sd": [38.44983], "integrated_xs_kurtosis": [0.42293], "integrated_skewness": [1.98473],"class":["unknown"]})

In [66]:
prediction = pulsar_fit2.predict(new_observation)
prediction

array(['not pulsar'], dtype=object)

**Discussion**

Summarize what you found?
- Created a classifier
- Found that best K = 17
- Based of on the test data, we found that the accuracy of our tuned classifier is 98%
- Tested again with a new observation randomized

Discuss whether this is what you expected to find?
- we expected to find whether our classifier would work with a high accuracy

Discuss what impact could such findings have?
- Time efficiency 
- The tuned classifier could be used to quickly and efficiently predict the classification of candidates without the need for human annotators. This would save time and be useful for research projects with a limited budget.

Discuss what future questions could this lead to?
- How can the classifier be improved in terms of recall or/and precision in order to increase accuracy?
- Will the accuracy of the classification model be different if all variables are used?