## About

Classification of Star-Quasar with Naive Bayes.

### Links

* [Python Scripts For Data](https://github.com/astroML/sklearn_tutorial/tree/master/doc/data/sdss_colors)
* Original Articles
    - [Notebook](http://nbviewer.ipython.org/url/astroml.github.com/sklearn_tutorial/_downloads/07_classification_example.ipynb)
    - [Detailed Description](http://www.astroml.org/sklearn_tutorial/classification.html)

### Dependecies

```
pip install sklearn
```

### Downloading Data

In [7]:
import os
import urllib2
import numpy as np

DTYPE_TRAIN = [('u-g', np.float32),
               ('g-r', np.float32),
               ('r-i', np.float32),
               ('i-z', np.float32),
               ('redshift', np.float32)]

DTYPE_TEST = [('u-g', np.float32),
               ('g-r', np.float32),
               ('r-i', np.float32),
               ('i-z', np.float32),
               ('label', np.int32)]

SDSS_COLORS_URL = "http://www.astro.washington.edu/users/vanderplas/pydata/"
TRAIN_FILE = 'sdssdr6_colors_class_train.dat'
TEST_FILE = 'sdssdr6_colors_class.200000.dat'
FOLDER = 'data/'

# data directory is password protected so the public can't access it    
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None, SDSS_COLORS_URL, 'pydata', 'astroML')
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(handler)

# download training data
destination = FOLDER + TRAIN_FILE.rstrip('.dat') + '.npy'
if not os.path.exists(destination):
    url = SDSS_COLORS_URL + TRAIN_FILE
    print "downloading data from", url
    fhandle = opener.open(url)
    np.save(destination, np.loadtxt(opener.open(url), dtype=DTYPE_TRAIN))

# download test data
destination = FOLDER + TEST_FILE.rstrip('.dat') + '.npy'
if not os.path.exists(destination):
    url = SDSS_COLORS_URL + TEST_FILE
    print "downloading data from", url
    fhandle = opener.open(url)
    np.save(destination, np.loadtxt(opener.open(url), dtype=DTYPE_TEST))

downloading data from http://www.astro.washington.edu/users/vanderplas/pydata/sdssdr6_colors_class_train.dat
downloading data from http://www.astro.washington.edu/users/vanderplas/pydata/sdssdr6_colors_class.200000.dat


### Importing Data

In [8]:
import numpy as np

train_data = np.load('./data/sdssdr6_colors_class_train.npy')
test_data = np.load('./data/sdssdr6_colors_class.200000.npy')

In [10]:
print train_data.dtype.names
print train_data['u-g'].shape

('u-g', 'g-r', 'r-i', 'i-z', 'redshift')
(505290L,)


### Splitting Data

In [11]:
X_train = np.vstack([train_data['u-g'],
                     train_data['g-r'],
                     train_data['r-i'],
                     train_data['i-z']]).T
y_train = (train_data['redshift'] > 0).astype(int)

X_test = np.vstack([test_data['u-g'],
                    test_data['g-r'],
                    test_data['r-i'],
                    test_data['i-z']]).T
y_test = (test_data['label'] == 0).astype(int)

print "training data:", X_train.shape
print "test data:    ", X_test.shape

training data: (505290L, 4L)
test data:     (200000L, 4L)


### Fitting Data
Notice that quasars have y = 1, and stars have y = 0. Naive Bayes classifier will be setted up. This will fit a four-dimensional uncorrelated gaussian to each distribution, and from these gaussians quickly predict the label for a test point.

In [13]:
from sklearn import naive_bayes
gnb = naive_bayes.GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

### Calculating Accuracy

Simple accurary metric based on percentages

In [25]:
accuracy = float(np.sum(y_test == y_pred)) / len(y_test)
print 'Accuracy: {0}'.format(accuracy)
print 'Stars:    {0}'.format(np.sum(y_test == 0))
print 'Quasars:  {0}'.format(np.sum(y_test == 1))

Accuracy 0.617245
Stars    186721
Quasars  13279


Calculating [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall).

In [24]:
TP = np.sum((y_pred == 1) & (y_test == 1))  # true positives
FP = np.sum((y_pred == 1) & (y_test == 0))  # false positives
FN = np.sum((y_pred == 0) & (y_test == 1))  # false negatives

print "Precision: {0}".format(TP / float(TP + FP))
print "Recall:    {0}".format(TP / float(TP + FN))

Precision: 0.142337086782
Recall:    0.948113562768


Combining together precision and recall into [F1 Score](https://en.wikipedia.org/wiki/F1_score).

In [26]:
from sklearn import metrics

print "Precision: {0}".format(metrics.precision_score(y_test, y_pred))
print "Recall:    {0}".format(metrics.recall_score(y_test, y_pred))
print "F1 score:  {0}".format(metrics.f1_score(y_test, y_pred))

Precision: 0.142337086782
Recall:    0.948113562768
F1 score:  0.247515506581


Printing table with all matrics embedded into.

In [31]:
print metrics.classification_report(y_test, y_pred, target_names=['Stars', 'Quasars (QSOs)'])

                precision    recall  f1-score   support

         Stars       0.99      0.59      0.74    186721
Quasars (QSOs)       0.14      0.95      0.25     13279

   avg / total       0.94      0.62      0.71    200000

