## About

Classification of Star-Quasar with Naive Bayes.

### Links

* [Python Scripts For Data](https://github.com/astroML/sklearn_tutorial/tree/master/doc/data/sdss_photoz)
* Original Articles
    - [Notebook](http://nbviewer.ipython.org/url/astroml.github.com/sklearn_tutorial/_downloads/08_regression_example.ipynb)
    - [Detailed Description](http://www.astroml.org/sklearn_tutorial/regression.html)

### Dependecies

```
pip install sklearn
```

### Downloading Data

In [1]:
"""
This file fetches photometric observations associated with SDSS galaxy
spectra which have spectroscopically confirmed redshifts.  This directly
queries the SDSS database for the information, and thus can take a few
minutes to run.
"""

import os
import urllib, urllib2
import numpy as np

# Here's how the data can be downloaded directly from the SDSS server.
# This route is limited to N = 50000, so we've done this separately
def fetch_data_sql(N = 50000):
    URL = 'http://cas.sdss.org/public/en/tools/search/x_sql.asp'
    archive_file = 'sdss_galaxy_colors.npy'

    dtype = [('mags', '5float32'),
             ('specClass', 'int8'),
             ('z', 'float32'),
             ('zerr', 'float32')]

    def sql_query(sql_str, url=URL, format='csv'):
        """Execute SQL query"""
        # remove comments from string
        sql_str = ' \n'.join(map(lambda x: x.split('--')[0],
                                 sql_str.split('\n')))
        params = urllib.urlencode(dict(cmd=sql_str, format=format))
        return urllib.urlopen(url + '?%s' % params)

    query_text = ('\n'.join(
            ("SELECT TOP %i" % N,
             "   modelMag_u, modelMag_g, modelMag_r, modelMag_i, modelMag_z, specClass, z, zErr",
             "FROM SpecPhoto",
             "WHERE ",
             "   modelMag_u BETWEEN 0 AND 19.6",
             "   AND modelMag_g BETWEEN 0 AND 20",
             "   AND zerr BETWEEN 0 and 0.03",
             "   AND specClass > 1 -- not UNKNOWN or STAR",
             "   AND specClass <> 5 -- not SKY",
             "   AND specClass <> 6 -- not STAR_LATE")))


    if not os.path.exists(archive_file):
        print "querying for %i objects" % N
        print query_text
        output = sql_query(query_text)
        print "finished.  Processing & saving data"
        try:
            data = np.loadtxt(output, delimiter=',', skiprows=1, dtype=DTYPE)
        except:
            raise ValueError(output.read())
        np.save(archive_file, data)
    else:
        print "data already on disk"


DATA_URL = ('http://www.astro.washington.edu/users/'
            'vanderplas/pydata/sdss_photoz.npy')
LOCAL_FILE = 'sdss_photoz.npy'
FOLDER = 'data/'

# data directory is password protected so the public can't access it    
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None, DATA_URL, 'pydata', 'astroML')
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(handler)

# download training data
if not os.path.exists(FOLDER + LOCAL_FILE):
    print "downloading data from", DATA_URL
    fhandle = opener.open(DATA_URL)
    open(FOLDER + LOCAL_FILE, 'wb').write(fhandle.read())

downloading data from http://www.astro.washington.edu/users/vanderplas/pydata/sdss_photoz.npy


### Importing Data

In [6]:
import numpy as np
data = np.load('data/sdss_photoz.npy')

N = len(data)
X = np.zeros((N, 4))
X[:, 0] = data['u'] - data['g']
X[:, 1] = data['g'] - data['r']
X[:, 2] = data['r'] - data['i']
X[:, 3] = data['i'] - data['z']
z = data['redshift']

### Splitting Data

In [14]:
Ntrain = 3 * N / 4
Xtrain = X[:Ntrain]
ztrain = z[:Ntrain]
Xtest = X[Ntrain:]
ztest = z[Ntrain:]

print Xtrain
print ztrain

[[ 1.13839722  0.61042595  0.27867508  0.34679604]
 [ 1.51262856  0.64428329  0.27687073  0.15073967]
 [ 1.64493179  0.83510208  0.46151543  0.41690731]
 ..., 
 [ 1.77320099  0.74706268  0.30396843  0.24538612]
 [ 1.48406982  0.72762299  0.44620895  0.2953701 ]
 [ 1.72066689  0.81639671  0.40833282  0.32231903]]
[ 0.0800357  0.0215853  0.0366892 ...,  0.121951   0.0607102  0.0433055]


### Fitting Data

DecisionTreeRegressor method implementation from scikit-learn will be used to train a model and predict redshifts for the test set based on a 20-level decision tree.

In [8]:
from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor()
clf.fit(Xtrain, ztrain)
zpred = clf.predict(Xtest)

### Calculating Accuracy

One of the metric to measure accuracy is [RMSE - root-mean-square error ](https://en.wikipedia.org/wiki/Root-mean-square_deviation)


In [15]:
rmse = np.sqrt(np.mean((ztest - zpred) ** 2))
print 'RMSE: {0}'.format(rmse)

RMSE: 0.235104463352
