This notebook is a very quick overview of the dataset for the PLAsTiCC Astronomical Classification competition, and intends to be a first exploration notebook to get used to the dataset.

# Table of contents
___
- [Data overview and target identification](#overview)
- [DDF areas](#ddfs)
- [Distmod and redshift](#distred)
- [Galactical versus Extragalactical sources](#galvsextra)
- [Passband](#passband)
- [Flux time plots](#fluxplots)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline
#read csv
training = pd.read_csv('../input/training_set.csv')

# Data overview and target identification <a name="overview"></a>
___
First of all, let's take a look at the dataset. 

In [None]:
training.sample(5)

So there are 6 fields in total. These are:
- mjd: the time in Modified Julian Date (MJD) of the observation. Can be read as days since November 17, 1858. Can be converted to Unix epoch time with the formula unix_time = (MJD−40587)×86400. 
- passband: The specific LSST passband integer, such that u, g, r, i, z, Y = 0, 1, 2, 3, 4, 5 in which it was viewed. 
- flux: the measured flux (brightness) in the passband of observation as listed in the passband column. These values have already been corrected for dust extinction (mwebv), though heavily extincted objects will have larger uncertainties (flux_err) in spite of the correction. 
- flux_err: the uncertainty on the measurement of the flux listed above. 
- detected: If 1, the object's brightness is significantly different at the 3-sigma level relative to the reference template. Only objects with at least 2 detections are included in the dataset. 

Besides those columns, we have another dataset, `training_set_metadata.csv`, in which we can find the following fields:

In [None]:
meta_training = pd.read_csv("../input/training_set_metadata.csv")
meta_training.sample(5)

Where each column represents:

- object_id: unique object identifier.
- ra: right ascension, sky coordinate: co-longitude in degrees. 
- decl: declination, sky coordinate: co-latitude in degrees. 
- gal_l: galactic longitude in degrees. 
- gal_b: galactic latitude in degrees. 
- ddf: A flag to identify the object as coming from the DDF survey area (with value DDF = 1 for the DDF, DDF = 0 for the WFD survey). Note that while the DDF fields are contained within the full WFD survey area, the DDF fluxes have significantly smaller uncertainties. Boolean
- hostgal_specz: the spectroscopic redshift of the source. This is an extremely accurate measure of redshift, available for the training set and a small fraction of the test set. 
- hostgal_photoz: The photometric redshift of the host galaxy of the astronomical source. While this is meant to be a proxy for hostgal_specz, there can be large differences between the two and should be regarded as a far less accurate version of hostgal_specz. 
- hostgal_photoz_err: The uncertainty on the hostgal_photoz based on LSST survey projections. 
- distmod: The distance to the source calculated from hostgal_photoz and using general relativity.
- mwebv: MW E(B-V). this ‘extinction’ of light is a property of the Milky Way (MW) dust along the line of sight to the astronomical source, and is thus a function of the sky coordinates of the source ra, decl. This is used to determine a passband dependent dimming and redenning of light from astronomical sources as described in subsection 2.1, and based on the Schlafly et al. (2011) and Schlegel et al. (1998) dust models. 
- target: The class of the astronomical source. This is provided in the training data. Correctly determining the target (correctly assigning classification probabilities to the objects) is the ‘goal’ of the classification challenge for the test data. Note that there is one class in the test set that does not occur in the training set: class_99 serves as an "other" class for objects that don't belong in any of the 14 classes in the training set. 

Let's explore a bit the target variable first. 

In [None]:
unique_targets = meta_training.target.unique()
print ("There are {} unique targets.".format(len(unique_targets)))
print (unique_targets)

Our goal, then is to classify our objects into one of those 14 classes. Just as a reminder, there is one more class, 99, that is present in the test set but not in the dataset. As the description from the challenge mentions, this serves as a placeholder for objects that don't belong to any of the 14 classes in the training set.

In [None]:
objects_per_target = pd.DataFrame(meta_training.groupby("target", as_index = False)["object_id"].count())
objects_per_target = objects_per_target.rename(columns = {"object_id": "num_of_objects"})
fig = plt.figure(figsize=(10,8))
sb.barplot(x =objects_per_target.target, y = objects_per_target.num_of_objects);

Most objects are from class 90, followed by classes 42, 16 and 15, while just a few are from class 53. 

Now we'll get some intuition on how those classes are spread in space.

In [None]:
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
for class_target in unique_targets:
    class_used = meta_training[meta_training.target == class_target]
    ax.scatter(x = class_used.gal_l, y = class_used.gal_b, alpha = 1)
plt.xlabel("Galactical Longitude(°)", fontsize = 15)
plt.ylabel("Galactical Latitude(°)", fontsize = 15);

Apparently, there is nothing very specific in this graph. All kinds of galaxies can be found in pretty much everywhere in the sky. Its strange form is probably due to where the detectors are placed and their sky coverage. 
By reading the map below, I think the map can see most of the Americas' sky (considering the increase in angle going from left to right.
![image.png](https://www.mapsofindia.com/worldmap/world-map-with-latitude-and-longitude.jpg)

However, it might be useful to lower alpha parameter to see if we can find spots that have a higher density of objects.

In [None]:
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
for class_target in unique_targets:
    class_used = meta_training[meta_training.target == class_target]
    ax.scatter(class_used.gal_l,class_used.gal_b, alpha = .1)
ax.set_xlabel("Galactical Longitude (°)", fontsize = 15)
ax.set_ylabel("Galactical Latitude (°)", fontsize = 15);

# DDF areas <a name="ddfs"></a>
___
Now we found something peculiar. There are five spots in the sky with a high density of observed objects. These particular spots may point out to nice insights if worked on properly.  The points seem to be around these intervals:
- [-60º:-57º, 165º:180º]
- [-56º:-53º, 220º:225º]
- [40º:43º, 228º:235º]
- [-53º:-51º, 315º:323º]
- [-72º:-75º, 322º, 332º]

One things that is worthy pointing out is that four of those five clusters of objects are below 40º South, one of them even below the Antartic circle. I'm not sure this will be relevant at some point, but insights are insights, right? 
Now, it is worth looking at those intervals to see if we can find some kind of pattern. 

In [None]:
condition = (round(meta_training["gal_b"]).isin(range(38,48)) & round(meta_training["gal_l"]).isin(range(226,238)))\
            |(round(meta_training["gal_b"]).isin(range(-56,-52)) & round(meta_training["gal_l"]).isin(range(220,226)))\
            |(round(meta_training["gal_b"]).isin(range(-66,-56)) & round(meta_training["gal_l"]).isin(range(165,178)))\
            |(round(meta_training["gal_b"]).isin(range(-55,-45)) & round(meta_training["gal_l"]).isin(range(315,323)))\
            |(round(meta_training["gal_b"]).isin(range(-77,-65)) & round(meta_training["gal_l"]).isin(range(322,332)))
five_point = meta_training.loc[condition]

fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
for class_target in unique_targets:
    class_used = five_point[five_point.target == class_target]
    ax.scatter(class_used.gal_l,class_used.gal_b, alpha = .1)
ax.set_xlabel("Galactical Longitude (°)", fontsize = 15)
ax.set_ylabel("Galactical Latitude (°)", fontsize = 15);

Great! Now we got our points. I didn't use the exact intervals I proposed before due to trial and error. Some of them were too small and I had to tweak a bit in order to get the whole cluster.  Now let's check the class distribution to see if it's somehow different.

In [None]:
objects_per_target_five = pd.DataFrame(five_point.groupby("target", as_index = False)["object_id"].count())
objects_per_target_five = objects_per_target_five.rename(columns = {"object_id": "num_of_objects"})
fig = plt.figure(figsize=(10,8))
sb.barplot(x =objects_per_target_five.target, y = objects_per_target_five.num_of_objects);
plt.xlabel("Class", fontsize = 15)
plt.ylabel("Number of sources", fontsize = 15);

Seems to me that nothing has changed much. There are proportionally fewer class 15 objects but I'm not sure if this is just some random fluctuation or an important feature. I tend to think it is the former, but anyway, just exploring the possibilities :) Further tests are needed to confirm or reject this hypothesis. 

One thing that just caught my attention while exploring, looking for possible features where patterns could be found, is the `ddf` column.  If 1, it means the object was observed from within a specific area (DDF survey area). So, maybe... Yeah, let's try that! First, the counts of ddf objects in the original meta_training dataset.

In [None]:
ddf_counts = pd.DataFrame(meta_training.groupby("ddf", as_index = False)["object_id"].count())
ddf_counts

So the proportion is about 2 ddf objects to each 5 objects detected. Now,  the clusters...

In [None]:
ddf_counts_five = pd.DataFrame(five_point.groupby("ddf", as_index = False)["object_id"].count())
ddf_counts_five

Woah! I guess the DDF survey areas were found :P  Each of the five clusters is such distinct area. The 26 objects are there probably due to my latitude/longitude choice, and that's the same reason the other 107 ddf objects aren't there as well. Maybe there are other features that could be seen only in those regions?  The only thing we have for granted, according to the information provided by the organizers, is that DDF fluxes have smaller uncertainties compared to the others. In my opinion, these won't turn out to be really special places except for the fact that these are the places where the DDFs were built. In [slide 13](https://project.lsst.org/groups/sac/sites/lsst.org.groups.sac/files/Brandt_DDF.pdf) is possible to see where they are located. I couldn't find one of them on that image, though. 

While we are at the galactical coordinates of the objects, we can check if the distances vary with galaxy coordinates - i.e. we want to check if we looked to right, distances from the objects would be, on average, the same as if we looked to other direction.

In [None]:
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
ax.scatter(meta_training.gal_l,meta_training.gal_b, c = meta_training.distmod, s = 7, cmap = 'Reds');
ax.set_xlabel("Galactical Longitude (°)", fontsize = 15)
ax.set_ylabel("Galactical Latitude (°)", fontsize = 15);

The distances seem to be pretty random - which seems to agree with  the isotropy of the universe (it looks the same in all directions).  But although they are the same in about every direction, perhaps there is a difference between close and far source counts. Are there more sources closer or distant from us?

# Distmod and redshift <a name="distred"></a>
___

In [None]:
fig = plt.figure(figsize=(10,8))
sb.distplot(meta_training[~np.isnan(meta_training.distmod)].distmod);
# plt.xlabel("Distance to the source")
plt.ylabel("Frequency", fontsize = 15);
plt.xlabel("Distmod", fontsize = 15);

So the distribution of source distances is (close to) binormal, with peaks around 41 and 47. Now, is there any relationship between the redshift observed and the distance they are?

In [None]:
meta_training.distmod.corr(meta_training.hostgal_photoz)

Well, that's a fairly high correlation. Let's plot to visually confirm what's going on here:

In [None]:
fig = plt.figure(figsize=(10,8))
plt.scatter(meta_training.distmod, meta_training.hostgal_photoz, s = 1);
plt.xlabel("Distmod", fontsize = 15);
plt.ylabel("Host Galaxy Photometric Redshift", fontsize = 15);


Yeah, there is definitely something non-trivial going on here. The farther a source (and thus their host galaxy) is, more is the [redshift](https://en.wikipedia.org/wiki/Redshift). this indicates that the farther the sources (and host galaxies) are, the faster they are moving away from us. I decided to go with the photoz variable rather than specz because this is not available for most of the test site and although far less accurate according to the data description, it seems to be more useful.  Moreover, we should also see how the error in hostgal_photoz behaves.

In [None]:
fig = plt.figure(figsize=(10,8))
plt.scatter(meta_training.distmod, meta_training.hostgal_photoz_err, s = 1);
plt.xlabel("Distmod", fontsize = 15);
plt.ylabel("Host Galaxy Photometric Redshift Error", fontsize = 15);

Seems like most errors are small, but there certainly are some big errors as high as the same order of magnitude of the measures themselves. Although not perfect, that's what we got to work with.

Just to be sure, we can also plot specz as a function of the distance to the source.

In [None]:
fig = plt.figure(figsize=(10,8))
plt.scatter(meta_training.distmod, meta_training.hostgal_specz, s = 1);
plt.xlabel("Distmod", fontsize = 15);
plt.ylabel("Host Galaxy Spectroscopic Redshift", fontsize = 15);

Indeed, although there is some resemblance to photoz, there are lot of points spread out and also a cluster on the bottom edge of the scatterplot showing objects with a very low redshift, meaning that although their host galaxies are moving fast away from us, the sources themselves are moving slower. 

# Galactical versus Extragalactical  Sources <a name="galvsextra"></a>
___
It is mentioned in the [starter kit](https://www.kaggle.com/michaelapers/the-plasticc-astronomy-starter-kit) and also used in [Kyle's notebook](https://www.kaggle.com/kyleboone/naive-benchmark-galactic-vs-extragalactic) the fact that galactic and extragalactic sources are easily separated, and the organizers simulated this feature by setting redshift from galactic sources to exactly `0`. I think it is worthy checking if these two source types have differences regarding the other aspects we've seen so far.

In [None]:
#first we should split the dataset between galactic and extragalactic sources
galactic = meta_training[meta_training.hostgal_photoz == 0]
galactic.sample(5)

Two key points that call the attention here:
- Even though photoz is just a proxy to photoz, they are both zero for galactic sources;
- distmod is `NaN` to all of them.

Now, let's define and check the extragalactic subset

In [None]:
extragalactic = meta_training[meta_training.hostgal_photoz != 0]
extragalactic.sample(5)

It is also possible to check that there are no null values for extragalactic `distmod`. Hence all the nulls presented in this variable are not due to errors in simulation nor at random, and this can be confirmed in the starter's kit as well, a few paragraphs below figure 27., when the authors say "For galactic objects with `hostgal_photoz = 0` , the distmod is reported as `NaN` (the distance would be 0, and taking the logarithm of 0 is a bad idea)." I missed that part when I read their notebook at first (it's a loooong notebook :P) and only checked the information after finding this here, so I decided to keep this analysis for people who might miss this information just like I did. 

We've just got our galactic and extragalactic datasets, so let's check their differences!

In [None]:
#for convinience while making the plots, I'll create a variable to reference the galactic/extragalactic feature. 1 for extragalactic and 0 for galactic
meta_training['extra'] =  0
meta_training.loc[meta_training.hostgal_photoz != 0, 'extra'] = 1

# plt.hist(meta_training[meta_training.extra == 0].target, label = 'Galactic')

meta_training['target'] = meta_training['target'].astype('category',copy=False)

grid = sb.FacetGrid(data = meta_training, hue = 'extra', height =7)
grid.map(sb.countplot, 'target')
for ax in grid.axes.ravel():
    ax.legend()
plt.xlabel("Class", fontsize = 15)
plt.ylabel("Counts", fontsize = 15);

Excellent! When we are only dealing with galactic sources, we'll effectively be labeling five classes, and 9 classes when dealing with extragalactic sources. Note that class overlaps between different source types do not occur.

In [None]:
fig = plt.figure(figsize=(10,8))
plt.scatter(x = galactic.gal_l, y = galactic.gal_b, alpha = 0.5, label = 'galactic')
plt.scatter(x = extragalactic.gal_l, y = extragalactic.gal_b, alpha = 0.1, label = 'extragalactic');
plt.xlabel("Galactic longitude (°)", fontsize = 15)
plt.ylabel("Galactic latitude (º)", fontsize = 15)
plt.legend();

I set different alpha parameters so we can confirm superpose both set of points and visualize the result. As expected, both types of objects can be seen in all directions (isotropic universe). Although it does not look like so in this plot, DDF regions are also capable of detecting several galactic points and the only reason extragalactic points are standing out in this plot is because the orange points are being ploted right above the blue ones. If the scatterplots are done in different subplots, it is possible to see both blue and orange points there. 

Now that we've talked about the meta training data, let's see the behaviour of the training data. Before going further on this analysis, I think it will be helpful to merge both datasets so we can see some of the training fields from the perspective of a few meta training fields in case we want to. 

In [None]:
merged = training.merge(meta_training, on = "object_id")
merged.sample(5)

# Passband <a name="passband"></a>
___ 
Before going into the flux plots, which are think are one of the most interesting and important parts of the dataset, I'd like to make a short analysis on the passband filters, since they will be seen in flux time plots as well.

First, let's see the use of passband filters. Is anyone of them used more often than the others?

In [None]:
grid = sb.FacetGrid(data = merged, height =7)
grid.map(sb.countplot, 'passband')
for ax in grid.axes.ravel():
    ax.legend()
plt.xlabel("Passband", fontsize = 15)
plt.ylabel("Counts", fontsize = 15);

Let's remember the filters used:  {u→0, g→1, r→2, i→3, z→4, y→5} where "u  covers the ultraviolet,  g  covers what your eye perceives as blue/green,  r  covers red,  i  covers the infrared". The z and y filters don't have any particular reason to be named that way, and further information on this can be found [here](http://articles.adsabs.harvard.edu/cgi-bin/nph-iarticle_query?2008JAVSO..36..110M&data_type=PDF_HIGH&whole_paper=YES&type=PRINTER&filetype=.pdf).  

Regarding the plot, we can see that filters z and y are actually the most used, while green/blue filter is the less used one.  So in the following plots, we'll see these two filters being used more often.

# Flux time plots <a name="fluxplots"></a>
___
I'll use only one object_id to get a feel of the data.

In [None]:
unique_sources = merged.object_id.unique()
obj_data = merged[merged.object_id == unique_sources[0]]
obj_data.head(5)

The chosen object (615) is a galactic source in the DDF area and belongs to class 92. Let's see how the flux in each passband varies with time (here measured  in mjd). 

In [None]:
unique_passbands = merged.passband.unique()

fig = plt.figure(figsize=(10,8))
for passband in unique_passbands:
    specific_pb = obj_data[obj_data.passband == passband]
    plt.scatter(specific_pb.mjd, specific_pb.flux, label = passband)
plt.xlabel("MJD (in days from November 17, 1858)", fontsize = 15)
plt.ylabel("Flux", fontsize = 15)
plt.legend();

This doesn't seem very helpful. We can see that there are three time windows from where the data is taken. Maybe we should focus on one of those windows to take a better understanding here. 

In [None]:
window_objdata = obj_data[(obj_data.mjd > 60100) & (obj_data.mjd<60300)]
fig = plt.figure(figsize=(15,8))
for passband in unique_passbands:
    specific_pb = window_objdata[window_objdata.passband == passband]
    plt.plot(specific_pb.mjd, specific_pb.flux, label = passband)
plt.xlabel("MJD (in days from November 17, 1858)", fontsize = 15)
plt.ylabel("Flux", fontsize = 15)
plt.legend();

The peaks here do seem to occur at the same time especially after the third one, but before that all seems to be messy. Starter kit's authors mentioned that different sources will show different behaviours in flux, so let's confirm that by looking at another class 92 object and after that two other similar objects from a different class.

In [None]:
#get objects from class 92
class_92_objs = merged[merged.target == 92]
#unique objects from class 92
unique_sources_92 = class_92_objs.object_id.unique()
#get data from one specific object from class 92 (elemment 0 is the same from before)
obj_data = merged[merged.object_id == unique_sources_92[4]]
#get one time window to observe and plot
window_objdata = obj_data[(obj_data.mjd > 60100) & (obj_data.mjd<60300)]
fig = plt.figure(figsize=(15,8))
for passband in unique_passbands:
    specific_pb = window_objdata[window_objdata.passband == passband]
    plt.plot(specific_pb.mjd, specific_pb.flux, label = passband)
plt.xlabel("MJD (in days from November 17, 1858)", fontsize = 15)
plt.ylabel("Flux", fontsize = 15)
plt.legend();

Now let's do the same to another class, say 52.

In [None]:
#get objects from class 52
class_52_objs = merged[merged.target == 52]
#unique objects from class 52
unique_sources_52 = class_52_objs.object_id.unique()
#get data from one specific object from class 52
obj_data = merged[merged.object_id == unique_sources_52[0]]
#get one time window to observe and plot
window_objdata = obj_data[(obj_data.mjd > 60100) & (obj_data.mjd<60300)]
fig = plt.figure(figsize=(15,8))
for passband in unique_passbands:
    specific_pb = window_objdata[window_objdata.passband == passband]
    plt.plot(specific_pb.mjd, specific_pb.flux, label = passband)
plt.xlabel("MJD (in days from November 17, 1858)", fontsize = 15)
plt.ylabel("Flux", fontsize = 15)
plt.legend();

In [None]:
#get data from one specific object from class 52
obj_data = merged[merged.object_id == unique_sources_52[2]]
#get one time window to observe and plot
window_objdata = obj_data[(obj_data.mjd > 60100) & (obj_data.mjd<60300)]
fig = plt.figure(figsize=(15,8))
for passband in unique_passbands:
    specific_pb = window_objdata[window_objdata.passband == passband]
    plt.plot(specific_pb.mjd, specific_pb.flux, label = passband)
plt.xlabel("MJD (in days from November 17, 1858)", fontsize = 15)
plt.ylabel("Flux", fontsize = 15)
plt.legend();

So class 52 seems to have a very irregular flux pattern. I tested a few others from this class just to be sure, and the plots are very similar. Note that 52 is a class from an extragalactic source and 92 is from a galactic one. So maybe there is some kind of pattern? Perhaps galactic objects have a more regular structure that can be seen from a simple timeplot of its flux pattern while extragalactic objects are more irregular and to find out a characteristic time scale we would have to use more advanced methods, such as the suggested  Lomb-Scargle Periodogram from starter's kit. 