In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
DATA_PATH='../input/'

In [None]:
df_train = pd.read_csv(DATA_PATH + 'training_set.csv')
df_train_meta = pd.read_csv(DATA_PATH + 'training_set_metadata.csv')

In [None]:
df_train.head()

In [None]:
df_train.describe()

In [None]:
df_train.info()

In [None]:
df_train_meta.head()

In [None]:
df_train_meta.describe()

In [None]:
df_train_meta.info()

# EDA

 ### Competition Details
We've astonomical Time Series data. These simulated time series, or ‘light curves’, are measurements of
an object’s brightness as a function of time - by measuring the photon flux in six differ-
ent astronomical filters (commonly referred to as passbands). 

**We've to classify each object into 15 classes (14 are in the training set while the 15th one is for *otherwise* category).**

There is a common error which occurs frequenctly in the observation known as *Redshift* error. Due to this error, the rate of arrival slows down which results in fainter(reddish) light. Redshift can cause either because of the dust in the path of the light or because of the doppler's effect i.e both the entities are moving opposite each other. It is necessary to take *Redshift* in consideration in classification

In this competition **weighted LogLoss Metrics** is used which means one cannot afford to make wrong predictions with high confidence which makes this problem more interesting. 





### Dataset Introduction

**Training data**
* `object_id` -> the Object ID, unique identifier (given as int32 numbers).
* `mjd` -> The time in Modified Julian Date (MJD) of the observation.  The MJD is a unit of time introduced by the Smithsonian Astrophysical Observatory in 1957 to record the orbit of Sputnik.
* `passband` -> The specific LSST passband integer, such that u,g,r,i,z,y = 0, 1, 2, 3, 4, 5 in which it was viewed. 
* `flux` -> the measured flux (brightness) in the passband of observation as listed in the passband column. The flux is corrected for MWEBV, but for large dust extinctions the uncertainty will be much larger in spite of the correction.
* `flux_err` ->  the uncertainty on the measurement of the flux listed above, given as float32 number..
* `detected` -> If detected = 1, the object’s brightness is significantly different at the 3σ level relative to the reference template. This is given as a Boolean flag.

** Training Meta data **
* `object_id` -> the Object ID, unique identifier
* `ra` ->  right ascension, sky coordinate:  longitude, units are degrees (given as float32 numbers).
* `decl` -> declination, sky coordinate: latitude, units are degrees 
* `gal_l` -> Galactic longitude, units are degrees (given as float32 numbers).
* `gal_b` -> Galactic lattitude, units are degrees (given as float32 numbers).
* `ddf` -> A Boolean flag to identify the object as coming from the DDF survey area (with value ddf = 1 for the DDF). Note that while the DDF fields are contained within the full WFD survey area, the DDF fields have significantly smaller uncertainties, given that the data are provided as additions of all observations in a given night.
* `hostgal_specz` -> The spectroscopic redshift of the source.  This is an extremely ac-
curate measure of redshift, provided for the training set and a small fraction of the
test set.
* `hostgal_photoz` ->  The photometric redshift of the host galaxy of the astronomical source.  While this is meant to be a proxy for `hostgal_specz`, there can be large differences between the two and `hostgal_photoz` should be regarded as a far less accurate version of `hostgal_specz`.  The `hostgal_photoz` is given as float32 numbers.
* `hostgal_photoz_err` -> The uncertainty on the `hostgal_photoz` based on LSST survey projections, given as float32 numbers.
* `distmod` -> The  distance  (modulus)  calculated  from  the `hostgal_photoz` since  this redshift is given for all objects (given as float32 numbers).  Computing the distance modulus requires knowledge of General Relativity, and assumed values of the dark energy and dark matter content of the Universe, as mentioned in the introduction section.
* `MWEBV` -> is equivaltent to MW E(B-V). this ‘extinction’ of light is a property of the Milky Way (MW) dust along the line of sight to the astronomical source,  and is thus a function of the sky coordinates of the source `ra`, `decl`.  This is used to determine a passband dependent dimming and reddening of light from astronomical sources as described in subsection 2.1, and is given as float32 numbers.
* `target` -> The class of the astronomical source. This is provided in the training data.
Correctly determining the target (correctly assigning classification probabilities to
the objects) is the goal of the classification challenge for the test data.  The
target is given as int8 numbers

** Caveats**
* Galactic vs extragalactic
* Data Gaps
* Negative flux

### Questions and Intutions
* `flux` and `flux_err` distribution wrt passbands
* How much passbands have `detected` True.
* Does `detected` is True for each passband of an object.
* Find transient and Variable targets.
* `flux` - `mjd` distribution
* `flux` and `flux_err` distribution wrt mjd and passbands
* correlation b/w `ra`, 'decl` with `gal_l`, `gal_b`
* Effect of ddf in `hostgal_photoz_err` and `flux_err`
* correlation between `hostgal_specz`, `MWEBV` and `hostgal_photoz`
* Checkout frequency of negative flux, distribution of -ve flux and corresponding targets frequncy.
* Find galactic and extragalctic targets distribution
* Detected-Target distribution

### Let the EDA begins

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(df_train['flux'], ax=ax)
plt.show()

In [None]:
df_train.flux.describe()

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(df_train['flux_err'], ax=ax)
plt.show()

In [None]:
df_train.flux_err.describe()

In [None]:
df_train[(df_train.flux> 1000) | (df_train.flux < -1000)].describe()

In [None]:
sns.distplot(df_train[(df_train.flux < 1000) & (df_train.flux > -1000)].flux)

In [None]:
sns.countplot(df_train[(df_train.flux < 1000) & (df_train.flux > -1000)].detected)

In [None]:
sns.countplot(df_train[(df_train.flux> 250) | (df_train.flux < -250)].detected)

In [None]:
sns.distplot(df_train[(df_train.flux < 100) & (df_train.flux > -100)].flux)

In [None]:
sns.countplot(df_train[(df_train.flux_err >= 100) | (df_train.flux_err <= 100)].detected)

In [None]:
sns.countplot(df_train[(df_train.flux_err > 100) | (df_train.flux_err < -100)].detected)

In [None]:
sns.heatmap(df_train[['flux', 'flux_err']].corr(), annot=True)

1. Most of the flux lies between -250 to 250
2. Majority of the flux greater 250 has detected = True and majority of flux less than 250 has detective = False. Hence detected-target distribution would bee facinating.
3. Most of the flux lies between -100 to 100

In [None]:
passbands = [0, 1,2,3,4,5]
fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(18,10))
i = 0
for row in ax:
    for col in row:
        sns.distplot(df_train[(df_train.passband == passbands[i]) & (df_train.flux < 250) & (df_train.flux > -250)]['flux'], ax=col, axlabel='flux distribution of passband ' + str(i))
        i += 1
plt.show()
        

In [None]:
sns.countplot(x='detected', data=df_train, hue='passband')
plt.show()

In [None]:
df_train.groupby(['passband']).count()

1. Passband 4 and 5 has comparitively wider distribution.
2. Passband 2 has the most concentrated distribution
3. Passband 1,3,4 are almost similar
4. Total count of passbands are different which suggests that there would be data inconsistency in the training set.


In [None]:
ts_lens = df_train.groupby(['object_id', 'passband']).size()
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(ts_lens, ax=ax)
ax.set_title('distribution of time series lengths')
plt.show()

In [None]:
passbands = [0, 1,2,3,4,5]
fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(18,10))
i = 0
for row in ax:
    for col in row:
        sns.distplot(df_train[df_train.passband == i].groupby(['object_id']).size(), ax=col, axlabel='timeseries distribution of passband ' + str(i))
        i += 1
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(df_train['mjd'], ax=ax, bins=200)
ax.set_title('number of observations made at each time point')
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(df_train[df_train['object_id'] == 713]['mjd'], ax=ax, bins=200)
ax.set_title('number of observations made at each time point')
plt.show()

In [None]:
passbands = [0, 1,2,3,4,5]
fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(22,10))
i = 0
for row in ax:
    for col in row:
        sns.distplot(df_train[(df_train['object_id'] == 713) & (df_train['passband'] == 3)]['mjd'], ax=col, bins=200)
        col.set_title('number of observations made at each time point by passband ' +  str(i))
        i += 1
plt.show()

In [None]:

f, ax = plt.subplots(figsize=(12, 9))
ax.scatter(x='mjd', y='flux', data=df_train.groupby(['mjd']).mean().reset_index())
ax.scatter(x='mjd', y='flux_err', data=df_train.groupby(['mjd']).mean().reset_index())
ax.legend(['flux', 'flux error'])
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(12, 9))
ax.scatter(x='mjd', y='flux', data=df_train[df_train.object_id==713].groupby(['mjd']).mean().reset_index())
ax.scatter(x='mjd', y='flux_err', data=df_train[df_train.object_id==713].groupby(['mjd']).mean().reset_index())
ax.legend(['flux', 'flux error'])
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(12, 9))
ax.scatter(x='mjd', y='flux', data=df_train[df_train.object_id==615].groupby(['mjd']).mean().reset_index())
ax.scatter(x='mjd', y='flux_err', data=df_train[df_train.object_id==615].groupby(['mjd']).mean().reset_index())
ax.legend(['flux', 'flux error'])
plt.show()

In [None]:
objects = df_train.object_id.unique()
random_id = np.random.randint(0, len(objects), 12)
fig, ax = plt.subplots(nrows=4, ncols=3, figsize=(18,10))
i = 0
for row in ax:
    for col in row:
        col.scatter(x='mjd', y='flux', data=df_train[df_train.object_id==objects[random_id[i]]].groupby(['mjd']).mean().reset_index())
        col.scatter(x='mjd', y='flux_err', data=df_train[df_train.object_id==objects[random_id[i]]].groupby(['mjd']).mean().reset_index())
        col.legend(['flux', 'flux error'])
        i += 1
plt.show()

1. Time series data is unevenly distributed.
2. A significant amount of data gaps are present . 
3. Average flux of an object  either have clustered vertical distribution or well-scattered vertical distribution.
4. Flux Errors are usually clustered and we might use the the distribution of flux and flux_error in later prediction.

In [None]:
sns.heatmap(df_train[['mjd', 'detected']].corr(), annot=True)
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(df_train_meta[['ra', 'decl', 'gal_l', 'gal_b']].corr(), annot=True, ax=ax)
plt.show()

Since four of them are coordinate it seems that `gal_l` is negatively correlated to `decl` and `ra` is positively correlated to `gal_b`

In [None]:
f, ax = plt.subplots(figsize=(21, 9))
ax.scatter(df_train_meta.object_id, df_train_meta.hostgal_specz)
ax.scatter(df_train_meta.object_id, df_train_meta.hostgal_photoz)
ax.legend(['specz redshift', 'photoz redshift'])
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(21, 9))
ax.scatter(df_train_meta.object_id, df_train_meta.hostgal_photoz_err)
ax.scatter(df_train_meta.object_id, df_train_meta.hostgal_photoz)
ax.legend(['photoz redshift error', 'photoz redshift'])
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(12, 9))
sns.distplot(df_train_meta.hostgal_photoz)
sns.distplot(df_train_meta.hostgal_specz)
ax.legend(['photoz redshift', 'specz redshift'])
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(21, 9))
sns.distplot(df_train_meta.hostgal_photoz)
sns.distplot(df_train_meta.hostgal_photoz_err)
ax.legend(['photoz redshift',  'photoz redshift error'])
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(df_train_meta[['hostgal_specz', 'hostgal_photoz', 'hostgal_photoz_err']].corr(), annot=True, ax=ax)
plt.show()

In [None]:
sns.countplot(df_train_meta.target)
plt.show()

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12,6))
sns.countplot(df_train_meta[df_train_meta['hostgal_photoz'] == 0].target, ax = ax[0])
sns.countplot(df_train_meta[df_train_meta['hostgal_photoz'] != 0].target, ax = ax[1])
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(12, 9))
sns.countplot('target', hue='ddf', data=df_train_meta, ax=ax)
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(df_train_meta[df_train_meta.ddf==1].hostgal_photoz)
sns.distplot(df_train_meta[df_train_meta.ddf==0].hostgal_photoz)
ax.legend(['Redshift on DDF Survey Area',  'Redshift outside DDF survey Area'])
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(df_train_meta[df_train_meta.ddf==1].hostgal_photoz_err)
sns.distplot(df_train_meta[df_train_meta.ddf==0].hostgal_photoz_err)
ax.legend(['Redshift error on DDF Survey Area',  'Redshift error outside DDF survey Area'])
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(df_train_meta.hostgal_photoz)
sns.distplot(df_train_meta.mwebv)
ax.legend(['photoz redshift',  'MWEBV'])
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(df_train_meta[df_train_meta.hostgal_photoz==0].mwebv)
sns.distplot(df_train_meta[df_train_meta.hostgal_photoz!=0].mwebv)
ax.legend(['Galactic MWEBV',  'Extragalactic MWEBV'])
plt.show()

In [None]:
sns.distplot(df_train_meta.distmod.dropna())
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(df_train_meta[df_train_meta.hostgal_photoz == 0].dropna().distmod)
sns.distplot(df_train_meta[df_train_meta.hostgal_photoz != 0].dropna().distmod)
ax.legend(['Galactic distmod',  'Extragalactic distmod'])
plt.show()

In [None]:
print(df_train_meta[df_train_meta.hostgal_photoz != 0].info())
print(df_train_meta[df_train_meta.hostgal_photoz == 0].info())

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.violinplot(y='distmod', x='target', data=df_train_meta, ax=ax)
plt.show()

1. Photoz redhift is more scattered than the specz redshift
2. Error distribution is pretty concentrated.
3. There are two types of targets: Galactic(``mwebv = 0``) and extragalcatic.
4. Redshift outside DDF area are more scattered.
5. distmod is ``NA`` of extragalactic objects.
