# Load the full catalog of AGN and create the full data for learning


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

The catalog with all the original features for different AGN. Some of them do not have a measured redshift.

In [2]:
catalog = pd.read_csv('New_CombinedData.csv')

In [3]:
catalog.shape

(3511, 17)

We did a full EDA report using `pandas_profiling` and we saved it in [eda_report.html](./eda_report.html).

# Remove all AGN that do not belong to the the `BLL` or `FSRQ` class

In [23]:
catalog.Label.unique()

array(['bcu', 'bll', 'fsrq', 'rdg', 'agn', 'FSRQ', 'BLL', 'RDG', 'nlsy1',
       'css', 'ssrq', 'NLSY1', 'BCU', 'sey'], dtype=object)

In [11]:
data_fsrq_bll = catalog.query("Label == 'FSRQ' or Label == 'BLL' or Label == 'fsrq' or Label == 'bll'")

In [13]:
data_fsrq_bll.shape

(1923, 17)

We go from 3511 datapoints to 1923 datapoints by selecting only 2 types of AGN

# Remove all missing values

For this analysis, we want to avoid doing imputation of missing values, so we remove all of the rows with `nan`

In [15]:
data_fsrq_bll.isna().sum()

Index                  0
X.name                 0
z                    461
Flux1.100m             0
Energy_Flux100         0
Significance           0
Variability_Index      0
Frac_Variability       0
Highest_Energy       616
nu                   245
nufnu                245
PL_Index               0
Pivot_Energy           0
LP_Index               0
LP_beta                0
Gaia_G_Magnitude     356
Label                  0
dtype: int64

The largest number of missing values is in `Highest_Energy`

In [16]:
data_remove_na = data_fsrq_bll.dropna()

In [18]:
data_remove_na.shape

(741, 17)

We can also just remove the rows that are missing just the target value, the redshift

In [19]:
data_with_redshift = data_fsrq_bll.dropna(subset=['z'])

In [20]:
data_with_redshift.shape

(1462, 17)