# Preprocessing the StarDataset

In this notebook, I will try to demonstrate how to preprocess this messy Star Dataset.

https://www.kaggle.com/vinesmsuic/star-categorization-giants-and-dwarfs

# Data Cleaning

Here are some useful info you might want to look at before playing with this notebook:

* [Data Cleaning course on Kaggle Learn](https://www.kaggle.com/alexisbcook/handling-missing-values)

### Take a first look at the data
The first thing we'll need to do is
* Import Libraries
* Check the files we have
* Load the raw dataset

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Use pd.read_csv to read file
path = "../input/star-categorization-giants-and-dwarfs/Star99999_raw.csv"
raw_data = pd.read_csv(path)

raw_data

^We can see the dataset included duplicated Index column, so we need to remove it later.

Lets read some statistics of the dataset first.

In [None]:
raw_data.columns

* `Vmag` - Visual Apparent Magnitude of the Star 
* `Plx` - Distance Between the Star and the Earth 
* `e_Plx` - Standard Error of `Plx` (Drop the Row if you find the e_Plx is too high!)
* `B-V` - B-V color index. (A hot star has a B-V color index close to 0 or negative, while a cool star has a B-V color index close to 2.0. Other stars are somewhere in between.)
* `SpType` -  [Stellar classification.](https://en.wikipedia.org/wiki/Stellar_classification) (Roman Numerals &gt;IV are giants. Otherwise are dwarfs) 


In [None]:
# read some statistics of the dataset
raw_data.describe()

^Why the describe function is not returning summary of all columns?

It is probably because this dataframe has mixed column types. The default behavior of pandas describe function is to only provide a summary for the numerical columns.

https://stackoverflow.com/questions/24524104/pandas-describe-is-not-returning-summary-of-all-columns

In [None]:
# Check the DataType of our dataset
raw_data.info()

^As we can see, Both `Vmag`, `Plx`, `e_Plx`,`B-V` are marked as object, but they were supposed to be float value.

Then we should convert our columns to numeric.

If you run this code:
```python
# Convert Columns data type to float values
raw_data["Vmag"] = pd.to_numeric(raw_data["Vmag"], downcast="float")
raw_data["Plx"] = pd.to_numeric(raw_data["Plx"], downcast="float")
raw_data["e_Plx"] = pd.to_numeric(raw_data["e_Plx"], downcast="float")
raw_data["B-V"] = pd.to_numeric(raw_data["B-V"], downcast="float")
```
Error would Occur : `ValueError: Unable to parse string "     " at position 25189`

From Above Error, we can observe that some cells were filled with whitespaces, so they become unable to parse.

We can add a parameter `errors='coerce'` to force the function to convert bad non-numeric values to NaN.

https://stackoverflow.com/questions/40790031/pandas-to-numeric-find-out-which-string-it-was-unable-to-parse

In [None]:
# Convert Columns data type to float values
raw_data["Vmag"] = pd.to_numeric(raw_data["Vmag"], downcast="float", errors='coerce')
raw_data["Plx"] = pd.to_numeric(raw_data["Plx"], downcast="float", errors='coerce')
raw_data["e_Plx"] = pd.to_numeric(raw_data["e_Plx"], downcast="float", errors='coerce')
raw_data["B-V"] = pd.to_numeric(raw_data["B-V"], downcast="float", errors='coerce')

Now let's check the info again.

In [None]:
# Check the DataType of our dataset
raw_data.info()

In [None]:
# Actually , if you want to show all the columns you can add parameter `include='all'`.
raw_data.describe(include='all')

## Checking Missing data
It is very common that a dataset have some missed values.

In [None]:
# get the number of missing data points per column
missing_values_count = raw_data.isnull().sum()

missing_values_count

In [None]:
# how many total missing values do we have?
total_cells = np.product(raw_data.shape)
total_missing = missing_values_count.sum()

# percentage of data that is missing
percent_missing = (total_missing/total_cells)
print("Percentage Missing:", "{:.2%}".format(percent_missing))

^From the percentage of missing data, since it is so small (only 0.7%), we can just drop the rows.

Let's see what will happen if we remove all the rows that contain a missing value.

## Dropping Missing Data

In [None]:
# remove all the rows that contain a missing value
# better to store it into a new variable to avoid confusion
raw_data_na_dropped = raw_data.dropna() 

raw_data_na_dropped

In [None]:
# just how much rows did we drop?
dropped_rows_count = raw_data.shape[0]-raw_data_na_dropped.shape[0]
print("Rows we dropped from original dataset: %d \n" % dropped_rows_count)

# Percentage we dropped
percent_dropped = dropped_rows_count/raw_data.shape[0]
print("Percentage Loss:", "{:.2%}".format(percent_dropped))

Lastly, read the statistics and info again.

In [None]:
raw_data_na_dropped.describe()

^Oh, almost forgot, we need to drop the first column of dataset.

## Dropping Unwanted Column

This Stackoverflow link will give you idea of how to drop a column from pandas dataframe.
https://stackoverflow.com/questions/13411544/delete-column-from-pandas-dataframe

In [None]:
#The best way to do this in pandas is to use drop:
raw_data_na_dropped = raw_data_na_dropped.drop('Unnamed: 0', axis=1)

In [None]:
raw_data_na_dropped.describe()

In [None]:
raw_data_na_dropped.info()

^We noticed the Int64Index have 96742 entries, but only 0 to 99998.

Therefore we need to reindex our dataframe.

https://stackoverflow.com/questions/40755680/how-to-reset-index-pandas-dataframe-after-dropna-pandas-dataframe

In [None]:
raw_data_na_dropped_reindex = raw_data_na_dropped.reset_index(drop=True)

In [None]:
raw_data_na_dropped_reindex.info()

Looks like we have finally cleaned out all the missing values!

Finally, we can save our progress to a csv file.
* Remember to use `index=False` if you don't want to create separate column of indexes again!

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
https://stackoverflow.com/questions/20845213/how-to-avoid-python-pandas-creating-an-index-in-a-saved-csv

In [None]:
#Optional - Save our progress
#raw_data_na_dropped_reindex.to_csv("Star99999_na_dropped.csv", index=False)

# Creating New Column for Amag

## Finding Absolute Magnitude

The absolute magnitude of the stars were generated via the equation:
![](https://i.imgur.com/tt1h8bu.png)
Where $M$ represents the absolute magnitude `Amag`, 
$m$ represents the visual apparent magnitude `Vmag`
and $p$ represents stellar parallax `Plx`.


In this session, we will create a new column `Amag` to store $M$.

Things need to be aware:
* Taking log of 0 would result in a infinity, which is what we dont want to see
  * To fix this: Dropping rows with `Plx` = 0
* Taking log of -ve numbers would result complex numbers, which is what we dont want to see too.
  * To fix this: Taking Absolute value of `Plx`


In [None]:
#Save a copy, so I can call it in a easier way
df = raw_data_na_dropped_reindex.copy()
df

In [None]:
#Dropping rows that `Plx` = 0
df = df[df.Plx != 0]

#Reindex the dataframe
df = df.reset_index(drop=True)

df

Looks like we successfully dropped all the rows that `Plx` = 0.

In [None]:
#Implement the equation
df["Amag"] = df["Vmag"] + 5* (np.log10(abs(df["Plx"]))+1)

df

In [None]:
df.info()

In [None]:
df.describe()

# Column Mapping

In this session, we will create a new column `TargetClass` to store whether it is a Giant or Dwarf.

## Convert SpType into Giants and Dwarf

* Roman Numerals >IV are giants. Otherwise are dwarfs



In [None]:
# Take a look at our SpType column
df['SpType']

In [None]:
#Copy the SpType column to a new column called TargetClass
df['TargetClass'] = df['SpType']

df

We can see the SpType contains the Roman Numerals we need.

> Best Practice is to use Regex (Regular Expression) to extract the pattern.

But I will try to use an intuitive approach first.

I just took a look from the dataset, the Roman Numeral contains: I, II, III, IV, V, VI, VII in the string.
Therefore:

* Dwarfs (I, II, III, VII)
* Giants (IV, V, VI)
* Other Special Stars (None)

^Edit0821: Sorry I made a huge mistake here... Now fixed!

For the First Character of the String, we don't need to worry there will be I or V.
![](https://cdn.britannica.com/17/143617-050-6042AB2A/diagram-Hertzsprung-Russell-Annie-Jump-Cannon-type-order.jpg)

In [None]:
#The intuitive approach (Could take a long time if you have a huge dataset)
for i in range(len(df['TargetClass'])):
    if "V" in df.loc[i,'TargetClass']: 
        if "VII" in df.loc[i,'TargetClass']: 
            df.loc[i,'TargetClass'] = 0 # VII is Dwarf
        else:
            df.loc[i,'TargetClass'] = 1 # IV, V, VI are Giants
    elif "I" in df.loc[i,'TargetClass']: 
        df.loc[i,'TargetClass'] = 0 # I, II, III are Dwarfs
    else: 
        df.loc[i,'TargetClass'] = 9 # None
        
df['TargetClass']

^When we use the data to analysis, the label is better in numeric values otherwise we might need to map them.

In [None]:
df.describe(include='all')

In [None]:
#Save our progress
#df.to_csv("Star99999_preprocessed0821.csv", index=False)

# Balancing Data
Almost forgot, we need to balance the data.

This post will give you the idea of why do we need to balance the data.
<br>
https://elitedatascience.com/imbalanced-classes

In [None]:
df['TargetClass'].value_counts()

In [None]:
import matplotlib.pyplot as plt # plot graphs
import seaborn as sns # plot graphs

sns.countplot(df['TargetClass'])

We only need the Dwarfs and Giants Record.

In [None]:
#Dropping rows that `TargetClass` = 9
df = df[df.TargetClass != 9]

#Reindex the dataframe
df = df.reset_index(drop=True)

df

Since we have so many records, we will just downsample the majority class.

In [None]:
# Separate the labels
df_giants = df[df.TargetClass == 1]
df_dwarfs = df[df.TargetClass == 0]

In [None]:
# Numbers of rows of Giants and Dwarfs
num_of_giant = df_giants.shape[0]
num_of_dwarf = df_dwarfs.shape[0]
print("Giants(1):",num_of_giant)
print("Dwarfs(0):",num_of_dwarf)

To downsample the class, we can just use a loop to loop through the records, but there is a way better approach.

Let's import `resample` from `sklearn`.

In [None]:
from sklearn.utils import resample

In [None]:
# Downsample majority class
df_giants_downsampled = resample(df_giants, 
                                 replace=False,    # sample without replacement
                                 n_samples=num_of_dwarf,     # to match minority class
                                 random_state=1) # reproducible results
 
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_giants_downsampled, df_dwarfs])

In [None]:
df_downsampled['TargetClass'].value_counts()

In [None]:
sns.countplot(df_downsampled['TargetClass'])

Our Dataset is finally balanced!
![](https://i.imgflip.com/303krn.jpg)

 Last but not Least, we need to check our dataset to see whether there are still some problem.

In [None]:
df_downsampled.describe(include='all')

In [None]:
df_downsampled.info()

Yeah, Reindex.

In [None]:
df_balanced = df_downsampled.reset_index(drop=True)

df_balanced.info()

In [None]:
df_balanced

Did you notice? Since we concat the 2 Dataframes, we need to shuffle our data before feeding them to a model.

> Pandas has a shuffle method called `sample`.
> You can also use sklearn to shuffle if you want to.

https://stackoverflow.com/questions/29576430/shuffle-dataframe-rows

In [None]:
df_balanced = df_balanced.sample(frac=1).reset_index(drop=True)

df_balanced

Finally done!

In [None]:
#Save our dataset, we can finally play with it!!!
df_balanced.to_csv("Star39552_balanced.csv", index=False)

You can use this link to check more notebooks about this dataset:

https://www.kaggle.com/vinesmsuic/star-categorization-giants-and-dwarfs/notebooks
