In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler 

Get the dataset. The class column is the label and other columns are the features. Our label is binary ("g" or "h") for gamma or hadron particle, so instead we convert it into 0,1 for easy interpretation. 1 is gamma, 0 is hadron

In [None]:
columns = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]
df = pd.read_csv("magic.data", names=columns)
df["class"] = (df["class"]=="g").astype(int)

Plot all features for visualization

In [None]:
for feature in columns[:-1]:
    plt.hist(df[df["class"]==1][feature], label='gamma', density=True)
    plt.hist(df[df["class"]==0][feature], label='hardon', density=True)
    plt.title(feature)
    plt.ylabel("Probability")
    plt.xlabel(feature)
    plt.legend()
    plt.show()

Create training, validation and test data.
- `train_df` - used for training the model
- `valid_df` - used to get metrics for our model during training. Based on the validation dataset, we change our hyper-parameters like no. of neighbors in kNN
- `test_df` - used to test the effectiveness of our fully trained model.


`df.sample(frac=1)` is used to get random samples from our dataset. Setting frac=1 lets us shuffle the entire df

In [None]:
train_df, valid_df, text_df = np.split(df.sample(frac=1), [int(0.6*len(df)), int(0.8*len(df))])

### Issues - 
- We have different magnitude of values in different columns. We need to standardize these values so that we can compare features with each other  
- We have data which has more gamma values compared to hadron. We need to over-sample our training dataset to get equal no. of gamma and hadron data, so that there are no biases towards gamma class in our model. 

### What is over-sampling - 
A technique to incorporate more of the minority data into our training dataset so that our model can train on equal no. of data of each label

In [None]:
def scale_dataset(df, over_sample):
    X = df[df.columns[:-1]].values
    y = df[df.columns[-1]].values

    # scale the features to standard deviation
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    if over_sample:
        ros = RandomOverSampler()
        X, y = ros.fit_resample(X, y)

    # 
    data = np.hstack(X, np.reshape(y, (-1, 1)))

    return data, X, y