# CS 421: Data Mining - lab 3

## Objectives
1. Applying preprocessing techniques learnt before and see their effects on classification accuracy
2. Exploring different classification models and performing tuning of their parameters
3. Exploring different techniques for evaluating classification models
4. Learning how to analyze observed results and explain observations in a detailed report .

## Problem Statement
Given the MAGIC gamma telescope dataset that can be obtained using the link below.  
https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope. This dataset is generated to simulate
registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using
the imaging technique. The dataset consists of two classes; gammas (signal) and hadrons (background). There
are 12332 gamma events and 6688 hadron events . You are required to apply preprocessing techniques on this
dataset and use the preprocessed dataset to construct different classification models such as **Decision Trees,
Naïve Bayes Classifier, Random Forests, AdaBoost, K-Nearest Neighbor (K-NN) and Support Vector
Machines (SVM)**. You are also required to tune the parameters of these models, compare the performance of
the learned models before and after preprocessing and compare the performance of models with each other.

----
## Data Loading and Exploration

In [1]:
import pandas as pd
import numpy as np

In [2]:
attrNames = ['fLength','fWidth','fSize','fConc','fConc1','fAsym','fM3Long','fM3Trans','fAlpha','fDist','class']
raw_data = pd.read_csv('data/magic04.data', header=None, names=attrNames)
raw_data.head(10)

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,g
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,g
2,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,g
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,g
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,g
5,51.624,21.1502,2.9085,0.242,0.134,50.8761,43.1887,9.8145,3.613,238.098,g
6,48.2468,17.3565,3.0332,0.2529,0.1515,8.573,38.0957,10.5868,4.792,219.087,g
7,26.7897,13.7595,2.5521,0.4236,0.2174,29.6339,20.456,-2.9292,0.812,237.134,g
8,96.2327,46.5165,4.154,0.0779,0.039,110.355,85.0486,43.1844,4.854,248.226,g
9,46.7619,15.1993,2.5786,0.3377,0.1913,24.7548,43.8771,-6.6812,7.875,102.251,g


In [3]:
raw_data.describe()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist
count,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0
mean,53.250154,22.180966,2.825017,0.380327,0.214657,-4.331745,10.545545,0.249726,27.645707,193.818026
std,42.364855,18.346056,0.472599,0.182813,0.110511,59.206062,51.000118,20.827439,26.103621,74.731787
min,4.2835,0.0,1.9413,0.0131,0.0003,-457.9161,-331.78,-205.8947,0.0,1.2826
25%,24.336,11.8638,2.4771,0.2358,0.128475,-20.58655,-12.842775,-10.849375,5.547925,142.49225
50%,37.1477,17.1399,2.7396,0.35415,0.1965,4.01305,15.3141,0.6662,17.6795,191.85145
75%,70.122175,24.739475,3.1016,0.5037,0.285225,24.0637,35.8378,10.946425,45.88355,240.563825
max,334.177,256.382,5.3233,0.893,0.6752,575.2407,238.321,179.851,90.0,495.561


In [4]:
gSize = len(raw_data[raw_data["class"] == "g"])
hSize = len(raw_data[raw_data["class"] == "h"])
print("'g' class size: " + str(gSize) + " - 'h' class size: " + str(hSize))

'g' class size: 12332 - 'h' class size: 6688


---
## Data preprocessing
### 1. Balancing data
Note that the dataset is class-imbalanced. To balance the dataset, randomly put aside the extra readings for
the gamma “g” class to make both classes equal in size.  
**note:** There are also other several methods such as repeating class "h" or giving it more weights.

In [5]:
np.random.seed(10) # setting random seed so that no changes in the results with several runs
remove_n = gSize - hSize # size to be removed
drop_indices = np.random.choice(raw_data[raw_data["class"] == "g"].index, remove_n, replace=False)
balanced_data = raw_data.drop(drop_indices)

In [6]:
gSize = len(balanced_data[balanced_data["class"] == "g"])
hSize = len(balanced_data[balanced_data["class"] == "h"])
print("in balanced data, 'g' class size: " + str(gSize) + " - 'h' class size: " + str(hSize))

in balanced data, 'g' class size: 6688 - 'h' class size: 6688


### 2. Visualizing Data 

5644

### 3. Data splitting

### 4. Features Processing

---
## Classification

### 1. Creating models

### 2. Training models on raw data
#### 2.1. Train models

#### 2.2. Performance measure

### 3. Training models on preprocessed data
#### 3.1. Train models

#### 3.2. Performance measure

---
## Models Tuning

### 1. Tune models

### 2. Test models