# Feature scaling analysis

One of the problems with the extracted features is that it can contain -1 values or outliers. This can be solved by scaling them.
Three different scalers are implemented
- Standard Scaling -> assumes standard distribution
- MinMax scaling -> scales to min max (0,1) range
- Robust scaling -> minmax scaling but more robust to outliers

The mutual information gain is compared for the scalers. The hypothesis is that these gains DO NOT change as we are not changing the features. Still, classification errors could change due to scaling!!

In [1]:
import classification
import numpy as np
import pandas as pd
from os import path

data_path = '../features'
feature_path = '1_paper_features.pkl'
label_path = '1_paper_labels.pkl'

features = pd.read_pickle(path.join(data_path, feature_path))
labels = np.load(path.join(data_path, label_path))
# TODO load features and labels from .json
classi = classification.Classifiers(features, labels, classifiers=[])


In [2]:
standard_scaled = classi.standard_scaling()
minmax_scaled = classi.minmax_scaling()
robust_scaled = classi.robust_scaling()
print("No scaling")
classi.information_gain()
# classi.chi2_stats()

print("Standard scaling")
classi.information_gain(standard_scaled)
# classi.chi2_stats(standard_scaled)

print("MinMax scaling")
classi.information_gain(minmax_scaled)
# classi.chi2_stats(minmax_scaled)
    
print("Robust scaling")
classi.information_gain(robust_scaled)

No scaling
Information gain of whole dataset
                               Feature Name  Info Gain
1     ratioChars_ArticleParagraphsPostTitle   0.040017
2                  numFormalWords_PostTitle   0.024087
3           ratioChars_ArticleDescPostTitle   0.021760
4                        numWords_PostTitle   0.017788
5   ratioChars_ArticleParagraphsArticleDesc   0.014178
6          ratioChars_ArticleTitlePostTitle   0.010832
7                        numChars_PostTitle   0.008457
8        diffWords_PostTitleArticleKeywords   0.005295
9        diffChars_PostTitleArticleKeywords   0.004994
10         ratioWords_ArticleTitlePostTitle   0.004707
11             diffChars_PostTitlePostImage   0.003067
12                numQuestionmarksPostTitle   0.001995
13            ratioWords_PostImagePostTitle   0.000725
14            ratioChars_PostImagePostTitle   0.000000
15          ratioWords_ArticleDescPostTitle   0.000000
Standard scaling
Information gain of whole dataset
                        

Unnamed: 0,Feature Name,Info Gain
1,ratioChars_ArticleParagraphsPostTitle,0.040017
2,numFormalWords_PostTitle,0.024087
3,ratioChars_ArticleDescPostTitle,0.02176
4,numWords_PostTitle,0.017788
5,ratioChars_ArticleParagraphsArticleDesc,0.014178
6,ratioChars_ArticleTitlePostTitle,0.010832
7,numChars_PostTitle,0.008457
8,diffWords_PostTitleArticleKeywords,0.005295
9,diffChars_PostTitleArticleKeywords,0.004994
10,ratioWords_ArticleTitlePostTitle,0.004707
