# Session 3
## Experiment 2
###  Lab

In this lab we will use WINE dataset

### Data Source
https://archive.ics.uci.edu/ml/datasets/wine

### Objective
To understand Scaling and Normalization

#### Dataset Information:
These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. 

### Data Attributes

1. Alcohol 
2. Malic acid 
3. Ash 
4. Alcalinity of ash 
5. Magnesium 
6. Total phenols 
7. Flavanoids 
8. Nonflavanoid phenols 
9. Proanthocyanins 
10. Color Intensity
11. Hue
12. OD280/OD315 of diluted wines 
13. Proline 



### Predicted attribute
The first field in the data is the Class Label -- 1-3

In [None]:
import pandas as pd
data = pd.read_csv("wine_data.csv", header=None)
#print(data.shape)
print(data)

### Extract the features Alcohol (percent/volume) and Malic acid (g/l). 

In [None]:
## Code here
features_extracted = data[[0,1,2]].values
#print features_extracted

###  Plot a graph between Alcohol (percent/volume) and Malic acid (g/l). 
#### Can you see some sparsity and non-symmetry in the dataset?

In [None]:
## Your code here
import matplotlib.pyplot as plt
plt.figure(1, figsize=(20,10))
plt.scatter(features_extracted[:,1], features_extracted[:,2], c=features_extracted[:,0],s=60)
plt.show()


## Min-Max Scaling
Min-Max scaling  maps the features in the range of [0, 1]. The formula for min-max scaling is below:

\begin{equation*}
    x_{norm}=\frac{x-x_{min}}{x_{max} - x_{min}}
\end{equation*}

Let us scale the Alcohol and Malic Acid columns. Plot the graph using these scaled values. Do you see any difference?

In [None]:
import numpy as np

## Min-max normalization
def xnorm(x, xmin, xmax):
    return (x - xmin)/(xmax - xmin) 
def minmax(dataSet):
    xmin = np.min(dataSet,0)
    xmax = np.max(dataSet,0)
    return xnorm(dataSet,xmin,xmax)


scaled_features = minmax(features_extracted[:,1:])
print(scaled_features.shape)

### Only Scaled Plot ### 
plt.figure(1, figsize=(20,10))
plt.scatter(scaled_features[:,0], scaled_features[:,1], c=features_extracted[:,0],s=60)
plt.show()

### Raw Data Vs Scaled Data Plot ### Observe the values become very small after scaling
plt.figure(1, figsize=(20,10))
plt.scatter(features_extracted[:,1], features_extracted[:,2], c='b',s=60)
plt.scatter(scaled_features[:,0], scaled_features[:,1], c='r',s=60)
plt.show()



## Standardization Method
Given $x$ is the original data, $\mu$ is the mean of a particular feature and $\sigma$ is the standard deviation scale the features. The formula for feature Normalization is:
    \begin{equation*}
          x_{norm}=\frac{x-\mu}{\sigma}
   \end{equation*}
   
**Exercise 4** :: Plot a graph between Alcohol (percent/volumne) and Malic acid (g/l). Do you see some difference between this and min-max plotting and  plotting raw data?


In [None]:
import numpy as np
def xZscore(x, mu, sigma):
    return (x - mu)/sigma 
def zScore(dataSet):
    avg = np.mean(dataSet,0)
    std = np.std(dataSet,0)
    return xZscore(dataSet, avg, std) 


std_features = zScore(features_extracted[:,1:])
print(std_features.shape)

plt.figure(1, figsize=(20,10))
### Plot1: Plot Standardised features
plt.scatter(std_features[:,0], std_features[:,1], c=features_extracted[:,0],s=60)
plt.show()

### Plot2: Standard Data Vs Raw Data Plot 
### Observe that the values become very small after scaling
plt.figure(1, figsize=(20,10))
plt.scatter(features_extracted[:,1], features_extracted[:,2], c='b',s=60)
plt.scatter(std_features[:,0], std_features[:,1], c='r',s=40)
plt.show()

### Plot3: Scaled data Vs standard Data Vs Raw data Plot 
### Observe the values become very small after scaling
plt.figure(1, figsize=(20,10))
plt.scatter(scaled_features[:,0], scaled_features[:,1], c='r',s=60)
plt.scatter(features_extracted[:,1], features_extracted[:,2], c='b',s=60)
plt.scatter(std_features[:,0], std_features[:,1], c='y',s=60)
plt.show()



In [None]:
### Intuition behind differnce in standardization and scaling ####

x = range(10)
y = range(10)

plt.figure(1, figsize=(20,10))
plt.plot(x, y, "ko", ms = 10)
plt.plot(x, minmax(y), "ro", ms = 10)
plt.plot(x, zScore(y), "bo", ms = 10)
plt.show()

## Exercises
Try the above with the following y values
  * y = range(10, 20)
  * y = range(20, 10, -1)
  * y = range(50, 100, 5)

In [None]:
y = range(10, 20)

In [None]:
y = range(20, 10, -1)

In [None]:
y = range(50, 100, 5)

## Classification
Comparison of classification results using raw features, scaled features and standardised features. You can use any classifier  like KNN, Linear or Naive Bayes

In [None]:
### Lets compare the scaled features and normalized in a task of classification 
### We will use the wine data set in a KNN classifier and see the accuracies.

#import packages
import math
import collections
import random


# ------------------------------------------------ #
# We are assuming that the label is the last field #
# If not, munge the data to make it so!            #
# ------------------------------------------------ #

def dist(a, b):
    sqSum = 0
    for i in range(len(a)):
        sqSum += (a[i] - b[i]) ** 2
    return math.sqrt(sqSum)

def kNN(k, train, given):
    distances = []
    for t in train:
        distances.append((dist(t[:-1], given[:-1]), t[-1]))
    distances.sort()
    return distances[:k]

def kNN_classify(k, train, given):
    tally = collections.Counter()
    for nn in kNN(k, train, given):
        tally.update(str(int(nn[-1])))
    return tally.most_common(1)[0]

picker = list(range(data.shape[0]))
random.shuffle(picker)       

FEATURE_COLUMNS = list(range(1, 14))
ALL_COLUMNS = FEATURE_COLUMNS + [0]

TRAIN_TEST_RATIO = 0.8

## Raw Data ###
data = data.reindex(columns = ALL_COLUMNS)
trainMax = int(len(picker) * TRAIN_TEST_RATIO)
train = []
test = []
for pick in picker[:trainMax]:
    train.append(list(data.values[pick]))         ### select 80% of data to be used as training set
for pick in picker[trainMax:]:
    test.append(list(data.values[pick])) 

acc = []
for t in test:
     acc.append(str(int(t[-1])) == kNN_classify(5, train, t)[0])

print("Accuracy without any normalization: ", sum(acc)/(len(test)*1.0))

## Scaled data ###
scaled_feats = minmax(data[FEATURE_COLUMNS].values)
scaled_data = np.append(scaled_feats, data[0].values.reshape(data.shape[0],1),1)

train = []
test = []

for pick in picker[:trainMax]:
    train.append(list(scaled_data[pick]))         ### select 80% of data to be used as training set
for pick in picker[trainMax:]:
    test.append(list(scaled_data[pick])) 

acc = []
for t in test:
     acc.append(str(int(t[-1])) == kNN_classify(5, train, t)[0])

print("Accuracy with scaling: ", sum(acc)/(len(test)*1.0))

### Standardized Data ###

std_feats = zScore(data[FEATURE_COLUMNS].values)
std_data = np.append(std_feats, data[0].values.reshape(data.shape[0],1),1)

train = []
test = []
for pick in picker[:trainMax]:
    train.append(list(std_data[pick]))         ### select 80% of data to be used as training set
for pick in picker[trainMax:]:
    test.append(list(std_data[pick])) 

acc = []
for t in test:
     acc.append(str(int(t[-1])) == kNN_classify(5, train, t)[0])

print("Accuracy with standardization:", sum(acc)/(len(test)*1.0))