# Balancing datasets

At times, you will be given imbalanced datasets, where there will be more instances of class 1 than class 2. To avoid the imbalance negatively affecting performance, it is important to even out the data. This can be done by randomly undersampling the majority class, or oversampling the minority class.

Run the code below.

In [59]:
data = pd.DataFrame(columns=["y","x1", "x2"], data=[[1,5,6],[1,4,4],[1,4,5],
                                                    [2,7,8],[2,8,7]])

data

Unnamed: 0,y,x1,x2
0,1,5,6
1,1,4,4
2,1,4,5
3,2,7,8
4,2,8,7


As you can see, there are more of instances of 1's than 2's. Although the imbalance is easily visible in this dataset, it won't be in real life cases. Count and save class occurences.

In [71]:
count_class_1, count_class_2 = # CODE HERE

print('Class 1:', count_class_1)

print('Class 2:', count_class_2)

Class 1: 3
Class 2: 2


Divide the dataset according to classes

In [72]:
data_class_1 = # CODE HERE

data_class_2 = # CODE HERE

## Undersampling 

Undersampling will downsize the majority class to the number of occurences in the minority class by random selection.

Use pandas' sample() method to sample from class 1 the same number of occurences as in class 2. Then, concatenate the undersampled class 1 data with the original class 2 data.

In [69]:
df_class_1_under = # CODE HERE

df_balanced_under = # CODE HERE

df_balanced_under

Unnamed: 0,y,x1,x2
1,1,4,4
2,1,4,5
3,2,7,8
4,2,8,7


## Oversampling 

Oversampling will upsize the minority class to the number of occurences in the majority class by random duplication.

Go ahead, oversample and concatenate! Because you will add instances, the sample() method requires a specific parameter.

In [75]:
df_class_2_over = # CODE HERE

df_balanced_over = # CODE HERE

df_balanced_over

Unnamed: 0,y,x1,x2
0,1,5,6
1,1,4,4
2,1,4,5
4,2,8,7
3,2,7,8
4,2,8,7


# Standardization

Machine Learning algorithms such as SVMs assume that all features have a somewhat normal distribution, centered around zero and a similar variance. However, that is rarely the case in wild datasets. A feature that has a significantly larger variance might dominate others and prevent the model to learn from them. 

Standaridization transforms each feature by removing its mean value (u) and dividing it by its standard deviation (s). As such, it is centered at zero.

z = (x - u) / s

Standardization can be done using Sklearn's "preprocessing" package and its method "StandardScaler". 

- Initiate default Scaler
- Fit data
- Transform data
- Print scaled data

In [20]:
from sklearn.preprocessing import StandardScaler

data = [[0,0],
        [1,1],
        [2,2]]

# CODE HERE

array([[-1.22474487, -1.22474487],
       [ 0.        ,  0.        ],
       [ 1.22474487,  1.22474487]])

The scaler is now stored in memory and can reproduce the equivalent transformation on new data. Transform the new data to verify it does the right transformation.

In [14]:
new_data = [[1,1]]

# CODE HERE

array([[0., 0.]])

# Scaling to range

Another transformation option is to scale to a range. The advantage of this method is its resistance to very small standard deviations. It also preserves zero entries in sparse datasets. There are two ways to scale to a range in Sklearn:

- MinMaxScaler transforms to a range [0,1]
- MaxAbsScaler transforms to a range [-1,1]

Below, use MinMaxScaler to transform the data.

In [18]:
from sklearn.preprocessing import MinMaxScaler

data = [[0,0],
        [1,1],
        [2,2]]

# CODE HERE

array([[0. , 0. ],
       [0.5, 0.5],
       [1. , 1. ]])

MaxAbsScaler works the same way but transforms the data to a range [-1,1]. That transformation is better suited to data already centered at zero (standardized).

Below, standardize the data before scaling it in the range [-1,1].

In [22]:
from sklearn.preprocessing import MaxAbsScaler

data = [[-1,-1],
        [1,1],
        [3,3]]

# CODE HERE

array([[-1., -1.],
       [ 0.,  0.],
       [ 1.,  1.]])

# Dealing with outliers

In the presence of outliers, standard scaling would not support the abnormally extreme data points. Sklearn's RobustScaler offers a more robust solution for such datasets.

Instead of removing the mean which would be affected by the outliers, it focuses on the median. It then scales the data according to the Interquartile Range (IQR). 

If the data was to be split into 4 quarters, the IQR represents the 2nd and 3rd quarters. By excluding the outermost quarters(1st and 4th), the algorithm intends to exclude the outliers.

RobustScaler uses the IQR by default but the range can be set manually.

Use RobustScaler to transform the data.

In [3]:
from sklearn.preprocessing import RobustScaler

data = [[1,1],
        [2,2],
        [3,999]]

# CODE HERE

array([[-1.        , -0.00200401],
       [ 0.        ,  0.        ],
       [ 1.        ,  1.99799599]])

Do the same thing but set a manual range to exclude the extreme fifths of the dataset. What is the transformed value of the outlier?

In [31]:
# CODE HERE

array([[-0.83333333, -0.00167001],
       [ 0.        ,  0.        ],
       [ 0.83333333,  1.66499666]])

# Encoding Categorical Features

Run the code below.

In [2]:
import pandas as pd

df = pd.DataFrame(columns=["target", "features"], 
                  data=[["a","u"],["b","s"],["a","s"],["c","r"]])
df

Unnamed: 0,target,features
0,a,u
1,b,s
2,a,s
3,c,r


In certain cases, targets and/or features will be letters. Before training a Machine Learning algorithm, you will need to convert them to numbers.

When dealing with the targets, you can use Sklearn's LabelEncoder. Do so below.

In [50]:
from sklearn.preprocessing import LabelEncoder

# CODE HERE

array([0, 1, 0, 2])

The same could be executed on the features, but it may negatively affect the accuracy of the classifier. 

Taking the example of our dataframe, [u,s,s,r] could be transformed to [1,2,2,3]. A Machine Learning algorithm could wrongly consider that 1 and 3 are more distant from one another than 1 and 2, take that into consideration, and create an unwanted distortion.

To avoid such phenomenon, it is better to create multiple binary features. This can be done with pandas' get_dummies. Use it to transform the features.

In [53]:
feature_encoded = # CODE HERE

feature_encoded

Unnamed: 0,r,s,u
0,0,0,1
1,0,1,0
2,0,1,0
3,1,0,0


# Binning

Binning is the process of turning continuous data into discrete data according to sections (bins). For example, in a dataset constituted of people's ages, you may want to consider age groups.

Run the code below.

In [86]:
ages = pd.DataFrame(columns=["age"], 
                  data=([[13],[15],[18],[19],[20],[22],[23],[23]]))

ages

Unnamed: 0,age
0,13
1,15
2,18
3,19
4,20
5,22
6,23
7,23


To part the data, you can use panda's "cut". Use it to cut the following age data into two bins "teens" and "adults".

In [88]:
ages['group'] = # CODE HERE

Unnamed: 0,age,group
0,13,teen
1,15,teen
2,18,teen
3,19,teen
4,20,teen
5,22,adult
6,23,adult
7,23,adult


# Model Building

Time to check out the difference in performance between raw and standardized data. You will be using breast cancer data. Run the code below to import.

In [136]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = data.data
y = data.target

Train an SVM, cross validate and check accuracy.

In [6]:
from sklearn.model_selection import cross_val_score
from sklearn import svm

model = # CODE HERE

scores = # CODE HERE

scores.mean()

0.4101532567049808

Now, use StandardScaler to standardize the data before training an SVM. Cross validate and check accuracy.

In [140]:
# CODE HERE

0.9736377981992016

# Challenge

Train a model of 90+% accuracy with the data below. It has a number of irregularities that need to be dealt with prior to training.

In [431]:
data = pd.read_csv("data1.csv")

data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,target
0,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065,0
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050,0
2,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185,0
3,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480,0
4,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735,0


In [1]:
# CODE HERE