In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Pre-Processing the data

#### Data preprocessing in Machine Learning is a crucial step that helps enhance the quality of data to promote the extraction of meaningful insights from the data.

#### Data preprocessing in Machine Learning refers to the technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning models.

#### In simple words, data preprocessing in Machine Learning is a data mining technique that transforms raw data into an understandable and readable format.

# STEPS

### Getting the dataset
### Importing libraries
### Importing datasets
### Finding Missing Data
### Encoding Categorical Data
### Splitting dataset into training and test set
### Feature scaling

##### Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.

##### Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.



# Need of Data Preprocessing

##### For achieving better results from the applied model in Machine Learning projects the format of the data has to be in a proper manner. Some specified Machine Learning model needs information in a specified format, for example, Random Forest algorithm does not support null values, therefore to execute random forest algorithm null values have to be managed from the original raw data set.

##### Another aspect is that data set should be formatted in such a way that more than one Machine Learning and Deep Learning algorithms are executed in one data set, and best out of them is chosen.

##### Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it


# Preprocessing Techniques



### Binarize Data  
We can transform our data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.


### Feature Scaling:
It puts all our features on the same scale. You don’t have to apply feature scaling to the dummy variables. Two techniques:

(i) Standardization
(ii) Normalization

### Standardize Data
Standardization of datasets is a common requirement for many machine learning estimators

### Normalization
Normalization involves adjusting the values in the feature vector so as to measure them on a common scale. Here, the values of a feature vector are adjusted so that they sum up to 1

In [2]:
import sklearn.preprocessing

# Binarize Data (Make Binary)

• We can transform our data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.

• This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.

• We can create new binary attributes in Python using scikit-learn with the Binarizer class.

In [3]:
from sklearn.preprocessing import Binarizer

In [5]:
import pandas as pd
path='sample_data/pima.csv'
# Only Sample Data is uploaded in cavnas. Original Source : https://datahub.io/machine-learning/diabetes/r/diabetes.csv
features = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age','class']
data=pd.read_csv(path,names=features)
data.head()



Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,tested_positive
1,1,85,66,29,0,26.6,0.351,31,tested_negative
2,8,183,64,0,0,23.3,0.672,32,tested_positive
3,1,89,66,23,94,28.1,0.167,21,tested_negative
4,0,137,40,35,168,43.1,2.288,33,tested_positive


In [7]:
array = data.values

# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]

In [8]:
binary=Binarizer(threshold=0.25).fit(X)
binaryX=binary.transform(X)
print(binaryX)

[[1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 ...
 [1. 1. 1. ... 1. 0. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]]


# Normalize or Standardize?

Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbors and Neural Networks.

Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Also, unlike normalization, standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization.

# When Feature Scaling matters

Some machine learning models are fundamentally based on distance matrix, also known as the distance-based classifier, for example, K-Nearest-Neighbours, SVM, and Neural Network. Feature scaling is extremely essential to those models, especially when the range of the features is very different. Otherwise, features with a large range will have a large influence in computing the distance.


Max-Min Normalisation typically allows us to transform the data with varying scales so that no specific dimension will dominate the statistics, and it does not require making a very strong assumption about the distribution of the data, such as k-nearest neighbours and artificial neural networks. However, Normalisation does not treat outliners very well. On the contrary, standardisation allows users to better handle the outliers and facilitate convergence for some computational algorithms like gradient descent. Therefore, we usually prefer standardisation over Min-Max Normalisation.

# Scaling
Most probably our dataset comprises of the attributes with varying scale, but we cannot provide such data to ML algorithm hence it requires rescaling. Data rescaling makes sure that attributes are at same scale. Generally, attributes are rescaled into the range of 0 and 1.

### 1)Decimal Scaling

### 2) Simple Feature Scaling

### 3)Min-Max Normalization

### 4)z-Score Normalization(zero-mean Normalization)

Decimal Scaling Method For Normalization –
It normalizes by moving the decimal point of values of the data. To normalize the data by this technique, we divide each value of the data by the maximum absolute value of data. The data value, vi, of data is normalized to vi‘ by using the formula below –

### Decimal Scaling Method For Normalization –

It normalizes by moving the decimal point of values of the data. To normalize the data by this technique, we divide each value of the data by the maximum absolute value of data. The data value, vi, of data is normalized to vi‘ by using the formula below –

### V(new_i)=v(i)/10^i

Let the input data is: -10, 201, 301, -401, 501, 601, 701

To normalize the above data,

Step 1: Maximum absolute value in given data(m): 701

Step 2: Divide the given data by 1000 (i.e j=3)

Result: The normalized data is: -0.01, 0.201, 0.301, -0.401, 0.501, 0.601, 0.701


In [9]:
#Decimal Scaling Method For Normalization
def Dec_scale(df):
    p = max(df)
    q = len(str(abs(p)))
    print(p,q)
    l=[]
    for x in df:
        l.append(x/10**q)
    print(l)
data=[18,12,89,121,900,45]
Dec_scale(data)

900 3
[0.018, 0.012, 0.089, 0.121, 0.9, 0.045]


In [11]:
#Simple Feature Scaling
path='sample_data/iris.csv'
# Only Sample Data is uploaded in cavnas. Original Source : https://datahub.io/machine-learning/iris/r/iris.csv
data=pd.read_csv(path)
#data.head()
data['sepal.length']=data['sepal.length']/data['sepal.length'].max()
print(data)

     sepal.length  sepal.width  petal.length  petal.width           class
0        0.645570          3.5           1.4          0.2     Iris-setosa
1        0.620253          3.0           1.4          0.2     Iris-setosa
2        0.594937          3.2           1.3          0.2     Iris-setosa
3        0.582278          3.1           1.5          0.2     Iris-setosa
4        0.632911          3.6           1.4          0.2     Iris-setosa
..            ...          ...           ...          ...             ...
145      0.848101          3.0           5.2          2.3  Iris-virginica
146      0.797468          2.5           5.0          1.9  Iris-virginica
147      0.822785          3.0           5.2          2.0  Iris-virginica
148      0.784810          3.4           5.4          2.3  Iris-virginica
149      0.746835          3.0           5.1          1.8  Iris-virginica

[150 rows x 5 columns]


In [14]:
#Simple Feature Scaling
from sklearn.preprocessing import MaxAbsScaler
dataWithNumericFeatures = data.loc[:, data.columns != "class"]
simpleScale=MaxAbsScaler()
rescaled=simpleScale.fit_transform(dataWithNumericFeatures)
print(rescaled)

[[0.64556962 0.79545455 0.20289855 0.08      ]
 [0.62025316 0.68181818 0.20289855 0.08      ]
 [0.59493671 0.72727273 0.1884058  0.08      ]
 [0.58227848 0.70454545 0.2173913  0.08      ]
 [0.63291139 0.81818182 0.20289855 0.08      ]
 [0.6835443  0.88636364 0.24637681 0.16      ]
 [0.58227848 0.77272727 0.20289855 0.12      ]
 [0.63291139 0.77272727 0.2173913  0.08      ]
 [0.55696203 0.65909091 0.20289855 0.08      ]
 [0.62025316 0.70454545 0.2173913  0.04      ]
 [0.6835443  0.84090909 0.2173913  0.08      ]
 [0.60759494 0.77272727 0.23188406 0.08      ]
 [0.60759494 0.68181818 0.20289855 0.04      ]
 [0.5443038  0.68181818 0.15942029 0.04      ]
 [0.73417722 0.90909091 0.17391304 0.08      ]
 [0.72151899 1.         0.2173913  0.16      ]
 [0.64556962 0.79545455 0.20289855 0.12      ]
 [0.6835443  0.88636364 0.1884058  0.16      ]
 [0.72151899 0.86363636 0.24637681 0.12      ]
 [0.64556962 0.86363636 0.2173913  0.12      ]
 [0.6835443  0.77272727 0.24637681 0.08      ]
 [0.64556962 

### Min-Max Normalization

Min-max normalization is one of the most common ways to normalize data.

MinMaxScaler scales all the data features in the range [0, 1] or else in the range [-1, 1] if there are negative values in the dataset. This scaling compresses all the inliers in the narrow range [0, 0.005].


For every feature, the minimum value of that feature gets transformed into a 0,
the maximum value gets transformed into a 1,

and every other value gets transformed into a decimal between 0 and 1.

##### F=Value-Min/Max-Min  

Min-max normalization has one fairly significant downside: it does not handle outliers very well.

Let (X1, X2) be a min and max boundary of an attribute and (Y1, Y2) be the new scale at which we are normalizing then for Vi  value of the attribute, the normalized value Ui is given as

##### Example: Vi=300,000; X1= 125,000; X2= 925,000; Y1= 0; Y2= 1

##### [(Vi-X1)/(X2-X1)]* (Y2-Y1)+Y1  


In [15]:
data['sepal.width']=data['sepal.width']-data['sepal.width'].min()/(data['sepal.width'].max()-data['sepal.width'].min())
print(data)

     sepal.length  sepal.width  petal.length  petal.width           class
0        0.645570     2.666667           1.4          0.2     Iris-setosa
1        0.620253     2.166667           1.4          0.2     Iris-setosa
2        0.594937     2.366667           1.3          0.2     Iris-setosa
3        0.582278     2.266667           1.5          0.2     Iris-setosa
4        0.632911     2.766667           1.4          0.2     Iris-setosa
..            ...          ...           ...          ...             ...
145      0.848101     2.166667           5.2          2.3  Iris-virginica
146      0.797468     1.666667           5.0          1.9  Iris-virginica
147      0.822785     2.166667           5.2          2.0  Iris-virginica
148      0.784810     2.566667           5.4          2.3  Iris-virginica
149      0.746835     2.166667           5.1          1.8  Iris-virginica

[150 rows x 5 columns]


In [16]:
#min-max Scaling
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

print("After MinMax Scaling")
scaler = MinMaxScaler()
print(scaler.fit(data))
print(scaler.data_max_)
print(scaler.transform(data))
print(scaler.transform([[2, 2]]))


After MinMax Scaling
MinMaxScaler()
[ 1. 18.]
[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]
[[1.5 0. ]]


In [None]:
from sklearn.preprocessing import MinMaxScaler
path='d://MLDataSet//pima.csv'
features = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age','class']
data=pd.read_csv(path,names=features)
data.head()
db=data.values

# separate array into input and output components
X = db[:,0:8]
Y = db[:,8]
scaler = MinMaxScaler(feature_range=(-1, 1))
rescaledX = scaler.fit_transform(X)
#rescaledY = scaler.fit_transform(Y)
print(rescaledX)

[[-0.29411765  0.48743719  0.18032787 ...  0.00149031 -0.53116994
  -0.03333333]
 [-0.88235294 -0.14572864  0.08196721 ... -0.2071535  -0.76686593
  -0.66666667]
 [-0.05882353  0.83919598  0.04918033 ... -0.30551416 -0.49274125
  -0.63333333]
 ...
 [-0.41176471  0.2160804   0.18032787 ... -0.21907601 -0.85738685
  -0.7       ]
 [-0.88235294  0.26633166 -0.01639344 ... -0.10283159 -0.76857387
  -0.13333333]
 [-0.88235294 -0.06532663  0.14754098 ... -0.09388972 -0.79760888
  -0.93333333]]


In [25]:
X[1]

array([1, 85, 66, 29, 0, 26.6, 0.351, 31], dtype=object)

### Z-Score Normalization

#Z-scores are linearly transformed data values having a mean of zero and a standard deviation of 1.

#if we run a scatterplot of scores versus z-scores, all dots will be exactly on a straight

#Z-scores are also known as standardized scores; they are scores (or data values) that have been given a common standard.

####      Z-Score helps in the normalization of data!
A positive z-score says the data point is above average.
A negative z-score says the data point is below average.

####      Z-score= data_point-mean/S.D


# standarization (or Z-score normalization)

#### What is Standardization?

Standardization is scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

The result of standardization (or Z-score normalization) is that the features will be rescaled to ensure the mean and the standard deviation to be 0 and 1, respectively

This technique is to re-scale features value with the distribution value between 0 and 1 is useful for the optimization algorithms, such as gradient descent, that are used within machine learning algorithms that weight inputs (e.g., regression and neural networks). Rescaling is also used for algorithms that use distance measurements, for example, K-Nearest-Neighbours (KNN).

Standardization results in the rescaling of features, which in turn represents the properties of a standard normal distribution:

mean = 0
sd = 1
![Stand_eq.webp](attachment:Stand_eq.webp)

In [None]:
#Z-Score

X[4]=X[4]-X[4].mean()/X[4].std()
print(X[4])

X=X-X.mean()/X.std()
print(X)



[ -1.97363171 135.02636829  38.02636829  33.02636829 166.02636829
  41.12636829   0.31436829  31.02636829]
[[ 5.22937612e+00  1.47229376e+02  7.12293761e+01 ...  3.28293761e+01
  -1.43623880e-01  4.92293761e+01]
 [ 2.29376120e-01  8.42293761e+01  6.52293761e+01 ...  2.58293761e+01
  -4.19623880e-01  3.02293761e+01]
 [ 7.22937612e+00  1.82229376e+02  6.32293761e+01 ...  2.25293761e+01
  -9.86238798e-02  3.12293761e+01]
 ...
 [ 4.22937612e+00  1.20229376e+02  7.12293761e+01 ...  2.54293761e+01
  -5.25623880e-01  2.92293761e+01]
 [ 2.29376120e-01  1.25229376e+02  5.92293761e+01 ...  2.93293761e+01
  -4.21623880e-01  4.62293761e+01]
 [ 2.29376120e-01  9.22293761e+01  6.92293761e+01 ...  2.96293761e+01
  -4.55623880e-01  2.22293761e+01]]


In [26]:
import numpy as np

from sklearn import preprocessing
input_data = np.array([[3, -1.5, 3, -6.4], [0, 3, -1.3, 4.1], [1, 2.3, -2.9, -4.3]])

In [27]:
print("Mean standardized data: ",input_data.mean(axis=0))
print("Standard Deviation standardized data: ",input_data.std(axis=0))

Mean standardized data:  [ 1.33  1.27 -0.4  -2.2 ]
Standard Deviation standardized data:  [1.25 1.98 2.49 4.54]


In [28]:
standardData=preprocessing.scale(input_data)
print(standardData)

[[ 1.34 -1.4   1.36 -0.93]
 [-1.07  0.88 -0.36  1.39]
 [-0.27  0.52 -1.   -0.46]]


In [29]:
print("Mean standardized data: ",standardData.mean(axis=0))
print("Standard Deviation standardized data: ",standardData.std(axis=0))

Mean standardized data:  [ 5.55e-17 -3.70e-17  0.00e+00 -1.85e-17]
Standard Deviation standardized data:  [1. 1. 1. 1.]


The preprocessing.scale() function standardizes a dataset along any axis. This method centers the data on the mean and resizes the components in order to have a unit variance.

Standardize Data
• Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.

• We can standardize data using scikit-learn with the StandardScaler class.

In [30]:
from sklearn.preprocessing import StandardScaler
import numpy as np

In [31]:
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

# summarize transformed data
print(rescaledX[0:5,:])

[[ 0.64  0.85  0.15  0.91 -0.69  0.2   0.47  1.43]
 [-0.84 -1.12 -0.16  0.53 -0.69 -0.68 -0.37 -0.19]
 [ 1.23  1.94 -0.26 -1.29 -0.69 -1.1   0.6  -0.11]
 [-0.84 -1.   -0.16  0.15  0.12 -0.49 -0.92 -1.04]
 [-1.14  0.5  -1.5   0.91  0.77  1.41  5.48 -0.02]]


In [32]:
SS=StandardScaler()
Sd=SS.fit_transform(X)
print(Sd[0:5,:])

[[ 0.64  0.85  0.15  0.91 -0.69  0.2   0.47  1.43]
 [-0.84 -1.12 -0.16  0.53 -0.69 -0.68 -0.37 -0.19]
 [ 1.23  1.94 -0.26 -1.29 -0.69 -1.1   0.6  -0.11]
 [-0.84 -1.   -0.16  0.15  0.12 -0.49 -0.92 -1.04]
 [-1.14  0.5  -1.5   0.91  0.77  1.41  5.48 -0.02]]


#### StandardScaler removes the mean and scales the data to unit variance.

#### However, the outliers have an influence when computing the empirical mean and standard deviation which shrink the range of the feature values

#### StandardScaler cannot guarantee balanced feature scales in the presence of outliers.


#### MinMaxScaler rescales the data set such that all feature values are in the range [0, 1]. However, this scaling compress all inliers in the narrow range

# Normalization Methods

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.

It is mainly useful in Sparse dataset where we have lots of zeros. We can rescale the data with the help of Normalizer class of scikit-learn Python library.

# Types of Normalization
In machine learning, there are two types of normalization preprocessing techniques as follows −

### L1 Normalization
It may be defined as the normalization technique that modifies the dataset values in a way that in each row the sum of the absolute values will always be up to 1. It is also called Least Absolute Deviations.

In [20]:
import pandas as pd
from numpy import set_printoptions
from sklearn.preprocessing import Normalizer
path='sample_data/pima.csv'
# Only Sample Data is uploaded in cavnas. Original Source : https://datahub.io/machine-learning/diabetes/r/diabetes.csv
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pd.read_csv (path, names=names)
print(dataframe.head())

dataExceptClass = dataframe.loc[:, dataframe.columns != "class"]
array = dataExceptClass.values

Data_normalizer = Normalizer(norm='l1').fit(array)
Data_normalized = Data_normalizer.transform(array)

set_printoptions(precision=2)
print ("\nNormalized data:\n", Data_normalized [0:3])

   preg  plas  pres  skin  test  mass   pedi  age            class
0     6   148    72    35     0  33.6  0.627   50  tested_positive
1     1    85    66    29     0  26.6  0.351   31  tested_negative
2     8   183    64     0     0  23.3  0.672   32  tested_positive
3     1    89    66    23    94  28.1  0.167   21  tested_negative
4     0   137    40    35   168  43.1  2.288   33  tested_positive

Normalized data:
 [[0.02 0.43 0.21 0.1  0.   0.1  0.   0.14]
 [0.   0.36 0.28 0.12 0.   0.11 0.   0.13]
 [0.03 0.59 0.21 0.   0.   0.07 0.   0.1 ]]


## L2 Normalization
It may be defined as the normalization technique that modifies the dataset values in a way that in each row the sum of the squares will always be up to 1. It is also called least squares.

In [21]:
Data_normalizer = Normalizer(norm='l2').fit(array)
Data_normalized = Data_normalizer.transform(array)

In [22]:
set_printoptions(precision=2)
print ("\nNormalized data:\n", Data_normalized [0:3])


Normalized data:
 [[0.03 0.83 0.4  0.2  0.   0.19 0.   0.28]
 [0.01 0.72 0.56 0.24 0.   0.22 0.   0.26]
 [0.04 0.92 0.32 0.   0.   0.12 0.   0.16]]


# What algorithms need feature scaling

Note: If an algorithm is not distance-based, feature scaling is unimportant, including Naive Bayes, Linear Discriminant Analysis, and Tree-Based models (gradient boosting, random forest, etc.).



In [23]:
#putting together
import numpy as np
from sklearn import preprocessing
data=np.array([[10,-9,8,7,15,-32],[19,11,90,-8,-5,33],[44,-23,-5,87,56,33]])
print(data)

print("-------------Data Scaling-----------")
std_data=preprocessing.scale(data)
print(std_data)

print("-------------Min-Max Scaler-----------")
data_scalar=preprocessing.MinMaxScaler(feature_range=(1,5))
data_scaled=data_scalar.fit_transform(data)
print(data_scaled)

print("-------------Normalization-----------")
normal_data=preprocessing.normalize(data, norm = 'l1')
print(normal_data)

print("-------------Binarization-----------")
binarydata=preprocessing.Binarizer(threshold=1.4).transform(data)
print(binarydata)

[[ 10  -9   8   7  15 -32]
 [ 19  11  90  -8  -5  33]
 [ 44 -23  -5  87  56  33]]
-------------Data Scaling-----------
[[-1.   -0.14 -0.55 -0.52 -0.28 -1.41]
 [-0.37  1.29  1.4  -0.88 -1.06  0.71]
 [ 1.37 -1.15 -0.86  1.4   1.34  0.71]]
-------------Min-Max Scaler-----------
[[1.   2.65 1.55 1.63 2.31 1.  ]
 [2.06 5.   5.   1.   1.   5.  ]
 [5.   1.   1.   5.   5.   5.  ]]
-------------Normalization-----------
[[ 0.12 -0.11  0.1   0.09  0.19 -0.4 ]
 [ 0.11  0.07  0.54 -0.05 -0.03  0.2 ]
 [ 0.18 -0.09 -0.02  0.35  0.23  0.13]]
-------------Binarization-----------
[[1 0 1 1 1 0]
 [1 1 1 0 0 1]
 [1 0 0 1 1 1]]


# Encoding categorical data

Sometimes our data is in qualitative form, that is we have texts as our data. We can find categories in text form. Now it gets complicated for machines to understand texts and process them, rather than numbers, since the models are based on mathematical equations and calculations. Therefore, we have to encode the categorical data.

# Nominal and Ordinal Variables

Nominal Variable (Categorical). Variable comprises a finite set of discrete values with no relationship between values.

Ordinal Variable. Variable comprises a finite set of discrete values with a ranked ordering between values.

Some algorithms can work with categorical data directly.

For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

# Encoding Categorical Data

There are three common approaches for converting ordinal and categorical variables to numerical values. They are:

### Ordinal Encoding(label encoding)

### One-Hot Encoding

### Dummy Variable Encoding

# Label Encoding (ordinal)


In ordinal encoding, each unique category value is assigned an integer value.

For example, “red” is 1, “green” is 2, and “blue” is 3.

In label encoding, we map each category to a number or a label. The labels chosen for the categories have no relationship. So categories that have some ties or are close to each other lose such information after encoding.

Limitation of label Encoding:

-Label encoding convert the data in machine readable form, but it assigns a unique number(starting from 0) to each class of data.

-This may lead to the generation of priority issue in training of data sets.


-A label with high value may be considered to have high priority than a label having lower value.



In [24]:
#Label Encoding
from sklearn import preprocessing
encode=preprocessing.LabelEncoder()
data=['AB','CD','PK','DX','MN']
encode.fit(data)
for i,item in enumerate(encode.classes_):
    print(item,'==>',i)
myinput=['CD','MN','PK','AB','CD','AB','PK','DX','MN']

lbl=encode.transform(myinput)
print(list(lbl))

AB ==> 0
CD ==> 1
DX ==> 2
MN ==> 3
PK ==> 4
[1, 3, 4, 0, 1, 0, 4, 2, 3]


In [34]:
# Label Encoding example
import numpy as np
import pandas as pd

path='sample_data/iris.csv'
# Only Sample Data is uploaded in cavnas. Original Source : https://datahub.io/machine-learning/iris/r/iris.csv

df = pd.read_csv(path)
df.head()
df['class'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [36]:
#df['variety'].unique()

# Import label encoder
from sklearn import preprocessing
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'species'.
df['class']= label_encoder.fit_transform(df['class'])
#f=df['variety'].unique()
#print(f)
for i,item in enumerate(label_encoder.classes_):
    print(item,'==>',i)

Iris-setosa ==> 0
Iris-versicolor ==> 1
Iris-virginica ==> 2


In [37]:
df

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


# OneHotEncoding

Sometimes in datasets, we encounter columns that contain numbers of no specific order of preference. The data in the column usually denotes a category or value of the category and also when the data in the column is label encoded. This confuses the machine learning model, to avoid this the data in the column should be One Hot encoded.

It refers to splitting the column which contains numerical categorical data to many columns depending on the number of categories present in that column. Each column contains “0” or “1” corresponding to which column it has been placed.


One hot encoding is the most widespread approach, and it works very well unless your categorical variable takes on a large number of values (i.e. you generally won't use it for variables taking more than 15 different values.

In [38]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

In [39]:
from numpy import asarray
from sklearn.preprocessing import OneHotEncoder
# define data
data = asarray([['red'], ['green'], ['blue'],['pink'],['black']])
print(data)
# define one hot encoding
encoder = OneHotEncoder(sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

[['red']
 ['green']
 ['blue']
 ['pink']
 ['black']]
[[0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0.]]




In [40]:
iris=load_iris()
iris.feature_names
features=pd.DataFrame(iris.feature_names)
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [41]:
X=pd.DataFrame(iris.data)
Y=pd.DataFrame(iris.target)

In [42]:
encoder = OneHotEncoder(sparse=False)

In [43]:
ohe=encoder.fit_transform(Y)
print(ohe)

[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0.



In [44]:
ohe=encoder.fit_transform(X)
print(ohe)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]




# Dummy Variable Encoding

-The one-hot encoding creates one binary variable for each category.

-The problem is that this representation includes redundancy.

-For example, if we know that [1, 0, 0] represents “blue” and [0, 1, 0] represents “green” we don’t need another binary variable to represent “red“, instead we could use 0 values for both “blue” and “green” alone, e.g. [0, 0].

-This is called a dummy variable encoding, and always represents C categories with C-1 binary variables.

pandas.get_dummies() is used for data manipulation. It converts categorical data into dummy or indicator variables.

pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)

Parameters:

data: whose data is to be manipulated.

prefix: String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Default value is None.

prefix_sep: Separator/delimiter to use if appending any prefix. Default is ‘_’

dummy_na: It adds a column to indicate NaN values, default value is false, If false NaNs are ignored.

columns: Column names in the DataFrame that needs to be encoded. Default value is None, If columns is None then all the columns with object or category dtype will be converted.

sparse: It  specify whether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False). default value is False.

drop_first: Remove first level to get k-1 dummies out of k categorical levels.

dtype: Data type for new columns. Only a single dtype is allowed. Default value is np.uint8.

In [45]:
import pandas as pd
import numpy as np


# list
li = ['s', 'a', 't', np.nan]
print(pd.get_dummies(li))

   a  s  t
0  0  1  0
1  1  0  0
2  0  0  1
3  0  0  0


In [46]:
import pandas as pd
import numpy as np


# list
li = ['s', 'a', 't', np.nan]
print(pd.get_dummies(li, dummy_na=True))

   a  s  t  NaN
0  0  1  0    0
1  1  0  0    0
2  0  0  1    0
3  0  0  0    1


### Drawbacks of  One-Hot and Dummy Encoding
One hot encoder and dummy encoder are two powerful and effective encoding schemes. They are also very popular among the data scientists, But may not be as effective when-

A large number of levels are present in data. If there are multiple categories in a feature variable in such a case we need a similar number of dummy variables to encode the data. For example, a column with 30 different values will require 30 new variables for coding.
If we have multiple categorical features in the dataset similar situation will occur and again we will end to have several binary features each representing the categorical feature and their multiple categories e.g a dataset having 10 or more categorical columns.
In both the above cases, these two encoding schemes introduce sparsity in the dataset i.e several columns having 0s and a few of them having 1s. In other words, it creates multiple dummy features in the dataset without adding much information.

Also, they might lead to a Dummy variable trap. It is a phenomenon where features are highly correlated. That means using the other variables, we can easily predict the value of a variable.

Due to the massive increase in the dataset, coding slows down the learning of the model along with deteriorating the overall performance that ultimately makes the model computationally expensive. Further, while using tree-based models these encodings are not an optimum choice.

# Effect encoding

This encoding technique is also known as Deviation Encoding or Sum Encoding. Effect encoding is almost similar to dummy encoding, with a little difference. In dummy coding, we use 0 and 1 to represent the data but in effect encoding, we use three values i.e. 1,0, and -1.

The row containing only 0s in dummy encoding is encoded as -1 in effect encoding.  In the dummy encoding example, the city Bangalore at index 4  was encoded as 0000. Whereas in effect encoding it is represented by -1-1-1-1.

In [50]:
pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.6.3-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/81.9 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.3


In [51]:
import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad']})
encoder=ce.sum_coding.SumEncoder(cols='City',verbose=False)

#Original Data
data

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hyderabad
3,Chennai
4,Bangalore
5,Delhi
6,Hyderabad


# Hash Encoder
To understand Hash encoding it is necessary to know about hashing. Hashing is the transformation of arbitrary size input in the form of a fixed-size value. We use hashing algorithms to perform hashing operations i.e to generate the hash value of an input. Further, hashing is a one-way process, in other words, one can not generate original input from the hash representation.

Hashing has several applications like data retrieval, checking data corruption, and in data encryption also. We have multiple hash functions available for example Message Digest (MD, MD2, MD5), Secure Hash Function (SHA0, SHA1, SHA2), and many more.

Just like one-hot encoding, the Hash encoder represents categorical features using the new dimensions. Here, the user can fix the number of dimensions after transformation using n_component argument. Here is what I mean – A feature with 5 categories can be represented using N new features similarly, a feature with 100 categories can also be transformed using N new features. Doesn’t this sound amazing?

By default, the Hashing encoder uses the md5 hashing algorithm but a user can pass any algorithm of his choice

In [52]:
import category_encoders as ce
import pandas as pd

In [53]:
data=pd.DataFrame({'Month':['January','April','March','April','Februay','June','July','June','September']})
en=ce.HashingEncoder(cols='Month',n_components=3)
en.fit_transform(data)

Unnamed: 0,col_0,col_1,col_2
0,0,1,0
1,1,0,0
2,0,1,0
3,1,0,0
4,1,0,0
5,0,1,0
6,1,0,0
7,0,1,0
8,0,1,0


# Target Encoding

In target encoding, we calculate the mean of the target variable for each category and replace the category variable with the mean value.
In the case of the categorical target variables, the posterior probability of the target replaces each category..

In [54]:
#import the libraries
import pandas as pd
import category_encoders as ce

#Create the Dataframe
data=pd.DataFrame({'class':['A,','B','C','B','C','A','A','A'],'Marks':[50,30,70,80,45,97,80,68]})

#Create target encoding object
encoder=ce.TargetEncoder(cols='class')
encoder.fit_transform(data['class'],data['Marks'])

Unnamed: 0,class
0,63.048373
1,63.581489
2,63.936117
3,63.581489
4,63.936117
5,67.574421
6,67.574421
7,67.574421


# Run-length encoding
Run-length encoding (RLE) is a form of lossless data compression in which runs of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run.

For example, consider a screen containing plain black text on a solid white background. There will be many long runs of white pixels in the blank space, and many short runs of black pixels within the text.

A hypothetical scan line, with B representing a black pixel and W representing white, might read as follows:

WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW

With a run-length encoding (RLE) data compression algorithm applied to the above hypothetical scan line, it can be rendered as follows:

12W1B12W3B24W1B14W

This can be interpreted as a sequence of twelve Ws, one B, twelve Ws, three Bs, etc.,

In [55]:
def encode(message):
    encoded_message = ""
    i = 0

    while (i <= len(message)-1):
        count = 1
        ch = message[i]
        j = i
        while (j < len(message)-1):
            if (message[j] == message[j+1]):
                count = count+1
                j = j+1
            else:
                break
        encoded_message=encoded_message+str(count)+ch
        i = j+1
    return encoded_message


encode("ACCCTCCAAGCCTTCGGG")
#print(Res)

'1A3C1T2C2A1G2C2T1C3G'

#One Hot Encoding


In [56]:

import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':[
'Delhi','Mumbai','Hydrabad','Chennai','Bangalore','Delhi','Hydrabad','Bangalore','Delhi'
]})

#Create object for one-hot encoding
encoder=ce.OneHotEncoder(cols='City',use_cat_names=True)
encoder.fit_transform(data)

Unnamed: 0,City_Delhi,City_Mumbai,City_Hydrabad,City_Chennai,City_Bangalore
0,1,0,0,0,0
1,0,1,0,0,0
2,0,0,1,0,0
3,0,0,0,1,0
4,0,0,0,0,1
5,1,0,0,0,0
6,0,0,1,0,0
7,0,0,0,0,1
8,1,0,0,0,0


#Dummy Encoding

In the case of one-hot encoding, for N categories in a variable, it uses N binary variables. The dummy encoding is a small improvement over one-hot-encoding. Dummy encoding uses N-1 features to represent N labels/categories.


In [None]:

data_encoded=pd.get_dummies(data=data,drop_first=True)
data_encoded

# Effect Encoding:
This encoding technique is also known as Deviation Encoding or Sum Encoding. Effect encoding is almost similar to dummy encoding, with a little difference. In dummy coding, we use 0 and 1 to represent the data but in effect encoding, we use three values i.e. 1,0, and -1.

In [57]:
import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad']})
encoder=ce.sum_coding.SumEncoder(cols='City')

In [58]:
encoder.fit_transform(data)



Unnamed: 0,intercept,City_0,City_1,City_2,City_3
0,1,1.0,0.0,0.0,0.0
1,1,0.0,1.0,0.0,0.0
2,1,0.0,0.0,1.0,0.0
3,1,0.0,0.0,0.0,1.0
4,1,-1.0,-1.0,-1.0,-1.0
5,1,1.0,0.0,0.0,0.0
6,1,0.0,0.0,1.0,0.0


# Data Wrangling
Data Wrangling is the process of converting data from the initial format to a format that may be better for analysis.

import pandas as pd
import matplotlib.pylab as plt


In [59]:
filename = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/auto.csv"

In [60]:
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

In [68]:
df = pd.read_csv(filename, names = headers)

In [69]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


Steps for working with missing data:
-dentify missing data
-deal with missing data
-correct data format

In [70]:
df.replace('?',np.nan,inplace=True)

In [71]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [72]:
missing_data = df.isnull()
missing_data.head(5)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [73]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")

symboling
False    205
Name: symboling, dtype: int64

normalized-losses
False    164
True      41
Name: normalized-losses, dtype: int64

make
False    205
Name: make, dtype: int64

fuel-type
False    205
Name: fuel-type, dtype: int64

aspiration
False    205
Name: aspiration, dtype: int64

num-of-doors
False    203
True       2
Name: num-of-doors, dtype: int64

body-style
False    205
Name: body-style, dtype: int64

drive-wheels
False    205
Name: drive-wheels, dtype: int64

engine-location
False    205
Name: engine-location, dtype: int64

wheel-base
False    205
Name: wheel-base, dtype: int64

length
False    205
Name: length, dtype: int64

width
False    205
Name: width, dtype: int64

height
False    205
Name: height, dtype: int64

curb-weight
False    205
Name: curb-weight, dtype: int64

engine-type
False    205
Name: engine-type, dtype: int64

num-of-cylinders
False    205
Name: num-of-cylinders, dtype: int64

engine-size
False    205
Name: engine-size, dtype: int64

fuel-system
Fa

Based on the summary above, each column has 205 rows of data, seven columns containing missing data:

"normalized-losses": 41 missing data
"num-of-doors": 2 missing data
"bore": 4 missing data
"stroke" : 4 missing data
"horsepower": 2 missing data
"peak-rpm": 2 missing data
"price": 4 missing data



Deal with missing data
How to deal with missing data?
drop data
a. drop the whole row
b. drop the whole column
replace data
a. replace it by mean
b. replace it by frequency
c. replace it based on other functions

In [74]:
# Calculate the average of the column
loss=df['normalized-losses'].astype("float").mean(axis=0)
print(loss)

122.0


In [75]:
#Replace "NaN" by mean value in "normalized-losses" column
df['normalized-losses'].replace(np.nan,loss,inplace=True)
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,122.0,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [76]:
#Calculate the mean value for 'bore' column
avg_bore=df['bore'].astype('float').mean(axis=0)
print("Average of bore:", avg_bore)
df["bore"].replace(np.nan, avg_bore, inplace=True)

Average of bore: 3.3297512437810943


Exercise for the students

In [None]:
According to the example above, replace NaN in "stroke" column by mean.

In [None]:
Calculate the mean value for the 'horsepower' column.Replace "NaN" by mean value:

In [None]:
Calculate the mean value for 'peak-rpm' column.