In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
import os
os.chdir('/content/drive/My Drive/Colab Notebooks/')

# Pre-Processing the data

#### Data preprocessing in Machine Learning is a crucial step that helps enhance the quality of data to promote the extraction of meaningful insights from the data.

#### Data preprocessing in Machine Learning refers to the technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning models.

#### In simple words, data preprocessing in Machine Learning is a data mining technique that transforms raw data into an understandable and readable format.

# STEPS

### Getting the dataset
### Importing libraries
### Importing datasets
### Finding Missing Data
### Encoding Categorical Data
### Splitting dataset into training and test set
### Feature scaling

##### Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.

##### Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.



# Need of Data Preprocessing

##### For achieving better results from the applied model in Machine Learning projects the format of the data has to be in a proper manner. Some specified Machine Learning model needs information in a specified format, for example, ***Random Forest algorithm does not support null values,*** therefore to execute random forest algorithm null values have to be managed from the original raw data set.

##### Another aspect is that ***data set should be formatted in such a way that more than one Machine Learning and Deep Learning algorithms are executed in one data set, and best out of them is chosen.***

##### Another reason why feature scaling is applied is that ***gradient descent converges much faster with feature scaling than without it***


# Preprocessing Techniques



### Binarize Data  
We can transform our data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.


### Feature Scaling:
It puts all our features on the same scale. You don’t have to apply feature scaling to the dummy variables. Two techniques:

(i) Standardization
(ii) Normalization

### Standardize Data
Standardization of datasets is a common requirement for many machine learning estimators

### Normalization
Normalization involves adjusting the values in the feature vector so as to measure them on a common scale. Here, the values of a feature vector are adjusted so that they sum up to 1

In [6]:
import sklearn.preprocessing

# Binarize Data (Make Binary)

• We can transform our data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.

• This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.

• We can create new binary attributes in Python using scikit-learn with the Binarizer class.

In [7]:
from sklearn.preprocessing import Binarizer

In [8]:
import pandas as pd

features = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age','class']
data=pd.read_csv( 'pima.csv',names=features)
data.head()



Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,tested_positive
1,1,85,66,29,0,26.6,0.351,31,tested_negative
2,8,183,64,0,0,23.3,0.672,32,tested_positive
3,1,89,66,23,94,28.1,0.167,21,tested_negative
4,0,137,40,35,168,43.1,2.288,33,tested_positive


In [9]:
array = data.values

# separate array into input and output components
X = array[:,0:8] # 0 to 7 columns in X
Y = array[:,8] #8th column in Y

In [10]:
binary=Binarizer(threshold=0.25).fit(X)
binaryX=binary.transform(X)
print(binaryX)

[[1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 ...
 [1. 1. 1. ... 1. 0. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]]


# Normalize or Standardize?

***Normalization*** is good to use when you know that the distribution of your ***data does not follow a Gaussian distribution.*** This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbors and Neural Networks.

***Standardization,*** on the other hand, can be helpful in cases where the ***data follows a Gaussian distribution.*** However, this does not have to be necessarily true. Also, unlike normalization, standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization.

# When Feature Scaling matters

Some machine learning models are fundamentally based on distance matrix, also known as the distance-based classifier, for example, K-Nearest-Neighbours, SVM, and Neural Network. Feature scaling is extremely essential to those models, especially when the range of the features is very different. Otherwise, features with a large range will have a large influence in computing the distance.


***Max-Min Normalisation*** typically allows us to transform the data with varying scales so that no specific dimension will dominate the statistics, and it does not require making a very strong assumption about the distribution of the data, such as k-nearest neighbours and artificial neural networks. ***However, Normalisation does not treat outliers very well.*** On the contrary, ***standardisation allows users to better handle the outliers and facilitate convergence for some computational algorithms like gradient descent.*** Therefore, we usually prefer standardisation over Min-Max Normalisation.

# Scaling
Most probably our dataset comprises of the attributes with varying scale, but we cannot provide such data to ML algorithm hence it requires rescaling. Data rescaling makes sure that attributes are at same scale. Generally, attributes are rescaled into the range of 0 and 1.

### 1)Decimal Scaling

### 2) Simple Feature Scaling

### 3)Min-Max Normalization

### 4)z-Score Normalization(zero-mean Normalization)

Decimal Scaling Method For Normalization –
It normalizes by moving the decimal point of values of the data. To normalize the data by this technique, we divide each value of the data by the maximum absolute value of data. The data value, vi, of data is normalized to vi‘ by using the formula below –

### Decimal Scaling Method For Normalization –

It normalizes by moving the decimal point of values of the data. To normalize the data by this technique, we divide each value of the data by the maximum absolute value of data. The data value, vi, of data is normalized to vi‘ by using the formula below –

### V(new_i)=v(i)/10^i
***
***Let the input data is: -10, 201, 301, -401, 501, 601, 701***

To normalize the above data,

Step 1: Maximum absolute value in given data(m): 701

Step 2: Divide the given data by 1000 (i.e j=3)

Result: The normalized data is: -0.01, 0.201, 0.301, -0.401, 0.501, 0.601, 0.701
***

In [11]:
#Decimal Scaling Method For Normalization
def Dec_scale(df):
    p = max(df)
    q = len(str(abs(p)))
    print(p,q)
    l=[]
    for x in df:
        l.append(x/10**q)
    print(l)
data=[18,12,89,121,900,45]
Dec_scale(data)

900 3
[0.018, 0.012, 0.089, 0.121, 0.9, 0.045]


In [12]:
#Simple Feature Scaling

# Original Source : https://datahub.io/machine-learning/iris/r/iris.csv
data=pd.read_csv('iris.csv')
print(data.head(),'\n')
print("max sepal length value : ",data['sepal.length'].max(),'\n')
data['sepal.length']=data['sepal.length']/data['sepal.length'].max()
print(data.head())

   sepal.length  sepal.width  petal.length  petal.width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa 

max sepal length value :  7.9 

   sepal.length  sepal.width  petal.length  petal.width        class
0      0.645570          3.5           1.4          0.2  Iris-setosa
1      0.620253          3.0           1.4          0.2  Iris-setosa
2      0.594937          3.2           1.3          0.2  Iris-setosa
3      0.582278          3.1           1.5          0.2  Iris-setosa
4      0.632911          3.6           1.4          0.2  Iris-setosa


In [13]:
#Simple Feature Scaling
from sklearn.preprocessing import MaxAbsScaler
dataWithNumericFeatures = data.loc[:, data.columns != "class"]
simpleScale=MaxAbsScaler()
rescaled=simpleScale.fit_transform(dataWithNumericFeatures)
print(rescaled)

[[0.64556962 0.79545455 0.20289855 0.08      ]
 [0.62025316 0.68181818 0.20289855 0.08      ]
 [0.59493671 0.72727273 0.1884058  0.08      ]
 [0.58227848 0.70454545 0.2173913  0.08      ]
 [0.63291139 0.81818182 0.20289855 0.08      ]
 [0.6835443  0.88636364 0.24637681 0.16      ]
 [0.58227848 0.77272727 0.20289855 0.12      ]
 [0.63291139 0.77272727 0.2173913  0.08      ]
 [0.55696203 0.65909091 0.20289855 0.08      ]
 [0.62025316 0.70454545 0.2173913  0.04      ]
 [0.6835443  0.84090909 0.2173913  0.08      ]
 [0.60759494 0.77272727 0.23188406 0.08      ]
 [0.60759494 0.68181818 0.20289855 0.04      ]
 [0.5443038  0.68181818 0.15942029 0.04      ]
 [0.73417722 0.90909091 0.17391304 0.08      ]
 [0.72151899 1.         0.2173913  0.16      ]
 [0.64556962 0.79545455 0.20289855 0.12      ]
 [0.6835443  0.88636364 0.1884058  0.16      ]
 [0.72151899 0.86363636 0.24637681 0.12      ]
 [0.64556962 0.86363636 0.2173913  0.12      ]
 [0.6835443  0.77272727 0.24637681 0.08      ]
 [0.64556962 

### Min-Max Normalization

Min-max normalization is one of the most common ways to normalize data.

**MinMaxScaler scales all the data features in the range [0, 1] or else in the range [-1, 1] if there are negative values in the dataset.** This scaling compresses all the inliers in the narrow range [0, 0.005].


For every feature, the minimum value of that feature gets transformed into a 0,
the maximum value gets transformed into a 1,

and every other value gets transformed into a decimal between 0 and 1.

##### F=Value-Min/Max-Min  

Min-max normalization has one fairly significant downside: **it does not handle outliers very well.**

Let (X1, X2) be a min and max boundary of an attribute and (Y1, Y2) be the new scale at which we are normalizing then for Vi  value of the attribute, the normalized value Ui is given as

##### Example: Vi=300,000; X1= 125,000; X2= 925,000; Y1= 0; Y2= 1

##### [(Vi-X1)/(X2-X1)]* (Y2-Y1)+Y1  


In [14]:
data['sepal.width']=data['sepal.width']-data['sepal.width'].min()/(data['sepal.width'].max()-data['sepal.width'].min())
print(data)

     sepal.length  sepal.width  petal.length  petal.width           class
0        0.645570     2.666667           1.4          0.2     Iris-setosa
1        0.620253     2.166667           1.4          0.2     Iris-setosa
2        0.594937     2.366667           1.3          0.2     Iris-setosa
3        0.582278     2.266667           1.5          0.2     Iris-setosa
4        0.632911     2.766667           1.4          0.2     Iris-setosa
..            ...          ...           ...          ...             ...
145      0.848101     2.166667           5.2          2.3  Iris-virginica
146      0.797468     1.666667           5.0          1.9  Iris-virginica
147      0.822785     2.166667           5.2          2.0  Iris-virginica
148      0.784810     2.566667           5.4          2.3  Iris-virginica
149      0.746835     2.166667           5.1          1.8  Iris-virginica

[150 rows x 5 columns]


In [15]:
#min-max Scaling
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

print("After MinMax Scaling")
scaler = MinMaxScaler()
print(scaler.fit(data),'\n')
print(scaler.data_max_,'\n')
print(scaler.transform(data),'\n')
print(scaler.transform([[2, 2]]))

#changing to -1 to 1
scaler1=MinMaxScaler(feature_range=(-1,1))
rescaled=scaler1.fit_transform(data)
print("\n with -1 to 1 range \n")
print(rescaled)


After MinMax Scaling
MinMaxScaler() 

[ 1. 18.] 

[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]] 

[[1.5 0. ]]

 with -1 to 1 range 

[[-1.  -1. ]
 [-0.5 -0.5]
 [ 0.   0. ]
 [ 1.   1. ]]


In [16]:
from sklearn.preprocessing import MinMaxScaler

features = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age','class']
data=pd.read_csv('pima.csv',names=features)
print(data.head())
db=data.values

# separate array into input and output components
X = db[:,0:8]
Y = db[:,8]

print(X)


   preg  plas  pres  skin  test  mass   pedi  age            class
0     6   148    72    35     0  33.6  0.627   50  tested_positive
1     1    85    66    29     0  26.6  0.351   31  tested_negative
2     8   183    64     0     0  23.3  0.672   32  tested_positive
3     1    89    66    23    94  28.1  0.167   21  tested_negative
4     0   137    40    35   168  43.1  2.288   33  tested_positive
[[6 148 72 ... 33.6 0.627 50]
 [1 85 66 ... 26.6 0.351 31]
 [8 183 64 ... 23.3 0.672 32]
 ...
 [5 121 72 ... 26.2 0.245 30]
 [1 126 60 ... 30.1 0.349 47]
 [1 93 70 ... 30.4 0.315 23]]


In [17]:
scaler = MinMaxScaler(feature_range=(-1, 1))
rescaledX = scaler.fit_transform(X)
#rescaledY = scaler.fit_transform(Y)
print(rescaledX)

[[-0.29411765  0.48743719  0.18032787 ...  0.00149031 -0.53116994
  -0.03333333]
 [-0.88235294 -0.14572864  0.08196721 ... -0.2071535  -0.76686593
  -0.66666667]
 [-0.05882353  0.83919598  0.04918033 ... -0.30551416 -0.49274125
  -0.63333333]
 ...
 [-0.41176471  0.2160804   0.18032787 ... -0.21907601 -0.85738685
  -0.7       ]
 [-0.88235294  0.26633166 -0.01639344 ... -0.10283159 -0.76857387
  -0.13333333]
 [-0.88235294 -0.06532663  0.14754098 ... -0.09388972 -0.79760888
  -0.93333333]]


In [18]:
X[1]

array([1, 85, 66, 29, 0, 26.6, 0.351, 31], dtype=object)

### Z-Score Normalization

#Z-scores are linearly transformed data values having a mean of zero and a standard deviation of 1.

#if we run a scatterplot of scores versus z-scores, all dots will be exactly on a straight

#Z-scores are also known as standardized scores; they are scores (or data values) that have been given a common standard.

####      Z-Score helps in the normalization of data!
***A positive z-score says the data point is above average.
A negative z-score says the data point is below average. ***

####      Z-score= data_point-mean/S.D


# standarization (or Z-score normalization)

#### What is Standardization?

**Standardization is scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.**

The result of standardization (or Z-score normalization) is that the features will be rescaled to ensure the mean and the standard deviation to be 0 and 1, respectively

This technique is to re-scale features value with the distribution value between 0 and 1 is useful for the optimization algorithms, such as gradient descent, that are used within machine learning algorithms that weight inputs (e.g., regression and neural networks). Rescaling is also used for algorithms that use distance measurements, for example, K-Nearest-Neighbours (KNN).

Standardization results in the rescaling of features, which in turn represents the properties of a standard normal distribution:

mean = 0
sd = 1


In [19]:
X[:,4]

array([0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 0, 230,
       83, 96, 235, 0, 0, 0, 146, 115, 0, 140, 110, 0, 0, 245, 54, 0, 0,
       192, 0, 0, 0, 207, 70, 0, 0, 240, 0, 0, 0, 0, 0, 0, 82, 36, 23,
       300, 342, 0, 304, 110, 0, 142, 0, 0, 0, 128, 0, 0, 0, 0, 38, 100,
       90, 140, 0, 270, 0, 0, 0, 0, 0, 0, 0, 0, 71, 0, 0, 125, 0, 71, 110,
       0, 0, 176, 48, 0, 64, 228, 0, 76, 64, 220, 0, 0, 0, 40, 0, 152, 0,
       140, 18, 36, 135, 495, 37, 0, 175, 0, 0, 0, 0, 51, 100, 0, 100, 0,
       0, 99, 135, 94, 145, 0, 168, 0, 225, 0, 49, 140, 50, 92, 0, 325, 0,
       0, 63, 0, 284, 0, 0, 119, 0, 0, 204, 0, 155, 485, 0, 0, 94, 135,
       53, 114, 0, 105, 285, 0, 0, 156, 0, 0, 0, 78, 0, 130, 0, 48, 55,
       130, 0, 130, 0, 0, 0, 92, 23, 0, 0, 0, 495, 58, 114, 160, 0, 94, 0,
       0, 0, 210, 0, 48, 99, 318, 0, 0, 0, 44, 190, 0, 280, 0, 87, 0, 0,
       0, 0, 130, 175, 271, 129, 120, 0, 0, 478, 0, 0, 190, 56, 32, 0, 0,
       744, 53, 0, 370, 37, 0, 45, 0, 192, 0, 

In [20]:
#Z-Score

X[:,4]=X[:,4]-X[:,4].mean()/X[:,4].std()
print(X[:,4])

X=X-X.mean()/X.std()
print(X)



[-0.6928905722954664 -0.6928905722954664 -0.6928905722954664
 93.30710942770453 167.30710942770455 -0.6928905722954664
 87.30710942770453 -0.6928905722954664 542.3071094277045
 -0.6928905722954664 -0.6928905722954664 -0.6928905722954664
 -0.6928905722954664 845.3071094277045 174.30710942770455
 -0.6928905722954664 -0.6928905722954664 229.30710942770455
 82.30710942770453 95.30710942770453 234.30710942770455
 -0.6928905722954664 -0.6928905722954664 -0.6928905722954664
 145.30710942770455 114.30710942770453 -0.6928905722954664
 139.30710942770455 109.30710942770453 -0.6928905722954664
 -0.6928905722954664 244.30710942770455 53.30710942770453
 -0.6928905722954664 -0.6928905722954664 191.30710942770455
 -0.6928905722954664 -0.6928905722954664 -0.6928905722954664
 206.30710942770455 69.30710942770453 -0.6928905722954664
 -0.6928905722954664 239.30710942770455 -0.6928905722954664
 -0.6928905722954664 -0.6928905722954664 -0.6928905722954664
 -0.6928905722954664 -0.6928905722954664 81.30710942

In [21]:
import numpy as np

from sklearn import preprocessing
input_data = np.array([[3, -1.5, 3, -6.4], [0, 3, -1.3, 4.1], [1, 2.3, -2.9, -4.3]])
print(input_data)

[[ 3.  -1.5  3.  -6.4]
 [ 0.   3.  -1.3  4.1]
 [ 1.   2.3 -2.9 -4.3]]


In [22]:
print("Mean standardized data: ",input_data.mean(axis=0)) #axis 0 is row
print("Standard Deviation standardized data: ",input_data.std(axis=0))

Mean standardized data:  [ 1.33333333  1.26666667 -0.4        -2.2       ]
Standard Deviation standardized data:  [1.24721913 1.97709102 2.49131826 4.53651849]


***preprocessing.scale()***

Standardize a dataset along any axis.

Center to the mean and component wise scale to unit variance.
The preprocessing.scale() function standardizes a dataset along any axis. This method centers the data on the mean and resizes the components in order to have a unit variance.

The preprocessing.scale() algorithm puts your data on one scale
X = [1, 4, 400, 10000, 100000]

scale(x) is a simpler function for basic scaling without the transformer functionalities.

In [23]:
standardData=preprocessing.scale(input_data) #default is axis=0
print(standardData)

[[ 1.33630621 -1.39936232  1.36473933 -0.9258201 ]
 [-1.06904497  0.87670892 -0.36125453  1.38873015]
 [-0.26726124  0.5226534  -1.0034848  -0.46291005]]


In [24]:
print("Mean standardized data: ",standardData.mean(axis=0))
print("Standard Deviation standardized data: ",standardData.std(axis=0))

Mean standardized data:  [ 5.55111512e-17 -3.70074342e-17  0.00000000e+00 -1.85037171e-17]
Standard Deviation standardized data:  [1. 1. 1. 1.]


**Standardize Data - StandardScaler()**

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.

 We can standardize data using scikit-learn with the StandardScaler class.


 z = (x - u) / s

In [25]:
from sklearn.preprocessing import StandardScaler
import numpy as np

In [26]:
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

# summarize transformed data
print(rescaledX[0:5,:])

[[ 0.63994726  0.84832379  0.14964075  0.90726993 -0.69289057  0.20401277
   0.46849198  1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575  0.53090156 -0.69289057 -0.68442195
  -0.36506078 -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 -1.28821221 -0.69289057 -1.10325546
   0.60439732 -0.10558415]
 [-0.84488505 -0.99820778 -0.16054575  0.15453319  0.12330164 -0.49404308
  -0.92076261 -1.04154944]
 [-1.14185152  0.5040552  -1.50468724  0.90726993  0.76583594  1.4097456
   5.4849091  -0.0204964 ]]


In [27]:
SS=StandardScaler()
Sd=SS.fit_transform(X)
print(Sd[0:5,:])

[[ 0.63994726  0.84832379  0.14964075  0.90726993 -0.69289057  0.20401277
   0.46849198  1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575  0.53090156 -0.69289057 -0.68442195
  -0.36506078 -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 -1.28821221 -0.69289057 -1.10325546
   0.60439732 -0.10558415]
 [-0.84488505 -0.99820778 -0.16054575  0.15453319  0.12330164 -0.49404308
  -0.92076261 -1.04154944]
 [-1.14185152  0.5040552  -1.50468724  0.90726993  0.76583594  1.4097456
   5.4849091  -0.0204964 ]]


#### StandardScaler removes the mean and scales the data to unit variance.

#### However, the outliers have an influence when computing the empirical mean and standard deviation which shrink the range of the feature values

#### StandardScaler cannot guarantee balanced feature scales in the presence of outliers.



# Encoding categorical data

Sometimes our data is in qualitative form, that is we have texts as our data. We can find categories in text form. Now it gets complicated for machines to understand texts and process them, rather than numbers, since the models are based on mathematical equations and calculations. Therefore, we have to encode the categorical data.

# Nominal and Ordinal Variables

Nominal Variable (Categorical). Variable comprises a finite set of discrete values with no relationship between values.

Ordinal Variable. Variable comprises a finite set of discrete values with a ranked ordering between values.

Some algorithms can work with categorical data directly.

For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

# Encoding Categorical Data

There are three common approaches for converting ordinal and categorical variables to numerical values. They are:

### Ordinal Encoding(label encoding)

### One-Hot Encoding



# Label Encoding (ordinal)


In ordinal encoding, each unique category value is assigned an integer value.

For example, “red” is 1, “green” is 2, and “blue” is 3.

In label encoding, we map each category to a number or a label. The labels chosen for the categories have no relationship. So categories that have some ties or are close to each other lose such information after encoding.

Limitation of label Encoding:

-Label encoding convert the data in machine readable form, but it assigns a unique number(starting from 0) to each class of data.

-This may lead to the generation of priority issue in training of data sets.


-A label with high value may be considered to have high priority than a label having lower value.



In [28]:
#Label Encoding
from sklearn import preprocessing
encode=preprocessing.LabelEncoder()
data=['AB','CD','PK','DX','MN']
encode.fit(data)
for i,item in enumerate(encode.classes_):
    print(item,'==>',i)
myinput=['CD','MN','PK','AB','CD','AB','PK','DX','MN']

lbl=encode.transform(myinput)
print(list(lbl))

AB ==> 0
CD ==> 1
DX ==> 2
MN ==> 3
PK ==> 4
[1, 3, 4, 0, 1, 0, 4, 2, 3]


In [29]:
# Label Encoding example
import numpy as np
import pandas as pd


df.head()
df['class'].unique()

NameError: name 'df' is not defined

In [None]:
#df['variety'].unique()

# Import label encoder
from sklearn import preprocessing
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'species'.
df['class']= label_encoder.fit_transform(df['class'])
#f=df['variety'].unique()
#print(f)
for i,item in enumerate(label_encoder.classes_):
    print(item,'==>',i)

In [None]:
df

# OneHotEncoding

**Sometimes in datasets, we encounter columns that contain numbers of no specific order of preference. The data in the column usually denotes a category or value of the category and also when the data in the column is label encoded. This confuses the machine learning model, to avoid this the data in the column should be One Hot encoded.**

It refers to splitting the column which contains numerical categorical data to many columns depending on the number of categories present in that column. Each column contains “0” or “1” corresponding to which column it has been placed.


One hot encoding is the most widespread approach, and it works very well unless your categorical variable takes on a large number of values (i.e. you generally won't use it for variables taking more than 15 different values.

In [None]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

In [None]:

df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 1, 0],
                   'C': [0, 2, 2], 'D': [0, 1, 1]})
print(df)
df_onehot = OneHotEncoder()
data = df_onehot.fit_transform(df)
#get_feature_names_out() : Get output feature names for transformation.
df1 = pd.DataFrame(data.toarray(), columns=df_onehot.get_feature_names_out(), dtype=int)
df1

In [None]:
iris=load_iris()
iris.feature_names
features=pd.DataFrame(iris.feature_names)
iris.target_names

In [None]:
X=pd.DataFrame(iris.data)
Y=pd.DataFrame(iris.target)
Y

In [None]:
encoder = OneHotEncoder(sparse_output=False)

In [None]:
ohe=encoder.fit_transform(Y)
print(ohe)

In [None]:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

X, y = make_classification(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

pipe = make_pipeline(StandardScaler(), LogisticRegression())

pipe.fit(X_train, y_train)  # apply scaling on training data

pipe.score(X_test, y_test)  # apply scaling on testing data, without leaking training data.


In [None]:
X.shape