## Data Preparation & Preprocessing
1. Feature selection (columns)
2. Data-Sample selection (rows)
3. Class balancing
4. Cleaning data (noise reduction, missing data imputation, normalisation, outlier detection etc.)
5. Feature engineering  
6. Data augmentation
7. Data standardization 
8. Merging and aggregating the data in preparation of the final data set. Pushing it to DVC with `dataset` tag
9. Split data (train, val, test), and push indices to DVC

In [1]:
# !pip install -U scikit-learn

Collecting scikit-learn
  Using cached scikit_learn-1.2.2-cp310-cp310-macosx_12_0_arm64.whl (8.5 MB)
Collecting joblib>=1.1.1
  Using cached joblib-1.2.0-py3-none-any.whl (297 kB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.2.0 scikit-learn-1.2.2 threadpoolctl-3.1.0


We will use examples from scikit-learn's preprocessing module [documentation](https://scikit-learn.org/stable/modules/preprocessing.html)

### Data Cleaning

Imputation

In [1]:
import numpy as np
X = np.array([[1, 2], [np.nan, 3], [7, 6]], dtype=float)
col_mean = np.nanmean(X, axis=0)
print(col_mean)
inds = np.where(np.isnan(X))
print(inds)
X[inds] = np.take(col_mean, inds[1])
print(X)

[4.         3.66666667]
(array([1]), array([0]))
[[1. 2.]
 [4. 3.]
 [7. 6.]]


In [5]:
X = np.matrix([[1, 2], [np.nan, 3], [7, 6]])
print(X)

[[ 1.  2.]
 [nan  3.]
 [ 7.  6.]]


In [7]:
import numpy as np
X = [[1, 2], [np.nan, 3], [7, 6]]

from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
# print(imp.fit_transform(X))
imp.fit(X)
print(imp.transform(X))

# X_new = [[np.nan, 2], [6, np.nan], [7, 6]]
# print(imp.transform(X_new))

[[1. 2.]
 [4. 3.]
 [7. 6.]]


In [3]:
import numpy as np
from sklearn.impute import KNNImputer
nan = np.nan
X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2, weights="uniform")
imputer.fit_transform(X)

array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

### Scaling

In [10]:
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
                    
scaler = preprocessing.StandardScaler().fit(X_train)
scaler

In [11]:
print(scaler.mean_, scaler.scale_)

[1.         0.         0.33333333] [0.81649658 0.81649658 1.24721913]


In [5]:
X_scaled = scaler.transform(X_train)
X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

### Encoding

In [16]:
enc = preprocessing.OrdinalEncoder(categories=['young', 'middle-aged', 'old'])
X = [['old'], ['middle-aged'], ['young']]
enc.fit_transform(X)

ValueError: Shape mismatch: if categories is an array, it has to be of shape (n_features,).

In [13]:
enc = preprocessing.LabelEncoder()
X = [['male'], ['female'], [np.nan], ['female']]
enc.fit_transform(X)

  y = column_or_1d(y, warn=True)


array([1, 0, 2, 0])

In [19]:
enc = preprocessing.OneHotEncoder()
X = [['male'], ['female'], ['female']]
enc.fit_transform(X).todense()

matrix([[0., 1.],
        [1., 0.],
        [1., 0.]])

In [20]:
enc = preprocessing.OneHotEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)

enc.transform([['female', 'from US', 'uses Safari'],
               ['male', 'from Europe', 'uses Safari']]).toarray()


array([[1., 0., 0., 1., 0., 1.],
       [0., 1., 1., 0., 0., 1.]])

In [8]:
enc.categories_

[array(['female', 'male'], dtype=object),
 array(['from Europe', 'from US'], dtype=object),
 array(['uses Firefox', 'uses Safari'], dtype=object)]

### [Text Data](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer

In [22]:
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
X

<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [23]:
X.todense()

matrix([[0, 1, 1, 1, 0, 0, 1, 0, 1],
        [0, 1, 0, 1, 0, 2, 1, 0, 1],
        [1, 0, 0, 0, 1, 0, 1, 1, 0],
        [0, 1, 1, 1, 0, 0, 1, 0, 1]])

In [24]:
vectorizer.vocabulary_

{'this': 8,
 'is': 3,
 'the': 6,
 'first': 2,
 'document': 1,
 'second': 5,
 'and': 0,
 'third': 7,
 'one': 4}

In [25]:
vectorizer.transform(['Something completely new.']).toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0]])

## Sample Application

In [26]:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

X, y = make_classification(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
pipe = make_pipeline(StandardScaler(), LogisticRegression())
pipe.fit(X_train, y_train)  # apply scaling on training data



pipe.score(X_test, y_test)  # apply scaling on testing data, without leaking training data.


0.96

In [27]:
X

array([[-2.02514259,  0.0291022 , -0.47494531, ..., -0.33450124,
         0.86575519, -1.20029641],
       [ 1.61371127,  0.65992405, -0.15005559, ...,  1.37570681,
         0.70117274, -0.2975635 ],
       [ 0.16645221,  0.95057302,  1.42050425, ...,  1.18901653,
        -0.55547712, -0.63738713],
       ...,
       [-0.03955515, -1.60499282,  0.22213377, ..., -0.30917212,
        -0.46227529, -0.43449623],
       [ 1.08589557,  1.2031659 , -0.6095122 , ..., -0.3052247 ,
        -1.31183623, -1.06511366],
       [-0.00607091,  1.30857636, -0.17495976, ...,  0.99204235,
         0.32169781, -0.66809045]])

In [28]:
y

array([0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0,
       0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0])

In [33]:
pipe.predict(X)

array([1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0,
       0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0])