# 1. Starting with Scikit-learn
## 1. Loading the dataset
- Scikit-learn comes with some datasets for experimenting with class labels already being integers

In [1]:
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target

print('Class labels:', np.unique(y))

Class labels: [0 1 2]


## 2. Splitting data into 70% training and 30% test data:
- "stratify" option allows us to preserve the proportions between the labels from the dataset in both training and testing sets (1:1:1 for 3 classes in this case)

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)

print('Labels counts in y:', np.bincount(y))
print('Labels counts in y_train:', np.bincount(y_train))
print('Labels counts in y_test:', np.bincount(y_test))

Labels counts in y: [50 50 50]
Labels counts in y_train: [35 35 35]
Labels counts in y_test: [15 15 15]


## 3. Standardizing the features:
- fit() methods allows StandardScaler object to estimate sample mean and standard deviation of the training set
- It is important to apply the same scaling parameters to both training and testing sets of data to get more realistic results for unknown samples (testing set)

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)