In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Import data

In [28]:
dataset = pd.read_csv('Data.csv')

Creates a matrix X of features (independent variables) by taking all rows and all columns except the last one
Creates the array Y of dependent variables.

In [29]:
X = dataset.iloc[:,:-1].values 
y = dataset.iloc[:,3].values

Taking care of missing data.
Imputer is going to replace empty values with the mean of all rows for that column.

In [30]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:,1:3] = imputer.transform(X[:,1:3])

Encoding categorical Data.
- Problem: there are no orders between these values (France is not greater or smaller than Spain)
- Solution: Dummy Encoding (OneHot)
*OBS: By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through. 

In [31]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer 

labelenconder_X = LabelEncoder()
X[:,0] = labelenconder_X.fit_transform(X[:,0])
onehotenconder = ColumnTransformer([("encoder", OneHotEncoder(), [0])], remainder="passthrough")
X = onehotenconder.fit_transform(X) 
labelenconder_y = LabelEncoder()
y = labelenconder_y.fit_transform(y)

Splitting dataset into training set and test set
Usually test_size = 20%

In [37]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)

array([[0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 37.0, 67000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [1.0, 0.0, 0.0, 44.0, 72000.0],
       [1.0, 0.0, 0.0, 35.0, 58000.0]], dtype=object)

Feature scaling
- Problem: the range of each column is different, and ML models are based on Euclidean Distance
- Solution: 
 - Standardisation: Xstand = (x - mean(x))/(std(x))
 - Normalisation: Xnorm = (x-min(x))/(max(x)-min(x))
*OBS: For X_test, we don't need to fit the sc_X object to the test set, because it is already fitted to the training set
QUESTION: Do we need to fit the dummy variables? Depends
QUESTION: Do we need to feature scale for Y? No, because it's a classification problem with categories

In [39]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

array([[-1.        ,  2.64575131, -0.77459667,  0.26306757,  0.12381479],
       [ 1.        , -0.37796447, -0.77459667, -0.25350148,  0.46175632],
       [-1.        , -0.37796447,  1.29099445, -1.97539832, -1.53093341],
       [-1.        , -0.37796447,  1.29099445,  0.05261351, -1.11141978],
       [ 1.        , -0.37796447, -0.77459667,  1.64058505,  1.7202972 ],
       [-1.        , -0.37796447,  1.29099445, -0.0813118 , -0.16751412],
       [ 1.        , -0.37796447, -0.77459667,  0.95182631,  0.98614835],
       [ 1.        , -0.37796447, -0.77459667, -0.59788085, -0.48214934]])