# Data Preprocessing Workbook w/Notes

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Importing the dataset

In [2]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values # iloc[rows, columns] : means all (iloc is integer location for rows and columns)
y = dataset.iloc[:, -1].values  # Assumes (as will often be the case) Your dependent variable is the last column (this says just take the last (-1) column)

In [3]:
print('X: ', X)
print('y: ', y)

X:  [['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
y:  ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


#### Both X and Y are matrices. And are the independent and dependent variables respectively.
#### You want to avoid having missing or nan values in your data set. But to combat it you:
- Run an Imputer substitution using package. 
- Include all of the numeric values into the transform function (or all the replacement string values if you were to need to fill null values)
- 

In [4]:
from sklearn.impute import SimpleImputer
missing_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')   # Using average of all present values in dataset
missing_imputer.fit(X[:, 1:3]) # Here you ask for dataset X, give all columns with numeric values (2nd and 3rd columns, or indexes 1:3)
X[:, 1:3] = missing_imputer.transform(X[:, 1:3]) # Here you again are saying replace these columns from matrix/dataset X no with the 'mocked' averaged values
print('X: ', X)

X:  [['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


#### Encoding Categorical (String) Data
##### One Hot Encoding
- Simplest method is take your column of string values, count the number of different values / categories . Then create that same number of columns out of that 1 column. So a column with categories of 3 countries, 3 new vector columns get created. i.e. $|1,0,0|, |0,1,0|, |0,0,1|$
- Below you'll see how we implement this. Basically in our example case we have 3 country values for our country feature (column). _One Hot Encooding_ Simply creates $n$ columns based upon $n$ categories you want to encode. So instead of 1 column with 3 country values, we will transform to have 3 columns with a numerical boolean representation for the country (1 or 0)

In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough') # Here you are saying that you want to encode the first column (index 0) of the dataset X
X = np.array(ct.fit_transform(X))
print('X: ', X)

X:  [[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encode Dependent Variable (Labels)
- In our example we have yes/no labels. 
- Convert them to binary 1/0 (LabelEncoder class from scikit learn)

In [6]:
# Do the same 'vectorization' (string to number but not ordinal) for Y (dependent variable / 'Label')
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
print('y: ', y)

y:  [0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

#### Apply feature scaling _after_ splitting training and test sets
- Reason you want to apply feature scaling after splitting your data is that the SD (used in feature scaling) would be calculated off the mean of all the data points _if_ you included the test data (if you feature scaled _before_ splitting). This would mean you are bleeding information from the test set into the training set, and they are _supposed to be independent_ and un-informed of each other.
- Scikit Learn provides the __train_test_split__ function for doing this, implementation below:

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)  # 'test_size' is the percentage of the dataset that will be used for testing (20% here)
                                                 # 'random_state' is the seed for the random number generator. Any integer will produce reproducible function results. https://scikit-learn.org/stable/glossary.html#term-random_state 

In [8]:
print('X_train: ', X_train)

X_train:  [[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [9]:
print('X_test: ', X_test)

X_test:  [[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [10]:
print('y_train: ', y_train)

y_train:  [0 1 0 0 1 1 0 1]


In [11]:
print('y_test: ', y_test)

y_test:  [0 1]


### Feature Scaling
- To recap we scale features to prevent any features in initial dataset from dominating other features in the set. When doing _proportionate analysis_ you need to ensure your feature values are not disproportionate. _(This isn't always neccessary)._
- [Formula for the basic 2 approaches](../../../../notes/2-DataPreprocessing.md/#feature-scaling) we've covered *Standardisation* and *Normalisation*
- Our instructor posits that _Standardisation_ will work in any type of Distribution, whereas _Normalisation_ typically only works well with a _normal distribution_. For that reason we'll be foccussing on using _Standardisation_
- Also important to note that when you encode categorical data like we did for the country names (by creating new columns and placing a 1 or 0) you _do not_ (need) to feature scale those dummy variable. 

In [12]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()


X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])   # Here you are saying that you want to scale all rows (1st colon) of all columns 
                                                    #from the 4th column (index 3) to the end of the dataset X_train
print('X_train: ', X_train)

X_train:  [[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


#### [Scikit-learn's StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) class provides methods for _standardizing_ features. The [fit_transform() method](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler.fit_transform) is used to first *fit* the matrix of features to get the _mean_, and then transform the data into the *standardized* form.

#### The sc object will set and store internal state (the statistics of the above running of the fit_transform method). For this reason, we want to *reuse* it on the test set data, and this makes sense, we wouldn't want to standardize one portion of our observational data and not do the same to the other. 

In [17]:

X_test[:, 3:] = sc.transform(X_test[:, 3:]) # Here you are saying that you want to scale all rows (1st colon) of all columns 
                                            #from the 4th column (index 3) to the end of the dataset X_test
print('X_test: ', X_test)

X_test:  [[0.0 1.0 0.0 -6.6987360602674375 -5.528988873522207]
 [1.0 0.0 0.0 -6.677304402554669 -5.528988865372772]]


#### Now that we are done scaling our features. Run it