# Data Processing

In data processing, we will frequently uses libraries like pandas(to process datasets) and numpy(to perform scientific calculations).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

After importing libraries we need to load the dataset into a dataframe(easier to handle) using pandas.
Any dataset for ML will have 
- Features (generally the first few columns)
    - features is basically variables which has info which is used to predict
- Labels (the last column)
    - labels which is a dependent variable
    
After loading the dataset, we need to house the data into matrices, the features in one matrix and labels in another.
Location of indexes is done using iloc() function, it will get indexes of columns and put it into the variables.
X will have all the rows and columns up until the last column, meaning it has the features.
Y will have rows and column value of the last column.

In [2]:
#creating variables for accessing datasets file
#this dataset varible is a data frame
dataset = pd.read_csv('Data.csv')
#first 3 columns in our data set is features stored in x, last column is labels stored in y
x = dataset.iloc[:,:-1].values 
#this creates a matrix of features and labels
#first it is rows, after comma it is columns
#iloc locates indexes, it will get indexes of columns we want to extract and put into x 
#':' is for range, without anything after or before it, everything is selected
#:-1 selecting all the columns excluding last one (labels)
y = dataset.iloc[:,-1].values
#y has all rows of last column

In [3]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [4]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


We use SciKit learn on a large scale in Machine Learning, so in the next code we use sklearn's impute class.

For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning. 

Impute offers us the chances to modify such data by replacing them with mean or other mathematical functions like average, median.

In [5]:
#missing data, replacing missing salary by average of all the salaries
#using scikit learn
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
#we can replace the missing value by anything
#missing_value is to specify what value we will replace, which is nan
#second argument is what it will be replaced with
#fit will connect and transform will replace
imputer.fit(x[:,1:3])
#only passing columns that are numeric and so :,1:, first column is string
x[:,1:3]=imputer.transform(x[:,1:3])
#transforms returns the two columns with replacements, which will replace the original matrix values

In [6]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


#### Why encoding categorical data?

ML model might be confused when it looks at the data, it might not
be able to relate what kind of data it is.

Machine learning tutorials will recommend or require that you prepare your data in specific ways before fitting a machine learning model.

Categorical data are variables that contain label values rather than numeric values. 
   - For example: Country is a categorical data which has values like, Germany, France and England.
   
Many machine learning algorithms cannot operate on label data directly. This means that categorical data must be converted to a numerical form.

There are two ways to convert:
1. **Integer Encoding**: Labelling the data as integers like 1 for red, 2 for green, 3 for blue. But sometimes things cant be related using integers hence we use the next method.
2. **One Hot Coding**: here we label the data in binary value, like 110 for red,.100 for blue, 101 for green. 

In [8]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthrough')
#we specify what kind of transformation we want to do, then what kind of encoding and third the indexes of the column
#passthrough means keeping columns that wont be transformed
#we will use fit transform which will fit the connection and transform the column 
x = np.array(ct.fit_transform(x))
#fit transform doesnt return the data as numpy array and so we convert it using numpy

In [9]:
print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In the above output you can notice what the OneHotEncoder does, we turned the countries into a numerical value.

---
Now we will encode the dependent variable, meaning we will convert the yes and nos to 1 and 0, and so it is natural for us to use Integer Encoding **(Label Encoder)**, discussed earlier.

In [10]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
#not passing any arguments because we want the whole column converted
#encoding it into binary value 1 and 0 for yes and no
y = le.fit_transform(y)

In [11]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


#### Why should we apply feature scaling before splitting the data set?
Feature Scaling scales the data set on the same level so one feature doesnt dominate other feature when the model is learning.

Information shouldn't be altered on the test data set.

In [18]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1)
# Train test splits the features and labels data, that is x and y
# We will then provide the split size to divide the data set into 80-20
# random state makes sures that splitting is done in a same way through out again
print("x_train",x_train)
print("x_test",x_test)
print("y_train",y_train)
print("y_test",y_test)

x_train [[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]
x_test [[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]
y_train [0 1 0 0 1 1 0 1]
y_test [0 1]


### Feature Scaling
- **Standardisation**:  which consists of subtracting each value of your feature by the mean of all the values of the feature and then dividing by the standard deviation.
    - Will do the job all the time
- **Normalization**: It consists of subtracting each value of your feature by the minimum value of the feature and then dividing by the difference between the maximum value of the feature and the minimum value of the feature.
    - Specific situations when certain features follow a normal distribution

In [19]:
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()
# We dont need to standardise dummy variables like country because it will fail to interpret its true meaning
x_train[:,3:] = sc.fit_transform(x_train[:,3:])
x_test[:,3:] = sc.transform(x_test[:,3:]) 

**fit_transform()**

The **fit** method is calculating the mean and variance of each of the features present in our data. 

The **transform** method is transforming all the features using the respective mean and variance.

---

Using the **transform()** method we can use the same mean and variance from our training data to transform our test data. 

Thus, the parameters learned by our model using the training data will help us to transform our test data.

---

**Why not fit transform on test data?**

Because we dont want our model to learn about our test data.

If we will use the fit method on our test data too, we will compute a new mean and variance that is a new scale for each feature and our model will learn about our test data too. 

Thus, what we want to keep as a surprise is no longer unknown to our model and we will not get a good estimate of how our model is performing on the test (unseen) data.

In [20]:
print(x_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [21]:
print(x_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
