# Numpy exercise

Suppose you have a dataset which contains tempreture measurements from different parts of a moving car: `engine`, `exhaust pipe`, `breaks`:
```
       Engine  Exhaust pipe      Breaks
0   87.454359    484.136701  159.063172
1  125.000000           NaN  143.346484
2   81.895101    483.893817         NaN
3   82.731238    481.494196  139.393452
4   86.087258    592.480628  177.006120
5  112.400000    495.917422  190.304417
6   86.986207    450.986284  148.849352
7   77.435774    499.880549  161.293086
8   81.179830    488.378706  157.190646
9   85.506349           inf  139.760698
```

Each row of the matrix contains a separate measurement. 

As you can see some of the values are faulty. In this exercise you will **clean** and **normalize** the data and **split into train and validation set**.


In [2]:
# import some libraries
import numpy as np
import pandas as pd


np.random.seed(13)

# name of the dataset file
ds_file_name = 'ds_1.npy'

In [3]:
# print the data nicely
def print_dataset(dataset, cols=["Engine", "Exhaust pipe", "Breaks"]):
    df = pd.DataFrame(dataset, columns=cols)
    print(df)

In [4]:
dataset = np.load(ds_file_name)

### Exercise 1: Clean the data (3 points)

**1.1** 
Engine tempreture should be less than 100 °C. Clip the values above 100.0 °C.

In [11]:
### Your code here ###

**1.2** Remove the lines with `inf` values

In [13]:
### Your code here ###

**1.3** Replace `NaN` values with the mean of the corresponding column

In [15]:
### Your code here ###

After cleaning the dataset, you should have something like this:

```
       Engine  Exhaust pipe      Breaks
0   87.454359    484.136701  159.063172
1  100.000000    497.146038  143.346484
2   81.895101    483.893817  159.555841
3   82.731238    481.494196  139.393452
4   86.087258    592.480628  177.006120
5  100.000000    495.917422  190.304417
6   86.986207    450.986284  148.849352
7   77.435774    499.880549  161.293086
8   81.179830    488.378706  157.190646
```

### Exercise 2: Append new data (0.5 point)

**2.1** Suppose you get another 10 meausrements. Append them to the previous dataset.

In [8]:
# load the new dataset
ds2_file_name = 'ds_2.npy'
dataset_new = np.load(ds2_file_name)

In [22]:
### Your code here ###

### Exercise 3: Broadcasting (1 pts)

**1.1 Standardize the data**
Standardization of datasets is a common requirement for many machine learning estimators; they might behave badly if the individual features (columns this case) do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation. (Source: [sklearn-preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling))

Calculate the mean and standard deviation of the columns, substract from the columns their mean and divide them by their standard deviation.

In [None]:
### Your code here ###

**2.2** Sometimes we have to insert new columns to the data (e.g.: during linear regression we usually have to append a new column to the data containing only 1-s). Append a column to the dataset which contains only 1-s.

In [25]:
### Your code here ###

### Exercise 3: Row/Column shuffle (0.5 pts)

Reorder the columns in the dataset: move the column containing ones in the 0-th position and the `Engine` column to the last position

In [64]:
### Your code here ###

### Exercise 4: Train/val split (1 pts)

When you have only one dataset, a common practice is to split the dataset to a train and a validation set, train your model on the train set and evaluate on the validation set. 

Split your data into a train and a validation set such that 30% of the data remains for validation. Note: split the data randomly, without modifying the original dataset (use random indices to select part of the data).

In [None]:
### Your code here ###

You should get something similar:

**Train set:**
```
         Exhaust pipe      Breaks      Engine
0   1.0    497.146038  143.346484  100.000000
1   1.0    483.893817  159.555841   81.895101
2   1.0    592.480628  177.006120   86.087258
3   1.0    499.880549  161.293086   77.435774
4   1.0    488.378706  157.190646   81.179830
5   1.0    453.583239  166.299909   82.384179
6   1.0    461.266998  126.291624   82.593143
7   1.0    478.142717  134.627460   75.379203
8   1.0    527.395915  164.583309   81.212734
9   1.0    523.282899  154.156001   89.611721
10  1.0    463.950016  106.257695   83.963252
11  1.0    528.113085  164.439301   83.120400
12  1.0    484.257096  156.975012   88.199420
13  1.0    478.505146  143.227635   95.268466
```

**Val set:**
```
        Exhaust pipe      Breaks      Engine
0  1.0    484.136701  159.063172   87.454359
1  1.0    481.494196  139.393452   82.731238
2  1.0    495.917422  190.304417  100.000000
3  1.0    450.986284  148.849352   86.986207
4  1.0    457.292852  165.318221   87.267405


```