# 02.A: Working with Datasets
It's important to have clear and sensible way of representing the datasets that learning algorithms train on.

A dataset consists of $n$ examples. Each example consists of $m$ features. The number of these features per example $m$ is also the number dimensions the dataset has. In supervised learning, the dataset is a matrix like this:


$\boldsymbol{D} =\left[\begin{array}{cccccc} 
  x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & \cdots & x_m^{(1)} & y^{(1)}\\ 
  x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & \cdots & x_m^{(2)} & y^{(2)}\\
  x_1^{(3)} & x_2^{(3)} & x_3^{(3)} & \cdots & x_m^{(3)} & y^{(3)}\\
  \vdots    & \vdots    & \vdots    & \cdots & \vdots & \vdots \\
  x_1^{(n)} & x_2^{(n)} & x_3^{(n)} & \cdots & x_m^{(n)} & y^{(n)}
\end{array}\right]$

Each row of this matrix consists of the $m$ features plus the target label as the last element in the row. In other words, $\boldsymbol{D}$ consists of both the input matrix $\boldsymbol{X}$ and target vector $y$, where: 

$\boldsymbol{X} =\left[\begin{array}{ccccc} 
  x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & \cdots & x_m^{(1)}\\ 
  x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & \cdots & x_m^{(2)}\\
  x_1^{(3)} & x_2^{(3)} & x_3^{(3)} & \cdots & x_m^{(3)}\\
  \vdots    & \vdots    & \vdots    & \cdots & \vdots \\
  x_1^{(n)} & x_2^{(n)} & x_3^{(n)} & \cdots & x_m^{(n)}
\end{array}\right]$

and

$\boldsymbol{y} =\left[\begin{array}{c} 
  y^{(1)}\\ 
  y^{(2)}\\
  y^{(3)}\\
  \vdots \\
  y^{(n)}
\end{array}\right]$

For unsupervised learning, $\boldsymbol{D}$ is the same as $\boldsymbol{X}$. Here is a class named `DataSet` to represent datasets. It uses pandas' DataFrame.

In addition, features have names

In [1]:
import numpy as np
import pandas as pd

In [2]:
class DataSet:
    """
    A dataset for a machine learning problem. A dataset d has the following properties:
    d.examples   A list of examples. Each one contains both the features and the target.
    d.features   An array of the of feature names.
    d.target     An m by 1 array containing the values of y
    d.y          Same as d.target
    d.inputs     An n by m array containing the values of X
    d.X          Same as d.inputs
    d.N          Number of examples
    d.M          Number of dimensions
    d.name       The name of the data set (for output display only)
    
    """
    def __init__(self, data, features=None, y=None, name=None):
        """
        If y is True, the data contains the target as the last column
        If y is None or False, No target is available
        Else y is an array to be added as the last column of the examples  dataframe
        """
        self.__name = name
        if isinstance(data, pd.DataFrame):
            self.__examples = data
        else:
            self.__examples = pd.DataFrame(data, columns=features)
            
        if y is True:
            self.__examples.columns = [*self.__examples.columns[:-1], 'y']
        elif y is not False and y is not None:
            self.__examples['y'] = y
            
    
    @property
    def examples(self):
        return self.__examples
    
    @property
    def features(self):
        return self.__examples.columns[:-1].values
    
    @property
    def target(self):
        if 'y' in self.__examples.columns:
            return self.__examples['y'].values
        return None
    
    @property
    def y(self):
        return self.target
    
    @property
    def inputs(self):
        return self.__examples.iloc[:, :-1].values
    
    @property
    def X(self):
        return self.inputs
    
    @property
    def name(self):
        return self.__name
    
    @property
    def N(self):
        return self.__examples.shape[0]
    
    @property
    def M(self):
        return self.inputs.shape[1]
    
    def __repr__(self):
        return repr(self.examples)

Let's test this class by creating a $27 \times 3$ input data and a separate $y$ column.

In [3]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")

ds

     x1   x2         x1  y
0   4.0  3.0  11.581802  1
1   4.0  8.0  11.598140  0
2   2.0  7.0   9.681518  0
3   4.0  2.0   4.365004  0
4   6.0  7.0   9.646520  1
5   5.0  5.0  11.865055  1
6   2.0  2.0   9.511194  1
7   8.0  6.0  10.970335  1
8   7.0  2.0   7.167357  0
9   4.0  3.0   9.665209  0
10  5.0  4.0   9.943866  0
11  3.0  8.0   9.496939  0
12  3.0  2.0   8.867048  0
13  5.0  6.0  10.969696  1
14  3.0  3.0  10.311644  1
15  8.0  7.0   7.573834  0
16  3.0  5.0   9.382061  0
17  5.0  6.0   9.319424  0
18  6.0  7.0   4.745994  0
19  8.0  1.0   7.898870  0
20  8.0  2.0   9.101864  1
21  4.0  6.0  11.781172  1
22  4.0  5.0  10.186214  0
23  7.0  4.0  12.673963  1
24  2.0  6.0   7.948580  1
25  5.0  4.0   9.792443  0
26  8.0  1.0   9.938071  1

In [4]:
ds.examples

Unnamed: 0,x1,x2,x1.1,y
0,4.0,3.0,11.581802,1
1,4.0,8.0,11.59814,0
2,2.0,7.0,9.681518,0
3,4.0,2.0,4.365004,0
4,6.0,7.0,9.64652,1
5,5.0,5.0,11.865055,1
6,2.0,2.0,9.511194,1
7,8.0,6.0,10.970335,1
8,7.0,2.0,7.167357,0
9,4.0,3.0,9.665209,0


In [5]:
ds.features

array(['x1', 'x2', 'x1'], dtype=object)

In [6]:
ds.target 

array([1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 1])

In [7]:
ds.y 

array([1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 1])

In [8]:
ds.inputs

array([[ 4.        ,  3.        , 11.58180249],
       [ 4.        ,  8.        , 11.59813992],
       [ 2.        ,  7.        ,  9.68151804],
       [ 4.        ,  2.        ,  4.36500352],
       [ 6.        ,  7.        ,  9.64652044],
       [ 5.        ,  5.        , 11.86505537],
       [ 2.        ,  2.        ,  9.51119375],
       [ 8.        ,  6.        , 10.9703348 ],
       [ 7.        ,  2.        ,  7.16735693],
       [ 4.        ,  3.        ,  9.66520864],
       [ 5.        ,  4.        ,  9.94386563],
       [ 3.        ,  8.        ,  9.49693945],
       [ 3.        ,  2.        ,  8.86704782],
       [ 5.        ,  6.        , 10.96969584],
       [ 3.        ,  3.        , 10.3116435 ],
       [ 8.        ,  7.        ,  7.57383415],
       [ 3.        ,  5.        ,  9.38206093],
       [ 5.        ,  6.        ,  9.31942396],
       [ 6.        ,  7.        ,  4.74599352],
       [ 8.        ,  1.        ,  7.89887006],
       [ 8.        ,  2.        ,  9.101

In [9]:
ds.X

array([[ 4.        ,  3.        , 11.58180249],
       [ 4.        ,  8.        , 11.59813992],
       [ 2.        ,  7.        ,  9.68151804],
       [ 4.        ,  2.        ,  4.36500352],
       [ 6.        ,  7.        ,  9.64652044],
       [ 5.        ,  5.        , 11.86505537],
       [ 2.        ,  2.        ,  9.51119375],
       [ 8.        ,  6.        , 10.9703348 ],
       [ 7.        ,  2.        ,  7.16735693],
       [ 4.        ,  3.        ,  9.66520864],
       [ 5.        ,  4.        ,  9.94386563],
       [ 3.        ,  8.        ,  9.49693945],
       [ 3.        ,  2.        ,  8.86704782],
       [ 5.        ,  6.        , 10.96969584],
       [ 3.        ,  3.        , 10.3116435 ],
       [ 8.        ,  7.        ,  7.57383415],
       [ 3.        ,  5.        ,  9.38206093],
       [ 5.        ,  6.        ,  9.31942396],
       [ 6.        ,  7.        ,  4.74599352],
       [ 8.        ,  1.        ,  7.89887006],
       [ 8.        ,  2.        ,  9.101

In [10]:
ds.name

'Sample Data'

In [11]:
ds.N

27

In [12]:
ds.M

3

## Shuffling
We can also supplement this class with a few useful methods. One such method is for shuffling. Here is the above class with a method for shuffling.

In [13]:
class DataSet:
    """
    A dataset for a machine learning problem. A dataset d has the following properties:
    d.examples   A list of examples. Each one contains both the features and the target.
    d.features   An array of the of feature names.
    d.target     An m by 1 array containing the values of y
    d.y          Same as d.target
    d.inputs     An n by m array containing the values of X
    d.X          Same as d.inputs
    d.N          Number of examples
    d.M          Number of dimensions
    d.name       The name of the data set (for output display only)
    
    """
    def __init__(self, data, features=None, y=None, name=None):
        """
        If y is True, the data contains the target as the last column
        If y is None or False, No target is available
        Else y is an array to be added as the last column of the examples  dataframe
        """
        self.__name = name
        if isinstance(data, pd.DataFrame):
            self.__examples = data
        else:
            self.__examples = pd.DataFrame(data, columns=features)
            
        if y is True:
            self.__examples.columns = [*self.__examples.columns[:-1], 'y']
        elif y is not False and y is not None:
            self.__examples['y'] = y
            
    @property
    def examples(self):
        return self.__examples
    
    @property
    def features(self):
        return self.__examples.columns[:-1].values
    
    @property
    def target(self):
        if 'y' in self.__examples.columns:
            return self.__examples['y'].values
        return None
    
    @property
    def y(self):
        return self.target
    
    @property
    def inputs(self):
        return self.__examples.iloc[:, :-1].values
    
    @property
    def X(self):
        return self.inputs
    
    @property
    def name(self):
        return self.__name
    
    @property
    def N(self):
        return self.__examples.shape[0]
    
    @property
    def M(self):
        return self.inputs.shape[1]
    
    def shuffled(self, random_state=None):
        rgen = np.random.RandomState(random_state)
        indexes = np.arange(self.N)
        rgen.shuffle(indexes)
        return DataSet(self.__examples.iloc[indexes])
    
    def __repr__(self):
        return repr(self.examples)

In [14]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")


ds.shuffled()

     x1   x2         x1  y
17  3.0  8.0  12.826599  0
23  3.0  1.0  13.325159  0
11  8.0  2.0   7.277527  0
10  4.0  7.0  10.580254  0
4   5.0  3.0   7.895073  0
14  2.0  6.0   9.955000  0
8   2.0  1.0  13.399584  1
5   5.0  6.0  14.881474  0
6   8.0  6.0   7.865086  0
26  8.0  8.0   8.940566  1
7   2.0  2.0   8.515647  1
25  7.0  8.0  13.275272  0
3   2.0  6.0  11.771253  0
21  3.0  8.0   8.984955  1
0   5.0  2.0  12.169006  1
12  5.0  3.0  13.444070  1
19  4.0  2.0   9.791523  0
2   8.0  2.0  11.815462  0
13  5.0  4.0   7.814657  0
1   7.0  8.0   7.047098  0
9   5.0  7.0  10.897876  1
18  2.0  7.0   8.325312  0
20  6.0  8.0   6.509000  0
16  8.0  4.0   9.361270  1
24  2.0  3.0   9.078248  1
15  5.0  7.0   7.548980  1
22  5.0  6.0  11.680674  1

## Splitting a dataset into training and test datasets
Another method is that for splitting the  dataset into a training and test sets. Here is again the above class with a method for splitting the dataset into a training and test sets.

In [15]:
class DataSet:
    """
    A dataset for a machine learning problem. A dataset d has the following properties:
    d.examples   A list of examples. Each one contains both the features and the target.
    d.features   An array of the of feature names.
    d.target     An m by 1 array containing the values of y
    d.y          Same as d.target
    d.inputs     An n by m array containing the values of X
    d.X          Same as d.inputs
    d.N          Number of examples
    d.M          Number of dimensions
    d.name       The name of the data set (for output display only)
    """
    def __init__(self, data, features=None, y=None, name=None):
        """
        If y is True, the data contains the target as the last column
        If y is None or False, No target is available
        Else y is an array to be added as the last column of the examples  dataframe
        """
        self.__name = name
        if isinstance(data, pd.DataFrame):
            self.__examples = data
        else:
            self.__examples = pd.DataFrame(data, columns=features)
            
        if y is True:
            self.__examples.columns = [*self.__examples.columns[:-1], 'y']
        elif y is not False and y is not None:
            self.__examples['y'] = y
            
    
    @property
    def examples(self):
        return self.__examples
    
    @property
    def features(self):
        return self.__examples.columns[:-1].values
    
    @property
    def target(self):
        if 'y' in self.__examples.columns:
            return self.__examples['y'].values
        return None
    
    @property
    def y(self):
        return self.target
    
    @property
    def inputs(self):
        return self.__examples.iloc[:, :-1].values
    
    @property
    def X(self):
        return self.inputs
    
    @property
    def name(self):
        return self.__name
    
    @property
    def N(self):
        return self.__examples.shape[0]
    
    @property
    def M(self):
        return self.inputs.shape[1]
    
    def shuffled(self, random_state=None):
        rgen = np.random.RandomState(random_state)
        indexes = np.arange(self.N)
        rgen.shuffle(indexes)
        return DataSet(self.__examples.iloc[indexes])
    
    def train_test_split(self,start=0, end=None, test_portion=None, shuffle=False, random_state=None):
        """
        Splits the dataset into a training set and atest set. 
        If test_portion is specified, return that portion of the dataset as test 
        and the rest as training. 
        Otherwise, return the examples between start and end as test and the 
        rest as training.
        """
        indexes = np.arange(self.N)
        if shuffle is True:
            rgen = np.random.RandomState(random_state)
            rgen.shuffle(indexes)

        if test_portion is None:
            end = end or self.N
        else:
            if not isinstance(test_portion, float) or test_portion < 0 or test_portion > 1:
                raise TypeError("Only fractions between ]0,1[ are allowed")

            start = self.N - int(self.N * test_portion)
            end = self.N

        test = DataSet(self.examples.iloc[indexes[range(start, end)]])
        train = DataSet(pd.concat([self.examples.iloc[indexes[range(start)]], 
                                      self.examples.iloc[indexes[range(end, self.N)]]], axis=0))    
        return train, test
    
    def __repr__(self):
        return repr(self.examples)

In [16]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")


ta, te = ds.train_test_split(test_portion=.25, shuffle=False, random_state=17)
print(ta)
print(te)

     x1   x2         x1  y
0   6.0  4.0   9.051200  0
1   5.0  3.0  11.761972  0
2   4.0  1.0  12.923208  1
3   4.0  8.0   9.807616  1
4   8.0  1.0  10.569088  0
5   7.0  7.0  10.132053  0
6   5.0  5.0  11.037717  0
7   7.0  8.0   9.605973  1
8   2.0  3.0  14.113913  0
9   7.0  4.0  12.717248  1
10  7.0  3.0   8.698224  1
11  8.0  3.0   8.939570  1
12  3.0  8.0  13.398067  0
13  4.0  1.0  10.117232  0
14  3.0  4.0   9.660368  0
15  7.0  2.0  10.461589  0
16  5.0  4.0  15.835143  0
17  8.0  3.0  10.367747  1
18  2.0  4.0  11.421002  1
19  8.0  8.0  12.131536  0
20  3.0  1.0   9.833527  0
     x1   x2         x1  y
21  6.0  6.0   9.069129  1
22  7.0  7.0   5.544432  1
23  7.0  1.0   9.520105  0
24  7.0  6.0  11.615281  1
25  7.0  3.0  10.689759  1
26  5.0  6.0   7.979117  0


## CHALLENGE
Provide an implementation for the `train_validation_test_split` method in `DataSet` below. This is the same class as above with a place holder for this method. This method should split the data into three sets: training, validation, and test. You may use `train_test_split` method. Make sure to include a comment describing how your implementation of the method works. Test your method on the `ds` dataset above and show that it works.

In [17]:
import numpy as np
import pandas as pd

class DataSet:
    """
    A dataset for a machine learning problem. A dataset d has the following properties:
    d.examples   A list of examples. Each one contains both the features and the target.
    d.features   An array of the of feature names.
    d.target     An m by 1 array containing the values of y
    d.y          Same as d.target
    d.inputs     An n by m array containing the values of X
    d.X          Same as d.inputs
    d.N          Number of examples
    d.M          Number of dimensions
    d.name       The name of the data set (for output display only)
    
    """
    def __init__(self, data, features=None, y=None, name=None):
        """
        If y is True, the data contains the target as the last column
        If y is None or False, No target is available
        Else y is an array to be added as the last column of the examples  dataframe
        """
        self.__name = name
        if isinstance(data, pd.DataFrame):
            self.__examples = data
        else:
            self.__examples = pd.DataFrame(data, columns=features)
            
        if y is True:
            self.__examples.columns = [*self.__examples.columns[:-1], 'y']
        elif y is not False and y is not None:
            self.__examples['y'] = y
            
    
    @property
    def examples(self):
        return self.__examples
    
    @property
    def features(self):
        return self.__examples.columns[:-1].values
    
    @property
    def target(self):
        if 'y' in self.__examples.columns:
            return self.__examples['y'].values
        return None
    
    @property
    def y(self):
        return self.target
    
    @property
    def inputs(self):
        return self.__examples.iloc[:, :-1].values
    
    @property
    def X(self):
        return self.inputs
    
    @property
    def name(self):
        return self.__name
    
    @property
    def N(self):
        return self.__examples.shape[0]
    
    @property
    def M(self):
        return self.inputs.shape[1]
    
    def shuffled(self, random_state=None):
        rgen = np.random.RandomState(random_state)
        indexes = np.arange(self.N)
        rgen.shuffle(indexes)
        return DataSet(self.__examples.iloc[indexes])
    
    def train_test_split(self,start=0, end=None, test_portion=None, shuffle=False, random_state=None):
        """
        Splits the dataset into a training set and atest set. 
        If test_portion is specified, return that portion of the dataset as test 
        and the rest as training. 
        Otherwise, return the examples between start and end as test and the 
        rest as training.
        """
        indexes = np.arange(self.N)
        if shuffle is True:
            rgen = np.random.RandomState(random_state)
            rgen.shuffle(indexes)

        if test_portion is None:
            end = end or self.N
        else:
            if not isinstance(test_portion, float) or test_portion < 0 or test_portion > 1:
                raise TypeError("Only fractions between ]0,1[ are allowed")

            start = self.N - int(self.N * test_portion)
            end = self.N

        test = DataSet(self.examples.iloc[indexes[range(start, end)]])
        train = DataSet(pd.concat([self.examples.iloc[indexes[range(start)]], 
                                      self.examples.iloc[indexes[range(end, self.N)]]], axis=0))    
        return train, test
    
    def train_validation_test_split(self, validation_portion=.25, test_portion=.25, shuffle=False, random_state=None):
        """
        Splits the dataset into a training set, a validation set, and a test set. 
        If test_portion and/or validation portion are specified, return those portions of the dataset as test and validation
        and the rest as training. 
        Splits the dataset based on the portions supplied
        """
        indexes = np.arange(self.N)
        if shuffle is True:
            rgen = np.random.RandomState(random_state)
            rgen.shuffle(indexes)
            
        if test_portion is None:
            test_portion = 0
        else:
            if not isinstance(test_portion, float) or test_portion < 0 or test_portion > 1:
                raise TypeError("Only fractions between ]0,1[ are allowed")
                
        if validation_portion is None:
            validation_portion = 0
        else:
            if not isinstance(validation_portion, float) or validation_portion < 0 or validation_portion > 1:
                raise TypeError("Only fractions between ]0,1[ are allowed")
        
        df = pd.DataFrame(self.examples.iloc[indexes[range(0, self.N)]])
        dfSorted = pd.DataFrame.sort_index(df)

        validate, test, train = np.split(dfSorted, [int(self.N*validation_portion),
                                                   int((self.N*validation_portion) + (self.N*test_portion)),
                                                   ])
        return validate, test, train
    
    def __repr__(self):
        return repr(self.examples)
    
    
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")
    
validate, test, train = ds.train_validation_test_split(test_portion=.25, validation_portion=.25, shuffle=False, random_state=17)
print('validate')
print(validate)
print('test')
print(test)
print('train')
print(train)

validate
    x1   x2         x1  y
0  7.0  6.0  10.641831  1
1  4.0  5.0  10.256585  0
2  3.0  5.0   9.505945  1
3  5.0  7.0   8.589945  1
4  2.0  3.0   8.376716  0
5  7.0  5.0  12.267459  1
test
     x1   x2         x1  y
6   6.0  1.0   7.405223  0
7   7.0  6.0  12.008082  0
8   2.0  3.0  11.986985  0
9   2.0  6.0  10.930613  1
10  2.0  8.0   4.801871  0
11  3.0  1.0   9.650191  1
12  4.0  3.0   9.121706  1
train
     x1   x2         x1  y
13  7.0  5.0   7.119966  1
14  8.0  5.0  11.936961  0
15  2.0  4.0  10.632773  1
16  6.0  4.0  11.388965  1
17  4.0  8.0   8.718038  0
18  6.0  3.0  13.314125  1
19  6.0  3.0  10.311986  1
20  5.0  7.0  10.927974  1
21  4.0  4.0   9.055488  1
22  3.0  8.0  10.823194  0
23  3.0  8.0  11.201155  1
24  5.0  7.0  10.611636  1
25  2.0  3.0   9.665562  0
26  4.0  4.0   7.292008  0
