# 02.A: Working with Datasets
It's important to have clear and sensible way of representing the datasets that learning algorithms train on.

A dataset consists of $n$ examples. Each example consists of $m$ features. The number of these features per example $m$ is also the number dimensions the dataset has. In supervised learning, the dataset is a matrix like this:


$\boldsymbol{D} =\left[\begin{array}{cccccc} 
  x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & \cdots & x_m^{(1)} & y^{(1)}\\ 
  x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & \cdots & x_m^{(2)} & y^{(2)}\\
  x_1^{(3)} & x_2^{(3)} & x_3^{(3)} & \cdots & x_m^{(3)} & y^{(3)}\\
  \vdots    & \vdots    & \vdots    & \cdots & \vdots & \vdots \\
  x_1^{(n)} & x_2^{(n)} & x_3^{(n)} & \cdots & x_m^{(n)} & y^{(n)}
\end{array}\right]$

Each row of this matrix consists of the $m$ features plus the target label as the last element in the row. In other words, $\boldsymbol{D}$ consists of both the input matrix $\boldsymbol{X}$ and target vector $y$, where: 

$\boldsymbol{X} =\left[\begin{array}{ccccc} 
  x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & \cdots & x_m^{(1)}\\ 
  x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & \cdots & x_m^{(2)}\\.
  x_1^{(3)} & x_2^{(3)} & x_3^{(3)} & \cdots & x_m^{(3)}\\
  \vdots    & \vdots    & \vdots    & \cdots & \vdots \\
  x_1^{(n)} & x_2^{(n)} & x_3^{(n)} & \cdots & x_m^{(n)}
\end{array}\right]$

and

$\boldsymbol{y} =\left[\begin{array}{c} 
  y^{(1)}\\ 
  y^{(2)}\\
  y^{(3)}\\
  \vdots \\
  y^{(n)}
\end{array}\right]$

For unsupervised learning, $\boldsymbol{D}$ is the same as $\boldsymbol{X}$. Here is a class named `DataSet` to represent datasets. It uses pandas' DataFrame.

In addition, features have names

In [1]:
import numpy as np
import pandas as pd

In [2]:
class DataSet:
    """
    A dataset for a machine learning problem. A dataset d has the following properties:
    d.examples   A list of examples. Each one contains both the features and the target.
    d.features   An array of the of feature names.
    d.target     An m by 1 array containing the values of y
    d.y          Same as d.target
    d.inputs     An n by m array containing the values of X
    d.X          Same as d.inputs
    d.N          Number of examples
    d.M          Number of dimensions
    d.name       The name of the data set (for output display only)
    
    """
    def __init__(self, data, features=None, y=None, name=None):
        """
        If y is True, the data contains the target as the last column
        If y is None or False, No target is available
        Else y is an array to be added as the last column of the examples  dataframe
        """
        self.__name = name
        if isinstance(data, pd.DataFrame):
            self.__examples = data
        else:
            self.__examples = pd.DataFrame(data, columns=features)
            
        if y is True:
            self.__examples.columns = [*self.__examples.columns[:-1], 'y']
        elif y is not False and y is not None:
            self.__examples['y'] = y
            
    
    @property
    def examples(self):
        return self.__examples
    
    @property
    def features(self):
        return self.__examples.columns[:-1].values
    
    @property
    def target(self):
        if 'y' in self.__examples.columns:
            return self.__examples['y'].values
        return None
    
    @property
    def y(self):
        return self.target
    
    @property
    def inputs(self):
        return self.__examples.iloc[:, :-1].values
    
    @property
    def X(self):
        return self.inputs
    
    @property
    def name(self):
        return self.__name
    
    @property
    def N(self):
        return self.__examples.shape[0]
    
    @property
    def M(self):
        return self.inputs.shape[1]
    
    def __repr__(self):
        return repr(self.examples)

Let's test this class by creating a $27 \times 3$ input data and a separate $y$ column.

In [3]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")

ds

     x1   x2         x1  y
0   4.0  8.0  12.352446  1
1   4.0  1.0  12.309170  0
2   4.0  1.0  10.614694  0
3   5.0  3.0   8.855804  0
4   5.0  4.0   5.691326  0
5   2.0  8.0  11.025048  0
6   8.0  5.0  11.749372  0
7   4.0  1.0  11.722638  0
8   6.0  8.0   9.098254  0
9   2.0  5.0   8.285164  0
10  5.0  6.0   8.658627  0
11  5.0  5.0  11.083964  1
12  8.0  6.0   7.671870  1
13  6.0  1.0  11.711185  0
14  6.0  4.0  10.667305  0
15  3.0  8.0  12.487910  1
16  5.0  6.0  10.253060  0
17  4.0  2.0  11.269276  0
18  2.0  1.0  10.500761  1
19  2.0  6.0  12.025454  0
20  4.0  4.0  13.106159  0
21  6.0  8.0   7.401697  0
22  5.0  3.0   8.343828  1
23  7.0  7.0   9.916274  1
24  4.0  8.0  10.753101  0
25  2.0  3.0   9.926761  0
26  6.0  6.0  10.223572  1

In [4]:
ds.examples

Unnamed: 0,x1,x2,x1.1,y
0,4.0,8.0,12.352446,1
1,4.0,1.0,12.30917,0
2,4.0,1.0,10.614694,0
3,5.0,3.0,8.855804,0
4,5.0,4.0,5.691326,0
5,2.0,8.0,11.025048,0
6,8.0,5.0,11.749372,0
7,4.0,1.0,11.722638,0
8,6.0,8.0,9.098254,0
9,2.0,5.0,8.285164,0


In [5]:
ds.features

array(['x1', 'x2', 'x1'], dtype=object)

In [6]:
ds.target 

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 1])

In [7]:
ds.y 

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 1])

In [8]:
ds.inputs

array([[ 4.        ,  8.        , 12.35244552],
       [ 4.        ,  1.        , 12.30917024],
       [ 4.        ,  1.        , 10.6146942 ],
       [ 5.        ,  3.        ,  8.85580388],
       [ 5.        ,  4.        ,  5.69132573],
       [ 2.        ,  8.        , 11.02504842],
       [ 8.        ,  5.        , 11.74937215],
       [ 4.        ,  1.        , 11.72263821],
       [ 6.        ,  8.        ,  9.09825425],
       [ 2.        ,  5.        ,  8.28516394],
       [ 5.        ,  6.        ,  8.65862722],
       [ 5.        ,  5.        , 11.08396353],
       [ 8.        ,  6.        ,  7.6718704 ],
       [ 6.        ,  1.        , 11.71118519],
       [ 6.        ,  4.        , 10.66730467],
       [ 3.        ,  8.        , 12.48791031],
       [ 5.        ,  6.        , 10.25305965],
       [ 4.        ,  2.        , 11.26927608],
       [ 2.        ,  1.        , 10.50076096],
       [ 2.        ,  6.        , 12.02545375],
       [ 4.        ,  4.        , 13.106

In [9]:
ds.X

array([[ 4.        ,  8.        , 12.35244552],
       [ 4.        ,  1.        , 12.30917024],
       [ 4.        ,  1.        , 10.6146942 ],
       [ 5.        ,  3.        ,  8.85580388],
       [ 5.        ,  4.        ,  5.69132573],
       [ 2.        ,  8.        , 11.02504842],
       [ 8.        ,  5.        , 11.74937215],
       [ 4.        ,  1.        , 11.72263821],
       [ 6.        ,  8.        ,  9.09825425],
       [ 2.        ,  5.        ,  8.28516394],
       [ 5.        ,  6.        ,  8.65862722],
       [ 5.        ,  5.        , 11.08396353],
       [ 8.        ,  6.        ,  7.6718704 ],
       [ 6.        ,  1.        , 11.71118519],
       [ 6.        ,  4.        , 10.66730467],
       [ 3.        ,  8.        , 12.48791031],
       [ 5.        ,  6.        , 10.25305965],
       [ 4.        ,  2.        , 11.26927608],
       [ 2.        ,  1.        , 10.50076096],
       [ 2.        ,  6.        , 12.02545375],
       [ 4.        ,  4.        , 13.106

In [10]:
ds.name

'Sample Data'

In [11]:
ds.N

27

In [12]:
ds.M

3

## Shuffling
We can also supplement this class with a few useful methods. One such method is for shuffling. Here is the above class with a method for shuffling.

In [13]:
class DataSet:
    """
    A dataset for a machine learning problem. A dataset d has the following properties:
    d.examples   A list of examples. Each one contains both the features and the target.
    d.features   An array of the of feature names.
    d.target     An m by 1 array containing the values of y
    d.y          Same as d.target
    d.inputs     An n by m array containing the values of X
    d.X          Same as d.inputs
    d.N          Number of examples
    d.M          Number of dimensions
    d.name       The name of the data set (for output display only)
    
    """
    def __init__(self, data, features=None, y=None, name=None):
        """
        If y is True, the data contains the target as the last column
        If y is None or False, No target is available
        Else y is an array to be added as the last column of the examples  dataframe
        """
        self.__name = name
        if isinstance(data, pd.DataFrame):
            self.__examples = data
        else:
            self.__examples = pd.DataFrame(data, columns=features)
            
        if y is True:
            self.__examples.columns = [*self.__examples.columns[:-1], 'y']
        elif y is not False and y is not None:
            self.__examples['y'] = y
            
    @property
    def examples(self):
        return self.__examples
    
    @property
    def features(self):
        return self.__examples.columns[:-1].values
    
    @property
    def target(self):
        if 'y' in self.__examples.columns:
            return self.__examples['y'].values
        return None
    
    @property
    def y(self):
        return self.target
    
    @property
    def inputs(self):
        return self.__examples.iloc[:, :-1].values
    
    @property
    def X(self):
        return self.inputs
    
    @property
    def name(self):
        return self.__name
    
    @property
    def N(self):
        return self.__examples.shape[0]
    
    @property
    def M(self):
        return self.inputs.shape[1]
    
    def shuffled(self, random_state=None):
        rgen = np.random.RandomState(random_state)
        indexes = np.arange(self.N)
        rgen.shuffle(indexes)
        return DataSet(self.__examples.iloc[indexes])
    
    def __repr__(self):
        return repr(self.examples)

In [14]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")


ds.shuffled()

     x1   x2         x1  y
14  5.0  2.0   6.671162  1
17  8.0  3.0   9.416767  0
4   2.0  1.0   9.264957  0
11  3.0  7.0   9.935940  1
25  5.0  4.0  10.954481  0
24  4.0  7.0  11.097825  0
5   5.0  6.0  10.540607  0
16  7.0  1.0   8.253564  1
9   5.0  1.0   9.432017  1
10  3.0  5.0  10.136335  1
2   6.0  6.0  13.396345  1
20  6.0  5.0  12.908151  0
0   7.0  7.0  11.190077  0
21  7.0  8.0   7.461856  1
23  4.0  6.0   8.489234  0
19  3.0  4.0   9.871504  1
8   6.0  8.0   9.801235  1
13  6.0  7.0  13.144102  1
15  7.0  7.0  10.124411  0
22  3.0  8.0   9.593044  1
26  5.0  7.0  11.263505  0
3   5.0  6.0   6.861755  0
18  2.0  3.0  13.715266  1
12  8.0  8.0  11.909981  1
1   4.0  8.0  12.186159  1
7   3.0  4.0  11.902669  0
6   3.0  4.0   9.534241  1

## Splitting a dataset into training and test datasets
Another method is that for splitting the  dataset into a training and test sets. Here is again the above class with a method for splitting the dataset into a training and test sets.

In [15]:
class DataSet:
    """
    A dataset for a machine learning problem. A dataset d has the following properties:
    d.examples   A list of examples. Each one contains both the features and the target.
    d.features   An array of the of feature names.
    d.target     An m by 1 array containing the values of y
    d.y          Same as d.target
    d.inputs     An n by m array containing the values of X
    d.X          Same as d.inputs
    d.N          Number of examples
    d.M          Number of dimensions
    d.name       The name of the data set (for output display only)
    """
    def __init__(self, data, features=None, y=None, name=None):
        """
        If y is True, the data contains the target as the last column
        If y is None or False, No target is available
        Else y is an array to be added as the last column of the examples  dataframe
        """
        self.__name = name
        if isinstance(data, pd.DataFrame):
            self.__examples = data
        else:
            self.__examples = pd.DataFrame(data, columns=features)
            
        if y is True:
            self.__examples.columns = [*self.__examples.columns[:-1], 'y']
        elif y is not False and y is not None:
            self.__examples['y'] = y
            
    
    @property
    def examples(self):
        return self.__examples
    
    @property
    def features(self):
        return self.__examples.columns[:-1].values
    
    @property
    def target(self):
        if 'y' in self.__examples.columns:
            return self.__examples['y'].values
        return None
    
    @property
    def y(self):
        return self.target
    
    @property
    def inputs(self):
        return self.__examples.iloc[:, :-1].values
    
    @property
    def X(self):
        return self.inputs
    
    @property
    def name(self):
        return self.__name
    
    @property
    def N(self):
        return self.__examples.shape[0]
    
    @property
    def M(self):
        return self.inputs.shape[1]
    
    def shuffled(self, random_state=None):
        rgen = np.random.RandomState(random_state)
        indexes = np.arange(self.N)
        rgen.shuffle(indexes)
        return DataSet(self.__examples.iloc[indexes])
    
    def train_test_split(self,start=0, end=None, test_portion=None, shuffle=False, random_state=None):
        """
        Splits the dataset into a training set and atest set. 
        If test_portion is specified, return that portion of the dataset as test 
        and the rest as training. 
        Otherwise, return the examples between start and end as test and the 
        rest as training.
        """
        indexes = np.arange(self.N)
        if shuffle is True:
            rgen = np.random.RandomState(random_state)
            rgen.shuffle(indexes)

        if test_portion is None:
            end = end or self.N
        else:
            if not isinstance(test_portion, float) or test_portion < 0 or test_portion > 1:
                raise TypeError("Only fractions between ]0,1[ are allowed")

            start = self.N - int(self.N * test_portion)
            end = self.N

        test = DataSet(self.examples.iloc[indexes[range(start, end)]])
        train = DataSet(pd.concat([self.examples.iloc[indexes[range(start)]], 
                                      self.examples.iloc[indexes[range(end, self.N)]]], axis=0))    
        return train, test
    
    def __repr__(self):
        return repr(self.examples)

In [16]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")


ta, te = ds.train_test_split(test_portion=.25, shuffle=False, random_state=17)
print(ta)
print(te)

     x1   x2         x1  y
0   3.0  4.0   9.762379  1
1   3.0  8.0   8.862119  1
2   8.0  7.0  11.760551  1
3   3.0  5.0   9.024141  0
4   4.0  1.0   9.671713  0
5   4.0  5.0   8.800041  0
6   5.0  5.0  10.776904  0
7   7.0  3.0   7.894650  0
8   6.0  5.0   8.767446  1
9   7.0  8.0   7.909626  1
10  5.0  5.0   6.118385  0
11  7.0  7.0   9.596694  1
12  2.0  8.0  11.063304  1
13  6.0  2.0   8.759668  1
14  5.0  3.0  10.051894  0
15  2.0  2.0  10.624847  0
16  3.0  7.0  11.633629  1
17  5.0  1.0   8.177801  0
18  7.0  2.0   8.972573  0
19  2.0  3.0  11.197250  0
20  7.0  4.0  11.771616  0
     x1   x2         x1  y
21  3.0  7.0   9.165290  1
22  7.0  5.0  10.981933  0
23  5.0  5.0   9.061923  0
24  4.0  1.0  14.327290  1
25  2.0  7.0   8.627002  1
26  6.0  7.0  11.400157  1


## CHALLENGE
Provide an implementation for the `train_validation_test_split` method in `DataSet` below. This is the same class as above with a place holder for this method. This method should split the data into three sets: training, validation, and test. You may use `train_test_split` method. Make sure to include a comment describing how your implementation of the method works. Test your method on the `ds` dataset above and show that it works.

## Challenge Accepted

I have completed the function as described and shown its usage. I decided to throw an error if validation and training portions of data exceed or equal 1. They should not exceed 1 because then we'd run out of data before being able to make the validation and training portions (they request over 100% of the data we have). And I dissallowed them to equal 1 because that would leave absolutely no data for the test portion.

In [69]:
import numpy as np
import pandas as pd

class DataSet:
    """
    A dataset for a machine learning problem. A dataset d has the following properties:
    d.examples   A list of examples. Each one contains both the features and the target.
    d.features   An array of the of feature names.
    d.target     An m by 1 array containing the values of y
    d.y          Same as d.target
    d.inputs     An n by m array containing the values of X
    d.X          Same as d.inputs
    d.N          Number of examples
    d.M          Number of dimensions
    d.name       The name of the data set (for output display only)
    
    """
    def __init__(self, data, features=None, y=None, name=None):
        """
        If y is True, the data contains the target as the last column
        If y is None or False, No target is available
        Else y is an array to be added as the last column of the examples  dataframe
        """
        self.__name = name
        if isinstance(data, pd.DataFrame):
            self.__examples = data
        else:
            self.__examples = pd.DataFrame(data, columns=features)
            
        if y is True:
            self.__examples.columns = [*self.__examples.columns[:-1], 'y']
        elif y is not False and y is not None:
            self.__examples['y'] = y
            
    
    @property
    def examples(self):
        return self.__examples
    
    @property
    def features(self):
        return self.__examples.columns[:-1].values
    
    @property
    def target(self):
        if 'y' in self.__examples.columns:
            return self.__examples['y'].values
        return None
    
    @property
    def y(self):
        return self.target
    
    @property
    def inputs(self):
        return self.__examples.iloc[:, :-1].values
    
    @property
    def X(self):
        return self.inputs
    
    @property
    def name(self):
        return self.__name
    
    @property
    def N(self):
        return self.__examples.shape[0]
    
    @property
    def M(self):
        return self.inputs.shape[1]
    
    def shuffled(self, random_state=None):
        rgen = np.random.RandomState(random_state)
        indexes = np.arange(self.N)
        rgen.shuffle(indexes)
        return DataSet(self.__examples.iloc[indexes])
    
    def train_test_split(self,start=0, end=None, test_portion=None, shuffle=False, random_state=None):
        """
        Splits the dataset into a training set and atest set. 
        If test_portion is specified, return that portion of the dataset as test 
        and the rest as training. 
        Otherwise, return the examples between start and end as test and the 
        rest as training.
        """
        indexes = np.arange(self.N)
        if shuffle is True:
            rgen = np.random.RandomState(random_state)
            rgen.shuffle(indexes)

        if test_portion is None:
            end = end or self.N
        else:
            if not isinstance(test_portion, float) or test_portion < 0 or test_portion > 1:
                raise TypeError("Only fractions between ]0,1[ are allowed")

            start = self.N - int(self.N * test_portion)
            end = self.N

        test = DataSet(self.examples.iloc[indexes[range(start, end)]])
        train = DataSet(pd.concat([self.examples.iloc[indexes[range(start)]], 
                                      self.examples.iloc[indexes[range(end, self.N)]]], axis=0))    
        return train, test
    
    def train_validation_test_split(self, validation_portion=.25, test_portion=.25, shuffle=False, random_state=None):
        '''
        Splits the dataset into training, validation, and testing sets.
        The default ratios are 50% training, 25% validation, and 25% testing
        '''
        indexes = np.arange(self.N)
        
        if shuffle:
            rgen = np.random.RandomState(random_state)
            rgen.shuffle(indexes)
        
        if not all(p<1 and p>0 and isinstance(p, float) for p in [validation_portion, test_portion]):
            raise TypeError("Only fractions between ]0,1[ are allowed")
        
        if validation_portion + test_portion >= 1:
            # they cannot exceed one for obvious reasons but they also can't equal one or there will be no test data
            raise Exception("validation_portion and test_portion must equal a ratio not equal to or exceeding 1.")
        
        valStart, testStart = int(self.N * (1 - (validation_portion + test_portion))), int(self.N * (1 - test_portion))
        
        train = DataSet(self.examples.iloc[indexes[range(0, valStart)]])
        validation = DataSet(self.examples.iloc[indexes[range(valStart, testStart)]])
        test = DataSet(self.examples.iloc[indexes[range(testStart,self.N)]])
        
        return train, validation, test
    
    def __repr__(self):
        return repr(self.examples)


In [70]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")

# 40% train, 20% validation, 20% test
tr, v, ts = ds.train_validation_test_split(.2, .2)
print(tr)
print(v)
print(ts)

     x1   x2         x1  y
0   7.0  3.0  12.247597  1
1   8.0  8.0   7.918500  1
2   8.0  1.0  11.492382  0
3   7.0  8.0  11.507328  0
4   3.0  1.0   7.517986  0
5   8.0  2.0  13.189524  1
6   7.0  6.0   8.943110  1
7   7.0  1.0  12.330330  0
8   3.0  4.0   8.199169  1
9   7.0  7.0   7.636846  0
10  2.0  8.0   8.847000  0
11  7.0  8.0  11.754850  1
12  4.0  3.0  10.294272  0
13  2.0  5.0  10.966667  0
14  8.0  5.0   9.289279  0
15  6.0  4.0  11.089967  1
     x1   x2         x1  y
16  6.0  2.0   8.058878  0
17  7.0  2.0   8.679714  0
18  5.0  3.0  11.093606  1
19  8.0  5.0   9.345824  1
20  4.0  2.0  10.068094  0
     x1   x2         x1  y
21  5.0  7.0  10.024956  1
22  8.0  5.0   7.079959  0
23  8.0  5.0   8.531127  1
24  2.0  7.0  10.843367  1
25  6.0  5.0   8.557874  0
26  7.0  3.0   9.720791  0
