
# Week 2 - Appendix: Further on data organisation

## Stepping stone

For example, _"Can I use `numpy.arange` and `numpy.linspace` functions to initialise multi-dimensional arrays?"_

After a Web search, showcasing a number of obscure solutions, I ended up back to the official documentation:

In [7]:
import numpy as np

#You can use np.mgrid for this
X,Y = np.mgrid[-5:5.1:0.5, -5:5.1:0.5]
print(X)
print()

#For linspace-like functionality, replace the above step i.e. 0.5
#with a complex number whose magnitude specifies the number of points you want in the series.
X, Y = np.mgrid[-5:5:21j, -5:5:21j]
print(X)

print()
#You can then create your pairs as:
xy = np.vstack((X.flatten(), Y.flatten())).T
print(xy)


[[-5.  -5.  -5.  -5.  -5.  -5.  -5.  -5.  -5.  -5.  -5.  -5.  -5.  -5.
  -5.  -5.  -5.  -5.  -5.  -5.  -5. ]
 [-4.5 -4.5 -4.5 -4.5 -4.5 -4.5 -4.5 -4.5 -4.5 -4.5 -4.5 -4.5 -4.5 -4.5
  -4.5 -4.5 -4.5 -4.5 -4.5 -4.5 -4.5]
 [-4.  -4.  -4.  -4.  -4.  -4.  -4.  -4.  -4.  -4.  -4.  -4.  -4.  -4.
  -4.  -4.  -4.  -4.  -4.  -4.  -4. ]
 [-3.5 -3.5 -3.5 -3.5 -3.5 -3.5 -3.5 -3.5 -3.5 -3.5 -3.5 -3.5 -3.5 -3.5
  -3.5 -3.5 -3.5 -3.5 -3.5 -3.5 -3.5]
 [-3.  -3.  -3.  -3.  -3.  -3.  -3.  -3.  -3.  -3.  -3.  -3.  -3.  -3.
  -3.  -3.  -3.  -3.  -3.  -3.  -3. ]
 [-2.5 -2.5 -2.5 -2.5 -2.5 -2.5 -2.5 -2.5 -2.5 -2.5 -2.5 -2.5 -2.5 -2.5
  -2.5 -2.5 -2.5 -2.5 -2.5 -2.5 -2.5]
 [-2.  -2.  -2.  -2.  -2.  -2.  -2.  -2.  -2.  -2.  -2.  -2.  -2.  -2.
  -2.  -2.  -2.  -2.  -2.  -2.  -2. ]
 [-1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5
  -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5]
 [-1.  -1.  -1.  -1.  -1.  -1.  -1.  -1.  -1.  -1.  -1.  -1.  -1.  -1.
  -1.  -1.  -1.  -1.  -1.  -1.  -1. ]
 [-0.5 -0.5 -0.

## On data representations

Requirement when dealing with data for:

* A high level language to describe data naturally - a "__data scientist view__"
* A low level machine representation but with high performance.

For example:

$\mathsf{A} = \left(
\begin{array}{ccc}
1 & 2 & 3 \\
4 & 5 & 6 \\
\end{array}
\right)$ 
and 
$\mathsf{B} = \left(
\begin{array}{cc}
2 & 3 \\
5 & 6 \\
\end{array}
\right)$

In [9]:
A = np.array([[1, 2, 3], [4, 5, 6]])
B = A[:, 1:]
B.base is A
print(B)

[[2 3]
 [5 6]]


In [10]:
C = A.T
C.base is A
print(C)

[[1 4]
 [2 5]
 [3 6]]


## On data organisation

With `pandas`, we have the opportunity to build a relational view of our data:

* All rows are distinct
* The labelling of columns matters
* One or more columns _uniquely identify_ each row

But sometimes, we may need to transform or even split our dataset for better data analysis.

Let's see this with an example. The _Netflix Movies and TV Shows_ dataset contains approx. 6,000 movies and TV shows available of the NetFlix platform:

In [16]:
import pandas as pd

shows = pd.read_csv("shows.csv")

# Let's look at the first three rows
shows.head(4)

Unnamed: 0,id,type,title,director,cast,country,added,released,rating,duration,categories,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...


Obviously, the person who created this dataset considered the column __id__ as the column that uniquely identifies each row.

Now, consider __a query__ of the form: "_How many movies on NetFlix have been (co)directed by Richard Finn?_"

__Data normalisation__ is a popular concept and method in databases whose purpose is:

* Eliminate redundant (or useless) data
* Ensure data is logically organised

For example, at a basic level, we can say that a table is __normalised__ when:

1. The order of rows does not matter.
2. Columns have unique names.
3. Values in a column should be from the same domain (e.g., strings or integers but not both).
4. __Every value must be atomic__.

The result is a data frame like so:

| id       | type  | title               | director     | ... | Category                 | ... |
|:-------- |:----- |:------------------- |:------------ | --- |:------------------------ | --- |
| 81145628 | Movie | "Norm of the North" | Richard Finn | ... | Children & Family Movies | ... |
| 81145628 | Movie | "Norm of the North" | Richard Finn | ... | Comedies                 | ... |
| 81145628 | Movie | "Norm of the North" | Tim Maltby   | ... | Children & Family Movies | ... |
| 81145628 | Movie | "Norm of the North" | Tim Maltby   | ... | Comedies                 | ... |
| ...      |

__Question A.__ What are the columns that uniquely identify each row in this view?

__Question B.__ Is there a column that depends on only __part of the key__? 

Our next step would be to try to elliminate any __partial dependency__ in our dataset by spliting the data frame into two or more frames.

__1)__ A data frame for NetFlix TV shows and movies, as before:

| id       | type  | title               | ... |
|:-------- |:----- |:------------------- | --- |
| 81145628 | Movie | "Norm of the North" | ... |
| ...      |

__2)__ A data frame that contains information about who directed a movie of TV show:

| id       | director     |
|:-------- |:------------ |
| 81145628 | Richard Finn |
| 81145628 | Tim Maltby   |
| ...      |

__3)__ A data frame that contains information about the category of a movie of TV show:

| id       | Category                 |
|:-------- |:------------------------ |
| 81145628 | Children & Family Movies |
| 81145628 |Comedies                  |
| ...      |