# A Quick Introduction to Numerical Data Manipulation with Python and NumPy


## What is NumPy?

[NumPy](https://docs.scipy.org/doc/numpy/index.html) stands for numerical Python. It's the backbone of all kinds of scientific and numerical computing in Python.

And since machine learning is all about turning data into numbers and then figuring out the patterns, NumPy often comes into play.

![ml101-6-step-ml-framework-tools.png](attachment:ml101-6-step-ml-framework-tools.png)

## Why NumPy?

You can do numerical calculations using pure Python. In the beginning, you might think Python is fast but once your data gets large, you'll start to notice slow downs.

One of the main reasons you use NumPy is because it's fast. Behind the scenes, the code has been optimized to run using C. Which is another programming language, which can do things much faster than Python.

The benefit of this being behind the scenes is you don't need to know any C to take advantage of it. You can write your numerical computations in Python using NumPy and get the added speed benefits.

If your curious as to what causes this speed benefit, it's a process called vectorization. [Vectorization](https://en.wikipedia.org/wiki/Vectorization) aims to do calculations by avoiding loops as loops can create potential bottlenecks.

NumPy achieves vectorization through a process called [broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html#module-numpy.doc.broadcasting).



## Where can I get help?
If you get stuck or think of something you'd like to do which this notebook doesn't cover, don't fear!

The recommended steps you take are:
1. **Try it** - Since NumPy is very friendly, your first step should be to use what you know and try figure out the answer to your own question (getting it wrong is part of the process). If in doubt, run your code.
2. **Search for it** - If trying it on your own doesn't work, since someone else has probably tried to do something similar, try searching for your problem. You'll likely end up in 1 of 2 places:
    * [NumPy documentation](https://docs.scipy.org/doc/numpy/index.html) - the ground truth for everything NumPy, this resource covers all of the NumPy functionality.
    * [Stack Overflow](https://stackoverflow.com/) - this is the developers Q&A hub, it's full of questions and answers of different problems across a wide range of software development topics and chances are, there's one related to your problem.
    
An example of searching for a NumPy function might be:

> "how to find unique elements in a numpy array"

Searching this on Google leads to the NumPy documentation for the `np.uniquie()` function: https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html

The next steps here are to read through the documentation, check the examples and see if they line up to the problem you're trying to solve. If they do, **rewrite the code** to suit your needs, run it, and see what the outcomes are.

3. **Ask for help** - If you've been through the above 2 steps and you're still stuck, you might want to ask your question on [Stack Overflow](https://www.stackoverflow.com). Be as specific as possible and provide details on what you've tried.

Remember, you don't have to learn all of the functions off by heart to begin with. 

What's most important is continually asking yourself, "what am I trying to do with the data?".

Start by answering that question and then practicing finding the code which does it.

Let's get started.

## 0. Importing NumPy

To get started using NumPy, the first step is to import it. 



In [11]:
import numpy as np

## 1. DataTypes and attributes

**NOTE:** Important to remember the main type in NumPy is `ndarray`, even seemingly different kinds of arrays are still `ndarray`'s. This means an operation you do on one array, will work on another.

In [24]:
# 1-dimensonal array, also referred to as a vector
a1 = np.array([1, 2, 3])

# 2-dimensional array, also referred to as matrix
a2 = np.array([[1, 2.0, 3.3],
               [4, 5, 6.5]])

# 3-dimensional array, also referred to as a matrix
a3 = np.array([[[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]],
                [[10, 11, 12],
                 [13, 14, 15],
                 [16, 17, 18]]])

In [124]:
a1.shape, a1.ndim, a1.dtype, a1.size, type(a1)

((3,), 1, dtype('int32'), 3, numpy.ndarray)

In [125]:
a2.shape, a2.ndim, a2.dtype, a2.size, type(a2)

((2, 3), 2, dtype('float64'), 6, numpy.ndarray)

In [126]:
a3.shape, a3.ndim, a3.dtype, a3.size, type(a3)

((2, 3, 3), 3, dtype('int32'), 18, numpy.ndarray)

In [127]:
a1

array([1, 2, 3])

In [128]:
a2

array([[1. , 2. , 3.3],
       [4. , 5. , 6.5]])

In [129]:
a3

array([[[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9]],

       [[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]]])

### Anatomy of an array

![numpy-anatomy-of-an-array-updated.png](attachment:numpy-anatomy-of-an-array-updated.png)

Key terms:
* **Array** - A list of numbers, can be multi-dimensional.
* **Scalar** - A single number (e.g. `7`).
* **Vector** - A list of numbers with 1-dimesion (e.g. `np.array([1, 2, 3])`).
* **Matrix** - A (usually) multi-deminsional list of numbers (e.g. `np.array([[1, 2, 3], [4, 5, 6]])`).

## 2. Creating arrays

* `np.array()`
* `np.ones()`
* `np.zeros()`
* `np.random.rand(5, 3)`
* `np.random.seed()` - pseudo random numbers

In [130]:
# Create a simple array
simple_array = np.array([1, 2, 3])
simple_array

array([1, 2, 3])

In [131]:
simple_array = np.array((1, 2, 3))
simple_array, simple_array.dtype

(array([1, 2, 3]), dtype('int32'))

In [132]:
# Create an array of ones
ones = np.ones((10, 2))
ones

array([[1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.]])

In [133]:
# The default datatype is 'float64'
ones.dtype

dtype('float64')

In [134]:
# You can change the datatype with .astype()
ones.astype(int)

array([[1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1]])

In [135]:
# Create an array of zeros
zeros = np.zeros((5, 3, 3))
zeros

array([[[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]])

In [136]:
zeros.dtype

dtype('float64')

In [137]:
# Create an array within a range of values
range_array = np.arange(0, 10, 2)
range_array

array([0, 2, 4, 6, 8])

In [15]:
# Random array of floats (between 0 & 1)
np.random.random((5, 3))

array([[0.16130952, 0.65310833, 0.2532916 ],
       [0.46631077, 0.24442559, 0.15896958],
       [0.11037514, 0.65632959, 0.13818295],
       [0.19658236, 0.36872517, 0.82099323],
       [0.09710128, 0.83794491, 0.09609841]])

In [16]:
np.random.random((5, 3))

array([[0.97645947, 0.4686512 , 0.97676109],
       [0.60484552, 0.73926358, 0.03918779],
       [0.28280696, 0.12019656, 0.2961402 ],
       [0.11872772, 0.31798318, 0.41426299],
       [0.0641475 , 0.69247212, 0.56660145]])

In [140]:
np.random.rand(5, 3)

array([[0.29641567, 0.33406014, 0.50907489],
       [0.74504766, 0.59307464, 0.34360118],
       [0.30266013, 0.27967976, 0.92107048],
       [0.87532113, 0.84135429, 0.87040443],
       [0.8213548 , 0.97830797, 0.79855969]])

NumPy uses pseudo-random numbers, which means, the numbers look random but aren't really, they're predetermined.

For consistency, you might want to keep the random numbers you generate similar throughout experiments.

To do this, you can use [`np.random.seed()`](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.seed.html).

What this does is it tells NumPy, "bghit n9ad random numbers walakin khlihom f same seed."

takifach ?????


In [18]:
# Set random seed to 0
np.random.seed(0)
# Make 'random' numbers
np.random.randint(10, size=(5, 3))

array([[5, 0, 3],
       [3, 7, 9],
       [3, 5, 2],
       [4, 7, 6],
       [8, 8, 1]])

With `np.random.seed()` set, every time you run the cell above, the same random numbers will be generated.

What if `np.random.seed()` wasn't set?

Every time you run the cell below, a new set of numbers will appear.

In [19]:
# Make more random numbers
np.random.randint(10, size=(5, 3))

array([[6, 7, 7],
       [8, 1, 5],
       [9, 8, 9],
       [4, 3, 0],
       [3, 5, 0]])

yalh aaaji tchof 

In [22]:
# Set random seed to same number as above
np.random.seed(0)

# The same random numbers come out
np.random.randint(10, size=(5, 3))

array([[5, 0, 3],
       [3, 7, 9],
       [3, 5, 2],
       [4, 7, 6],
       [8, 8, 1]])

Because `np.random.seed()` is set to 0, the random numbers are the same as the cell with `np.random.seed()` set to 0 as well.

Setting `np.random.seed()` is not 100% necessary but it's helpful to keep numbers the same throughout your experiments.

For example, say you wanted to split your data randomly into training and test sets.

Every time you randomly split, you might get different rows in each set.

If you shared your work with someone else, they'd get different rows in each set too.

Setting `np.random.seed()` ensures there's still randomness, it just makes the randomness repeatable. Hence the 'pseudo-random' numbers.

In [10]:
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randint(10, size=(5, 3)))
df

Unnamed: 0,0,1,2
0,5,0,3
1,3,7,9
2,3,5,2
3,4,7,6
4,8,8,1


## 3. Viewing arrays and matrices (indexing)

Remember, because arrays and matrices are both `ndarray`'s, they can be viewed in similar ways.

Let's check out our 3 arrays again.

In [145]:
a1

array([1, 2, 3])

In [146]:
a2

array([[1. , 2. , 3.3],
       [4. , 5. , 6.5]])

In [147]:
a3

array([[[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9]],

       [[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]]])

Array shapes are always listed in the format `(row, column, n, n, n...)` where `n` is optional extra dimensions.

In [148]:
a1[0]

1

In [149]:
a2[0]

array([1. , 2. , 3.3])

In [150]:
a3[0]

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [151]:
# Get 2nd row (index 1) of a2
a2[1]

array([4. , 5. , 6.5])

In [26]:
# Get the first 2 values of the first 2 rows of both arrays
a3
a3[:2, :2, :2]

array([[[ 1,  2],
        [ 4,  5]],

       [[10, 11],
        [13, 14]]])

## 4. Manipulating and comparying arrays

* Aggregation
    * `np.sum()` - faster than `.sum()`, make demo, np is really fast
    * `np.mean()`
    * `np.std()`
    * `np.var()`
    * `np.min()`
    * `np.max()`

### Aggregation

Aggregation - bringing things together, doing a similar thing on a number of things.

In [153]:
a1

array([1, 2, 3])

In [154]:
sum(a1)

6

In [155]:
np.sum(a1)

6

Use NumPy's `np.sum()` on NumPy arrays and Python's `sum()` on Python lists.

In [27]:
massive_array = np.random.random(100000)
massive_array.size

100000

In [28]:
%timeit sum(massive_array) # Python sum()
%timeit np.sum(massive_array) # NumPy np.sum()

8.34 ms ± 390 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
40.5 µs ± 2.4 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [158]:
a2

array([[1. , 2. , 3.3],
       [4. , 5. , 6.5]])

In [159]:
# Find the mean
np.mean(a2)

3.6333333333333333

In [160]:
# Find the max
np.max(a2)

6.5

In [161]:
# Find the min
np.min(a2)

1.0

In [162]:
# Find the standard deviation ---> ecar-type --->  is a measure of how dispersed the data is in relation to the mean
np.std(a2)

1.8226964152656422

In [163]:
# Find the variance
np.var(a2)

3.3222222222222224

In [164]:
# The standard deviation is the square root of the variance
np.sqrt(np.var(a2))

1.8226964152656422

**What's mean?**

Mean is the same as average. You can find the average of a set of numbers by adding them up and dividing them by how many there are.

**What's standard deviation?**

[Standard deviation](https://www.mathsisfun.com/data/standard-deviation.html) is a measure of how spread out numbers are.

**What's variance?**

The [variance](https://www.mathsisfun.com/data/standard-deviation.html) is the averaged squared differences of the mean.

To work it out, you:
1. Work out the mean
2. For each number, subtract the mean and square the result
3. Find the average of the squared differences