# Week 1 - Part 1
### Numpy Intro

Key points:
* numpy offers very efficient multi-dim. arrays
* main class is ndarray
* contain homogenous types, not dynamic like base python - more efficient
* array dims are called axes
  - axis 0 traverses the rows - so you actually get the columns
  - axis 1 traverses the columns - so you actually get the rows 

In [1]:
import numpy as np

### Init Data

Seeded random results
- This guarantees we can always get the same random results (good for demo purposes)

In [2]:
# init random engine
np.random.seed(1)

Create a 2D array

In [3]:
arr = np.random.multivariate_normal(mean = [1, 0.5], 
                                    cov = [[1,0],[0,1]], 
                                    size = 10000)
arr

array([[ 2.62434536, -0.11175641],
       [ 0.47182825, -0.57296862],
       [ 1.86540763, -1.8015387 ],
       ...,
       [ 2.10957003, -0.44320839],
       [ 1.78221575,  2.9084338 ],
       [ 1.88278555,  0.40040369]])

Scalar = 1D = a list
- ex) [1,2,3]

Array = 2D = like a matrix w/rows + cols
- ex) [ [1,2],

-        [3,4]]

Tensor = 3D = like a cube

Shape returns the (rows, cols) of an array

In [4]:
arr.shape

(10000, 2)

Just giving array dims

In [5]:
np.random.rand(3,2,2)  

array([[[0.48382042, 0.30519443],
        [0.39997738, 0.09613724]],

       [[0.70862236, 0.72072961],
        [0.28462039, 0.05752638]],

       [[0.79270584, 0.34496008],
        [0.68896627, 0.26391377]]])

# Indexing
* Specifies locations in arrays
* Can make use of regular Python ranges

```
          FORMAT:      arr[start:stop:step]
          EX:          a[0:3:1]
          TRANSLATION: [starts at 0: stops at 3 (not including): steps by 1]
```

Get top left element

In [6]:
arr[0,0]

2.6243453636632417

Get bottom right record

In [7]:
arr[-1,-1]

0.4004036898978954

Get data type

In [8]:
arr.dtype

dtype('float64')

Change data type to integer

In [9]:
# returns a copy by default
arr.astype(int)

array([[ 2,  0],
       [ 0,  0],
       [ 1, -1],
       ...,
       [ 2,  0],
       [ 1,  2],
       [ 1,  0]])

# Subset
### Extracting subsets of arrays

```
    FORMAT: dataframe[row, col]
    EX:     arr[0,:]
```

Above ex selects entire 1st row
- colon (:) = regular Python range command
    - When used w/o # indicates the full range of that row or col

Just a reminder of what our array looks like

In [10]:
arr

array([[ 2.62434536, -0.11175641],
       [ 0.47182825, -0.57296862],
       [ 1.86540763, -1.8015387 ],
       ...,
       [ 2.10957003, -0.44320839],
       [ 1.78221575,  2.9084338 ],
       [ 1.88278555,  0.40040369]])

Select entire 1st row

In [11]:
arr[0,:]

array([ 2.62434536, -0.11175641])

Same idea -- but this time, index into just the 1st col

In [12]:
arr[:,0]

array([2.62434536, 0.47182825, 1.86540763, ..., 2.10957003, 1.78221575,
       1.88278555])

Easier to just look at the shape of this one: 10k elements in this row

In [13]:
arr[:,0].shape

(10000,)

Select even rows - again this just uses regular Python ranges

- FORMAT: start:stop:step
- :: (double colon) = extended slicing = if you drop the stop argument, so should be the full range
- TRANSLATION: Start at 0, (Stop at the end), Step by 2
    - This will end up w/all even #'s

NOTE: Prof. likes to work in Python tuples like this

In [14]:
arr[0::2,:].shape

(5000, 2)

- TRANSLATION: Start at 1, (stop at the end), step by 2
    - This will end up w/all odd #'s instead

In [15]:
arr[1::2,:].shape

(5000, 2)

Flip 1st row for last row 
- That is a negative step so the rows are selected bkwds
- TRANSLATION: (Start at the end), (Stop at the end), Step by -1

In [16]:
arr[::-1,:]

array([[ 1.88278555,  0.40040369],
       [ 1.78221575,  2.9084338 ],
       [ 2.10957003, -0.44320839],
       ...,
       [ 1.86540763, -1.8015387 ],
       [ 0.47182825, -0.57296862],
       [ 2.62434536, -0.11175641]])

# Filtering
### Taking conditionals of arrays 

Numpy operations applied to each element

This gives us a boolean (T/F) array

In [17]:
arr > 0

array([[ True, False],
       [ True, False],
       [ True, False],
       ...,
       [ True, False],
       [ True,  True],
       [ True,  True]])

Apply our previous boolean (T/F) array to the original array

This will give us all the values that match the condition > 0 in the original array

In [18]:
arr[arr > 0]

array([2.62434536, 0.47182825, 1.86540763, ..., 2.9084338 , 1.88278555,
       0.40040369])

We can specify explicit indices to extract

In [19]:
wanted_rows=[0,1,5,99]
wanted_cols=[0]

Applying our lists of wanted rows and wanted col's to the original arrays gives the values

In [20]:
arr[wanted_rows, wanted_cols]

array([2.62434536, 0.47182825, 2.46210794, 1.81095167])

----------------------
AXIS:
- 0 = row
- 1 = col
----------------------

Here we want all rows w/at least 1 positive value

`any` operates on axis 1 (DOWN the whole column) 
- Returns a boolean (T/F) array 
- Each row that has a val > 0
  - Makes for a good row filter

In [21]:
(arr>0).any(axis=1)

array([ True,  True,  True, ...,  True,  True,  True])

Get subset that matches this >0 condition by applying the boolean (T/F) array to original array

In [22]:
arr[(arr>0).any(axis=1),:]

array([[ 2.62434536, -0.11175641],
       [ 0.47182825, -0.57296862],
       [ 1.86540763, -1.8015387 ],
       ...,
       [ 2.10957003, -0.44320839],
       [ 1.78221575,  2.9084338 ],
       [ 1.88278555,  0.40040369]])

### Comparing `any` VS `all` 

These are the Python built-ins, but same idea

Any = like OR = as long as there's a match somewhere

In [23]:
any([True, False])

True

All = like AND = must all be a match

In [24]:
all([True,False])

False

Changed the `any` to `all` so that

Now requires ALL row entries to meet the test

In [25]:
arr[(arr>0).all(axis=1),:]

array([[1.3190391 , 0.25062962],
       [0.6775828 , 0.11594565],
       [1.04221375, 1.08281521],
       ...,
       [1.41951494, 0.62272589],
       [1.78221575, 2.9084338 ],
       [1.88278555, 0.40040369]])

Returns WHERE an index in the array meets the condition
- Header row 
- Index col 

In [26]:
np.where(arr>0)

(array([   0,    1,    2, ..., 9998, 9999, 9999], dtype=int64),
 array([0, 0, 0, ..., 1, 0, 1], dtype=int64))

Applies multiple conditions
- Via bitwise operator 
    - Recall this is applied element-wise

TRANSLATION:
- array val greater than 0
- AND
- array val less than 2
- ALL
- col's (axis = 1)

In [27]:
arr[((arr>0) & (arr<2)).all(axis=1),:]

array([[1.3190391 , 0.25062962],
       [0.6775828 , 0.11594565],
       [1.04221375, 1.08281521],
       ...,
       [1.78974846, 0.28892712],
       [1.41951494, 0.62272589],
       [1.88278555, 0.40040369]])

# Sorting

Reminder of our original array

In [28]:
arr

array([[ 2.62434536, -0.11175641],
       [ 0.47182825, -0.57296862],
       [ 1.86540763, -1.8015387 ],
       ...,
       [ 2.10957003, -0.44320839],
       [ 1.78221575,  2.9084338 ],
       [ 1.88278555,  0.40040369]])

`argsort` returns the index of the ordered elements 

Sorting on 1st col

In [29]:
np.argsort(arr[:,0])

array([4775, 7605, 3704, ..., 4363, 7472, 7463], dtype=int64)

Sorting on 2nd column

In [30]:
np.argsort(arr[:,1])

array([9867, 5851, 6452, ..., 1197,  282, 3877], dtype=int64)

Sort entire array by col 0

In [31]:
arr[np.argsort(arr[:,0]),:]

array([[-2.6564401 , -0.74300894],
       [-2.43592581,  1.08961331],
       [-2.29485841,  0.59083978],
       ...,
       [ 4.61327701,  1.39894868],
       [ 4.83438102,  1.53704785],
       [ 5.16811768, -0.61761377]])

And same thing in reverse order
- NOTE: the reversal is applied as a second step [::-1]

Sorts in desc order

In [32]:
arr[np.argsort(arr[:,1])[::-1],:]

array([[ 0.83966508,  4.52684904],
       [-0.63744959,  4.4586027 ],
       [ 0.51572248,  4.2402489 ],
       ...,
       [ 0.69710098, -2.7803276 ],
       [ 2.20353388, -2.81084256],
       [-0.97038056, -2.95140291]])

# Apply UDFs  (user defined functions)

### NumPy ver of pandas.DataFrame.apply() command

Just get the quartiles (25%, 50%, 75%)

NOTE: this uses a temporary lambda func 
- It's defined in-line

In [33]:
func = lambda x: np.percentile(x, q=[25,50,75])

Apply the function along the axis 0 (that's col's)
- This computes our func along each col (we have 2 col's)
- Each col then has 3 results -- the quartiles

In [34]:
np.apply_along_axis(func1d = func, axis=0, arr=arr)

array([[ 0.33302073, -0.16586014],
       [ 1.03000632,  0.49529121],
       [ 1.6951908 ,  1.16643536]])

Didn't know this, but plugging an iterable into a lambda seems to do the right thing

- NOTE: this is another way to do the previous

In [35]:
func(arr[:,0]) , func(arr[:,1])

(array([0.33302073, 1.03000632, 1.6951908 ]),
 array([-0.16586014,  0.49529121,  1.16643536]))