# Linear Algebra with Numpy
### NumPy

As we've seen in lecture, linear algebra is the branch of mathematics describing navigation between different vector spaces. This core concept is very important as various data science techniques rely upon it (e.g. Linear Regression, Support Vector Machines, Recommender Systems, Neural Networks).

NumPy is a package designed to be used in scientific computing, and specifically around building and manipulating N-dimensional array objects.

* `numpy` stands for numerical python
* the starring role is played by the `ndarray`, which is a vector
* `pandas` is built on `numpy`
* More documentation available at: http://www.numpy.org
* Good quick tutorial at: http://cs231n.github.io/python-numpy-tutorial/

In [2]:
import numpy
import numpy as np # Same as above, but numpy is aliased to np.

In [None]:
print numpy.absolute(-10)
print np.absolute(-10) # Both ways work

### Arrays
We can create vectors and matrices we learned in class through np.array

In [None]:
vector = np.array([1, 2, 1])

In [None]:
matrix = np.array([[1, 2, 3],[2, 4, 9],[4,6,4]]) # Create a 3 x 3 matrix

In [None]:
matrix.shape # Gives us the dimensions

In [None]:
matrix[0]  # first row

In [None]:
matrix[0,:] #first row, but more clear

In [None]:
matrix[ : , 1]  # all rows, second column

In [None]:
matrix[:2,1:3] #Slice the 0 and 1 rows, and the 1 and 2 columns

In [None]:
matrix[:2,1:3] = 0 # Can directly modify the elements 

#### Integer slicing
You can index into numpy arrays using arrays of integers as well

In [3]:
a = np.array([ [1,2],[3,4],[5,6]])

In [4]:
a[ [0,1,2], [0,1,0] ]

array([1, 4, 5])

In [5]:
np.array([ a[0,0], a[1,1], a[2,0]])

array([1, 4, 5])

#### Boolean Slicing
You can also index into numpy arrays through conditional statements

In [6]:
a = np.array([[1,2],[3,4],[5,100000000]])
a

array([[        1,         2],
       [        3,         4],
       [        5, 100000000]])

In [7]:
bad_vals = a > 10
bad_vals

array([[False, False],
       [False, False],
       [False,  True]], dtype=bool)

In [8]:
a[bad_vals] = 5
a

array([[1, 2],
       [3, 4],
       [5, 5]])

### Datatypes
All elements in an np array use the same data type. Numpy tries to guess the best datatype, but they are coercible to other types. In addition you can explicitly force the type

In [None]:
x = np.array([1,2])
x.dtype

In [None]:
x = np.array([1.0,2.3])
x.dtype

In [None]:
x = np.array([1,2],dtype='float64')
x.dtype

### Matrix Operations

We can perform matrix operations such as:

In [None]:
data = np.array( [ [1,2,3], [4,5,6], [7,8,9] ])

In [None]:
data + data # Matrix Addition

In [None]:
np.add(data,data) # Same as above

In [None]:
np.subtract(data,data)

In [None]:
data * 2 # Multiplication by a scalar

In [None]:
vector = np.array([1, 2, 1])
vector

In [None]:
np.dot(data, vector) # Matrix - Vector Multiplication

There is a difference between the `np.dot` method and `np.array * np.array`

In [None]:
sqmat = np.array([ [1,2,3], [4,5,6], [7,8,9] ])

In [None]:
np.dot( sqmat, sqmat ) # sqmat (dot) sqmat, or, sqmat^2, this is matrix mult

In [None]:
sqmat * sqmat # Careful, this is very different from matrix multiplication!

We can transpose an array

In [None]:
print sqmat
print sqmat.T
print np.transpose(sqmat)

In [None]:
np.linalg.inv(sqmat) # Very high numbers! Most likely not full rank matrix

### Matrix Descriptions
Helper functions exist to help reshape arrays as well as determine the number of dimensions and size of the dimensions

In [None]:
a = np.array( [1,2,3,4,5,6,7,8,9,10,11,12]) #Create a 12 x 1 array

In [None]:
print np.shape(a) # Get the dimensions of the array
print np.ndim(a)  # Get how many dimensions there are

In [None]:
matA = a.reshape(3,4) #Reshape into a 3 x 4 matrix. Use elements row-wise

In [None]:
print matA.shape
print matA.ndim

In [None]:
np.reshape(a,(3,4)) # Same as above, the shape must be passed as a tuple

### Helper Functions to create Arrays

There are a number of helper functions to create arrays from scratch

In [None]:
a = np.zeros(4)
a

In [None]:
a = np.zeros( (4,4) )
a

In [None]:
a = np.ones( (4,4) )
a

In [None]:
a = np.arange(5**2).reshape(5,5)
a

In [None]:
a = np.linspace(0,1,11) #Note that the boundaries ARE included in linspace
a 

In [None]:
np.random.random() #Gives a single number in the interval [0,1.0] inclusive

In [None]:
np.random.random(10) # Creates a vector of 10 elements

In [None]:
np.random.random( (3,4) ) #Creates a 3 x 4 matrix

### Descriptive Stats
Through helper functions, you can obtain stats of arrays along multiple dimensions, as well as modify the data to your benefit

In [None]:
a = np.arange(5,10)

In [None]:
np.square(a)

In [None]:
np.sqrt(a)

In [None]:
print np.mean(a), np.median(a),  np.min(a), np.max(a), np.std(a)

In [None]:
np.cumsum(a)

In [None]:
a = np.arange(25).reshape(5,5)
a

In [None]:
print a.mean(axis=0) #average per column
print a.mean(axis=1) #average per row

In [None]:
a = np.random.random(10)
print np.sort(a)
print a

In [None]:
a.sort() # Be careful, this will sort a in place
print a

In [None]:
np.random.shuffle(a) # Alternatively, you can shuffle your elements too
print a

### Plotting with matplotlib
The `pyplot` submodule is located in the `matplotlib` package.
Remember to include `%matplotlib inline` if you want the graphs to plot from inside ipython notebook (else they get plotted to lala land)

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
a = np.random.random(10000)
f = plt.hist(a,bins=20)

In [None]:
a = np.random.randn(10000) #Normal distribution
f = plt.hist(a, bins=20)

In [None]:
print a.mean(), a.var() #Let's check again

#### Stacking arrays
Often, you may need to append columns or add rows to an existing matrix.

`np.hstack` and `np.vstack` are functions to horizontally append columns and vertically add rows respectively

In [None]:
a = np.array([1,2,3])
b = np.array([4,5,6])
print a
print b

In [None]:
np.hstack( (a,b) ) #Define all the elements to stack in a tuple

In [None]:
np.vstack( (a,b) )

In [None]:
a = np.array([ [1],[2],[3] ])
b = np.array([ [4],[5],[6] ])

In [None]:
np.hstack( (a,b) )

In [None]:
np.vstack( (a,b) )

Find more distributions and random functions here: http://docs.scipy.org/doc/numpy/reference/routines.random.html

# Exercise

1a) Create a 4x5 array of integers between 2 and 21. Save it as `matA`

1b) Create another 5x10 array of random numbers between [0,1]. Save it as `matB`

1c) Obtain the dot product (matrix multiplication) of the two matrices. Save this as `matC`.

1d) For some reason, analysts believe that `matB` and `matC` can be better used when their columns are combined. Horizontally stack the two matrices, and save as `matD`

2a) Create a 6x6-matrix with ones on the diagonals - i.e. on (1,1), (2,2), (3,3) and zeroes everywhere else. Save as `matA`

2b) Create another 6x6-matrix where all of the values are uniformly random values between 1 and 10. Save as `matB`

2c) Add `matA + matB` and determine the columwise means, rowwise means, and overall mean of matC

3a) Create a 500x5 array through a normal distribution with a mean of 20 and variance of 100. Save it to a variable called  biggie

3b) Plot a histogram of this distrubtion using 20 break points.

3c) Determine the standard deviation of `biggie`. Now, for any value in `biggie` greater than `mean + 2 * std`, set it equal to `mean + 2 * std`. Any value less than `mean - 2 * std`, set it equal to `mean - 2 * std`.

3d) Replot a histogram of this distribution using 30 break points. What could a technique like 2c be used for in the real world?