# Linear Algebra

#### Importing Packages
Python has an extensive collection of third-party libraries, or ***packages***, with additional functions, data-structures, etc.  Many (most?) packages of interest are hosted on the Python Package Index ***pypi***, and can be installed into your environment using ***pip***.  The Anaconda distribution in the pre-work includes a number of these that are useful in data science, so you should have most of them installed already.  

ref:  https://pypi.python.org/pypi

A note on namespaces when importing - there are a few different ways to import:

    1. import numpy

    - all submodules of numpy are accessible in the numpy namespace
    - e.g. numpy.array([1,2,3])


    2. import numpy as np

    - same as 1 except an alias 'np' is created for the namespace instead
    - e.g. np.array([1,2,3])


    3. from numpy import *

    - adds all submodules to global namespace
    - e.g. array([1,2,3])
    - Note: This can be dangerous because if different modules have submodules with the same name than whatever is imported last will overrite what came before it - i.e. naming collision -> overwriting!.
    
    4. from numpy import array
    
    - will import only the indicated submodules into the global namespace
    - e.g. array([1,2,3])
    - Note: can be ok since you are being explicit
    
    We will generally use 2 and 4 (sparingly)

### NumPy

As we've seen in lecture, linear algebra is the branch of mathematics describing navigation between different vector spaces. This core concept is very important as various data science techniques rely upon it (e.g. Linear Regression, Support Vector Machines, Recommender Systems, Neural Networks).

NumPy is a package designed to be used in scientific computing, and specifically around building and manipulating N-dimensional array objects.

In [None]:
import numpy

import numpy as np

from numpy import absolute

# The next one is dangerous to do, and not recommended 
# except in cases where you know why you're using it
# from numpy import *

Now we can do the same thing three ways

In [None]:
print numpy.absolute(-10)
print np.absolute(-10)
print absolute(-10)

We can create the Linear Algebra Objects we saw in lecture

In [None]:
vector = np.array([1, 2, 1])

In [None]:
data = np.array([[1, 2, 3],[2, 4, 9]])
data

In [None]:
data[0]  # first row

In [None]:
data[ : , 1]  # all rows, second column

Numpy lets us perform matrix operations

In [None]:
np.dot(data, vector)

We can transpose an array

In [None]:
print data
print data.T

#### Creating a square matrix array

In [None]:
a = np.arange(25).reshape(5,5)
# arange(n) is a function that creates a 1 row array of integers of length n 
# reshape(M,N) is a method converts a list to a matrix of size MxN
a

In [None]:
biga = a*10
biga

In [None]:
print biga.mean()
print biga.mean(0) #Average per column
biga.mean(1) #average per row
# type(biga.mean(1))

Creating a matrix with numpy

In [None]:
bigm = np.matrix(biga-20)
bigm

Creating the Inverse of a Matrix

In [None]:
np.linalg.inv(biga-20)

#### Slices

In [None]:
bigm = np.array(bigm)
print bigm
bigm[0]

In [None]:
#Same thing, but demonstrating the full slice with a colon
print biga
biga[0,:]

#### Describing your Arrays

In [None]:
compa = np.arange(30).reshape(5,3,2)
compa

In [None]:
# lets describe it
print compa.shape
print compa.ndim
print compa.dtype

In [None]:
compa[3,:,1]

In [None]:
# We can assign values using list-like index
# But be careful on types
compa[0,0,0] = 5.9
compa[0,0,0

We can change the datatype when needed

In [None]:
compa = compa.astype(float)
compa[0,0,0] = 5.75
compa[0,0,0]

#### Stacking arrays

You must stack using dimensions of the saem size

In [None]:
a = np.array((1,2,3))
b = np.array((2,3,4))
print 'H Stack'
print np.hstack((a,b))
print 'V Stack'
print np.vstack((a,b))

In [None]:
a = np.array([[1],[2],[3]])
b = np.array([[2],[3],[4]])
print 'H Stack'
print np.hstack((a,b))
print 'V Stack'
print np.vstack((a,b))

### Using Random Numbers

Random numbers are very helpful and are necessary at times for testing data pipelines and running statistical analyses. Functions for creating random values are under numpy.random.

In [None]:
#Create a randomized array
rm = np.random.rand(5,5)
rm

In [None]:
rm.shape

You can shuffle values randomly as well

In [None]:
# This will shuffle along the first index of a multi-dimensional array
np.random.shuffle(rm)
rm

In [None]:
print rm.mean()
print rm.mean(0) #Average per column
print rm.mean(1) #average per row

In [None]:
# for a different Normal Distribution, use np.random.normal
rm = np.random.normal(5,9,(30,30))
rm

In [None]:
print rm.mean(), "which is hopefully close to the input mean"
print rm.var(), "which variance = stdev squared"
print np.median(rm)

Find more distributions and random functions here: http://docs.scipy.org/doc/numpy/reference/routines.random.html

### Exercise 1

1) Create a 4x5 array of integers between 0 and 19

2) Create a 50x500 array with a mean of 20 and variance of 100. Save it to a variable called  biggie

3) Change the mean of the array to a value within 1 of 0 and the variance within 1 of 25. Think about what the mean and the variance represent and try using various mathematical operations.

## Pandas (python package)

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

Pandas is great for tabular/indexed data

In [None]:
# NOTE: you should normally put all your imports at the top of the file
import pandas as pd

In [None]:
data = pd.read_csv('../data/nytimes.csv')

In [None]:
# Note here we're calling the head method on the dataframe to return the 'head' of the 
# dataframe, in this case the first 4 lines
# head() actually creates a new copy of the data, this is important later in the course!
data.head(4)

In [None]:
data[0:4]

In [None]:
# Each DataFrame has an index
# Sometimes you will need to reindex
data.index

In [None]:
# This is a Series
# A DataFrame is made of of several Series with the same index
data.Age

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
# We can change this data into numpy
type(data.Age.values)

#### Just like in numpy, we can use mean, var, and other functions on the data

In [None]:
print data.Age.mean()
print data.Age.var()
print data.Age.max()
print data.Age.min()

In [None]:
# Function that groups users by age.
def map_age_category(x):
    if x < 18:
        return '1'
    elif x < 25:
        return '2'
    elif x < 32:
        return '3'
    elif x < 45:
        return '4'
    else:
        return '5'

data['age_categories'] = data['Age'].apply(map_age_category)

In [None]:
data.head()

#### Sorting data

In [None]:
data.sort_index(axis=1, ascending=False).head()

In [None]:
data.sort('Signed_In').head()

In [None]:
ran_data = [
    ['a', 1]
    , ['b', 2]
    , ['c', 3]
]
df = pd.DataFrame(ran_data, columns=['col_a', 'numeric'])
df

#### Indexing functions

Pandas Dataframes support various methods for indexing:

- .iloc
- .loc
- .ix

In [None]:
df

In [None]:
# iloc accesses a row by its row number
df.iloc[0]

In [None]:
df.set_index('col_a', inplace = True)
df

In [None]:
# loc accesses a dataframe row by its index label (or column label)
df.loc['a'] = 5
df.loc['b'] = 3

In [None]:
df

In [None]:
# This can be used to add new columns
df.loc[:,'C'] = df.loc[:,'numeric']

df

In [None]:
# is equivalent to:
df['D'] = df.loc[:,'numeric']

df

.ix is the generic form of indexers

Values can be set by index and index + column

In [None]:
df.ix[2] = 2
df.ix[2, 'C'] = 3

df

### Combining DataFrames
#### Appending
We can append dataframes together

In [None]:
df_combine = df.append(df)
df_combine

In [None]:
# When DataFrames are appended together, we often need to create a new index
df_combine.reset_index()

#### Join lets us join together dataframes using their index

In [None]:
df_2 = pd.DataFrame([1, 2, 3], columns=['col'])
df_2

In [None]:
# The default is left join, so Null values are placed
# where values are misssing
df.join(df_2)


In [None]:
# reset the index
df_1 = df.reset_index()
df_1

In [None]:
# try joining again:
df_1.join(df_2)


#### Merge allows us to join on any fields

In [None]:
# Merge has a default of inner join
# So where the join misses rows are omitted
df.merge(df_2, left_on='numeric', right_on='col')

### Concat combines a list of DataFrames together

In [None]:
# It can be used like append
pd.concat([df, df])

In [None]:
# But concat will create a spare DataFrame when columns don't match
# This can create huge dataframes when mismatches occur
pd.concat([df, df_2])

## Exercise 2
### Combining numpy and pandas

1) Create 2 arrays of integers

One should be created using np.random

2) Turn those arrays into pandas DataFrames

The columns can be named numerically

3) Use some of the summary functions on the dataframes and arrays

Show how mean and var give the same response in python and numpy


4) Add an extra index using .loc

5) Using merge or join, create a single DataFrame from the two

6) Try testing out the groupby functions

df.groupby(column).agg (agg can be an aggregate function, try sum, max, min...)

Resources can be found here: http://pandas.pydata.org/pandas-docs/stable/10min.html#grouping


## Plotting!

In [None]:
import pandas.io.data
import datetime
import matplotlib.pyplot as plt

%matplotlib inline

mu, sigma = 0, 0.1
normal_dist = np.random.normal(mu, sigma, 1000)
aapl = pd.io.data.get_data_yahoo('FB', 
                                 start=datetime.datetime(2015, 4, 1), 
                                 end=datetime.datetime(2015, 4, 28))
aapl.head()

## MatPlotLib

MatPlotLib is a standard, granular method for building visualizations. Although tried and true, it can be cumbersome compared to other higher level packages such as Seaborn or Bokeh. Note most visualization packages use matplotlib as their base.

In [None]:
fig = plt.figure(figsize=(20,16))

ax = fig.add_subplot(2,2,1)
ax.plot(aapl.index, aapl['Close'])
ax.set_title('Line plots', size=24)

ax = fig.add_subplot(2,2,2)
ax.plot(aapl['Close'], 'o')
ax.set_title('Scatter plots', size=24)

ax = fig.add_subplot(2,2,3)
ax.hist(normal_dist, bins=50)
ax.set_title('Histograms', size=24)
ax.set_xlabel('count', size=16)

ax = fig.add_subplot(2,2,4)
ax.boxplot(normal_dist)
ax.set_title('Boxplots', size=24)

### Bokeh
To install Bokeh, go to a terminal and type:

`conda install bokeh` 

Bokeh is built by the same people that created Anaconda (Continuum Analytics) and is designed out of the box for web display, making it nice for creating presentation ready, interactive visuals quickly. Labs in this course will be shown in Bokeh. Checkout http://bokeh.pydata.org/en/latest/docs/quickstart.html#concepts to see some of the range of capabilities.

In [None]:
from bokeh.plotting import figure, output_notebook,show
output_notebook()

In [None]:
# prepare some data
x = aapl.Low
y = aapl['High']

# create a new plot with a title and axis labels
p = figure(title="Stock High vs. Low", x_axis_label='Low', y_axis_label='High')

# add a line renderer with legend and line thickness
p.circle(x, y, legend="High vs. Low", line_width=2)

# show the results
show(p)

In [None]:
x = aapl.index
y = aapl.Close
p = figure(title="Stock Open & Close over time", x_axis_label='Date', y_axis_label='High',x_axis_type="datetime")
# Note that I've declared the x_axis_type
p.square(x, y, legend="Close")
p.circle(x,aapl.Open,legend='Open',color='red')
# show the results
show(p)

## Pandas Plotting!

The plot method is a great, quick way to visualize your dataframes. By selecting the columns you care to view, calling .plot() on the dataframe defaults to a line chart vs. the index.

We will be revisiting this so just take a second to appreciate what can be done with one line of code.

In [None]:
aapl[['Open','Close']].plot()

In [None]:
aapl[['High','Low','Open','Close']].plot(kind='box')