## Connect Intensive - Machine Learning Nanodegree

### Introduction to `numpy`

NumPy is the fundamental package for scientific computing with Python. It contains among other things:
 (https://www.numpy.org/)
 - a powerful N-dimensional array object
 - sophisticated (broadcasting) functions
 - tools for integrating C/C++ and Fortran code
 - useful linear algebra, Fourier transform, and random number capabilities
 - and more ..
 


The following code is to help you play with Numpy, which is a library 
that provides functions that are especially useful when you have to
work with large arrays and matrices of numeric data, like doing 
matrix matrix multiplications. Also, Numpy is battle tested and 
optimized so that it runs fast, much faster than if you were working
with Python lists directly.



In [15]:
import numpy as np


'''
The array object class is the foundation of Numpy, and Numpy arrays are like
lists in Python, except that every thing inside an array must be of the
same type, like int or float.
'''
# Change False to True to see Numpy arrays in action
if True:
    array1 = np.array([1, 4, 5, 8], float)
    print("Our array is called a 1-dimensional array as it contains values along 1 axis -- a row\n{}".format(array1))
    array2 = np.array([[1, 2, 3], [4, 5, 6]], float)  # a 2D array/Matrix   
    print("\nThe next example is called a 2-dimensional array because it has numbers in more than 1 row as")
    print("well as in columns\n{}".format(array2))

    array3 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], float)  # a 2D array/Matrix
    print("\nThe next example is still a 2-dimensional array because it has numbers in rows and in columns\n{}".format(array3))
    

Our array is called a 1-dimensional array as it contains values along 1 axis -- a row
[ 1.  4.  5.  8.]

The next example is called a 2-dimensional array because it has numbers in more than 1 row as
well as in columns
[[ 1.  2.  3.]
 [ 4.  5.  6.]]

The next example is still a 2-dimensional array because it has numbers in rows and in columns
[[ 1.  2.  3.]
 [ 4.  5.  6.]
 [ 7.  8.  9.]]


What would a 3-dimensional array look like? We can answer that intuitively by looking closely at the one- and two- 
dimensional arrays. In the examples above, we see that the two-dimensional array looks like a number of rows one after the
other, i.e., rows stacked together. So we might expect a 3-dimensional array would like a number of _smaller_ units 
(2-dimensional arrays maybe?) stacked together.

In [16]:
array4 = np.array([[[1, 2], [3,4]], [[5, 6], [7, 8]], [[9, 10], [11, 12]]], float)
print("A 3-dimensional array:\n{}".format(array4))

A 3-dimensional array:
[[[  1.   2.]
  [  3.   4.]]

 [[  5.   6.]
  [  7.   8.]]

 [[  9.  10.]
  [ 11.  12.]]]


We can continue stacking numpy arrays of lower dimensions to build up even higher dimensional arrays. 
This naturally brings the question, if we had an array that someone (or some snippet of code) provided us, how coumld we
know its dimensionality if we wanted to? Numpy provides a convenient functiomn `shape` to do this. You read of the
dimensionality from the number of elements in the tuple that is the result of `shape`.

In [26]:
print("The shape of the first array we built is {}\t\t and it is {}-dimensional.".format(array1.shape, len(array1.shape)))
print("The shape of the second array we built is {}\t and it is {}-dimensional.".format(array2.shape, len(array2.shape)))
print("The shape of the third array we built is {}\t and it is {}-dimensional.".format(array3.shape, len(array3.shape)))
print("The shape of the last array we built is {}\t and it is {}-dimensional.".format(array4.shape, len(array4.shape)))

The shape of the first array we built is (4L,)		 and it is 1-dimensional.
The shape of the second array we built is (2L, 3L)	 and it is 2-dimensional.
The shape of the third array we built is (3L, 3L)	 and it is 2-dimensional.
The shape of the last array we built is (3L, 2L, 2L)	 and it is 3-dimensional.


In [None]:
'''
You can index, slice, and manipulate a Numpy array much like you would with a
a Python list.
'''
# Change False to True to see array indexing and slicing in action
if False:
    array = np.array([1, 4, 5, 8], float)
    print array
    print ""
    print array[1]
    print ""
    print array[:2]
    print ""
    array[1] = 5.0
    print array[1]

# Change False to True to see Matrix indexing and slicing in action
if False:
    two_D_array = np.array([[1, 2, 3], [4, 5, 6]], float)
    print two_D_array
    print ""
    print two_D_array[1][1]
    print ""
    print two_D_array[1, :]
    print ""
    print two_D_array[:, 2]

'''
Here are some arithmetic operations that you can do with Numpy arrays
'''
# Change False to True to see Array arithmetics in action
if False:
    array_1 = np.array([1, 2, 3], float)
    array_2 = np.array([5, 2, 6], float)
    print array_1 + array_2
    print ""
    print array_1 - array_2
    print ""
    print array_1 * array_2

# Change False to True to see Matrix arithmetics in action
if False:
    array_1 = np.array([[1, 2], [3, 4]], float)
    array_2 = np.array([[5, 6], [7, 8]], float)
    print array_1 + array_2
    print ""
    print array_1 - array_2
    print ""
    print array_1 * array_2

In addition to the standard arthimetic operations, Numpy also has a range of
other mathematical operations that you can apply to Numpy arrays, such as
mean and dot product.

In [29]:
array_1 = np.array([1, 2, 3], float)
array_2 = np.array([[6], [7], [8]], float)
print(np.mean(array_1))
print(np.mean(array_2))
print("")
print(np.dot(array_1, array_2))

2.0
7.0

[ 44.]


Multiplication of numpy arrays can give you unexpected results if you don't know the rules well. Let's see what happens when
you do a multiplication of the two arrays above. Can you reason through why you got a 3x3 2-dimensional array and where the 
numbers in it came from?

In [31]:
print(array_1*array_2)

[[  6.  12.  18.]
 [  7.  14.  21.]
 [  8.  16.  24.]]


So looking back a few frames, you'll see that the `'*'` operator represents an element-wise multiplication. The result is unambiguous when the two matrices have the same shape, for example when we multiply `np.array([1, 2, 3])` by `np.array([6, 7, 8])`.
We get `[6, 14, 24]` because for every element of the first array, there is a corresponding element in the second one.

In our example, array_1 has 1 row of three elements, but array_2 only has one element (6) in the first row. So numpy treats it as a single number and produces the row `[6, 12, 18]`. The second element in array_2, a similar situation holds and we get the second row of the result as `[1, 2, 3]` multiplied by (7).


#### Quizzes
What is the difference between `np.array([[6, 7, 8]])` and `np.array([[6], [7], [8]])`?

**Answer:** `np.array([[6, 7, 8]])` is a 1-dimensional array of 3 elements. `np.array([[6], [7], [8]])` is a 2-dimensional array

How would you quickly create a 10x10 matrix of the numbers 5, 6, ..., 104 where the first _column_ is [5, 6, 7, ..]?

**Answer:** This introduces three functions in `numpy`. 
1. Python has a function `range` that can produce a sequence of numbers very easily. Numpy has a similar function `numpy.arange`. `np.arange(5, 105)` will return a sequence (in this case a 1-dimensinal np.array) of 100 elements from 5 through 104.
2. `numpy` has a function reshape that can take a sequence and produce an N-dimensional array based on its argument which should be a tuple giving the number of elements in each dimension. So np.reshape((10, 10)) would take a 100 element 1-array and return a $ 10 \times 10 $ 2-array
3. A common manipulation of square matrices (same number of rows and columns) is to flip around the diagonal. This is called matrix transposition. `np.transpose` will take the 2-dimensional matrix and return another one with its rows being the columns of the original.
Finally, you can chain together these functions in compact notation

In [34]:
array10 = np.arange(5, 105).reshape((10, 10)).transpose()
print(array10)

[[  5  15  25  35  45  55  65  75  85  95]
 [  6  16  26  36  46  56  66  76  86  96]
 [  7  17  27  37  47  57  67  77  87  97]
 [  8  18  28  38  48  58  68  78  88  98]
 [  9  19  29  39  49  59  69  79  89  99]
 [ 10  20  30  40  50  60  70  80  90 100]
 [ 11  21  31  41  51  61  71  81  91 101]
 [ 12  22  32  42  52  62  72  82  92 102]
 [ 13  23  33  43  53  63  73  83  93 103]
 [ 14  24  34  44  54  64  74  84  94 104]]


3. Calculate the average of the numbers in each column

**Answer:** This one has a few solutions. The first solution that may come to mind if you aren't familiar with matrices
is to just write two nested loops in pure python.

In [36]:
# Using brute force looping
result0 = np.zeros(10) # This is a quick function to produce an array of zero values of the input length
for i in range(10): #looping over column indicies as we want the um over columns
    for j in range(10): #loop over row indices 
        result0[i] += array10[j][i]
print(result0/10.0)

[  9.5  19.5  29.5  39.5  49.5  59.5  69.5  79.5  89.5  99.5]


In [39]:
# Mixing python and numpy.
# We learned earlier that we can find the mean of an np.array using np.mean. Let's try that on the columns.
# Lets first use transpose so that the columns become rows and then we can loop over the rows and use the mean()
result1 = np.array([row.mean() for row in array10.transpose()])
print(result1)

[  9.5  19.5  29.5  39.5  49.5  59.5  69.5  79.5  89.5  99.5]


In [45]:
# Turns out numpy has the easiest way of doing this
# We have to specify the axis if we want to apply the mean function over one of the dimensions
# 0 - for column-mean (looping over rows) and 1 for row-mean (looping over columns)
print(array10.mean(axis=0) )a

[  9.5  19.5  29.5  39.5  49.5  59.5  69.5  79.5  89.5  99.5]


There is one more way -- using matrix multiplication. I'll leave it for to try entirely on your own if you are interested.

### Pandas

[pandas.pydata.org/](http://pandas.pydata.org/)

[Another tutorial](https://bitbucket.org/hrojas/learn-pandas)

 - A fast and efficient DataFrame object for data manipulation with integrated indexing
 - Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format
 - Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form
 - Flexible reshaping and pivoting of data sets
 - Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
 - Columns can be inserted and deleted from data structures for size mutability
 - Aggregating or transforming data with a powerful `group by` engine allowing `split-apply-combine` operations on data sets
 - High performance merging and joining of data sets;
 - Highly optimized for performance, with critical code paths written in Cython or C.
 - and more ..
 
 
Spreadsheets and relational databases are other artifacts that can do very impressive things with tabular data. Pandas borrows some of the terminology and concepts of working with relational tables. Thus tere are functions iike `groupby`, `merge`, `drop` and others. These terms do not use the same syntax, but attempt to have similar effects.

In [23]:
import pandas as pd

'''
The following code is to help you play with the concept of Series in Pandas.

You can think of Series as an one-dimensional object that is similar to
an array, list, or column in a database. By default, it will assign an
index label to each item in the Series ranging from 0 to N, where N is
the number of items in the Series minus one.

Please feel free to play around with the concept of Series and see what it does

*This playground is inspired by Greg Reda's post on Intro to Pandas Data Structures:
http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
'''
# Change False to True to create a Series object
if False:
    series = pd.Series(['Dave', 'Cheng-Han', 'Udacity', 42, -1789710578])
    print series

'''
You can also manually assign indices to the items in the Series when
creating the series
'''

# Change False to True to see custom index in action
if False:
    series = pd.Series(['Dave', 'Cheng-Han', 359, 9001],
                       index=['Instructor', 'Curriculum Manager',
                              'Course Number', 'Power Level'])
    print series

'''
You can use index to select specific items from the Series
'''
# Change False to True to see Series indexing in action
if False:
    series = pd.Series(['Dave', 'Cheng-Han', 359, 9001],
                       index=['Instructor', 'Curriculum Manager',
                              'Course Number', 'Power Level'])
    print series['Instructor']
    print ""
    print series[['Instructor', 'Curriculum Manager', 'Course Number']]

'''
You can also use boolean operators to select specific items from the Series
'''
# Change False to True to see boolean indexing in action
if False:
    cuteness = pd.Series([1, 2, 3, 4, 5], index=['Cockroach', 'Fish', 'Mini Pig',
                                                 'Puppy', 'Kitten'])
    print cuteness > 3
    print ""
    print cuteness[cuteness > 3]


In [24]:
'''
The following code is to help you play with the concept of Dataframe in Pandas.

You can think of a Dataframe as something with rows and columns. It is
similar to a spreadsheet, a database table, or R's data.frame object.

*This playground is inspired by Greg Reda's post on Intro to Pandas Data Structures:
http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
'''

'''
To create a dataframe, you can pass a dictionary of lists to the Dataframe
constructor:
1) The key of the dictionary will be the column name
2) The associating list will be the values within that column.
'''
# Change False to True to see Dataframes in action
if False:
    data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
            'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
                     'Lions', 'Lions'],
            'wins': [11, 8, 10, 15, 11, 6, 10, 4],
            'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
    football = pd.DataFrame(data)
    print football

'''
Pandas also has various functions that will help you understand some basic
information about your data frame. Some of these functions are:
1) dtypes: to get the datatype for each column
2) describe: useful for seeing basic statistics of the dataframe's numerical
   columns
3) head: displays the first five rows of the dataset
4) tail: displays the last five rows of the dataset
'''
# Change False to True to see these functions in action
if True:
    data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
            'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
                     'Lions', 'Lions'],
            'wins': [11, 8, 10, 15, 11, 6, 10, 4],
            'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
    football = pd.DataFrame(data)
    print football.dtypes
    print ""
    print football.describe()
    print ""
    print football.head()
    print ""
    print football.tail()


losses     int64
team      object
wins       int64
year       int64
dtype: object

          losses       wins         year
count   8.000000   8.000000     8.000000
mean    6.625000   9.375000  2011.125000
std     3.377975   3.377975     0.834523
min     1.000000   4.000000  2010.000000
25%     5.000000   7.500000  2010.750000
50%     6.000000  10.000000  2011.000000
75%     8.500000  11.000000  2012.000000
max    12.000000  15.000000  2012.000000

   losses     team  wins  year
0       5    Bears    11  2010
1       8    Bears     8  2011
2       6    Bears    10  2012
3       1  Packers    15  2011
4       5  Packers    11  2012

   losses     team  wins  year
3       1  Packers    15  2011
4       5  Packers    11  2012
5      10    Lions     6  2010
6       6    Lions    10  2011
7      12    Lions     4  2012


#### Questions
How many different types of data does `pandas` support?

When you work with tabular (2-dimensional) data, a very common chore is selecting a subset of the olumns, or a set of rows that match some criteria. `pandas` provides very flexible methods for this. From the `football` DataFrame above, select the records for the 'Bears' (should be another DataFrame but with just the rows with 'Bears' in the 'team' column). Also, create another DataFrame that includes all the rows but doesn't include `year`.

In [31]:
football[football.team == 'Bears'].drop('year', axis=1)

Unnamed: 0,losses,team,wins
0,5,Bears,11
1,8,Bears,8
2,6,Bears,10


From your result above, calculate the average (mean) number of wins and losses for the Bears?

In [34]:
football[football.team == 'Bears'].drop('year', axis=1).mean()

losses    6.333333
wins      9.666667
dtype: float64

From the original `football` DataFrame above, calculate the average number of wins and losses for each of the three teams. Do not include the average of the year in which they were playing.

In [37]:
football.drop('year', axis=1).groupby('team').mean()

Unnamed: 0_level_0,losses,wins
team,Unnamed: 1_level_1,Unnamed: 2_level_1
Bears,6.333333,9.666667
Lions,9.333333,6.666667
Packers,3.0,13.0


In `football`, should `year` be considered a numeric variable or a categorical variable?

Add a column to `football` with the header "win_rate" and populate it with the fraction of games won (the total number of 
games played by eahch team in a given year is 16).

In [38]:
football['win_rate'] = football.wins/16.0

In [39]:
football

Unnamed: 0,losses,team,wins,year,win_rate
0,5,Bears,11,2010,0.6875
1,8,Bears,8,2011,0.5
2,6,Bears,10,2012,0.625
3,1,Packers,15,2011,0.9375
4,5,Packers,11,2012,0.6875
5,10,Lions,6,2010,0.375
6,6,Lions,10,2011,0.625
7,12,Lions,4,2012,0.25


Suppose you had additional information on `salary` for these teams and years. Add these values to the `football` table.


In [41]:
salary = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
            'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
                     'Lions', 'Lions'],
            'salary': [112.1, 125.2, 120.8, 115.4, 118.4, 96.2, 100.5, 104.1]}
#TODO - Add in code to combine this dictionary with football. [Hint: look at the pandas docs for DataFrame merge]

In [43]:
football.merge(pd.DataFrame(salary), on=("team", 'year'))

Unnamed: 0,losses,team,wins,year,win_rate,salary
0,5,Bears,11,2010,0.6875,112.1
1,8,Bears,8,2011,0.5,125.2
2,6,Bears,10,2012,0.625,120.8
3,1,Packers,15,2011,0.9375,115.4
4,5,Packers,11,2012,0.6875,118.4
5,10,Lions,6,2010,0.375,96.2
6,6,Lions,10,2011,0.625,100.5
7,12,Lions,4,2012,0.25,104.1
