In [1]:
from IPython.display import Image
Image(url='http://datascience.uci.edu/wp-content/uploads/sites/2/2014/09/data_science_logo_with_image1.png')

## Preliminaries

1.  If you haven't do so already, download and **install the [Anaconda Scientific Python Distribution](https://store.continuum.io/cshop/anaconda/)**.  If it offers to make itself your default Python distribution, allow it.
1. Whether you've just installed Anaconda, or you have done so previously, you should now **update Anaconda** to the latest version of the distribution.  It changes a lot so do this today even if you did recently.
 1. Open a terminal or command prompt.
 1. Type ```conda update conda``` and press enter or return.  Confirm that you'd like it to make any changes that it offers.
 1. Type ```conda update anaconda``` and press enter or return.  Confirm that you'd like it to make any changes that it offers.
1. **Download the code repository**.  
 1. Go to [bit.ly/uci_predictive](http://bit.ly/uci_predictive) and click the "download zip" button on the right to download a zip file containing this entire repository.
 1. Unzip that file into a directory you know how to find; you'll need it several times throughout the day.  
1. **Start an ipython notebook server**.
 1. Open a terminal and type ```ipython notebook```.  Navigate to the directory where you unzipped this repository.
 1. Open "Test Notebook.ipynb".
 1. Click "Cell" at the top of the opened notebook, followed by "Run All" and ensure that 1) there are no errors and that 2) the output from the first cell is the same as that in the second.  If it doesn't match, raise your hand.
 1. If everything looks good, close the browser tab containing the test notebook but keep open the tab listing all the other notebooks.

### Schedule for Today

|Start Time | Session |
|-----------|---------|
|8:30am     | Check In|
|9:00am     | **The IPython Notebook and Pandas** |
|10:30am    | Coffee & Bagels|
|10:45am    | **Linear Regression and Predictive Modeling** |
|12:30pm    | Lunch|
|1:00pm     | **Out of Sample Prediction** |
|2:45pm     | Afternoon break|
|3:00pm     | **Logistic Regression** |

## Predictive Modeling with Python - _IPython Notebooks and Viewing Data in Python_
#### Author: Kevin Bache

## Outline
1. IPython and IPython Notebooks
1. Numpy
1. Pandas

## Python and IPython
* `python` is a programming language and also the name of the program that runs scripts written in that language.
* If you're running scripts from the command line you can use either `ipython` with something like `ipython my_script.py` or `python` with something like `python my_script.py`
* If you're using the command line interpreter interactively to load and explore data, try out a new package, etc. always use `ipython` over `python`.  This is because `ipython` has a bunch of features like tab completion, inline help, and easy access to shell commands which are just plain great (more on these in a bit).

## IPython Notebook
* IPython notebook is an interactive front-end to ipython which lets you combine snippets of python code with explanations, images, videos, whatever.  
* It's also really convenient for conveying experimental results.
* http://nbviewer.ipython.org

### <span style="color:red">IPython Notebook Exercise #1</span>
1. Start a terminal window and cd to the directory where you stored the course files
1. Start the IPython Notebook server with the command `ipython notebook`.  The IPython notebook server runs your python code behind the scenes and renders the output into the notebook
1. Create a new notebook by clicking New (top right) >> Python 2 Notebook

### Notebook Concepts
* **Cells** -- That grey box is called a cell.  An IPython notebook is nothing but a series of cells.  
* **Selecting** -- You can tell if you have a cell selected because it will have a thin, black box around it.
* **Running a Cell** -- Running a cell displays its output.  You can run a cell by pressing **`shift + enter`** while it's selected (or click the play button toward the top of the screen). 
* **Modes** -- There are two different ways of having a cell selected:
  * **Command Mode** -- Lets you delete a cell and change its type (more on this in a second).
  * **Edit Mode** -- Lets you change the contents of a cell.

### Aside: Keyboard Shortcuts That I Use A Lot
* (When describing keyboard shortcuts, `+` means 'press at the same time', `,` means 'press after'
* **`Enter`** -- Run this cell and make a new one after it
* **`Esc`** -- Stop editing this cell
* **`Option + Enter`** -- Run this cell and make a new cell after it (Note: this is OSX specific.  Check help >> keyboard shortcuts to find your operating system's version)
* **`Shift + Enter`** -- Run this cell and don't make a new one after it
* **`Up Arrow`** and **`Down Arrow`**  -- Navigate between cells (must be in command mode)
* **`Esc, m, Enter`** -- Convert the current cell to markdown and start editing it again
* **`Esc, y, Enter`** -- Convert the current cell to a code cell and start editing it again
* **`Esc, d, d`** -- Delete the current cell
* **`Esc, a`** -- Create a new cell above the current one
* **`Esc, b`** -- Create a new cell below the current one
* **`Command + /`** -- Toggle comments in Python code (OSX)
* **`Ctrl + /`** -- Toggle comments in Python code (Linux / Windows)

### <span style="color:red">IPython Notebook Exercise #2</span>
1. Click Help >> User Interface Tour and take the tour
1. Click Help >> Keyboard Shortcuts.  Mice are for suckers.

### More Notebook Concepts
* **Cell Types** -- There are 3 types of cells: python, markdown, and raw.
  * **Python Cells** -- Contain python code. Running a python cell displays its output. Press **`y`** in command mode convert any selected cell into a python cell. All cells start their lives as python cells.
  * **Markdown Cells** -- Contain formatted text, lists, links, etc. Press **`m`** in command mode to convert the selected cell into a markdown cell.
  * **Raw Cells** -- Useful for a few advanced corner cases.  We won't deal with these at all today.
  
### <span style="color:red">IPython Notebook Exercise #3</span>:
1. Partner up with someone next to you
  1. Create a code cell
    1. IPython notebooks offer tab completion.  Try typing: ```a<tab>``` 
    1. IPython notebooks offer inline help.  Try typing: ```?abs``` and then running the cell
  1. Create and render a markdown cell which contains 
    1. Bold text
    1. A nested numbered list
    1. A working link to UCI's website
    1. An image
    1. Some rendered LaTex. 
  1. **Hint**: The following websites might be useful:
    1. Check out this [Markdown Cheat Sheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet).  Markdown is a set of simple commands for formatting text to make it pretty.  It isn't specific to IPython Notebooks; it's used all over (for example, a lot of blogging platforms let you write your content in markdown because it's easy and HTML is a pain in the butt).
    1. Check out this [stackoverflow post](http://stackoverflow.com/questions/13208286/how-to-write-latex-in-ipython-notebook) about using LaTex in IPython Notebooks.

## Numpy
Numpy is the main package that you'll use for doing scientific computing in Python.  Numpy provides a multidimensional array datatype called `ndarray` which can do things like vector and matrix computations.

### Resources:
* [Official Numpy Tutorial](http://wiki.scipy.org/Tentative_NumPy_Tutorial)
* [Numpy, R, Matlab Cheat Sheet](http://mathesaurus.sourceforge.net/matlab-python-xref.pdf)
* [Another Numpy, R, Matlab Cheat Sheet](http://sebastianraschka.com/Articles/2014_matrix_cheatsheet_table.html)

In [1]:
# you don't have to rename numpy to np but it's customary to do so
import numpy as np

# you can create a 1-d array with a list of numbers
a = np.array([1, 4, 6])
print 'a:'
print a
print 'a.shape:', a.shape
print 

# you can create a 2-d array with a list of lists of numbers
b = np.array([[6, 7], [3, 1], [4, 0]])
print 'b:'
print b
print 'b.shape:', b.shape
print


a:
[1 4 6]
a.shape: (3,)

b:
[[6 7]
 [3 1]
 [4 0]]
b.shape: (3, 2)



In [2]:
# you can create an array of ones
print 'np.ones(3, 4):'
print np.ones((3, 4))
print

# you can create an array of zeros
print 'np.zeros(2, 5):'
print np.zeros((2, 5))
print

# you can create an array which of a range of numbers and reshape it
print 'np.arange(6):'
print np.arange(6)
print 
print 'np.arange(6).reshape(2, 3):'
print np.arange(6).reshape(2, 3)
print

# you can take the transpose of a matrix with .transpose or .T
print 'b and b.T:'
print b
print 
print b.T
print 

np.ones(3, 4):
[[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]

np.zeros(2, 5):
[[ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]]

np.arange(6):
[0 1 2 3 4 5]

np.arange(6).reshape(2, 3):
[[0 1 2]
 [3 4 5]]

b and b.T:
[[6 7]
 [3 1]
 [4 0]]

[[6 3 4]
 [7 1 0]]



In [3]:
# you can iterate over rows
i = 0
for this_row in b:
    print 'row', i, ': ', this_row
    i += 1 
print 
    
# you can access sections of an array with slices
print 'first two rows of the first column of b:'
print b[:2, 0]
print

row 0 :  [6 7]
row 1 :  [3 1]
row 2 :  [4 0]

first two rows of the first column of b:
[6 3]



np.hstack([b, b]):
[[6 7 6 7]
 [3 1 3 1]
 [4 0 4 0]]

np.vstack([b, b]):
[[6 7]
 [3 1]
 [4 0]
 [6 7]
 [3 1]
 [4 0]]



In [5]:
# note that you get an error if you pass in print 'np.hstack(b, b):'
print np.hstack([b, b])
print


[[6 7 6 7]
 [3 1 3 1]
 [4 0 4 0]]



In [39]:
# you can perform matrix multiplication with np.dot()
a = np.array([1, 4, 6])
b = np.array([[6, 7], [3, 1], [4, 0]])
print (a)
print (b)
c = np.dot(a, b)
print ('c = np.dot(a, b):')
print (c)
print

# you can perform element-wise multiplication with * 
d = b * b
print ('d = b * b:')
print (d)
print


[1 4 6]
[[6 7]
 [3 1]
 [4 0]]
c = np.dot(a, b):
[42 11]
d = b * b:
[[36 49]
 [ 9  1]
 [16  0]]


<function print>

### Arrays and Matrices
In addition to arrays which can have any number of dimensions, Numpy also has a `matrix` data type which always has exactly 2.  **DO NOT USE `matrix`**.  

The original intention behind this data type was to make Numpy feel a bit more like Matlab, mainly by making the `*` operator perform matrix multiplication so you don't have to use `np.dot`.  But `matrix` isn't as well developed by the Numpy people as `array` is.  `matrix` is slower and using it will sometimes throw errors in other people's code because everyone expects you to use `array`.

In [7]:
# you can convert a 1-d array to a 2-d array with np.newaxis
print 'a:'
print a
print 'a.shape:', a.shape
print 
print 'a[np.newaxis] is a 2-d row vector:'
print a[np.newaxis]
print 'a[np.newaxis].shape:', a[np.newaxis].shape
print

print 'a[np.newaxis].T: is a 2-d column vector:'
print a[np.newaxis].T
print 'a[np.newaxis].T.shape:', a[np.newaxis].T.shape
print


a:
[1 4 6]
a.shape: (3,)

a[np.newaxis] is a 2-d row vector:
[[1 4 6]]
a[np.newaxis].shape: (1, 3)

a[np.newaxis].T: is a 2-d column vector:
[[1]
 [4]
 [6]]
a[np.newaxis].T.shape: (3, 1)



In [8]:
# numpy provides a ton of other functions for working with matrices
m = np.array([[1, 2],[3, 4]])
m_inverse = np.linalg.inv(m)
print 'inverse of [[1, 2],[3, 4]]:'
print m_inverse
print

print 'm.dot(m_inverse):'
print m.dot(m_inverse)

inverse of [[1, 2],[3, 4]]:
[[-2.   1. ]
 [ 1.5 -0.5]]

m.dot(m_inverse):
[[  1.00000000e+00   1.11022302e-16]
 [  0.00000000e+00   1.00000000e+00]]


In [9]:
# and for doing all kinds of sciency type stuff.  like generating random numbers:
np.random.seed(5678)
n = np.random.randn(3, 4)
print 'a matrix with random entries drawn from a Normal(0, 1) distribution:'
print n

a matrix with random entries drawn from a Normal(0, 1) distribution:
[[-0.70978938 -0.01719118  0.31941137 -2.26533107]
 [-1.37745366  1.94998073 -0.56381007 -0.84373759]
 [ 0.22453858 -0.39137772  0.60550347 -0.68615034]]


### <span style="color:red">Numpy Exercise</span>
1. In the cell below, add a column of ones to the matrix `X_no_constant`.  This is a common task in linear regression and general linear modeling and something that you'll have to be able to do later today.  
1. Multiply your new matrix by the `betas` vector below to make a vector called `y`
1. You'll know you've got it when the cell prints '\*\*\*\*\*\* Tests passed! \*\*\*\*\*\*' at the bottom.

Specificically, given a matrix:

\begin{equation*}
\qquad
\mathbf{X_{NoConstant}} = 
\left( \begin{array}{ccc}
x_{1,1} & x_{1,2} & \dots & x_{1,D} \\
x_{2,1} & x_{2,2} & \dots & x_{2,D} \\
\vdots & \vdots & \ddots & \vdots \\
x_{i,1} & x_{i,2} & \dots & x_{i,D} \\
\vdots & \vdots & \ddots & \vdots \\
x_{N,1} & x_{N,2} & \dots & x_{N,D} \\
\end{array} \right)
\qquad
\end{equation*}

We want to convert it to:
\begin{equation*}
\qquad
\mathbf{X} = 
\left( \begin{array}{ccc}
1 & x_{1,1} & x_{1,2} & \dots & x_{1,D} \\
1 & x_{2,1} & x_{2,2} & \dots & x_{2,D} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_{i,1} & x_{i,2} & \dots & x_{i,D} \\
\vdots & \vdots & \ddots & \vdots \\
1 & x_{N,1} & x_{N,2} & \dots & x_{N,D} \\
\end{array} \right)
\qquad
\end{equation*}

So that if we have a vector of regression coefficients like this:

\begin{equation*}
\qquad
\beta = \left( \begin{array}{ccc}
\beta_0 \\
\beta_1 \\
\vdots \\
\beta_j \\
\vdots \\
\beta_D
\end{array} \right)
\end{equation*}

We can do this:

\begin{equation*}
\mathbf{y} \equiv \mathbf{X}^T \mathbf{\beta} 
\end{equation*}


In [54]:
np.random.seed(3333)
n_data = 10 # number of data points. i.e. N
n_dim = 5   # number of dimensions of each datapoint.  i.e. D

betas = np.random.randn(n_dim + 1)

print (betas)

X_no_constant = np.random.randn(n_data, n_dim)
print ('X_no_constant:')
print (X_no_constant)
print 

print (X_no_constant.shape)

# INSERT YOUR CODE HERE!

onesv=np.ones((10, 1))
print (onesv)

#[[ 1.  1.  1.  1.]
# [ 1.  1.  1.  1.]
# [ 1.  1.  1.  1.]]

baseX=np.hstack([X_no_constant, onesv])

#Transpose matrix
baseX.T
print (baseX.T)

baseX.T.shape
betas.shape


#matrix multiplication
# you can perform matrix multiplication with np.dot()
y = np.dot(betas.reshape(1,6),baseX.T)

# Tests:
y_expected = np.array([-0.41518357, -9.34696153, 5.08980544, 
                       -0.26983873, -1.47667864, 1.96580794, 
                       6.87009791, -2.07784135, -0.7726816, 
                       -2.74954984])
np.testing.assert_allclose(y, y_expected)
print ('****** Tests passed! ******')

[-0.45481164  0.29282121 -0.04605458 -3.78610463 -1.19921111  0.71335536]
X_no_constant:
[[-0.92232935  0.27352359 -0.86339625  1.43766044 -1.71379871]
 [ 0.179322   -0.89138595  2.13005603  0.51898975 -0.41875106]
 [ 0.34010119 -1.07736609 -1.02314142 -1.02518535  0.40972072]
 [ 1.18883814  1.01044759  0.3108216  -1.17868611 -0.49526331]
 [-1.50248369 -0.196458    0.34752922 -0.79200465 -0.31534705]
 [ 1.73245191 -1.42793626 -0.94376587  0.86823495 -0.95946769]
 [-1.07074604 -0.06555247 -2.17689578  1.58538804  1.81492637]
 [-0.73706088  0.77546031  0.42653908 -0.51853723 -0.53045538]
 [ 1.09620536 -0.69557321  0.03080082  0.25219596 -0.35304303]
 [-0.93971165  0.04448078  0.04273069  0.4961477  -1.7673568 ]]
(10, 5)
[[ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]]
[[-0.92232935  0.179322    0.34010119  1.18883814 -1.50248369  1.73245191
  -1.07074604 -0.73706088  1.09620536 -0.93971165]
 [ 0.27352359 -0.89138595 -1.07736609  1.01044759 -0.196458   -1.42793626
  

AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

(shapes (1, 10), (10,) mismatch)
 x: array([[-2.135228, -1.190096,  3.680435,  5.510778,  4.69995 , -2.585873,
        -6.897521,  3.855371, -0.521776,  1.392771]])
 y: array([-0.415184, -9.346962,  5.089805, -0.269839, -1.476679,  1.965808,
        6.870098, -2.077841, -0.772682, -2.74955 ])

In [88]:
np.random.seed(3333)
n_data = 10 # number of data points. i.e. N
n_dim = 5   # number of dimensions of each datapoint.  i.e. D

betas = np.random.randn(n_dim + 1)

print (betas)

[-0.45481164  0.29282121 -0.04605458 -3.78610463 -1.19921111  0.71335536]


In [89]:
betas.shape

(6,)

In [90]:
X_no_constant = np.random.randn(n_data, n_dim)
print ('X_no_constant:')
print (X_no_constant)
print 

print (X_no_constant.shape)



X_no_constant:
[[-0.92232935  0.27352359 -0.86339625  1.43766044 -1.71379871]
 [ 0.179322   -0.89138595  2.13005603  0.51898975 -0.41875106]
 [ 0.34010119 -1.07736609 -1.02314142 -1.02518535  0.40972072]
 [ 1.18883814  1.01044759  0.3108216  -1.17868611 -0.49526331]
 [-1.50248369 -0.196458    0.34752922 -0.79200465 -0.31534705]
 [ 1.73245191 -1.42793626 -0.94376587  0.86823495 -0.95946769]
 [-1.07074604 -0.06555247 -2.17689578  1.58538804  1.81492637]
 [-0.73706088  0.77546031  0.42653908 -0.51853723 -0.53045538]
 [ 1.09620536 -0.69557321  0.03080082  0.25219596 -0.35304303]
 [-0.93971165  0.04448078  0.04273069  0.4961477  -1.7673568 ]]
(10, 5)


In [91]:
# INSERT YOUR CODE HERE!

onesv=np.ones((10, 1))
print (onesv)

#[[ 1.  1.  1.  1.]
# [ 1.  1.  1.  1.]
# [ 1.  1.  1.  1.]]



[[ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]]


In [93]:
onesv.shape

(10, 1)

In [94]:
baseX=np.hstack([X_no_constant, onesv])



In [95]:
baseX
baseX.shape

(10, 6)

In [None]:
#Transpose matrix
baseX.T
print (baseX.T)

baseX.T.shape
betas.shape


In [96]:


#matrix multiplication
# you can perform matrix multiplication with np.dot()
y = np.dot(baseX,betas)

# Tests:


In [97]:
print (y)

[-2.13522804 -1.19009649  3.68043545  5.51077786  4.6999502  -2.58587328
 -6.89752086  3.85537055 -0.52177587  1.39277092]


In [98]:
y.shape

(10,)

In [None]:
y_expected = np.array([-0.41518357, -9.34696153, 5.08980544, 
                       -0.26983873, -1.47667864, 1.96580794, 
                       6.87009791, -2.07784135, -0.7726816, 
                       -2.74954984])


In [None]:
np.testing.assert_allclose(y, y_expected)
print ('****** Tests passed! ******')

In [30]:
baseX.T.shape


(6, 10)

In [31]:
n_data = 10 # number of data points. i.e. N
n_dim = 5   # number of dimensions of each datapoint.  i.e. D

betas = np.random.randn(n_dim + 1)

In [52]:
betas.reshape(1,6)

array([[-0.45481164,  0.29282121, -0.04605458, -3.78610463, -1.19921111,
         0.71335536]])

In [48]:
baseX.T.shape



(6, 10)

In [None]:
#a-columns = b-rows

In [28]:
betas.shape

(6,)

## Pandas
Pandas is a python package which adds some useful data analysis features to numpy arrays.  Most importantly, it contains a `DataFrame` data type like the r `dataframe`: a set of named columns organized into something like a 2d array.  Pandas is great.

### Resources:
* [10 Minutes to Pandas](http://pandas.pydata.org/pandas-docs/dev/10min.html)
* [Pandas Data Structures Tutorial](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)
* [Merge, Join, Concatenate Tutorial](http://pandas.pydata.org/pandas-docs/dev/merging.html)
* [Another Numpy/Pandas Tutorial](https://www.kaggle.com/c/titanic/details/getting-started-with-python-ii)

In [5]:
# like with numpy, you don't have to rename pandas to pd, but it's customary to do so
import numpy as np
import pandas as pd

b = np.array([[6, 7], [3, 1], [4, 0]])
df = pd.DataFrame(data=b,  columns=['Weight', 'Height'])
print ('b:')
print (b)
print 
print ('DataFame version of b:')
print (df)
print

b:
[[6 7]
 [3 1]
 [4 0]]
DataFame version of b:
   Weight  Height
0       6       7
1       3       1
2       4       0


<function print>

In [6]:
# Pandas can save and load CSV files.  
# Python can do this too, but with Pandas, you get a DataFrame 
# at the end which understands things like column headings
baseball = pd.read_csv('data/baseball.dat.txt')

# A Dataframe's .head() method shows its first 5 rows
baseball.head()

Unnamed: 0,Salary,AVG,OBP,Runs,Hits,Doubles,Triples,HR,RBI,Walks,SO,SB,Errs,free agency eligibility,free agent in 1991/2,arbitration eligibility,arbitration in 1991/2,Name
0,3300,0.272,0.302,69,153,21,4,31,104,22,80,4,3,1,0,0,0,Andre Dawson
1,2600,0.269,0.335,58,111,17,2,18,66,39,69,0,3,1,1,0,0,Steve Buchele
2,2500,0.249,0.337,54,115,15,1,17,73,63,116,6,5,1,0,0,0,Kal Daniels
3,2475,0.26,0.292,59,128,22,7,12,50,23,64,21,21,0,0,1,0,Shawon Dunston
4,2313,0.273,0.346,87,169,28,5,8,58,70,53,3,8,0,0,1,0,Mark Grace


In [10]:
# you can see all the column names
print ('baseball.keys():')
baseball.keys()
print

print ('baseball.Salary:')
#print baseball.Salary
print 
#print ("baseball['Salary']:")
print (baseball['Salary'])

baseball.keys():
baseball.Salary:
0      3300
1      2600
2      2500
3      2475
4      2313
5      2175
6       600
7       460
8       240
9       200
10      177
11      140
12      117
13      115
14     2600
15     1907
16     1190
17      990
18      925
19      365
20      302
21      300
22      129
23      111
24     6100
25     4125
26     3213
27     2319
28     2000
29     1600
       ... 
307     205
308     185
309     115
310    3750
311    2188
312    2167
313    2167
314    2050
315    1445
316     875
317     560
318     360
319     162
320     133
321     109
322    4300
323    3850
324    2387
325     950
326     675
327     600
328     287
329     230
330     215
331     183
332     170
333     160
334     142
335     140
336     109
Name: Salary, dtype: int64


In [11]:
baseball.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 337 entries, 0 to 336
Data columns (total 18 columns):
Salary                     337 non-null int64
AVG                        337 non-null float64
OBP                        337 non-null float64
Runs                       337 non-null int64
Hits                       337 non-null int64
Doubles                    337 non-null int64
Triples                    337 non-null int64
HR                         337 non-null int64
RBI                        337 non-null int64
Walks                      337 non-null int64
SO                         337 non-null int64
SB                         337 non-null int64
Errs                       337 non-null int64
free agency eligibility    337 non-null int64
free agent in 1991/2       337 non-null int64
arbitration eligibility    337 non-null int64
arbitration in 1991/2      337 non-null int64
Name                       337 non-null object
dtypes: float64(2), int64(15), object(1)
memory usage: 50.0+ K

In [12]:
baseball.describe()

Unnamed: 0,Salary,AVG,OBP,Runs,Hits,Doubles,Triples,HR,RBI,Walks,SO,SB,Errs,free agency eligibility,free agent in 1991/2,arbitration eligibility,arbitration in 1991/2
count,337.0,337.0,337.0,337.0,337.0,337.0,337.0,337.0,337.0,337.0,337.0,337.0,337.0,337.0,337.0,337.0,337.0
mean,1248.52819,0.257825,0.323973,46.697329,92.833828,16.673591,2.338279,9.097923,44.020772,35.017804,56.706231,8.246291,6.771513,0.397626,0.115727,0.192878,0.029674
std,1240.013309,0.039546,0.047132,29.020166,51.896322,10.452001,2.543336,9.289934,29.559406,24.842474,33.828784,11.664782,5.92749,0.490135,0.320373,0.395145,0.169938
min,109.0,0.063,0.063,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,230.0,0.238,0.297,22.0,51.0,9.0,0.0,2.0,21.0,15.0,31.0,1.0,3.0,0.0,0.0,0.0,0.0
50%,740.0,0.26,0.323,41.0,91.0,15.0,2.0,6.0,39.0,30.0,49.0,4.0,5.0,0.0,0.0,0.0,0.0
75%,2150.0,0.281,0.354,69.0,136.0,23.0,3.0,15.0,66.0,49.0,78.0,11.0,9.0,1.0,0.0,0.0,0.0
max,6100.0,0.457,0.486,133.0,216.0,49.0,15.0,44.0,133.0,138.0,175.0,76.0,31.0,1.0,1.0,1.0,1.0


In [13]:
baseball

Unnamed: 0,Salary,AVG,OBP,Runs,Hits,Doubles,Triples,HR,RBI,Walks,SO,SB,Errs,free agency eligibility,free agent in 1991/2,arbitration eligibility,arbitration in 1991/2,Name
0,3300,0.272,0.302,69,153,21,4,31,104,22,80,4,3,1,0,0,0,Andre Dawson
1,2600,0.269,0.335,58,111,17,2,18,66,39,69,0,3,1,1,0,0,Steve Buchele
2,2500,0.249,0.337,54,115,15,1,17,73,63,116,6,5,1,0,0,0,Kal Daniels
3,2475,0.260,0.292,59,128,22,7,12,50,23,64,21,21,0,0,1,0,Shawon Dunston
4,2313,0.273,0.346,87,169,28,5,8,58,70,53,3,8,0,0,1,0,Mark Grace
5,2175,0.291,0.379,104,170,32,2,26,100,87,89,22,4,1,0,0,0,Ryne Sandberg
6,600,0.258,0.370,34,86,14,1,14,38,15,45,0,10,1,0,0,0,Luis Salazar
7,460,0.228,0.279,16,38,7,2,3,21,11,32,2,3,0,0,0,0,Dwight Smith
8,240,0.250,0.327,40,61,11,0,1,18,24,26,14,2,0,0,0,0,Doug Dascenzo
9,200,0.203,0.240,39,64,10,1,10,33,14,96,13,6,0,0,0,0,Sammy Sosa


In [15]:
# You can perform queries on your data frame.  
# This statement gives you a True/False vector telling you 
# whether the player in each row has a salary over $1 Million
millionaire_indices = baseball['Salary'] > 1000
print (millionaire_indices)

0       True
1       True
2       True
3       True
4       True
5       True
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14      True
15      True
16      True
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24      True
25      True
26      True
27      True
28      True
29      True
       ...  
307    False
308    False
309    False
310     True
311     True
312     True
313     True
314     True
315     True
316    False
317    False
318    False
319    False
320    False
321    False
322     True
323     True
324     True
325    False
326    False
327    False
328    False
329    False
330    False
331    False
332    False
333    False
334    False
335    False
336    False
Name: Salary, dtype: bool


In [17]:
# you can use the query indices to look at a subset of your original dataframe
print ('baseball.shape:', baseball.shape)
print ("baseball[millionaire_indices].shape:", baseball[millionaire_indices].shape)

baseball.shape: (337, 18)
baseball[millionaire_indices].shape: (139, 18)


In [18]:
# you can look at a subset of rows and columns at the same time
print ("baseball[millionaire_indices][['Salary', 'AVG', 'Runs', 'Name']]:")
baseball[millionaire_indices][['Salary', 'AVG', 'Runs', 'Name']]

baseball[millionaire_indices][['Salary', 'AVG', 'Runs', 'Name']]:


Unnamed: 0,Salary,AVG,Runs,Name
0,3300,0.272,69,Andre Dawson
1,2600,0.269,58,Steve Buchele
2,2500,0.249,54,Kal Daniels
3,2475,0.260,59,Shawon Dunston
4,2313,0.273,87,Mark Grace
5,2175,0.291,104,Ryne Sandberg
14,2600,0.300,69,Ivan Calderon
15,1907,0.225,60,Tim Wallach
16,1190,0.255,39,Spike Owen
24,6100,0.302,102,Bobby Bonilla


## Pandas Joins
The real magic with a Pandas DataFrame comes from the merge method which can match up the rows and columns from two DataFrames and combine their data.  Let's load another file which has shoesize for just a few players

In [19]:
# load shoe size data
shoe_size_df = pd.read_csv('data/baseball2.dat.txt')
shoe_size_df

Unnamed: 0,Shoe Size,Name
0,11,Andre Dawson
1,13,Mark Grace
2,12,Sammy Sosa


In [20]:
merged = pd.merge(baseball, shoe_size_df, on=['Name'])
merged

Unnamed: 0,Salary,AVG,OBP,Runs,Hits,Doubles,Triples,HR,RBI,Walks,SO,SB,Errs,free agency eligibility,free agent in 1991/2,arbitration eligibility,arbitration in 1991/2,Name,Shoe Size
0,3300,0.272,0.302,69,153,21,4,31,104,22,80,4,3,1,0,0,0,Andre Dawson,11
1,2313,0.273,0.346,87,169,28,5,8,58,70,53,3,8,0,0,1,0,Mark Grace,13
2,200,0.203,0.24,39,64,10,1,10,33,14,96,13,6,0,0,0,0,Sammy Sosa,12


In [None]:
merged_outer = pd.merge(baseball, shoe_sizes, on=['Name'], how='outer')
merged_outer.head()

### <span style="color:red">Pandas Exercise</span>
1. Partner up with someone next to you.  Then, on one of your computers:
  1. Prepend a column of ones to the dataframe `X_df` below.  Name the new column 'const'.
  1. Again, matrix multiply `X_df` by the `betas` vector and assign the result to an new variable: `y_new`
  1. You'll know you've got it when the cell prints '\*\*\*\*\*\* Tests passed! \*\*\*\*\*\*' at the bottom.

  **Hint**: This stackoverflow post may be useful: http://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns

In [59]:
np.random.seed(3333)
n_data = 10 # number of data points. i.e. N
n_dim = 5   # number of dimensions of each datapoint.  i.e. D

betas = np.random.randn(n_dim + 1)

X_df = pd.DataFrame(data=np.random.randn(n_data, n_dim))

# INSERT YOUR CODE HERE!
const=np.ones((5, 1))
baseX=np.hstack([X_df, const])

y_new=np.dot(baseX,betas)

# Tests:
assert 'const' in X_df.keys(), 'The new column must be called "const"'
assert np.all(X_df.shape == (n_data, n_dim+1))
assert len(y_new == n_data)
print ('****** Tests passed! ******')


ValueError: all the input array dimensions except for the concatenation axis must match exactly

In [82]:
const=np.ones((1, 5))

In [64]:
print (const)

[[ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]]


In [65]:
print (X_df)

          0         1         2         3         4
0 -0.922329  0.273524 -0.863396  1.437660 -1.713799
1  0.179322 -0.891386  2.130056  0.518990 -0.418751
2  0.340101 -1.077366 -1.023141 -1.025185  0.409721
3  1.188838  1.010448  0.310822 -1.178686 -0.495263
4 -1.502484 -0.196458  0.347529 -0.792005 -0.315347
5  1.732452 -1.427936 -0.943766  0.868235 -0.959468
6 -1.070746 -0.065552 -2.176896  1.585388  1.814926
7 -0.737061  0.775460  0.426539 -0.518537 -0.530455
8  1.096205 -0.695573  0.030801  0.252196 -0.353043
9 -0.939712  0.044481  0.042731  0.496148 -1.767357


In [77]:
X_df.T.shape


(5, 10)

In [86]:
const.shape

(1, 5)

In [87]:
baseX=np.hstack([X_df.T, const])

ValueError: all the input array dimensions except for the concatenation axis must match exactly

In [84]:
y_new=np.dot(baseX,betas)

ValueError: shapes (11,5) and (6,) not aligned: 5 (dim 1) != 6 (dim 0)

In [None]:
print (baseX)

In [75]:
X_no_constant.shape


(10, 5)

In [74]:
onesv.shape

(10, 1)

In [101]:
np.random.seed(3333)
n_data = 10 # number of data points. i.e. N
n_dim = 5   # number of dimensions of each datapoint.  i.e. D

betas = np.random.randn(n_dim + 1)

print (betas)

X_no_constant = np.random.randn(n_data, n_dim)
print ('X_no_constant:')
print (X_no_constant)
print 

print (X_no_constant.shape)

# INSERT YOUR CODE HERE!

onesv=np.ones((10, 1))
print (onesv)

#[[ 1.  1.  1.  1.]
# [ 1.  1.  1.  1.]
# [ 1.  1.  1.  1.]]

baseX=np.hstack([X_no_constant, onesv])

#Transpose matrix
baseX.T
print (baseX.T)

baseX.T.shape
betas.shape


#matrix multiplication
# you can perform matrix multiplication with np.dot()
y = np.dot(baseX,betas)

print (y)

# Tests:
#y_expected = np.array([-0.41518357, -9.34696153, 5.08980544, 
                      # -0.26983873, -1.47667864, 1.96580794, 
                      # 6.87009791, -2.07784135, -0.7726816, 
                      # -2.74954984])
#np.testing.assert_allclose(y, y_expected)
#print ('****** Tests passed! ******')

[-0.45481164  0.29282121 -0.04605458 -3.78610463 -1.19921111  0.71335536]
X_no_constant:
[[-0.92232935  0.27352359 -0.86339625  1.43766044 -1.71379871]
 [ 0.179322   -0.89138595  2.13005603  0.51898975 -0.41875106]
 [ 0.34010119 -1.07736609 -1.02314142 -1.02518535  0.40972072]
 [ 1.18883814  1.01044759  0.3108216  -1.17868611 -0.49526331]
 [-1.50248369 -0.196458    0.34752922 -0.79200465 -0.31534705]
 [ 1.73245191 -1.42793626 -0.94376587  0.86823495 -0.95946769]
 [-1.07074604 -0.06555247 -2.17689578  1.58538804  1.81492637]
 [-0.73706088  0.77546031  0.42653908 -0.51853723 -0.53045538]
 [ 1.09620536 -0.69557321  0.03080082  0.25219596 -0.35304303]
 [-0.93971165  0.04448078  0.04273069  0.4961477  -1.7673568 ]]
(10, 5)
[[ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]]
[[-0.92232935  0.179322    0.34010119  1.18883814 -1.50248369  1.73245191
  -1.07074604 -0.73706088  1.09620536 -0.93971165]
 [ 0.27352359 -0.89138595 -1.07736609  1.01044759 -0.196458   -1.42793626
  