[Introduction to NumPy and Pandas (NumPy)](#Introduction-to-NumPy-and-Pandas-(NumPy))   
1. [NumPy](#NumPy)   
     a. [Arrays](#Arrays)    
     b. [Element-wise and Aggregator functions](#Element-wise-and-Aggregator-functions)     
2. [Pandas](#Pandas)    
     a. [The Data Frame](#The-Data-Frame)   
     b. [Adding More Data](#Adding-More-Data)     
3. [Pandas - Selecting and Grouping](#Pandas-Selecting-and-Grouping)  
     a. [Basic Selects with `.loc` and `.iloc`](#Basic-Selects-with-`.loc`-and-`.iloc`)    
     b. [Conditional Selection](#Conditional-Selection)   
     c. [Groups](#Groups)         
4.[Working with Files](#Working-with-Files)        
     a. [Opening CSV files with Pandas](#Opening-CSV-files-with-Pandas)   



# Introduction to NumPy and Pandas (NumPy)

## NumPy

[_NumPy_](https://docs.scipy.org/doc/numpy/reference/) is the basic package for doing slightly more advanced math and storing data in an analytics-friendly form.You should have installed NumPy on your local environment in the previous Unit. If you don't yet have NumPy installed, install it now with:

<div class="alert alert-success">pip install numpy</div>

Once NumPy is installed we have to import the package into our current environment to actually use it. We do this with an 
<span style="background-color: #D8D8D8">import</span> statement.

The shorthand for <span style="background-color: #D8D8D8">numpy</span> is simply <span style="background-color: #D8D8D8">np</span>. You can import and set the abbreviation like this:

In [2]:
import numpy as np

Import statements will work at any point in a script or any cell in a notebook. However, [Python style](https://www.python.org/dev/peps/pep-0008/#imports) requires they should always appear at the beginning of the script or in the first cell (most notebooks will use the first cell just for this purpose).


### Arrays

NumPy is the fundamental package for storing and manipulating mathematical data in Python. NumPy primarily accomplishes this with a new data structure: the _array_.
Arrays use bracket notation <span style="background-color: #D8D8D8"> [ ] </span> to access items by index, just like lists and strings. We can create an array by calling <span style="background-color: #D8D8D8">  
 np.array() </span > and passing in any iterable. A list, for example:

In [3]:
x = np.array([1, 4, 9, 25, 36, 49, 64, 81, 100])
x

array([  1,   4,   9,  25,  36,  49,  64,  81, 100])

You can add multiple dimensions to your array by either manually creating an array of arrays or with <span style="background-color: #D8D8D8">np.arange()</span>. Here are two ways to generate the same thing:

In [4]:
w = np.array([[1, 4, 25, 9],[25, 36, 49, 64]])
w

array([[ 1,  4, 25,  9],
       [25, 36, 49, 64]])

In [5]:
y = np.arange(8).reshape(2, 4)
y

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

### Element-wise and Aggregator functions

Now that you've seen the basic data structure of NumPy we can introduce a few other functionalities of this package. NumPy's primary value is the ability to do more sophisticated arithmetic than basic Python can do out of the box. NumPy allows you to do these computations in two ways: with <span style="background-color: #D8D8D8">element-wise</span> functions that process array elements one at a time and then return a new array, and with <span style="background-color: #D8D8D8">aggregator</span> functions that process the array into a single value the function returns.   

Element-wise functions that return a new array:

In [6]:
x = np.array([1, 4, 9, 25, 36])
# Square each value.
print(np.square(x))

# Square root of each value.
print(np.sqrt(x))

# Cosine of each value.
print(np.cos(x))

[   1   16   81  625 1296]
[1. 2. 3. 5. 6.]
[ 0.54030231 -0.65364362 -0.91113026  0.99120281 -0.12796369]


Aggregator functions that aggregate the elements of an array and return a single value:

In [7]:
x = np.array([1, 4, 9, 25, 36])

# Find the maximum value.
print(np.max(x))

# Find the minimum value.
print(np.min(x))

# Find the mean of the input array.
print(np.mean(x))

# Find the standard deviation of the input array.
print(np.std(x))

36
1
15.0
13.371611720357423


Note that these aggregator functions return single values, rather than arrays. That is what an aggregator does. It takes a set of multiple data values and condenses them (or aggregates them) into a single value according to some rule. So <span style="background-color: #D8D8D8">np.min()</span> returns the minimum value of all the data given to it, <span style="background-color: #D8D8D8">np.mean()</span> the mean, and so on. These are some of the basic functions of NumPy, but there are many more. If you'd like to know more now you can look through the basic [NumPy documentation](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html).


## Pandas 

Nowadays, Pandas is the most heavily utilized package in Python for data scientists. It is Built on top of NumPy, it is essential to data manipulation, organization, and modeling. We'll introduce some of its core functionalities as well as its primary data structure: the data frame. First we have to import the package, which typically gets the abbreviation <span style="background-color: #D8D8D8">pd</span>. 

In [8]:
import pandas as pd

### The Data Frame

The <span style="background-color: #D8D8D8">data frame</span> is like a NumPy array, with a few additional features like column names and row indexing.  You can create a data frame in many different ways, either from csv files, by querying databases, or explicitly. For your first data frame, let's create a 2-dimensional Numpy array. Then, to create a data frame use the <span style="background-color: #D8D8D8">pd.DataFrame()</span> function and pass in the NumPy array:


In [9]:
my_array = np.array([[87,127,101, 94, 97],
                     [1035, 1034, 1173, 1347,1029]])
df = pd.DataFrame(my_array)
df

Unnamed: 0,0,1,2,3,4
0,87,127,101,94,97
1,1035,1034,1173,1347,1029


Data frames are organized into rows and columns that are nameable. Columns are labeled with column names, rows with an index number (starting with zero by default). You can set both column names and indexes explicitly during the creation of the data frame or after the fact. Let's set both for <span style="background-color: #D8D8D8">df</span> from above.

In [10]:
df.columns = ['John', 'Emily','Adam', 'Caroline','Tim']
df.index = ['IQ Score', 'Brain Volume (cm^3)']
df

Unnamed: 0,John,Emily,Adam,Caroline,Tim
IQ Score,87,127,101,94,97
Brain Volume (cm^3),1035,1034,1173,1347,1029


You can also set column and index names through the <span style="background-color: #D8D8D8">column=</span> or <span style="background-color: #D8D8D8">index=</span> keyword arguments when you call the <span style="background-color: #D8D8D8">pd.DataFrame()</span> function to initially construct the data frame.

In [11]:
df2 = pd.DataFrame(
    my_array,
    columns=['John', 'Emily','Adam', 'Caroline','Tim'],
    index=['IQ Score', 'Brain Volume (cm^3)'])
df2

Unnamed: 0,John,Emily,Adam,Caroline,Tim
IQ Score,87,127,101,94,97
Brain Volume (cm^3),1035,1034,1173,1347,1029


Whichever method you use, we now have a data frame with labeled rows and columns. This will be useful for working with data frames, because it makes these elements easily callable and makes your code more natural to write and simpler to read.

### Adding More Data

To show what data frames can really do we're going to need to make something a little bit bigger. Let's assemble a data frame with named columns via lists. You can create an empty data frame by calling the <span style="background-color: #D8D8D8">pd.DataFrame()</span> function and passing in the indexes you'd like to use for row names, then add columns using <span style="background-color: #D8D8D8">df['COLUMN_NAME'] = [LIST_OF_VALUES]</span>. For example:

In [12]:
# This list will become our row names.
names = ['George',
         'John',
         'Thomas',
         'James',
         'Andrew',
         'Martin',
         'William',
         'Zachary',
         'Millard',
         'Franklin']

# Create an empty data frame with named rows.
purchases = pd.DataFrame(index=names)

# Add our columns to the data frame one at a time.
purchases['country'] = ['US', 'CAN', 'CAN', 'US', 'CAN', 'US', 'US', 'US', 'CAN', 'US']
purchases['ad_views'] = [16, 42, 32, 13, 63, 19, 65, 23, 16, 77]
purchases['items_purchased'] = [2, 1, 0, 8, 0, 5, 7, 3, 0, 5]
purchases 

Unnamed: 0,country,ad_views,items_purchased
George,US,16,2
John,CAN,42,1
Thomas,CAN,32,0
James,US,13,8
Andrew,CAN,63,0
Martin,US,19,5
William,US,65,7
Zachary,US,23,3
Millard,CAN,16,0
Franklin,US,77,5


Let's say this is the purchase and browsing history for several users of an ecommerce website for a given year. Page views is the number of pages they've loaded on the site and purchases is the number of items they've bought that year.

Now we have a data frame we can do something with. First note that you can call out a column as a series using *either* dot notation *or* bracket notation: <span style="background-color: #D8D8D8">df.column_name</span> or <span style="background-color: #D8D8D8">df['column_name']</span> both work. So <span style="background-color: #D8D8D8">purchases['Name']</span> returns the names of users who visited the ecommerce website, as does <span style="background-color: #D8D8D8">purchases.Name</span>. Bracket notation is generally preferred and we'll use bracket notation here.

Pandas also makes it very easy to create a new column out of our previous data. Let's say we want to create a column of the average items purchased per page view, and call the column <span style="background-color: #D8D8D8">items_purch_per_view</span>. We can do that with this one-liner:

In [13]:
purchases['items_purch_per_ad'] = purchases['items_purchased'] / purchases['ad_views']
purchases

Unnamed: 0,country,ad_views,items_purchased,items_purch_per_ad
George,US,16,2,0.125
John,CAN,42,1,0.02381
Thomas,CAN,32,0,0.0
James,US,13,8,0.615385
Andrew,CAN,63,0,0.0
Martin,US,19,5,0.263158
William,US,65,7,0.107692
Zachary,US,23,3,0.130435
Millard,CAN,16,0,0.0
Franklin,US,77,5,0.064935


If we just want to _see_ those values and don't need to _store_ them as a new column in our data frame we can just run that function without assigning it to <span style="background-color: #D8D8D8">purchases['items_purch_per_ad']</span> and it will return labeled values giving the name and the purchases per ad for each user.

In [14]:
purchases['items_purchased'] / purchases['ad_views']

George      0.125000
John        0.023810
Thomas      0.000000
James       0.615385
Andrew      0.000000
Martin      0.263158
William     0.107692
Zachary     0.130435
Millard     0.000000
Franklin    0.064935
dtype: float64

## Pandas - Selecting and Grouping

The most basic form of indexing, or "selection" with the bracketed selection of column names. Recall what we did before on our purchases data:

In [15]:
purchases['country']

George       US
John        CAN
Thomas      CAN
James        US
Andrew      CAN
Martin       US
William      US
Zachary      US
Millard     CAN
Franklin     US
Name: country, dtype: object

Which returns the data from the column named "country". However, for more sophisticated selection we will need to use the Pandas dataframe indexing methods.

### Basic Selects with `.loc` and `.iloc`

[`.loc`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html) is a selector that indexes over rows and columns. It selects over the row index first, then the column name (if included). [`.iloc`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html) does the same thing but over indices. For example, to select the row for `'George'` in our purchases data frame, we just pass the string `'George'` in to `purchases.loc` with bracket notation: 

In [16]:
purchases.loc['George']

country                  US
ad_views                 16
items_purchased           2
items_purch_per_ad    0.125
Name: George, dtype: object

To select the column 'country' we would use:

In [17]:
purchases.loc[:, 'country']

George       US
John        CAN
Thomas      CAN
James        US
Andrew      CAN
Martin       US
William      US
Zachary      US
Millard     CAN
Franklin     US
Name: country, dtype: object

The <span style="background-color: #D8D8D8">:</span> above works just like it would when slicing a list or string, and selects all rows from start to finish of the data frame.

Lastly to select George's country, we'd combine the two like this:

In [18]:
purchases.loc['George', 'country']

'US'

### Conditional Selection

You can also use <span style="background-color: #D8D8D8">.loc</span> for conditional selection, or selecting all the entries that meet a given criteria. This will use __lambda__, which is a construction that allows for defining anonymous, unnamed functions at runtime. We use the lambda function to create a condition on the row or column.  

Let's say we want all the columns for individuals who made more than one purchase. That ends up being a relatively simple line of code.

In [19]:
purchases.loc[lambda df: purchases['items_purchased'] > 1, :]

Unnamed: 0,country,ad_views,items_purchased,items_purch_per_ad
George,US,16,2,0.125
James,US,13,8,0.615385
Martin,US,19,5,0.263158
William,US,65,7,0.107692
Zachary,US,23,3,0.130435
Franklin,US,77,5,0.064935


We are selecting rows, so the lambda is the first item in the brackets. We define the input <span style="background-color: #D8D8D8">df</span> as it takes a data frame. Then we define the condition for which each row will be evaluated on. The <span style="background-color: #D8D8D8">, :</span> is the same slicing syntax and means we want all columns using the same logic as above.

There is a simpler way to do this, using boolean logic, and it is also quite common.

In [20]:
purchases[purchases['items_purchased'] > 1]

Unnamed: 0,country,ad_views,items_purchased,items_purch_per_ad
George,US,16,2,0.125
James,US,13,8,0.615385
Martin,US,19,5,0.263158
William,US,65,7,0.107692
Zachary,US,23,3,0.130435
Franklin,US,77,5,0.064935


This is a similar logic, but the lack of explicit indexing makes it slightly less robust. The first example with <span style="background-color: #D8D8D8">.loc</span> using explicit indexing is more robust, but this latter [boolean indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing) may be more common and is easily readable. Choose your tradeoffs wisely.

### Groups

There's one last thing we'll introduce here, and that is grouping and aggregation. You can create groups in your data frame using the [`.groupby()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) method and passing in the column name. Let's try it.

If we wanted to group by the country of the site user, all we'd have to do is:

In [21]:
purchases.groupby('country')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ff0ae3f24c0>

When you run the line above, it doesn't return your data any more. It returns a line that references a grouped object, but not the object.

That's because if we want it to return something we have to do something on those groups. There are several methods that you can use here. Some are built in like <span style="background-color: #D8D8D8">.sum()</span> or <span style="background-color: #D8D8D8">.count()</span>. For even greater possibilities you can use <span style="background-color: #D8D8D8">.aggregate(numpy_function)</span>. Let's use this to find out which group has more page views and purchases.

In [22]:
purchases.groupby('country').aggregate(np.mean)

# Don't want to take the mean of all columns? Try this:
# purchases.groupby('country')['column_name'].mean()

Unnamed: 0_level_0,ad_views,items_purchased,items_purch_per_ad
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CAN,38.25,0.25,0.005952
US,35.5,5.0,0.217767


Now we can see the mean of each column. Seems like Canadian visitors view slightly more ads per person but purchase far fewer items...

These are the fundamentals of selecting data inside a data frame. For a deep dive, see the [pandas documentation on Indexing and Selecting Data](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing).


## Working with Files

So far we built Pandas data frames by manually typing or pasting in text. Doing that with actual data sets would be tedious or impossible, so you'll almost always be loading your data into Pandas from files (like CSV files), from the web using APIs, or from databases and other large data stores.

Here, we'll show you how to load CSV files.

### Opening CSV files with Pandas

Here's a sample CSV file called [addresses](https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv). Save a copy of this file by right-clicking in your browser and choosing "Save As". Name it <span style="background-color: #D8D8D8">addresses.csv</span> and place it in the same directory as your Python files.

The basics of loading a CSV into Pandas are simple. You can load the <span style="background-color: #D8D8D8">addresses.csv</span> file you just saved with this one-liner

In [23]:
df = pd.read_csv('addresses.csv')
print(df.head())

                    John       Doe                 120 jefferson st.  \
0                   Jack  McGinnis                      220 hobo Av.   
1          John "Da Man"    Repici                 120 Jefferson St.   
2                Stephen     Tyler  7452 Terrace "At the Plaza" road   
3                    NaN  Blankman                               NaN   
4  Joan "the bone", Anne       Jet               9th, at Terrace plc   

     Riverside   NJ   08075  
0        Phila   PA    9119  
1    Riverside   NJ    8075  
2     SomeTown   SD   91234  
3     SomeTown   SD     298  
4  Desert City   CO     123  


The Pandas <span style="background-color: #D8D8D8">read_csv()</span> method takes a string representing the path to the file you want to read and returns a data frame object. In the example above the CSV file we're working with is in the same directory as the script we're running. If you saved your file elsewhere or with a different name you'll have a different path than <span style="background-color: #D8D8D8">addresses.csv</span>   

Try loading <span style="background-color: #D8D8D8">addresses.csv</span>in your own Jupyter notebook now.

The only required argument <span style="background-color: #D8D8D8">read_csv()</span> is the file path, but there are dozens of optional keyword arguments available that are useful in different contexts. You can read about those in the [documentation](http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).