# Intro to Pandas 

We're going to look at the Pandas library (created by Wes McKinney). When we refer to working in Pandas, we're typically talking about working within a dataframe, which is what will be our focus. Pandas DataFrames are objects that hold data, allowing us to interact with it, manipulate it, and eventually input it into machine learning algorithms. 

Since a Pandas DataFrame is an **object**, this means that we're going to interact with it in much the same way that we interact with all of our other objects in python. Before we get to actually interacting with DataFrames, though, we'll have to get one, and get one with data in it! There's one quick step that we have to do before that... 

## Pandas Import 

```python
import pandas as pd # Standard import. 
```

Here I've shown how we get access to everything in the Pandas library - we import it! Also note the python comment `"# Standard import"` out to the right of our import. This was to note that this is the standard way to import the Pandas library. We should always be sure that if we are importing the entire pandas library, we follow this syntax. It's common practice to use `pd` as the alias, and we tend to follow common practice whenever possible. This makes it easier for others to read our code.

## Getting a DataFrame Object

There are two basic ways that we can get a Pandas DataFrame object to work with. The first is by using data that is already in our Python program, in conjuction with the `DataFrame` constructor. The second is by reading in external data through the pandas module (which we've imported and made accessible via `pd`). For reference, here are the [docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) for Pandas DataFrames. 

##### Using data already in our Python program

If we are using data that is already in our Python program, then we are going to be passing that data to the `DataFrame` constructor. We typically do this in one of two ways. The first involves passing in a list of dictionaries, whereas the second involves passing in two lists. Let's dive into the first...

In [None]:
import pandas as pd # I haven't actually done this in code yet. 
data_lst = [{'a': 1, 'b': 2, 'c':3}, {'a': 4, 'b':5, 'c':6, 'd':7}]
df = pd.DataFrame(data_lst)
df

What's going on here? How do I read that DataFrame output right above, and how did that list of dictionaries translate to that DataFrame? 

Each and every one of our Pandas DataFrames will consist of **rows** and **columns**, where the columns will be denoted and accessed via their names, and the rows will be denoted and access via the indices of the DataFrame. Above, we can look at our columns and see that their names are `a`, `b`, `c`, and `d`. We can similiary look at our rows and see that they are indexed by `0` and `1`. These column names and indices are how we will access this data later. How did the `DataFrame` constructor take our list of dictionaries and put it into the DataFrame in that format, though?

When the Pandas DataFrame constructor encounters a list of dictionaries like we gave it, it interprets each dictionary to be a row in the DataFrame. The keys are read as the column names and the values as the values for each column. By default, the DataFrame constructor will assign a column for **every** key that it sees in **any** dictionary in the list of dictionaries. If a particular dictionary in that list doesn't have a value for that key, then it assigns a `NaN` (stands for not a number) value for that index-column pair. Therefore, when the Pandas DataFrame above got the list of dictionaries, it saw `a`, `b`, `c`, and `d` keys, and thus created those columns. It then filled in the values associated with those keys, filling in a `NaN` if it didn't find that key (like it didn't find `d` in the first dictionary in our list). 

In [None]:
data_lst = [{'a': 1}, {'b':5}, {'c': 4}]
df = pd.DataFrame(data_lst)

**What do you expect our DataFrame to hold now?**

In [None]:
df

The second way of creating a dataframe from data that is already in our Python program is to pass in a list of lists as the `data` argument, and a list of strings as the `columns` argument. The `pd.DataFrame()` constructor will assume that each individual list in the `data` argument is one row (i.e. if you pass in a list of 5 lists, your dataframe will have 5 rows). Below, we're passing in a list of 2 lists to the `data` parameter, which means that our DataFrame will have two rows.

In [None]:
data_vals = [[1, 2, 3], [4, 5, 6]]
data_cols = ['a', 'b', 'c']
df = pd.DataFrame(data=data_vals, columns=data_cols)
df

It's important to note that this method is not quite as flexible as using a list of dictionaries. When passing in a list of lists via the `data` argument, we have to make sure that the greatest number of elements in any single list corresponds to the number of column names we are passing in via the `columns` argument (no more, no less). For example:

In [None]:
data_vals = [[1, 2], [4, 5, 6]]
data_cols = ['a', 'b']
df = pd.DataFrame(data=data_vals, columns=data_cols)

We do, however, have the flexibility of passing in some rows with 'missing' data. The key to note here, though, is that the last column name will become what is filled with a `NaN` if a list is missing a column value (it will not be based off a key name like in our list of dictionaries). 

In [None]:
data_vals = [[1, 2], [4, 5, 6]]
data_cols = ['a', 'b', 'c']
df = pd.DataFrame(data=data_vals, columns=data_cols)
df

#### Reading External Data

There are many ways that we can read external data into a Pandas DataFrame, and they will be called as a function that is available via the `pandas` module. As a result of importing the `pandas` module and making it accessible via `pd`, this means that we will call all of these functions via `pd`. Each one of these functions will **return** back to us a Pandas DataFrame object, populated with the external data that we read in. 

The [Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/io.html) will show you all of the ways that you could load external data into a DataFrame. Basically, there is a way to load in data stored in any format (CSV, JSON, SQL, Excel, HTML). All of these take some form of a `read_{data_type}` function, which means that we will call them as `pd.read_{data_type}`.

So, if we wanted to load data in from a CSV, we would simply use:

```python
df = pd.read_csv('my_data.csv')
```

Note: This assumes that we have the column names in the first row of your .csv.

If we don't have the column names in the first row of our .csv, we could read in the .csv with the following:

```python
df = pd.read_csv('my_data.csv', header=None)
```

Note: This by default assigns numbers as the column names (starting with 0).

If we wanted to assign the column names as we read it in, we can pass in an additional `names` argument, where this `names` argument holds a list of the names we want to assign to the columns.

```python
df = pd.read_csv('my_data.csv', header=None, names=['col1', 'col2', ...., 'col12'])
```

We have looked at a couple of the parameters that we can pass arguments to when reading in data. Just note that the functions that are available to read in data from other sources will likely also have optional parameters that we can specify. 

## Looking at Your Data

We got the following data to look at [here](http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/), the UCI repository of open data sets.  

Given that our `DataFrame` is an object, we can imagine that it will have associated attributes and methods. There are a couple of each available to us to get a general sense of our data. We have two attributes that we will frequently use on our DataFrame - these will allow us to look at the shape of our data and the column names. We have four methods that are availiable on our dataframe for getting a general sense of our data: `info()`, `describe()`, `head()`, and `tail()`. Let's take a look at what these do. 

In [None]:
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', delimiter=';')

In [None]:
df.shape # Gives you the number of rows and number of columns

In [None]:
df.columns # Gives you back a list of all of the column names. 

In [None]:
df.info() # Allows you to look at the data type for each column, and the number of null values.

In [None]:
df.head() # Shows you the first n columns (by default n = 5)

In [None]:
df.tail() # Shows you the last n columns (by default n = 5). 

In [None]:
df.describe() # Gives you summary statistics for all of your numeric columns. 

In [None]:
#Another external data example
col_names = ['sex','length','diameter','height','whole_weight','shucked_weight','viscera_weight','shell_weight','rings']
df_abalone = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',delimiter=',',header = None, names = col_names)

In [None]:
df_abalone.head()

In [None]:
df_abalone.describe(include='all')

## Grabbing your data 

#### The Basics

We now know how to look at our data. What if we wanted to grab certain parts to look at, or certain parts to play around with/transform? Say we wanted to grab an entire row, or an entire column... how do we do that? Let's dive in by starting off with some indexing. 

The format we use to index into our dataframe and grab data will depend on exactly what subset of the data that we want to grab. If we want to grab entire rows or columns, then we can use bracket notation to do that (just like we use bracket notation to index into lists). If we want an entire column, then we're going to place the **column name** in brackets (and multiple column names in a list inside those brackets). We can also sometimes access a column via dot notation on the dataframe, which we'll show in a second. If we want an entire row, then we have to place **both** a **beginning and ending index** inside the brackets (it won't work to just place a single index in the brackets). 

In [None]:
# Let's take a quick look at the DataFrame that we're using to remind ourselves what it 
# looks like. 
df.head()

In [None]:
df['chlorides'] # Grabs the 'chlorides' column. 
#df.chlorides # Also grabs the 'chlorides' column. 

In [None]:
df['volatile acidity']
#df.volatile acidity # Dot notation only works if the column name has no spaces. 

We can, however, alter the column headers to remove the spaces, at which point dot notation would work.  
The following code shows how we can quickly, and efficiently, eliminate spaces from the column names using list comprehension:

In [None]:
df2 = df.copy()
cols = df2.columns.tolist()
cols = [col.replace(' ', '_') for col in cols]
df2.columns = cols
df2.volatile_acidity

In [None]:
# We can access all of multiple columns by passing in a list of column names. 
df[['chlorides', 'volatile acidity']]

In [None]:
df[:3] # This will grab from the beginning up to but not including the row at index 3. 

In [None]:
# This will grab up to but not including the row at index 1 (i.e. it'll grab the row  at index 0). 
df[:1]

In [None]:
# This will not work because we didn't give it a starting **and** ending index.
df[0]

In [None]:
# This won't work because we are trying to access a subset of rows 
# **and** columns at the same time. 
df[:1, 'volatile acidity'] 

What if we want to grab certain rows **and** certain columns, rather than just entire rows or entire columns?

If we want to grab only certain rows and columns, there are three **methods** that we can use to index into a Pandas DataFrame: `loc[]`, `iloc[]`, and `ix[]`. Note that these are **methods**, which means that we will call them via dot notation on our `DataFrame` object. The difference between these three has to do with how we use them. `loc[]` is a purely label-location based indexer, `iloc[]` is a purely integer-location based indexer, and `ix[]` is a primarily label-location based indexer that falls back to integer indexing. 

In [None]:
# Let's look at our data real quickly again. 
df.head()

In [None]:
# Loc is label based. All of these will work, because they are recognized as labels on the 
# rows (index labels) or columns (column name labels). 
df.loc[0, 'fixed acidity'] # 0 is one of the index labels, and 'fixed acidity' is a column label.

In [None]:
# Ranges on our index labels still work (as long as they're numeric).
df.loc[0:10, 'fixed acidity']

In [None]:
df.loc[10:15, ['chlorides', 'fixed acidity']]

In [None]:
# These will all fail, because they attempt to access the columns by position integers, 
# and loc only takes labels. 
df.loc[0, 0]
df.loc[0:10, 0]
df.loc[10:15, [0, 4]]

In [None]:
# The above will all work with .iloc, though, since it takes integers (and not labels)
df.iloc[0, 0]
df.iloc[0:10, 0]
df.iloc[10:15, [0, 4]]

In [None]:
# Using labels, though, like we did with .loc, will NOT work. These will all fail
df.iloc[0, 'fixed acidity']
df.iloc[0:10, 'fixed acidity'] 
df.iloc[10:15, ['chlorides', 'fixed acidity']]

#### A little bit more

We know how to grab certain rows or columns from dataframes, as well as a subet of rows and columns, and anything in between. But, this looks like it typically requires that the exact location of the data we want is known. What if we don't know that location? Is there a way to grab desired data by simply specifying some query parameters? Yes! 

There are a couple of ways that we can do this. The first way we'll look at is just through masking, whereas the second actually uses the `query()` method availiable on the Pandas DataFrame.

In [None]:
# Reminder of what our data looks like. 
df.head()

In [None]:
df['chlorides'] <= 0.08 # This just gives us a mask - tells us True or False whether each row 
                        # fit's the condition.

In [None]:
# To use a mask, we actually have to use it to index into the DataFrame (using square brackets). 
df[df['chlorides'] <= 0.08]

Notice how only the indices that were found to be True from the condition show up in this subset of the dataframe. We've "masked" off the rest of the indices that we're found to be False (hence the name **masking**). 

In [None]:
# Okay, this is cool. What if I wanted a slightly more complicated query...
df[(df['chlorides'] >= 0.04) & (df['chlorides'] < 0.08)]

In [None]:
# So I could write an arbitrarily complicated query using that syntax... 
df[(df['chlorides'] >= 0.04) & (df['chlorides'] < 0.08) & (df['pH'] > 3.5) & (df['pH'] < 4.00)]

In [None]:
# Or I could use the query() method that is available on our dataframe object. 
df.query('chlorides >= 0.04 and chlorides <= 0.08 and pH > 3.5 and pH < 4.00')

Personally, I think this looks much better, and is a little bit easier to write! It doesn't use loads of sets of brackets (`[]`) and parentheses (`())`, but rather just one set of parentheses. It also tends to follow the Python syntax a little more closely than the mask methods that we looked at above, using `and` instead of `&` to separate different specifications on our queries. 

In general, it's preferred to use the `query()` method, since it improves readability. 

## A Deeper Dive

What else can I do with my data? Anything! There are loads of things that we can do with our data once we've gotten it into a Pandas DataFrame. We're going to look at a couple of more things tonight, but to view all available attributes and methods of DataFrames, we can check out the [Pandas Docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). For practical examples of how DataFrames are used, I might suggest getting a copy of [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) (it's written by Wes McKinney, the creator of Pandas).

For those coming from R, know that the Pandas DataFrame was based off the R DataFrame, and most anything we can do with an R DataFrame, we can do with a Pandas DataFrame. For anybody coming from a SQL background, the methods available via DataFrame's give us much, if not all, of the functionality that we have available in your SQL environment. 


#### Groupby

Let's start with `groupby`s...

In [None]:
# Remind ourselves of the data. 
df.head()

In [None]:
# Quality looks like something we might want to group by. Let's check it out in a little 
# more detail first, though. 
df['quality'].unique()

In [None]:
df.groupby('quality') # Note that this returns back to us a groupby object. It doesn't actually 
                      # return to us anything useful until we perform some aggregation on it. 

In [None]:
# We have tons of aggregation metrics we can get from a groupby object. Note here that we 
# store the results of a groupby below to then perform all kinds of operations on it (this is 
# actually the preferred method if we're going to perform more than one calculation
# on it). We have tons of operations we can perform on it. 
groupby_obj = df.groupby('quality')
groupby_obj.mean()
groupby_obj.max()
groupby_obj.count()

In [None]:
# The previous aggregation metrics gave us back a DataFrame with all of the columns minus what 
# we grouped on. Notice that what we grouped on becomes the index. What if I wanted only one 
# column back (especially with something like count, where it is the same
# for every column)? Well, we can do anything with this DataFrame that we did before...
df.groupby('quality').count()['fixed acidity']

In [None]:
# Note we can also group by multilple columns by passing them in in a list. It will group by 
# the first column passed in first, and then the second after that (i.e. it will group by 
# the second within the group by of the first). 
df.groupby(['pH', 'quality']).count()['chlorides']

Check out the [Group By documentation](http://pandas.pydata.org/pandas-docs/stable/groupby.html) to look at what all you can do with the Pandas .groupby().

#### Sorting 

Sorting is going to work much the same way as group bying. It is going to be available via a method that we call on the dataframe, `.sort_values()`, and we are going to pass it a column or columns to sort by. 

In [None]:
df.sort_values('quality') # Note: this is ascending by default.
#df.sort_values('quality', ascending=False)

In [None]:
# Note that we can sort by multiple columns by placing them in a list inside of the sort()
# method. It will sort by the first column passed in first, and then the second within the 
# sort of the first. 
df.sort_values(['quality', 'alcohol'], ascending=False) # ascending=False will apply to both columns. 

#### Creating and Dropping Columns

Creating columns is done in one of two ways: 
1. Using bracket notation
2. Using the `eval()` method on the Pandas DataFrame. 

Dropping columns is done using the `df.drop()` method on the Pandas DataFrame. When dropping columns, we have to be careful to make sure to tell the DataFrame to drop them in place, or assign the DataFrame with dropped columns to a new variable. You also need to make sure to tell the `drop()` method what axis the thing you're trying to drop is on (rows are `axis=0`, and columns are `axis=1`).

In [None]:
# View our DataFrame to remember what it looks like.
df.head()

In [None]:
# How would I create a column, using bracket notation, that is equal to the amount of 
# non-free sulfur dioxide?


In [None]:
# Typically, we don't name columns with spaces in them because they are a little tricky 
# to work with. 
df.rename(columns={'total sulfur dioxide': 'total_sulfur_dioxide', 
                   'free sulfur dioxide': 'free_sulfur_dioxide' }, inplace=True)

In [None]:
df.eval('non_free_sulfur2 = total_sulfur_dioxide - free_sulfur_dioxide')
df.columns

In [None]:
df.drop('non_free_sulfur2', axis=1)

In [None]:
df.columns # Wait, the non_free_sulfur2 column is still there... why? 
           # It's because we didn't tell it drop inplace. 

In [None]:
df.drop('non_free_sulfur2', inplace=True, axis=1) # Note the axis=1 argument telling them we're
                                                  # dropping a column. 

In [None]:
df.columns

#### Dealing with Nulls

Pandas has functions for both filling nulls (or N/As) with whatever value we want, or dropping nulls all together. To fill nulls, we use the `.fillna()` method on the DataFrame, and to drop nulls, we call the `.dropna()` method on the DataFrame. In terms of the `.fillna()` function, we can give it a default value to fill in, or a number of other methods to fill it in (padding, back filling, foward filling). You can read about dealing with missing data in the docs [here](http://pandas.pydata.org/pandas-docs/stable/missing_data.html#cleaning-filling-missing-data). We're not going to go into too much depth here, but want you to know that this functionality exists. 

In [None]:
df.fillna(-1, inplace=True)
#df.dropna(inplace=True) # Notice the addition of the inplace argument here. 

In [None]:
#df.fillna(-1, inplace=True)
df.dropna(inplace=True) # Notice the addition of the inplace argument here. 

In [None]:
df.info()

## Merging two dataframes

In [None]:
phys_df = pd.read_csv('data/specialty.txt', delimiter=',')

In [None]:
procs_df = pd.read_csv('data/procs.txt', delimiter=',')

In [None]:
phys_df.head()

In [None]:
procs_df.head()

In [None]:
drs_procs = pd.merge(phys_df, procs_df, how = 'inner', on = 'id')

In [None]:
drs_procs.head()

## A quick aside on Pandas Series

You might have noticed that in a couple of places, when we asked for certain rows/columns of the data, we got back a 1-D array that had an index attached. These are examples of what Pandas calls `Series`. In the documentation for [Pandas Series](http://pandas.pydata.org/pandas-docs/version/0.15.2/dsintro.html#series), you can get an idea of what they can do. For the most part, we can kind of treat them like a mini DataFrame, as they have a lot of the same methods. However, there are some slight differences. Since we work with DataFrame's the majority of the time, we're not going to go into any real depth on Series. 

Here are some examples of things that returned series: 

In [None]:
df['chlorides'] <= 0.08

In [None]:
type(df['chlorides'] <= 0.08)

In [None]:
df.groupby('quality').count()['fixed acidity']

## A quick look at vizualization in Pandas

There are numerous libraries available in Python for creating vizualizations. Often times, we will probably be using [Matplotlib](http://matplotlib.org/) and/or [Seaborn](http://stanford.edu/~mwaskom/software/seaborn/) for anything that is general purpose, and then other libraries if we need something more specialized ([Plotly](https://plot.ly/) for dashboards, for example). All of these libraries allow us to build great looking visualizations that can be used in a production setting. If we want something quick and dirty to visualize our data very easily, there is also some plotting functionality built into Pandas. 

If we look at the [docs](http://pandas.pydata.org/pandas-docs/stable/visualization.html), we can see that the plotting available with Pandas will be called via the `plot()` method on a DataFrame object. From there, we'll pass in a bunch of potential arguments to the `plot()` method to specify exactly how to build the plot. The most important of those arguments is the `kind` keyword argument, which tells the `plot()` method what kind of visualization we would like (bar plot, histogram, scatter plot, etc.). Tonight we'll look at one or two examples - since most of the time we'll be doing your visualization in Matplotlib or Seaborn, we'll just give you a taste of what Pandas can do so that you know it's there. We'll look at Seaborn and Matplotlib in a little bit more depth in a later class. 

In [None]:
# Code to get ours to function correctly. Pandas plotting is built on top of matplotlib, so 
# we have to import it. 
import matplotlib.pyplot as plt
# This just tells the IPython notebook to plot it inline (i.e. in the browser).
%matplotlib inline
# This will change the style that matplotlib uses (i.e. makes the plots look nicer than the default)
import matplotlib
matplotlib.style.use('ggplot')

In [None]:
# Revisit the data to see what it looks like. 
df.head()

In [None]:
# Let's try a histogram first. 
df.plot(kind='hist') 

In [None]:
# That looked pretty bad. Since we didn't specify a column name to plot, Pandas just plotted
# all of them on top of one another. That's not what we want! Let's try selecting a column and doing that again. 
df['quality'].plot(kind='hist')

In [None]:
# This will not work. Any guesses why?
df.plot(kind='scatter')

In [None]:
# As you might guess from the error, we have to specify X and Y columns for Pandas to plot. 
df.plot(kind='scatter', x='free_sulfur_dioxide', y='total_sulfur_dioxide')

In [None]:
df.plot(kind='box')

In [None]:
# We don't need to specify a column necessarily, but because of the different scales this 
# doesn't look too great. Let's specify three columns and see how that looks...
df[['fixed acidity', 'pH', 'alcohol']].plot(kind='box')

In [None]:
# This still doesn't look great - it's hard to really examine these three columns since pH is 
# so different from the other two. Let's drop pH and try one more time...
df[['fixed acidity', 'alcohol']].plot(kind='box')