# 06 Pandas Data Structures
File(s) needed: stats_exam_3_scores.csv, Applewood_2011.csv

The two primary data structures in pandas are the **Series** and **DataFrame**. We've already seen the DataFrame in action but we will spend more time getting to know these two structures because what we will do later depends on our understanding them.

- Pandas DataFrame
    - A rectangular dataset, like a table
    - Has columns and rows of data
    - Can hold heterogeneous data between columns
    - Within a column, must have the same data type
- Pandas Series
    - A "column" of data
    - DataFrames are composed of Series objects

In [None]:
# Make sure pandas library is loaded


### The pandas Series
The pandas series is a one-dimensional array of indexed data. It can be created from a list as follows:

In [None]:
# We can create it directly or in parts.
# Here we create the list first.
my_list = [2.25, 2.5, 2.75, 3.0]
my_list

In [None]:
# Create the series from the list.


We can see both a sequence of the values we used to create our series and a sequence of indices (the row numbers on the left). We can call the `.values` and/or `.index` attributes if we need access to either of them individually.

In [None]:
# display the data values


In [None]:
# display the index values


We can use the square bracket notation to access individual values using the indices.

In [None]:
# access the value at index 1


These examples may make the pandas Series look like a Python list or simple numPy array. But there is a big difference.
- Implicitly defined index - list or array objects use implicitly defined integer indeces to access values.
- Explicitly defined index - the pandas Series object has an explicitly defined index we use to access the values. 
    - That may not seem like much of a difference, but the result is that **_the index can be values of any type we want._**

In [None]:
# Add letter index values and retrieve value at index C


In [None]:
# We can even use nonsequential values for the indices.


We can use `loc` and `iloc` to slice a series. Remember that we use `loc` with the index value and `iloc` with the integer position value. (The square brackets only work with the index value.)

In [None]:
# Display what is in the data object as a reminder


In [None]:
# Show the value at index 37


In [None]:
# Show the value at position 3


In [None]:
# Show just the values


There are many methods built into the pandas series we can use to perform common calculations. We will talk more about these as we need them, but here is an example of calculating the mean.

See https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#descriptive-statistics for more info.

In [None]:
# Calculating the mean of the data series


### Boolean subsetting in a series
We have already worked with ways to get a subset of our data based upon index values or positions. We will seldom (if ever) know the exact index or position values we want. Instead, we typically look for rows that meet some condition, as with a SQL query or an Excel formula. We can use a simple conditional statement that evaluates to TRUE or FALSE to get the rows of interest.

We will use the square brackets format to select our subset, so the general form of this statement is
```
series_name[conditional statement]
```

We need a little bigger data set for this example. Load "stats_exam_3_scores.csv" and convert the results to a series to get started.

In [None]:
# First, let's load a bigger set of data into a series.
# Read the file "stats_exam_3_scores.csv" and get it into a Series named "scores"
# pd.read_csv saves data in a data frame so we will extract the scores column to a Series.



In [None]:
# Extract the data to a series.


In [None]:
# Display the scores series to see the data we loaded.

# Use the describe method to see what the data looks like.


Now, subset the data in the series using a Boolean statement.

In this example, we want to get all the rows with scores of 145 or more.

In [None]:
# Subset all the rows with scores of 145 or more


#### What is actually happening in this statement? Let's run just the conditional part to see the results.

In [None]:
# Use print() to see just the conditional results


So it looks like a Boolean series is generated and subsequently applied to subset our data.

When we wrote the statement `data['C']` earlier, we told pandas to return the row where the index value was a capital C. This time, we effectively told pandas to return the rows where the value was equal or greater to 145.

Specifically, the statement `scores[scores >= 145]` tells pandas to do the following:
- look at each value in `scores`
- create a parallel Boolean series that contains the value `TRUE` if the value is greater than or equal to 145 or `FALSE` if it is not.
- subset all the rows that have a `TRUE` value in the corresponding position in the parallel Boolean series.

In [None]:
# Try another one: inspect all the failing scores (below 60 percent)
#    That is a score of 90 out of the 150 total points on this exam.


### Wait a minute .... where's the loop?
If we were doing this same task in base Python (or any other typical programming language), we would loop through the data one element at a time. We would compare each value to our conditional statement and include it in the results if the test was true and omit it if it were false.

Many of the methods we will use for the pandas series and data frame are **_vectorized_**. That means they work on the entire data structure at once. Think of it this way - it's like the loop is built into the method.

In [None]:
# One more: save all the values above 80% (score of 120) to a new series called "above_80"
# Start with the general form of this type of statement.
# TIP: use the quantile() method of the pandas series in the conditional statement


## The pandas DataFrame
As we saw previously, the DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a Python dictionary of "aligned" Series objects. "Aligned" means they share the same index. 

It is generally the most commonly used pandas object. You will soon see why.

Along with the data, you can optionally pass _index_ (row labels) and _columns_ (column labels) arguments. If you pass an index and/or columns, you are specifying the index and/or column names of the resulting DataFrame.

If row and column labels are not passed, they will be constructed by pandas from the input data based on common sense rules.

In [None]:
# Example: create a DataFrame using dictionaries
# Python dictionaries are made up of a key and a value. In this case, the values are themselves lists and
# the key is used as the column name.

# You are very unlikely to use this method in practice, so don't get bogged down by the details.
pythons = pd.DataFrame({
    'Name':['Michael', 'John', 'Graham', 'Eric'],
    'Office':['COB 305M','COB 305J','COB 305G','COB 305E'],
    'Age':[59, 66, 70,72],
    'Favorite':['Finland','Ministry of Silly Walks',"What's All This Then?",'The Galaxy Song']})
pythons

### What part of this table is the actual data?

---

By default we get integer row index values. What if we wanted to use the values in our 'Name' column as the row labels? Move those values to the index parameter. We still maintain the integer position values for each row but now can specify a particular index name (or label) for each row.

Kind of confusing, so let's look at an example.

In [None]:
# Specify row names using the index parameter.
# Add the columns parameter to specify a particular column order when we create the dataframe.
pythons = pd.DataFrame(
    data = {'Office':['COB 305M','COB 305J','COB 305G','COB 305E'],
            'Age':[59, 66, 70,72],
            'Favorite':['Cheese Shop','Ministry of Silly Walks',"What's All This Then?",'Wink wink nudge nudge']},
    index = ['Michael', 'John', 'Graham', 'Eric'],
    columns = ['Age','Favorite','Office'])
pythons

### What part of _this_ table is the actual data? What has changed? 

---

In [None]:
# What are the index values?


In [None]:
# Get a row by index value (label)


In [None]:
# Get a row by index position


We typically have our data table constructed so each row is an instance of an entity and each column is an attribute of the entity. In this case, our data is modeling Monty Python members (the entity), with each row being one of the members' data and each column is some descriptive attribute of the members.

The data frame is actually built from the columns, not the rows, however. Each column is a series of the same data type.

To see this, look at individual columns and how they all have the same row names

In [None]:
# Subset the Age column


In [None]:
# Subset the Favorite column


In [None]:
# What data type is pythons_fav?


Just like the Series object, the pandas DataFrame object has `index` and `columns` attributes we can use to get the index labels and column labels, respectively. We can also use the `values` attribute to see the underlying data.


In [None]:
# What are the index values?


In [None]:
# Display the column names


In [None]:
# Display the data values in the data frame


The DataFrame maps a column name to the Series object that contains the column's data. For our purposes, we will look at it the way we look at tables of data.

If we reference a particular column, we can see the contents of that column.

We can also access a single row by giving the `values` attribute a row index number.

In [None]:
# use values attribute to get row with index 2


In [None]:
# remember using the column name?


If we want to refer to a row by the key, we have to use **_slicing_**. With slicing, we are trying to select a subset of the rows in the data.

Note that when we use key values, the operation is right INclusive.

In [None]:
# Example: slicing


In [None]:
# We can also use row numbers in a slice
# With the row numbers it is right EXclusive


Just like we did with the series, we can also use Boolean statements to select rows that meet a condition. We will talk about this more as we need it, but think of it as being able to run a simple query on the DataFrame.

In [None]:
# Use a conditional to select rows where Age > 67.
# You will need to use dot notation to reference the column name.


Let's do another example with a bigger data set. Load the "Applewood_2011.csv" data into a data frame called "cars."


In [None]:
# Load the "Applewood_2011.csv" data into "cars" object and make sure it is what we expect


In [None]:
# Get rows where age > 67 and save as sub_cars


In [None]:
# Get rows where the location is Olean


In [None]:
# Get rows for customers who made at least 3 previous purchases


### Adding rows
It is usually easier to add rows in the original data set and reimport the file. However, we can add a row to an existing data frame by first creating the new row as a data frame with the same structure as our original data frame. Then we use the `append()` method of the original data frame object to add the new row.

Let's do an example so you can see why you don't want to do it this way.

In [None]:
# Add a row for Terry Jones: 63 years old, favorite is "Spam", and office is "COB 305T"
# First, create the new data as a data frame using a dictionary.
new_row = pd.DataFrame(
    data = [{'Favorite':'Spam','Age':63,'Office':'COB 305T'}],
    index= ['Terry'])
new_row

In [None]:
# Add the series to the data frame using the append() method


In [None]:
# show the index values


### Adding columns
There are many reasons we might add a column. When we add a column, pandas matches index key values to make sure the data stays aligned. Let's add a column to our pythons data frame to see how this works.

First we need a column (i.e., another series) to add.

In [None]:
# Create another series of data related to the pythons


In [None]:
# Remember what the pythons data frame looks like?


In [None]:
# Add the new column by specifying the new column name


What happens if an index value doesn't match up? Go back to the cell creating the new column and change a value, then rerun these cells to see the effect.

### Dropping columns
We may decide we don't need one or more columns in our data set any more. If that is the case, we can use the `drop()` method to remove the column.

In [None]:
# Drop the Shoe_size column from pythons and see the change.


In [None]:
# Display the data frame. What's the deal?


As a failsafe, the results of the drop method are not saved to the original data frame by default. There are two options
1. save the results to a new data frame object
2. use the parameter `inplace=True` to commit the change to the original data frame


In [None]:
# Add inplace=True to save the changes


To drop multiple columns at once, pass the column names as a list.

In [None]:
# Drop Location and Previous from the Applewood data frame


# Once you like the results, save to another object or add inplace parameter


### More about data file access
NOTE: This is the situation where you can benefit from having a folder in your project structure called **working** (or something similar) to hold modified data files. That allows you to keep original data and modified data separate.

---
#### Pickling
There are times when you may want to save your modified data frame to disk. One way to do that is by **_pickling_**. Pickling is how Python saves a data container as a binary file.

This may come in handy if you need to save your data results from one program for use in another. However, the data can only be accessed with Python. Other programs will not be able to open it.


In [None]:
# Save a data frame to a pickled file - specify a path and file name
# The filename extension doesn't matter, but "pickle" is self-documenting.


In [None]:
# Read a pickled file and save it in a named object


#### CSV files
Comma-separated value (CSV) files are a very flexible type to use for data storage. They are simply basic text files with commas between each row element and any program can open them. We have already used the basic version of `pd.read_csv` to load some of our data. The DataFrame and Series objects have a `to_csv()` method built in that allow you to save data to a CSV file.

In [None]:
# Save a data frame as a csv file and open it in Excel or Notepad to see what is there


In [None]:
# Save the data frame without row indexes
# CAUTION: the row indexes might be important so only use this if they are not


#### Excel files
<p style="font-size:125%;padding:35px;text-align:center;background-color:dodgerblue;color:white">You should stick with CSV files as much as possible. If you have data in an Excel spreadsheet, you will usually be better off if you use Excel to save it in CSV format and then work with the CSV file.</p>

However, pandas can read Excel files directly. The following is provided for your reference.

The `read_excel()` method in pandas can read Excel 2003 (.xls) and Excel 2007+ (.xlsx) files using the xlrd Python module. The `to_excel()` instance method is used for saving a DataFrame to Excel. Generally the semantics are similar to working with csv data.

In the most basic use-case, `read_excel` takes a path to an Excel file, and the sheet_name indicating which sheet to parse. You can also specify a column to use as the index values and how to handle missing data.

```
# Returns a DataFrame
read_excel('path_to_file.xls', sheet_name='Sheet1')
```

Using more of the available arguments:
```
Using the sheet name, specifying the index, and missing values as NA:
  read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])

Using the sheet index, specifying the index, and missing values as NA:
  read_excel('path_to_file.xls', 0, index_col=None, na_values=['NA'])

Using all default values:
  read_excel('path_to_file.xls')

Using None to get all sheets:
  read_excel('path_to_file.xls', sheet_name=None)

Using a list to get multiple sheets:
Returns the 1st and 4th sheet, as a dictionary of DataFrames.
  read_excel('path_to_file.xls', sheet_name=['Sheet1', 3])
```

As you see in the last two entries, `read_excel` can read more than one sheet at a time by setting sheet_name to either a list of sheet names, a list of sheet positions, or None to read all sheets. Sheets can be specified by sheet index or sheet name, using an integer or string, respectively.

#### Database files
pandas also has built-in capabilities for reading database files. Many of the best known databases rely on SQL and pandas leverages SQL for data access. The library `sqlalchemy` provides these functions for SQL databases like SQLite, MySQL, SQL Server, and MS Access (through the ODBC standard). We will not have time to cover these in class but you should be aware of them for future use.