The DataFrame data structure is the heart of the Panda's library. It's a primary object that you'll be working with in data analysis and cleaning tasks. 

The DataFrame is conceptually a two-dimensional series object, where there's an index and multiple columns of content, with each column having a label. In fact, the distinction between a column and a row is only a conceptual distinction, and you can think of the DataFrame itself as simply a two-axis labeled array.

In [1]:
# Let's start by importing our panda's library:
import pandas as pd

In [2]:
# I'm going to jump in with an example. Let's create three school records for students and their class grades. I'll create each
# as a series which has a student name, the class name, and the score:
record1 = pd.Series({'Name': 'Mark',
                    'Class': 'Physics',
                    'Score': 90})
record2 = pd.Series({'Name': 'Elon',
                    'Class': 'Math',
                    'Score': 92})
record3 = pd.Series({'Name': 'Tim',
                    'Class': 'Chemistry',
                    'Score': 85})

In [6]:
# So like a Series, the DataFrame object is indexed. I'll use a group of series, where each series represents a row of data.
# Just like the Series function, we can pass in our individual items in an array and we can pass in our index values as second
# arguments:
# So df is a common name for a DataFrame equals pd.DataFrame, and we'll pass in record1, record2, record3 as our arrays, and
# we'll just say our index is school1, school2 and school1. 
df = pd.DataFrame([record1, record2, record3],
                 index=['school1','school2','school3'])

#Just like the Series, we can use the head() function to see the first several rows of the DataFrame, including indices from
#both axis, and we can use this to verify the columns and the rows:
df.head()

Unnamed: 0,Name,Class,Score
school1,Mark,Physics,90
school2,Elon,Math,92
school3,Tim,Chemistry,85


In [None]:
# You'll notice here the Jupyter creates a nice bit of HTML to render the results of the DataFrame. So we have the index, which
# is the leftmost column and is the school name, and then we have the rows of data, where each row has a column header which was
# given in our initial record directories.

In [13]:
# An alternative method is that you could use a list of dictionaries, where each dictionary represents a row of data.
students = [{'Name': 'Mark',
                    'Class': 'Physics',
                    'Score': 90},
            {'Name': 'Elon',
                    'Class': 'Math',
                    'Score': 92},
            {'Name': 'Tim',
                    'Class': 'Chemistry',
                    'Score': 85}]
# Then we pass this list of dictionaries into the DataFrame function:
df = pd.DataFrame(students, index=['school1','school2', 'school1'])
# And lets print the head again:
df.head()

Unnamed: 0,Name,Class,Score
school1,Mark,Physics,90
school2,Elon,Math,92
school1,Tim,Chemistry,85


In [14]:
# So similar to the series, we can extract data using the .iloc and .loc attributes. Because the DataFrame is two-dimensional,
# passing a single value to loc indexing operator will return the series if there's only one row to return. 

# For instance, if we wanted to select data associated with school2, we would just query the .loc attribute with one parameter:
df.loc['school2']


Name     Elon
Class    Math
Score      92
Name: school2, dtype: object

In [15]:
# You'll note that the name of the series is returned to the index value, while the column name is included in the output. 

# We can check the data type of the return using the python type function:
type(df.loc['school2'])

pandas.core.series.Series

In [16]:
# It's important to remember that the indices and column names along either axis horizontal or vertical, could be non-unique.
# In this example, we see two records for school1 as different rows. If we use a single value with the DataFrame lock attribute,
# multiple rows of the DataFrame will be return, not as a new series, but as a new DataFrame. 

#So let's query for school1 records
df.loc['school1']

Unnamed: 0,Name,Class,Score
school1,Mark,Physics,90
school1,Tim,Chemistry,85


In [17]:
# Here we can see the type of this is actually different too:
type(df.loc['school1'])

pandas.core.frame.DataFrame

In [18]:
# One of the powers of the panda's DataFrame, is that you can quickly select data based on multiple axis. For instance, if you
# wanted to just list the student names for school1, you can supply two parameters to .loc, one being the row index and the
# other being the column name. 

# So for instance, if we're only interested in school1 student names, we can say df.loc [school1] as the first parameter, and
# name as the second parameter:
df.loc['school1','Name']

school1    Mark
school1     Tim
Name: Name, dtype: object

In [19]:
# Remember, just like the Series, the Panda's developers have implemented this using indexing operators and not parameters to a
# function.

# So what would we do if we want it to select a single column though? Well, there's a few mechanisms. First, we could transpose
# the matrix. This pivots all of the rows into columns and all of the columns into rows, and it's done with the T attribute
df.T

Unnamed: 0,school1,school2,school1.1
Name,Mark,Elon,Tim
Class,Physics,Math,Chemistry
Score,90,92,85


In [21]:
# Then we can call .loc on the tranpose to get the student names only:
df.T.loc['Name']

school1    Mark
school2    Elon
school1     Tim
Name: Name, dtype: object

In [25]:
#However, since iloc and loc are used for row selection, Panda reserves the indexing operator directly on the DataFrame for
#column selection. In a Panda's DataFrame, columns always have a name. So this selection is always label based, and it's not as
#confusing as it was when using the square bracket operator on the series objects. For those familiar with relational databases,
#this operator is analogous to column projection.
df['Name']

school1    Mark
school2    Elon
school1     Tim
Name: Name, dtype: object

In [27]:
#In practice, this works really well since you're often trying to add or drop new columns. However, this means that you can get
#a key error when you try and use the .lock with a column name. 
df.loc['Name']

KeyError: 'Name'

In [28]:
#Note too that the result of a single column projection is a Series object.
type(df['Name'])

pandas.core.series.Series

In [30]:
#Since the result of using the indexing operator is either a DataFrame or series, you can chain operations together. For
#instance, we can select all of the rows which related to school1 using .loc, then project the name column for just those rows:
df.loc['school1']['Name']

school1    Mark
school1     Tim
Name: Name, dtype: object

In [32]:
#If you get confused, you can use type to check the responses from resulting operations:
print(type(df.loc['school1'])) #Should be a DataFrame
print(type(df.loc['school1']['Name'])) #should be a Series

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


Chaining, by indexing on the return type of another index, can come with some cost and is best avoided if you can use another approach. In particular, chaining tends to cause Pandas to return a copy of the DataFrame instead of a view on the DataFrame. For selecting data, this is not a big deal, though might be slower than necessary. If you are chaining data though, it's an important distinction, because this can be a source of error.

In [33]:
#Here's another approach. As we saw.loc does row selection, and it can take two parameters, the row index, and the list of
#column names. The.loc attribute also supports slicing. 

#If we wanted to select all rows, we can use a colon to indicate a full slice from beginning to end. This is just like slicing
#characters in a list in Python. Then we can add the column name as the second parameter as a string. If we wanted to include
#multiple columns, we could do so in a list, and pandas will bring back only the columns that we've asked for. 

#Here's an example, where we ask for all of the names and scores for all schools using the .loc operator.
#So df.loc and I want all schools. So I'm just going to put a colon as the first parameter, though the row index selection, and
#then as the second parameter, I want to project the name and the score as columns. 
df.loc[:,['Name','Score']]

Unnamed: 0,Name,Score
school1,Mark,90
school2,Elon,92
school1,Tim,85


In [34]:
#Take a look at that again, that the colon means that we want to get all of the rows, and the list and the second argument
#position is the list of the columns that we want to get back.

In [35]:
#That's selecting and projecting data from a DataFrame based on row and column labels. The key concepts to remember are that
#rows and columns are really just for our benefit. Underneath this is just a two-axis labeled array, and transposing the
#columns is easy. Also consider the issue of chaining carefully and try to avoid it, as it can cause some unpredictable results,
#where your intent was to obtain a different view of the data, but instead Pandas returns to you a copy. 
#Before we leave the discussion of accessing data in DataFrames, let's talk about dropping data. It's easy to delete data in
#series and DataFrames, and we can use the drop function to do so. This function takes a single parameter, which is the index or
#row label to draw. This another tricky place for new users. The drop function doesn't actually change the DataFrame by default,
#instead the drop function returns to you a copy of the DataFrame with the given rows removed.
df.drop('school1') 

Unnamed: 0,Name,Class,Score
school2,Elon,Math,92


In [36]:
#But if we look at our original DataFrame, we see the data is actually still intact:
df


Unnamed: 0,Name,Class,Score
school1,Mark,Physics,90
school2,Elon,Math,92
school1,Tim,Chemistry,85


In [37]:
#Drop has two interesting optional parameters. The first is called in-place, and if it's set to true, the DataFrame will be
#updated in place instead of a copy being returned. The second parameter is the axes, which should be dropped. By default this
#value is 0, indicating the row axis. But you can change it to 1 if you wanted to drop a column. 

#For example, let's make a copy of the DataFrame using the .copy() function. 
copy_df = df.copy()
#Now let's drop the name column in this copy
copy_df.drop('Name', inplace=True, axis=1)
copy_df


Unnamed: 0,Class,Score
school1,Physics,90
school2,Math,92
school1,Chemistry,85


In [38]:
#There is a second way to drop a column, and that's directly through the use of the indexing operator, using the del keyword.
#This way I'm dropping data, however, takes immediate effect on the DataFrame and does not return a view.
del copy_df['Class']
copy_df


Unnamed: 0,Score
school1,90
school2,92
school1,85


In [39]:
#Finally, adding a new column to the DataFrame is as easy as signing it to some value using the indexing operator. For instance,
#if we wanted to add a class ranking column with default value of None, we could do so by using the assignment operator after
#the square brackets. This broadcasts the default value to the new column immediately:
df['ClassRanking'] = None
df

Unnamed: 0,Name,Class,Score,ClassRanking
school1,Mark,Physics,90,
school2,Elon,Math,92,
school1,Tim,Chemistry,85,


#In this lecture, you've learned about the data structure you'll use the most in pandas, the DataFrame. The DataFrame is indexed both by row and column, and you can easily select individual rows and project the columns you're interested in, using the familiar indexing methods from the Series class. You'll be gaining a lot of experience with the DataFrame in the content more to come.