The DataFRame data structure is the heart of the pandas library. its a primary object that you will be working with in data analysis and cleaning tasks

The DataFrame is conceptually a two-dimensional series object where there is an index and multiple columns of content with each column having a label.
In fact, the dimension between column and row is really only a conceptual distinction and you can think of the DataFrame itself as simply a two-axis labbeled array

In [1]:
#lets import our pandas library
import pandas as pd


In [4]:
#lets create three school records for students and their class grades.
#i'll create each a a series which has a student name, the class name  and the score.

record1 = pd.Series({'name': 'alice',
                    'class': 'physics',
                    'score': 85})
record2 = pd.Series({'name': 'jack',
                    'class': 'chemistry',
                    'score': 82})
record3 = pd.Series({'name': 'hellen',
                     'class': 'biology',
                     'score': 90})

In [5]:
#like a series, the dataframe object is index. here i'll use a group of deries where each series represents a row of data
#just like the series function, we can pass in our individual items in an array and we can pass in our index values as a second argument

df = pd.DataFrame([record1, record2, record3], index = ['school1', 'school2', 'school3'])

#and just like in series we can use the head() function to view the first five row of our dataframe including indices from both axes, and we can use this to verify the columns and the rows
df.head()

Unnamed: 0,name,class,score
school1,alice,physics,85
school2,jack,chemistry,82
school3,hellen,biology,90


In [6]:
#notice that jupyter creates a nice bit of html to render the results of the dataframe.
#so we ave the index which is the leftmost column and is the school name, and then we have the rows of data where each row 
#has a column header which was given in our initial records dictioaries

In [7]:
#an alternative method is that you could use a list of dictionaries where each dictionary represents a row of data.

students = [{'name': 'alice',
            'class': 'biology',
            'score': 87},
           {'name': 'max',
           'class': 'physics',
           'score': 78},
           {'name': 'tonny',
           'class': 'calculus',
           'score': 10}]

#then we pas the list of dictionaries into the dataframe function
df = pd.DataFrame(students, index = ['school1','school2', 'school1'])
df

Unnamed: 0,name,class,score
school1,alice,biology,87
school2,max,physics,78
school1,tonny,calculus,10


In [8]:
#similar to series, we can extract data using the iloc and loc attributes
#dataframe is two-dimensional, passing in a single value to the loc indexing operator will return the series if theres only one row to return 

#for instance if we wanted to select data from school2 we would just query the loc attribute with one parameter.
df.loc['school2']

name         max
class    physics
score         78
Name: school2, dtype: object

In [10]:
#you will notice that the name of the series is returned as the index value while the column name is included in the output

#we can check the data type of the return using the python type function
type(df.loc['school2'])

pandas.core.series.Series

In [11]:
#lets query for school1 records
df.loc['school1']

Unnamed: 0,name,class,score
school1,alice,biology,87
school1,tonny,calculus,10


In [12]:
#notice that the presentation of school1 and school2 i different.
#lets check the type of school2
type(df.loc['school1'])

pandas.core.frame.DataFrame

In [13]:
#if we wanted to get only the names of students in school1 we would have to pass in two arguments into the loc attribute

df.loc['school1', 'name']

school1    alice
school1    tonny
Name: name, dtype: object

In [15]:
#remember, just like series, the pandas developers have implemented this using the indexing operator and not as parameters to a function

#what would you do if you just wanted to select a single column though? 
#well, there are a few ways in which you can do that.
#firstly, we can trnaspose the matrix, this changes all the columns into rows and all the rows onto columns and is done with the T attribute
df.T

Unnamed: 0,school1,school2,school1.1
name,alice,max,tonny
class,biology,physics,calculus
score,87,78,10


In [16]:
#then we can call iloc on the transpose to get the student names only
df.T.loc['name']

school1    alice
school2      max
school1    tonny
Name: name, dtype: object

In [18]:
# However, since iloc and loc are used for row selection, Panda reserves the indexing operator 
# directly on the DataFrame for column selection. In a Panda's DataFrame, columns always have a name. 
# So this selection is always label based, and is not as confusing as it was when using the square 
# bracket operator on the series objects. For those familiar with relational databases, this operator 
# is analogous to column projection.
df['name']

school1    alice
school2      max
school1    tonny
Name: name, dtype: object

In [19]:
df

Unnamed: 0,name,class,score
school1,alice,biology,87
school2,max,physics,78
school1,tonny,calculus,10


In [22]:
df.loc['name']

KeyError: 'name'

In [23]:
df.loc['school1']

Unnamed: 0,name,class,score
school1,alice,biology,87
school1,tonny,calculus,10


In [24]:
df.loc['school1', 'name']

school1    alice
school1    tonny
Name: name, dtype: object

In [26]:
# If you get confused, use type to check the responses from resulting operations
print(type(df.loc['school1'])) #should be a DataFrame
print(type(df.loc['school1']['name'])) #should be a Series

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [27]:
# Chaining, by indexing on the return type of another index, can come with some costs and is
# best avoided if you can use another approach. In particular, chaining tends to cause Pandas 
# to return a copy of the DataFrame instead of a view on the DataFrame. 
# For selecting data, this is not a big deal, though it might be slower than necessary. 
# If you are changing data though this is an important distinction and can be a source of error.

In [29]:
# Here's another approach. As we saw, .loc does row selection, and it can take two parameters, 
# the row index and the list of column names. The .loc attribute also supports slicing.

# If we wanted to select all rows, we can use a colon to indicate a full slice from beginning to end. 
# This is just like slicing characters in a list in python. Then we can add the column name as the 
# second parameter as a string. If we wanted to include multiple columns, we could do so in a list. 
# and Pandas will bring back only the columns we have asked for.

# Here's an example, where we ask for all the names and scores for all schools using the .loc operator.
df.loc[:,['name', 'score']]

Unnamed: 0,name,score
school1,alice,87
school2,max,78
school1,tonny,10


In [30]:
# Take a look at that again. The colon means that we want to get all of the rows, and the list
# in the second argument position is the list of columns we want to get back

In [31]:
# That's selecting and projecting data from a DataFrame based on row and column labels. The key 
# concepts to remember are that the rows and columns are really just for our benefit. Underneath 
# this is just a two axes labeled array, and transposing the columns is easy. Also, consider the 
# issue of chaining carefully, and try to avoid it, as it can cause unpredictable results, where 
# your intent was to obtain a view of the data, but instead Pandas returns to you a copy. 

In [32]:
# Before we leave the discussion of accessing data in DataFrames, lets talk about dropping data.
# It's easy to delete data in Series and DataFrames, and we can use the drop function to do so. 
# This function takes a single parameter, which is the index or row label, to drop. This is another 
# tricky place for new users -- the drop function doesn't change the DataFrame by default! Instead,
# the drop function returns to you a copy of the DataFrame with the given rows removed.

df.drop('school1')

Unnamed: 0,name,class,score
school2,max,physics,78


In [33]:
# But if we look at our original DataFrame we see the data is still intact.
df

Unnamed: 0,name,class,score
school1,alice,biology,87
school2,max,physics,78
school1,tonny,calculus,10


In [34]:
# Drop has two interesting optional parameters. The first is called inplace, and if it's 
# set to true, the DataFrame will be updated in place, instead of a copy being returned. 
# The second parameter is the axes, which should be dropped. By default, this value is 0, 
# indicating the row axis. But you could change it to 1 if you want to drop a column.

# For example, lets make a copy of a DataFrame using .copy()
copy_df = df.copy()
# Now lets drop the name column in this copy
copy_df.drop("name", inplace=True, axis=1)
copy_df

Unnamed: 0,class,score
school1,biology,87
school2,physics,78
school1,calculus,10


In [35]:
# There is a second way to drop a column, and that's directly through the use of the indexing 
# operator, using the del keyword. This way of dropping data, however, takes immediate effect 
# on the DataFrame and does not return a view.
del copy_df['class']
copy_df

Unnamed: 0,score
school1,87
school2,78
school1,10


In [36]:
# Finally, adding a new column to the DataFrame is as easy as assigning it to some value using
# the indexing operator. For instance, if we wanted to add a class ranking column with default 
# value of None, we could do so by using the assignment operator after the square brackets.
# This broadcasts the default value to the new column immediately.

df['ClassRanking'] = None
df

Unnamed: 0,name,class,score,ClassRanking
school1,alice,biology,87,
school2,max,physics,78,
school1,tonny,calculus,10,
