### The three components of a DataFrame
A DataFrame is composed of three different components, the index, columns, and the data. The data is also known as the values.

The index represents the sequence of values on the far left-hand side of the DataFrame. All the values in the index are in bold font. Each individual value of the index is called a label. Sometimes the index is referred to as the row labels. Sometimes the row labels are not very interesting and are just the integers beginning from 0 up to n-1, where n is the number of rows in the table. Pandas defaults DataFrames with this simple index.

The columns are the sequence of values at the very top of the DataFrame. They are also in bold font. Each individual value of the columns is called a column, but can also be referred to as column name or column label.

Everything else not in bold font is the data or values. You will sometimes hear DataFrames referred to as tabular data. This is just another name for a rectangular table data with rows and columns.

#### Axis and axes
It is also common terminology to refer to the rows or columns as an axis. Collectively, we call them axes. So, a row is an axis and a column is another axis.

The word axis appears as a parameter in many DataFrame methods. Pandas allows you to choose the direction of how the method will work with this parameter. This has nothing to do with subset selection so you can just ignore it for now.



#### Each row has a label and each column has a label
The main takeaway from the DataFrame learning is that each row has a label and each column has a label. These labels are used to refer to specific rows or columns in the DataFrame. It’s the same as how humans use names to refer to specific people.

In [1]:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.array([["Hawai", "blue", "champagne", 80], ["Illinois", "green", "steak", 79], ["New York", "red", "pizza", 90],["Florida","orange_blue","delirium", 92]]), index= ['Rohit','Mike','Kyla','Xintian'], columns=["state", "color", "food_drink","age"])

In [2]:
df.head()

Unnamed: 0,state,color,food_drink,age
Rohit,Hawai,blue,champagne,80
Mike,Illinois,green,steak,79
Kyla,New York,red,pizza,90
Xintian,Florida,orange_blue,delirium,92


###  [ ], .loc, and .iloc

Collectively, they are called the indexers. These are by far the most common ways to select data. 

df[ ] a.k.a indexing operator

df.loc[ ]

df.iloc[ ]

In [3]:
index = df.index
columns = df.columns
values = df.values

In [4]:
df.index[0]

'Rohit'

In [4]:
index

Index(['Rohit', 'Mike', 'Kyla', 'Xintian'], dtype='object')

In [5]:
columns

Index(['state', 'color', 'food_drink', 'age'], dtype='object')

In [6]:
values

array([['Hawai', 'blue', 'champagne', '80'],
       ['Illinois', 'green', 'steak', '79'],
       ['New York', 'red', 'pizza', '90'],
       ['Florida', 'orange_blue', 'delirium', '92']], dtype=object)

In [7]:
type(index)

pandas.core.indexes.base.Index

In [8]:
type(columns)

pandas.core.indexes.base.Index

In [9]:
type(values)

numpy.ndarray

Both the index and the columns are the same type. They are both a pandas Index object. This object is quite powerful in itself, but for now you can just think of it as a sequence of labels for either the rows or the columns.

The values are a NumPy ndarray, which stands for n-dimensional array, and is the primary container of data in the NumPy library. Pandas is built directly on top of NumPy and it's this array that is responsible for the bulk of the workload.

#### Selecting a single column as a Series

In [10]:
df['food_drink']

Rohit      champagne
Mike           steak
Kyla           pizza
Xintian     delirium
Name: food_drink, dtype: object

Selecting a single column of data returns the other pandas data container, the Series. A Series is a one-dimensional sequence of labeled data. There are two main components of a Series, the index and the data(or values). There are NO columns in a Series.

The visual display of a Series is just plain text, as opposed to the nicely styled table for DataFrames. The sequence of person names on the left is the index. The sequence of food items on the right is the values.

You will also notice two extra pieces of data on the bottom of the Series. The name of the Series becomes the old-column name. You will also see the data type or dtype of the Series. You can ignore both these items for now.

#### Selecting multiple columns with just the indexing operator

In [13]:
df[['color', 'food_drink', 'age']]

Unnamed: 0,color,food_drink,age
Rohit,blue,champagne,80
Mike,green,steak,79
Kyla,red,pizza,90
Xintian,orange_blue,delirium,92


#### Selecting multiple columns returns a DataFrame

In [14]:
df[['food_drink']] #You can actually select a single column as a DataFrame with a one-item list:

Unnamed: 0,food_drink
Rohit,champagne
Mike,steak
Kyla,pizza
Xintian,delirium


df['colour']

KeyError: 'colour'

df['color', 'age'] # should be:  df[['color', 'age']]

KeyError: ('color', 'age')

#### Indexing Operator Recap
Its primary purpose is to select columns by the column names

Select a single column as a Series by passing the column name directly to it: df['col_name']

Select multiple columns as a DataFrame by passing a list to it: df[['col_name1', 'col_name2']]

You actually can select rows with it, but this will not be shown here as it is confusing and not used often.

##  .loc
The .loc indexer selects data in a different way than just the indexing operator. It can select subsets of rows or columns. It can also simultaneously select subsets of rows and columns. Most importantly, it only selects data by the LABEL of the rows and columns.

#### Select a single row as a Series with .loc
The .loc indexer will return a single row as a Series when given a single row label. Let's select the row for Mike

In [15]:
df.loc['Mike']

state         Illinois
color            green
food_drink       steak
age                 79
Name: Mike, dtype: object

#### Select Multiple Rows


In [16]:
df.loc[['Xintian', 'Kyla']]

Unnamed: 0,state,color,food_drink,age
Xintian,Florida,orange_blue,delirium,92
Kyla,New York,red,pizza,90


#### Use slice notation to select a range of rows with .loc
It is possible to ‘slice’ the rows of a DataFrame with .loc by using slice notation. Slice notation uses a colon to separate start, stop and step values. For instance we can select all the rows from Jason through Chris like this:

In [17]:
df.loc['Rohit':'Xintian']

Unnamed: 0,state,color,food_drink,age
Rohit,Hawai,blue,champagne,80
Mike,Illinois,green,steak,79
Kyla,New York,red,pizza,90
Xintian,Florida,orange_blue,delirium,92


#### important note: .loc includes the last value with slice notation

In [18]:
df.loc[:'Mike']

Unnamed: 0,state,color,food_drink,age
Rohit,Hawai,blue,champagne,80
Mike,Illinois,green,steak,79


In [19]:
df.loc['Rohit':'Kyla':2] # Slice from Rohit to Kyla stepping by 2

Unnamed: 0,state,color,food_drink,age
Rohit,Hawai,blue,champagne,80
Kyla,New York,red,pizza,90


In [20]:
df.loc['Rohit':] #Slice from Rohit to the end

Unnamed: 0,state,color,food_drink,age
Rohit,Hawai,blue,champagne,80
Mike,Illinois,green,steak,79
Kyla,New York,red,pizza,90
Xintian,Florida,orange_blue,delirium,92


#### Selecting rows and columns simultaneously with .loc
Unlike just the indexing operator, it is possible to select rows and columns simultaneously with .loc. You do it by separating your row and column selections by a comma. It will look something like this:

df.loc[row_selection, column_selection]

In [21]:
df.loc[['Mike', 'Kyla'], ['food_drink', 'state','age']]

Unnamed: 0,food_drink,state,age
Mike,steak,Illinois,79
Kyla,pizza,New York,90


#### Use any combination of selections for either row or columns for .loc

Row or column selections can be any of the following as we have already seen:

A single label

A list of labels

A slice with labels

We can use any of these three for either row or column selections with .loc.

In [22]:
df.loc[['Rohit', 'Xintian'], 'food_drink']

Rohit      champagne
Xintian     delirium
Name: food_drink, dtype: object

In [23]:
#Slice of rows and a list of columns
df.loc['Mike':'Xintian', ['state', 'color']]

Unnamed: 0,state,color
Mike,Illinois,green
Kyla,New York,red
Xintian,Florida,orange_blue


In [24]:
df.loc['Kyla', 'age']

'90'

In [25]:
df.loc[:'Kyla', 'age':]

Unnamed: 0,age
Rohit,80
Mike,79
Kyla,90


#### Selecting all of the rows and some columns
It is possible to select all of the rows by using a single colon. You can then select columns as normal:

In [26]:
df.loc[:, ['food_drink', 'color']]

Unnamed: 0,food_drink,color
Rohit,champagne,blue
Mike,steak,green
Kyla,pizza,red
Xintian,delirium,orange_blue


In [27]:
#select all columns
df.loc[['Mike','Rohit'], :]

Unnamed: 0,state,color,food_drink,age
Mike,Illinois,green,steak,79
Rohit,Hawai,blue,champagne,80


In [28]:
df.loc[['Mike','Rohit']] #same as above

Unnamed: 0,state,color,food_drink,age
Mike,Illinois,green,steak,79
Rohit,Hawai,blue,champagne,80


In [29]:
rows = ['Mike','Kyla','Xintian','Rohit']
cols = ['state', 'age']
df.loc[rows, cols]

Unnamed: 0,state,age
Mike,Illinois,79
Kyla,New York,90
Xintian,Florida,92
Rohit,Hawai,80


#### Summary of .loc
Only uses labels

Can select rows and columns simultaneously

Selection can be a single label, a list of labels or a slice of labels

Put a comma between row and column selections

### .iloc
The .iloc indexer is very similar to .loc but only uses integer locations to make its selections. The word .iloc itself stands for integer location so that should help with remember what it does.

#### Selecting a single row with .iloc
By passing a single integer to .iloc, it will select one row as a Series:

In [30]:
df.iloc[3]

state             Florida
color         orange_blue
food_drink       delirium
age                    92
Name: Xintian, dtype: object

In [31]:
df.iloc[[3, 2, 0]]       # remember, don't do df.iloc[3, 2, 0]

Unnamed: 0,state,color,food_drink,age
Xintian,Florida,orange_blue,delirium,92
Kyla,New York,red,pizza,90
Rohit,Hawai,blue,champagne,80


In [32]:
df.iloc[1:3] #Slice notation works just like a list in this instance and is exclusive of the last element

Unnamed: 0,state,color,food_drink,age
Mike,Illinois,green,steak,79
Kyla,New York,red,pizza,90


In [33]:
df.iloc[3:]

Unnamed: 0,state,color,food_drink,age
Xintian,Florida,orange_blue,delirium,92


In [34]:
#select third position to end by 2
df.iloc[0::2]

Unnamed: 0,state,color,food_drink,age
Rohit,Hawai,blue,champagne,80
Kyla,New York,red,pizza,90


Just like with .iloc any combination of a single integer, lists of integers or slices can be used to select rows and columns simultaneously. Just remember to separate the selections with a comma.

In [35]:
df.iloc[[2,3], [0, 1]]

Unnamed: 0,state,color
Kyla,New York,red
Xintian,Florida,orange_blue


In [6]:
df.iloc[0:4, [3]]

Unnamed: 0,age
Rohit,80
Mike,79
Kyla,90
Xintian,92


In [37]:
df.iloc[2:4, 0:4]

Unnamed: 0,state,color,food_drink,age
Kyla,New York,red,pizza,90
Xintian,Florida,orange_blue,delirium,92


In [38]:
df.iloc[0, 2]

'champagne'

In [39]:
df.iloc[:, 3]

Rohit      80
Mike       79
Kyla       90
Xintian    92
Name: age, dtype: object

#### Selecting subsets of Series
We can also do subset selection with a Series. Earlier I recommended using just the indexing operator for column selection on a DataFrame. Since Series do not have columns, I suggest using only .loc and .iloc. You can use just the indexing operator, but its ambiguous as it can take both labels and integers. 

Typically, you will create a Series by selecting a single column from a DataFrame. Let’s select the food_drink column:

In [40]:
food = df['food_drink']

In [41]:
food.loc['Mike']

'steak'

In [42]:
#Select three different values. This returns a Series:
food.loc[['Mike', 'Kyla', 'Xintian']]

Mike          steak
Kyla          pizza
Xintian    delirium
Name: food_drink, dtype: object

In [43]:
#Slice from ... - is inclusive of last index
food.loc['Rohit':'Xintian']

Rohit      champagne
Mike           steak
Kyla           pizza
Xintian     delirium
Name: food_drink, dtype: object

In [44]:
food.loc['Rohit':]

Rohit      champagne
Mike           steak
Kyla           pizza
Xintian     delirium
Name: food_drink, dtype: object

In [45]:
#Select a single value in a list which returns a Series
food.loc[['Mike']]

Mike    steak
Name: food_drink, dtype: object

In [46]:
food.iloc[0]

'champagne'

In [47]:
food.iloc[[2, 1, 3]]

Kyla          pizza
Mike          steak
Xintian    delirium
Name: food_drink, dtype: object

In [48]:
food.iloc[1:3]#last one is exclusive

Mike    steak
Kyla    pizza
Name: food_drink, dtype: object

#### Comparison to Python lists and dictionaries
It may be helpful to compare pandas ability to make selections by label and integer location to that of Python lists and dictionaries.

Python lists allow for selection of data only through integer location. You can use a single integer or slice notation to make the selection but NOT a list of integers.

In [53]:
some_list = ['a', 'two', 10, 4, 0, 'asdf', 'mgmt', 434, 99]

some_list[5]

some_list[-1]

some_list[:4]

some_list[3:]

some_list[2:6:3]


[10, 'asdf']

In [54]:
d = {'a':1, 'b':2, 't':20, 'z':26, 'A':27}
d['a']
d['A']

27

### Using just the indexing operator to select rows from a DataFrame — Confusing!

So far we used just the indexing operator to select a column or columns from a DataFrame. But, it can also be used to select rows using a slice. This behavior is very confusing to me. The entire operation changes completely when a slice is passed.

In [55]:
df[0:2]

Unnamed: 0,state,color,food_drink,age
Rohit,Hawai,blue,champagne,80
Mike,Illinois,green,steak,79


In [57]:
df['Mike':'Xintian']

Unnamed: 0,state,color,food_drink,age
Mike,Illinois,green,steak,79
Kyla,New York,red,pizza,90
Xintian,Florida,orange_blue,delirium,92


this feature is not deprecated and completely up to you whether you wish to use it. But, I highly prefer not to select rows in this manner as can be ambiguous, especially if you have integers in your index.

Using .iloc and .loc is explicit and clearly tells the person reading the code what is going to happen. Let's rewrite the above using .iloc and .loc.

In [58]:
df.iloc[0:2]      # More explicit than df[0:2]

Unnamed: 0,state,color,food_drink,age
Rohit,Hawai,blue,champagne,80
Mike,Illinois,green,steak,79


In [60]:
df.loc['Mike':'Kyla']

Unnamed: 0,state,color,food_drink,age
Mike,Illinois,green,steak,79
Kyla,New York,red,pizza,90


In [61]:
#Cannot simultaneously select rows and columns with []
df[3:6, 'Mike':'Kyla']

TypeError: '(slice(3, 6, None), slice('Mike', 'Kyla', None))' is an invalid key

In [62]:
#Using just the indexing operator to select rows from a Series — Confusing!
#You can also use just the indexing operator with a Series. 
#Again, this is confusing because it can accept integers or labels.
food

Rohit      champagne
Mike           steak
Kyla           pizza
Xintian     delirium
Name: food_drink, dtype: object

In [63]:
food[1:3]

Mike    steak
Kyla    pizza
Name: food_drink, dtype: object

In [64]:
food['Rohit':'Xintian']

Rohit      champagne
Mike           steak
Kyla           pizza
Xintian     delirium
Name: food_drink, dtype: object

Since Series don’t have columns you can use a single label and list of labels to make selections as well

In [65]:
food[['Mike', 'Rohit', 'Xintian']]

Mike           steak
Rohit      champagne
Xintian     delirium
Name: food_drink, dtype: object

In [66]:
df2_idx = df.set_index('age')

In [67]:
df2_idx

Unnamed: 0_level_0,state,color,food_drink
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
80,Hawai,blue,champagne
79,Illinois,green,steak
90,New York,red,pizza
92,Florida,orange_blue,delirium


#### DataFrame column selection with dot notation
Pandas allows you to select a single column as a Series by using dot notation. This is also referred to as attribute access. You simply place the name of the column without quotes following a dot and the DataFrame like this:

In [68]:
df.age

Rohit      80
Mike       79
Kyla       90
Xintian    92
Name: age, dtype: object

In [69]:
df.state

Rohit         Hawai
Mike       Illinois
Kyla       New York
Xintian     Florida
Name: state, dtype: object

In [70]:
df#Shift+tab+tab

Unnamed: 0,state,color,food_drink,age
Rohit,Hawai,blue,champagne,80
Mike,Illinois,green,steak,79
Kyla,New York,red,pizza,90
Xintian,Florida,orange_blue,delirium,92


In [71]:
df[['age', 'age', 'age']]

Unnamed: 0,age,age.1,age.2
Rohit,80,80,80
Mike,79,79,79
Kyla,90,90,90
Xintian,92,92,92
