# Simplify with slicing

``When working with large data sets, oftentimes you're only interested in a smaller subset of your data``, and that's why slicing is so important. I'll learn how to select columns in pandas, because oftentimes, you're only interested in a smaller subset of columns in your data, and I'll also show you how to use slicing operations in pandas. I'm working with a car loans dataset, where I have the DataFrame df, and I'm looking at the first five rows. Let's say you're only interested in looking at a few columns of your dataset. So let's go over how to use brackets to select just a few columns. 


In [2]:
import pandas as pd

filename = 'car_financing.xlsx'
df = pd.read_excel(filename)

**Select columns using brackets**

What the code here does is we're using double square brackets to only output one column of our data set. And as you see, I've only pulled out the car_type column.

In [3]:
# Select one column using double brackets
df[['car_type']].head()

Unnamed: 0,car_type
0,Toyota Sienna
1,Toyota Sienna
2,Toyota Sienna
3,Toyota Sienna
4,Toyota Sienna


Now, we can also select multiple columns using double brackets. And let me show you how that's done. So right here, you notice that we have a list within these brackets. So I'm looking at the car_type column, and I'm looking at the Principal Paid column. So I run this, and now I have the car_type column and the Principal Paid column. 

In [5]:
df.head()

Unnamed: 0,Month,Starting Balance,Repayment,Interest Paid,Principal Paid,New Balance,term,interest_rate,car_type
0,1,34689.96,687.23,202.93,484.3,34205.66,60,0.0702,Toyota Sienna
1,2,34205.66,687.23,200.1,487.13,33718.53,60,0.0702,Toyota Sienna
2,3,33718.53,687.23,197.25,489.98,33228.55,60,0.0702,Toyota Sienna
3,4,33228.55,687.23,194.38,492.85,32735.7,60,0.0702,Toyota Sienna
4,5,32735.7,687.23,191.5,495.73,32239.97,60,0.0702,Toyota Sienna


In [6]:
# Select multiple columns using double brackets
df[['car_type', 'Principal Paid']].head()

Unnamed: 0,car_type,Principal Paid
0,Toyota Sienna,484.3
1,Toyota Sienna,487.13
2,Toyota Sienna,489.98
3,Toyota Sienna,492.85
4,Toyota Sienna,495.73


And notice that when I use the inbuilt type function, that this is still a pandas DataFrame. 

In [7]:
# This is a Pandas DataFrame
type(df[['car_type']].head() )

pandas.core.frame.DataFrame

One thing a lot of beginners often have difficulties with when working with pandas is if they just have single brackets, they end up with something that looks like this. This is called a pandas series, and what this is is a one-dimensional array, which can be labeled. In this case, our labels are zero or one or two or three and a four. These are called indexes. 

In [8]:
# Select one column using single brackets
# This produces a pandas series which is a one-dimensional array which can be labeled

df['car_type'].head()

0    Toyota Sienna
1    Toyota Sienna
2    Toyota Sienna
3    Toyota Sienna
4    Toyota Sienna
Name: car_type, dtype: object

And notice when I use the inbuilt type function, I have a pandas series. Keep in mind that when you use pandas series, you cannot select multiple columns. This will result in a key error as you can see here. And this is a really common error that a lot of beginners run into. And this usually results from people wanting to select multiple columns. And the simple solution to this is simply to use a pandas DataFrame. In other words, use double brackets. 

In [9]:
# This is a pandas series
type(df['car_type'].head())

pandas.core.series.Series

In [10]:
# Keep in mind that you can't select multiple columns using single brackets
# This will result in a KeyError
df['car_type', 'Principal Paid'].head()

KeyError: ('car_type', 'Principal Paid')

In [11]:
# Solution is use to double brackets
df[['car_type', 'Principal Paid']].head()

Unnamed: 0,car_type,Principal Paid
0,Toyota Sienna,484.3
1,Toyota Sienna,487.13
2,Toyota Sienna,489.98
3,Toyota Sienna,492.85
4,Toyota Sienna,495.73


# Pandas Slicing

So I have my DataFrame, I'm selecting my car_type column, and I'm selecting my Principal Paid column. One reason why you might use a pandas series as opposed to a DataFrame is that with a ``pandas series, you can select rows using slicing``, where you have the series, the *start index* of what you want to select, the *end index* of what you want to select. And keep in mind, the end index is not inclusive, and this behavior is very similar to Python lists. So I have a pandas series here, where I'm looking at the car_type column. And this is the entire car_type column. 

In [12]:
df['car_type']

0      Toyota Sienna
1      Toyota Sienna
2      Toyota Sienna
3      Toyota Sienna
4      Toyota Sienna
           ...      
403        VW Golf R
404        VW Golf R
405        VW Golf R
406        VW Golf R
407        VW Golf R
Name: car_type, Length: 408, dtype: object

Say I'm only interested in, let's say index zero up until but not including index 10, in other words, from here to here. I can use a slicing operation. So over here, I have my car_type column, and this is a pandas series, and here's my slice. And I'm just selecting from index zero up until but not including index 10. So from zero to nine.

In [13]:
df['car_type'][0:10]

0    Toyota Sienna
1    Toyota Sienna
2    Toyota Sienna
3    Toyota Sienna
4    Toyota Sienna
5    Toyota Sienna
6    Toyota Sienna
7    Toyota Sienna
8    Toyota Sienna
9    Toyota Sienna
Name: car_type, dtype: object

Keep in mind you can also select columns using dot notation. However, this is not the recommended syntax. 

In [14]:
# Select column using dot notation. 
# This is not recommended.
df.car_type.head()

0    Toyota Sienna
1    Toyota Sienna
2    Toyota Sienna
3    Toyota Sienna
4    Toyota Sienna
Name: car_type, dtype: object

And as you'll see in this cell over here, this can result in an error, as there's a space in this column name. Keep in mind that this also fails if your column name is the same as the pandas DataFrame's attributes, or methods. So a safer syntax is just to use single brackets. 

In [None]:
"""
This won't work as there is a space in the column name. 
Dot notation also fails if your column has the same name 
of a DataFrame's attributes or methods.
"""
df.Principal Paid

In [15]:
df['Principal Paid'][0:10]

0    484.30
1    487.13
2    489.98
3    492.85
4    495.73
5    498.63
6    501.55
7    504.48
8    507.43
9    510.40
Name: Principal Paid, dtype: float64

# Selecting columns using loc

And lastly, I wanted to show you the preferred syntax for selecting columns. And this is by using the **loc attribute**. And this allows you to select ``columns``, index, as well as slice your data. So over here, I'm selecting all the rows of my pandas DataFrame. I'm specifically saying I just want the car_type column, and then I want the first five rows. 

In [16]:
# pandas dataframe
df.loc[:, ['car_type']].head()

Unnamed: 0,car_type
0,Toyota Sienna
1,Toyota Sienna
2,Toyota Sienna
3,Toyota Sienna
4,Toyota Sienna


Similarly, if you just want a pandas series, you just take out the square brackets around your column name. So that's it. If in the future you're presented with a big data set and you want to look at a subset of it, consider slicing.

In [17]:
# pandas series
df.loc[:, 'car_type'].head()

0    Toyota Sienna
1    Toyota Sienna
2    Toyota Sienna
3    Toyota Sienna
4    Toyota Sienna
Name: car_type, dtype: object