# Intro to Pandas
by Ryan Orsinger

## Module 2: Introducing DataFrames

### Pandas DataFrames Basics - Part 2
- Helpful Jupyter features
- Importing a CSV dataset
- Selecting multiple columns
- Filtering rows with Boolean masks
- Using AND and OR operators
- Why and when to use .copy
- Sorting a dataframe by one or more columns

In [1]:
import pandas as pd

In [2]:
# pd.read_csv reads a comma-separated value file and returns a dataframe
mpg = pd.read_csv("mpg.csv")
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
0,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
1,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
2,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
3,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
4,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...
229,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
230,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
231,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
232,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


In [3]:
# Showing the object.tab autocomplete option
# Type the variable that's already been evaluated in memory
# This works with imported modules, too
# Type .
# Then type the <Tab> character


In [4]:
# Pulling up the docstring for a method
mpg.head?

In [5]:
# .tail looks at the last records of a pandas object (series or datafram)
mpg.tail(3)

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
231,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
232,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize
233,volkswagen,passat,3.6,2008,6,auto(s6),f,17,26,p,midsize


In [6]:
# The square brackets on the dataframe can accept a list of column names 
columns_of_interest = ["displ", "cty", "hwy"]
mpg[columns_of_interest].head()

Unnamed: 0,displ,cty,hwy
0,1.8,18,29
1,1.8,21,29
2,2.0,20,31
3,2.0,21,30
4,2.8,16,26


In [7]:
# You'll also see the same result with double square brackets
mpg[["cty", "hwy"]].head()

Unnamed: 0,cty,hwy
0,18,29
1,21,29
2,20,31
3,21,30
4,16,26


In [8]:
# Our functions like .mean, .sum, .median, min/max evaluate an entire column, by default
# There's a default argument in .mean() that's axis=0 (which means column)
mpg[["cty", "hwy"]].mean()

cty    16.858974
hwy    23.440171
dtype: float64

In [9]:
mpg[["cty", "hwy"]].head()

Unnamed: 0,cty,hwy
0,18,29
1,21,29
2,20,31
3,21,30
4,16,26


In [10]:
# axis=1 means execute the .mean on each row, row-wise execution
mpg[["cty", "hwy"]].mean(axis=1).head()

0    23.5
1    25.0
2    25.5
3    25.5
4    21.0
dtype: float64

In [11]:
# axis=1 calculates the row-wise average
mpg["average_mileage"] = mpg[["cty", "hwy"]].mean(axis=1)
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,average_mileage
0,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,23.5
1,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,25.0
2,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,25.5
3,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,25.5
4,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,21.0


In [12]:
# value_counts shows unique values and their distribution
mpg.cyl.value_counts()

4    81
6    79
8    70
5     4
Name: cyl, dtype: int64

In [13]:
# Using Boolean Series to answer questions
# How many audis in the dataset
# True works as a 1, False works as 0
(mpg.manufacturer == "audi").sum()

18

In [14]:
# What proportion of engines are 4 cylinders?
(mpg.cyl == 4).mean()

0.34615384615384615

In [15]:
# Using Boolean Series to filter a dataframe
mpg[mpg.cyl == 5]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,average_mileage
217,volkswagen,jetta,2.5,2008,5,auto(s6),f,21,29,r,compact,25.0
218,volkswagen,jetta,2.5,2008,5,manual(m5),f,21,29,r,compact,25.0
225,volkswagen,new beetle,2.5,2008,5,manual(m5),f,20,28,r,subcompact,24.0
226,volkswagen,new beetle,2.5,2008,5,auto(s6),f,20,29,r,subcompact,24.5


In [16]:
# You can mix and match the columns used to generate Booleans, the Boolean values don't care
mpg[(mpg.model == "maxima") | (mpg.cyl == 5)]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,average_mileage
147,nissan,maxima,3.0,1999,6,auto(l4),f,18,26,r,midsize,22.0
148,nissan,maxima,3.0,1999,6,manual(m5),f,19,25,r,midsize,22.0
149,nissan,maxima,3.5,2008,6,auto(av),f,19,25,p,midsize,22.0
217,volkswagen,jetta,2.5,2008,5,auto(s6),f,21,29,r,compact,25.0
218,volkswagen,jetta,2.5,2008,5,manual(m5),f,21,29,r,compact,25.0
225,volkswagen,new beetle,2.5,2008,5,manual(m5),f,20,28,r,subcompact,24.0
226,volkswagen,new beetle,2.5,2008,5,auto(s6),f,20,29,r,subcompact,24.5


In [17]:
# AND limits options. Both predicates must be true
mpg[(mpg["class"] == "midsize") & (mpg["displ"] < 2)]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,average_mileage
227,volkswagen,passat,1.8,1999,4,manual(m5),f,21,29,p,midsize,25.0
228,volkswagen,passat,1.8,1999,4,auto(l5),f,18,29,p,midsize,23.5


In [35]:
original_df = pd.DataFrame({"x": [1, 2, 3]})

In [36]:
# Be careful assigning Python collections like this...
# This works the same way with lists, dictionaries, and many other collection objects
new_df = original_df

In [37]:
original_df["y"] = original_df.x * 100
original_df

Unnamed: 0,x,y
0,1,100
1,2,200
2,3,300


In [38]:
new_df

Unnamed: 0,x,y
0,1,100
1,2,200
2,3,300


In [39]:
# Use .copy to make a proper copy of a dataframe
new_df = original_df.copy()
new_df

Unnamed: 0,x,y
0,1,100
1,2,200
2,3,300


In [42]:
original_df["z"] = 5000
original_df

Unnamed: 0,x,y,z
0,1,100,5000
1,2,200,5000
2,3,300,5000


In [43]:
# The new df was created from a copy and avoids future mutation to the original dataframe
new_df

Unnamed: 0,x,y
0,1,100
1,2,200
2,3,300


In [56]:
# Sorting by a single column
mpg.sort_values(by=["average_mileage"], ascending=False, ignore_index=True).head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,average_mileage
0,volkswagen,new beetle,1.9,1999,4,manual(m5),f,35,44,d,subcompact,39.5
1,volkswagen,jetta,1.9,1999,4,manual(m5),f,33,44,d,compact,38.5
2,volkswagen,new beetle,1.9,1999,4,auto(l4),f,29,41,d,subcompact,35.0
3,toyota,corolla,1.8,2008,4,manual(m5),f,28,37,r,compact,32.5
4,toyota,corolla,1.8,1999,4,manual(m5),f,26,35,r,compact,30.5


In [59]:
# Sorting by multiple columns
mpg.sort_values(by=["hwy", "cty"], ascending=False, ignore_index=True).head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,average_mileage
0,volkswagen,new beetle,1.9,1999,4,manual(m5),f,35,44,d,subcompact,39.5
1,volkswagen,jetta,1.9,1999,4,manual(m5),f,33,44,d,compact,38.5
2,volkswagen,new beetle,1.9,1999,4,auto(l4),f,29,41,d,subcompact,35.0
3,toyota,corolla,1.8,2008,4,manual(m5),f,28,37,r,compact,32.5
4,honda,civic,1.8,2008,4,auto(l5),f,25,36,r,subcompact,30.5


In [63]:
# Updating the dataframe to store the sorted values
# Use inplace=True or reassign the dataframe variable (not both)
mpg.sort_values(by="displ", ascending=False, ignore_index=True, inplace=True)
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,average_mileage
0,chevrolet,corvette,7.0,2008,8,manual(m6),r,15,24,p,2seater,19.5
1,chevrolet,k1500 tahoe 4wd,6.5,1999,8,auto(l4),4,14,17,d,suv,15.5
2,chevrolet,corvette,6.2,2008,8,manual(m6),r,16,26,p,2seater,21.0
3,chevrolet,corvette,6.2,2008,8,auto(s6),r,15,25,p,2seater,20.0
4,jeep,grand cherokee 4wd,6.1,2008,8,auto(l5),4,11,14,p,suv,12.5


## Additional Resources
- [Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- [Pandas Comparison to Spreadsheets](https://pandas.pydata.org/docs/dev/getting_started/comparison/comparison_with_spreadsheets.html)
- [Pandas Comparison to SQL](https://pandas.pydata.org/docs/dev/getting_started/comparison/comparison_with_sql.html)

## Exercises
- Create a new column named `is_automatic` that holds a Boolean if that given vehicle is an automatic transmition or not. 
- Use the `is_automatic` column to sum up the number of automatic vehicles in this dataset.
- Write the pandas code to determine what percentage of the vehicles are subcompacts?
- Combined fuel economy is a weighted average of the city value by 55% and the highway value by 45%. Use arithmetic operators to add a new column named `fuel_economy` to the `mpg` dataframe.
- Use Boolean masking to find all of the vehicles with a `fuel_economy` above the median `fuel_economy`.
