#  Exploratory Data Analysis With Pandas

### Overview

In this lesson, students will begin using Pandas for exploratory data analysis. This will include filtering and sorting data to generate insights.

### Learning Objectives

* Use Pandas to read in a data set
* Use DataFrame attributes and methods to investigate a data set's integrity
* Apply filters and sorting to DataFrames

## Meet Pandas
 

#### What is Pandas?
First let's install the library (you only have to run the next cell once)

In [1]:
conda install pandas -y

It's also useful to install another library, matplotlib. Again, this only happens once.

In [None]:
conda install matplotlib -y

In [2]:
# import the library
import pandas as pd

In [3]:
# what version am I using?
print(pd.__version__)

1.4.2


In [4]:
# what's the name of this package?
print(pd.__package__)


pandas


In [5]:
# show me some documentation!
print(pd.__doc__)


pandas - a powerful data analysis and manipulation library for Python

**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.

Main Features
-------------
Here are just a few of the things that pandas does well:

  - Easy handling of missing data in floating point as well as non-floating
    point data.
  - Size mutability: columns can be inserted and deleted from DataFrame and
    higher dimensional objects
  - Automatic and explicit data alignment: objects can be explicitly aligned
    to a set of labels, or the user can simply ignore the labels and

###  Reading a Data Set

In [6]:
# where is our dataset?
file_address = 'data/chipotle.tsv'

# you can also get this file online:
file_address = 'https://raw.githubusercontent.com/python-machine-learning-apps/intro-to-pandas/main/data/chipotle.tsv'

In [7]:
delimiter_character = '\t'

In [8]:
# read in a file
data_frame = pd.read_csv(file_address,sep=delimiter_character)

In [9]:
# show the first few rows
data_frame[:5]

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [10]:
# you can also get the same file in another format
file_location = '../data/chipotle.csv'

In [11]:
# notice that now it doesn't need the delimiter!
comparison_data_frame = pd.read_csv(file_location)

In [12]:
# here's a different way to look at the first 5 rows
comparison_data_frame.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


###  Series vs. DataFrames

In [13]:
# A series
data_frame[['item_name']]

Unnamed: 0,item_name
0,Chips and Fresh Tomato Salsa
1,Izze
2,Nantucket Nectar
3,Chips and Tomatillo-Green Chili Salsa
4,Chicken Bowl
...,...
4617,Steak Burrito
4618,Steak Burrito
4619,Chicken Salad Bowl
4620,Chicken Salad Bowl


In [14]:
# A dataframe
data_frame[['item_name','item_price']]

Unnamed: 0,item_name,item_price
0,Chips and Fresh Tomato Salsa,$2.39
1,Izze,$3.39
2,Nantucket Nectar,$3.39
3,Chips and Tomatillo-Green Chili Salsa,$2.39
4,Chicken Bowl,$16.98
...,...,...
4617,Steak Burrito,$11.75
4618,Steak Burrito,$11.75
4619,Chicken Salad Bowl,$11.25
4620,Chicken Salad Bowl,$8.75


###  Accessing and Modifying the Index

In [15]:
data_frame.index

RangeIndex(start=0, stop=4622, step=1)

In [16]:
# show the first few rows
data_frame[:5]

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [17]:
column_name = 'order_id'
data_frame.set_index(column_name, inplace=True)
# data_frame[:5]

###  Columns and Data Types

In [18]:
# Prints all the column names
data_frame.columns 

Index(['quantity', 'item_name', 'choice_description', 'item_price'], dtype='object')

In [19]:
# Prints all the data types, but is hard to read!
data_frame.dtypes 

quantity               int64
item_name             object
choice_description    object
item_price            object
dtype: object

In [20]:
# Easy-to-read DataFrame of the data types: 
pd.DataFrame(data_frame.dtypes, columns=['DataType'])

Unnamed: 0,DataType
quantity,int64
item_name,object
choice_description,object
item_price,object


###  Renaming Columns
Okay, we're going to try a new file

In [21]:
# set the file location
file_address = 'https://raw.githubusercontent.com/python-machine-learning-apps/intro-to-pandas/main/data/mtcars.csv'

In [22]:

# read in the file
df = pd.read_csv(file_address)

In [23]:
# the "head" of a file

df.head(5)

Unnamed: 0,name,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,color
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,Blue
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,Black
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,Red
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,Red
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,Silver


In [24]:
# renaming
df.rename(columns={'name': 'model'}, inplace=True)
df.head(3)

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,color
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,Blue
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,Black
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,Red


###  Common Column Operations

In [25]:
# describe a column
df['mpg'].describe()

count    32.000000
mean     20.090625
std       6.026948
min      10.400000
25%      15.425000
50%      19.200000
75%      22.800000
max      33.900000
Name: mpg, dtype: float64

In [26]:
# show the frequency of values within a column
df['cyl'].value_counts()

8    14
4    11
6     7
Name: cyl, dtype: int64

In [27]:
# show the unique values of a column
df['model'].unique()

array(['Mazda RX4', 'Mazda RX4 Wag', 'Datsun 710', 'Hornet 4 Drive',
       'Hornet Sportabout', 'Valiant', 'Duster 360', 'Merc 240D',
       'Merc 230', 'Merc 280', 'Merc 280C', 'Merc 450SE', 'Merc 450SL',
       'Merc 450SLC', 'Cadillac Fleetwood', 'Lincoln Continental',
       'Chrysler Imperial', 'Fiat 128', 'Honda Civic', 'Toyota Corolla',
       'Toyota Corona', 'Dodge Challenger', 'AMC Javelin', 'Camaro Z28',
       'Pontiac Firebird', 'Fiat X1-9', 'Porsche 914-2', 'Lotus Europa',
       'Ford Pantera L', 'Ferrari Dino', 'Maserati Bora', 'Volvo 142E'],
      dtype=object)

In [28]:
# how many unique values are there?
df['model'].nunique()

32

## Filtering and Sorting Data
 

###  The Boolean Mask

In [29]:
# show the first 5 rows of a dataframe
df.head()

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,color
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,Blue
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,Black
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,Red
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,Red
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,Silver


In [30]:
#  establish a filter (we call this a "Boolean Mask")
df['cyl'] == 6

0      True
1      True
2     False
3      True
4     False
5      True
6     False
7     False
8     False
9      True
10     True
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29     True
30    False
31    False
Name: cyl, dtype: bool

In [31]:
#  The Boolean Mask, cont.
df[df['cyl'] == 6].tail(3)

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,color
9,Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4,Black
10,Merc 280C,17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4,Red
29,Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6,Silver


###  DataFrame Syntax Chaining

In [32]:
# set a new filter
df['color'] == 'Blue'

0      True
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12     True
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24     True
25     True
26    False
27    False
28    False
29    False
30    False
31     True
Name: color, dtype: bool

In [33]:
# Access the mpg column of all results passing the filter:
df[df['color'] == 'Blue']['mpg']

0     21.0
12    17.3
24    19.2
25    27.3
31    21.4
Name: mpg, dtype: float64

In [34]:
# describe that column
df['mpg'].describe()

count    32.000000
mean     20.090625
std       6.026948
min      10.400000
25%      15.425000
50%      19.200000
75%      22.800000
max      33.900000
Name: mpg, dtype: float64

In [35]:
# Gives numerical summaries based only on results passing the filter
df[df['color'] == 'Blue']['mpg'].describe()

count     5.000000
mean     21.240000
std       3.758058
min      17.300000
25%      19.200000
50%      21.000000
75%      21.400000
max      27.300000
Name: mpg, dtype: float64

###  Filtering by Multiple Conditions

In [36]:
# set a filter, then show the first 5 rows
df[df['mpg'] > 20].head()

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,color
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,Blue
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,Black
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,Red
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,Red
7,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2,Black


In [37]:
# Adding more conditions uses the same syntax,  even if it looks more complicated
df[(df['color'] == 'Silver') & (df['mpg'] > 20)]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,color
18,Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2,Silver
20,Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1,Silver
26,Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2,Silver


In [38]:
# The parentheses here are very important! 
# Leaving them out will usually trigger an error.
df[(df['color'] == 'Silver') | (df['mpg'] > 32)]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,color
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,Silver
6,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4,Silver
14,Cadillac Fleetwood,10.4,8,472.0,205,2.93,5.25,17.98,0,0,3,4,Silver
17,Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1,Black
18,Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2,Silver
19,Toyota Corolla,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1,Black
20,Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1,Silver
21,Dodge Challenger,15.5,8,318.0,150,2.76,3.52,16.87,0,0,3,2,Silver
23,Camaro Z28,13.3,8,350.0,245,3.73,3.84,15.41,0,0,3,4,Silver
26,Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2,Silver


###  Sorting

In [39]:
# For a Series object, no need to specify column: There's only one!
df['mpg'].sort_values().head()

15    10.4
14    10.4
23    13.3
6     14.3
16    14.7
Name: mpg, dtype: float64

In [40]:
# For a DataFrame, it will sort by index unless given a column name. 
df.sort_values(by='mpg', ascending=True).head()

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,color
15,Lincoln Continental,10.4,8,460.0,215,3.0,5.424,17.82,0,0,3,4,Red
14,Cadillac Fleetwood,10.4,8,472.0,205,2.93,5.25,17.98,0,0,3,4,Silver
23,Camaro Z28,13.3,8,350.0,245,3.73,3.84,15.41,0,0,3,4,Silver
6,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4,Silver
16,Chrysler Imperial,14.7,8,440.0,230,3.23,5.345,17.42,0,0,3,4,Red


###  Accessing an Individual Row

In [41]:
# We can use the iloc property to use indexing syntax
df.sort_values(by="mpg").iloc[[0]]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,color
15,Lincoln Continental,10.4,8,460.0,215,3.0,5.424,17.82,0,0,3,4,Red


In [42]:
# We could also simply use one bracket
df.sort_values(by="mpg").iloc[0]

model    Lincoln Continental
mpg                     10.4
cyl                        8
disp                   460.0
hp                       215
drat                     3.0
wt                     5.424
qsec                   17.82
vs                         0
am                         0
gear                       3
carb                       4
color                    Red
Name: 15, dtype: object