Working with Constants

We can add new calculated columns using an existing column and a constant value. We create a new column that converts the body weight in pounds to kilograms.

In [None]:
import numpy as np
import pandas as pd

animals = pd.read_csv('./data/animals.csv')

animals['bodywtkg'] = animals['bodywt'] * 0.45359237
animals.head()
 	brainwt     bodywt 	animal 	        bodywtkg
0 	3.385 	    44.500 	Arctic_fox 	20.184860
1 	0.480 	    15.499 	Owl_monkey 	7.030228
2 	1.350 	    8.100 	Beaver 	        3.674098
3 	464.983     423.012 	Cow 	        191.875016
4 	36.328 	    119.498 	Gray_wolf       54.203381


Combining Two (or More) Columns

In [None]:
animals['wtratiozerocheck'] = np.where(animals['brainwt'] != 0, animals['bodywt'] / animals['brainwt'], 0)
animals.head()
 	brainwt bodywt 	    animal 	bodywtkg 	wtratio 	wtratiozerocheck
0 	3.385 	44.500 	    Arctic_fox 	20.184860 	13.146233 	13.146233
1 	0.480 	15.499 	    Owl_monkey 	7.030228 	32.289583 	32.289583
2 	1.350 	8.100 	    Beaver 	3.674098 	6.000000        6.000000
3 	464.983 423.012     Cow 	191.875016 	0.909736        0.909736
4 	36.328 	119.498     Gray_wolf 	54.203381 	3.289419 	3.289419


Calculations Using Functions

Let's say we want to take a sum of all numeric columns in the animals DataFrame. We can do this by using the sum function and passing axis=1as an argument to the function.

In [None]:
animals['sum'] = animals.sum(axis=1)
animals['sum']
0        94.362327
1        87.588395
2        25.124098
3      1081.689489
4       216.608218
          ...     
57      407.773558
58       10.457118
59       32.265027
60       51.814904
61      101.297708
Name: sum, Length: 62, dtype: float64

Data Aggregation & Summarization

Grouping

Applying the groupby function to a DataFrame will return a DataFrameGroupBy object. We then specify the columns that we intend to group on.

In [None]:
import numpy as np
import pandas as pd

vehicles = pd.read_csv('vehicles.csv')
vehicles.groupby(['Transmission'])

Aggregations

We can apply different aggregation functions to our grouped data. We can use some standard functions or define our own functions and then apply them to the aggregated data using the agg function.

Some standard aggregation functions are: mean, sum, count, median, min, max, std.

We can also use the agg function to apply multiple aggregations at once to all columns specified.

After aggregating, we can subset the data to only apply the aggregation to the columns that we choose.

Here are some examples of standard aggregation functions:

In [None]:
#Here we aggregate 3 different columns and compute their mean based on the different transmission values
vehicles.groupby(['Transmission'])['Highway MPG', 'City MPG', 'Combined MPG'].mean()
 	       Highway MPG 	City MPG 	Combined MPG
Transmission 			
Auto (AV) 	40.000000 	35.000000 	37.000000
Auto (AV-S6) 	25.000000 	22.000000 	23.000000
Auto (AV-S8) 	22.000000 	20.000000 	21.000000
Auto(A1) 	37.000000 	41.000000 	39.000000
Auto(AM-S6) 	32.978261 	24.315217 	27.554348
... 	... 	... 	...
Manual 5 spd 	14.000000 	14.000000 	14.000000
Manual 5-spd 	25.664312 	19.242327 	21.634391
Manual 6-spd 	26.202229 	18.306232 	21.153941
Manual 7-spd 	26.205882 	18.220588 	21.117647
Manual(M7) 	22.333333 	14.000000 	17.000000
45 rows Г— 3 columns

#In this example we aggregate based on two columns and compute the median CO2 Emission for all combinations of fuel type and cylinders
vehicles.groupby(['Fuel Type', 'Cylinders'])['CO2 Emission Grams/Mile'].median()
Fuel Type                    Cylinders
CNG                          4.0          253.197321
                             6.0          417.030882
                             8.0          568.070913
Diesel                       4.0          308.484848
                             5.0          391.538462
                                             ...    
Regular                      8.0          634.785714
                             10.0         776.500000
                             12.0         683.615385
Regular Gas and Electricity  4.0          129.000000
Regular Gas or Electricity   4.0           51.000000
Name: CO2 Emission Grams/Mile, Length: 48, dtype: float64

#Here we produce the mean, median and standard deviation for combined MPG grouped by fuel type
vehicles.groupby(['Fuel Type'])['Combined MPG'].agg(['mean', 'median', 'std'])
 	                        mean 	median 	std
Fuel Type 			
CNG 	                    18.133333 	14.5 	7.436663
Diesel 	                    23.488474 	21.0 	7.054702
Gasoline or E85 	    17.572385 	17.0 	3.822538
Gasoline or natural gas     15.350000 	12.0 	5.343712
Gasoline or propane 	    13.500000 	13.5 	1.603567
... 	... 	... 	...
Premium and Electricity     26.300000 	25.5 	5.141165
Premium or E85 	            20.090909 	20.0 	3.676502
Regular 	            20.144698 	20.0 	5.317500
Regular Gas and Electricity 41.937500 	38.5 	5.246824
Regular Gas or Electricity  42.000000 	42.0 	0.000000

13 rows Г— 3 columns


Custom Aggregation Functions

We do not have to be limited by the range of standard aggregation functions. If the need arises, we can write our own aggregation function.

For example, in our vehicle dataset, we might want to find out for each level of transmission, what is the most common vehicle class. In other words, we would like to find the mode.

We can write our own implementation of the mode function, but it would be more efficient to use the scipy implementation of this function. Scipy is a Python package for scientific computing.

Let us first define our custom function using the scipy mode function. We create a custom function since the mode function returns a tuple with the mode and the frequency of the mode. We are only interested in the first part of the tuple.

In [None]:
from scipy import stats

def agg_mode(x):
    return(stats.mode(x)[0])

In [None]:
vehicles.groupby("Transmission")["Vehicle Class"].agg(agg_mode)
Transmission
Auto (AV)           Compact Cars
Auto (AV-S6)        Compact Cars
Auto (AV-S8)        Midsize Cars
Auto(A1)         Subcompact Cars
Auto(AM-S6)         Compact Cars
                      ...       
Manual 5 spd                Vans
Manual 5-spd        Compact Cars
Manual 6-spd        Compact Cars
Manual 7-spd    Minicompact Cars
Manual(M7)           Two Seaters
Name: Vehicle Class, Length: 45, dtype: object

Plotting Multiple Data Series

There are many cases when a more elaborate visualization can help us understand our data better. Therefore, in this lesson we will focus on generating such visualizations.



Multiple Line Plots

We want to compare the relationship between city MPG, highway MPG and CO2 emissions

In order to do this, we can use the .plot function in Pandas. With this function, we can specify which variables will be in the x axis and which will be in the y axis. We will put CO2 emissions in the x axis and the MPG variables in the y axis.

In order to get a meaningful visualization, we should sort our DataFrame by these variables first. This is because Python does not sort by default. It will just connect a line between any two points in the chart that are sequential. This can lead to a very unclear chart.

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
vehicles = pd.read_csv('vehicles.csv')

vehicles.sort_values(by=["CO2 Emission Grams/Mile", "City MPG", "Highway MPG"], inplace=True)
vehicles.plot(x="CO2 Emission Grams/Mile", y=["City MPG", "Highway MPG"])

Multiple Bar Plots


When plotting categorical data, there is value to plotting two or more groups side by side and being able to compare them. There are a few ways of creating such a plot

Side By Side Bar Plots

If we include multiple columns in our bar plot, they will show up side by side in different colors.

In the example below we aggregate both highway and city MPG by drivetrain. Since a bar plot will plot one value per group, we will aggregate and compute the mean.

In [None]:
vehicles_mean = vehicles[["Highway MPG", "City MPG", "Drivetrain"]].groupby(["Drivetrain"]).agg("mean")
vehicles_mean.plot.bar()

Side By Side Horizontal Bar Plots


We can use the .barh function to produce horizontal bars.

In [None]:
vehicles_mean.plot.barh()

Scatter Matrices

A scatter matrix is a useful tool particularly in exploratory data analysis. We can look at the pairwise relationships between multiple variables at the same time. Typically what we look for is linear relationships between the pairs of variables. This information can help us in the future when modeling the data. There are also non linear relationships that we can detect like a logarithmic or exponential relationship between two variables. In this case, we can apply a transformation to the variables to produce a linear relationship.

We will be using the scatter_matrix function. This function will create a scatter plot for any two numeric variables in our data.

By default the scatter matrix displays the histogram of each variable along the diagonal. We can also show the kernel density estimation along the diagonal instead.

In [None]:
pd.plotting.scatter_matrix(vehicles)

This visualization may seem a bit cluttered but it tells us quite a bit about our data. The main takeaways are that there is a linear relationship between combined MPG, city MPG and highway MPG. There is a non linear relationship between MPG and CO2 emissions and MPG and fuel cost per year. The relationship between those pairs of variables could benefit from a transformation in order to make those relationships linear.

