In [2]:
import matplotlib.pyplot as plt
import seaborn as sns

# Inspiration and background

You are a data scientist working at an American Automotive Company, and for some reason, you are tasked with looking at various data for cars between the years of 1970 and 1982. Not wanting to stir the pot, you acquiesce. You're not sure what you're looking for, so you try to come up with the most logical combinations possible to plot to make your boss happy.

## Importing previously established datasets with Pandas

You've heard tell of a Python package called Pandas, which is a Python tool that allows you to read in data organized in rows and columns. Since it's based on Numpy, it allows very fast calculations on very large datasets. Fun fact: it was originally developed for crunching large amounts of numbers in the econometrics sector and is written in a combination of Python and C++.

In [None]:
# Read in a toy dataset that comes with Seaborn
df = sns.load_dataset('mpg')

# To load any other .csv or .tsv file as a Pandas DataFrame, simply use the pd.readcsv() function.

# Print the dataframe to look at how the structure is organized
print(df)

Next, we want to see the values stored in the variable "origin".

In [None]:
# Pull out the column "origin"
car_origin = df.origin

# Print the column
print(car_origin)

If you want to get multiple columns out, instead of using the dot method, index using a string that corresponds to the "key" in the table (column name).

In [None]:
# Pull out the column "origin" AND the column "mpg"
car_origin_mpg = df[['origin','mpg']]

# Print these columns
print(car_origin_mpg)

That column is quite unwieldy. Maybe there's a way we can find the *unique* values in this column.

In [None]:
# Pull out the column "origin" and pass it through the .unique() function to get a list of the unique values
car_origin = df.origin.unique()

# Print the column
print(car_origin)

That's all fine and dandy, but what's _actually_ stored in this data set? Is there a way we can access all of the columns that we can work with? There sure is!

In [None]:
# Print the name of the columns by using the dot method. Note the data type is an index object instead of a list.

print(df.columns)

# Working with Seaborn
What if we wanted to see the difference in fuel efficiency between cars made in Japan, Europe, or the USA? We can use a graph! But what kind of graph?

Seaborn is a package based on the Matplotlib framework. You don't have any idea what that means, but you trust the process. It's very good at displaying data and is deeply integrated with Pandas DataFrames, so you decide to give it a shot.

In [None]:
# A scatterplot may help you to identify trends between the data. Use the scatterplot function in Seaborn
sns.scatterplot(data = df, x='origin',y='mpg')

# Run the following command to print the resulting graph
plt.show()

Okay, it's not the prettiest graph we've ever made, but we can clearly see the trends in fuel efficiency vs. origin. Is there another way we can look at categorical data?

# Using Categorical graphs

Categorical graphs are slightly different than scatter plots we might be used to. Instead of thinking of the x variable as continuous, we can think of it as having discrete values, so we can organize the data based on these relationships. The syntax will be the same, but Seaborn does a lot of the work for us under the hood, pardon the pun.

In [None]:
sns.boxplot(data=df, x='origin',y='mpg')
sns.stripplot(data=df, x='origin',y='mpg')
plt.show()

Let's add a title and pretty up the graph a little more.

In [None]:
# Set context and resolution
sns.set_context('paper')
sns.set(rc={"figure.dpi":300, 'savefig.dpi':300})

# Plot the data
ax = sns.boxplot(data=df, x='origin',y='mpg')
sns.stripplot(data=df, x='origin',y='mpg', palette='dark')

# Add figure titles, labels, and tick labels
plt.title('Fuel Efficiency vs. Origin of Automobile')
plt.xlabel('Country of Origin')
plt.ylabel('Fuel Efficiency (miles per gallon)')
ax.set_xticklabels(['USA','Japan','Europe'])

plt.show()

That's interesting! It looks like Japan has the highest fuel efficiency, followed closely by Europe, then the USA in dead last. Pushing any speculation as to why out of your mind, you wonder how these results compare to how fast these cars accelerate to 60 mph. We can do that by simply __copying__ the code above, changing around the variables we want plotted, and running the program!

In [None]:
# Set context and resolution
sns.set_context('paper')
sns.set(rc={"figure.dpi":300, 'savefig.dpi':300})

# Plot the data
ax = sns.boxplot(data=df, x='origin',y='acceleration')
sns.stripplot(data=df, x='origin',y='acceleration', palette='dark')

# Add figure titles, labels, and tick labels
plt.title('Fuel Efficiency vs. Origin of Automobile')
plt.xlabel('Country of Origin')
plt.ylabel('Acceleration')
ax.set_xticklabels(['USA','Japan','Europe'])

plt.show()

Now, you might be interested to see the effect of cylinder number on the fuel efficiency or performance of the car. This is easily accomplished by taking advantage of the "hue" parameter!


In [None]:
# Set context and resolution
sns.set_context('paper')
sns.set(rc={"figure.dpi":300, 'savefig.dpi':300})

# Plot the data
ax = sns.boxplot(data=df, x='origin',y='mpg', hue='cylinders')
sns.stripplot(data=df, x='origin',y='mpg', hue='cylinders', palette='dark', dodge=True)

# Add figure titles, labels, and tick labels
plt.title('Fuel Efficiency vs. Origin of Automobile')
plt.xlabel('Country of Origin')
plt.ylabel('Fuel Efficiency (miles per gallon)')
ax.set_xticklabels(['USA','Japan','Europe'])
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)

plt.show()

Let's look at acceleration now...

In [None]:
# Set context and resolution
sns.set_context('paper')
sns.set(rc={"figure.dpi":300, 'savefig.dpi':300})

# Plot the data
ax = sns.boxplot(data=df, x='origin',y='acceleration', hue='cylinders')
sns.stripplot(data=df, x='origin',y='acceleration', hue='cylinders', palette='dark', dodge=True)

# Add figure titles, labels, and tick labels
plt.title('Fuel Efficiency vs. Origin of Automobile')
plt.xlabel('Country of Origin')
plt.ylabel('Acceleration')
ax.set_xticklabels(['USA','Japan','Europe'])

# Put legend outside for readability.
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)

plt.show()

Aha! Now we're getting somewhere. It looks like there is a negative correlation between the number of cylinders and both the fuel efficiency of the car, and a positive correlation with its acceleration. Maybe we can see this in our data!

In [None]:
sns.lmplot(data=df, x='acceleration', y='mpg', hue='origin')
plt.show()

In [None]:
#Filter out data we want by using a "logical indexing" method. only take the rows of the dataframe where the origin is "usa"

usa_data = df[df.origin=='usa']

print(usa_data)

sns.lmplot(data=usa_data, x='acceleration',y='mpg',hue='cylinders', col='cylinders')
plt.ylim((0,50))

plt.show()

Well doesn't that look dandy! Four- and six-cylinder cars show a negative correlation between acceleration and miles per gallon, but eight-cylinder shows less of a correlation. I wonder if this is trend holds across all the regions. Since the number of cylinders is different across all regions, I want to do this separately.

In [None]:
sns.lmplot(data=df[df.origin=='japan'], x='acceleration',y='mpg',hue='cylinders', col='cylinders')
plt.ylim((0,50))

plt.show()

In [None]:
sns.lmplot(data=df[df.origin=='europe'], x='acceleration',y='mpg',hue='cylinders', col='cylinders')
plt.ylim((0,50))
plt.show()

One thing to keep in mind when you are doing some exploratory analyses with low numbers of observations is that the confidence in the observations goes down. Once you start to see less than optimal number of observations, you've probably gone down as deep into comparisons as you can without losing statistical power.

For instance, this data set has a limit of division across four variables (mpg, acceleration, cylinders, and origin). Keep it at four or less "filters".

# What did we learn today?

You feel satisfied with your work in identifying the trends between speed, performance, country of origin, and number of cylinders. You have also put together a compelling Jupyter notebook to record these observations for reproducibility.