[](/Media/Seaborn-Logo.jpg)

<p style="text-align: left">
  <img  src="../Media/Seaborn-Logo.jpg" width="200" alt="Seaborn Logo">
</p>

### Table of Contents <a class="anchor" id="DS104L5_toc"></a>

* [Table of Contents](#DS104L5_toc)
    * [Page 1 - Introduction](#DS104L5_page_1)
    * [Page 2 - A Look at Hybrid Car Data](#DS104L5_page_2)
    * [Page 3 - A Look at Flight Data from USDOT](#DS104L5_page_3)
    * [Page 4 - Barplots with Hybrid Car Data](#DS104L5_page_4)
    * [Page 5 - Insurance Analysis](#DS104L5_page_5)
    * [Page 6 - Insurance Analysis](#DS104L5_page_6)
    * [Page 7 - Penguins](#DS104L5_page_7)
    * [Page 8 - Other Built-in Datasets](#DS104L5_page_8)

<hr style="height:10px;border-width:0;color:gray;background-color:darkorchid">

# Page 1 - Introduction <a class="anchor" id="DS108L5_page_1"></a>

[Back to Top](#DS104L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:darkorchid">

# Visualization Examples and Importing packages

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. While it provides basic bar graphs, Seaborn offers rich dramatic colors. 
For tutorials and more on Seaborn, visit the official Seaborn link __[here](https://seaborn.pydata.org/)__.

Import necessary packages 

In [None]:
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

<hr style="height:10px;border-width:0;color:gray;background-color:darkorchid">

# Page 2 - A Look at Hybrid Car Data <a class="anchor" id="DS104L5_page_2"></a>

[Back to Top](#DS104L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:darkorchid">

### Loading Hybrid Car data from the companion notebooks __[here](https://github.com/woz-u/DS-Student-Resources/blob/main/DS105-Intermediate-Statistics/DS105-L1-Python-Basic-Stats.ipynb)__.

In [None]:
hybrid = pd.read_excel("C:/Users/Marcy/Desktop/DATA/hybrid2013.xlsx")

### Taking a look at column names, and a quick glance at what the values look like

In [None]:
hybrid.head(10)

We see this dataset has a Car ID, vehicle name, year, MSRP, acceleration rate, MPG and car class info.

In [None]:
import warnings
warnings.filterwarnings('ignore')
warnings.warn('DelftStack')
warnings.warn('Do not show this message')
print("No Warning Shown")

Quick line plot of the data

In [None]:
plt.figure(figsize = (16, 6))
plt.title ("2013 Hybrid Vehicles")
sns.lineplot(data=hybrid)

Now we'll subset the data and look only at vehicles; here's the list of columns. 

In [None]:
list(hybrid.columns)

In [None]:
sns.lineplot(data= hybrid['vehicle'], label = "Vehicle")

#### That's not very pretty. Let's try transposing the data on the graph, making it larger for readability, then running a lineplot again. 

In [None]:
x = hybrid ['vehicle']
y = hybrid ['mpg']

In [None]:
plt.figure(figsize = (16, 10))
sns.lineplot(x, y , data = hybrid)

#### That still doesn't look right. Let switch x and y with the larger size and try again.

In [None]:
plt.figure(figsize = (16, 10))
sns.lineplot(y, x , data = hybrid)

#### That's better.  Now to remove that confidence interval from the graph, and add data points.

In [None]:
plt.figure(figsize = (16, 10))
sns.lineplot(y, x , ci = None, marker = 'o', data = hybrid)

#### Change the color of the lineplot.

In [None]:
plt.figure(figsize = (16, 10))
sns.lineplot (y, x, data = hybrid, color = 'purple', linewidth = 2.5, ci = None)

<hr style="height:10px;border-width:0;color:gray;background-color:darkorchid">

# Page 3 - A Look at Flight Data from USDOT <a class="anchor" id="DS104L5_page_3"></a>

[Back to Top](#DS104L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:darkorchid">

#### Next, let's take a look at flight data collected by the US Department of Transportation. You can find the Kaggle link for it __[here](https://www.kaggle.com/code/alexisbcook/scatter-plots/data?select=flight_delays.csv)__. 

In [None]:
flights = pd.read_csv(r"C:\Users\Marcy\Desktop\DATA\flight_delays.csv", index_col = "Month")

#### Note that if you forget to flip your backslashes, or you have many, you can put a lowercase r in front of the first quotation mark to import data. 

In [None]:
flights.head()

### Column Names

In [None]:
list(flights.columns)

#### For a full list of definitions for the airline codes, see this __[link](https://www.bts.gov/topics/airlines-and-airports/airline-codes)__ at the Bureau of Transportation Statistics.

In [None]:
flights

#### Here's a bar chart looking at Delta flight information showing average arrival delays. 


In [None]:
sns.barplot(x=flights.index, y=flights['DL'])
plt.ylabel ("Arrival Delay (in minutes)")

#### Take a look at Spirit Airlines, with code NK

In [None]:
sns.barplot(x=flights.index, y=flights['NK'])
plt.ylabel ("Arrival Delay (in minutes)")

#### And American Airlines 

#### Note I selected Month as the index when importing the data. By using .index for the x axis, I selected the column that indexes the rows. You can't use flight['Month'] - it will return an error, since we used it as the index when importing the data. This is a small dataset. 

In [None]:
sns.barplot(x=flights.index, y=flights['AA'])
plt.ylabel ("Arrival Delay (in minutes)")

## Heatmaps

In [None]:
plt.figure(figsize=(14,7))
# Add title
plt.title("Average Arrival Delay for Each Airline, by Month")
# Heatmap showing average arrival delay for each airline by month
sns.heatmap(data=flights, annot=True)

# Add label for horizontal axis
plt.xlabel("Airline")

#### Variation

In [None]:
# Heatmap showing average arrival delay for each airline by month
sns.heatmap(data=flights, annot=True)

This code has three main components:

sns.heatmap -> This tells the notebook that we want to create a heatmap.

data=flights -> This tells the notebook to use all of the entries in flights to create the heatmap.

annot=True -> This ensures that the values for each cell appear on the chart. (Leaving this out removes the numbers from each of the cells.)

### Analysis

Darker colors in the later months of the year indicate the airlines are better, on average, at arriving on time. 

<hr style="height:10px;border-width:0;color:gray;background-color:darkorchid">

# Page 4 - Barplots with Hybrid Cars <a class="anchor" id="DS104L5_page_4"></a>

[Back to Top](#DS104L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:darkorchid">

#### For a barplot, let's go back to the hybrid cars data we looked at earlier.

In [None]:
hybrid.head()

In [None]:
sns.barplot(x= hybrid['msrp'], y = hybrid['vehicle'])

#### That's not very readable - let's adjust the size. the first number inside the parentheses is for the width; the second number is the height. 

In [None]:
plt.figure(figsize=(14,7))
sns.barplot(x= hybrid['msrp'], y = hybrid['vehicle'])

#### We can do better. 

In [None]:
plt.figure(figsize=(14, 12 ))
sns.barplot(x= hybrid['msrp'], y = hybrid['vehicle'])

### That's better. 

<hr style="height:10px;border-width:0;color:gray;background-color:darkorchid">

# Page 5 - Insurance Analysis <a class="anchor" id="DS104L5_page_5"></a>

[Back to Top](#DS104L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:darkorchid">

Now we'll take a look at __[insurance](https://www.kaggle.com/code/alexisbcook/scatter-plots/data?select=insurance.csv)__ data, which you can download and run. 

In [None]:
insurance = pd.read_csv(r"C:\Users\Marcy\Desktop\DATA\insurance.csv")

In [None]:
insurance.head()

### Scatter plots
To create a simple scatter plot, we use the sns.scatterplot command and specify the values for:

the horizontal x-axis (x=insurance['bmi']), and
the vertical y-axis (y=insurance['charges']).

In [None]:
sns.scatterplot(x=insurance['bmi'], y=insurance['charges'])

#### The scatterplot above suggests that body mass index (BMI) and insurance charges are positively correlated, where customers with higher BMI typically also tend to pay more in insurance costs. (This pattern makes sense, since high BMI is typically associated with higher risk of chronic disease.)

To double-check the strength of this relationship, you might like to add a regression line, or the line that best fits the data. We do this by changing the command to sns.regplot.

In [None]:
sns.regplot(x=insurance['bmi'], y=insurance['charges'])

### Color-coded scatter plots
We can use scatter plots to display the relationships between not just two, but three variables. One way of doing this is by color coding the points.

For instance, to understand how smoking affects the relationship between BMI and insurance costs, we can color code the points by 'smoker', and plot the other two columns on the x and y axes.

In [None]:
sns.scatterplot(x=insurance['bmi'], y=insurance['charges'], hue=insurance['smoker'])

This scatter plot shows that while nonsmokers might pay slightly more with increasing body mass, smokers pay much more. 

### A trendline really highlights this difference. 

In [None]:
sns.lmplot(x="bmi", y="charges", hue="smoker", data=insurance)

### You've seen scatter plots - they're normally used for continuous (numerical) data. But a swarmplot can be used for categorical data, as seen here. 

In [None]:
sns.swarmplot(x=insurance['smoker'],
              y=insurance['charges'])

#### What does a swarmplot look like with different categorical data?

In [None]:
sns.swarmplot(x=insurance['smoker'],
              y=insurance['region'])

<hr style="height:10px;border-width:0;color:gray;background-color:darkorchid">

# Page  6- Video Games and Beauty in Swarmplots <a class="anchor" id="DS104L5_page_6"></a>

[Back to Top](#DS104L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:darkorchid">

#### Not that great. what about a different dataset?

In [None]:
Vgames = pd.read_csv(r"C:\Users\Marcy\Desktop\DATA\video_games_global.csv")

In [None]:
Vgames. head()

In [None]:
#sns.swarmplot(x=Vgames['Genre'],
             # y=Vgames['Publisher'])
    
#Commented out because there's too many variables for it to run. So swarmplot isn't the best choice in this case. 

####  The graphs don't turn out so well with just any kind of categorical data. I find it looks best when one of the categories is binary. For instance, a yes/no, or male/female  as one of the inputs. Let's try a beauty dataset, found __[here](https://www.kaggle.com/datasets/aungpyaeap/beauty)__.

In [None]:
beauty = pd.read_csv(r"C:\Users\Marcy\Desktop\DATA\beauty.csv")

In [None]:
beauty.head()

#### I'd like to recode the zeros and ones in at least the female column; I'm not a fan of those for delivering meaning in a graph label. 

In [None]:
def gender(series):
    if series == 1:
        return "Female"
    elif series == 0:
        return "Male"
   
beauty['FemaleR'] = beauty['female'].apply(gender)

beauty.head()

In [None]:
sns.swarmplot(x=beauty['FemaleR'],
              y=beauty['wage'])

#### Rename the x axis so it makes more sense 

In [None]:
sns.swarmplot(x=beauty['FemaleR'],
              y=beauty['wage'])
plt.xlabel('Gender')
plt.ylabel('Income')

#### Now, taking a look at the warning message, it's suggesting to use stripplot. What's that? Let's find out. 

A strip plot is created entirely on its own. In circumstances when all data are given together with some representation of the underlying distribution, it is a nice complement to a boxplot or violinplot. It is used to generate a scatter plot depending on a category.

The Tips dataset is one of the sample datasets included with the seaborn package, and it is used in the documentation of the seaborn package. It may be readily imported using the seaborn load dataset command, according to __[askpython.com](https://www.askpython.com/python-modules/seaborn-stripplot-method#:~:text=%20Using%20the%20Seaborn%20stripplot%20%28%29%20method%20in,the%20data%20points.%20The%20width%20of...%20More%20)__. 

In [None]:
import seaborn
import matplotlib.pyplot as plt
tips = seaborn.load_dataset("tips")
plt.style.use("seaborn")

In [None]:
plt.figure(figsize=(10,10))
seaborn.stripplot(x="sex", y="total_bill", data=tips)
plt.show()

#### We can make these points larger.

In [None]:
plt.figure(figsize=(10,10))
seaborn.stripplot(y="total_bill", x="sex", data=tips,linewidth=2,size=10)
plt.show()

#### A third variable can even be added, with a key/legend.

In [None]:
plt.figure(figsize=(10,10))
seaborn.stripplot(x="sex", y="total_bill", hue="day", data=tips,size=10)
plt.show()

<hr style="height:10px;border-width:0;color:gray;background-color:darkorchid">

# Page  7- Penguins <a class="anchor" id="DS104L5_page_7"></a>

[Back to Top](#DS104L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:darkorchid">

#### Seaborn has another built-in dataset, called penguins.

In [None]:
penguins = sns.load_dataset('penguins')

In [None]:
penguins.head()

#### Swarmplot for penguins

In [None]:
sns.swarmplot(x=penguins['sex'],
              y=penguins['body_mass_g'])

In [None]:
sns.swarmplot(x=penguins['sex'],
              y=penguins['flipper_length_mm'])

#### Here's another plot, looking at the data, a displot with labels and adding species as the color. 

In [None]:
penguinGraph = sns.displot(penguins, x="bill_length_mm", hue="species", col="island", col_wrap=2, height=3)
plt.show()

#### Same plot, looking at gender

In [None]:
penguinGraph = sns.displot(penguins, x="bill_length_mm", hue="sex", col="island", col_wrap=2, height=3)
plt.show()

<hr style="height:10px;border-width:0;color:gray;background-color:darkorchid">

# Page  8- Other Built-in Datasets, including Pluto and the Planets <a class="anchor" id="DS104L5_page_8"></a>

[Back to Top](#DS104L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:darkorchid">

#### Curious what other built in datasets Seaborn offers? Run this line of code to find out:

In [None]:
sns.get_dataset_names ()

#### Load them in this manner:

In [None]:
planets = sns.load_dataset('planets')

In [None]:
planets.head()

#### I wonder how many planets this dataset has, and if they're still counting Pluto. 

In [None]:
planets.number.value_counts()

In [None]:
plt.figure(figsize=(14,7))
sns.barplot(x= planets['year'], y = planets['number'])

#### And finally, __[Kaggle](https://www.kaggle.com/)__ has many datasets. They're tailored for different kinds of analyses - some are better for machine learning, some for NLP or regression. Take a look around. You may find one you'll use for a final project.  