In [None]:
import pandas as pd
import numpy as np

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt


In [None]:
plt.style.use('seaborn')
%matplotlib inline

# Matplotlib Scatter Plots

In this section we are going to learn how to create scatter plots in Matplotlib with the **`axes.scatter()`** method.

<div class="alert alert-block alert-info">
<p>You _can_ actually use the `plot()` method we covered in the previous tutorial to generate scatter plots as well.</p>
<p>
But you should generally use the method we are covering in this tutorial unless you are having significant performance problems.
</p>
<p>The reason for this is that the `axes.scatter()` method provides more control over the visualization. It does this at the cost of performance, but this should only be a factor for plots with huge amounts of data points.</p>
</div> 

### Load Necessary Data Set(s)

In [None]:
seattle_weather_2015_2016 = pd.read_csv(
    './data/seattle_weather_2015_2016.csv')
seattle_weather_2015_2016.head()

## Scatter Plots
Let's start exploring Matplotlib's **`axes.scatter()`** method's functionality by charting the precipitation records from our dataframe.

### A Simple Scatter Plot

In [None]:
figure, axes = plt.subplots()


axes.scatter(seattle_weather_2015_2016.index, 
             seattle_weather_2015_2016['low_temp'])

#### Adjusting Marker Color(s)
With scatter plots, you can assign a single color to each plot, just like you can with line plots. The only difference here is that you use the **`c`** parameter rather than **`color`**.

In [None]:
figure, axes = plt.subplots()

# We'll plot high and low temps with different colors.
axes.scatter(seattle_weather_2015_2016.index, 
             seattle_weather_2015_2016['low_temp'], 
             c='blue')

axes.scatter(seattle_weather_2015_2016.index, 
             seattle_weather_2015_2016['high_temp'], 
             c='orange')

<div class="alert alert-block alert-info">
<p>Technically, you can use the `color` argument, but if you do, it won't work with the next technique.</p>
</div> 

But you can also assign a unique color to each marker in your plot - which can lead to some very cool and powerful visualizations.

The key here is that you have to use two arguments in combination: 
* **`c`**: Previously, we only used this as a single value argument, but now you have to pass it a list/array that contains a numeric element for each data point in your plot. You have to have the exact same number of elements as plot items or you will get an error.
* **`cmap`**: A valid colormap name. Matplotlib automatically converts all your numeric values in **`c`** to a float between 0 and 1 and then picks the corresponding color out of a given colormap for each marker. There are many different colormaps available. You can see many of the available options in Matplotlib's <a href="https://matplotlib.org/users/colormaps.html" target="_blank">online documentation</a>.

Now let's provide a couple of examples:

In [None]:
figure, axes = plt.subplots()

# We'll plot high and low temps with different colors.
# We will pair our low temp ratings to the colormap
# itself so that low temps will get mapped to one
# end of the colormap and higher temps to the other.
axes.scatter(seattle_weather_2015_2016.index, 
             seattle_weather_2015_2016['low_temp'], 
             c=seattle_weather_2015_2016['low_temp'], 
             cmap='coolwarm')

In [None]:
figure, axes = plt.subplots()

# We'll plot high and low temps with different colors.
# We will pair our avg wind speed to the colormap
# itself so that low wind speed will get mapped to one
# end of the colormap and higher wind speed to the other.
axes.scatter(seattle_weather_2015_2016.index, 
             seattle_weather_2015_2016['low_temp'], 
             c=seattle_weather_2015_2016['avgwindspeed'], 
             cmap='coolwarm')

#### Adding a Colorbar

<div class="alert alert-block alert-warning">
<p>Adding a colorbar requires interacting with the `figure` object directly. This is the first time that we've done this.</p>
<p>Make sure when you are working on your assignments not to try adding colorbars to the `axes` object or you'll have all sorts of problems.</p>
</div> 

When you have a plot of a continous variable(s) where color is used to signify the values of the variable a colorbar is a great addition to your plot.

To add a colorbar to your plot, you use the **`figure.colorbar()`** method.

The method requires a single argument: an image which has been "painted" onto an *`axes`* object.

It just so happens that when you call the *`axes.scatter`* method, it not only plots such an image, but it also returns a reference to that image.

Up until now, we've just ignored it, but now let's capture it so that we can pass it along to **`figure.colorbar`**:

In [None]:
figure, axes = plt.subplots()

scatter_image = axes.scatter(seattle_weather_2015_2016.index, 
             seattle_weather_2015_2016['low_temp'], 
             c=seattle_weather_2015_2016['low_temp'], 
             cmap='coolwarm')

figure.colorbar(
    scatter_image,
    label='Temperature' # You can also add a label
)

In [None]:
figure, axes = plt.subplots()

scatter_image = axes.scatter(seattle_weather_2015_2016.index, 
             seattle_weather_2015_2016['low_temp'], 
             c=seattle_weather_2015_2016['avgwindspeed'], 
             cmap='coolwarm')

figure.colorbar(
    scatter_image,
    label='Avg. Wind Speed' # You can also add a label
)

You can also add multiple colorbars:

In [None]:
# Plotting high & low temps with different colormaps
# along with separate colorbars
figure, axes = plt.subplots()

low_temps = axes.scatter(seattle_weather_2015_2016.index, 
             seattle_weather_2015_2016['low_temp'], 
             c=seattle_weather_2015_2016['low_temp'], 
             cmap='PuBu')

high_temps = axes.scatter(seattle_weather_2015_2016.index, 
             seattle_weather_2015_2016['high_temp'],
             c=seattle_weather_2015_2016['high_temp'], 
             cmap='YlOrRd')

figure.colorbar(low_temps, label="Low Temperature", orientation = 'horizontal')
figure.colorbar(high_temps, label = "High Temperature", orientation = 'horizontal')

# Activity

### Scatter plot

* Plot a scatter plot for 'avgwindspeed' in the seattle dataset. 
    * Adjust the following parameters for color map based on windspeed
        * c
        * cmap
    * Show a colorbar indicating the mapping of the speed of wind to color. 

### Setting a different style for plots

In [None]:
# Select a style, it will affect all subsequent plots.
# Showing you different style 'fivethirtyeight'
plt.style.use('fivethirtyeight')

# Bar Plots in Matplotlib

Simple Line plots and Histograms are very useful when you have quantitative (continous) data type. Bar plots are useful for plotting **categorical** data types. 

In [None]:
flights = pd.read_csv("./data/flight_sample.csv")
flights.head()

Let us say you want to find how flights are in the dataset corresponding to each airline. 

You can do this by groupby() and the size() to find the number of airlines in each group.

In [None]:
flights_by_Airline = flights.groupby(["AIRLINE"])
num_flights_airline = flights_by_Airline.size()
num_flights_airline

In [None]:
figure, axes = plt.subplots()

axes.bar(range(len(num_flights_airline.index)), num_flights_airline)

# This below line will set the ticks at every number
axes.set_xticks( range(len(num_flights_airline.index)))

# This below line will renamed the ticks from numbers to the airlines
axes.set_xticklabels(num_flights_airline.index)

axes.set_xlabel("Airline")
axes.set_ylabel("Number of flights")
axes.set_title("The distribution of flights across airlines")

### Tight integration with pandas

Instead of writing such a long code. The matplotlib is tightly integrated with pandas. So you can write simple code as below

In [None]:
figure, axes = plt.subplots()

num_flights_airline.plot(ax=axes, kind='bar', color='orange')

**Wow** That was easy! 

Plotting bar plots has a lot more features. Look [here](https://matplotlib.org/examples/api/barchart_demo.html) for more examples. 

# Activity

* Group the flights by 'MONTH' and find the number of flights in each month. 

* Plot a *bar* plot with each month on the x-axis and the height of the bar indicating the number of flights
    * Make sure you rename the ticks to be 'Jan', 'Feb', 'Mar', 'Apr',..., 'Dec' using axes.set_xticklabels() method. 

## Bar plots with multiple bars

In [None]:
# Compute the median distance travelled by each airline by each month using pivot table

med_dist_month_airline = flights.pivot_table('DISTANCE', index='MONTH', columns = 'AIRLINE', aggfunc = np.median)

In [None]:
med_dist_month_airline

In [None]:
# Selecting only three airlines to visualize
med_dist_only_3 = med_dist_month_airline[['AA','AS','DL']]

In [None]:
figure, axes = plt.subplots()

med_dist_only_3.plot(ax = axes, kind='bar')

In [None]:
figure, axes = plt.subplots()

med_dist_only_3.plot(ax = axes, kind='bar')

# Setting the ticks appropriately
axes.set_xticklabels(['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sept','Oct','Nov','Dec'], rotation=45)
axes.set_title("Median Distance travelled in each month for AA, AS, DL airlines")
axes.set_ylabel("Median Distance")

# Activity

* Pivot table based on 'DAY_OF_WEEK' as index and 'AIRLINE' as column to find the **average taxi in (TAXI_IN)** time.  

* Select the three columns corresponding to Southwest (WN), JetBlue (B6), and Hawaiin (HA)

* Plot multiple bars for average taxi in time for the three airlines for each of the seven days. 
    * Make sure you rename the ticks to be 'Mon', 'Tue', 'Wed', 'Thu', 'Fri','Sat','Sun' using axes.set_xticklabels() method. 

In [None]:
# Pivot table


In [None]:
# Select the three airlines


In [None]:
# Visualize


# Making Interactive Plots

Instead of saying `%matplotlib inline` if you used **`%matplotlib notebook`** you can interact with your plots. This is makes visualization even more **fun**!

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt

# Add this after your imports to configure Jupyter to 
# display your plots and INTERACT with them
%matplotlib notebook

In [None]:
# Plotting high & low temps with different colormaps
# along with separate colorbars
figure, axes = plt.subplots()

low_temps = axes.scatter(seattle_weather_2015_2016.index, 
             seattle_weather_2015_2016['low_temp'], 
             c=seattle_weather_2015_2016['low_temp'], 
             cmap='PuBu')

high_temps = axes.scatter(seattle_weather_2015_2016.index, 
             seattle_weather_2015_2016['high_temp'],
             c=seattle_weather_2015_2016['high_temp'], 
             cmap='YlOrRd')

figure.colorbar(low_temps, label="Low Temperature")
figure.colorbar(high_temps, label = "High Temperature")

# Matplotlib Histograms

In this section, we are going to learn how to create histograms, which are a great way of summarizing data sets.

The basic idea of a histogram is to create "buckets" into which your data points fall into and display those rather than individually displaying all the data points.

In [None]:
nd_football_roster = pd.read_csv('./data/nd-football-2021-roster.csv')
seattle_weather_2015_2016 = pd.read_csv(
    './data/seattle_weather_2015_2016.csv')

## One Dimensional Histograms
Simple, one dimensional histograms are created with the **`axes.hist()`** method.

### A Basic Histogram
This create a basic histogram: you simply have to pass a NumPy array or Pandas series object to it. 

By default, it will create a set of buckets from the values in your array/series along the x-axis and then display how many elements fall in each bucket via the y-axis.

In [None]:
# Create a histogram of ND Football Player Heights
# You can quickly see the most common height bucket
# on the team.
figure, axes = plt.subplots()
axes.hist(nd_football_roster['Height'])

As always, you can adjust the title, label, and legend properties of the **`axes`** object.

In [None]:
# Adding title, labels, and legend
figure, axes = plt.subplots()
axes.hist(nd_football_roster['Height'])
axes.legend()
axes.set_title('ND Football Player Height Distribution')
axes.set_xlabel('Height (in)')
axes.set_ylabel('Count of Players')

### Customizing Histograms

#### Changing the Number of Bins
Matplotlib tries to guess the number of bins you want for your histograms and it generally does a good job at this. Nonetheless, you may want to increase/decrease the default to adjust the granularity of your plot.

You can do so with the `bins` parameter.

In [None]:
figure, axes = plt.subplots()

# Adjust the number of bins to 15
# The increased granularity will
# expose some heights that aren't represented
# on the team.
axes.hist(nd_football_roster['Height'], bins=15)

#### Changing a Range for Bins
We know that we can use the `axes.set_xlim` and `axes.set_ylim` methods to focus/zoom into one area of a plot.

The **`range`** parameter is somewhat similar functionally to this in that it allows you to specify the range of input values to plot.

The key difference is that when you use the **`range`** parameter, Matplotlib does all the binning/grouping within the range specified.

With `set_xlim`/`set_ylim` you would simply zoom into a smaller area of the plot. This parameter allows you to calculate the histogram with a subset of your data.

Let's demonstrate using our Seattle Weather dataset.

In [None]:
# Simply plot the precipitation records
# with default parameters.
figure, axes = plt.subplots()
axes.hist(seattle_weather_2015_2016['precipitation'])

Because there are so many records with no rain, it skews the chart. If fact, the distortion is so high, that we can't visually make out any records above 2 inches (though we know they must be there because of the bins Matplotlib created).

Let's focus our histogram between .5 and 2.5 inches:

In [None]:
# Simply plot the precipitation records
# with default parameters.
figure, axes = plt.subplots()
axes.hist(seattle_weather_2015_2016['precipitation'])
axes.set_xlim(0.5,2.5)

As you can see above, it did zoom in the range we wanted between 0.5 and 2.5, but the "binning" is still too coarse. **This is why using range parameter rather than set_xlim**

In [None]:
# This will give us a plot of how many
# reasonably, but not extremely, raining days
# Seattle experienced in 2015-2016
figure, axes = plt.subplots()
axes.hist(
    seattle_weather_2015_2016['precipitation'], 
    range=(.5, 2.5), 
    bins=20)

## Activity

### Histograms

* Drop the rows with missing values

* Plot the histogram of the distance ('DISTANCE') travelled by all the flights in the dataset. 
    * Adjust the parameters
        * bins
        * range
        * color

In [None]:
flights = pd.read_csv("./data/flight_sample.csv")

In [None]:
# Dropping the missing values as plotting the histograms doesn't know how to handle missing data


### Overlaying Multiple Histograms
Just like we are able to plot multiple lines on a single axes, we can plot multiple histograms. 

In [None]:
figure, axes = plt.subplots()
axes.hist(
    seattle_weather_2015_2016['high_temp'], 
    bins=20)
axes.hist(
    seattle_weather_2015_2016['low_temp'], 
    bins=20)

That is pretty cool, but there are a number of things that are not ideal with our plot:
1. The low temperatures are in red. 
2. The low temperatures are covering up the high temperatures.
3. There is no legend to clearly deliniate which is which.

Let's combine what we've learned so far about Matplotlib to address these issues.

In [None]:
figure, axes = plt.subplots()
axes.hist(
    seattle_weather_2015_2016['high_temp'], 
    bins=20, 
    label='High Temps', # Use label to specify legend name
    color='orange', # Override the default color
    alpha=.75 # Adjust alpha to allow both datasets to appear
)
axes.hist(
    seattle_weather_2015_2016['low_temp'], 
    bins=20, 
    label='Low Temps', # Use label to specify legend name
    alpha=.75 # Adjust alpha to allow both datasets to appear
)

# Enable the legend
axes.legend()

## Activity

### Two Histograms

* Create two DataFrames by extracting flight details of 'AS' (Alaskan Airlines) and 'HA' (Hawaiin Airlines)
    * Use two separate masks for AS and HA airlines



* Plot the histograms of the distance ('DISTANCE') travelled by each of 'AS' and 'HA' airlines
    * Adjust the parameters for each histogram
        * bins
        * range
        * color
        * alpha (opacity)
        * label
    * Provide a legend on the upper right corner to indicate which airline

In [None]:
# Select rows based on the AIRLINE
