<span style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">An Exception was encountered at '<a href="#papermill-error-cell">In [1]</a>'.</span>

# Matplotlib Basic Plots

<span id="papermill-error-cell" style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">Execution using papermill encountered an exception here and stopped:</span>

In [1]:
import numpy as np  
import pandas as pd
df = pd.read_excel('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/Canada.xlsx',
                       sheet_name='Canada by Citizenship',
                       skiprows=range(20),
                       skipfooter=2)
df.head()

XLRDError: Excel xlsx file; not supported

In [None]:
df.columns

In [None]:
df.drop(['AREA','REG','DEV','Type','Coverage'], axis=1, inplace=True)
df.head()

In [None]:
df.rename(
    columns={
        'OdName': 'Country',
        'AreaName': 'Continent',
        'RegName': 'Region'
    }, inplace=True)
df.columns

Adding a ‘Total’ column which will show the total immigrants that came into Canada from 1980 to 2013 from each country:

In [None]:
df['Total'] = df.sum(axis=1)

In [None]:
df.isnull().sum()

In [None]:
df = df.set_index('Country')

Let's do the plotting part now:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl

Choose a style for the plot. Available styles are as follows:

In [None]:
plt.style.available

In [None]:
# Selecting 'ggplot' style
mpl.style.use(['ggplot'])

### Line Plot

In [None]:
years = list(map(int, range(1980, 2014)))

In [None]:
# Picking Switzerland as an example
df.loc['Switzerland', years]

In [None]:
df.loc['Switzerland', years].plot()
plt.title('Immigration from Switzerland')
plt.ylabel('Number of immigrants')
plt.xlabel('Years')
plt.show()

Comparing 3 different countries

In [None]:
ind_pak_ban = df.loc[['India', 'Pakistan', 'Bangladesh'], years]
ind_pak_ban.head()

In [None]:
ind_pak_ban.T

In [None]:
ind_pak_ban.T.plot();

### Pie Plot

To demonstrate the pie plot we will plot the total number of immigrants for each continent.

In [None]:
cont = df.groupby('Continent', axis=0).sum()
cont

In [None]:
cont['Total'].plot(kind='pie', figsize=(7,7),
                   autopct='%1.1f%%', shadow=True)
plt.title('Immigration By Continenets')
plt.axis('equal')
plt.show()

This pie chart is understandable. But we can improve it with a little effort. This time I want to choose my own colors and a start angle.

In [None]:
colors = ['lightgreen', 'lightblue', 'pink', 'purple', 'grey', 'gold']
explode = [0.1, 0, 0, 0, 0.1, 0.1]
cont['Total'].plot(kind='pie', figsize=(17,10),
                  autopct = '%1.1f%%', startangle=90,
                  shadow=True, labels=None,
                  pctdistance=1.12, colors=colors, explode=explode)
plt.axis('equal')
plt.legend(labels=cont.index, loc='upper right', fontsize=14)
plt.show()

### Box plot

We will make a box plot of the immigrant’s number of China first.

In [None]:
china = df.loc[["China"], years].T

In [None]:
china.plot(kind='box', figsize=(8, 6))
plt.title('Box plot of Chinese Immigratns')
plt.ylabel('Number of Immigrnts')
plt.show()

In [None]:
ind_pak_ban.T.plot(kind='box', figsize=(8, 7))
plt.title('Box plots of Inian, Pakistan and Bangladesh Immigrants')
plt.ylabel('Number of Immigrants')

### Scatter Plot

For this exercise, I will make a new DataFrame that will contain the years as an index and the total number of immigrants each year.

In [None]:
totalPerYear = pd.DataFrame(df[years].sum(axis=0))
totalPerYear.head()

We need to convert the years to integers. I want to polish the DataFrame a bit just to make it presentable.

In [None]:
totalPerYear.index = map(int, totalPerYear.index)
totalPerYear.reset_index(inplace=True)
totalPerYear.rename(columns={
        'index': 'year',
        0: 'total'
    }, inplace=True)
totalPerYear.head()

In [None]:
totalPerYear.plot(kind='scatter', x = 'year', y='total', figsize=(10, 6), color='darkred')
plt.title('Total Immigration from 1980 - 2013')
plt.xlabel('Year')
plt.ylabel('Number of Immigrants')
plt.show()

### Area Plot

The area plot shows the area covered under a line plot. For this plot, I want to make a DataFrame including the information of India, China, Pakistan, and France.

In [None]:
top = df.loc[['India', 'China', 'Pakistan', 'France'], years]
top = top.T
top.head()

In [None]:
colors = ['black', 'green', 'blue', 'red']
top.plot(kind='area', stacked=False,
        figsize=(20, 10), color=colors)
plt.title('Immigration trend from Europe')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')
plt.show()

Remember to use this ‘stacked’ parameter above, if you want to see the individual countries area plot.
When it is unstacked, it does not show the individual variable’s area. It stacks on to the previous one.


In [None]:
colors = ['black', 'green', 'blue', 'red']
top.plot(kind='area', stacked=True,
        figsize=(20, 10), color=colors)
plt.title('Immigration trend from Europe')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')
plt.show()

### Histogram

The histogram shows the distribution of a variable. Here is an example:



In [None]:
df[2005].plot(kind='hist', figsize=(8,5))
plt.title('Histogram of Immigration from 195 Countries in 2010') # add a title to the histogram
plt.ylabel('Number of Countries') # add y-label
plt.xlabel('Number of Immigrants') # add x-label
plt.show()

Let’s use the ‘top’ DataFrame from the scatter plot example and plot each country’s distribution of the number of immigrants in the same plot.

In [None]:
top.plot.hist()
plt.title('Histogram of Immigration from Some Populous Countries')
plt.ylabel('Number of Years')
plt.xlabel('Number of Immigrants')
plt.show()

In this plot, we do not see the bin edges clearly. Let’s improve this plot.

###### Specify the number of bins and find out the bin edges

I will use 15 bins. I am introducing a new parameter here called ‘alpha’. The alpha value determines the transparency of the colors. For these types of overlapping plots, transparency is important to see the shape of each distribution.

In [None]:
count, bin_edges = np.histogram(top, 15)
top.plot(kind = 'hist', figsize=(14, 6), bins=15, alpha=0.6, 
        xticks=bin_edges, color=colors);

Like the area plot, you can make a stacked plot of the histogram as well.

In [None]:
top.plot(kind='hist',
          figsize=(12, 6), 
          bins=15,
          xticks=bin_edges,
          color=colors,
          stacked=True,
         )
plt.title('Histogram of Immigration from Some Populous Countries')
plt.ylabel('Number of Years')
plt.xlabel('Number of Immigrants')
plt.show()

### Bar Plot

For the bar plot, I will use the number of immigrants from France per year.

In [None]:
france = df.loc['France', years]
france.plot(kind='bar', figsize = (10, 6))
plt.xlabel('Year') 
plt.ylabel('Number of immigrants') 
plt.title('Immigrants From France')
plt.show()

You can add extra information to the bar plot. This plot shows an increasing trend since 1997 for over a decade. It could be worth mentioning. It can be done using an annotate function.

In [None]:
france.plot(kind='bar', figsize = (10, 6))
plt.xlabel('Year') 
plt.ylabel('Number of immigrants') 
plt.title('Immigrants From France')
plt.annotate('Increasing Trend',
            xy = (19, 4500),
            rotation= 23,
            va = 'bottom',
            ha = 'left')
plt.annotate('',
            xy=(29, 5500),
            xytext=(17, 3800),
            xycoords='data',
            arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='black', lw=1.5))
plt.show()

Sometimes, showing the bars horizontally makes it more understandable. Showing a label on the bars can be even better. Let’s do it.

In [None]:
france.plot(kind='barh', figsize=(12, 16), color='steelblue')
plt.xlabel('Year') # add to x-label to the plot
plt.ylabel('Number of immigrants') # add y-label to the plot
plt.title('Immigrants From France') # add title to the plot
for index, value in enumerate(france):
    label = format(int(value), ',')
    plt.annotate(label, xy=(value-300, index-0.15), color='white')
    
plt.show()

# Advanced Matplotlib and Seaborn Plots

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

d = pd.read_csv("data/USA_cars_datasets.csv")
d.head()

### Diverging bars with texts

This plot will show the diverging bars and the value of each bar. We will plot the mean price for each brand. First, find the mean price for each brand using the pandas groupby function:

In [None]:
import numpy as np
d1 = d.groupby('brand')['price'].agg([np.mean])
d1.columns = ['mean_price']
d1.head()

The data frame d1 contains the mean price for each brand. It requires the normalized values for a diverging plot. We will normalize the mean price and put it in a new column named ‘price_z’ in the d1 data frame:

In [None]:
x = d1.loc[:, ['mean_price']]
d1['price_z'] = (x - x.mean()) / x.std()
d1.head()

In [None]:
d1.sort_values('price_z', axis=0, ascending=True, inplace=True)

To plot the text plot we need x and y values as usual. But also an extra parameter that is the text that is to be plotted.

In [None]:
plt.figure(figsize=(14, 18), dpi=80)

for x, y, tex in zip(d1.price_z, d1.index, d1.price_z):
    t = plt.text(x, y, round(tex, 2), 
                 horizontalalignment='right' if x < 0 else 'left', 
                 verticalalignment='center', 
                 fontdict={'color': 'red' if x < 0 else 'darkblue', 'size': 14})

    plt.hlines(y, xmin=0, xmax=tex, color='red' if tex < 0 else 'darkblue')
    
plt.yticks(d1.index, fontsize=12)
plt.title("Diverging text bars of car price by brand", fontdict={"size": 20})
plt.grid(linestyle = '--', alpha=0.5)
plt.show()

### Improved Bar Plot

In [None]:
d2 = d1.copy()

plt.figure(figsize=(20, 10))

plt.bar(d2.index, d2['mean_price'], width=0.3)
for i, val in enumerate(d2['mean_price'].values):
    plt.text(i, val, round(float(val)), horizontalalignment='center', 
             verticalalignment='bottom', fontdict={'fontweight':500, 'size': 10})
    
plt.gca().set_xticklabels(d2.index, fontdict={'size': 12}, rotation=60)
plt.title("Mean Price for Each Brand", fontsize=22)
plt.ylabel("Brand", fontsize=16)
plt.show()

Another Method:

In [None]:
fig, ax = plt.subplots(figsize=(28, 10))
ax.vlines(x=d1.index, ymin=0, ymax=d1.mean_price, color= 'coral', alpha=0.7, linewidth=2)
ax.scatter(x=d1.index, y=d1.mean_price, s = 75, color='firebrick', alpha = 0.7 )

ax.set_title("Barchat for Average Car Price by Brand")

ax.set_ylabel("Mean Car Price by Brand", fontsize=16)
ax.set_xticks(d1.index)
ax.set_xticklabels(d1.index.str.upper(), rotation=60, fontdict={'horizontalalignment': 'right', 'size':14})

for row in d1.itertuples():
    ax.text(row.Index, row.mean_price+700, s=round(row.mean_price), horizontalalignment = 'center', verticalalignment='bottom', fontsize=14)
plt.show()

###### Dealing with a big dataset

In [None]:
d = pd.read_csv('data/nhanes_2015_2016.csv')

In [None]:
d.columns

The column ‘DMDEDUC2’ shows the education level of the population and ‘RIDRETH1’ shows the ethnic origin of the population. Both are categorical variables. The next plot will plot the number of each ethnic origin for each education level.

In [None]:
sns.catplot("RIDRETH1", col= "DMDEDUC2", col_wrap = 4,
               data=d[d.DMDEDUC2.notnull()],
               kind="count", height=3.5, aspect=.8,
               palette='tab20')
plt.show()

###### What if both the variable is not categorical?

In that case, a segregated violin plot will be more appropriate. We will show how to use violin plots for different numbers of variables. First, let’s plot the distribution of age for each education level.

In [None]:
plt.figure(figsize=(12, 4))
a = sns.violinplot(d.DMDEDUC2, d.RIDAGEYR)

It shows the distribution of age for each education level. For example, in education level 1, we find more people above 60. In education level 5, you will find more people around 30.

It will be even more efficient to see the distribution of age of males and females separately.

In [None]:
d['RIAGENDRx'] = d.RIAGENDR.replace({1: "Male", 2: "Female"})

plt.figure(figsize=(12, 4))
a = sns.violinplot(d.DMDEDUC2, d.RIDAGEYR, hue=d.RIAGENDRx, split=True)

You have the distribution of age for males and females of each education level.

Let’s add one more variable to it. What if I want the same information as the previous plot for each ethnic group.

In [None]:
sns.catplot(x='RIDAGEYR', y="DMDEDUC2", hue='RIAGENDR', col="RIDRETH1",split=True,
           data = d[d.DMDEDUC2.notnull()], col_wrap=3,
           orient="h", height=5, aspect=1, palette='tab10', 
           kind='violin', didge=True, cut=0, bw=.2);