This document will cover sections 6 - 9 of the "Plotting and Programming in Python" workshop material. 

Here we will learn about libraries, reading tablular data into Dataframes with Pandas, and plotting.

Access the data by downloading the gapminder dataset from
  * swcarpentry.github.io/python-novice-gapminder/setup.html
  * Place on your desktop for easy access
  * Path: ~/Desktop/data/

### An Overview of Python Libraries

* A library is a collection of files (modules) that contain functions for use by other programs
  * Generally related functions
  * May contain other things like data values
  * Also referred to as packages
* Python contains an extensive standard library
* Additional libraries can be found in the Python Package Index (PyPI)


* To use a library, it must be imported into Python
  * `import numpy`
      * `numpy.array([1, 2, 3])`
* Make a nickname for the imported package
  * `import numpy as np`
    * `np. array([1, 2, 3])
  

In [1]:
# `pi` is s a function that provides the number pi
# If you type the function `pi` without importing the "math" library, 
# you will get an error
pi

NameError: name 'pi' is not defined

In [2]:
# Import the "math" package to use the function `pi`
import math

In [3]:
# Call a function from an imported library using dot (.) notation
# libary_name.function
print(math.pi)

3.141592653589793


In [4]:
# Learn more about a library by using the function `help`
# help(math)

In [5]:
# Import specific functions from a library, not the entire library
# These no longer require that you reference the library and can be called by 
# only their name
from math import pi, cos
print(pi)
print(cos(pi))
print("cos(pi) is", cos(pi))

3.141592653589793
-1.0
cos(pi) is -1.0


In [6]:
# Make a nickname for an imported library
import math as m
print(m.pi)

3.141592653589793


## Exploring a new library

To excercise our mind and fingers, let's use the library `random` to print a random character from a string.

In [7]:
import random

# Write a string
string = 'STACEY'

# Check the length of characters in your string
print(len(string))

# Save the random index determined for the length of characters in your string
random_index = random.randrange(len(string))
print(random_index)
                                
# Pring the character in the randomly selected index
print(string[random_index])

6
5
Y


In [8]:
# Use `random.choice` to select a random index from your string
import random

print(random.choice(string))

E


## Reading Tabular Data into Dataframes using Pandas

* Pandas is a Python library for statistics and tabular data
* Similar to dataframes in R
  * 2-dimensional
  * Columns have names of the observed variables
  * Rows contain the observation values
  * Able to contain different data types
* A dataframe is a collection of series
  * A dataframe is a table
  * A series is the data-structure Pandas uses to represent a column
* Pandas is built on the Numpy library
  * Most methods in Numpy apply to dataframes/series in Pandas
* Common nickname/alias for Pandas is `pd`


In [9]:
# Import Pandas using the alias `pd`
import pandas as pd

# Use `pd.read_csv()` to read a comma separated values (csv) data table
data = pd.read_csv('~/Desktop/data/gapminder_gdp_oceania.csv')
print(data)

# Pandas uses `\` when the output is too wide to fit the screen

       country  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
0    Australia     10039.59564     10949.64959     12217.22686   
1  New Zealand     10556.57566     12247.39532     13175.67800   

   gdpPercap_1967  gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  \
0     14526.12465     16788.62948     18334.19751     19477.00928   
1     14463.91893     16046.03728     16233.71770     17632.41040   

   gdpPercap_1987  gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  \
0     21888.88903     23424.76683     26997.93657     30687.75473   
1     19007.19129     18363.32494     21050.41377     23189.80135   

   gdpPercap_2007  
0     34435.36744  
1     25185.00911  


In [10]:
# Current row names are 0 and 1
# Change the row names to the column `country`

data = pd.read_csv('~/Desktop/data/gapminder_gdp_oceania.csv', index_col = 'country')
print(data)

             gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  gdpPercap_1967  \
country                                                                       
Australia       10039.59564     10949.64959     12217.22686     14526.12465   
New Zealand     10556.57566     12247.39532     13175.67800     14463.91893   

             gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  gdpPercap_1987  \
country                                                                       
Australia       16788.62948     18334.19751     19477.00928     21888.88903   
New Zealand     16046.03728     16233.71770     17632.41040     19007.19129   

             gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  gdpPercap_2007  
country                                                                      
Australia       23424.76683     26997.93657     30687.75473     34435.36744  
New Zealand     18363.32494     21050.41377     23189.80135     25185.00911  


In [11]:
# find out more information about the dataframe
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, Australia to New Zealand
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   gdpPercap_1952  2 non-null      float64
 1   gdpPercap_1957  2 non-null      float64
 2   gdpPercap_1962  2 non-null      float64
 3   gdpPercap_1967  2 non-null      float64
 4   gdpPercap_1972  2 non-null      float64
 5   gdpPercap_1977  2 non-null      float64
 6   gdpPercap_1982  2 non-null      float64
 7   gdpPercap_1987  2 non-null      float64
 8   gdpPercap_1992  2 non-null      float64
 9   gdpPercap_1997  2 non-null      float64
 10  gdpPercap_2002  2 non-null      float64
 11  gdpPercap_2007  2 non-null      float64
dtypes: float64(12)
memory usage: 208.0+ bytes


In [12]:
# We can explore and modify the data using member variables (do not require `()`)

# Get column names
print(data.columns)

# Transpose a dataframe
print(data.T)

Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967',
       'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987',
       'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'],
      dtype='object')
country           Australia  New Zealand
gdpPercap_1952  10039.59564  10556.57566
gdpPercap_1957  10949.64959  12247.39532
gdpPercap_1962  12217.22686  13175.67800
gdpPercap_1967  14526.12465  14463.91893
gdpPercap_1972  16788.62948  16046.03728
gdpPercap_1977  18334.19751  16233.71770
gdpPercap_1982  19477.00928  17632.41040
gdpPercap_1987  21888.88903  19007.19129
gdpPercap_1992  23424.76683  18363.32494
gdpPercap_1997  26997.93657  21050.41377
gdpPercap_2002  30687.75473  23189.80135
gdpPercap_2007  34435.36744  25185.00911


In [13]:
# Import new dataset from the same directory and 
# label the rows with the column `country`

americas = pd.read_csv('~/Desktop/data/gapminder_gdp_americas.csv', 
                       index_col = 'country')

# Print out the dimensions of the dataframe
# 25 rows, 13 columns
print(americas.shape)

(25, 13)


In [14]:
# Use `head` to print the top three rows of the dataframe
print(americas.head(n=3))

          continent  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
country                                                               
Argentina  Americas     5911.315053     6856.856212     7133.166023   
Bolivia    Americas     2677.326347     2127.686326     2180.972546   
Brazil     Americas     2108.944355     2487.365989     3336.585802   

           gdpPercap_1967  gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  \
country                                                                     
Argentina     8052.953021     9443.038526    10079.026740     8997.897412   
Bolivia       2586.886053     2980.331339     3548.097832     3156.510452   
Brazil        3429.864357     4985.711467     6660.118654     7030.835878   

           gdpPercap_1987  gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  \
country                                                                     
Argentina     9139.671389     9308.418710    10967.281950     8797.640716   
Bolivia       2753.691490  

In [15]:
# Use `tail` to print the bottom three rows of the dataframe
print(americas.tail(n=2))

          continent  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
country                                                               
Uruguay    Americas     5716.766744     6150.772969     5603.357717   
Venezuela  Americas     7689.799761     9802.466526     8422.974165   

           gdpPercap_1967  gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  \
country                                                                     
Uruguay       5444.619620     5703.408898     6504.339663     6920.223051   
Venezuela     9541.474188    10505.259660    13143.950950    11152.410110   

           gdpPercap_1987  gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  \
country                                                                     
Uruguay       7452.398969     8137.004775     9230.240708     7727.002004   
Venezuela     9883.584648    10733.926310    10165.495180     8605.047831   

           gdpPercap_2007  
country                    
Uruguay       10611.46299  
Venezuela    

In [16]:
# Get summary statistics for columns with numerical data
print(data.describe())

       gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  gdpPercap_1967  \
count        2.000000        2.000000        2.000000        2.000000   
mean     10298.085650    11598.522455    12696.452430    14495.021790   
std        365.560078      917.644806      677.727301       43.986086   
min      10039.595640    10949.649590    12217.226860    14463.918930   
25%      10168.840645    11274.086022    12456.839645    14479.470360   
50%      10298.085650    11598.522455    12696.452430    14495.021790   
75%      10427.330655    11922.958888    12936.065215    14510.573220   
max      10556.575660    12247.395320    13175.678000    14526.124650   

       gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  gdpPercap_1987  \
count         2.00000        2.000000        2.000000        2.000000   
mean      16417.33338    17283.957605    18554.709840    20448.040160   
std         525.09198     1485.263517     1304.328377     2037.668013   
min       16046.03728    16233.717700    17632.410

In [17]:
# Save the transposed data into a new variable
americas_flip = americas.T
print(americas_flip)

country           Argentina      Bolivia       Brazil       Canada  \
continent          Americas     Americas     Americas     Americas   
gdpPercap_1952  5911.315053  2677.326347  2108.944355  11367.16112   
gdpPercap_1957  6856.856212  2127.686326  2487.365989  12489.95006   
gdpPercap_1962  7133.166023  2180.972546  3336.585802  13462.48555   
gdpPercap_1967  8052.953021  2586.886053  3429.864357  16076.58803   
gdpPercap_1972  9443.038526  2980.331339  4985.711467  18970.57086   
gdpPercap_1977  10079.02674  3548.097832  6660.118654  22090.88306   
gdpPercap_1982  8997.897412  3156.510452  7030.835878  22898.79214   
gdpPercap_1987  9139.671389   2753.69149  7807.095818  26626.51503   
gdpPercap_1992   9308.41871  2961.699694  6950.283021  26342.88426   
gdpPercap_1997  10967.28195  3326.143191  7957.980824  28954.92589   
gdpPercap_2002  8797.640716   3413.26269  8131.212843  33328.96507   
gdpPercap_2007  12779.37964  3822.137084  9065.800825  36319.23501   

country            

In [18]:
# Print the last two rows of the dataframe and transpose for easier reading
# Note: this is not saved as an object, it is only printed
print(americas_flip.tail(n=2).T)

                    gdpPercap_2002 gdpPercap_2007
country                                          
Argentina              8797.640716    12779.37964
Bolivia                 3413.26269    3822.137084
Brazil                 8131.212843    9065.800825
Canada                 33328.96507    36319.23501
Chile                  10778.78385    13171.63885
Colombia               5755.259962    7006.580419
Costa Rica             7723.447195     9645.06142
Cuba                   6340.646683    8948.102923
Dominican Republic     4563.808154    6025.374752
Ecuador                5773.044512    6873.262326
El Salvador            5351.568666    5728.353514
Guatemala              4858.347495    5186.050003
Haiti                  1270.364932    1201.637154
Honduras                3099.72866    3548.330846
Jamaica                6994.774861    7320.880262
Mexico                 10742.44053    11977.57496
Nicaragua              2474.548819    2749.320965
Panama                 7356.031934    9809.185636


In [19]:
americas2 = americas_flip.tail(n=2).T

In [20]:
# Save modified dataframe as a new file
americas2.to_csv('~/Desktop/data/americas2_processed.csv')

In [21]:
# Read in the new data file by assigning a new variable name
my_df = pd.read_csv('~/Desktop/data/americas2_processed.csv')
print(my_df)

                country  gdpPercap_2002  gdpPercap_2007
0             Argentina     8797.640716    12779.379640
1               Bolivia     3413.262690     3822.137084
2                Brazil     8131.212843     9065.800825
3                Canada    33328.965070    36319.235010
4                 Chile    10778.783850    13171.638850
5              Colombia     5755.259962     7006.580419
6            Costa Rica     7723.447195     9645.061420
7                  Cuba     6340.646683     8948.102923
8    Dominican Republic     4563.808154     6025.374752
9               Ecuador     5773.044512     6873.262326
10          El Salvador     5351.568666     5728.353514
11            Guatemala     4858.347495     5186.050003
12                Haiti     1270.364932     1201.637154
13             Honduras     3099.728660     3548.330846
14              Jamaica     6994.774861     7320.880262
15               Mexico    10742.440530    11977.574960
16            Nicaragua     2474.548819     2749

## Subsetting a Dataframe

* Square brackets
  * [] - results in Panda series
  * [[]] - results in Pandas dataframe
* Slicing
  * [] - includes <b>start</b>, up to and excluding <b>end</b> index
    * results in a dataframe with only the rows specified
    * only row indices allowed
* loc
  * label-based, specify rows and columns based on row and column labels
    * [] - results are a Pandas series
    * [[]] - results are a dataframe
      * <b>Note</b>: contrary to usual python slices, both the start and the the stop index are included
* iloc
  * similar to square brackets, iloc is integer based. Specify rows and columns by integer index
    * [] - results are a Pandas series
    * [[]] - results are a dataframe

In [23]:
# Use `loc`, row name, and single brackets to return series
print(americas.loc["Cuba"])

continent            Americas
gdpPercap_1952     5586.53878
gdpPercap_1957    6092.174359
gdpPercap_1962     5180.75591
gdpPercap_1967    5690.268015
gdpPercap_1972    5305.445256
gdpPercap_1977    6380.494966
gdpPercap_1982    7316.918107
gdpPercap_1987    7532.924763
gdpPercap_1992    5592.843963
gdpPercap_1997    5431.990415
gdpPercap_2002    6340.646683
gdpPercap_2007    8948.102923
Name: Cuba, dtype: object


In [24]:
# Return a specific value by providing the specific column and row labels
print(americas.loc["Cuba", "gdpPercap_1952"])

5586.53878


In [27]:
americas.head(n=10)

Unnamed: 0_level_0,continent,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Argentina,Americas,5911.315053,6856.856212,7133.166023,8052.953021,9443.038526,10079.02674,8997.897412,9139.671389,9308.41871,10967.28195,8797.640716,12779.37964
Bolivia,Americas,2677.326347,2127.686326,2180.972546,2586.886053,2980.331339,3548.097832,3156.510452,2753.69149,2961.699694,3326.143191,3413.26269,3822.137084
Brazil,Americas,2108.944355,2487.365989,3336.585802,3429.864357,4985.711467,6660.118654,7030.835878,7807.095818,6950.283021,7957.980824,8131.212843,9065.800825
Canada,Americas,11367.16112,12489.95006,13462.48555,16076.58803,18970.57086,22090.88306,22898.79214,26626.51503,26342.88426,28954.92589,33328.96507,36319.23501
Chile,Americas,3939.978789,4315.622723,4519.094331,5106.654313,5494.024437,4756.763836,5095.665738,5547.063754,7596.125964,10118.05318,10778.78385,13171.63885
Colombia,Americas,2144.115096,2323.805581,2492.351109,2678.729839,3264.660041,3815.80787,4397.575659,4903.2191,5444.648617,6117.361746,5755.259962,7006.580419
Costa Rica,Americas,2627.009471,2990.010802,3460.937025,4161.727834,5118.146939,5926.876967,5262.734751,5629.915318,6160.416317,6677.045314,7723.447195,9645.06142
Cuba,Americas,5586.53878,6092.174359,5180.75591,5690.268015,5305.445256,6380.494966,7316.918107,7532.924763,5592.843963,5431.990415,6340.646683,8948.102923
Dominican Republic,Americas,1397.717137,1544.402995,1662.137359,1653.723003,2189.874499,2681.9889,2861.092386,2899.842175,3044.214214,3614.101285,4563.808154,6025.374752
Ecuador,Americas,3522.110717,3780.546651,4086.114078,4579.074215,5280.99471,6679.62326,7213.791267,6481.776993,7103.702595,7429.455877,5773.044512,6873.262326


In [29]:
print(americas.loc['Chile':'Cuba', :])

           continent  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
country                                                                
Chile       Americas     3939.978789     4315.622723     4519.094331   
Colombia    Americas     2144.115096     2323.805581     2492.351109   
Costa Rica  Americas     2627.009471     2990.010802     3460.937025   
Cuba        Americas     5586.538780     6092.174359     5180.755910   

            gdpPercap_1967  gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  \
country                                                                      
Chile          5106.654313     5494.024437     4756.763836     5095.665738   
Colombia       2678.729839     3264.660041     3815.807870     4397.575659   
Costa Rica     4161.727834     5118.146939     5926.876967     5262.734751   
Cuba           5690.268015     5305.445256     6380.494966     7316.918107   

            gdpPercap_1987  gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  \
country             

In [32]:
print(americas.loc['Chile':'Cuba', 'gdpPercap_1967':'gdpPercap_1977'])

            gdpPercap_1967  gdpPercap_1972  gdpPercap_1977
country                                                   
Chile          5106.654313     5494.024437     4756.763836
Colombia       2678.729839     3264.660041     3815.807870
Costa Rica     4161.727834     5118.146939     5926.876967
Cuba           5690.268015     5305.445256     6380.494966
