# Day One: Introducing Pandas

## Setup

For this workshop, I require **pandas version > 2.0** to make sure everyone gets the same results from the same code. You can check the version of Pandas installed in your current conda environment as follows:

In [28]:
import pandas as pd
print(pd.__version__) # 2.0 or above

2.2.2


Adjust how Jupyter displays columns in DataFrames.

In [79]:
pd.set_option('display.max_columns', None)

## Datasets
* [CORGIS: Coffee Cupping Dataset](https://corgis-edu.github.io/corgis/csv/coffee/) - A *.csv* (comma-separated value) file of professionally rated coffee varieties.

## Pandas

The core functionality of Pandas is provided through the [**DataFrame**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and [**Series**](https://pandas.pydata.org/docs/reference/series.html) objects. I would recommend referring to the documentation linked throughout.

## Reading CSV Data from a File
Today, we will review downloading and importing .csv files from local folders. You will do this by specifying an input **filepath** relative to the directory in which your Jupyter Notebook (**.ipynb**) file is stored.

### Filesystem Navigation in Python
The Python [os module](https://docs.python.org/3/library/os.html) can help you manage files and folders through Python commands. If you find yourself getting *FileNotFound* errors, you might be looking in the wrong directory.

To get the current working directory we can use the **os.getcwd()** function.

In [2]:
import os
print(os.getcwd())

C:\Users\Erin\Desktop\DataWranglingWorkshop


To see the **files and folders** relative to your work directory, use the **os.listdir()** command. 

In [16]:
os.listdir() # If you see .ipnyb_checkpoints, that's a hidden file created by Jupyter

['.ipynb_checkpoints',
 'data',
 'day-one.ipynb',
 'pandas-workshop-day-1-notes.ipynb']

## Exploring DataFrames

To read in .csv files into Pandas DataFrames, we can use the **pd.read_csv()** function. This file doesn't have a unique idenitifier column that we could use as a row index, and it is separated by commas rather than tabs or spaces, so the **default** arguments are appropriate.

In [80]:
coffee = pd.read_csv("data/coffee.csv")
coffee.head(5) # Gives the first five rows of a DataFrame

Unnamed: 0,Location.Country,Location.Region,Location.Altitude.Min,Location.Altitude.Max,Location.Altitude.Average,Year,Data.Owner,Data.Type.Species,Data.Type.Variety,Data.Type.Processing method,Data.Production.Number of bags,Data.Production.Bag weight,Data.Scores.Aroma,Data.Scores.Flavor,Data.Scores.Aftertaste,Data.Scores.Acidity,Data.Scores.Body,Data.Scores.Balance,Data.Scores.Uniformity,Data.Scores.Sweetness,Data.Scores.Moisture,Data.Scores.Total,Data.Color
0,United States,kona,0,0,0,2010,kona pacific farmers cooperative,Arabica,,,25,45.3592,8.25,8.42,8.08,7.75,7.67,7.83,10.0,10.0,0.0,86.25,Unknown
1,Brazil,sul de minas - carmo de minas,12,12,12,2010,jacques pereira carneiro,Arabica,Yellow Bourbon,,300,60.0,8.17,7.92,7.92,7.75,8.33,8.0,10.0,10.0,0.08,86.17,Unknown
2,Brazil,sul de minas - carmo de minas,12,12,12,2010,jacques pereira carneiro,Arabica,Yellow Bourbon,,300,60.0,8.42,7.92,8.0,7.75,7.92,8.0,10.0,10.0,0.01,86.17,Unknown
3,Ethiopia,sidamo,0,0,0,2010,ethiopia commodity exchange,Arabica,,,360,6.0,7.67,8.0,7.83,8.0,7.92,7.83,10.0,10.0,0.0,85.08,Unknown
4,Ethiopia,sidamo,0,0,0,2010,ethiopia commodity exchange,Arabica,,,300,6.0,7.58,7.83,7.58,8.0,7.83,7.5,10.0,10.0,0.1,83.83,Unknown


`coffee` is a DataFrame object.

* If we look at the **.shape** attribute (a tuple in `(row, column)` form) we can see it has 989 rows and 23 columns.
* We can look at the row names in the **.index** attribute and the column names in the **.columns** attribute--we'll learn how to change these later!

In [5]:
print(type(coffee)) # Print data type
print(coffee.shape)

<class 'pandas.core.frame.DataFrame'>
(989, 23)


If we do not specify an index column using the *index_col=* parameter, the row names will default to a **RangeIndex** starting at 0 and stopping at `num_rows - 1`.

In [6]:
print(coffee.index)

RangeIndex(start=0, stop=989, step=1)


In [7]:
print(coffee.columns)

Index(['Location.Country', 'Location.Region', 'Location.Altitude.Min',
       'Location.Altitude.Max', 'Location.Altitude.Average', 'Year',
       'Data.Owner', 'Data.Type.Species', 'Data.Type.Variety',
       'Data.Type.Processing method', 'Data.Production.Number of bags',
       'Data.Production.Bag weight', 'Data.Scores.Aroma', 'Data.Scores.Flavor',
       'Data.Scores.Aftertaste', 'Data.Scores.Acidity', 'Data.Scores.Body',
       'Data.Scores.Balance', 'Data.Scores.Uniformity',
       'Data.Scores.Sweetness', 'Data.Scores.Moisture', 'Data.Scores.Total',
       'Data.Color'],
      dtype='object')


Every **column** in a **DataFrame** is a **Series**, and every **row** in a **DataFrame** is a **Series**. However, columns have an assigned **Dtype** where rows do not. This makes sense because rows typically represent *observations* while columns represent *variables*.

You can see the Dtypes for each column in your DataFrame using **.dtypes**. (**object** is the default Dtype for catego

In [41]:
coffee.dtypes

Location.Country                   object
Location.Region                    object
Location.Altitude.Min               int64
Location.Altitude.Max               int64
Location.Altitude.Average           int64
Year                                int64
Data.Owner                         object
Data.Type.Species                  object
Data.Type.Variety                  object
Data.Type.Processing method        object
Data.Production.Number of bags      int64
Data.Production.Bag weight        float64
Data.Scores.Aroma                 float64
Data.Scores.Flavor                float64
Data.Scores.Aftertaste            float64
Data.Scores.Acidity               float64
Data.Scores.Body                  float64
Data.Scores.Balance               float64
Data.Scores.Uniformity            float64
Data.Scores.Sweetness             float64
Data.Scores.Moisture              float64
Data.Scores.Total                 float64
Data.Color                         object
dtype: object

We can get descriptive statistics for **numeric** Series or for several numeric columns in a DataFrame at once using **describe()**.

In [23]:
coffee["Data.Scores.Aroma"].describe() # On a Series

count    989.000000
mean       7.572831
std        0.396796
min        0.000000
25%        7.420000
50%        7.580000
75%        7.750000
max        8.750000
Name: Data.Scores.Aroma, dtype: float64

In [24]:
coffee.describe() # On all numeric columns in a DataFrame

Unnamed: 0,Location.Altitude.Min,Location.Altitude.Max,Location.Altitude.Average,Year,Data.Production.Number of bags,Data.Production.Bag weight,Data.Scores.Aroma,Data.Scores.Flavor,Data.Scores.Aftertaste,Data.Scores.Acidity,Data.Scores.Body,Data.Scores.Balance,Data.Scores.Uniformity,Data.Scores.Sweetness,Data.Scores.Moisture,Data.Scores.Total
count,989.0,989.0,989.0,989.0,989.0,989.0,989.0,989.0,989.0,989.0,989.0,989.0,989.0,989.0,989.0,989.0
mean,1640.076845,1675.929221,1657.998989,2013.549039,151.761375,210.491937,7.572831,7.51541,7.387472,7.539697,7.506309,7.500344,9.823893,9.830313,0.093903,81.972133
std,9192.519762,9191.957731,9192.058989,1.658883,125.66549,1666.707294,0.396796,0.420677,0.425284,0.39937,0.391481,0.425055,0.593888,0.691316,0.044666,3.859562
min,0.0,0.0,0.0,2010.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,905.0,950.0,950.0,2012.0,15.0,1.0,7.42,7.33,7.25,7.33,7.33,7.33,10.0,10.0,0.1,81.08
50%,1300.0,1310.0,1300.0,2013.0,170.0,60.0,7.58,7.5,7.42,7.58,7.5,7.5,10.0,10.0,0.11,82.5
75%,1550.0,1600.0,1600.0,2015.0,275.0,69.0,7.75,7.75,7.58,7.75,7.67,7.75,10.0,10.0,0.12,83.58
max,190164.0,190164.0,190164.0,2018.0,600.0,19200.0,8.75,8.83,8.67,8.75,8.5,8.58,10.0,10.0,0.28,90.58


## Indexing

We can access values in a DataFrame using [one of two selection methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html):
* .loc[*row*, *column*] to get values by named indices and columns (inclusive on both ends)
* .iloc[*row*, *column*] to get values by numbered indices (exclusive on the right)

### Selecting from a DataFrame with the default (numeric) index
*Q: How would you select the second item in the second row using loc? How would you select it with iloc?*

In [17]:
coffee.head(3)

Unnamed: 0,Location.Country,Location.Region,Location.Altitude.Min,Location.Altitude.Max,Location.Altitude.Average,Year,Data.Owner,Data.Type.Species,Data.Type.Variety,Data.Type.Processing method,...,Data.Scores.Flavor,Data.Scores.Aftertaste,Data.Scores.Acidity,Data.Scores.Body,Data.Scores.Balance,Data.Scores.Uniformity,Data.Scores.Sweetness,Data.Scores.Moisture,Data.Scores.Total,Data.Color
0,United States,kona,0,0,0,2010,kona pacific farmers cooperative,Arabica,,,...,8.42,8.08,7.75,7.67,7.83,10.0,10.0,0.0,86.25,Unknown
1,Brazil,sul de minas - carmo de minas,12,12,12,2010,jacques pereira carneiro,Arabica,Yellow Bourbon,,...,7.92,7.92,7.75,8.33,8.0,10.0,10.0,0.08,86.17,Unknown
2,Brazil,sul de minas - carmo de minas,12,12,12,2010,jacques pereira carneiro,Arabica,Yellow Bourbon,,...,7.92,8.0,7.75,7.92,8.0,10.0,10.0,0.01,86.17,Unknown


In [18]:
print(coffee.loc[1, "Location.Region"]) # Our index in this case is a number
print(coffee.iloc[1, 1]) # The second row, the second column

sul de minas - carmo de minas
sul de minas - carmo de minas


We can also use .loc and .iloc to select ranges of cells in the form of rows and columns. 

In [11]:
coffee.loc[:, "Location.Region"] # Returns all the rows with the Location.Region column
coffee.iloc[0:2, :] # Because iloc is not inclusive at the right end, first two rows
coffee.loc[:, "Location.Country":"Location.Altitude.Min"] # loc is inclusive

Unnamed: 0,Location.Country,Location.Region,Location.Altitude.Min
0,United States,kona,0
1,Brazil,sul de minas - carmo de minas,12
2,Brazil,sul de minas - carmo de minas,12
3,Ethiopia,sidamo,0
4,Ethiopia,sidamo,0
...,...,...,...
984,Guatemala,san marcos,1700
985,Honduras,comayagua,1400
986,India,chikmagalur karnataka indua,3170
987,India,chikmagalur karnataka india,3140


## Two Key Python Data Structures

Many Pandas functions use **lists** or **dictionaries** as arguments.

### Lists
Lists are one-dimensional, ordered sequences. Lists can include items of heterogeneous data types.

In [19]:
my_list = ["a", "b", 3, 5]
print(my_list[0]) # Gets the first item
print(my_list[:2]) # Gets the first two items
my_list.append(["dog", "cat"])
print(my_list)

a
['a', 'b']
['a', 'b', 3, 5, ['dog', 'cat']]


### Dictionaries
Dictionaries are sets of **key-value** pairs indicated by {} in the form `{k1: v1, k2: v2..}`.

* Dictionary keys must be unique.
* Dictionary values do not need to be.
* If you know your key, dictionaries give you the corresponding value quickly.

Let's create a toy dictionary with patient names as *keys* and their diagnoses as *values*. 

In [13]:
patients = {}
patients["John"] = "kidney stones"
patients["David"] = "fever"
patients["Steven"] = "fever"
print(patients)

{'John': 'kidney stones', 'David': 'fever', 'Steven': 'fever'}


There are two ways to get values out of a dictionary by their keys: using the [] syntax or using a method called .get().

In [14]:
patients["John"]

'kidney stones'

The `[]` lookup syntax will fail with a **KeyError** if you try to lookup a key that is not in the dictionary.

In [15]:
patients["Emily"] # This will error out

KeyError: 'Emily'

To prevent your code from crashing with a **KeyError**, use the **.get()** method instead, which will return `None` if the specified key is not in the dictionary.

In [22]:
if (patients.get("Emily")) == None: # this will NOT crash
    print("Emily is not in the dictionary")
print(patients.get("David"))

Emily is not in the dictionary
fever


Dictionary *keys* must be immutable, but dictionary *values* can be just about anything.

**Hint**: The *mutable* objects you've encountered so far include lists and dictionaries.

In [87]:
not_a_dict = {["a","b"]:3} # This will fail

TypeError: unhashable type: 'list'

You *can* use lists as values in dictionaries, and in fact, it's convenient to do so.

In [52]:
hospital = {"name": ["adam", "yao", "sam"], "ages":[15, 25, 30]}
hospital

{'name': ['adam', 'yao', 'sam'], 'ages': [15, 25, 30]}

### Wait, What Does This Have to Do with Pandas?

We can construct toy **Series** from lists by passing in a list of values *and* an index of the same length. (Remember, a **Series** is a one-dimensional array with an alphanumeric index.)

In [46]:
s = pd.Series(["cat", "dog", "antelope"]) # no index provided, defaults to rangeindex
s

0         cat
1         dog
2    antelope
dtype: object

In [47]:
z = pd.Series(data=["cat", "dog", "antelope"], index=["c", "d", "a"])
z

c         cat
d         dog
a    antelope
dtype: object

We can construct **DataFrames** from dictionaries of rows. Let's take the `hospital` dictionary from above. Note that the $nth$ item in the list corresponds to $nth$ row in the resulting DataFrame .

In [53]:
hospital_frame = pd.DataFrame(data=hospital, index=[101, 102, 103])
hospital_frame

Unnamed: 0,name,ages
101,adam,15
102,yao,25
103,sam,30


In this example, patients are indexed by a fake patient_id number. We can use **.loc[]** to retrieve rows by that index.

In [54]:
hospital_frame.loc[101]

name    adam
ages      15
Name: 101, dtype: object

There are many functions in Pandas that take in dictionaries or lists as **arguments**, so make sure you're comfortable with these data structures. 

### Selecting from a DataFrame with an alphanumeric index
Here is a DataFrame that is indexed by a string rather than an integer starting at 0. In this case, .loc and .iloc will different when indexing into rows.

In [9]:
names = pd.DataFrame({"First":["John", "Steve"], "Last":["Jones", "Smith"]}, index=["a", "b"])
names.head()

Unnamed: 0,First,Last
a,John,Jones
b,Steve,Smith


*Q: How would you select "John" using loc? How would you select "Jones" using iloc?*

In [10]:
print(names.loc["a", "First"])
print(names.iloc[0, 1])

John
Jones


To look at the first `k` and the last `k` rows of a DataFrame respectively, we can use **.head(k)** and **.tail(k)**.

*Q: How would you find the last three rows of the coffee DataFrame?*

In [49]:
coffee.tail(3)

Unnamed: 0,Location.Country,Location.Region,Location.Altitude.Min,Location.Altitude.Max,Location.Altitude.Average,Year,Data.Owner,Data.Type.Species,Data.Type.Variety,Data.Type.Processing method,...,Data.Scores.Flavor,Data.Scores.Aftertaste,Data.Scores.Acidity,Data.Scores.Body,Data.Scores.Balance,Data.Scores.Uniformity,Data.Scores.Sweetness,Data.Scores.Moisture,Data.Scores.Total,Data.Color
986,India,chikmagalur karnataka indua,3170,3170,3170,2017,nishant gurjer,Robusta,,Washed / Wet,...,7.75,7.92,8.0,7.92,7.92,10.0,8.0,0.0,83.5,Unknown
987,India,chikmagalur karnataka india,3140,3140,3140,2017,nishant gurjer,Robusta,,Washed / Wet,...,7.75,7.83,7.67,7.92,7.83,10.0,7.92,0.1,82.5,Bluish-Green
988,Honduras,western region,1200,1200,1200,2018,mdh,Arabica,Catimor,Washed / Wet,...,7.42,7.08,7.42,7.5,7.33,10.0,10.0,0.11,81.58,Blue-Green


Each column in a Pandas DataFrame is a [Series](https://pandas.pydata.org/docs/reference/series.html#series), so we can apply all of the methods available for Series to Pandas columns.

*Q: How would you find the oldest and most recent coffees in the DataFrame? Their difference?*

In [56]:
newest = coffee["Year"].max()
oldest = coffee["Year"].min()
print("The range of age of coffee vintages is", newest - oldest, "years")

The range of age of coffee vintages is 8 years


We can use lists to select several columns at once and in any order. You can pass it lists of column names to .loc[]. Remember `:` means select all columns or all rows.

In [81]:
coffee.loc[:, ["Data.Color", "Data.Scores.Total", "Location.Region"]]

Unnamed: 0,Data.Color,Data.Scores.Total,Location.Region
0,Unknown,86.25,kona
1,Unknown,86.17,sul de minas - carmo de minas
2,Unknown,86.17,sul de minas - carmo de minas
3,Unknown,85.08,sidamo
4,Unknown,83.83,sidamo
...,...,...,...
984,Green,79.08,san marcos
985,Green,0.00,comayagua
986,Unknown,83.50,chikmagalur karnataka indua
987,Bluish-Green,82.50,chikmagalur karnataka india


You can make the list a separate variable for readability purposes.

In [82]:
longer_list = ["Data.Color", "Data.Scores.Total", "Location.Region", "Year"]
coffee.loc[:, longer_list]

Unnamed: 0,Data.Color,Data.Scores.Total,Location.Region,Year
0,Unknown,86.25,kona,2010
1,Unknown,86.17,sul de minas - carmo de minas,2010
2,Unknown,86.17,sul de minas - carmo de minas,2010
3,Unknown,85.08,sidamo,2010
4,Unknown,83.83,sidamo,2010
...,...,...,...,...
984,Green,79.08,san marcos,2017
985,Green,0.00,comayagua,2017
986,Unknown,83.50,chikmagalur karnataka indua,2017
987,Bluish-Green,82.50,chikmagalur karnataka india,2017
