# Introduction to Python

Welcome to UIC ACER's Introduction to Python! 
This course introduces you Python by working through common tasks in data science: 
importing, manipulating, and exporting data.

Before proceeding with these training materials, 
please ensure you have access to [Open OnDemand](https://ood.acer.uic.edu) or have locally installed Python and Jupyter notebooks via Anaconda as described 
[here](http://www.fredhutch.io/software/#python-jupyter-notebooks).

This tutorial is adapted from the [Fred Hutch Introduction to Python](https://github.com/fredhutchio/python_intro/) and the [Data Carpentry Python for Ecologists](https://datacarpentry.org/python-ecology-lesson/) materials.

## Learning Objectives

By the end of this tutorial, you should be able to:

- work in a Jupyter notebook to run and record Python code
- understand basic Python syntax to use functions and assign variables
- create lists
- use for-loops
- define functions
- load packages and spreadsheet-style data using Python
- extract columns, rows, and portions thereof from datasets
- calculate summary statistics
- subset data by identifying rows that meet particular conditions
- group data to summarize by category
- accommodate missing data
- export data using pandas

## A brief orientation to Python and Jupyter notebooks

Python is a commonly used programming language among researchers,
and has a large community and set of tools available to support its use.
As a result, there are many different ways to interact with Python,
the choice of which depends on your specific need for coding.
In this class, we'll be using [Jupyter notebooks](https://jupyter.readthedocs.io/en/latest/) 
to write, run, and maintain a record of our work.

A Jupyter notebook is an interface operated in a web browser 
that allows inclusion of code, output (including graphics) and explanatory text
all in the same document. 
In fact, these lesson materials are written in a Jupyter notebook.
Jupyter notebooks can also be used as a method of communicating research methods.

### Creating a new Jupyter notebook

You can access Jupyter notebooks through Anaconda,
which is the software you used to install Python as well.
Anaconda is a version of conda, a package manager that helps you install and update software.

Open the Anaconda Navigator software on your computer, 
then click on Jupyter Notebook 
(note that this is different than Jupyter Lab!).

You'll see your default web browser open a new tab.
On a Mac, you may also see a Terminal window open;
this window needs to stay open for Python to run,
but we recommend you minimize it so it stays out of the way.

In the Jupyter notebook window in your web browser,
note the URL at the top: 
it should start with something like `http://localhost:8888/tree`.
In the browser window, you should see folders like "Documents" and "Desktop."
This window represents a different way to interact with the files on your computer.
Although you're viewing these files in a web browser,
you're not necessarily working with files online.

We're going to create a project directory for the purposes of this course. 
You can think of a project as a discrete unit of work, 
such as a chapter of a thesis/dissertation, analysis for a manuscript, or a monthly report. 
We recommend organizing your code, data, and other associated files as projects, 
which allows you to keep all parts of an analysis together for easier access.

Create a new project for this class using the Jupyter notebook file browser:

- Navigate to the location in your computer where you'd like to save files for this class (we recommend Desktop or Documents).
- Click "New" in the upper right hand corner of the screen, then "Folder". This will create a new folder named "Untitled Folder".
- Click the box next to "Untitled Folder", then select "Rename" near the upper left corner of the screen. Name the new directory "intro_python"; we'll now refer to this as your project directory.
- Click on the new folder to view its contents (it should be empty).
- Click "New" in upper right hand corner of the window, then select "Python3". This creates a new ipython notebook file and opens it in a new tab. Click on the title of the notebook to rename the file "intro". If you click on the browser tab for the file browser, you can also rename as for the folder earlier. You'll note this filename has a suffix of `ipynb`.

### Executing code in a Jupyter notebook

Now that we have a new project and an empty notebook set up, we can begin orienting ourselves to how notebooks work to hold our text, code, and output.

The pale gray box you see at the top of your screen with `In [ ]` to the left is a cell.
By default, each cell is created as a code cell.
Because our notebook is Python 3, 
our code cells are able to execute Python code.
We can test this out by entering `3 + 4` into the cell,
then holding down the Shift key and pressing Enter/Return.
This executes (runs) the code in the cell and prints the output below,
prefaced by `Out[ ]`.
Executing the code this way also creates a new cell below the one you executed.
If a new cell doesn't appear,
you can add one using the `+` button in the toolbar at the top of the screen. 

Cells can also be used to enter text using [Markdown formatting](https://www.markdownguide.org/basic-syntax/).
Change the type of your new cell by going to the dropdown box in the tool bar at the top of the window and changing "Code" to "Markdown." 
Add a subtitle in this cell by entering `## Operators, functions, and data types`,
then using Shift + Enter to execute the cell,
which formats the text as large and bold.
The link above includes more information about Markdown formatting,
but we'll generally use only plain text and subtitles for this course.

Jupyter notebooks include many other features, 
which you can explore in the toolbar and dropdown menus at the top of the screen. 
Additional keyboard shortcuts are also available under Help -> Keyboard Shortcuts.

# Functions and variables

## Operators, functions, and data types

Now that we have a notebook created, 
as well as a basic understanding of how to write and execute code, 
we can begin learning more about Python syntax,
which are rules that dictate how combinations of words and symbols are interpreted in a language.

In [1]:
# addition
3 + 2

5

The first line in the example above is a code comment. 
It is not interpreted by Python, but is a human-readable explanation of the code that follows. 
In Python, anything to the right of one or more `#` symbols represents a comment.

> Syntax differs among language.
> So far in this lesson, 
> we've learned that Markdown interprets `#` as a way of formatting titles and subtitles, 
> while in Python the same symbol represents a code comment.

As we proceed through these lessons, 
we recommend trying to type the example code so it appears as similar as possible to what is presented here.
From the example above,
you may now be wondering if the spaces on either side of the `+` are required.
We can test this for ourselves:

In [2]:
# addition without spaces, same result
3+2

5

The code above indicates that the spaces are not required, 
but are convention. 
Code convention and style doesn’t make or break the ability of your code to run, 
but it does affect whether other people can easily understand your code. 
We'll try to model appropriate code convention for this course,
and you can read more about Python formatting recommendations [here](https://www.python.org/dev/peps/pep-0008/#id26).

Here are other arithmetic operators in Python:

In [3]:
# subtraction
3 - 2

1

In [4]:
# multiplication
3 * 2

6

In [5]:
# division
3 / 2

1.5

In [6]:
# exponentiation
3 ** 2

9

In [7]:
# modulus (remainder)
3 % 2

1

We can also use logical operators to evaluate whether a given statement is true or false:

In [8]:
# greater than
3 > 4 

False

In [9]:
# less than
3 < 4

True

In [10]:
# equal to
3 == 4

False

In [11]:
# less than or equal to
3 <= 4

True

We can also store values into variables. Like in math, 
a variable is a word used to represent data, which can be a single value or more complex collections:

In [12]:
# storing values in variables
pi = 3.1415

We can then use the variable like any other number:

In [13]:
pi * 2

6.283

In [14]:
# storing multiple variables
weight_kg = 22
weight_lb = weight_kg / 2.2
weight_lb

10.0

We defined `weight_lb` using `weight_kg` above. Note however that if we update `weight_kg`, the value of `weight_lb` will not automatically change. `weight_lb` depends only on the value of `weight_kg` *when `weight_lb` was defined*:

In [15]:
# updating weight_kg does not automatically update weight_lb
weight_kg = 38.5
weight_lb

10.0

We need to redefine `weight_lb` if we want to update it:

In [16]:
# redefine weight_lb
weight_lb = weight_kg / 2.2
weight_lb

17.5

Python also includes functions for other types of math. Functions are called by writing the function name and then the argument surrounded by parentheses:

In [17]:
# round pi to a whole number
round(pi)

3

Functions can take more than one argument, in which case the arguments are separated by commas: 

In [18]:
# pass a second argument for the number of digits to round to
round(pi, 1)

3.1

Arguments can also be named, which may increase clarity if the purpose of the argument isn't obvious, or be necessary if there are many arguments:

In [19]:
# use a named argument for the number of digits
round(pi, ndigits=1)

3.1

If you would like to find help on a function, 
there's a function for that:

In [20]:
# find help on a function
help(round)

Help on built-in function round in module builtins:

round(number, ndigits=None)
    Round a number to a given precision in decimal digits.
    
    The return value is an integer if ndigits is omitted or None.  Otherwise
    the return value has the same type as the number.  ndigits may be negative.



Python possesses several built-in data types:

In [21]:
# data types in python
integer = 42 # integer
real = 3.1415 # float
text = "UIC ACER" # string

Notebooks allow you a handy shortcut to view the contents of a variable by executing only the variable name:

In [22]:
text

'UIC ACER'

This approach will work well enough for us in this class,
since we'll be using notebooks the whole time.
If you are using code written by other people, 
or begin writing code in scripts (outside notebooks),
you'll often see the `print` function used:

In [23]:
# print output to screen
print(text)

UIC ACER


The data output are the same for each of the two previous code cells,
though they look slightly different.
It's useful to note that the notebook will only print the result of the last command executed in a code cell,
so if there are is other output in the cell you'd like to see, 
you may need to use the `print` function then as well.

## Sequences

So far we've been working with variables containing a single value.
It's often the case that we would like to use a variable to reference collections of values.
Sequences are a data structure which hold collections of elements.
Lists are one type of sequence,
and are defined in Python using square brackets:

In [24]:
# assign a list to a variable
numbers = [1, 2, 3, 4, 5, 6]
numbers

[1, 2, 3, 4, 5, 6]

Now that we've created a list,
we can access different portions of it:

In [25]:
# access first element in list
numbers[0] 

1

The number in the square brackets above indicates the position, or index, of the element we are accessing.
Python begins indexing (counting) at 0,
so the index positions in `numbers` are `0`, `1`, `2`, `3`, `4`, and `5`.

If you need to find information about your variable,
you can run `?numbers` in a code cell and a help window will pop up containing information about things like the variable's type and length.

Similar (but more extensive) information appears in an output cell if you run `help(numbers)`
this additional detail may be useful to you as your programming skills develop. 

We can also use negative numbers to index from the end of the list:

In [26]:
# access last element in list
numbers[-1]

6

In [27]:
# access second-to-last element in list
numbers[-2]

5

We can use `:` to access a range of items in a list (note that the first index is included but the last is not):

In [28]:
# access a range
numbers[1:4]

[2, 3, 4]

We can also leave one side of the range empty to go to the edge of the list:

In [29]:
# from the start up to but not including index 3
numbers[:3]

[1, 2, 3]

In [30]:
# from index 3 to the end
numbers[3:]

[4, 5, 6]

In [31]:
# the last three elements
numbers[-3:]

[4, 5, 6]

We can modify lists after they are created by adding elements or modifying elements:

In [32]:
# add element (number) to end of list
numbers.append(7)

Note that nothing is printed as output unless we specifically ask for it:

In [33]:
print(numbers)

[1, 2, 3, 4, 5, 6, 7]


`append()` is a method, or function associated with a particular variable. 
In this case, it is a method associated with lists that allows us to directly modify it.
You can learn more about this method by typing `?numbers.append` in a new code cell,
which presents a help window with the following information:

```
Docstring: L.append(object) -> None -- append object to end
Type:      builtin_function_or_method
```

You can view other methods available for lists by typing `?numbers.` 
in a new code cell and hitting the `tab` key.
This provides a drop-down list that shows all methods available for the variable.

In [34]:
# modify existing element
numbers[1] = 17
print(numbers)

[1, 17, 3, 4, 5, 6, 7]


Python also provides a function `len()` to determine how long a list is:

In [35]:
# calculate length
len(numbers)

7

> #### Exercise: remove
How do you remove items from a list?

# Loops and Statements to control flow

## "For" loops

We've been printing the contents of lists so far to the screen,
but we often would like to access each element in a structure once at a time.
We can accomplish this using a programming structure called a for loop.
For loops exist in many programming languages, 
and can be used to repeat actions across a set of things.
Here, we'll access elements in `organs` one at a time:

In [36]:
# lists of string data
organs = ["lung", "kidney", "heart"]

# for loop to access elements in list one at a time
for organ in organs:
    print(organ)

lung
kidney
heart


In the code above, `organ` represents a variable used inside the for loop.
There is a predictable format for the syntax of a for loop,
Loops require specific syntax, including `for`, `in`, and `:` in the first line. You can read about for loop structures in Python [here](https://wiki.python.org/moin/ForLoop).

## "While" loops

In [None]:
x = 0
while ( x < 10 ): 
    x = x + 1; 
    print(x)


## If/Else statements

In [None]:
for n in numbers:
    if n >5:
        print("large number")
    else:
        print("small number")

> #### Exercise: elif
What do you think "elif" means? (Hint: it's a combo of the two statements we used above) Try using "elif" to print out if a number is "medium" (small <4 , medium 4 to 7 and larger than 7 is large)

## Functions

In this last section, we'll briefly overview how to write our own custom functions:

In [37]:
# define a chunk of code as function
def plus_ten(a):
    result = a + 10
    return result

The first line of code defines the function with the name `plus_ten()` 
that accepts one items as input (`a`).
The second line performs the action,
and the last line determines what is output.

We can test the function by evaluating its use on data with an easily predictable outcome:

In [38]:
# apply the function
z = plus_ten(21)
print(z)

31


Let's combine a for loop with our `plus_ten` function:

In [39]:
for num in numbers:
    new_num = plus_ten(num)
    print(new_num)

11
27
13
14
15
16
17


> #### Exercise: function 
Define a new function called `times_ten` that multiplies a number by 10.

# Starting with data

## Using packages

We'll first need to load additional packages,
(collections of related functions)
so the functions we'll need are available for use:

In [40]:
# make packages available to use in this notebook
import os
import urllib.request
import pandas as pd

The packages we're using today include:

- [`os`](https://docs.python.org/3/library/os.html): to create a `data` directory
- [`urllib`](https://docs.python.org/3/library/urllib.html): for downloading files
- [`pandas`](https://pandas.pydata.org): for data manipulation and analysis

For the last package,
`pd` is being defined as an alias, or shortcut, 
to specify we're using a function from that package.
For the rest of this lesson, we'll preface the function in which it's been loaded.

## Importing data

Before we can download our data,
we should create a new directory to contain it:

In [41]:
# create data directory
os.mkdir("data")

Then we can use a function from the `urllib` package to download the data file:

In [42]:
# download dataset
urllib.request.urlretrieve("https://raw.githubusercontent.com/uicacer/workshops/main/r_intro/data/animals.csv", "data/animals.csv")

('data/animals.csv', <http.client.HTTPMessage at 0x7fe325c38410>)

The first argument (string inside quotation marks)
represents the URL from which the data is being downloaded.
The second argument ("data/animals.csv") indicates where the data will be saved.

Notice that the URL above ends in "animals.csv", 
which is also the name we used to save the file on our computers.
If you click on the URL and view it in a web browser, the format isn’t particularly easy for us to understand. 
The data we’ve downloaded are in csv format, which stands for “comma separated values.” This means the data are organized into rows and columns, with columns separated by commas.

These data are arranged in a tidy format, meaning each row represents an observation, and each column represents a variable (piece of data for each observation). Moreover, only one piece of data is entered in each cell.

We are investigating the animal species diversity and weights found within plots at our study site. Each row holds information for a single animal, and the columns represent:

| Column           | Description                        |
| ---------------- | ---------------------------------- |
| year             | year of observation                |
| sex              | sex of animal (“M”, “F”)           |
| hindfoot\_length | length of the hindfoot in mm       |
| weight           | weight of the animal in grams      |
| genus            | genus of animal                    |
| species          | species of animal                  |
| taxon            | e.g. Rodent, Reptile, Bird, Rabbit |
| plot\_type       | type of plot                       |


We can import these data and assign them to a variable:

In [43]:
# assign data to variable
animal_df = pd.read_csv("data/animals.csv")

The command executed successfully, 
but we still need to ensure the data have been imported correctly.

There are a few ways we can inspect the data.
First, we can preview the data:

In [44]:
# preview first few rows of the data
animal_df.head()

Unnamed: 0,year,sex,hindfoot_length,weight,genus,species,taxa,plot_type
0,1983,F,19.0,28.0,Onychomys,torridus,Rodent,Long-term Krat Exclosure
1,1991,not reported,,,Amphispiza,bilineata,Bird,Control
2,1987,F,32.0,162.0,Neotoma,albigula,Rodent,Control
3,1995,M,36.0,44.0,Dipodomys,merriami,Rodent,Control
4,2002,F,23.0,15.0,Chaetodipus,penicillatus,Rodent,Spectab exclosure


The `head` function by default shows the the column headers,
along with first five rows of data.
You can specify a different number of rows by placing that number inside the parentheses, 
demonstrated below using `tail`, 
which shows the last few rows:

In [45]:
# print last eight rows of data to screen
animal_df.tail(8) # pass argument for number of rows

Unnamed: 0,year,sex,hindfoot_length,weight,genus,species,taxa,plot_type
34778,2000,F,25.0,32.0,Chaetodipus,baileyi,Rodent,Control
34779,1979,F,49.0,73.0,Dipodomys,spectabilis,Rodent,Control
34780,1980,F,19.0,29.0,Onychomys,torridus,Rodent,Rodent Exclosure
34781,1990,M,36.0,47.0,Dipodomys,merriami,Rodent,Spectab exclosure
34782,1978,M,55.0,168.0,Dipodomys,spectabilis,Rodent,Spectab exclosure
34783,1988,F,37.0,50.0,Dipodomys,ordii,Rodent,Control
34784,2001,M,23.0,17.0,Chaetodipus,penicillatus,Rodent,Short-term Krat Exclosure
34785,1995,F,17.0,18.0,Reithrodontomys,megalotis,Rodent,Short-term Krat Exclosure


We can use `len()` to determine how many rows and columns are in our data:

In [46]:
# number of rows
len(animal_df)

34786

In [47]:
# number of columns
len(animal_df.columns)

8

Now that we have data imported and available, 
we can print a summary of all column names, number of entries, data types, and non-null values:

In [48]:
# print summary
animal_df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34786 entries, 0 to 34785
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   year             34786 non-null  int64  
 1   sex              34786 non-null  object 
 2   hindfoot_length  31438 non-null  float64
 3   weight           32283 non-null  float64
 4   genus            34786 non-null  object 
 5   species          34786 non-null  object 
 6   taxa             34786 non-null  object 
 7   plot_type        34786 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 2.1+ MB


The output above highlight another of the key features of `pandas`:
it interprets data in ways that make it easier to analyze.

The description at the top of this output,
`pandas.core.frame.DataFrame`,
describes the data structure as a data frame,
which is how `pandas` interprets spreadsheet style data.
Directly below that line,
we see a note that there are 6832 observations (rows, or our case, patients or cases),
as well as 8 columns.
A summary of the data type for each column is below.

In the last lesson,
we discussed data types built into Python.
`pandas` features the following data types specific to its package,
which were implemented in our data:

- `object` data in `pandas` represents string (character) data in native Python
- `float64` is still float data (the `64` references 64 bit hardware)
- `int64` in `pandas` refers to integer data 
- `datetime64` from `pandas` isn't shown here, but refers to a specific format to make working with date and time data easier. 

> To create a list in a markdown cell, 
use an asterisk (`*`) or dash (`-`) followed by a space;
these will be rendered as bullet points when you execute the cell.

## Accessing columns and rows

A common task in data analysis is to extract particular columns or rows,
often referred to as subsetting.
This section explores a few different ways to access these parts of our spreadsheet.

First, we can subset a single column using its name (column header):

In [49]:
# show only the first few rows of one column
animal_df["taxa"].head()

0    Rodent
1      Bird
2    Rodent
3    Rodent
4    Rodent
Name: taxa, dtype: object

The square brackets above are a common subsetting syntax in Python.
The quotation marks around the column name are necessary for Python to interpret it as a column,
rather than a variable name.
We've added `.head()` to the end so we only preview the first few rows, rather than the entire data frame.

Similarly, we can assess the data type of a specific row:

In [50]:
# show data type for a column
animal_df["taxa"].dtype 

dtype('O')

The output, `O`, 
indicates these data are object (character) type.

One of the shortcuts afforded by `pandas` is the ability to treat the column names as attributes,
which means you can access them using the `.` syntax:

In [51]:
# access columns by name using dot syntax
animal_df.taxa.head()

0    Rodent
1      Bird
2    Rodent
3    Rodent
4    Rodent
Name: taxa, dtype: object

Here, we've also used `.head()` to minimize the amount of data printed to the screen.
If you were assigning data to a new variable name, 
you would likely be using the whole column instead.

If you need to extract multiple columns, 
you'll need to adjust the syntax slightly:

In [52]:
# select two columns at once
animal_df[["taxa", "year"]].head()

Unnamed: 0,taxa,year
0,Rodent,1983
1,Bird,1991
2,Rodent,1987
3,Rodent,1995
4,Rodent,2002


In this case, we can't use the dot syntax to access columns.
However, double square brackets are a common part of Python syntax.
They reference parts of lists.
In general, the dot syntax means you are accessing a part of the thing (generally a variable)
that comes before the dot.

> #### Exercise: typo
What happens if you misspell the name of a column?

> #### Exercise: order
Does the order of the columns you list matter?

We can also extract rows from a data frame:

In [53]:
# access three rows 
animal_df[1:4]

Unnamed: 0,year,sex,hindfoot_length,weight,genus,species,taxa,plot_type
1,1991,not reported,,,Amphispiza,bilineata,Bird,Control
2,1987,F,32.0,162.0,Neotoma,albigula,Rodent,Control
3,1995,M,36.0,44.0,Dipodomys,merriami,Rodent,Control


This is the same type of range notation we used for lists above, but now it gives us the first 3 rows instead of the first 3 elements.

As with lists, we can leave one side of the range empty:

In [54]:
# access the first three rows
animal_df[:3]

Unnamed: 0,year,sex,hindfoot_length,weight,genus,species,taxa,plot_type
0,1983,F,19.0,28.0,Onychomys,torridus,Rodent,Long-term Krat Exclosure
1,1991,not reported,,,Amphispiza,bilineata,Bird,Control
2,1987,F,32.0,162.0,Neotoma,albigula,Rodent,Control


In [55]:
# access the last five rows
animal_df[34781:]

Unnamed: 0,year,sex,hindfoot_length,weight,genus,species,taxa,plot_type
34781,1990,M,36.0,47.0,Dipodomys,merriami,Rodent,Spectab exclosure
34782,1978,M,55.0,168.0,Dipodomys,spectabilis,Rodent,Spectab exclosure
34783,1988,F,37.0,50.0,Dipodomys,ordii,Rodent,Control
34784,2001,M,23.0,17.0,Chaetodipus,penicillatus,Rodent,Short-term Krat Exclosure
34785,1995,F,17.0,18.0,Reithrodontomys,megalotis,Rodent,Short-term Krat Exclosure


We can perform a similar operation to `tail` using a negative index with the range:

In [56]:
# access the last row in the data frame
animal_df[-1:] 

Unnamed: 0,year,sex,hindfoot_length,weight,genus,species,taxa,plot_type
34785,1995,F,17.0,18.0,Reithrodontomys,megalotis,Rodent,Short-term Krat Exclosure


> #### Exercise: last
How would you extract the last 10 rows of the dataset?

> #### Exercise: middle
Use `len()` to extract the middle row in the data set  
Note: be careful, indices must be whole numbers. Can you think of a function we used earlier to help with this? Alternatively, see if you can figure out what the arithmetic operator `//` does.

## Slicing subsets of rows and columns

Now that we have a basic understanding of accessing whole rows and columns, 
we are ready to discuss slicing
(extracting portions of rows and columns).

There are multiple ways to slice a data frame. 
We'll begin by exploring `iloc`, 
which uses integer indexing.
This means we'll reference rows and columns by their index position:

In [57]:
# access one data element from a single cell
animal_df.iloc[2, 1]

'F'

We can check one of our previews of the data above to see that this does represent the data in that cell.

As with subsetting described in the previous section,
we can also extract ranges of cells:

In [58]:
# select range of data
animal_df.iloc[0:3, 1:4]

Unnamed: 0,sex,hindfoot_length,weight
0,F,19.0,28.0
1,not reported,,
2,F,32.0,162.0


As described earlier with subsetting using ranges of index values,
we can see in the output above that the beginning and end bounds of the ranges are noninclusive.

We can also include an empty start or stop bound to indicate the beginning or end of the data frame, respectively:

In [59]:
# empty stop boundary to indicate end of data
animal_df.iloc[:2, 3:]

Unnamed: 0,weight,genus,species,taxa,plot_type
0,28.0,Onychomys,torridus,Rodent,Long-term Krat Exclosure
1,,Amphispiza,bilineata,Bird,Control


Now we'll move on and explore the second method for extracting slices,
using `loc`, which stands for label indexing.
The tricky part with our data is that the row labels are actually also the index values.
This means that when we extract a range of rows,
we can still reference those values:

In [60]:
# slicing using loc
animal_df.loc[1:4]

Unnamed: 0,year,sex,hindfoot_length,weight,genus,species,taxa,plot_type
1,1991,not reported,,,Amphispiza,bilineata,Bird,Control
2,1987,F,32.0,162.0,Neotoma,albigula,Rodent,Control
3,1995,M,36.0,44.0,Dipodomys,merriami,Rodent,Control
4,2002,F,23.0,15.0,Chaetodipus,penicillatus,Rodent,Spectab exclosure


Here you can note one of the major differences between `iloc` and `loc`:
the latter has inclusive start and stop bound.

We can still use empty bounds:

In [61]:
# empty stop boundary to indicate end of data
animal_df.loc[34781: ]

Unnamed: 0,year,sex,hindfoot_length,weight,genus,species,taxa,plot_type
34781,1990,M,36.0,47.0,Dipodomys,merriami,Rodent,Spectab exclosure
34782,1978,M,55.0,168.0,Dipodomys,spectabilis,Rodent,Spectab exclosure
34783,1988,F,37.0,50.0,Dipodomys,ordii,Rodent,Control
34784,2001,M,23.0,17.0,Chaetodipus,penicillatus,Rodent,Short-term Krat Exclosure
34785,1995,F,17.0,18.0,Reithrodontomys,megalotis,Rodent,Short-term Krat Exclosure


We can also select all columns for a specific set of rows by adding the row labels as a list:

In [62]:
# Select all columns for rows of index values specified
animal_df.loc[[0, 10, 6831], ]

Unnamed: 0,year,sex,hindfoot_length,weight,genus,species,taxa,plot_type
0,1983,F,19.0,28.0,Onychomys,torridus,Rodent,Long-term Krat Exclosure
10,1982,F,33.0,120.0,Neotoma,albigula,Rodent,Control
6831,1978,F,52.0,108.0,Dipodomys,spectabilis,Rodent,Control


Finally, we can use the column labels for extraction:

In [63]:
# select first row for specified columns
animal_df.loc[0, ["year", "weight", "genus"]]

year           1983
weight           28
genus     Onychomys
Name: 0, dtype: object

In [64]:
# select first five rows for specified columns
animal_df.loc[0:5, ["year", "weight", "genus"]]

Unnamed: 0,year,weight,genus
0,1983,28.0,Onychomys
1,1991,,Amphispiza
2,1987,162.0,Neotoma
3,1995,44.0,Dipodomys
4,2002,15.0,Chaetodipus
5,2002,18.0,Chaetodipus


> #### Exercise: location
Why doesn't the following code work? 
>
> `animal_df.loc[2, 6]`

> #### Exercise: 100
How would you extract the last 100 rows for only year and taxa?

> #### Exercise: column
Extract the column `"genus"` in at least four different ways (e.g. using `[]`, `loc`, `iloc`...)

So far, we've been printing the output from our subsetting and slicing to the screen 
(often using `.head()`).
Remember that if you'd like to use these data for another purpose,
it's possible you may want to assign these data to a new variable to further manipulate.

## Calculating summary statistics

Once you've extracted your data of interest, 
you will likely want to be able to assess basic statistical features of the data.

Data frames allow you to assess these features:

In [65]:
# calculate basic stats a single column
animal_df.weight.describe()

count    32283.000000
mean        42.672428
std         36.631259
min          4.000000
25%         20.000000
50%         37.000000
75%         48.000000
max        280.000000
Name: weight, dtype: float64

In this case, 
we've assessed a collection of summary statistics for the column "weight" 
using the `.describe()` function.

You can access the statistics listed above individually as well:

In [66]:
# calculate only the minimum for weight
animal_df.weight.min()

4.0

We can also our ability to access columns to perform mathematical operations,
such as unit conversion:

In [67]:
# convert weight column from grams to ounces
animal_df.weight.head() / 28.35

0    0.987654
1         NaN
2    5.714286
3    1.552028
4    0.529101
Name: weight, dtype: float64

We can also perform a conversion on a summary statistic:

In [68]:
# convert maximum weight to ounces
animal_df.weight.max() / 28.35

9.876543209876543

We can add a column to our data frame to store the converted values using the square bracket notation:

In [69]:
# add converted column
animal_df['weight_oz'] = animal_df.weight / 28.35
animal_df.head()

Unnamed: 0,year,sex,hindfoot_length,weight,genus,species,taxa,plot_type,weight_oz
0,1983,F,19.0,28.0,Onychomys,torridus,Rodent,Long-term Krat Exclosure,0.987654
1,1991,not reported,,,Amphispiza,bilineata,Bird,Control,
2,1987,F,32.0,162.0,Neotoma,albigula,Rodent,Control,5.714286
3,1995,M,36.0,44.0,Dipodomys,merriami,Rodent,Control,1.552028
4,2002,F,23.0,15.0,Chaetodipus,penicillatus,Rodent,Spectab exclosure,0.529101


In [70]:
# max of weight-converted-to-oz equivalent to max-of-weight converted to oz
animal_df.weight_oz.max()

9.876543209876543

> #### Exercise: object
What type of summary statistics do you get for object data?

> #### Exercise: deviation
How would you extract only the standard deviation for weight?

> #### Exercise: new column
Add a column `hindfoot_in` that contains hindfoot length in inches (it is currently in centimeters)

In [71]:
# summary for string data
animal_df.genus.describe()

count         34786
unique           26
top       Dipodomys
freq          16167
Name: genus, dtype: object

# Data manipulation

## Conditional subsetting

Now that we're set up with data and tools,
we're going to explore conditional subsetting.
This means extracting particular rows based on a criteria.

For example, 
we may want to find all data collected in 1998.
We may be tempted to try something like `animal_df.year == 1998`,
but the output we get isn't quite what we wanted: the results are logical data (true/false).

> We used double equal signs (`==`) to indicate mathematical equivalency,
and to differentiate from variable assignment and specifying parameters for arguments
(like in the last lesson with `sep=` for loading data).

In [72]:
# test equality
animal_df.year == 1998

0        False
1        False
2        False
3        False
4        False
         ...  
34781    False
34782    False
34783    False
34784    False
34785    False
Name: year, Length: 34786, dtype: bool

However, we can combine this information with what we know about data subsetting:

In [73]:
# conditionally subset all samples collected in 1998
animal_df[animal_df.year == 1998].head()

Unnamed: 0,year,sex,hindfoot_length,weight,genus,species,taxa,plot_type,weight_oz
17,1998,M,36.0,49.0,Dipodomys,merriami,Rodent,Control,1.728395
46,1998,M,34.0,39.0,Dipodomys,merriami,Rodent,Control,1.375661
118,1998,F,36.0,47.0,Dipodomys,merriami,Rodent,Control,1.657848
121,1998,F,25.0,34.0,Chaetodipus,baileyi,Rodent,Short-term Krat Exclosure,1.199295
126,1998,F,36.0,46.0,Dipodomys,merriami,Rodent,Control,1.622575


We can invert the select (e.g., identify samples *not* collected in 1998)
using `-`:

In [74]:
# conditionally subset all samples NOT collected in 1998
animal_df[-(animal_df.year == 1998)].head()

Unnamed: 0,year,sex,hindfoot_length,weight,genus,species,taxa,plot_type,weight_oz
0,1983,F,19.0,28.0,Onychomys,torridus,Rodent,Long-term Krat Exclosure,0.987654
1,1991,not reported,,,Amphispiza,bilineata,Bird,Control,
2,1987,F,32.0,162.0,Neotoma,albigula,Rodent,Control,5.714286
3,1995,M,36.0,44.0,Dipodomys,merriami,Rodent,Control,1.552028
4,2002,F,23.0,15.0,Chaetodipus,penicillatus,Rodent,Spectab exclosure,0.529101


Typically we would instead use the shorter notation `!=` for "not equal":

In [75]:
# shorter notation for not equal
animal_df[animal_df.year != 1998].head()

Unnamed: 0,year,sex,hindfoot_length,weight,genus,species,taxa,plot_type,weight_oz
0,1983,F,19.0,28.0,Onychomys,torridus,Rodent,Long-term Krat Exclosure,0.987654
1,1991,not reported,,,Amphispiza,bilineata,Bird,Control,
2,1987,F,32.0,162.0,Neotoma,albigula,Rodent,Control,5.714286
3,1995,M,36.0,44.0,Dipodomys,merriami,Rodent,Control,1.552028
4,2002,F,23.0,15.0,Chaetodipus,penicillatus,Rodent,Spectab exclosure,0.529101


We can combine multiple criteria into the same filter:

In [76]:
# extract all samples collected between 1998 and 2000
animal_df[(animal_df.year >= 1998) & (animal_df.year <= 2000)].head()

Unnamed: 0,year,sex,hindfoot_length,weight,genus,species,taxa,plot_type,weight_oz
11,2000,M,36.0,46.0,Dipodomys,merriami,Rodent,Control,1.622575
17,1998,M,36.0,49.0,Dipodomys,merriami,Rodent,Control,1.728395
29,2000,F,35.0,37.0,Dipodomys,merriami,Rodent,Control,1.305115
39,2000,M,22.0,16.0,Chaetodipus,penicillatus,Rodent,Control,0.564374
46,1998,M,34.0,39.0,Dipodomys,merriami,Rodent,Control,1.375661


In the example above, the ampersand (`&`) represents *AND*, 
meaning any value must meet *both* conditions.
We can include a vertical pipe (`|`), representing *OR*,
meaning a value must meet *at least one* condition.

In [77]:
# extract all data for samples collected in 1998 or 1999
animal_df[(animal_df.year == 1998) | (animal_df.year == 1999)].head()

Unnamed: 0,year,sex,hindfoot_length,weight,genus,species,taxa,plot_type,weight_oz
17,1998,M,36.0,49.0,Dipodomys,merriami,Rodent,Control,1.728395
46,1998,M,34.0,39.0,Dipodomys,merriami,Rodent,Control,1.375661
61,1999,not reported,,,Ammospermophilus,harrisi,Rodent,Control,
69,1999,F,22.0,12.0,Chaetodipus,penicillatus,Rodent,Control,0.42328
73,1999,M,36.0,50.0,Dipodomys,merriami,Rodent,Spectab exclosure,1.763668


> #### Exercise: conditional
Print to the screen all data from `animal_df` for only kangaroo rats (column `genus` equals `'Dipodomys'`)

> #### Exercise: combining conditions
Print to the screen all data from `animal_df` for only kangaroo rats that weigh more than 170 grams (`weight`)

> #### Exercise: negative conditions
Print to the screen all data from `animal_df` *except for* kangaroo rats that weigh more than 170 grams

## Grouping data

Another useful feature of `pandas` is the ability to group data by categories,
so you can then summarize other quantitative variables in the dataset.

First, we can explore what categories exist for our `sex` column:

In [78]:
# identify unique elements in a column
pd.unique(animal_df["sex"])

array(['F', 'not reported', 'M'], dtype=object)

Remember that the code above is synonymous to `pd.unique(animal_df.sex)`, 
which uses a different syntax to access the column.

Now that we know the categories in `sex`,
we may be interested in summarizing quantitative variables based on these categories.
We can assign our grouped data to a new variable:

In [79]:
# group data by sex 
grouped_data = animal_df.groupby("sex")

`grouped_data` represents the same data, but with the data oriented according to categories
(though it isn't interpretable by humans, so it's not useful to print it to the screen).

> In the command above, if we wanted to specify sex using `.` we would have to write `animal_df.groupby(animal_df.sex)` because of the syntax required by `groupby`.

We can then calculate summary statistics for our grouped data:

In [80]:
# summary stats for all columns by taxa
grouped_data.describe()

Unnamed: 0_level_0,year,year,year,year,year,year,year,year,hindfoot_length,hindfoot_length,...,weight,weight,weight_oz,weight_oz,weight_oz,weight_oz,weight_oz,weight_oz,weight_oz,weight_oz
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
F,15690.0,1990.644997,7.598725,1977.0,1984.0,1990.0,1997.0,2002.0,14894.0,28.83678,...,46.0,274.0,15303.0,1.487498,1.299752,0.141093,0.705467,1.199295,1.622575,9.664903
M,17348.0,1990.480401,7.403655,1977.0,1984.0,1990.0,1997.0,2002.0,16476.0,29.709578,...,49.0,280.0,16879.0,1.516592,1.276366,0.141093,0.705467,1.375661,1.728395,9.876543
not reported,1748.0,1989.310069,6.80087,1977.0,1984.0,1989.0,1995.0,2002.0,68.0,25.941176,...,117.0,243.0,101.0,2.283689,2.19399,0.141093,0.599647,1.234568,4.126984,8.571429


This output only summarizes quantitative variables 
(the names listed in the first line of the output, e.g., `age_at_diagnosis`).
The subtitles in the output (e.g., `count`, `mean`, etc)
represent the summary statistics for each variable, 
grouped by sex (in the row labels).

Only a subset of the output is included above;
the ellipsis (`...`) in the middle of the dataset indicate it's been truncated.
You can extract only the data for a single column:

In [81]:
# summary stats for weight for only one column (weight)
grouped_data.weight.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
F,15303.0,42.170555,36.847958,4.0,20.0,34.0,46.0,274.0
M,16879.0,42.995379,36.184981,4.0,20.0,39.0,49.0,280.0
not reported,101.0,64.742574,62.199623,4.0,17.0,35.0,117.0,243.0


In addition to these basic summary statistics for quantitative variables,
we can also assess how missing data affects the data available for each category.
We obtain this information by counting the nuber of data points for each category from our grouped data:

In [82]:
# show the number of samples of each taxa available for each column 
grouped_data.count()

Unnamed: 0_level_0,year,hindfoot_length,weight,genus,species,taxa,plot_type,weight_oz
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
F,15690,14894,15303,15690,15690,15690,15690,15303
M,17348,16476,16879,17348,17348,17348,17348,16879
not reported,1748,68,101,1748,1748,1748,1748,101


From the output above,
we can also extract the data for only one column.
This lets us know how many cases for each sex category are in our dataset:

In [83]:
# counts for only weight
grouped_data.weight.count()

sex
F               15303
M               16879
not reported      101
Name: weight, dtype: int64

It's useful to compare the output from a different column, 
which allows us to understand how missing data affects our ability to
perform statistics tests:

In [84]:
# count the number of each sex for which hindfoot length is available
grouped_data.hindfoot_length.count()

sex
F               14894
M               16476
not reported       68
Name: hindfoot_length, dtype: int64

In this case, we see that missing data for hindfoot length only slightly decreases the number of observations available in each category.

Moreover, we can extract a single sex category from the output above:

In [85]:
# only display one sex (M), from hindfood_length grouped by sex
grouped_data.hindfoot_length.count().M

16476

This is an example of how you can use various pieces of Python syntax
to ask increasingly specific questions of the data.

It's useful to remember that the command above is synonymous with:

In [86]:
# another way: only display one sex (M), from hindfoot_length grouped by sex
animal_df.groupby("sex")["hindfoot_length"].count()["M"]

16476

This command differs from the previous example because 
it begins by grouping the original data object (`animal_df`),
and applies alternative syntax for identifying columns.

Once you've identified the specific reformatting of data that you need,
you may want to assign the output to a new object:

In [87]:
# save output to object for later use
sex_counts = grouped_data.hindfoot_length.count()
print(sex_counts)

sex
F               14894
M               16476
not reported       68
Name: hindfoot_length, dtype: int64


> #### Exercise: group
Find the min and max hindfoot length fo each species in the data set

> #### Exercise: subgroup
Find the number of kangaroo rats in this data set (`genus` equals `'Dipodomys'`). Try to do this once using `len()` and once using `count()`.

> #### Exercise: double group
Write code that will display the number of kangaroo rats in this data set split by sex. Try to do this once using subsetting and once by grouping by two variables.

> #### Exercise: group max
Find the heaviest animal observed in each year  
(Hint: you probably want to use `idxmax` at some point)

## Missing data

The final section in this lesson will help you compare a few ways to handle missing data across a data frame.

We can use `isnull` to show us where missing data are found in the dataset:

In [88]:
# test if value is missing
pd.isnull(animal_df).head()

Unnamed: 0,year,sex,hindfoot_length,weight,genus,species,taxa,plot_type,weight_oz
0,False,False,False,False,False,False,False,False,False
1,False,False,True,True,False,False,False,False,True
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False


Any field showing `True` represents a cell with missing data
(missing data in the original data frame are shown as `NaN`).

The effects of these missing data include multiple complications later in analysis,
including code errors and misinterpreting statistical output.
One basic example of this occurs from attempting to change the data type in year of birth 
(from float to integer)

`animal_df.weight = animal_df.weight.astype("int64")`

That code would result in a `ValueError`:
    
`ValueError: Cannot convert non-finite values (NA or inf) to integer`

This means `NA` values are prohibiting data conversion for this column.

If we used `dropna` to remove all rows with missing data,
and then use `len` to assess how many rows are remaining:

In [89]:
# extract all rows WITHOUT missing data
len(animal_df.dropna())

30738

In [90]:
# length of original data
len(animal_df)

34786

This leaves us with 30676 of our original 34786 samples.

We'll explore two methods for handling missing data that allow us to retain the specific data we need.

### Replacing data in copied data frame

The first approach to handling missing data is to copy the existing data frame and then replacing missing data.

Unlike the case we saw above with `weight_kg` and `weight_lb`, if you assign a dataframe to a new variable, it isn't copied by default. You need to apply the `copy()` method to create another data frame.

In [91]:
# create new copy of data frame
animal_copy = animal_df.copy()

Then we can use `pandas` functions to replace `NaN` with a different value. Replacing missing values is known as "imputation", and there are many approaches. A simple one is to use the average value (mean or median):

In [92]:
# replace missing values with mean
mean_weight = animal_df.weight.mean()
animal_copy.weight = animal_copy.weight.fillna(mean_weight)

This approach allows us to retain all observations in the original data frame,
and is one method sometimes used (for example, in machine learning)
to retain as much data as possible.

However, this approach does alter the statistical properties of the remaining data. For example, the variance of the data will be deflated misleadingly.

More complex approaches exist, including trying to predict the missing data using the available data, and adding noise to better reflect the variance. A full treatment is beyond the scope of this tutorial, but feel free to swing by ACER Data Science Office Hours if you want to discuss more.

### Masking missing data

The second approach to handling missing data is masking,
where we exclude observations where measurements are missing.
We'll explore a few approaches to masking.

The first dataset we'll create has missing data removed for weight.

> Because masking doesn't involve modifying the original data,
only filtering out whole rows and columns,
we don't need to copy a new object.

We can use `dropna` with the `subset` argument to exclude 
missing values for only one column:

In [93]:
# exclude missing data in only weight
weight_complete = animal_df.dropna(subset = ["weight"])

We assigned this to a new variable so we can continue with our filters.

It's also possible to use `isnull` to subset data,
though we don't do so here:

`animal_df[-pd.isnull(animal_df.weight)]`

Once we are satisfied with our filtering,
we may want to retain our data for later.
This could be so we can load it into another program,
or share with another collaborator,
both of which require us to access the data outside our notebook.

We can save our filtered data frame as spreadsheet-style data to a csv file:

In [94]:
# save filtered data to file
weight_complete.to_csv("data/weight_complete.csv", index=False)

The `index=False` argument prevents the row's index values from being printed before the first column.

You can see the new file in your project directory (in the data subfolder).

Now, we'll move on and use our masking skills to create a different dataframe.

> #### Exercise: filter
Mask (filter out) missing data from `animals_df` for `sex`, `hindfoot_length`, and `weight`. (Hint: check to see what categories exist for the column `sex`).

There are several ways to approach the exercise above.
First, you might try to use `dropna` to remove multiple columns:

In [95]:
# Drop NaN
animals_reduced = animal_df.dropna(subset = ["sex", "hindfoot_length", "weight"])

While this executes without a problem, 
you can check to see what categories are present in `sex`:

In [96]:
# show categories
pd.unique(animals_reduced.sex)

array(['F', 'M', 'not reported'], dtype=object)

This reveals there are some missing data encoded in a category called `not reported`.
This means we need to add an additional filter to remove that category:

In [97]:
# remove missing values that aren't NaN
animals_reduced = animals_reduced[animals_reduced.sex != "not reported"]

We can check to see that it worked:

In [98]:
pd.unique(animals_reduced.sex)

array(['F', 'M'], dtype=object)

There is one remaining manipulation we'll cover with our `animals_reduces` dataset.

We have many different species in this dataset, 
but we would like to retain only those species with a large number of samples collected.
We can apply a filter on this criteria,
but will need to start by counting how many samples for each species exist in our dataset:

In [99]:
# count number of samples for each species
species_counts = animals_reduced.groupby("species").species.count()
species_counts

species
albigula        1045
baileyi         2803
eremicus        1198
flavus          1469
fulvescens        73
fulviventer       38
hispidus         159
intermedius        7
leucogaster      905
leucopus          35
maniculatus      835
megalotis       2417
merriami        9727
montanus           8
ochrognathus      40
ordii           2790
penicillatus    2969
sp.                9
spectabilis     2023
taylori           45
torridus        2081
Name: species, dtype: int64

Because we've used `groupby`,
we'll need to reset the index values so our `species_counts` object is properly formatted:

In [100]:
# reset index to default
species_counts = species_counts.reset_index(name="counts")
species_counts

Unnamed: 0,species,counts
0,albigula,1045
1,baileyi,2803
2,eremicus,1198
3,flavus,1469
4,fulvescens,73
5,fulviventer,38
6,hispidus,159
7,intermedius,7
8,leucogaster,905
9,leucopus,35


Next, we can filter the new object so it only includes species with many cases:

In [101]:
# keep only species with many observations
frequent_species = species_counts[species_counts.counts > 500]
frequent_species

Unnamed: 0,species,counts
0,albigula,1045
1,baileyi,2803
2,eremicus,1198
3,flavus,1469
8,leucogaster,905
10,maniculatus,835
11,megalotis,2417
12,merriami,9727
15,ordii,2790
16,penicillatus,2969


Now that we have a collection of species to keep,
we can apply another filter to keep only species appearing in `frequent_species`:

In [102]:
# extract values for frequently occurring species
animals_reduced = animals_reduced[animals_reduced["species"].isin(frequent_species.species)]
animals_reduced.head()

Unnamed: 0,year,sex,hindfoot_length,weight,genus,species,taxa,plot_type,weight_oz
0,1983,F,19.0,28.0,Onychomys,torridus,Rodent,Long-term Krat Exclosure,0.987654
2,1987,F,32.0,162.0,Neotoma,albigula,Rodent,Control,5.714286
3,1995,M,36.0,44.0,Dipodomys,merriami,Rodent,Control,1.552028
4,2002,F,23.0,15.0,Chaetodipus,penicillatus,Rodent,Spectab exclosure,0.529101
5,2002,F,22.0,18.0,Chaetodipus,penicillatus,Rodent,Spectab exclosure,0.634921


This compares the species values in `weight_complete` to those in `frequent_species`,
keeping only the types that occur in the latter.

Finally, we can also write these data to a file:

In [103]:
# write data to csv
animals_reduced.to_csv("data/animals_reduced.csv", index=False)

> #### Exercise: tally, filter, and save
Extract only rows for genera (genuses) that are observed at least 1000 times, and save it to a file called "genera_reduced.csv"

# Plotting Data

In [None]:
import matplotlib.pyplot as plt

In [None]:
# when plotting we probably want to do some transformation to get the plot we want

# Drop NaN
animals_reduced = animal_df.dropna(subset = ["sex", "hindfoot_length", "weight"])

# remove missing values that aren't NaN
animals_reduced = animals_reduced[animals_reduced.sex != "not reported"]

# group by species and count
species_counts = animals_reduced.groupby("species").species.count()

# rename count index
species_counts = species_counts.reset_index(name="counts")


plt.plot(species_counts.species, species_counts.counts)
plt.xlabel('x label')
plt.ylabel('y label')
plt.title("Simple Plot")
plt.legend();



In [None]:
#improvements

# keep only species with many observations
frequent_species = species_counts[species_counts.counts > 500]
frequent_species


plt.bar(frequent_species.species, frequent_species.counts)
plt.xlabel('x label')
plt.ylabel('y label')
plt.title("Simple Plot")
plt.legend();
plt.xticks(rotation=45, ha="right")