# August 21

Today, we will be orienting ourselves to the Jupyter Notebook, learning about Python, and looking at our first data sets.

## The Jupyter Notebook

The interface in front of you is called a Jupyter Notebook. This cell that you're reading now is a _Markdown cell_. These are used to hold _text_ information. I will use them to communicate with you throughout the course. We can even embed images in the Markdown. If you double-click in this cell, you will see that this is plain text. The plain text characters are transformed into attractive text by the Jupyter Notebook.


In [None]:
# This is a code cell.

import random

# It holds code. 

list = random.randint(1, 10)

# It can also hold text, in the form of comments. Comments are helpful hints to yourself. Jupyter will not execute
# the comments. Leave as many as you want!

print(list)


When a cell is "run", by pressing the run button, it will be executed in a manner that is appropriate for the type of cell it is. If it is a code cell, the code will be run. If it is a markdown cell, the text will be rendered.

# What is Python?

Python is an open-source, free to use programming language. *Open Source* is desireable because we can look at any of the given functions in Python, and understand how they work. *Free*, we all understand why that is good. 

Python is actively maintained by the Python Software Foundation, and is rapidly becoming one of the world's most commonly-used languages.

![Python Popularity](img/pythondominance.png)

You find Python in virtually all fields, and all career paths.

Python is also easy to read. Without knowing any Python, look at the below cell. See if you can figure out what it will do, then run it to see if you're right.

In [None]:
num_list = [1, 2, 3, 4]
new_list = []

for entry in num_list:
    new_list.append(entry*2)

print(new_list)

Were you able to guess correctly? Python is written in such a way that it mimics human speech and writing. 

Python also has an active user community who communicate different packages and workflows to the software. For example, I use the Python library DendroPy almost daily in my work. It is for working with phylogenetic data in Python.

# Ask for help when you need it
# I'm not joking around
# This class is a little different than others, in that we don't have many throwaway moments when you learn a fact, use it on a test, then maybe never use it again
# If you don't get it now, it might be a problem later, and we'll work on it. Now.

Seriously, y'all, just ask. We'll get it worked out.

# Data types in Python

In the first couple weeks, we will be working with datasets in an interactive way. But first, we should learn a little bit about how Python works. One of the most common operations to do programmatically is save data to a variable. A _variable_ is a little bit of space we clear in the computer's memory. We can fill it with information, and give it a handle to recall it later. See below.

In [None]:
my_text = "This is a string variable"
# Strings are varaibles that are meant to be read literally as they are seen above. Often, they are text. 
# You know a string because it will be encased in quotation marks
# Enter the name of the variable to view it.
my_text

In [None]:
my_number = 64
# This is an integer value
my_decimal = 1.64
# This number has a decimal

The kind of variable you create dictates some of the things you can do with it. Do you think my_number and my_integer are the same kind of variable? Run the below code to find out.

In [None]:
type(my_decimal)

In [None]:
type(my_number)

"Float" - what does that mean? Floats are stored differently in the computer's memory than integers are, and saving whole numbers as integers can mean programs take less memory to run.

The "type" refers to the kind of variable something is. This can influence what operations you can do with that variable. For example:

In [None]:
round(my_decimal)

What does round() do? What does it do if you call it on `my_number`? 

`round()` is a function. So is `type()`. We can think of functions like organs in our body - they are sets of code that work together to accomplish some task. You can recognize that you are calling a function by the presence of the open and close parentheses. Functions have help available via the help function.

In [None]:
help(round)

There are more data types out there, but we'll start with these. Most of the data we will work with in this course will be of these three types - integers, floats and strings.

# Operators in Python

Python uses what are likely to be familiar operators: `+, -, /, *, %`

Try using each of these operators to combine `my_number` and `my_decimal`. What behaviors make sense? Which are hard to understand? To try using operators, first make a new code cell, by clicking the `+` button above. Then, enter the comparison you would like to make.



We can also use what are called logical operators. These operators, `<, >, ==, !=, <=, >=` evaluate objects relative to one another. Once again, create a new cell and try each operator to compare `my_number` and `my_decimal`. What is each one doing? 

# Groups of Objects

## Lists

How often do you want to sit down and hand-enter data? Basically never. For the purpose of storing more massive sets of objects, we have lists. Lists are _ordered_, meaning that they are stored in the same order in the computer's memory as when you enter them.

In [None]:
my_number_list = [1, 2, 3, 4, 5]

my_number_list[2]

Did you note something odd, there? What happens if you try to access the first element of the `my_number_list`? 

Lists can also be added to:

In [None]:
my_number_list.append(6)

What if you want to add something at a certain position in the list? Use the help function to view the help file for append(). 

In [None]:
help(append)

 What has gone wrong? What if you wanted to view the help file for `my_number_list`? Try to view it. As you type out `my_number_list`, hit the tab key.
 
When you have figured out how to view the help function, see if you can find a way to insert a new entry at some point in the list. Flag me down when you think you have it.
 
A loop can be used to access the data `iteratively`. A common type of loop is the `for loop`, which does some operation to every item in a list:
 
 

In [None]:
for item in my_number_list:
    print(item)

 
 Try removing the indentation in the loop. Does this code run?
 
 ## Dictionaries
 
 A dictionary is a container that holds pairs of objects - keys and values.

In [None]:
translation = {'one': 1, 'two': 2}
translation['one']


The first value in the list is the "key", which we use to access data. The second is the value. Dictionaries are _not_ ordered. Try to index one in the way you indexed a list.


The reason this happens is because Python expects you to use the keys to access data. We can, however, loop over our dictionary.

In [None]:
for key, value in translation.items():
    print(key, " unlocks ", value)

What we have done is just loop over _multiple_ values. 

Now, let's try reassigning a value in the dictionary. See if you can figure out how to reassign the value of "one". Re-run the loop above to see if it has successfully been reassigned. 

Most data structures in Python support some reassignment. See below, as we assign value to a variable, then change it up.

In [None]:
a = 5
print(a)
a = "five"
print(a)

## Functions

So far, we have used functions. But Python also allows us to make functions. We often do this for three reasons:
- Organization: Code that is in functions is easier to read. Imagine opening a book with no paragraphs. How hard to read would that be?

- Modularity: If we have functions, which take specified inputs and outputs, we can test those inputs and outputs.

- Reusability: If we perform the same analysis frequently, functions can be packaged to make them easier to reuse and disseminate. 

You can see the structure of a function below:

In [None]:
def do_multiplication(a, b):
    product = a * b
    return(product)

do_multiplication(my_decimal, my_number)

Try:

- To explain this function in words to your neighbor
- Removing the return statement. What does this do?
- Save the result of the function to a variable.
- Bonus: What happens if instead of `my_decimal` and `my_number`, you enter a string? What happens? How could you gaurd against this outcome?


# Potential Data Sets

We will all work on one dataset in the classroom. The dataset we will work on together is called the [Portal Mammal dataset](https://esajournals.onlinelibrary.wiley.com/doi/full/10.1890/15-2115.1), and it was collected on the northern edge of the Chihuahuan Desert over the past 40 years. We will use these data for classroom activities.

There are two other "project" datasets that you can use for the projects in this class. One is the Ant Morphology database. This is some of my research data. The other is a dataset from a [long-term](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3461117/) evolution study of _E. coli_ bacteria, looking at the evolution of citrate utilization mutations. 

For today, though, we will all be looking at the same data. First, let's look at the data in Google Drive.


In [None]:
import pandas as pd

pd.read_csv("../data/surveys.csv")

OK, what just happened? 

A *library* is a set of Python functions that are packaged for reuse. We loaded the library pandas. pandas is a set of functions to visualize, clean and load datasets.

In this case, we used the function `read_csv()` to read in the datafile "../data/surveys.csv". This is the exact same datafile we just looked at in Google Drive. 

We called a _function_ (read_csv) from a _library_ (pandas), and told it what _file_ ("../data/surveys.csv"), we want to read.

Can we use these data? Save these data into the computer's memory as `surveys_df`.

## Dataframes

Dataframes are data structures that are arranged in row by column format. The data can be any of the types we discussed. How many rows are in the dataframe? How many columns? Look above to see if you can find the information, and then try to run the below code to get that same information.

In [None]:
surveys_df.shape

Here are a couple exploratory functions: 

- surveys_df.dtypes
- surveys_df.columns
- surveys_df.head()
- surveys_df.tail()

What does each of these do? What information can you obtain? 

From the list of columns, let's try to get the number of unique species that were seen:


In [None]:
pd.unique(surveys_df['species_id'])

Save this to a variable. Now, use the len() function to find out how many unique species the data collectors saw.

## Simple stats

We can get some quick and dirty stats out of pandas, as well. 

In [None]:
surveys_df['weight'].describe()

In [None]:
# We can access individual summary stats, too
surveys_df['weight'].min()
surveys_df['weight'].max()
surveys_df['weight'].mean()
surveys_df['weight'].std()
surveys_df['weight'].count()
# Why did we only see one?

We often want to do more sophisticated groupings of data. For example, we may believe that sex is an important phenotypic variable. Perhaps we believe male rodents are bigger than female, for example. 

We can, then, create a grouped object.

In [None]:
# Group data by sex
grouped_data = surveys_df.groupby('sex')

grouped_data.describe()
grouped_data.mean()


How does this differ from our prior output? Try grouping on two columns.

In [None]:
#What does this code do? See if you can verbally explain it. Then, uncomment the print statement to check.

species_counts = surveys_df.groupby('species_id')['record_id'].count()
print(species_counts)

In [None]:
# what about this code?

surveys_df['weight']*2

# How will you check what it does?

In [None]:
# Make sure figures appear inline in Ipython Notebook
%matplotlib inline
# Create a quick bar chart
species_counts.plot(kind='bar');


Let's end here for today. We:

- Learned the basic types of objects in Python
- Learned about functions
- Hopefully learned not to be afraid of programming
- Made statistics! And plots!

For Weds, please:

- Take a look at [this](https://peerj.com/preprints/3183/) paper. You do not need to do the Excel exercises. Just become familiar with some basic sources of error in spreadsheets.