# September 11

We're going to change up the schedule a little bit. This week, we'll cover some concepts that I saw in your datasets that weren't initially on the schedule:

- Splitting a column
- Replacing text
- Scaling values
- Grouping data

# Splitting a column

A challenge I saw in some of your datasets was that more than one piece of information has been added to a column. For example, species and sex in one column. This is often a choice made for convenience at some point, but actually limits the types of analysis we can perform with the data. Load in the species.csv file from the `../data` directory

In [None]:
import pandas as pd
species = pd.read_csv("../data/species.csv")

In [None]:
species

Print the dataset to the screen. It looks generally OK. But if you were to try and look at some facet of the data - say, how many species of each genus exists in the dataset, you couldn't! One of the items in the [data in spreadsheets](https://peerj.com/preprints/3183/) reading was that we should keep one piece of information in each column for this exact reason.

Because of this, we might want to think about splitting some of these columns up into two different columns. We are going to iteratively build up the column splitting command. We'll start simple, with the "genus-species" column. 

In [None]:
species['genus'] = species['genus-species'].str.split(' ')


*Before* you view the results, talk with a partner about what you think the command did. Then view it. Is this what we want? 

How could we add these as columns to the dataframe? Take a moment and try a couple things.

In [None]:
# Put your answer here

This looks sort of like we want, yes? We have two new columns, genus and species. They contain roughly the data that we think they ought to. But we still have some problems: the species names contain extra characters, and we still have a "genus-species" column. First, let's get rid of the execess column.

In [None]:
# The below will require you to fill in the column to drop and 
# axis for it to work

species = species.drop("", axis=1)

Good. Now we don't have extra data in our drataframe. Now, we're going to clean out the extra characters.

In [None]:
species['species'] = species.species.replace('"', '')

Take a look at the results. Has this done what we want? Pretty much. But some of you have more complex replacement issues. For example, the Piller students have a dataset in which all the measurements are in millimeters. Except a few cells, which are in CM, and are denoted as being in CM. In this case, when we encounter the "problem" condition, we want to do something special with the value.

## Using loops and replacements together

As I said above, something we might want to do is not simply a replace, but to do some other work with the value. For example, in the case of the Piller students, when they encounter a "cm", they want to remove it. That's easy to do with replace. But they also want to convert it to mm so the columns are all in the same measurements. This isn't the easiest thing, but it isn't the hardest, either. Below, I've made three bullet points. Each will correspond to a step that we need to do to accomplish the task.

- Locate the "problem values"
- Remove the "cm" characters
- Convert the numeric value to an integer
- Multiply by 100 to get mm.

But first, without a computer, describe to your partner how to get to my office. Or yours. Someone's office. Where will they turn? How will they know when to turn. 

With this in mind, what are the four steps to make this task happen?

These data are a little different than the ones we've been working with. They are in Excel format, which means we need to use a different function to read them. What are these arguments, skiprows and header, doing?

In [None]:
import pandas as pd

messy = pd.read_excel("../data/messy data set.xlsx", skiprows=0, header=1)

They also have some extra stuff at the end - let's just grab the rows that correspond to actual specimens.

In [None]:
messy = messy[0:30]

Isolating individual problem values can be tricky. First, we will isolate the cells containing problem values. The contains() function locates rows that have the problem value anywhere in them. the argument `na=False` causes us to simply ignore missing data.

In [None]:
sample = messy[messy['dorsal fin origin to pectoral origin'].str.contains("cm", na=False)]


Next, for that row, we locate the individual cells that contain the problem value, and for each cell, replace it with the problem value removed. At the same time, we convert to numeric and multiply by 10. 

In [None]:
if sample['dorsal fin origin to pectoral origin'].str.contains("cm", na=False).any() == True:
    sample['dorsal fin origin to pectoral origin'] = sample['dorsal fin origin to pectoral origin'].str.replace("cm", "")
    sample['dorsal fin origin to pectoral origin'] = float(sample['dorsal fin origin to pectoral origin'])*10    
    

Then, we reinsert the value in the original dataframe.

In [None]:
messy.loc[12] = sample.loc[12]

This was really tricky. Exceptionally tricky. But when we break it down in to smaller pieces, it's not so bad.

# Grouping

We might have some _a priori_ idea about what substructure exists in our data. For example, in the species dataset:

In [None]:
species

we have the taxa column, which tells us to which higher-order taxon the individuals we trapped might belong. We might be interested in getting a quick picture or count of these individuals. To look at this issue, we can use the groupby function.

In [None]:
species.groupby('taxa')

Groupby has some nice properties for getting speedy peeks at data. How large is the membership to each group?

In [None]:
species.groupby('taxa').size()

In [None]:
number_of = species.groupby('taxa').size()

Or, we could take a look at what unique genera are represented in each taxon type. To do this, when we group on the 'taxa' column, we only select the genus column.

In [None]:
uniq_genera = species.groupby('taxa')['genus'].unique()

In [None]:
for genus_set in uniq_genera:
    print(genus_set)

We could combine these two operations to find out how many of the genera in the dataset are not unique. We can do this using the `zip` function, which allows us to co-iterate over two lists.

In [None]:
for genus_set, numb in zip(uniq_genera, number_of):
    print(numb - len(genus_set))

# Challenge

Load in the surveys dataframe. In the surveys dataframe, group based on the sex of the animal. 

- Can you get a list of unique genera for both males and females? 
- Are these lists the same (i.e., are there any genera represented in the males but not females)? 
- BONUS CHALLENGE: Have a look at automating that second challenge. Try:

```
sorted(grouped[0])
```

Can you use this command to spot the outlier a little easier? What if you combined sorting and Boolean comparisons?