Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions _episodes/00-short-introduction-to-Python.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ another_tuple = ('blue','green','red')
a_list = [1,2,3]
```

> ### Challenge
> ## Challenge - Tuples
> 1. What happens when you type `a_tuple[2]=5` vs `a_list[1]=5` ?
> 2. Type `type(a_tuple)` into python - what is the object type?
>
Expand Down Expand Up @@ -242,9 +242,7 @@ or
>>>
```

> ## Challenge
>
> Can you do reassignment in a dictionary? Give it a try.
> ## Challenge - Can you do reassignment in a dictionary?
>
> 1. First check what `rev` is right now (remember `rev` is the name of our dictionary).
>
Expand All @@ -264,6 +262,8 @@ sequence of their items (i.e. the order in which key:value pairs were added to
the dictionary). Because of this, the order in which items are returned from loops
over dictionaries might appear random and can even change with time.



## Functions

Defining part of a program in Python as a function is done using the `def`
Expand Down
327 changes: 167 additions & 160 deletions _episodes/01-starting-with-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -223,18 +223,20 @@ information in the parens to control behaviour.

Let's look at the data using these.

## Challenges
> ## Challenge - DataFrames
>
> Using our DataFrame `surveys_df`, try out the attributes & methods below to see
> what they return.
>
> 1. `surveys_df.columns`
> 2. `surveys_df.shape` Take note of the output of `shape` - what format does it
> return the shape of the DataFrame in?
>
> HINT: [More on tuples, here](https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences).
> 3. `surveys_df.head()` Also, what does `surveys_df.head(15)` do?
> 4. `surveys_df.tail()`
{: .challenge}

Using our DataFrame `surveys_df`, try out the attributes & methods below to see
what they return.

1. `surveys_df.columns`
2. `surveys_df.shape` Take note of the output of `shape` - what format does it
return the shape of the DataFrame in?

HINT: [More on tuples, here](https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences).
3. `surveys_df.head()` Also, what does `surveys_df.head(15)` do?
4. `surveys_df.tail()`

## Calculating Statistics From Data In A Pandas DataFrame

Expand Down Expand Up @@ -275,13 +277,14 @@ array(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',
'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)
```

## Challenges

1. Create a list of unique plot ID's found in the surveys data. Call it
`plot_names`. How many unique plots are there in the data? How many unique
species are in the data?

2. What is the difference between `len(plot_names)` and `plot_names.nunique()`?
> ## Challenge - Statistics
>
> 1. Create a list of unique plot ID's found in the surveys data. Call it
> `plot_names`. How many unique plots are there in the data? How many unique
> species are in the data?
>
> 2. What is the difference between `len(plot_names)` and `plot_names.nunique()`?
{: .challenge}

# Groups in Pandas

Expand Down Expand Up @@ -358,32 +361,35 @@ M 29.709578 42.995379
The `groupby` command is powerful in that it allows us to quickly generate
summary stats.

# Challenge

1. How many recorded individuals are female `F` and how many male `M`
2. What happens when you group by two columns using the following syntax and
then grab mean values:
- `sorted_data2 = surveys_df.groupby(['plot_id','sex'])`
- `sorted_data2.mean()`
3. Summarize weight values for each plot in your data. HINT: you can use the
following syntax to only create summary statistics for one column in your data
`by_plot['weight'].describe()`


Did you get #3 right? **A Snippet of the Output from challenge 3 looks like:**

```
plot
1 count 1903.000000
mean 51.822911
std 38.176670
min 4.000000
25% 30.000000
50% 44.000000
75% 53.000000
max 231.000000
...
```
> ## Challenge - Summary Data
>
> 1. How many recorded individuals are female `F` and how many male `M`
> 2. What happens when you group by two columns using the following syntax and
> then grab mean values:
> - `sorted_data2 = surveys_df.groupby(['plot_id','sex'])`
> - `sorted_data2.mean()`
> 3. Summarize weight values for each plot in your data. HINT: you can use the
> following syntax to only create summary statistics for one column in your data
> `by_plot['weight'].describe()`
>
>
>> ## Did you get #3 right?
>> **A Snippet of the Output from challenge 3 looks like:**
>>
>> ```
>> plot
>> 1 count 1903.000000
>> mean 51.822911
>> std 38.176670
>> min 4.000000
>> 25% 30.000000
>> 50% 44.000000
>> 75% 53.000000
>> max 231.000000
>> ...
>> ```
> {: .solution}
{: .challenge}

## Quickly Creating Summary Counts in Pandas

Expand Down Expand Up @@ -413,13 +419,12 @@ calculated from our data.
surveys_df['weight']*2


## Another Challenge

1. What's another way to create a list of species and associated `count` of the
records in the data? Hint: you can perform `count`, `min`, etc functions on
groupby DataFrames in the same way you can perform them on regular
DataFrames.

> ## Challenge - Make a list
>
> What's another way to create a list of species and associated `count` of the
> records in the data? Hint: you can perform `count`, `min`, etc functions on
> groupby DataFrames in the same way you can perform them on regular DataFrames.
{: .challenge}

# Quick & Easy Plotting Data Using Pandas

Expand All @@ -441,112 +446,114 @@ total_count = surveys_df['record_id'].groupby(surveys_df['plot_id']).nunique()
total_count.plot(kind='bar');
```

# Challenge Activities

1. Create a plot of average weight across all species per plot.
2. Create a plot of total males versus total females for the entire dataset.


# Summary Plotting Challenge

Create a stacked bar plot, with weight on the Y axis, and the stacked variable
being sex. The plot should show total weight by sex for each plot. Some
tips are below to help you solve this challenge:

* [For more on Pandas plots, visit this link.](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.core.groupby.DataFrameGroupBy.plot.html)
* You can use the code that follows to create a stacked bar plot but the data to stack
need to be in individual columns. Here's a simple example with some data where
'a', 'b', and 'c' are the groups, and 'one' and 'two' are the subgroups.

```
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
pd.DataFrame(d)
```

shows the following data

```
one two
a 1 1
b 2 2
c 3 3
d NaN 4
```

We can plot the above with

```
# plot stacked data so columns 'one' and 'two' are stacked
my_df = pd.DataFrame(d)
my_df.plot(kind='bar',stacked=True,title="The title of my graph")
```

![Stacked Bar Plot](../fig/stackedBar1.png)

* You can use the `.unstack()` method to transform grouped data into columns
for each plotting. Try running `.unstack()` on some DataFrames above and see
what it yields.

Start by transforming the grouped data (by plot and sex) into an unstacked layout, then create
a stacked plot.


## Solution to Summary Challenge

First we group data by plot and by sex, and then calculate a total for each plot.

```python
by_plot_sex = surveys_df.groupby(['plot_id','sex'])
plot_sex_count = by_plot_sex['weight'].sum()
```

This calculates the sums of weights for each sex within each plot as a table

```
plot sex
plot_id sex
1 F 38253
M 59979
2 F 50144
M 57250
3 F 27251
M 28253
4 F 39796
M 49377
<other plots removed for brevity>
```

Below we'll use `.unstack()` on our grouped data to figure out the total weight that each sex contributed to each plot.

```python
by_plot_sex = surveys_df.groupby(['plot_id','sex'])
plot_sex_count = by_plot_sex['weight'].sum()
plot_sex_count.unstack()
```

The `unstack` function above will display the following output:

```
sex F M
plot_id
1 38253 59979
2 50144 57250
3 27251 28253
4 39796 49377
<other plots removed for brevity>
```

Now, create a stacked bar plot with that data where the weights for each sex are stacked by plot.

Rather than display it as a table, we can plot the above data by stacking the values of each sex as follows:

```python
by_plot_sex = surveys_df.groupby(['plot_id','sex'])
plot_sex_count = by_plot_sex['weight'].sum()
spc = plot_sex_count.unstack()
s_plot = spc.plot(kind='bar',stacked=True,title="Total weight by plot and sex")
s_plot.set_ylabel("Weight")
s_plot.set_xlabel("Plot")
```

![Stacked Bar Plot](../fig/stackedBar.png)
> ## Challenge - Plots
>
> 1. Create a plot of average weight across all species per plot.
> 2. Create a plot of total males versus total females for the entire dataset.
{: .challenge}

> ## Summary Plotting Challenge
>
> Create a stacked bar plot, with weight on the Y axis, and the stacked variable
> being sex. The plot should show total weight by sex for each plot. Some
> tips are below to help you solve this challenge:
>
> * [For more on Pandas plots, visit this link.](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.core.groupby.DataFrameGroupBy.plot.html)
> * You can use the code that follows to create a stacked bar plot but the data to stack
> need to be in individual columns. Here's a simple example with some data where
> 'a', 'b', and 'c' are the groups, and 'one' and 'two' are the subgroups.
>
> ```
> d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
> pd.DataFrame(d)
> ```
>
> shows the following data
>
> ```
> one two
> a 1 1
> b 2 2
> c 3 3
> d NaN 4
> ```
>
> We can plot the above with
>
> ```
> # plot stacked data so columns 'one' and 'two' are stacked
> my_df = pd.DataFrame(d)
> my_df.plot(kind='bar',stacked=True,title="The title of my graph")
> ```
>
> ![Stacked Bar Plot](../fig/stackedBar1.png)
>
> * You can use the `.unstack()` method to transform grouped data into columns
> for each plotting. Try running `.unstack()` on some DataFrames above and see
> what it yields.
>
> Start by transforming the grouped data (by plot and sex) into an unstacked layout, then create
> a stacked plot.
>
>
>> ## Solution to Summary Challenge
>>
>> First we group data by plot and by sex, and then calculate a total for each plot.
>>
>> ```python
>> by_plot_sex = surveys_df.groupby(['plot_id','sex'])
>> plot_sex_count = by_plot_sex['weight'].sum()
>> ```
>>
>> This calculates the sums of weights for each sex within each plot as a table
>>
>> ```
>> plot sex
>> plot_id sex
>> 1 F 38253
>> M 59979
>> 2 F 50144
>> M 57250
>> 3 F 27251
>> M 28253
>> 4 F 39796
>> M 49377
>> <other plots removed for brevity>
>> ```
>>
>> Below we'll use `.unstack()` on our grouped data to figure out the total weight that each sex contributed to each plot.
>>
>> ```python
>> by_plot_sex = surveys_df.groupby(['plot_id','sex'])
>> plot_sex_count = by_plot_sex['weight'].sum()
>> plot_sex_count.unstack()
>> ```
>>
>> The `unstack` function above will display the following output:
>>
>> ```
>> sex F M
>> plot_id
>> 1 38253 59979
>> 2 50144 57250
>> 3 27251 28253
>> 4 39796 49377
>> <other plots removed for brevity>
>> ```
>>
>> Now, create a stacked bar plot with that data where the weights for each sex are stacked by plot.
>>
>> Rather than display it as a table, we can plot the above data by stacking the values of each sex as follows:
>>
>> ```python
>> by_plot_sex = surveys_df.groupby(['plot_id','sex'])
>> plot_sex_count = by_plot_sex['weight'].sum()
>> spc = plot_sex_count.unstack()
>> s_plot = spc.plot(kind='bar',stacked=True,title="Total weight by plot and sex")
>> s_plot.set_ylabel("Weight")
>> s_plot.set_xlabel("Plot")
>> ```
>>
>> ![Stacked Bar Plot](../fig/stackedBar.png)
> {: .soultion}
{: .challenge}