# Assignment 4: Pandas Part 2


## Learning Objectives
This lesson meets the following learning objectives:

- The ability to use Python data structures provided in Pandas

## Instructions
Read through all of the text in this page. This assignment provides step-by-step training divided into numbered sections. The sections often contain embeded exectable code for demonstration.  Section headers with icons have special meanings:  

- <i class="fas fa-puzzle-piece"></i> The puzzle icon indicates that the section provides a practice exercise that must be completed.  Follow the instructions for the exercise and do what it asks.  Exercises must be turned in for credit.
- <i class="fa fa-cogs"></i> The cogs icon indicates that the section provides a task to perform.  Follow the instructions to complete the task.  Tasks are not turned in for credit but must be completed to continue progress.

Review the list of items in the **Expected Outcomes** section to check that you feel comfortable with the material you just learned. If you do not, then take some time to re-review that material again. If after re-review you are not comfortable, do not feel confident or do not understand the material, please ask questions on Slack to help.

Follow the instructions in the **What to turn in** section to turn in the exercises of the assginment for course credit.

## <i class="fa fa-cogs"></i>  Read the Tidy Data Paper
Before proceeding, read the official Tidy Data paper:

Wickham, H. (2014). [Tidy Data](https://www.jstatsoft.org/article/view/v059i10). Journal of Statistical Software, 59(10), 1–23.


## <i class="fa fa-cogs"></i>  Notebook Setup

First, we need to import the pandas library (and Numpy library too).  All packages are imported at the top of the notebook. Execute the code in the following cell to get started with this notebook (type Ctrl+Enter in the cell below)


In [1]:
# Numpy and pandas usage are often intertwined.
# These abbreviations are ubiquitiously used.
import pandas as pd
import numpy as np

For this tutorial we will use some of the data objects created in the previous Pandas Part 1 tutorial. Specifically these objects.

Recreate:
+ `df`:  a generic data frame containing two columns named "alpha" and "beta".
+ `iris_df`:  a data frame containing the imported iris dataset.

First, let's create the `df` object:

In [44]:
df = pd.DataFrame(
    {'alpha': [0, 1, 2, 3, 4],
     'beta': ['a', 'b', 'c', 'd', 'e']})

Now let's read in the iris data.  It should be in a `data` directory inside the same directory as this notebook:

In [3]:
iris_df = pd.read_csv('data/iris.csv')

As a reminder, execute the following to view the `df` data frame

In [45]:
df

Unnamed: 0,alpha,beta
0,0,a
1,1,b
2,2,c
3,3,d
4,4,e


And let's review the `iris_df` data frame:

In [5]:
iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## 1. Setting values in a DataFrame object

We often want to change or assign data at specific rows, columns, indexes or slices.   

### 1.1. Inserting a New Column

A new column can be added by using a new label as an index to an existing DataFrame object, and assigning to it, a new Series object as the values. Let's add a new column to the `df` object. The column will be named `gamma` and consist of a Series containing 5 numbers. 

In [46]:
# Add the new column
new_series = pd.Series([4, 3, 2, 1, 0])
df['gamma'] = new_series

# Now print the data frame
df

Unnamed: 0,alpha,beta,gamma
0,0,a,4
1,1,b,3
2,2,c,2
3,3,d,1
4,4,e,0


Alternatively you can use a numpy array instead of a Seris

In [47]:
df['gamma'] = np.array([4, 3, 2, 1, 0])
df

Unnamed: 0,alpha,beta,gamma
0,0,a,4
1,1,b,3
2,2,c,2
3,3,d,1
4,4,e,0


If you use Numpy, the array of values provided must have the same number of values as there are rows in the data frame. Observe the effect if the list is too short:

In [18]:
df['gamma'] = np.array([4, 3, 2, 1])

ValueError: Length of values (4) does not match length of index (5)

However, if a Series is used, NaN's are added to indicate missing values.  Let's add a new "epsilon" column that is **shorter** than the others:

In [48]:
df['epsilon'] = pd.Series([1, 2, 3])
df

Unnamed: 0,alpha,beta,gamma,epsilon
0,0,a,4,1.0
1,1,b,3,2.0
2,2,c,2,3.0
3,3,d,1,
4,4,e,0,


Now, observe the effect if a `pd.Series` object is **longer** than the other columns:

In [49]:
df['theta'] = pd.Series([0, 1, 3, 4, 5, 6])
df

Unnamed: 0,alpha,beta,gamma,epsilon,theta
0,0,a,4,1.0,0
1,1,b,3,2.0,1
2,2,c,2,3.0,3
3,3,d,1,,4
4,4,e,0,,5


The values in the Series that were beyond the length of the data frame were excluded.

It is also possible to overwrite existing columns using the same approach. Here we'll replace gamma with a series of 1's.

In [50]:
df['gamma'] = pd.Series([1, 1, 1, 1, 1])
df

Unnamed: 0,alpha,beta,gamma,epsilon,theta
0,0,a,1,1.0,0
1,1,b,1,2.0,1
2,2,c,1,3.0,3
3,3,d,1,,4
4,4,e,1,,5


Just like DataFrames, Series objects have indexes. When we add them, Pandas adds and aligns their values by their indexes. By default these are integer indexes.

### 1.2. <i class="fas fa-puzzle-piece"></i> Practice

In the cell below notebook, perform the following.

+ Create a copy of the `df` dataframe.
+ Add a new column named "delta" to the copy that consists of random numbers.

In [51]:
df_copy = df.copy()
df_copy['delta'] = np.random.random(5)
df_copy

Unnamed: 0,alpha,beta,gamma,epsilon,theta,delta
0,0,a,1,1.0,0,0.51249
1,1,b,1,2.0,1,0.16764
2,2,c,1,3.0,3,0.477174
3,3,d,1,,4,0.35832
4,4,e,1,,5,0.913536


## 2. Missing Data

As shown in the previous section, missing values are represented as 'NaN' in the table display. You can test for missing values or add missing values using the Numpy `np.nan` value: 

> pandas primarily uses the value `np.nan` to represent missing data. It is by default not included in computations. See the [Missing Data section](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data).

The choice of what to do with missing data is specific to the analysis you are performing. Two common approaches are to drop rows (or columns) with missing data, or to impute (fill) the empty cells. To explore this, we will need a dataframe with missing values.  We introduced missing values into the `df` data frame when we added a new column using a Series object that was too short. As a reminder, let's look again at the `df` data frame:

In [52]:
df

Unnamed: 0,alpha,beta,gamma,epsilon,theta
0,0,a,1,1.0,0
1,1,b,1,2.0,1
2,2,c,1,3.0,3
3,3,d,1,,4
4,4,e,1,,5


Notice the missing values in the "epsilon" column.  

### 2.1 Dropping Rows with Missing Values
We can remove all rows with at least one missing value using the `dropna` function.  This is a member function of a DataFrame object so we can all it in the following way:

In [53]:
df.dropna()

Unnamed: 0,alpha,beta,gamma,epsilon,theta
0,0,a,1,1.0,0
1,1,b,1,2.0,1
2,2,c,1,3.0,3


**Note**: By default, `dropna()` (and a fair number of other functions) do not modify the data frame 'in place', rather they return a modified copy.  Because we did not store this modified data frame in another variable in the code above, Python simply prints it.  Observe that if we print the `df` object it remains intact.

In [25]:
df

Unnamed: 0,alpha,beta,gamma,epsilon,theta
0,0,a,1,1.0,0
1,1,b,1,2.0,1
2,2,c,1,3.0,3
3,3,d,1,,4
4,4,e,1,,5


If you would like to have the data frame change "in place", then you then have two choices:
1. Use the `dropna` function's `inplace=True` argument.
2. Assign the result, using the same name.
    `df = df.dropna()`

Let's save the dataframe with rows removed into a new data frame:

In [26]:
# Drop the rows with missing values.
df_nomissing = df.dropna()
# Print the contents of our new data frame.
df_nomissing

Unnamed: 0,alpha,beta,gamma,epsilon,theta
0,0,a,1,1.0,0
1,1,b,1,2.0,1
2,2,c,1,3.0,3


### 2.2 Filling Missing Values

In some cases, setting missing values to some other value (such as 0) may be appropriate.  This can be accomplished using the `df.fillna()` function and passing in the desired value. Below we'll fill missing values with 

In [27]:
df.fillna(0)

Unnamed: 0,alpha,beta,gamma,epsilon,theta
0,0,a,1,1.0,0
1,1,b,1,2.0,1
2,2,c,1,3.0,3
3,3,d,1,0.0,4
4,4,e,1,0.0,5


Note that the `fillna` function also does not replace the value "in place". If we want to save the dataframe with missing values replaced as 0's we must also save it to a new variable or use the `inplace` argument

In [28]:
# Show that without the inplace argument the original data frame is unchanged.
df

Unnamed: 0,alpha,beta,gamma,epsilon,theta
0,0,a,1,1.0,0
1,1,b,1,2.0,1
2,2,c,1,3.0,3
3,3,d,1,,4
4,4,e,1,,5


### 2.3.<i class="fas fa-puzzle-piece"></i> Practice

In the cell below notebook, perform the following.


+ Create two new copies of the `df` dataframe:
+ Add a new column to both that has missing values.
+ In one copy, replace missing values with a value of your choice.
+ In the other copy, drop rows with `NaN` values.
+ Print both arrays to confirm.

In [54]:
df1 = df.copy()
df2 = df.copy()

In [55]:
df1['new'] = pd.Series([1,2,3])
df2['new'] = pd.Series([1,2,3])
df1.dropna(inplace=True)
print(df1)
df2 = df2.fillna(3)
print(df2)

   alpha beta  gamma  epsilon  theta  new
0      0    a      1      1.0      0  1.0
1      1    b      1      2.0      1  2.0
2      2    c      1      3.0      3  3.0
   alpha beta  gamma  epsilon  theta  new
0      0    a      1      1.0      0  1.0
1      1    b      1      2.0      1  2.0
2      2    c      1      3.0      3  3.0
3      3    d      1      3.0      4  3.0
4      4    e      1      3.0      5  3.0


## 3. Operations

Pandas provides functions that "operate" or act on on rows or columns. Some of these include calculating the mean, covariance, correlation, percent change, etc.

### 3.1 Mathematical Operations

To explore these operations, Let's review the `iris_df` data frame:

In [56]:
iris_df.head(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


to demonstrate the use of operation lets caculate the mean of each column.  The `df.mean()` function provides this for each numerical column of the data frame:

In [57]:
iris_df.mean()

  iris_df.mean()


sepal_length    5.843333
sepal_width     3.054000
petal_length    3.758667
petal_width     1.198667
dtype: float64

Recall that we can call `help()` on an existing object, or its abstract form.

```python
help(pd.DataFrame.mean)
# Should be mostly the same as:
help(iris_df.mean)
```

For example:

In [58]:
help(pd.DataFrame.mean)

Help on function mean in module pandas.core.generic:

mean(self, axis: 'int | None | lib.NoDefault' = <no_default>, skipna=True, level=None, numeric_only=None, **kwargs)
    Return the mean of the values over the requested axis.
    
    Parameters
    ----------
    axis : {index (0), columns (1)}
        Axis for the function to be applied on.
    skipna : bool, default True
        Exclude NA/null values when computing the result.
    level : int or level name, default None
        If the axis is a MultiIndex (hierarchical), count along a
        particular level, collapsing into a Series.
    numeric_only : bool, default None
        Include only float, int, boolean columns. If None, will attempt to use
        everything, then use only numeric data. Not implemented for Series.
    **kwargs
        Additional keyword arguments to be passed to the function.
    
    Returns
    -------
    Series or DataFrame (if level specified)



According to the help instructions returned by the previous line of code, we can calculate the mean of the rows by providing the `axis` argument. Recall that axis 0 refers to rows and 1 is columns.

In [59]:
# calcualte the mean of each row, and use the head function to only show the
# top 5 rows.
iris_df.mean(1).head()

  iris_df.mean(1).head()


0    2.550
1    2.375
2    2.350
3    2.350
4    2.550
dtype: float64

You can find a complete listing of the mathematical functions available with DataFrame objects here:
+ [Binary Operator Functions](https://pandas.pydata.org/docs/reference/frame.html#binary-operator-functions)
+ [Computations and Descriptiont Statistics](https://pandas.pydata.org/docs/reference/frame.html#computations-descriptive-stats)

### 3.2. <i class="fas fa-puzzle-piece"></i> Practice

In the cell below notebook, perform the following.

View the [Computational tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html) and [statistical methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html#method-summary) documentation.
Using the list of operational functions choose five functions to use with the iris data frame.


In [87]:
mode = iris_df.mode(numeric_only = True)
print("Mode\n",mode)
median = iris_df.median(numeric_only = True)
print("Median all columns\n",median)
sum = iris_df.sum(numeric_only=True)
print('Sum all\n',sum)
STD = iris_df.std(numeric_only=True)
print('Standard deviation all\n',STD)
vari = iris_df.var(numeric_only=True)
print('Variance\n',vari)

Mode
    sepal_length  sepal_width  petal_length  petal_width
0           5.0          3.0           1.5          0.2
Median all columns
 sepal_length    5.80
sepal_width     3.00
petal_length    4.35
petal_width     1.30
dtype: float64
Sum all
 sepal_length    876.5
sepal_width     458.1
petal_length    563.8
petal_width     179.8
dtype: float64
Standard deviation all
 sepal_length    0.828066
sepal_width     0.433594
petal_length    1.764420
petal_width     0.763161
dtype: float64
Variance
 sepal_length    0.685694
sepal_width     0.188004
petal_length    3.113179
petal_width     0.582414
dtype: float64


### 3.3 Apply

Another common operation is the ability to "apply" another function to a set of rows or columns (or a subset created from slicing). To efficiently apply functions in this way there is the `df.apply()` function.  The list of arguments for the `df.apply()` function are shown below 

```python
apply(func, axis=0, broadcast=None, raw=False, reduce=None, result_type=None, args=(), **kwds)
```

And, you can view the full help documetnation for this function  by executing the following line:

In [88]:
help(df.apply)

Help on method apply in module pandas.core.frame:

apply(func: 'AggFuncType', axis: 'Axis' = 0, raw: 'bool' = False, result_type=None, args=(), **kwargs) method of pandas.core.frame.DataFrame instance
    Apply a function along an axis of the DataFrame.
    
    Objects passed to the function are Series objects whose index is
    either the DataFrame's index (``axis=0``) or the DataFrame's columns
    (``axis=1``). By default (``result_type=None``), the final return type
    is inferred from the return type of the applied function. Otherwise,
    it depends on the `result_type` argument.
    
    Parameters
    ----------
    func : function
        Function to apply to each column or row.
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Axis along which the function is applied:
    
        * 0 or 'index': apply function to each column.
        * 1 or 'columns': apply function to each row.
    
    raw : bool, default False
        Determines if row or column is passed as 

Observe that the first argument passed to `apply` is the named of the function that should be "applied" to the data frame. As a simple exmaple, we can provide the `print` function:

In [89]:
df.apply(print)

0    0
1    1
2    2
3    3
4    4
Name: alpha, dtype: int64
0    a
1    b
2    c
3    d
4    e
Name: beta, dtype: object
0    1
1    1
2    1
3    1
4    1
Name: gamma, dtype: int64
0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
Name: epsilon, dtype: float64
0    0
1    1
2    3
3    4
4    5
Name: theta, dtype: int64


alpha      None
beta       None
gamma      None
epsilon    None
theta      None
dtype: object

Or, we can use `apply` to find the the data type of each column using the the built-in function `type` function:

In [90]:
df.apply(type)

TypeError: copy() missing 1 required positional argument: 'self'

For demonstration purposes, we can use `apply` to calcualte the sum of each columns using the Numpy sum function (although there is a `sum` function built into DataFrames):

In [91]:
df.apply(np.sum)

alpha         10
beta       abcde
gamma          5
epsilon      6.0
theta         13
dtype: object

**Note**: by default `apply` performs the supplied function across columns.  To apply the function across rows, use the `axis` argument (see the help for `apply` above).  

### 3.4. <i class="fas fa-puzzle-piece"></i> Practice

In the cell below notebook, perform the following.

Practice using `apply` on either the `df` or `iris_df` data frames using any two functions of your choice other than `print`, `type`, and `np.sum`.

In [99]:
print(iris_df.apply(min))
print(iris_df.apply(max))


sepal_length       4.3
sepal_width        2.0
petal_length       1.0
petal_width        0.1
species         setosa
dtype: object
sepal_length          7.9
sepal_width           4.4
petal_length          6.9
petal_width           2.5
species         virginica
dtype: object


### 3.5 Occurances

You can calculate the number of occurances for a set of values in a `pd.Series` object using the `value_counts` function.  These counts are similar to what you would see in a Histogram.  As an example, let's create a new series containing 10 random integers between 1 and 7. Then we will use `value_counts` to observe how many occurances there are of each integer.  First let's create the series:

In [100]:
s = pd.Series(np.random.randint(0, 7, size=10))
s

0    6
1    0
2    5
3    2
4    1
5    1
6    4
7    5
8    1
9    0
dtype: int32

Now we'll use `value_counts` to create our "histogram" or occurances of each integer:

In [101]:
s.value_counts()

1    3
0    2
5    2
6    1
2    1
4    1
dtype: int64

### 3.6. <i class="fas fa-puzzle-piece"></i> Practice

In the cell below notebook, perform the following.

Ientify the number of occurances for each species (virginica, versicolor, setosa) in the `iris_df` object.  *Hint*: the `value_counts` function only works on a `pd.Series` object, not on the full data frame.

In [102]:
species = iris_df['species']
species.value_counts()

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

## 4. Working with Strings in the DataFrame

Many times, data sets will have columns with string values that must be modified in some way, such as being split from one column into two, or joining columns into one. To demonstrate string methods, lets adjust the `df` object to add some strings.  As a reminder lets look at the df object:

In [103]:
df

Unnamed: 0,alpha,beta,gamma,epsilon,theta
0,0,a,1,1.0,0
1,1,b,1,2.0,1
2,2,c,1,3.0,3
3,3,d,1,,4
4,4,e,1,,5


As a reminder lets find the dimensions of this data frame:

In [104]:
df.shape

(5, 5)

Now, let's create some lists of strings to add to our data frame.

In [105]:
labels = ['alpha', 'beta', 'gamma', 'delta']
more_labels = ['foxtrot', 'whiskey', 'omega','epsilon','startrek']

Next, we'll use the NumPy `np.random.choice` function to create two new columns for the `df` data frame using randomized values from the two lists of strings we just created.  Take some time to make sure you understand the following code, then execute it.

In [106]:
df['iota'] = np.random.choice(labels, df.shape[0])
df['kappa'] = np.random.choice(more_labels, df.shape[0])
df

Unnamed: 0,alpha,beta,gamma,epsilon,theta,iota,kappa
0,0,a,1,1.0,0,alpha,whiskey
1,1,b,1,2.0,1,alpha,whiskey
2,2,c,1,3.0,3,delta,epsilon
3,3,d,1,,4,beta,startrek
4,4,e,1,,5,gamma,foxtrot


**Reminder:** Creating new columns using NumPy arrays requires they are of the same length as the existing data frame. 

### 4.1 The `.str` Object

Before we learn about working with Strings, let's first learn about the `str` object of Series objects.  If the Series object contains strings then you will have access to the `str` object. Remember that Pandas columns are Series objects.  So for any column that stores string values you can access the `str` object in the following way:

In [107]:
df['kappa'].str

<pandas.core.strings.accessor.StringMethods at 0x22a01952e80>

Observe that the `str` object is of type  `pandas.core.strings.StringMethods`. This object provides an "accessor". Similar to the `loc` and `iloc` accessors of Pandas, the `str` accessor can be used to access characters of strings in the series. For example we can extract the first three characters of each string:

In [109]:
df['kappa'].str[0:3]

0    whisk
1    whisk
2    epsil
3    start
4    foxtr
Name: kappa, dtype: object

The `str` object also provides a variety of string-related functions such as `split` to split strings, `upper` to convert strings to uppercase, `lower` to convert to lowercase, and `len` to get the length of string. View the [String Methods documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#string-methods) to view examples of usage. And explore the [String Handling](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#string-handling) documentation to see a list of functions that are part of the `str` object. 

As an example, we can get the length of all strings in the series:

In [110]:
df['kappa'].str.len()

0    7
1    7
2    7
3    8
4    7
Name: kappa, dtype: int64

### 4.2 Combining Columns into One

Combining strings from two different columns is easy!  For example, we can create a new column named `lambda` which contains the values of the `iota` and `kappa` columns by simply using the `+` operator.  We can include an underscore to separate the words:

In [111]:
df['lambda'] = df['iota'] + '_' + df['kappa']
df

Unnamed: 0,alpha,beta,gamma,epsilon,theta,iota,kappa,lambda
0,0,a,1,1.0,0,alpha,whiskey,alpha_whiskey
1,1,b,1,2.0,1,alpha,whiskey,alpha_whiskey
2,2,c,1,3.0,3,delta,epsilon,delta_epsilon
3,3,d,1,,4,beta,startrek,beta_startrek
4,4,e,1,,5,gamma,foxtrot,gamma_foxtrot


### 4.3 Splitting Columns to Multiple
To split a column into two or more strings we can use the `str` object's `split` function.  We just added a new column with two words separated by an underscore. Let's use the `split` function to separate the strings with that delimiter:


In [112]:
df['lambda'].str.split('_')

0    [alpha, whiskey]
1    [alpha, whiskey]
2    [delta, epsilon]
3    [beta, startrek]
4    [gamma, foxtrot]
Name: lambda, dtype: object

Observe that each row of the `lambda` column was split and a new list was created.  The split worked, but if our goal is to split a string and return it back to the data frame as two new columns we have a bit more work to do. 

Remember, in order to add new columns to a dataframe, we need two Series objects: one for each new column.  Before we show how to do this, let's look at the datatype that is returned when the split is performed. We can do this with the core `type` function:

In [113]:
type(df['lambda'].str.split('_'))

pandas.core.series.Series

It turns out that the `split` function returns a Series object with each element a list. Fortunately, the `str` object of a Series can be used to unpack lists!  Observe:

In [114]:
split_1, split_2 = df['lambda'].str.split('_').str
split_1

  split_1, split_2 = df['lambda'].str.split('_').str


0    alpha
1    alpha
2    delta
3     beta
4    gamma
Name: lambda, dtype: object

Notice in the code above that the `str` object on the Series returned by the split can be used to unpack the elements into two new Series:  `split_1` and `split_2`. The `split_1` Series contains the first set of values and the `split_2` will be the second set of values.  

To add two new columns to our dataframe we can unpack directly into two new columns for our dataframe rather than new variables:

In [115]:
# Split the lamba column of the df into two new columns.
df['split_1'], df['split_2'] = df['lambda'].str.split('_', 1).str

# Print the data frame.
df

  df['split_1'], df['split_2'] = df['lambda'].str.split('_', 1).str


Unnamed: 0,alpha,beta,gamma,epsilon,theta,iota,kappa,lambda,split_1,split_2
0,0,a,1,1.0,0,alpha,whiskey,alpha_whiskey,alpha,whiskey
1,1,b,1,2.0,1,alpha,whiskey,alpha_whiskey,alpha,whiskey
2,2,c,1,3.0,3,delta,epsilon,delta_epsilon,delta,epsilon
3,3,d,1,,4,beta,startrek,beta_startrek,beta,startrek
4,4,e,1,,5,gamma,foxtrot,gamma_foxtrot,gamma,foxtrot


### 4.4. <i class="fas fa-puzzle-piece"></i> Practice

In the cell below notebook, perform the following.

+ Create a list of five strings that represent dates in the form YYYY-MM-DD (e.g. 2020-02-20 for Feb 20th, 2020).
+ Add this list of dates as a new column in the `df` dataframe.
+ Now split the date into 3 new columns with one column representing the year, another the month and another they day.
+ Combine the values from columns `alpha` and `beta` into a new column where the values are spearated with a colon.


In [137]:
dates = pd.Series(['2022-02-25','2022-02-23','2022-02-20','2022-02-15','2022-02-12'])
df['dates'] = dates
df['Year'], df['Month'],df['Day'] = df['dates'].str.split('-', 2).str
df['new_col'] = df['alpha'].astype(str) + "," + df['beta']
df

  df['Year'], df['Month'],df['Day'] = df['dates'].str.split('-', 2).str


Unnamed: 0,alpha,beta,gamma,epsilon,theta,iota,kappa,lambda,split_1,split_2,dates,Year,Month,Day,new_col
0,0,a,1,1.0,0,alpha,whiskey,alpha_whiskey,alpha,whiskey,2022-02-25,2022,2,25,"0,a"
1,1,b,1,2.0,1,alpha,whiskey,alpha_whiskey,alpha,whiskey,2022-02-23,2022,2,23,"1,b"
2,2,c,1,3.0,3,delta,epsilon,delta_epsilon,delta,epsilon,2022-02-20,2022,2,20,"2,c"
3,3,d,1,,4,beta,startrek,beta_startrek,beta,startrek,2022-02-15,2022,2,15,"3,d"
4,4,e,1,,5,gamma,foxtrot,gamma_foxtrot,gamma,foxtrot,2022-02-12,2022,2,12,"4,e"


## 5. Combining DataFrames

Often it is useful to combine two or more dataframes whether by merging, joining, contenating or grouping.  We will explore each of these operations. 

***Note***: For a much more exahustive description of merging, joining and contatenation, and helpful diagrams, see the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html).

To demonstrate concatenation, lets use the iris dataframe.  But first let's break it up into separate species-specific data frames, i.e. one dataframe per species.  We can use Fancy Indexing that we learned from the Pandas Part 1 tutorial to do this:

In [138]:
setosa_df = iris_df[iris_df['species'] == 'setosa']
versicolor_df = iris_df[iris_df['species'] == 'versicolor']
virginica_df = iris_df[iris_df['species'] == 'virginica']

Let's breifly peek at each dataframe to make sure we split them correctly:

In [139]:
setosa_df.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


In [140]:
setosa_df.shape

(50, 5)

In [141]:
versicolor_df.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
50,7.0,3.2,4.7,1.4,versicolor
51,6.4,3.2,4.5,1.5,versicolor
52,6.9,3.1,4.9,1.5,versicolor


In [142]:
versicolor_df.shape

(50, 5)

In [143]:
virginica_df.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
100,6.3,3.3,6.0,2.5,virginica
101,5.8,2.7,5.1,1.9,virginica
102,7.1,3.0,5.9,2.1,virginica


In [144]:
virginica_df.shape

(50, 5)

Observe that each data frame is exactly 50 rows long and each one contains only the species specific data. Also notice that despite existing in new dataframes, the row indexes did not change.  The indexes of the `virginica_df` start with 100 rather than 1.  Therefore, as a reminder from the Pandas lesson Part 1, the following use of `loc` and `iloc` retreive the same rows:

In [145]:
virginica_df.loc[100:104]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
100,6.3,3.3,6.0,2.5,virginica
101,5.8,2.7,5.1,1.9,virginica
102,7.1,3.0,5.9,2.1,virginica
103,6.3,2.9,5.6,1.8,virginica
104,6.5,3.0,5.8,2.2,virginica


In [146]:
virginica_df.iloc[0:5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
100,6.3,3.3,6.0,2.5,virginica
101,5.8,2.7,5.1,1.9,virginica
102,7.1,3.0,5.9,2.1,virginica
103,6.3,2.9,5.6,1.8,virginica
104,6.5,3.0,5.8,2.2,virginica


Now, for the rest of this section, let's assume that we imported these iris species data as separate data frames. So, let's reset the indexes to start at 0 for each dataframe.

In [147]:
# Reset the indexes to start at 0 and do this inplace.
# The drop argument prevents the function from adding the old
# index as a new column of the dataframe
virginica_df.reset_index(drop=True, inplace=True)
versicolor_df.reset_index(drop=True, inplace=True)
setosa_df.reset_index(drop=True, inplace=True)

# Print the virginica_df to show the indexes are reset
virginica_df.head(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,6.3,3.3,6.0,2.5,virginica
1,5.8,2.7,5.1,1.9,virginica
2,7.1,3.0,5.9,2.1,virginica
3,6.3,2.9,5.6,1.8,virginica
4,6.5,3.0,5.8,2.2,virginica


### 5.1 Merging with `pd.concat`
#### 5.1.1 Concatenating by rows

Suppose now that we imported the species-specific iris data frames and we now want to merge them.  We can do this using the `pd.concat` function.  By default, the `pd.concat` function tries to merge data frames by combining the rows of each data frame one after the other.  This image from the [Pandas Merge,join and concatenate documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) shows how three dataframes are combined (concatenated) into a single larger datafame:

![rows concat](media/A04-merging_concat_basic.png)

The first argument that the `pd.concat` function accepts is a list of dataframes that should be merged. Therefore, we must first create a list of the individual data frames.

In [148]:
subsets = [setosa_df, versicolor_df, virginica_df]

We can now call `pd.concat` to merge the three data frames:

In [149]:
# Merge the dataframes into a larger one
new_iris_df = pd.concat(subsets)
# Print the top few rows of the new data frame.
new_iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [150]:
# Show that all the rows are there.
new_iris_df.shape

(150, 5)

Observe that the `new_iris_df` data frame is now back to 150 rows! We can also  test this by randomly sampling to see if we get a variety of species:

In [151]:
new_iris_df.sample(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
44,5.1,3.8,1.9,0.4,setosa
45,4.8,3.0,1.4,0.3,setosa
13,5.7,2.5,5.0,2.0,virginica
46,5.7,2.9,4.2,1.3,versicolor
5,5.4,3.9,1.7,0.4,setosa
28,6.0,2.9,4.5,1.5,versicolor
31,5.4,3.4,1.5,0.4,setosa
6,4.9,2.5,4.5,1.7,virginica
1,6.4,3.2,4.5,1.5,versicolor
15,6.4,3.2,5.3,2.3,virginica


Our species-specific dataframes are now reunited!  Although, notice that the row indexes have duplicates.  

In [153]:
new_iris_df.loc[4]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
4,5.0,3.6,1.4,0.2,setosa
4,6.5,2.8,4.6,1.5,versicolor
4,6.5,3.0,5.8,2.2,virginica


With that indexing, we get 3 entries, one from each of the three species-specific data frames. When dataframes are combined, Pandas, will not reset the indexes so these rows maintained their original row names.  This could be confusing, especialy since the row names are numeric, so let's reset them:

In [None]:
# Reindex the new dataframe
new_iris_df.reset_index(drop=True, inplace=True)

# Get the row with row name '4'. Now it should only be a single Series.
new_iris_df.loc[4]

**Important**: The concatenation shown above works because all three dataframes have the same columns with the same data type in each column.

#### 5.1.2. <i class="fas fa-puzzle-piece"></i> Practice

In the cell below notebook, perform the following.

+ Create the following dataframe
```Python
df1 = pd.DataFrame(
    {'alpha': [0, 1, 2, 3, 4],
     'beta': ['a', 'b', 'c', 'd', 'e']}, index = ['I1', 'I2' ,'I3', 'I4', 'I5'])
```
+ Create a new dataframe named `df2` with column names "delta" and "gamma" that contins 5 rows with some index names that overlap with the `df1` dataframe and some that do not.
+ Concatenate the two dataframes by rows and print the result.
+ You should see the two have combined one after the other, but there should also be missing values added. 
+ Explain why there are missing values.

In [166]:
df1 = pd.DataFrame(
    {'alpha': [0, 1, 2, 3, 4],
     'beta': ['a', 'b', 'c', 'd', 'e']}, index = ['I1', 'I2' ,'I3', 'I4', 'I5'])

df2 = pd.DataFrame(
    {'delta': [0, 1, 2, 3, 4],
     'gamma': ['a', 'b', 'c', 'd', 'e']}, index = ['AA', 'CC' ,'I3', 'DD', 'I5'])
df3 = pd.concat([df1,df2])
print(df3)
print('There are NA values on the non-common rows between the 2 datasets')

    alpha beta  delta gamma
I1    0.0    a    NaN   NaN
I2    1.0    b    NaN   NaN
I3    2.0    c    NaN   NaN
I4    3.0    d    NaN   NaN
I5    4.0    e    NaN   NaN
AA    NaN  NaN    0.0     a
CC    NaN  NaN    1.0     b
I3    NaN  NaN    2.0     c
DD    NaN  NaN    3.0     d
I5    NaN  NaN    4.0     e
There are NA values on the non-common rows between the 2 datasets


#### 5.1.3. Concatenating by Columns
What if we wanted to combine two or more dataframes by columns?  To explore this, consider the following dataframes from the Pandas documentation:

In [162]:
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                    'D': ['D2', 'D3', 'D6', 'D7'],
                    'F': ['F2', 'F3', 'F6', 'F7']},
                    index=[2, 3, 6, 7])
df4

Unnamed: 0,B,D,F
2,B2,D2,F2
3,B3,D3,F3
6,B6,D6,F6
7,B7,D7,F7


In [163]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                   index=[0, 1, 2, 3])
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


Suppose we wanted to merge these two dataframes?  Unlike the previous example we want to merge by concatenating columns rathar than rows.  To concatenate by colmns we can pass the `axis` argument. A value of `1` tells the funtion to concatenate by columns:

```python
  pd.concat([df1, df4], axis=1)
```

However, notice that each data frame has some column names in common, and some of the row names are in common, but both have some column and row names that are not shared. How will `pd.concat` handle this? 

In [164]:
pd.concat([df1, df4], axis=1)

Unnamed: 0,A,B,C,D,B.1,D.1,F
0,A0,B0,C0,D0,,,
1,A1,B1,C1,D1,,,
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3
6,,,,,B6,D6,F6
7,,,,,B7,D7,F7


Observe that the columns of the two tables are set side-by-side, even though some of the columns have the same name, and rows with the same index between the two data frames are merged.  Any columns that are missing in one data frame will have missing values added in.

If we want to only include rows that have no missing values we can use the argument `join="inner"` to automatically remove them:

In [165]:
pd.concat([df1, df4], axis=1, join='inner')

Unnamed: 0,A,B,C,D,B.1,D.1,F
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3


Notice, an "inner" join removes rows whose index name is not present in both. 

#### 5.1.4. <i class="fas fa-puzzle-piece"></i> Practice

In the cell below notebook, perform the following.

Using the same dataframes, df1 and df2, from the previous practice section:
+ Concatenate the two by columns
+ Add a "delta" column to `df1` and concatenate by columns such that there are 5 columns in the merged dataframe.
+ Respond in writing to this question (add a new 'raw' cell to contain your answer). What will happen if using you had performed an inner join while concatenating?  
+ Try the concatenation with the inner join to see if you are correct.

In [180]:
df1['delta'] = df1['beta']
df5 = pd.concat([df1,df2], axis = 1)
print(df5)
print(Printng inner join would remove the rows that have no data for one of the columns")
df6 =pd.concat([df1,df2], axis = 1, join = "inner")
df6

    alpha beta  delta  delta gamma
I1    0.0    a      a    NaN   NaN
I2    1.0    b      b    NaN   NaN
I3    2.0    c      c    2.0     c
I4    3.0    d      d    NaN   NaN
I5    4.0    e      e    4.0     e
AA    NaN  NaN    NaN    0.0     a
CC    NaN  NaN    NaN    1.0     b
DD    NaN  NaN    NaN    3.0     d
Printng inner join would remove the rows that have no data for one of the columns


Unnamed: 0,alpha,beta,delta,delta.1,gamma
I3,2,c,c,2,c
I5,4,e,e,4,e


#### 5.1.5. Concatenating by Columns and Rows
If you want to merge two dataframes that share some, but not all column names and you want to maintain all of the rows from both, you can use concat as if you were combining by rows:

In [181]:
# Merge by rows, but for dataframes that share column names.
# We must set sort=False for this function.
pd.concat([df1, df4], sort=False)

Unnamed: 0,alpha,beta,delta,B,D,F
I1,0.0,a,a,,,
I2,1.0,b,b,,,
I3,2.0,c,c,,,
I4,3.0,d,d,,,
I5,4.0,e,e,,,
2,,,,B2,D2,F2
3,,,,B3,D3,F3
6,,,,B6,D6,F6
7,,,,B7,D7,F7


Notice that we now only have columns `A`, `B`, `C`, `D` and `F`. The column `D` only appears once.  We also have all of the rows.  Missing values were inserted.

**Important** If you want to combine two dataframes by rows and they have shared column names and you do not want the columns merged you must rename one of the column names to ensure they are all unique.

### 5.2 Append

Adding rows to a dataframe can be done using the `append` function.  To demonstrate this, lets create a 4x6 data frame containing a collection of random numbers:

In [182]:
rand_df = pd.DataFrame(np.random.random((4, 6)))
rand_df

Unnamed: 0,0,1,2,3,4,5
0,0.971042,0.167924,0.693881,0.653188,0.238466,0.909949
1,0.021489,0.846725,0.041266,0.419324,0.379623,0.307432
2,0.983625,0.945262,0.072376,0.792379,0.896184,0.841736
3,0.950326,0.665227,0.051867,0.851322,0.464342,0.479533


Now suppose we want to append a new row to that data frame.  Let's create a new `pd.Series` object containing the same number of random numbers as there are columns in the data frame:

In [183]:
rand_row = pd.Series(np.random.random(6) * 100)
rand_row

0    70.927385
1    99.624113
2    27.370794
3    92.783101
4    68.498944
5    11.393762
dtype: float64

Now we an call the `append` function.  Unlike the `pd.concat` function, the `append` function belongs to the DataFrame object.  The `append` function takes the new series as its first agument. We also pass the `ignore_index` argument set to `True`. This forces the append to re-number the row indexes.

In [185]:
rand_df.append(rand_row, ignore_index=True)

  rand_df.append(rand_row, ignore_index=True)


Unnamed: 0,0,1,2,3,4,5
0,0.971042,0.167924,0.693881,0.653188,0.238466,0.909949
1,0.021489,0.846725,0.041266,0.419324,0.379623,0.307432
2,0.983625,0.945262,0.072376,0.792379,0.896184,0.841736
3,0.950326,0.665227,0.051867,0.851322,0.464342,0.479533
4,70.927385,99.624113,27.370794,92.783101,68.498944,11.393762


***Note*** the `append` function does not alter 'in place' the `rand_df` data frame. Instead it returns a new data frame with the row appended.


### 5.3. <i class="fas fa-puzzle-piece"></i> Practice

In the cell below notebook, perform the following.

+ Create a new 5x5 dataframe full of random numbers.
+ Create a new 5x10 dataframe full of 1's.
+ Append one to the other and print it.
+ Append a single Series of zeros to the end of the appended dataframe.


In [212]:
data1 =  pd.DataFrame(np.random.randint(0,100,size=(5, 5)), columns=list('ABCDE'))
data2 =  pd.DataFrame(np.ones((5, 10)))
data3 = pd.concat([data1,data2], axis = 1)
rand_row = pd.Series(np.zeros(10))
data3.append(rand_row, ignore_index=True)

  data3.append(rand_row, ignore_index=True)


Unnamed: 0,A,B,C,D,E,0,1,2,3,4,5,6,7,8,9
0,46.0,16.0,32.0,61.0,84.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,8.0,75.0,54.0,27.0,42.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,89.0,39.0,24.0,86.0,88.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,78.0,36.0,85.0,13.0,76.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,68.0,24.0,11.0,51.0,93.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 5.4 Grouping

#### 5.4.1 Grouping by a Single Column
Grouping using the `groupby` function of a DataFrame object is a powerful sorting and subsetting tool of Pandas. The grouping performs three operations:  

- Splitting the data frame
- Applying a function
- Combining the results


This is best demonstrated with an example. Lets remind ourselves of the iris data frame:

In [213]:
iris_df.sample(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
36,5.5,3.5,1.3,0.2,setosa
16,5.4,3.9,1.3,0.4,setosa
53,5.5,2.3,4.0,1.3,versicolor
13,4.3,3.0,1.1,0.1,setosa
128,6.4,2.8,5.6,2.1,virginica


Suppose we want to know the mean, width and length of each tissue (sepal and petal) for each species. How would you do this? One might be inclined to write a Python function that iterates through the rows of the dataframe and calculates the stats for each species.

There is an easier way to accomplish this using Panda's `groupby` function.   The `groupby` function allows us to "collapse" into groups, the rows of data by one or more columns.  Once collapsed into groups, we can then apply functions to the groups. First, start by grouping the iris dataset by species:

In [214]:
iris_df.groupby('species')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000022A02D42EB0>

Notice that the `groupBy` function returns a new object named `pandas.core.groupby.groupby.DataFrameGroupBy` or a `DataFrameGroupBy` object for short.  Lets re-run that command and save that object in a variable:

In [215]:
groups = iris_df.groupby('species')

Let's explore the `DataFrameGroupBy` object.  This object allows us to iterate over its "groups".  We can do so using a `for` loop:

In [217]:
# Iterate through each group, print its type and contents.
for group in groups:   
  print(type(group))
  print(group)
    
  # Let's stop the for loop after one iteration to save space in the Notebook.
  break

<class 'tuple'>
('setosa',     sepal_length  sepal_width  petal_length  petal_width species
0            5.1          3.5           1.4          0.2  setosa
1            4.9          3.0           1.4          0.2  setosa
2            4.7          3.2           1.3          0.2  setosa
3            4.6          3.1           1.5          0.2  setosa
4            5.0          3.6           1.4          0.2  setosa
5            5.4          3.9           1.7          0.4  setosa
6            4.6          3.4           1.4          0.3  setosa
7            5.0          3.4           1.5          0.2  setosa
8            4.4          2.9           1.4          0.2  setosa
9            4.9          3.1           1.5          0.1  setosa
10           5.4          3.7           1.5          0.2  setosa
11           4.8          3.4           1.6          0.2  setosa
12           4.8          3.0           1.4          0.1  setosa
13           4.3          3.0           1.1          0.1  setos

Observe that the groups are tubles.  Remember a tuple is like a list, but unlike a list,once it is created you can not change it. Tuples are represented using parentheses, `(` and `)`, rather than square brakets.  The first element of the tuple is the key. In the example printed above the key is `setosa`.  The second element of the tuple are the rows that belong to that group.   We learn from this, that the `DataFrameGroupBy` object is a list of tuples, where each tuple contains the rows of the dataframe that belong to the group.

Now that we have our groups we can apply a function.  To get the mean we simply call `mean` on the groups object:

In [216]:
groups.mean()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.006,3.418,1.464,0.244
versicolor,5.936,2.77,4.26,1.326
virginica,6.588,2.974,5.552,2.026


Notice that the result is a new data frame. 

There are a large variety of functions supported by the `DataFrameGroupBy` object. These include the `mean`, `var`, `std`, `min` and many more.  You can find the list of all functions available on the [GroupBy API documentation page](https://pandas.pydata.org/pandas-docs/version/0.23.1/api.html#groupby)


#### 5.4.3 Grouping by a Multiple Columns

It is possible to have `groupby` use more than one column to group. To demonstrate, let's suppose we took measurements of the iris sepal and petals at two different developmental stages: early and late flowering. We want to cacluate the mean for those two different periods.  In this case we would need to group by the `species` and by a new development stage column.  Let's create one by randomly assigning the development stage (of course in reality this data would be provided):

In [218]:
iris_df['dev_stage'] = np.random.choice(['early', 'late'], iris_df.shape[0])
iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,dev_stage
0,5.1,3.5,1.4,0.2,setosa,late
1,4.9,3.0,1.4,0.2,setosa,early
2,4.7,3.2,1.3,0.2,setosa,early
3,4.6,3.1,1.5,0.2,setosa,late
4,5.0,3.6,1.4,0.2,setosa,early


Now let's group by both `species` and `dev_stage` and calcualte the mean:

In [219]:
# Group by species and developmental stage
groups = iris_df.groupby(['species', 'dev_stage'])
# Calculate the mean
groups.mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,sepal_length,sepal_width,petal_length,petal_width
species,dev_stage,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
setosa,early,4.96,3.388,1.412,0.236
setosa,late,5.052,3.448,1.516,0.252
versicolor,early,5.945455,2.836364,4.313636,1.377273
versicolor,late,5.928571,2.717857,4.217857,1.285714
virginica,early,6.559259,2.914815,5.525926,2.037037
virginica,late,6.621739,3.043478,5.582609,2.013043


We now have the mean organized by species and developmental stage!

Notice in this dataframe that there are two row index names correponsding to the two group by columns. This is called a `MultiIndex`.  We will not explore MultiIndex however, you can learn more about it on the [MultiIndex/advanced indexing page](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)

#### 5.4.4. <i class="fas fa-puzzle-piece"></i> Practice

In the cell below notebook, perform the following.

Demonstrate a `groupby`.

+ Create a new column with the label "region" in the iris data frame. This column will indicates geographic regions of the US where measurments were taken. Values should include:  'Southeast', 'Northeast', 'Midwest', 'Southwest', 'Northwest'. Use these randomly.
+ Use `groupby` to get a new data frame of means for each species in each region.
+ Add a `dev_stage` column by randomly selecting from the values "early" and "late".
+ Use `groupby` to get a new data frame of means for each species,in each region and each development stage.
+ Use the `count` function (just like you used the `mean` function) to identify how many rows in the table belong to each combination of species + region + developmental stage.

In [223]:
iris_df['region'] = np.random.choice(['Southeast', 'Northeast', 'Midwest', 'Southwest', 'Northwest'], iris_df.shape[0])
g1 = iris_df.groupby(['species','region'])
print(g1.mean())
g2 = iris_df.groupby(['species','region','dev_stage'])
print(g2.mean())
print(g2.count())

                      sepal_length  sepal_width  petal_length  petal_width
species    region                                                         
setosa     Midwest        5.216667     3.650000      1.416667     0.166667
           Northeast      5.000000     3.411111      1.488889     0.266667
           Northwest      5.000000     3.453333      1.426667     0.240000
           Southeast      5.018182     3.418182      1.518182     0.300000
           Southwest      4.866667     3.211111      1.466667     0.211111
versicolor Midwest        6.063636     2.663636      4.345455     1.318182
           Northeast      6.014286     2.828571      4.328571     1.364286
           Northwest      5.784615     2.776923      4.238462     1.315385
           Southeast      6.083333     2.783333      4.250000     1.283333
           Southwest      5.700000     2.800000      4.000000     1.316667
virginica  Midwest        6.325000     2.950000      5.500000     2.012500
           Northeast     

## Expected Outcomes
At this point, you should feel comfortable with the following:
- Inserting new columns in a Pandas dataframe.
- Working with missing data.
- Performing mathmatics with dataframes.
- Using the apply function in Pandas.
- Working with strings in Pandas.
- Concatenating columns and rows.
- Appending data.
- Grouping data.

## What to Turn in?
Be sure to **commit** and **push** your changes to this notebook.  All practice exercises should be completed.  Once completed, send a **Slack message** to the instructor indicating you have completed this assignment. The instructor will verify all work is completed. 