# DSE Course 1, Session 3: Functions and Pandas Introduction

**Instructor**: Wesley Beckner<br>

**Contact**: wesleybeckner@gmail.com

<br>

---

<br>

Today, we will discuss **_functions_** in more depth.  We've seen them previously and used them, for example the `.append()` **_function_** for lists, or the even more general `print()` function.  Here, we'll dig into how you can make your own functions to encapsulate code that you will reuse over and over.  

Then we'll jump into the **Pandas** package.  Packages are collections of related functions.  These are the things we `import`. Pandas is a two dimensional data structure like a spreadsheet in Excel. In fact, we will be importing our first dataset and viewing it, with Pandas!

<br>

---

## 3.1 Review from Session on Data Structures and Flow Control

In our last session, we discussed **_lists_**, **_dictionaries_**, and **_flow control_**.

**_Lists_** are **_ordered collections_** of data that can be used to hold multiple pieces of information while preserving their order.  We use `[` and `]` to access elements by their indices which start with `0`.  All things that operate on **_lists_** like slices use the concept of an inclusive lower bound and an exclusive upper bound.  So, the following gets elements from the **_list_** `my_list` with index values of `0`, `1`, and `2`, but **not** `3`!

```
my_list[0:3]
```

> What other way is there of writing the same statement using **_slicing_**?  Hint, think about leaving out one of the numbers in the slice!

**_Dictionaries_** are **_named_** **_collections_** of data that can be used to hold multiple pieces of information as **_values_** that are addressed by **_keys_** resulting in a **_key_** to **_value_** data structure.  They are accessed with `[` and `]` but intialized with `{` and `}`.  E.g.

```
my_dict = { 'cake' : 'Tasty!', 'toenails' : 'Gross!' }
my_dict['cake']
```

Finally, we talked about **_flow control_** and using the concept of **_conditional execution_** to decide which code statements were executed.  Remember this figure?

<img src="https://docs.oracle.com/cd/B19306_01/appdev.102/b14261/lnpls008.gif">Flow control figure</img>

> What are the **_if_** statments? <br> 
Where do **_for_** loops fit in? <br>
What was the overarching concept of a **_function_**?

## 3.2 Functions

For loops let you repeat some code for every item in a list.  Functions are similar in that they run the same lines of code and, frequently, for new values of some variable (we call these **_parameters_**).  They are different in that functions are not limited to looping over items.

Functions are a critical part of writing easy to read, reusable code.

Create a function like:
```
def function_name (parameters):
    """
    optional docstring
    """
    function expressions
    return [variable]
```

Here is a simple example.  It prints a string that was passed in and returns nothing.

```
def print_string(string):
    """This prints out a string passed as the parameter."""
    print(string)
    return
```

In [None]:
def print_string(string):
    """This prints out a string passed as the parameter."""
    print(string)
    return

To call the function, use:
```
print_string("GIX is awesome!")
```

_Note:_ The function has to be defined before you can call it!

In [None]:
print_string("GIX is awesome!")

GIX is awesome!


If you don't provide a parameter or provide more than one parameter, you get an error.

In [None]:
# print_string()

TypeError: ignored

### 3.2.1 Function Parameters

Parameters (or arguments) in Python are all passed by reference.  This means that if you modify the parameters in the function, they are modified outside of the function.

See the following example:

```
def change_list(my_list):
   """This changes a passed list into this function"""
   my_list.append('four');
   print('list inside the function: ', my_list)
   return

my_list = [1, 2, 3];
print('list before the function: ', my_list)
change_list(my_list);
print('list after the function: ', my_list)
```

In [None]:
def change_list(my_list):
   """This changes a passed list into this function"""
   my_list.append('four');
   print('list inside the function: ', my_list)
   return

my_list = [1, 2, 3];
print('list before the function: ', my_list)
change_list(my_list);
print('list after the function: ', my_list)

list before the function:  [1, 2, 3]
list inside the function:  [1, 2, 3, 'four']
list after the function:  [1, 2, 3, 'four']


### 3.2.2 For advanced folks...

Variables have scope: **_global_** and **_local_**

In a function, new variables that you create are not saved when the function returns - these are **_local_** variables.  Variables defined outside of the function can be accessed but not changed - these are **_global_** variables, _Note_ there is a way to do this with the **_global_** keyword.  Generally, the use of **_global_** variables is not encouraged, instead use parameters.

```
my_global_1 = 'bad idea'
my_global_2 = 'another bad one'
my_global_3 = 'better idea'

def my_function():
    print(my_global_1)
    my_global_2 = 'broke your global, man!'
    global my_global_3
    my_global_3 = 'still a better idea'
    return
    
my_function()
print(my_global_2)
print(my_global_3)
```

In [None]:
my_global_1 = 'bad idea'
my_global_2 = 'another bad one'
my_global_3 = 'better idea'

def my_function():
    print(my_global_1)
    my_global_2 = 'broke your global, man!'
    global my_global_3
    my_global_3 = 'still a better idea'
    return
    
print(my_global_2)
print(my_global_3)
my_function()

another bad one
better idea
bad idea


In general, you want to use parameters to provide data to a function and return a result with the `return`. E.g.

```
def sum(x, y):
    my_sum = x + y
    return my_sum
```

#### Exercise 1: My first function

Write a function that takes one parameter and returns any data structure

> If you are going to return multiple objects, what data structure that we talked about can be used?  Give and example below.

In [None]:
# Cell for excerise 1

### 3.2.3 Parameters have four different types:

| type | behavior |
|------|----------|
| required | positional, must be present or error, e.g. `my_func(first_name, last_name)` |
| keyword | position independent, e.g. `my_func(first_name, last_name)` can be called `my_func(first_name='Wesley', last_name='Beckner')` or `my_func(last_name='Beckner', first_name='Wesley')` |
| default | keyword params that default to a value if not provided |


```
def print_name(first, last='Beckner'):
    print(f'Your name is {first} {last}')
    return
```

In [None]:
def print_name(first, last='Beckner'):
    print(f'Your name is {first} {last}')
    return

Play around with the above function.

In [None]:
print_name('Wesley', last='the DSE Instructor')

Your name is Wesley the DSE Instructor


**DETOUR** into string formatting.  In the function `print_name` you may have noticed the `f` in front of the `"` quotation mark at the begining of the string. This is a special thing called an **f-string**.  Details on them, including how they are used and how they differ from other ways of embedding the value of string variables in new string is [here](https://realpython.com/python-f-strings/).  They are a relatively new language addition (Python >= 3.6), but they are the bees knees.

Put simply, when you put an `f` in front of the first quotation mark of a string (it can be `"` or `'`), the curly braces `{` and `}` can be used to enclose a variable name.  The value of that variable name when the **f-string** is evaluated will be substituted into the location of the curly braces.  Neat, right!

Some more examples:

```
my_favorite_number = 42
print(f"My favorite number is {my_favorite_number}.")
print(f"My favorite number times 2 is {my_favorite_number*2}.")

my_favorite_animal = "liger"
print(f"My favorite animal is a {my_favorite_animal}.")
print(f"What is a {my_favorite_animal} you ask?  It's pretty much my favorite animal. It's like a lion and a tiger mixed... bred for its skills in magic.")
```

In [None]:
my_favorite_number = 42
print(f"My favorite number is {my_favorite_number}.")
print(f"My favorite number times 2 is {my_favorite_number*2}.")

my_favorite_animal = "liger"
print(f"My favorite animal is a {my_favorite_animal}.")
print(f"What is a {my_favorite_animal} you ask?  It's pretty much my favorite animal. It's like a lion and a tiger mixed... bred for its skills in magic.")

My favorite number is 42.
My favorite number times 2 is 84.
My favorite animal is a liger.
What is a liger you ask?  It's pretty much my favorite animal. It's like a lion and a tiger mixed... bred for its skills in magic.


**END DETOUR!**

Functions can contain any code that you put anywhere else including:
* if...elif...else
* for...else
* while
* other function calls

```
def print_name_age(first, last, age):
    print_name(first, last)
    print('Your age is %d' % (age))
    if age > 25 and age < 40:
        print('You are a millenial!')
    return
```


In [None]:
def print_name_age(first, last, age):
    print_name(first, last)
    print('Your age is %d' % (age))
    if age > 25 and age < 40:
        print('You are a millenial!')
    return

```
print_name_age(age=29, last='Beckner', first='Wesley')
```

In [None]:
print_name_age(age=29, last='Beckner', first='Wesley')

Your name is Wesley Beckner
Your age is 29
You are a millenial!


## 3.3 Pandas and the Scientific Python Toolkit

In addition to Python's built-in modules like the ``math`` module we explored above, there are also many often-used third-party modules that are core tools for doing data science with Python.
Some of the most important ones are:

#### [``numpy``](http://numpy.org/): Numerical Python

Numpy is short for "Numerical Python", and contains tools for efficient manipulation of arrays of data.
If you have used other computational tools like IDL or MatLab, Numpy should feel very familiar.

#### [``scipy``](http://scipy.org/): Scientific Python

Scipy is short for "Scientific Python", and contains a wide range of functionality for accomplishing common scientific tasks, such as optimization/minimization, numerical integration, interpolation, and much more.
We will not look closely at Scipy today, but we will use its functionality later in the course.

#### [``pandas``](http://pandas.pydata.org/): Labeled Data Manipulation in Python

Pandas is short for "Panel Data", and contains tools for doing more advanced manipulation of labeled data in Python, in particular with a columnar data structure called a *Data Frame*.
If you've used the [R](http://rstats.org) statistical language (and in particular the so-called "Hadley Stack"), much of the functionality in Pandas should feel very familiar.

#### [``matplotlib``](http://matplotlib.org): Visualization in Python

Matplotlib started out as a Matlab plotting clone in Python, and has grown from there in the 15 years since its creation. It is the most popular data visualization tool currently in the Python data world (though other recent packages are starting to encroach on its monopoly).

#### [``scikit-learn``](https://scikit-learn.org/stable/): Visualization in Python

Scikit-learn is a machine learning library.

It features various classification, regression, and clustering algorithms, including support vector machines, random forests, gradient boosting, k-means, and DBSCAN.

The library is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

As a side note, every year stack overflow conducts a developer survey. Pandas is the fourth most popular miscellaneous framework among developers, out of all common coding languages!

<img src="https://raw.githubusercontent.com/wesleybeckner/ds_for_engineers/main/assets/other_frameworks_description.JPG" width=400px></img>

<img src="https://raw.githubusercontent.com/wesleybeckner/ds_for_engineers/main/assets/other_frameworks_results.JPG" width=400px></img>

<small>source: insights.stackoverflow.com/survey/2020</small>


### 3.3.1 Pandas and Scikit-Learn `load_datasets`

We begin by loading the Panda's package.  Packages are collections of functions that share a common utility.  We've seen `import` before.  Let's use it to import Pandas and all the richness that pandas has.

We'll also use a very useful feature of the scikit-learn toolkit, the `load_datasets` module. We will do some very rudimentary tasks with this dataset, just to demonstrate the utility of `load_datasets`, then we will switch over to a more relevant dataset for our purposes.

```
import pandas
from sklearn.datasets import load_wine
```

In [None]:
import pandas
from sklearn.datasets import load_wine

We import a function `load_wine` that loads a simple data set we can play with called the Wine recognition dataset from the 1980s.

You can read more about that dataset [here](https://archive.ics.uci.edu/ml/datasets/Wine)

```
dataset = load_wine()
print(dataset.DESCR)
```

In [None]:
dataset = load_wine()
print(dataset.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

```
df = pandas.DataFrame()
```

In [None]:
df = pandas.DataFrame()

Because we'll use it so much, we often import under a shortened name using the ``import ... as ...`` pattern:

```
import pandas as pd
```

In [None]:
import pandas as pd

Let's create an empty _data frame_ and put the result into a variable called `df`.  This is a popular choice for a _data frame_ variable name.

```
df = pd.DataFrame()
```

In [None]:
df = pd.DataFrame()

Let's open the Wine dataset as a pandas data frame.  Notice we change the value of the `df` variable to point to a new data frame.

```
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
```

In [None]:
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)

*Note: strings in Python can be defined either with double quotes or single quotes*

### 3.3.2 Viewing Pandas Dataframes

The ``head()`` and ``tail()`` methods show us the first and last rows of the data.

```
df.head()
df.tail()
```

In [None]:
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [None]:
df.tail()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.7,0.64,1.74,740.0
174,13.4,3.91,2.48,23.0,102.0,1.8,0.75,0.43,1.41,7.3,0.7,1.56,750.0
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.2,0.59,1.56,835.0
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.3,0.6,1.62,840.0
177,14.13,4.1,2.74,24.5,96.0,2.05,0.76,0.56,1.35,9.2,0.61,1.6,560.0


The ``shape`` attribute shows us the number of elements:

```
df.shape
```

Note it doesn't have the `()` because it isn't a **_function_** - it is an **_attribute_** or variable attached to the `df` object.

In [None]:
df.shape

(178, 13)

The ``columns`` attribute gives us the column names

```
df.columns
```


In [None]:
df.columns

Index(['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium',
       'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
       'proanthocyanins', 'color_intensity', 'hue',
       'od280/od315_of_diluted_wines', 'proline'],
      dtype='object')

The ``index`` attribute gives us the index names

```
df.index
```

In [None]:
df.index

RangeIndex(start=0, stop=178, step=1)

The ``dtypes`` attribute gives the data types of each column, remember the data type *_floating point_**?:

```
df.dtypes
```

In [None]:
df.dtypes

alcohol                         float64
malic_acid                      float64
ash                             float64
alcalinity_of_ash               float64
magnesium                       float64
total_phenols                   float64
flavanoids                      float64
nonflavanoid_phenols            float64
proanthocyanins                 float64
color_intensity                 float64
hue                             float64
od280/od315_of_diluted_wines    float64
proline                         float64
dtype: object

### 3.3.3. Manipulating data with ``pandas``

Here we'll cover some key features of manipulating data with pandas

Access columns by name using square-bracket indexing:

```
df['ash']
```

In [None]:
df['ash']

0      2.43
1      2.14
2      2.67
3      2.50
4      2.87
       ... 
173    2.45
174    2.48
175    2.26
176    2.37
177    2.74
Name: ash, Length: 178, dtype: float64

Mathematical operations on columns happen *element-wise*:

```
df['ash'] * 100.
```

In [None]:
df['ash'] * .01

0      0.0243
1      0.0214
2      0.0267
3      0.0250
4      0.0287
        ...  
173    0.0245
174    0.0248
175    0.0226
176    0.0237
177    0.0274
Name: ash, Length: 178, dtype: float64

Columns can be created (or overwritten) with the assignment operator.
Let's create a CRIME per 100 individuals column.

```
df['alcohol fraction'] = df['alcohol'] * .01
```

In [None]:
df['alcohol fraction'] = df['alcohol'] * .01

Let's use the `.head()` **_function_** to see our new data!

```
df.head()
```

In [None]:
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,alcohol fraction
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0.1423
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0.132
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0.1316
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0.1437
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0.1324


#### 3.3.3.1 Loading data from csv and url

We can also load our datasets in avariety of formats. For the rest of this session we will be using another winemaker dataset that is a bit more extensive and offers more usability for our purposes!

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/wesleybeckner/ds_for_engineers/main/data/wine_quality/winequalityN.csv")

In [None]:
df.head()

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,white,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,white,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [None]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,6487.0,6489.0,6494.0,6495.0,6495.0,6497.0,6497.0,6497.0,6488.0,6493.0,6497.0,6497.0
mean,7.216579,0.339691,0.318722,5.444326,0.056042,30.525319,115.744574,0.994697,3.218395,0.531215,10.491801,5.818378
std,1.29675,0.164649,0.145265,4.758125,0.035036,17.7494,56.521855,0.002999,0.160748,0.148814,1.192712,0.873255
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0,3.0
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99234,3.11,0.43,9.5,5.0
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3,6.0
75%,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.99699,3.32,0.6,11.3,6.0
max,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9,9.0


In [None]:
df.dtypes

type                     object
fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
dtype: object

### 3.3.4 Simple Grouping of Data

The real power of Pandas comes in its tools for grouping and aggregating data. Here we'll look at *value counts* and the basics of *group-by* operations.

In preparation for grouping the data, let's bin the instances by their density (we could have chosen any numerical column). For that, we'll use ``pd.cut``.  The documentation for ``pd.cut`` can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html). It is used to bin values into discrete intervals.  This is like a histogram where for each *bin* along the range of data values, you count the number of occurrences of that bin.  in our example, we'll use 10 bins and let Pandas decide how to evenly divide the range into the bins.  Let's see it in action.

```
df['density_group'] = pd.cut(df['density'], 10)
df.head()
df.dtypes
```

In [None]:
df['density_group'] = pd.cut(df['density'], 10)

In [None]:
df.head()

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,density_group
0,white,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,"(0.997, 1.003]"
1,white,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,"(0.992, 0.997]"
2,white,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,"(0.992, 0.997]"
3,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,"(0.992, 0.997]"
4,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,"(0.992, 0.997]"


In [None]:
df.dtypes

type                      object
fixed acidity            float64
volatile acidity         float64
citric acid              float64
residual sugar           float64
chlorides                float64
free sulfur dioxide      float64
total sulfur dioxide     float64
density                  float64
pH                       float64
sulphates                float64
alcohol                  float64
quality                    int64
density_group           category
dtype: object

#### 3.3.4.1 Value Counts

Pandas includes an array of useful functionality for manipulating and analyzing tabular data.
We'll take a look at two of these here.

The ``pandas.value_counts`` returns statistics on the unique values within each column.

We can use it, for example, to break down the wines by their density group that we just created:

```
pd.value_counts(df['density_group'])
```

In [None]:
pd.value_counts(df['density_group'])

(0.992, 0.997]    3645
(0.987, 0.992]    1599
(0.997, 1.003]    1241
(1.003, 1.008]       9
(1.008, 1.013]       2
(1.034, 1.039]       1
(1.029, 1.034]       0
(1.023, 1.029]       0
(1.018, 1.023]       0
(1.013, 1.018]       0
Name: density_group, dtype: int64

What happens if we try this on a continuous valued variable?

```
pd.value_counts(df['density'])
```

In [None]:
pd.value_counts(df['density'])

0.99760    69
0.99720    69
0.99800    64
0.99200    64
0.99280    63
           ..
0.99483     1
0.98947     1
0.99837     1
0.99511     1
0.98923     1
Name: density, Length: 998, dtype: int64

#### Exercise 2: `value_counts, unique, nunique`

We can do a little data exploration with this by seeing how common different values are. Play around with these pandas methods:

* `value_counts()`
* `unique()`
* `nunique()`

Do so with 3 different columns in the dataframe

#### 3.3.4.2 Group-by Operation

One of the killer features of the Pandas dataframe is the ability to do group-by operations.
You can visualize the group-by like this (image borrowed from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do))

![image](https://swcarpentry.github.io/r-novice-gapminder/fig/12-plyr-fig1.png)

Let's break take this in smaller steps.
Recall our ``density_group`` column.

```
pd.value_counts(df['density_group'])
```

In [None]:
pd.value_counts(df['density_group'])

(0.992, 0.997]    3645
(0.987, 0.992]    1599
(0.997, 1.003]    1241
(1.003, 1.008]       9
(1.008, 1.013]       2
(1.034, 1.039]       1
(1.029, 1.034]       0
(1.023, 1.029]       0
(1.018, 1.023]       0
(1.013, 1.018]       0
Name: density_group, dtype: int64

`groupby` allows us to look at the number of values for each column and each value.  The group by documentation is [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html).  Basically, `groupby` allows us to create *groups* of records based on their values. Let's count how many records, or rows, in our data set fall into each bin of our density data. 

```
df.groupby(['density_group']).count()
```

In [None]:
df.groupby(['density_group']).count()

Unnamed: 0_level_0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
density_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
"(0.987, 0.992]",1599,1597,1598,1599,1599,1599,1599,1599,1599,1597,1599,1599,1599
"(0.992, 0.997]",3645,3638,3638,3643,3643,3643,3645,3645,3645,3640,3641,3645,3645
"(0.997, 1.003]",1241,1240,1241,1240,1241,1241,1241,1241,1241,1239,1241,1241,1241
"(1.003, 1.008]",9,9,9,9,9,9,9,9,9,9,9,9,9
"(1.008, 1.013]",2,2,2,2,2,2,2,2,2,2,2,2,2
"(1.013, 1.018]",0,0,0,0,0,0,0,0,0,0,0,0,0
"(1.018, 1.023]",0,0,0,0,0,0,0,0,0,0,0,0,0
"(1.023, 1.029]",0,0,0,0,0,0,0,0,0,0,0,0,0
"(1.029, 1.034]",0,0,0,0,0,0,0,0,0,0,0,0,0
"(1.034, 1.039]",1,1,1,1,1,1,1,1,1,1,1,1,1


Now, let's find the mean of each of the columns for each ``density_group``.  *Notice* what happens to the non-numeric columns.

```
df.groupby(['density_group']).mean()
```

In [None]:
df.groupby(['density_group']).mean()

Unnamed: 0_level_0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
density_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
"(0.987, 0.992]",6.58062,0.280219,0.318555,2.612477,0.037036,30.714509,112.679174,0.990814,3.202461,0.487473,11.817209,6.304565
"(0.992, 0.997]",7.153079,0.353889,0.29947,5.100796,0.060011,29.661728,114.43594,0.994989,3.236239,0.535018,10.243895,5.679012
"(0.997, 1.003]",8.186532,0.372192,0.374524,9.964988,0.068479,32.870669,123.724819,0.998717,3.18774,0.575584,9.515082,5.601934
"(1.003, 1.008]",11.877778,0.611111,0.43,11.755556,0.109667,24.777778,71.666667,1.003202,3.045556,0.662222,10.333333,5.666667
"(1.008, 1.013]",7.9,0.33,0.28,31.6,0.053,35.0,176.0,1.0103,3.15,0.38,8.8,6.0
"(1.013, 1.018]",,,,,,,,,,,,
"(1.018, 1.023]",,,,,,,,,,,,
"(1.023, 1.029]",,,,,,,,,,,,
"(1.029, 1.034]",,,,,,,,,,,,
"(1.034, 1.039]",7.8,0.965,0.6,65.8,0.074,8.0,160.0,1.03898,3.39,0.69,11.7,6.0


You can specify a groupby using the names of table columns and compute other functions, such as the ``sum``, ``count``, ``std``, and ``describe``.

```
df.groupby(['density_group'])['residual sugar'].describe()
```

In [None]:
df.groupby(['density_group'])['residual sugar'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
density_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"(0.987, 0.992]",1599.0,2.612477,1.822662,0.7,1.3,1.8,3.5,11.25
"(0.992, 0.997]",3643.0,5.100796,3.769753,0.6,1.9,3.6,7.9,22.6
"(0.997, 1.003]",1241.0,9.964988,6.042544,1.3,2.8,12.0,14.8,23.5
"(1.003, 1.008]",9.0,11.755556,9.340485,3.7,4.2,6.6,15.4,26.05
"(1.008, 1.013]",2.0,31.6,0.0,31.6,31.6,31.6,31.6,31.6
"(1.013, 1.018]",0.0,,,,,,,
"(1.018, 1.023]",0.0,,,,,,,
"(1.023, 1.029]",0.0,,,,,,,
"(1.029, 1.034]",0.0,,,,,,,
"(1.034, 1.039]",1.0,65.8,,65.8,65.8,65.8,65.8,65.8


The simplest version of a groupby looks like this, and you can use almost any aggregation function you wish (mean, median, sum, minimum, maximum, standard deviation, count, etc.)

```
<data object>.groupby(<grouping values>).<aggregate>()
```

You can even group by multiple values: for example we can look at the quality grouped by the ``density_group`` and ``residual sugar``.

In [None]:
df.groupby(['residual sugar'])['quality'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
residual sugar,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.60,2.0,5.000000,0.000000,5.0,5.0,5.0,5.0,5.0
0.70,7.0,4.857143,1.069045,3.0,4.5,5.0,5.5,6.0
0.80,25.0,5.240000,0.925563,4.0,5.0,5.0,6.0,8.0
0.90,41.0,5.634146,0.829340,4.0,5.0,6.0,6.0,7.0
0.95,4.0,5.000000,0.000000,5.0,5.0,5.0,5.0,5.0
...,...,...,...,...,...,...,...,...
22.60,1.0,5.000000,,5.0,5.0,5.0,5.0,5.0
23.50,1.0,5.000000,,5.0,5.0,5.0,5.0,5.0
26.05,2.0,6.000000,0.000000,6.0,6.0,6.0,6.0,6.0
31.60,2.0,6.000000,0.000000,6.0,6.0,6.0,6.0,6.0


#### Exercise 3: Group-by

<ol>
<li>use <code>pd.cut</code> to perform a grouping of one or more of the dataframe columns
<li>use <code>groupby</code> to group by that (those) columns and then perform
<li>three different statistical summaries in three separate instances



In [None]:
# Cell for excercise 3

## 3.5 Breakout for Functions and Pandas

Write a function that outputs the results of your exercise 3 code into a new dataframe. 