----
#Pandas Idioms
----

Python programmers will often suggest that there many ways python can be used to solve a particular problem. However, some are more appropriate than others. The best solutions are celebrated as "**Idiomatic Python**" and there are lots of great examples of this on StackOverflow and other websites. That is to say, the single quickest way to increase maintainability and decrease 'simple' bugs is to strive to write idiomatic Python. 

Pandas has its own set of idioms, akin to a sub-language within Python. We've alluded to some of these already, such as using vectorization whenever possible, and not using iterative loops if you don't need to.

Several developers and users within the Panda's community have used the term **pandorable** for these idioms. I think it's a great term. So, I wanted to share with you a couple of key features of how you can make your code "pandorable".

Let's start by calling in a dataset we can use to explore pandas' different idioms.

In [1]:
# Mount the drive. 
from google.colab import drive
drive.mount('/content/drive')
#!ls /content/drive/My\ Drive/Applied\ Data\ Science\ in\ Python/datasets/  

Mounted at /content/drive


In [2]:
# Let's bring in our data processing libraries
import pandas as pd
import numpy as np

# And look at some census data from the US
df = pd.read_csv('/content/drive/My Drive/Applied Data Science in Python/datasets/census.csv')
df.head()

FileNotFoundError: ignored

##Method Chaining

The first of the pandas idioms I would like to talk about is **method chaining**. The general idea behind method chaining is that every method on an object returns a reference to that object. The beauty of this is that you can condense many different operations on a DataFrame into one line or at least one statement of code. This greatly increases the readability of the code, and is very common amongst Python programmers.

For example, let's assume that I want to:
1. extract only county-level data, i.e. data which has a summary level of 50
2. set the state and city names as a multiple index, 
3. rename a column too, just to make it a bit more readable

Here's the "pandorable" way to write code with method chaining...


In [None]:
(df[df['SUMLEV']==50]
  .set_index(['STNAME','CTYNAME'])
  .rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'}))

Let's walk through this...

First, we use the indexing operator to pass in a boolean mask which will only return the rows where the SUMLEV is equal to 50. This indicates in our source data that the data is summarized at the county level. 

We take the resulting dataframe returned, and set its index to the state name followed by the county name. Finally, we can rename a column to make it more readable. 

Note that instead of writing this all on one line, as I could have done, I began the statement with a parenthesis, which tells python I'm going to span the statement over multiple lines for readability.

Below is the more traditional, non-pandorable way, of writing this. There's nothing wrong with this code in the functional sense, you might even be able to understand it better as a new person to the language. It's just not as pandorable as the first example.

In [None]:
# First create a new dataframe from the original
df = df[df['SUMLEV']==50]
# Update the dataframe to have a new index, we use inplace=True to do this in place
df.set_index(['STNAME','CTYNAME'], inplace=True)
# Set the column names
df.rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'})

You'll see lots of examples on stack overflow and in documentation of people using method chaining in their pandas. As such, I think being able to read and understand the syntax is really worth your time.

##Apply Function

Here's another pandas idiom. As we've learned in a previous lecture, Python has a wonderful function called `map()`, which is sort of a basis for functional programming in the language. When you want to use `map()` in Python, you pass it some function you want called, and some iterable, like a list, that you want the function to be applied to. The results are that the function is called against each item in the list, and there's a resulting list of all of the evaluations of that function.

Pandas has a similar function called [`apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html), that I use often when wanting to map across all of the rows in a DataFrame. 



As an example, let's take a look at our census DataFrame. In this DataFrame, we have six columns for population estimates, each corresponding to a different year. Let's assume we want to create some new columns where we look at the minimum or maximum population estimates over the years for each county / state. The `apply()` function provides an easy way to do this.

First, we need to write a function which *takes a row* from our DataFrame, finds a minimum and maximum values, and returns a new row of data.  We'll call this function `min_max`. We can do this by creating some small slice of a row, projecting the population columns, then use the NumPy `min` and `max` functions, and create a new series with label values representing the new values we want to apply.


In [None]:
def min_max(row):
    data = row[['POPESTIMATE2010',
                'POPESTIMATE2011',
                'POPESTIMATE2012',
                'POPESTIMATE2013',
                'POPESTIMATE2014',
                'POPESTIMATE2015']] 
    # The series here is the newly generated row
    return pd.Series({'min': np.min(data), 'max': np.max(data)})

Now, we just need to call `apply()` on the DataFrame.

[`apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) takes the function and the axis on which to operate as parameters. Here, we have to be a bit careful, we've talked about axis zero being the rows of the DataFrame in the past, but this parameter expects you to pass along *the index* to use. So, to apply across all rows, you need to use the columns as the index, and thus pass `axis` equal to `'columns'`.

In [None]:
df.apply(min_max, axis='columns')

Of course there's no need to limit yourself to returning a new series object. If you're doing this as part of data cleaning, you're likely to find yourself wanting to add the new data to the existing DataFrame. In that case, you can just take the row values and add them as new columns indicating the max and min scores. This is a regular part of the data processing, inferring new data and building descriptive statistics, and is often used heavily with the merging of DataFrames.

Let's adjust the code above, and produce a revised version of the function `min_max` that returns the original dataframe with two new columns labeled `max` and `min`.


In [None]:
def min_max(row):
    data = row[['POPESTIMATE2010',
                'POPESTIMATE2011',
                'POPESTIMATE2012',
                'POPESTIMATE2013',
                'POPESTIMATE2014',
                'POPESTIMATE2015']]
    # Create a new column for max
    row['max'] = np.max(data)
    # Create a new column for min
    row['min'] = np.min(data)
    #return the adjusted row
    return row
# Now just apply the function across the dataframe
df.apply(min_max, axis='columns')

`apply()` is an extremely important tool in your toolkit. However, you rarely use `apply()` with large function definitions, like we did. Instead, you typically see it used with `lambda`.

Let's see how we can convert the above code to one line of code using method chaining, as well as `lambda` in the `apply()` function. 

In [None]:
df[['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 
    'POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015']].apply(lambda x: np.max(x), axis=1).head()

Remember, `lambda` is basically an unamed function. In this case it takes a single parameter, x, and returns a single value: the maximum over all columns associated with row x.

How would we adjust the above line of code to make it return a new column in the same dataframe, labeled "max"?

In [None]:
df['max']= df[['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012',
               'POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015']].apply(lambda x: np.max(x), axis=1)

df

The beauty of the `apply()` function is that it allows flexibility in doing whatever manipulation you desire, as the function you pass into `apply()` can be customized however you want. 

Let's say we want to divide the states into four categories: Northeast, Midwest, South, and West. We can write a customized function that returns the region based on the state.

In [None]:
def get_state_region(x):
    northeast = ['Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 
                 'Rhode Island','Vermont','New York','New Jersey','Pennsylvania']
    midwest = ['Illinois','Indiana','Michigan','Ohio','Wisconsin','Iowa',
               'Kansas','Minnesota','Missouri','Nebraska','North Dakota',
               'South Dakota']
    south = ['Delaware','Florida','Georgia','Maryland','North Carolina',
             'South Carolina','Virginia','District of Columbia','West Virginia',
             'Alabama','Kentucky','Mississippi','Tennessee','Arkansas',
             'Louisiana','Oklahoma','Texas']
    west = ['Arizona','Colorado','Idaho','Montana','Nevada','New Mexico','Utah',
            'Wyoming','Alaska','California','Hawaii','Oregon','Washington']
    
    if x in northeast:
        return "Northeast"
    elif x in midwest:
        return "Midwest"
    elif x in south:
        return "South"
    else:
        return "West"

Now that we have the customized function, let's say we want to create a new column called Region, showing the state's region. We can use the customized function and the `apply()` function to do so. The customized function is supposed to work on the state name column STNAME. So we will set the apply function on the state name column and pass the customized function into the apply function.

In [None]:
df['state_region'] = df['STNAME'].apply(lambda x: get_state_region(x))

In [None]:
# Now let's see the results
df[['STNAME','state_region']]

##GroupBy Function

Sometimes we want to select data based on groups and understand *aggregated* data on a group level. We have seen that even though Pandas allows us to iterate over every row in a dataframe, it is generally very slow to do so. 

Fortunately Pandas has a `groupby()` function that speeds up such tasks. The idea behind the `groupby()` function is that it takes some dataframe, splits it into chunks based on some key values, applies computation on those  chunks, then combines the results back together into another dataframe. In pandas, this is refered to as the ***split-apply-combine* pattern.**

![split-apply-combine](https://drive.google.com/uc?id=1IAqOC1pco14exirexw9YjgXxFe3vAPvq)

### Splitting

In the first example for `groupby()` I will re-use the census date. 

In [None]:
# Let's look at some US census data
df = pd.read_csv('/content/drive/My Drive/Applied Data Science in Python/datasets/census.csv')
# And exclude state level summarizations, which have sum level value of 40
df = df[df['SUMLEV']==50]
df.head()

Let's assume we want to calculate that average population of the counties within each state using the `CENSUS2010POP` column. 

In order to do so, one impulsive way to do that is to generate a list of the unique states, then iterate over all the states, and for each state we produce a dataframe and calculate the average.

And to see how well this method is, let's run such the code 3 times and time it. For this we'll use the cell magic function we used before, `%%timeit`.

In [None]:
%%timeit -n 3

for state in df['STNAME'].unique():
    # We'll just calculate the average using numpy for this particular state
    avg = np.average(df[df['STNAME']==state]['CENSUS2010POP'])
    # And we'll print it to the screen
    print('Counties in state ' + state + 
          ' have an average population of ' + str(avg))

If you scroll down to the bottom of that output you can see it takes a fair bit of time to finish.

Now let's try doing the same thing using [`groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html).

The `groupby()` **returns a GroupBy object**. When iterating over a GroupBy object, it returns a tuple for every "group":
1. the value of the key we were trying to group by, in this case a specific state name,
2. the projected dataframe for that group

In [None]:
%%timeit -n 3
# For this method, we start by telling pandas we're interested in grouping by state name, this is the "split"
for group, frame in df.groupby('STNAME'):
    # Now we include our logic in the "apply" step, which is to calculate an average of the census2010pop
    avg = np.average(frame['CENSUS2010POP'])
    # And print the results
    print('Counties in state ' + group + 
          ' have an average population of ' + str(avg))

Look at the difference in speed! GroupBy improved the performace by roughly two factors!

Now, most of the time, you'll use `groupby()` on one or more columns. However, it is good to know that you can also provide a function to `groupby()` and use that to segment your data.

Let's look at an example. Say you have a big batch job with lots of processing and you want to work on only a third or so of the states at a given time. As a result, you want to create some function which allocates each row into one of 3 groups based on the first letter of the state name. Then we can tell `groupby()` to use this function to "split" up our data frame. It's important to note that in order to do this you need to set the index of the dataframe to be the column that you want to group by first (as stated in the [API reference](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) : "If `by` is a function, it’s called on each value of the object’s index.")

We'll create some new function called `set_group` and if the first letter of the parameter is from A to L we'll return "A to L". If it's a from M to P we'll return "M to P", otherwise we'll return "Q to Z". Then we'll pass this function to the data frame.

In [None]:
df = df.set_index('STNAME')

def set_group(index):
    if index[0]<'M':
        return "A to L"
    if index[0]<'Q':
        return "M to P"
    return "Q to Z"

# The dataframe is supposed to be grouped by according to the batch
# We will then loop through each batch group
for group, frame in df.groupby(set_group):
    print('There are ' + str(len(frame)) + ' records in group ' + group + ' for processing.')

Let's take one more look at an example of how we might group data. In this example, we use a dataset of housing from airbnb. In this dataset, there are two columns of interest, one is the `cancellation_policy` and the other is the `review_scores_value`.

In [None]:
import pandas as pd
import numpy as np

df=pd.read_csv("/content/drive/My Drive/Applied Data Science in Python/datasets/listings.csv")
df.head()

So, how would I group by both of these columns? One approach might be to promote them to a multiindex and just call `groupby()`.

In [None]:
df=df.set_index(["cancellation_policy","review_scores_value"])

# When we have a multiindex we need to pass in the levels we are interested in grouping by
for group, frame in df.groupby(level=(0,1)):
    print(group)

This seems to work ok. But what if we wanted to group by the cancelation policy and review scores, but separate out all the 10's from those under ten? In this case, we could use a function to manage the groupings

In [None]:
def grouping_fun(index_):
    # Check the "review_scores_value" portion of the index.
    # index_ is in the format of (cancellation_policy,review_scores_value)
    if index_[1] == 10.0:
        return (index_[0],"10.0")
    else:
        return (index_[0],"not 10.0")

for group, frame in df.groupby(by=grouping_fun):
    print(group)

### Applying

To this point we have applied very simple processing to our data after splitting, which is simply outputting some print statements to demonstrate how the splitting works. The pandas developers have three broad categories of data processing to happen during the apply step: 
1. Aggregation of group data,
2. Transformation of group data
3. Filtration of group data

#### Aggregation

The most straight forward "apply" step is the aggregation of data, using the method [`agg()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html) on the groupby object. Thus far we have only iterated through the GroupBy object, unpacking it into a label (the group name) and a dataframe. But with `agg()` we can pass in a dictionary of the columns we are interested in aggregating along with the function we are looking to apply to aggregate.

In [None]:
# Let's reset the index for our airbnb data
# which puts "cancellation_policy" and "review_scores_value" back as columns
df=df.reset_index()

# Now lets group by the cancellation policy and find the average review_scores_value by group
df.groupby("cancellation_policy").agg({"review_scores_value":np.average})

So that didn't work, and returned a bunch of `NaN`s. The issue is in the function that we sent to aggregate. `np.average()` does not ignore `NaN`s! However, there is a function we can use for this `np.nanmean()`.

In [None]:
df.groupby("cancellation_policy").agg({"review_scores_value":np.nanmean})

We can also extend this dictionary to aggregate multiple functions and/or multiple columns.


In [None]:
df.groupby("cancellation_policy").agg({"review_scores_value":(np.nanmean,np.nanstd),
                                      "reviews_per_month":np.nanmean})

Take a moment to make sure you understand the previous cell, since it's somewhat complex. First we're doing a `groupby()` on the dataframe object by the column `"cancellation_policy"`. This creates a new GroupBy object. Then we are invoking the `agg()` function on that object. The `agg()` function is going to apply the one or more functions we specify to the group dataframes and *return a single row per dataframe/group*. When we called this function we sent it two dictionary entries, each with the key indicating which column we wanted functions applied to. 

For the first column we actually supplied a tuple of two functions. Note that these are not function invocations, like `np.nanmean()`, or function names, like `"nanmean"` they are references to functions which will return single values. The groupby object will recognize the tuple and call each function in order on the same column. The results will be in a heirarchical index, but since they are columns they don't show as an index per se. Finally, we indicated another column and a single function we wanted to run.

#### Transformation

Transformation is different from aggregation. **The `agg()` function returns a single value per column, and one row per group.** On the other hand, the `tranform()` function returns an object that is the same size as the group. Essentially, **`tranform()` broadcasts the function you supply over the grouped dataframe, returning a new dataframe**. This makes combining data later easy.

For instance, suppose we want to include the average rating values in a given group by cancellation policy, but preserve the dataframe shape so that we could generate a difference between an individual observation and the sum.


In [None]:
# First, lets define just some subset of columns we are interested in
cols = ['cancellation_policy','review_scores_value']

# Now lets transform it, and store it in its own dataframe
transform_df = df[cols].groupby('cancellation_policy').transform(np.nanmean)
transform_df.head()

As you can see that the index here is actually the same as the original dataframe. So to include these new columns to the original dataframe, we can just merge this in. However, before we do that, let's rename the column in the transformed version.

In [None]:
transform_df.rename({'review_scores_value':'mean_review_scores'}, axis='columns', inplace=True)
df=pd.merge(df,transform_df, left_index=True, right_index=True)
df.head()

Great, we can see that our new column is in place, the `mean_review_scores`. So now we could create, for instance, the difference between a given row and it's group (the cancellation policy) means.

In [None]:
df['mean_diff']=df['review_scores_value']-df['mean_review_scores']
df['mean_diff'].head()

#### Filtering

The GroupBy object has built-in support for filtering groups as well. It's often that you'll want to group by some feature, then make some transformation to the groups, then drop certain groups as part of your cleaning routines. The **`filter()` function takes in a function which it applies to each group dataframe and returns either a `True` or a `False`**, depending upon whether that group should be included in the results.

In [None]:
#Let's first look at the average 
print (df[['cancellation_policy','review_scores_value']].groupby('cancellation_policy').mean())
# For instance, if we only want those groups which have a mean rating above 9.2 included in our results
df[['cancellation_policy','review_scores_value']].groupby('cancellation_policy').filter(lambda x: np.nanmean(x['review_scores_value'])>9.2)

Notice that the results are still indexed, but that any of the results which were in a group with a mean review score of less than or equal to 9.2 (i.e. `strict` and `super_strict_30`) were not copied over.

#### Applying

By far the most common operation python programmers invoke on groupby objects is the [`apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.apply.html) function. This allows you to apply an arbitrary function to each group, and stitch the results back into a single dataframe while the index is preserved.

In [None]:
# Lets look at an example using our airbnb data, I'm going to get a clean copy of the dataframe
df=pd.read_csv("/content/drive/My Drive/Applied Data Science in Python/datasets/listings.csv")
# And lets just include some of the columns we were interested in previously
df=df[['cancellation_policy','review_scores_value']]
df.head()

In previous work, we wanted to find the average review score of a listing and its deviation from the group mean. This was a two step process, first we used `transform()` on the groupby object and then we had to broadcast to create a new column. With `apply()` we could wrap this logic in one place.

In [None]:
def calc_mean_review_scores(group):
    # group is a dataframe just of whatever we have grouped by, e.g. cancellation policy, so we can treat
    # this as the complete dataframe
    avg=np.nanmean(group["review_scores_value"])
    # now broadcast our formula and create a new column
    group["review_scores_mean"]=np.abs(avg-group["review_scores_value"])
    return group

# Now just apply this to the groups
df.groupby('cancellation_policy').apply(calc_mean_review_scores).head()

Using `apply()` can be slower than using some of the specialized functions, especially `agg()`. However, if your dataframes are not huge, it's a solid general purpose approach.

Finally,  pandas developers have provided the GroupBy object with [built-in functions for the most commonly used computations](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html), and it is worth investing some time to look at these functions and read them.

Groupby is a powerful and commonly used tool for data cleaning and data analysis. Once you have grouped the data by some category you have a dataframe of just those values and you can conduct aggregated analysis on the segments that you are interested in. The `groupby()` function follows a split-apply-combine approach - first the data is split into subgroups, then you can apply some transformation, filtering, or aggregation, then the results are combined automatically by pandas for us.