# Aggregation with groupby

In [1]:
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd
pd.set_option('precision', 2) # show only two decimal digits

Load the survey data

In [2]:
df  = pd.read_csv('cleaned_survey.csv', index_col=0)

In [3]:
df.head()

Unnamed: 0,Job,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,...,Tableau,Regression,Classification,Clustering,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus,Languages,Expert
0,0.0,MSIS,4,1,1,0.0,1,1.0,1.0,0.0,...,0,1.0,4,4,0,1,0,0,6.0,1
1,0.5,MSIS,3,1,1,0.0,1,0.0,0.0,0.0,...,0,0.0,2,2,0,0,0,1,4.0,1
2,0.0,MSIS,3,0,0,0.0,1,1.0,0.0,0.0,...,0,1.0,3,3,0,0,1,0,3.0,1
3,0.0,MSIS,3,1,0,0.0,1,1.0,0.0,1.0,...,0,1.0,2,3,0,0,0,1,5.0,1
4,0.0,MSIS,3,1,0,0.0,1,1.0,0.0,0.0,...,0,0.0,1,1,0,0,1,0,4.0,1


## groupby

The method <i>groupby</i> splits the data by the value of a field <i>f</i>. We can then aggregate other columns separately for each value of <i>f</i>

<b>Example</b>: We know how to show the average value of each column.  But groupby allows us to show the average value of each column divided by "Program"

#### Example

Display the mean of all columns, grouping by the Job situation

## Aggregate only some columns

Oftentimes, we don't want to aggregate all columns. For example, we want to find the average of Job grouped by Program.

Or more columns. For example, we want to find the average of Job, C, and R, grouped by Program.

## Problems

For each Job situation (0=no job, 0.5=part time, 1=full time), find the proportion of students that know SQL

For each program, count how many student know SQL.

Considering only the students who know SQL, find for each Program the proportion of students who know Java

Which one is faster? Why?
<ol>
<li>df.groupby(by='Program')['SQL'].mean()
<li>df.groupby(by='Program').mean()['SQL']
</ol>

For each Classification skill level, how many MBA students are there? Your result should have 5 rows (one for each classification skill level: 1, 2, 3, 4, and 5)

Here is the wrong way to do it:
<ol>
<li>Keep only the rows of MBA students
<li>Perform group by
</ol>

<p>Here is the correct way to do it:</p>
<ol>
<li>Create a dummy variable 'MBA' that has a 1 (or True) if the student is an MBA student and 0 (or False) otherwise
<li>For each classification level, compute the sum of the 'MBA' column. Note that the sum of boolean values counts the True values.
<li>Remove the dummy variable 'MBA'
</ol>

## Apply multiple functions (<i>agg</i>)

For each Job situation (0=no job, 0.5=part time, 1=full time), find (1) their number and (2) the proportion of students that know SQL.

#### Renaming resulting columns

You need to rename the columns manually

## Apply multiple arbitrary functions to multiple columns and give them names (agg)

For each Job situation (0=no job, 0.5=part time, 1=full time), compute the average knowledge of SQL, the maximum knowledge of Classification, and the gap between the max and the min Classification score for each Job level

We can also give a name to all columns created

## group by multiple fields

You can also group by multiple fields. For example, find the mean of all columns grouped by Program and Job situation.

It returns a DataFrame with a <b>Hierarchical Index</b> (i.e., a composite key in a database). In this case, the index is (Program,Job). DataFrames with Hierarchical Indeces are outside the scope of this course because they tend to be hard to deal with; you can avoid them here by using <i>as_index = False</i> inside the <i>groupby</i>. Note: it does not work in all cases.

## Problems

Find the maximum, minimum, and average number of Languages known by students in each Program

For each existing combination of programming skills level and Program, report the number of students (call it <i>nStudents</i>) and the proportion that know Python (call it <i>PythonProportion</i>)

HARD. For each Program, report:
<ul>
<li>the number of students who know both Python and C (call it <i>C_Python_Students</i>, and note that it can be equal to 0)
<li>the gap between max and mean Clustering knowledge (call it <i>CluGap</i>)
</ul>

## Advanced: retrieve unaggregated rows (<i>apply</i>)

Sometimes, for each group-by value we want to retrieve one or more rows. For example, for each program report the student who knows most languages (report more than one students in case of ties)

### Warning! Do not use DataFrameGroupBy.apply unless you actually need it

The method <i>DataFrameGroupBy.apply</i> is slow. It is implemented as a for loop that invokes the lambda function at each iteration.  

The fast way to compute the average ProgSkills for each program:

The <b>slow</b> way to compute the average ProgSkills for each program:

## Problems

For each ProgSkills level, find the student (or students in case of ties) with the highest Classification skills and show their knowledge of C and Java

For each ProgSkills level, find the Program with most students that have that ProgSkill level