# Aggregation with groupby

In [1]:
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd
pd.set_option('precision', 2) # show only two decimal digits

Load the survey data

In [2]:
df  = pd.read_csv('cleaned_survey.csv', index_col=0)

In [3]:
df.head()

Unnamed: 0,Job,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,...,Tableau,Regression,Classification,Clustering,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus,Languages,Expert
0,0.0,MSIS,4,1,1,0.0,1,1.0,1.0,0.0,...,0,1.0,4,4,0,1,0,0,6.0,1
1,0.5,MSIS,3,1,1,0.0,1,0.0,0.0,0.0,...,0,0.0,2,2,0,0,0,1,4.0,1
2,0.0,MSIS,3,0,0,0.0,1,1.0,0.0,0.0,...,0,1.0,3,3,0,0,1,0,3.0,1
3,0.0,MSIS,3,1,0,0.0,1,1.0,0.0,1.0,...,0,1.0,2,3,0,0,0,1,5.0,1
4,0.0,MSIS,3,1,0,0.0,1,1.0,0.0,0.0,...,0,0.0,1,1,0,0,1,0,4.0,1


## groupby

The method <i>groupby</i> splits the data by the value of a field <i>f</i>. We can then aggregate other columns separately for each value of <i>f</i>

<b>Example</b>: We know how to show the average value of each column.  But groupby allows us to show the average value of each column divided by "Program"

In [4]:
# we already know to compute the average of each column
df.mean()

Job               0.35
ProgSkills        2.85
C                 0.59
CPP               0.44
CS                0.08
Java              0.74
Python            0.43
JS                0.38
R                 0.20
SQL               0.85
SAS               0.10
Excel             0.95
Tableau           0.56
Regression        0.52
Classification    1.87
Clustering        1.84
Bach_0to1         0.03
Bach_1to3         0.25
Bach_3to5         0.26
Bach_5Plus        0.46
Languages         3.79
Expert            0.82
dtype: float64

In [5]:
# to do it by Program
gb = df.groupby(by='Program')

In [6]:
type(gb)

pandas.core.groupby.DataFrameGroupBy

In [7]:
gb.mean()

Unnamed: 0_level_0,Job,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,...,Tableau,Regression,Classification,Clustering,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus,Languages,Expert
Program,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Business Man,1.0,1.0,0.0,0.0,,0.0,0.0,1.0,1.0,0.0,...,0.0,1.0,2.0,3.0,0.0,0.0,0.0,1.0,2.0,0.0
Faculty!,1.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,5.0,5.0,0.0,0.0,0.0,1.0,3.0,1.0
MBA,0.66,2.5,0.31,0.38,0.12,0.38,0.38,0.33,0.44,0.62,...,0.25,0.88,1.94,1.94,0.0,0.12,0.12,0.75,3.06,0.62
MSIS,0.2,3.08,0.7,0.47,0.07,0.97,0.49,0.4,0.1,0.95,...,0.75,0.31,1.77,1.75,0.03,0.3,0.35,0.33,4.2,0.93
Master of Finance,0.0,4.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,...,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,5.0,1.0
Supply Chain Mgmt & Analytics,0.5,1.5,0.5,0.5,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,2.0,1.0,0.5,0.5,0.0,0.0,2.0,0.5


#### Example

Display the mean of all columns, grouping by the Job situation

In [8]:
df.groupby(by='Job').mean()

Unnamed: 0_level_0,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,...,Tableau,Regression,Classification,Clustering,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus,Languages,Expert
Job,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,2.91,0.62,0.41,0.09,0.78,0.45,0.41,0.16,0.87,0.06,...,0.66,0.38,1.84,1.84,0.06,0.25,0.41,0.28,3.81,0.88
0.5,3.07,0.67,0.6,0.07,0.93,0.4,0.2,0.14,0.93,0.2,...,0.67,0.5,1.73,1.6,0.0,0.27,0.2,0.53,4.13,0.87
1.0,2.5,0.43,0.36,0.08,0.43,0.43,0.54,0.36,0.71,0.08,...,0.21,0.86,2.07,2.07,0.0,0.21,0.0,0.79,3.36,0.64


## Aggregate only some columns

Oftentimes, we don't want to aggregate all columns. For example, we want to find the average of Job grouped by Program.

In [9]:
df.groupby('Program')['Job'].mean()

Program
Business Man                     1.00
Faculty!                         1.00
MBA                              0.66
MSIS                             0.20
Master of Finance                0.00
Supply Chain Mgmt & Analytics    0.50
Name: Job, dtype: float64

Or more columns. For example, we want to find the average of Job, C, and R, grouped by Program.

In [10]:
df.groupby('Program')[['Job','C','R']].mean()

Unnamed: 0_level_0,Job,C,R
Program,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Business Man,1.0,0.0,1.0
Faculty!,1.0,1.0,0.0
MBA,0.66,0.31,0.44
MSIS,0.2,0.7,0.1
Master of Finance,0.0,1.0,0.0
Supply Chain Mgmt & Analytics,0.5,0.5,0.0


## Problems

For each Job situation (0=no job, 0.5=part time, 1=full time), find the proportion of students that know SQL

In [11]:
df.groupby('Job')['SQL'].mean()

Job
0.0    0.87
0.5    0.93
1.0    0.71
Name: SQL, dtype: float64

For each program, count how many student know SQL.

In [12]:
df.groupby('Job')['SQL'].sum()

Job
0.0    27.0
0.5    14.0
1.0    10.0
Name: SQL, dtype: float64

Considering only the students who know SQL, find for each Program the proportion of students who know Java

In [13]:
df.loc[df.SQL == 1,:].groupby('Program')['Java'].mean()

Program
Faculty!                         0.00
MBA                              0.60
MSIS                             0.97
Master of Finance                0.00
Supply Chain Mgmt & Analytics    0.00
Name: Java, dtype: float64

Which one is faster? Why?
<ol>
<li>df.groupby(by='Program')['SQL'].mean()
<li>df.groupby(by='Program').mean()['SQL']
</ol>

In [14]:
%timeit df.groupby(by='Program')['SQL'].mean()

1000 loops, best of 3: 514 µs per loop


In [15]:
%timeit df.groupby(by='Program').mean()['SQL']

100 loops, best of 3: 1.91 ms per loop


For each Classification skill level, how many MBA students are there? Your result should have 5 rows (one for each classification skill level: 1, 2, 3, 4, and 5)

Here is the wrong way to do it:
<ol>
<li>Keep only the rows of MBA students
<li>Perform group by
</ol>

In [16]:
df.loc[df.Program == 'MBA'].groupby('Classification').size() #wrong: it does not return one row for each value of Classification

Classification
1    7
2    5
3    2
4    2
dtype: int64

<p>Here is the correct way to do it:</p>
<ol>
<li>Create a dummy variable 'MBA' that has a 1 (or True) if the student is an MBA student and 0 (or False) otherwise
<li>For each classification level, compute the sum of the 'MBA' column. Note that the sum of boolean values counts the True values.
<li>Remove the dummy variable 'MBA'
</ol>

In [17]:
# create a dummy variable that indicates whether the student is an MBA student
df['MBA'] = df.Program == 'MBA'

In [18]:
# for each Classification level, sum the values of the variable MBA
df.groupby('Classification')['MBA'].sum()

Classification
1    7.0
2    5.0
3    2.0
4    2.0
5    0.0
Name: MBA, dtype: float64

In [19]:
# remove dummy variable
df.drop('MBA', axis=1, inplace=True)

## Apply multiple functions (<i>agg</i>)

For each Job situation (0=no job, 0.5=part time, 1=full time), find (1) their number and (2) the proportion of students that know SQL.

In [20]:
gb = df.groupby('Job')['SQL']

In [21]:
gb.agg(['mean','size'])

Unnamed: 0_level_0,mean,size
Job,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,0.87,32
0.5,0.93,15
1.0,0.71,14


#### Renaming resulting columns

You need to rename the columns manually

In [29]:
gb.agg(['mean','size']).rename(columns={'mean':'SQL_prop','size':'n_students'})

Unnamed: 0_level_0,SQL_prop,n_students
Job,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,0.87,32
0.5,0.93,15
1.0,0.71,14


## Apply multiple arbitrary functions to multiple columns and give them names (agg)

For each Job situation (0=no job, 0.5=part time, 1=full time), compute the average knowledge of SQL, the maximum knowledge of Classification, and the gap between the max and the min Classification score for each Job level

In [30]:
gb = df.groupby('Job')

In [31]:
gb.agg({'SQL' : 'mean', 
       'Classification' : ['max', lambda x: x.max() - x.min()]})

Unnamed: 0_level_0,Classification,Classification,SQL
Unnamed: 0_level_1,max,<lambda>,mean
Job,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0.0,4,3,0.87
0.5,3,2,0.93
1.0,5,4,0.71


We can also give a name to all columns created

In [35]:
gb.agg({'SQL' : 'mean', 
       'Classification' : ['max', lambda x: x.max() - x.min()]}).rename(columns={'max' : 'maxClassif',
                                                                                 '<lambda>':'spreadClassif',
                                                                                'mean' : 'SQLmean'})

Unnamed: 0_level_0,Classification,Classification,SQL
Unnamed: 0_level_1,maxClassif,spreadClassif,SQLmean
Job,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0.0,4,3,0.87
0.5,3,2,0.93
1.0,5,4,0.71


## group by multiple fields

You can also group by multiple fields. For example, find the mean of all columns grouped by Program and Job situation.

In [36]:
df.groupby(['Program', 'Job']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,...,Tableau,Regression,Classification,Clustering,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus,Languages,Expert
Program,Job,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Business Man,1.0,1.0,0.0,0.0,,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,1.0,2.0,3.0,0.0,0.0,0.0,1.0,2.0,0.0
Faculty!,1.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,1.0,5.0,5.0,0.0,0.0,0.0,1.0,3.0,1.0
MBA,0.0,1.75,0.25,0.25,0.0,0.0,0.25,0.0,0.75,0.5,0.25,...,0.5,1.0,2.0,2.25,0.0,0.0,0.5,0.5,2.25,0.5
MBA,0.5,3.0,0.33,0.33,0.33,0.67,0.33,0.33,0.67,0.67,0.33,...,0.33,1.0,2.33,2.33,0.0,0.67,0.0,0.33,4.0,0.67
MBA,1.0,2.67,0.33,0.44,0.11,0.44,0.44,0.5,0.22,0.67,0.0,...,0.11,0.78,1.78,1.67,0.0,0.0,0.0,1.0,3.11,0.67
MSIS,0.0,3.08,0.65,0.38,0.12,0.96,0.48,0.46,0.08,0.92,0.04,...,0.73,0.23,1.81,1.85,0.04,0.31,0.42,0.23,4.04,0.92
MSIS,0.5,3.08,0.75,0.67,0.0,1.0,0.42,0.17,0.0,1.0,0.17,...,0.75,0.36,1.58,1.42,0.0,0.17,0.25,0.58,4.17,0.92
MSIS,1.0,3.0,1.0,0.5,0.0,1.0,1.0,1.0,1.0,1.0,0.0,...,1.0,1.0,2.5,2.5,0.0,1.0,0.0,0.0,6.5,1.0
Master of Finance,0.0,4.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,...,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,5.0,1.0
Supply Chain Mgmt & Analytics,0.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,1.0,1.0,0.0,0.0,0.0,3.0,1.0


It returns a DataFrame with a <b>Hierarchical Index</b> (i.e., a composite key in a database). In this case, the index is (Program,Job). DataFrames with Hierarchical Indeces are outside the scope of this course because they tend to be hard to deal with; you can avoid them here by using <i>as_index = False</i> inside the <i>groupby</i>. Note: it does not work in all cases.

In [37]:
df.groupby(['Program', 'Job'],as_index=False).mean()

Unnamed: 0,Program,Job,ProgSkills,C,CPP,CS,Java,Python,JS,R,...,Tableau,Regression,Classification,Clustering,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus,Languages,Expert
0,Business Man,1.0,1.0,0.0,0.0,,0.0,0.0,1.0,1.0,...,0.0,1.0,2.0,3.0,0.0,0.0,0.0,1.0,2.0,0.0
1,Faculty!,1.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,5.0,5.0,0.0,0.0,0.0,1.0,3.0,1.0
2,MBA,0.0,1.75,0.25,0.25,0.0,0.0,0.25,0.0,0.75,...,0.5,1.0,2.0,2.25,0.0,0.0,0.5,0.5,2.25,0.5
3,MBA,0.5,3.0,0.33,0.33,0.33,0.67,0.33,0.33,0.67,...,0.33,1.0,2.33,2.33,0.0,0.67,0.0,0.33,4.0,0.67
4,MBA,1.0,2.67,0.33,0.44,0.11,0.44,0.44,0.5,0.22,...,0.11,0.78,1.78,1.67,0.0,0.0,0.0,1.0,3.11,0.67
5,MSIS,0.0,3.08,0.65,0.38,0.12,0.96,0.48,0.46,0.08,...,0.73,0.23,1.81,1.85,0.04,0.31,0.42,0.23,4.04,0.92
6,MSIS,0.5,3.08,0.75,0.67,0.0,1.0,0.42,0.17,0.0,...,0.75,0.36,1.58,1.42,0.0,0.17,0.25,0.58,4.17,0.92
7,MSIS,1.0,3.0,1.0,0.5,0.0,1.0,1.0,1.0,1.0,...,1.0,1.0,2.5,2.5,0.0,1.0,0.0,0.0,6.5,1.0
8,Master of Finance,0.0,4.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,...,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,5.0,1.0
9,Supply Chain Mgmt & Analytics,0.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,3.0,1.0,1.0,0.0,0.0,0.0,3.0,1.0


## Problems

Find the maximum, minimum, and average number of Languages known by students in each Program

In [38]:
df.groupby('Program')['Languages'].agg(['max','min','mean'])

Unnamed: 0_level_0,max,min,mean
Program,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Business Man,2.0,2.0,2.0
Faculty!,3.0,3.0,3.0
MBA,7.0,0.0,3.06
MSIS,7.0,1.0,4.2
Master of Finance,5.0,5.0,5.0
Supply Chain Mgmt & Analytics,3.0,1.0,2.0


For each existing combination of programming skills level and Program, report the number of students (call it <i>nStudents</i>) and the proportion that know Python (call it <i>PythonProportion</i>)

In [39]:
df.columns

Index([u'Job', u'Program', u'ProgSkills', u'C', u'CPP', u'CS', u'Java',
       u'Python', u'JS', u'R', u'SQL', u'SAS', u'Excel', u'Tableau',
       u'Regression', u'Classification', u'Clustering', u'Bach_0to1',
       u'Bach_1to3', u'Bach_3to5', u'Bach_5Plus', u'Languages', u'Expert'],
      dtype='object')

In [42]:
df.groupby(['ProgSkills','Program']).agg({
    'Job' : 'size',
    'Python' : 'mean'}
    ).rename(columns={'Job':'nStudents','Python':'PythonProportion'})

Unnamed: 0_level_0,Unnamed: 1_level_0,PythonProportion,nStudents
ProgSkills,Program,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Business Man,0.0,1
1,MBA,0.0,4
1,MSIS,0.0,1
1,Supply Chain Mgmt & Analytics,0.0,1
2,MBA,0.25,4
2,MSIS,0.33,6
2,Supply Chain Mgmt & Analytics,0.0,1
3,Faculty!,0.0,1
3,MBA,0.8,5
3,MSIS,0.5,23


HARD. For each Program, report:
<ul>
<li>the number of students who know both Python and C (call it <i>C_Python_Students</i>, and note that it can be equal to 0)
<li>the gap between max and mean Clustering knowledge (call it <i>CluGap</i>)
</ul>

In [43]:
df['PythonAndC'] = (df.Python == 1) & (df.C == 1)

In [48]:
df.groupby('Program').agg({
        'PythonAndC' : 'sum', 
        'Clustering' : lambda x : x.max() - x.mean()}
    ).rename(columns={'PythonAndC':'C_Python_Students',
                     'Clustering':'CluGap'})

Unnamed: 0_level_0,CluGap,C_Python_Students
Program,Unnamed: 1_level_1,Unnamed: 2_level_1
Business Man,0.0,0.0
Faculty!,0.0,0.0
MBA,2.06,2.0
MSIS,2.25,13.0
Master of Finance,0.0,1.0
Supply Chain Mgmt & Analytics,0.0,0.0


In [49]:
df.drop('PythonAndC', axis = 1 , inplace=True)

## Advanced: retrieve unaggregated rows (<i>apply</i>)

Sometimes, for each group-by value we want to retrieve one or more rows. For example, for each program report the student who knows most languages (report more than one students in case of ties)

In [52]:
df.groupby('Program').apply(lambda d : d.loc[d.Languages == d.Languages.max(),:])

Unnamed: 0_level_0,Unnamed: 1_level_0,Job,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,...,Tableau,Regression,Classification,Clustering,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus,Languages,Expert
Program,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Business Man,31,1.0,Business Man,1,0,0,,0,0.0,1.0,1.0,...,0,1.0,2,3,0,0,0,1,2.0,0
Faculty!,16,1.0,Faculty!,3,1,0,0.0,0,0.0,0.0,0.0,...,0,1.0,5,5,0,0,0,1,3.0,1
MBA,60,1.0,MBA,3,1,1,1.0,1,1.0,1.0,0.0,...,0,1.0,1,1,0,0,0,1,7.0,1
MSIS,45,1.0,MSIS,3,1,1,0.0,1,1.0,1.0,1.0,...,1,1.0,2,2,0,1,0,0,7.0,1
MSIS,46,0.0,MSIS,4,1,1,1.0,1,1.0,1.0,0.0,...,1,1.0,2,2,0,1,0,0,7.0,1
Master of Finance,35,0.0,Master of Finance,4,1,1,0.0,0,1.0,1.0,0.0,...,0,1.0,1,1,0,0,0,1,5.0,1
Supply Chain Mgmt & Analytics,13,0.0,Supply Chain Mgmt & Analytics,2,1,1,0.0,0,0.0,0.0,0.0,...,0,1.0,3,1,1,0,0,0,3.0,1


### Warning! Do not use DataFrameGroupBy.apply unless you actually need it

The method <i>DataFrameGroupBy.apply</i> is slow. It is implemented as a for loop that invokes the lambda function at each iteration.  

The correct way to compute the average ProgSkills for each program:

In [59]:
%timeit df.groupby('Program').ProgSkills.mean()

1000 loops, best of 3: 752 µs per loop


The <b>wrong</b> way to compute the average ProgSkills for each program:

In [60]:
%timeit df.groupby('Program').apply(lambda d : d.ProgSkills.mean())

100 loops, best of 3: 2.92 ms per loop


## Problems

For each ProgSkills level, find the student (or students in case of ties) with the highest Classification skills and show their knowledge of C and Java

In [62]:
maxClass = df.groupby('ProgSkills').Classification.max()
maxClass

ProgSkills
1    2
2    3
3    5
4    4
5    2
Name: Classification, dtype: int64

In [73]:
df.groupby('ProgSkills').apply(lambda d : d.loc[d.Classification == d.Classification.max(),['C','Java']])

Unnamed: 0_level_0,Unnamed: 1_level_0,C,Java
ProgSkills,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,31,0,0
2,13,1,0
2,20,0,1
3,16,1,0
4,0,1,1
4,17,0,0
5,14,1,1


For each ProgSkills level, find the Program with most students that have that ProgSkill level

In [74]:
df.groupby('ProgSkills').apply(lambda d : d.groupby('Program').size().nlargest(1))

ProgSkills  Program
1           MBA         4
2           MSIS        6
3           MSIS       23
4           MSIS        9
5           MBA         1
dtype: int64