# Doing things the Pandas Way!!!

## Problem at hand.

Consider a set of courses running in a department of an Institute.
The courses require a certain number of teaching assistants(TAs) for each of them.
The Students are asked for their preferences as to which course they would like to be a TA for. How to allot courses to the students optimally?

### Scenario

Each course is asked for their TA requirements and each student on the other hand is asked to give 3 choices from the available courses which will be offered in the coming semester.

Constraints:
- Each student can be allotted to only one course
- A course can only be alloted a fixed number of TAs.

### Data available

We have available 2 files:
- One which contains a list of students along with their choices for the course they would like to TA for.
- The second file has a list of courses and the number of TAs requires for each of the courses.

### Lets Play!
Lets read in the data available.

Pandas make it very easy to read in CSV files into objects called 'DataFrames'.

In [None]:
# Import Pandas
import pandas as pd
# Reading in the students data.
studentsDF = pd.read_csv('taPrefs.csv')

THATS IT! 

Remember the function we wrote while dealing with lists. Duh!

#### Viewing the data

In [None]:
# Taking a peek at the data
studentsDF.head()

#### Operating on the data

Now we have the data as DataFrames, which can be easily indexed to retrieve data.

In [None]:
# Lets slice out the 3rd students data
studentsDF.loc[3]

In [None]:
# Counting the number of entries in the DataFrame.
studentCount = len(studentsDF.index)
studentCount

In [None]:
# Extracting the Specialization of each of the students
specializationArray = studentsDF['Specialization']
specializationArray.tail()

In [None]:
# Another way to do the above
studentsDF.Specialization.tail()

NOTE : No need to rememeber what was the index of the column containing the 'Specialization' information unlike lists/arrays.

In [None]:
# Importing a few libraries for plotting
%matplotlib inline
import numpy as np

In [None]:
# Sepcialization based grouping of TAs.
studentsDF.Specialization.value_counts().plot(kind='pie')

In [None]:
# Quick stats on the Choice1 column of the data
studentsDF.Choice1.describe()

In [None]:
# Bar chart of choice1, gives popularity of a course in a way.
studentsDF.Choice1.value_counts().plot(figsize=(20,5), kind='bar')

In [None]:
# Using Choice2
studentsDF.Choice2.value_counts().plot(figsize=(20,5), kind='bar')

In [None]:
# Using Choice3
studentsDF.Choice3.value_counts().plot(figsize=(20,5), kind='bar')

In [None]:
# Making specialization wise dataframes.
ee1StudentDF = studentsDF[studentsDF['Specialization'] == "EE1"]
# ee2StudentDF = studentsDF[studentsDF['Specialization'] == "EE2"]
# ee3StudentDF = studentsDF[studentsDF['Specialization'] == "EE3"]
# ee4StudentDF = studentsDF[studentsDF['Specialization'] == "EE4"]
# ee5StudentDF = studentsDF[studentsDF['Specialization'] == "EE5"]

In [None]:
# Viewing the new EE1 dataframe
ee1StudentDF.head()

In [None]:
# Plotting the course popularity for choice1 among EE1 students.
ee1StudentDF.Choice1.value_counts().plot(figsize=(10,5), kind='bar', alpha=0.5)

In [None]:
# Getting a data frame of just the EE1 students.
ee1StudentDF.groupby('Choice1').count()

In [None]:
# Saving a file.
ee1StudentDF.to_csv('ee1StudentPrefsLast.csv')

NOTE : One single command, thats it!!!

In [None]:
# Students who provided all the 3 choices as a 3rd year Course. (3xx course is a 3rd year UG course).
ee1StudentDF[ee1StudentDF.Choice1.str.startswith('EE3') & ee1StudentDF.Choice2.str.startswith('EE3') & ee1StudentDF.Choice3.str.startswith('EE3')]

In [None]:
# Getting the number of students preferring a 3rd year course as 1st Choice.
(ee1StudentDF.Choice1.str.startswith('EE3')).sum()

In [None]:
# Getting the number of students preferring a PG course as 1st Choice.
(ee1StudentDF.Choice1.str.startswith('EE6')).sum()

## Conclusion :
* Huge datasets can be handled very easily and efficiently.
* Pandas gives us a very simple and intuitive interface to deal with data.
* Pandas is optimized for this and is hence much faster than using lists or other default python language constructs.

# THANK YOU!