# More Pandas

## Introduction

We have introduced various ways to *slice* a Pandas Dataframe i.e. extract a subset of the data. This notebook will go through other methods of extracting subset of data. Specifically we will be focusing on extract data based on the values of variables using the *groupby* methods in the Pandas. 

## Motivating Example

Recall our *MovieFinance.csv* dataset. What if we wish to analyse the movie data on a year-by-year basis? We can group the data based on the "Release Year" variable. 

Let's start by importing the relevant modules and the data first. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [12,9]

In [None]:
mfin = pd.read_csv('../input/moviefinance/MovieFinances.csv', header=0)
mfin.info()

The dataframe has been created successfully. We can now create a dataframe that will organise the data based on "Released Year". The key method is called *groupby*. 

In [None]:
mfin_year = mfin.groupby('Release Year')
type(mfin_year), type(mfin)

So the function *type* above allows one to find out the object type. In the example above, *mfin_year* is a _**DataFrameGroupBy**_ object whereas *mfin* is a _**DataFrame**_ object.

But what can we do with the *DataFrameGroupBy* object? Let's say we want to calculate the mean Domestic Gross and Worldwide Gross for each year, we can combine our knowledge of *slicing* in the *groupby* object. 

In [None]:
var = ['Domestic Gross($M)', 'Worldwide Gross($M)']
grossbyYear = mfin_year[var].mean() # This has two parts, the first part extracts the columns and the second part calculates the means. 
grossbyYear

Note that *grossbyYear* is just another dataframe which means we can treat it as another dataset. So we can utilise the plotting methods that come with each dataframe. 

We will have discuss *data visualisation* in a different module but for now we can simply use the *plot* method from the *DataFrame* object.  

The example below also demonstrates a few arguments that can be used within the *plot* method. For more information about the different arguments, see [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html).

In [None]:
grossbyYear.plot(color=[(0,0,1), (1,0,0)], lw=3, figsize=[16,9], ylabel="US Dollars in \$M", rot=60)

In fact, by utilising the method *describe* we can get even more basic descriptive statistics from the dataframe. 

In [None]:
grossbyYear.describe()

## Groupby more than one variable

*Groupby* is not limited by one variable. For example, we can groupby released year and released month, which create a *multi-index* dataframe. To do this, we put the variables that determined the group into a list

In [None]:
groupbylist = ['Release Year', 'Month']
mfin_yearMonth = mfin.groupby(groupbylist)[var]
grossbyYearMonth = mfin_yearMonth.mean()
grossbyYearMonth

Note that we have combined multiple tasks in the one line. We created a groupby dataframe with only the columns that were specified by the variable *var* (defined above). 

This is interesting and it is obvious that movies were not being released every month in every year. 

Note that the data above is first sorted by 'Release Year' first, then by 'Month' but obviously we may also want to sort this the other way round. First by month then by year. 

There are two ways to do this. Create another *groupby* object by change the order in which the variables appear in the list OR use the *reorder_levels* method. 

**Exercsie:** Create another groupby object that sort by *Month* first then by *Release Year*. 

While the first approach is treated as an exercise, we will demonstrate the second approach below.

In [None]:
grossbyMonthYear = grossbyYearMonth.reorder_levels([1,0])
grossbyMonthYear

So the effort so far is about calculating the mean of the two columns for every month in every year. What if we want to know how many movies were released in each month of the year? We can use *size* and *count*. There are two differences between the two methods, one being more subtle than the other. We first demonstrate what each method returns then we will try to explain. 

In [None]:
mfin_yearMonth.size()

In [None]:
mfin_yearMonth.count()

The two differences are: *size* returns the number of cases in each month for each year _**including missing values**_. However, *count* returns the number of cases for each variable in each month for each year, _**excluding missing values**_. 

We can also find out easily which month in which year had the most number of movies released. 

In [None]:
mfin_yearMonth.size().idxmax(), mfin_yearMonth.size().max()

**Exercise** See if you can find out which month in which year had the least number of movies released (strictly greater than 0). 

**GOTCHA:** The max and min function only returns the first occurrence in the case of a **tie**. What if we want to find out all of them? 

One way to do this is via conditional indexing as follows: 

In [None]:
maxM = mfin_yearMonth.size().max()
mfin_yearMonth.size().loc[mfin_yearMonth.size()==maxM, :]

In this case there is only one occurence of a max. 

**Exercise:** See if you can find out how many times the minimum number of movies has occured. 

One can also examine the data based on different index level. Note that in such cases, we will be examining the *DataFrame* object rather than the *DataFrameGroupby* object. 

In [None]:
grossbyYearMonth.count(level='Release Year')

**Exercise** Count the number of movies being released in each month over all the years. 

What if we want to know the titles of all the movies released in December 2015, the year when the most movies released. One way to do this is the following

In [None]:
# Recall groupbylist is ['Release Year', 'Month']
# Recall var = ['Domestic Gross($M)', 'Worldwide Gross($M)']
var01 = ['Movie']+ var # We are adding Movie into the list of columns that we want. 
maxgroup = mfin_yearMonth.size().idxmax() #This gives us the year/month we need. 
mfin.groupby(groupbylist)[var01].get_group(maxgroup)

**IMPORTANT NOTE:** The '+' operator between list means joining two list togehter. This is one of the reasons why understanding data types is important. The operator '+' is defined differently depends on the types in which it is being applied. For example, 2+3 gives 5 because '+' is defined as a sum operator. However, if we are "adding" two lists together, Python will simply join the list.  

## Dealing with Multi-index

So the example above uses multi-index and we are going to explore this a little further. Specifically, we are going see how we can slice the data with multi-index. 

The key idea is to utilise [*tuples*](https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences). A $(index-0,index-1)$ convention for rows. For example, if we want to know the monthly means of both domestic and worldwide gross for all the movies being released between 2012 to 2016 in April and December, we can do the following

In [None]:
grossbyYearMonth.loc[(slice(2013,2016), ['Apr', 'Dec', 'Feb']), :]

We use the *loc* method to extract the relevant part of the dataframe. We use the tuple (YearRange, MonthRange) to specify the year range and month range. Note the use of function *slice* to indicate the year range from 2012 to 2016. The range of month is presented by a list which contains all the months we want. If we want all the months i.e. select everyting, we can use *slice(None)* as follows:


In [None]:
grossbyYearMonth.loc[(slice(2012,2016), slice(None)), :]