# Transforming Pandas DataFrame

This notebook demonstrates how to transform data within a Pandas dataframe. It shows the use of [agg](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html) and [transform](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transform.html) as well as creating new columns by combining existing ones. 

We will also demonstrate how to construct a [pivot table](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html) using Pandas. 

## Aggregate

Often we may wish to apply functions that involve *aggregating* the data, such as sum, mean and standard deviation. We have seen some of this when we discussed *Groupby*, let us use the *agg* method here to show how much more we can do. 

First we import the data as well as the usual modules. We will introduce a new module here called [numpy](https://numpy.org/). It is one of the most important modules in Python for numerical computation as it defines *multi-dimensional* array and their associated operations. It also contains some of the most common mathematical functions. 

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline

In [None]:
mfin = pd.read_csv('../input/moviefinance/MovieFinances.csv', header=0)
mfin.info()

Let say we want to examine the mean, the standard deviation, max and min for both Domestic gross and Worldwide gross. 

In [None]:
var = ['Domestic Gross($M)', 'Worldwide Gross($M)']
mfin[var].agg([max, min, np.mean, np.std])

In [None]:
mfin[var].describe()

We do not have to do all the summary statistics for all the variables. We can be selective using [*dictionary*](https://docs.python.org/3/tutorial/datastructures.html#dictionaries). For example, if we want max, min, mean and standard deviation for Domestic gross but only max and standard deviation for Worldwide Gross, we can do the following

In [None]:
mfin[var].agg({'Domestic Gross($M)':[max, min, np.mean, np.std], 'Worldwide Gross($M)':[max, np.std]})

A naive way to think about a *dictionary* is that it is a list (or an array) with a *text* index. So far we use something like *array[2]* to extract the third element from the list *array*. A dictionary uses *text* as index, for instances, one way to define a dictionary is 

In [None]:
dictExample = {'FirstChoice':34, 'SecondChoice':"hellow world"}

In this case *dictExample* is a dictionary with two elements namely, *FirstChoice* and *SecondChoice* and their corresponding values can be extracted by

In [None]:
dictExample['FirstChoice'], dictExample['SecondChoice']

In the example above, we use a dictionary *{'Domestic Gross($M)':[max, min, np.mean, np.std], 'Worldwide Gross($M)':[max, np.std]}* to specify which statistics should be applied to which variable. In this case, the indexes of the dictionary are the variable names and the corresponding values are lists that contian the required statistics. In fact, we can make this more readable by

In [None]:
statdict = {'Domestic Gross($M)':[max, min, np.mean, np.std], 'Worldwide Gross($M)':[max, np.std]}
mfin[var].agg(statdict)

One thing to note is the "NaN" (Not-a-Number or missing vales) under Worldwide Gross for min and mean. This reflects the fact that we did not apply these statistics to Worldwide Gross. 

Combining *agg* and *groupby*, Pandas provides a powerful way to analyse data. For example, if we want to calculate the means and standard deviations for both Domestic Gross and Worldwide Gross for each year, we can try  

In [None]:
mfinYear = mfin.groupby('Release Year')
mfinYear[var].agg([np.mean, np.std])

The first line creates a *groupby* object where the second line applies *agg* directly to the *GroupBy* object. 

**Exercise** 
1. Write a piece of code that would return the min, max, mean and standard deviation of both Domestic Gross and Worldwide for each month (averaging across every year in the sample). 
2. Write a piece code that would return the min, max, mean and standard deviation of both Domestic Gross and Worldwide for every month in every year. 

## Transform

*Aggregrate* summarises specific part of the data based on certain critia. There are cases where one may wish to *transform* the data before the analysis. For example, one may wish to express Domestic Growth in terms of dollar values rather than in millions. This is where the *transform* method can be useful 

In [None]:
mfin[var].transform(lambda x: 1e6*x)

The key is the first argument where Python uses the keyword *lambda* to initiate a function definition. The line *lambda x: 1e6*x* means we are creating a function that uses *x* as an input and the function will return *1e6*x* as the output. 

This function will then be applied to all variables in the dataframe. 

This creates a new dataframe with the *transformed* data. The original dataframe remains unchanged. We can check this by

In [None]:
mfin[var]

Another application is to centralise the data by substract the mean of the variable from each of the observations. In other words, we want to apply 

$$ x_i \leftarrow x_i - \bar{x} $$

where $\bar{x}$ denotes the mean of $x$. 

In [None]:
mfin[var].transform(lambda x: x-np.mean(x))

**Exercise** Standardise the relevant variables in the dataframe by following 

$$ z_i = \frac{x_i - \bar{x}}{\sigma_x} $$

where $\bar{x}$ denotes the mean of $x_i$ and $\sigma_x$ denotes the standard deviation of $x_i$. 

## Creating New Column

Creating a new column in a Pandas Dataframe is relatively straightforward. To demonstrate, let us import the familiar *MovieFinance* dataset as well as the other usual modules. 

We can create a column called temp with the value 0's (or whatever value we like) by 

In [None]:
mfin['temp'] = 0
mfin.head()

**Exercise** See if you can create a new column called *temp2* with the value -3. 

So perhaps we would like to analyse the ratio between Domestic Gross and Worldwide Gross, we can do this by 

In [None]:
mfin['grossRatio'] = mfin['Domestic Gross($M)']/mfin['Worldwide Gross($M)']
mfin.head()

**Exercise** Construct a new variable called cost2Revenue which represents the ratio between budget and Worldwide Gross. 

## Pivot Table

What is a Pivot Table? You can find out more [here](https://en.wikipedia.org/wiki/Pivot_table). 

The construciton of pivot table is often treated as a more advance feature of EXCEL. In Pandas, it is a relatively straightforward object to construct. It invovles just one line of code. 

In this section, we are going to put a pivot table together. Becasue it is so simple, we are going to make the task more challenging by involving some of the functions introduced above during the data preparation stage. 

In the IMDB.csv, it contains information about rating, genre and budget of each movie in the dataset. Consider we wish to get some insight on the relation between Movie ratings, their genre and budgets, we can construct a *pivot table* where each row represents the genre, each column represents the ratings and the data point is the mean budget of that genre with that rating. 

First, we want to import the data in usual fashion. 

In [None]:
imdb = pd.read_csv('../input/imdbcsv/IMDB.csv', header=0)
imdb.info()

In [None]:
imdb.head(5)

Bummer!! Immediate we see a couple of challenges!

1. Rating is a float. Ideally we want interger from 1 to 10 rather than real number (float). 
2. All the genres are dummy variables and some of movies do not belong to any of the genres. We need to combine this into one variable. 

The first task involves two actions. One is to create a new variable that contains the integer version of the ratings. The second is to obtain the integer version of the ratings. This can be done by 

In [None]:
imdb['roundRating'] = imdb['rating'].transform(int)

In [None]:
imdb.head()

The left hand side of the "equation" creates a new variable called *roundRating*. The right handside of this "equation" sets the values of *roundRating* as the outcome of the transform function, which transformed all the ratings to interger. 

Now the second task is a little more stricky. We need to handle two issues. The first is to identify the columns of all the genre dummies where their values are 1 and then set the genre variable for that row to be the name of the dummy. The basic idea is to utilise *idxmax* since these are dummies variables. So the maximum value is 1. So using *idxmax* allows us to identify which column across all genre dummies has(ve) value(s) equal to 1. 

The second issue is that there are cases when all the dummies for those rows are all zeros. In that case, we need to set the genre variable to *Others*. In this case, we will utilise *conditional slicing* to help us identify them. The idea is that the maximum value is 0 in the case when all the dummies are 0. The first occurance will be the "Action" dummy, so all we really need to do is to identify the case when the *genre* variable is Action but the *Action* dummy is 0. 

<a style="color:red">IMPORTANT NOTE:</a> An important assumption here is that each movie will have a unique genre. This is not necessary the case. Though this can be handled with slighly more code, which we will skip for now. 

In [None]:
imdb['genre'] = imdb[imdb.columns[18:25]].idxmax(axis=1) # The argument axis=1 is to inform Pandas we want the column index rather than the row index.
changeIndex = (imdb['genre']=='Action')&(imdb['Action']==0) # In the case when all the dummies are 0, the max value is 0 with the first occurence in "Action". So this allows us to identify "Others"
imdb.loc[changeIndex, 'genre'] = 'Others'
imdb.head(5)

Just to check if *Others* has been assigned correctly. 

In [None]:
imdb.loc[imdb['genre']=='Others']

So check these actions give the desired outcome. 

Note that our approach to create *genre* is not necessarily perfect, becasue some movies go across two (or more) genres. Our approach will set the genre to be the last genre with value 1. 

It is possible to create more sophisticaed assignmnet rule, we will leave that to you to decide. 

Now, assuming we are happy with our assignment rule above, we can construct our pivot table. 

In [None]:
pd.pivot_table(imdb, values='budget', index='genre', columns='roundRating', aggfunc=np.mean)