# Pandas

## Introduction 

This notebook introduces one of the most popular modules for data analysis called [Pandas][pandas]. Pandas extends the functionality of [numpy][numpy] which provides the foundation to manipulate multi-dimensional arrays. [Pandas][pandas] utilizes this foundation for data handling, manipulation and analysis. 

It also utilizes [matplotlib][matplotlib] for creating various data visualizations, including standard plots such as box plot, histogram, line plot and scatter plot. 


[numpy]: http://numpy.org 
[pandas]: http://pandas.pydata.org
[matplotlib]:https://matplotlib.org

## Importing Pandas

To import the module we can 

In [None]:
import pandas as pd

In this case we give pandas a name, *pd*. We can access any function within the pandas module by using the convention *pd.function_name*. For example, *pd.DataFrame*, represents the function *DataFrame* defined within the matplotlib module. The name you give pandas is arbitrary. For example, 

import pandas as pand 

mean *pand.DataFrame* will provide the same functionality as *pd.DataFrame*. 

In order for Jupyter notebook to show the plots, we need to invoke a [jupyter magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html), which is a set of jupyter commands that makes life easier. The one we need is 

In [None]:
%matplotlib inline

## Importing Data

Pandas provides a range of functions to import data depending of the file format. In this example, we would like to import data from a *common separated value* (csv) file called *MovieFinances.csv*. 

In [None]:
mfin = pd.read_csv("../input/moviefinance/MovieFinances.csv", header=0)

The function above import the data from the file into the variable *mfin*, which is a **dataframe**. The *header=0* option informed pandas that the first row contains column names. 

**Note** Python is 0-index. This means index starts from 0 rather than 1. 

One way to inspect if the data has been import correctly is to use the *head* and *tail* functions. *head(n)* lists the first $n$ rows of data whereas *tail(n)* shows the last $n$ rows. These functions are associated with the dataframe, so to use it we write

*mfin.head(5)*

to show the first 5 rows of data. 

In [None]:
mfin.head(15)

Similarly, *mfin.tail(3)* shows the last 3 rows of data. 

In [None]:
mfin.tail(5)

## Getting Meta-Data

We can start inspecting our dataset by gathering some of the *meta-data* such as the number of observations (number of rows) and number of variables (number of columns). 

The dataframe function *shape* returns the size of the data in the form of rows $\times$ columns. 

In [None]:
mfin.shape

*shape* returns a Python [*tuple*](https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences) with the first number indicates the number of rows and the second number indicates the number columns in the dataframe *mfin*. In this case, *mfin* has 5222 rows (observations) and 7 columns (variables).

We can also get the list of column names by using the function *columns*

In [None]:
mfin.columns

You can also use the *info* function to get all the information above. 

In [None]:
mfin.info()

**Exercise:** See if you can obtain similar meta-data for *IMDB.csv*. 

1. How many movies and variables are there in the IMDB dataset?
2. Identify the variable type for each column. 

To get more basic info about the dataframe, one can utilise the module [*pandas_profiling*](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/)

In [None]:
import pandas_profiling as pdf
profile = pdf.ProfileReport(mfin, title='Movie Finance')

To show the profile in the notebook

In [None]:
profile.to_widgets()

## Slicing

Often we would like to access part (subset) of the data only. To do this, we first consider the concept of *slicing*. The idea is that each entry in dataframe has a coordinate, denoted by its row and column numbers. So we can access each entry by providing its row and column numbers using the *iloc* function, which stands for *index location*. For example, to assess the entry on the *second* row and *sixth* column we can use 

In [None]:
mfin.iloc[1,5]

**Note:** Recall Python is 0-index i.e. it counts from 0 rather than 1. 

We can also obtain multiple rows and columns at the same time. For example, if we want data from the third to fifth rows and second to fourth columns we can use

In [None]:
mfin.iloc[2:5, 1:4]

**IMPORTANT NOTE:** Unlike other languages, the range $2:5$ is right open. Mathematically, it covers all the integers from 2 to 5 but not including 5 i.e $2:5 = [2,5)$. Similarly $1:3 = [1,3)$ where the $[,]$  indicate the inclusion of the left and right bounds, respectively while $(,)$ indicate exclusion of the left and right bounds, respectively. 

It is also possible slice by individual indexes. For example, if we want first and third rows as well as second and fifth columns, we can write

In [None]:
mfin.iloc[[0,2], [1,4]]

If we want to select all rows or columns, we can use the $:$ notation. Example below extract all the data for the variable "Released Year".

In [None]:
mfin.iloc[:, 3]

It is also possible to extract columns based on column names using the *loc* function. For example, we can achieve the example above by

In [None]:
mfin.loc[:, 'Release Year']

Similarly, we can also slice using column names

In [None]:
mfin.loc[0:8, ['Movie','Release Year']]

**Exercise** 

1. Extract rows 10-15 for all the variables in FinanceMovices.csv
2. Extract rows 10-15, 400 and 450-900 for all the variables in IMDB.csv. 
3. Extract the *ratings* and *budget* variables. Do you think budget and rating are related? 

## Conditional Slicing

One of th most powerful features of Pandas as well as other tools such as R and Julia, is the ability to slice based on conditional statement. For example, if we just want to examine all the movies released in 2015 only, we can write 

In [None]:
mfin['Release Year']==2015

In [None]:
mfin.loc[mfin['Release Year']==2015, :]

In this case, we are telling Pandas that we only want rows that correspond to the case when *Release Year* is equal to 2015. 

Another example may be to consider Movie only has a Budget great than \$200M. In this case, we can write 

In [None]:
mfin.loc[mfin['Budget($M)']>200, :]

Note that each of these command will provide a sub-dataframe. So we can get the meta-data of these subset by using all the functions we have introduced so far. For example

In [None]:
mfin.loc[mfin['Budget($M)']>200, :].info()

**Exercise**

1. In the IMDB data, extract movies that have rating higher than 8. 
2. Can you extract movies that have rating lower than 8 but higher than 5? 
3. Extract movies that were released in December since 2008. 