# 3.3 Getting Started with Pandas

>This is an accompanying notebook with the module __MC_DAP01_3.3_Getting Started with Pandas__
><br>All the code snippet used in the section are available in this notebook for reference purposes.


## Introduction to Pandas

Pandas is considered to be one of the most favorite Python libraries to work with. Perhaps because it is the workhorse for data analysts using Python!

### Pandas = Panel + Data 
    
Etymologically, Pandas is a portmanteau from ‘Panel’ and ‘Data’. Panel data is a term commonly used for the data sets that contain data with observations over a period of time for the same subject / individuals each time.
    
### Pandas functionalities

There are many Pandas functionalities that include:

- loading your data from a file to start your analysis
- previewing your data to understand the data
- manipulating your data for gathering better insights 
- transforming your data from one format to another


As in Python, to use a module in your Pandas program you first need to import it.

### Importing Pandas
    
Similar to NumPy, to start using Pandas and all of the functions available, you are required to import the package. This can be easily done using the following code:


In [None]:
##Importing Pandas package
import pandas as pd

There is an unstated, undocumented convention that is followed in the Python world – using ‘pd’ as the reference name while importing Pandas. Technically, any other name can be used, but this is the convention generally followed.

### How is it different from NumPy?

NumPy is powerful, but it lacks some high-level functions and abstractions that are needed for solving everyday problems for a data analyst using structured data tables. 

For example:
- Labelled columns: Labelled data is especially useful in explicit data alignment while loading data to Pandas.
- Reading from files: Datasets stored in CSV and XLSX (among other formats) can be read easily using Pandas.

Pandas is a fast, powerful, flexible, and easy to use open source data analysis and data manipulation tool for Python. Pandas is built on top of NumPy and makes it easy to use in data analytics applications.

That being said it is very important and useful to learn NumPy, as a lot of the core features of Pandas and other packages are based on NumPy functionality.

## Pandas data structures
There are two fundamental data structures which Pandas provide:
- Pandas Series, and
- Pandas DataFrame

Both these two data structures provide a solid basis for most of the applications that you would build in Python.

__Pandas Series__ is a one-dimensional Array-like object consisting of Array of data, Array of labels, also known as index, whereas:

__DataFrame__ is a two-dimensional structure and it represents a tabular, spreadsheet-like data structure with an ordered collection of columns. It has an index for both rows and columns.

Let’s explore these data structures in some detail in the subsequent sections.


## Pandas Series
Pandas Series is a
- is a one-dimensional Array that can hold data of any type
- contains an Array of data
- contains an associated Array of labels, which is also known as its index.


### Creating Pandas Series

A simplest way to create a series is to pass an Array of data to Series function, as is highlighted in the code snippet below:

In [None]:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame

In [None]:
ser1 = Series([1,2,3,4,5,6,7])
ser1

Looking at the output of the code above, we can infer that:

- the second column contains the Array data
- the first column contains the default labels (ie, default index assigned to the data).

### Accessing data and index of Series

Just a while ago, while describing Series, we said that it can contain both Array of the data and Array of the labels.

So, there must be some mechanism available to extract the Array of data, and Array of indexes from Pandas Series. There are two attributes, values and index, that return the Array representation of the data and the index object of the Series respectively.

The following code snippet highlights this functionality:  

In [None]:
ser1.values

In [None]:
ser1.index

### Creating Series with explicit labelling

Often it is required to create a Series with a labelled index for each data point. This is achieved by passing the index labels to Series function. 

This is highlighted in the code snippet below:

In [None]:
ser2 = Series([1,2,3,4,5,6,7], index=['a','b','c','d','e','f','g'])
ser2

In [None]:
ser2.values

In [None]:
ser2.index

### Accessing values using indexes (default index or explicit index)

Values in the Series can be accessed using the index notation. We can either pass a single index or multiple indexes (Default index or Explicit index) to access the values. 

The code snippet below highlights the behaviour:

In [None]:
ser2

In [None]:
ser2['a'], ser2['f']

In [None]:
ser2[['a','b','f']]

### Creating Series from Python Dictionaries


A Pandas Series can also be created by passing a dictionary object to Series Function. In this particular case, the keys of the dictionary will be assigned as the labels of the Series, and the values of the dictionary become the data of the Series.

The code snippet below highlights the behaviour.

- We have first created a dictionary with a key:value pair, where:
    <br>Key = country code
    <br>Value = description of currency.

- We have created the Pandas series by passing the dictionary to the Series() function: 
    <br>The values from dictionary = the data for series
    <br>The keys from the dictionary = the index for the series.


In [None]:
dict_1 = {'AU':'Australian Dollar',
         'US': 'US Dollar',
         'IN': 'Indian Ruppees',
         'DK': 'Danish Krones',
         'SW': 'Swiss Francs'}
dict_1

In [None]:
ser1 = Series(dict_1)
ser1

In [None]:
ser1.index

### Name Attribute for Series and index

Both the Pandas Series and its index have the ‘name’ attribute, and this is used at various places during programming with Pandas. We will see its usage in some exercises during the course.

For now, it’s good to remember that such an attribute exists and its value can be accessed programmatically.

The following code snippet highlights the ‘Name’ attribute. You can see that when we display: 
- the Series, the Name of the Series is displayed along with the data
- the index, the Name of index is also displayed along with the respective indexes.



In [None]:
ser1.name = "Currency"
ser1.index.name="Country"

In [None]:
ser1

In [None]:
ser1.index

## Pandas DataFrame

A Pandas DataFrame:
-is a tabular, spreadsheet-like data structure
-contains an ordered list of columns, and each can have different data types
-has indexes or labels for both rows and columns

Let’s explore it through examples.


### Creating a Pandas DataFrame

There are various ways a DataFrame can be created using the function DataFrame(), but one of the common ways is to pass a dictionary of equal length list or NumPy Arrays. In this case, the keys will be column labels / index.

The following code snippet will highlight this behaviour:

In [None]:
data = {
    'state':['WA', 'SA', 'VIC', 'NSW', 'ACT', 'QLD', 'NT'],
    'pop': [1,1,2.5,2.7,0.5,1.5,0.4],
    'TZ': ['GMT+8', 'GMT+9.30','GMT+10', 'GMT+10', 'GMT+10', 'GMT+10', 'GMT+9.30']
}

In [None]:
df_states = DataFrame(data)
df_states

Later in the course, we will see how to populate DataFrames using CSV files and databases.


### Specifying the sequence of columns

We can also specify the sequence or order of the columns for DataFrame, as a result of which data will be arranged in the column order specified.

See below code snippet for demonstration:

In [None]:
df_states=DataFrame(data, columns=['state','TZ','pop'])
df_states

If a column name is specified but there is no data supplied for that data column, then DataFrame will fill the null values or NaN for that column.

In [None]:
#If a Column Name is specified, but no data is supplied
#then column will contain Null Values
df_states1=DataFrame(data, columns=['state','TZ','pop','GDP'])
df_states1

### Specifying the row labels

As we have seen in the case of Series, we can specify the explicit indexes for the rows for DataFrame. This is done by passing the index parameter to the DataFrame function. The index parameter can be either:
- the list of labels
- the name of the column / key in the data provided, that should be treated as index.

See the code example below for the demonstration:

In [None]:
#"Specifying the Row Labels, or Indexes"
data = {
    'state':['Western Australia', 'Southern Australia', 'Victoria', 'New South Wales', 'Australian Capital Territory', 'Queensland', 'Northern Territory'],
    'pop': [1,1,2.5,2.7,0.5,1.5,0.4],
    'TZ': ['GMT+8', 'GMT+9.30','GMT+10', 'GMT+10', 'GMT+10', 'GMT+10', 'GMT+9.30']
}

row_labels = ['WA', 'SA','VIC','NSW','ACT','QLD','NT']
df_states_lbl = DataFrame(data, columns=['state','TZ','pop'], index=row_labels)
df_states_lbl

### Retrieving Columns from DataFrame

Individual columns in DataFrame are stored as Pandas Series. We can access the data in the individual columns by either:
- using index Notation and passing the column name as the index
- using the df.Attribute notation, where the column name itself is the name of the attribute.

While retrieving the individual columns, the index of DataFrame is retained in the retrieved Series.

The following code snippet showcases this behaviour:



In [None]:
df_states_lbl

In [None]:
series_states = df_states_lbl['state']
series_states

In [None]:
series_states = df_states_lbl.state
series_states

### Retrieving Rows from DataFrame

<br>Individual rows can also be accessed from the DataFrame, by using the loc method and passing the index as a parameter to this method.


In [None]:
df_states_lbl

In [None]:
wa_data = df_states_lbl.loc['WA']
wa_data

In case we don’t want to use the explicit label or row index, we can also use the default positional index for the rows in DataFrame. For such scenarios, there is a method .iloc(), and it can be used to access the particular row.

In [None]:
wa_data1 = df_states_lbl.iloc[0]
wa_data1

In [None]:
wa_data1 = df_states_lbl.iloc[1]
wa_data1

### Changing data in a particular column

The values in a particular column can be changed by assigning either the scalar value or a range of values (equal to number of rows in the DataFrame).

The following code snippet shows this behaviour:


In [None]:
data = {
    'state':['Western Australia', 'Southern Australia', 'Victoria', 'New South Wales', 'Australian Capital Territory', 'Queensland', 'Northern Territory'],
    'pop': [1,1,2.5,2.7,0.5,1.5,0.4],
    'TZ': ['GMT+8', 'GMT+9.30','GMT+10', 'GMT+10', 'GMT+10', 'GMT+10', 'GMT+9.30']
}

row_labels = ['WA', 'SA','VIC','NSW','ACT','QLD','NT']

df_states = DataFrame(data, columns=['state','TZ','pop','GDP'], index=row_labels)
df_states

In [None]:
#Access GDP column and assign a constant value
df_states['GDP'] = 16
df_states

In the example below we are passing the list of values instead of some constant value:

In [None]:
gdp_data = [11,8,20,22,15,18,8]
df_states

In [None]:
df_states['GDP'] = gdp_data
df_states

### Adding a column to DataFrame

Using the same convention, if we specify a column name that doesn’t exist in DataFrame and pass the list of values or a scalar, it results in the addition of a new column to the DataFrame.

See the below example for a demonstration:



In [None]:
df_states

In [None]:
df_states['area'] = "TBD"
df_states

So far, we have seen some of the basic operations related to the Series and DataFrame. In the subsequent sections, let’s explore some of the essential functionalities for Pandas, as these will form the foundation of the data wrangling activities for any data analytics projects that you would be involved in.

## Pandas: Essential operations

Similar to NumPy, Pandas also has some essential operations that you would need to explore interacting with the data stored in Series and DataFrame.

### Reindexing

An important operation that we perform on the Pandas data structure is reindexing, which means to create a new object and rearrange the data in the Pandas data structure, conforming to the new index. 

While doing so, if data is not present for some index in the original data, missing values are added, corresponding to those indexes.



In [None]:
a = Series(np.random.randn(10), index=['a','b','c','d','e','f','g','h','i','j'])
a

In [None]:
new_index = ['a','A1','b','B1','c','C1','d','e','f','g','h','i','j']

In [None]:
a_new = a.reindex(new_index)
a_new

Imagine a situation where you are processing employee records. However, many of the employees have supplied incomplete information. You need a way to handle these cases and highlight the gaps to follow up with them. Perhaps you could insert ‘Unknown’ into all the empty fields, to make the missing values easy to identify.

There are various ways the missing values can be handled during reindexing. 

We can:
- either specify a particular value to be filled, and we do this by adding a parameter <code>fill_value = value to be filled</code>, to the reindex method

- or we can specify the pre-defined options by passing a parameter <code>method = pre defined method values</code>. This is especially useful in case we need to do operations like interpolation, forward fill, backward fill, and so on for time series data analysis.

These two methods are demonstrated by the code snippet below:

These two methods are demonstrated by the code snippet below:

__Specify fill_value()__

In [None]:
a_fillvalue = a.reindex(new_index, fill_value=0)
a_fillvalue

__Specifying 'method' parameter__


In [None]:
a = Series(np.random.randn(10), index=[0,2,4,6,8,10,12,14,16,18])
a

In [None]:
## Reindex so that indexes 1,3,5... are introduced in the series
a_new = a.reindex(range(20))
a_new

In [None]:
## Perform similar reindex but with forward fill method specific for null values
a_ffill = a.reindex(range(20), method='ffill')
a_ffill

Observe index 1, 3, and 5: values have been populated from the previous index.

For the complete list of parameters of reindexing method, refer to the documentation available at following links:
:<br>
Read: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html
Read: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html


### Dropping entries from axis

At many times, we need to delete the data from the Pandas Series and DataFrame. You can do this using the drop method, which is available to both Series and DataFrame. This method accepts the index, or the list of index, to be dropped from the Series and DataFrame.

This method creates a new object with only the required values. Note that this operation doesn’t perform inline-drop (ie, the original Pandas Series or DataFrame will be preserved and still available after the drop operations). In practical terms, the method creates a selective copy of the data. 


__Drop single index__

In [None]:
b = Series(np.arange(10), index=['a','b','c','d','e','f','g','h','i','j'])
b

In [None]:
#Dropping index b
new_series = b.drop('b')
new_series

__Drop multiple index__<br>

In [None]:
# Dropping multiple index.
# for e.g., a, g, j
new_series_1 = b.drop(['a','g','j'])
new_series_1

In the case of DataFrame, we specify the index for both axes: row labels (by using index parameter), and column names (by using columns parameter). 

The following code snippets demonstrates this behaviour:

<br>__Removing a row from DataFrame__<br>Following code snippets demonstrates this behaviour:

In [None]:
#Check DataFrame
df_states

In [None]:
# Let's drop NT row, and check the newly created DataFrame
df_states_noNT = df_states.drop('NT')
df_states_noNT

In [None]:
#Check Original DataFrame, the row still exist in the original DataFrame
df_states

__Removing multiple columns from DataFrame, by passing a sequence of column index and axis = 1__
<br> See code example below for how to drop columns for Pandas DataFrame

In [None]:
#Check DataFrame
df_states

In [None]:
#Remove columns 'state' and 'area'
df1 = df_states.drop(['state','area'], axis=1)
df1

Observe how the original DataFrame is always preserved, and whenever we use drop() method, a new DataFrame is created

In [None]:
#Check ORiginal DataFrame
df_states

### Indexing, Selection and Filtering

We have already seen various examples of indexing being used. Let’s explore a little more about indexing features available to Pandas.

#### Pandas Series

Indexing for Pandas Series works similarly to NumPy ndArrays, with one additional feature being that we can also use the labelled index along with the implicit positional index, which are available for Series.

Following are some of the examples of this behaviour:



In [None]:
#Create a new Series
ob1 = Series(np.random.randn(10), index=['a','b','c','d','e','f','g','h','i','j'])
ob1

In [None]:
#Access using positional index, passing 0 index will pick value
#corresponding to index a
ob1[0], ob1['a']

Slicing with labels works a little differently than normal slicing, and the difference is that both the indexes are inclusive whereas in the case of normal slicing, the endpoint is not inclusive.

In [None]:
#Create a new Series
ob1 = Series(np.random.randn(10), index=['a','b','c','d','e','f','g','h','i','j'])
ob1

In [None]:
#Check that index number 5 and index label f return same value
ob1[5], ob1['f']

In [None]:
#Using slicing with default index, the end point is not included
#where slicing with explicit labels, the end point is included
ob1[0:5]

In [None]:
ob1['a':'f']

#### Pandas DataFrame

As we have already seen, we use indexing to retrieve a particular subset of data along the x and y axis of DataFrame, by passing either the single value or sequence of indexes.

The following examples demonstrate these features again:


In [None]:
#Check DataFrame
df_states

In [None]:
#Select state column
df_states['state']

In [None]:
#Select multiple columns state and TZ
df_states[['state', 'TZ']]

TThis mechanism of supplying indexes let’s us do data selection by a variety of ways:

__By passing row slices:__


In [None]:
df_states[:2]

__By passing a Boolean Array (filter array)__:

In [None]:
df_states[df_states['GDP']>8]

### Data alignment

One of the interesting features of Pandas operations is data alignment. We have already seen some analogous behaviours. 

For example: 
- The creation of DataFrame (ie, if data is not presented for one of the specified columns, then missing values are filled in automatically as NaN).

- Reindexing if the data doesn’t exist for the supplied index. So then by default, NaN is filled for those indexes and additionally we have the option to pass values and methods as well to the reindex method.

On the same lines, if we do mathematical operations between Panda objects with different indexes, Pandas will perform the data alignment into the resulting Panda object.



In [None]:
#Create 3X3 dataframe with random numbers,
#Columns = a,b,c
#Index = 'SA', 'VIC', 'NSW'
df1 = DataFrame(np.arange(9).reshape(3,3), columns=['a','b','c'], index=['SA', 'VIC', 'NSW'])
df1

In [None]:
#Create 4X3 dataframe with random numbers,
#Columns = a,c,E
#Index = 'SA', 'VIC', 'NSW', 'ACT'
df2 = DataFrame(np.arange(12).reshape(4,3), columns=['a','b','e'], index=['SA', 'VIC', 'NSW', 'ACT'])
df2

In case of addition, if index pairs are not the same, the resultant Pandas object will have the index that is union of both the original index, and missing values will be filled as NaN.

In [None]:
#Observer data alignment when we add the two dataframes
#and missing values are filled as NaN 
#-> Column c and Column e are NANs as either one of the array has that column missing
#-> And for ACT row, column a and b are NAN (as ACT row for these columns is NAN, in df1)
#Indexes are arranged in order
df1+df2

We also have the option of passing parameter values to determine how missing values should be dealt with, which performs this internal data alignment.

In [None]:
#adding with .add()method and passing fill_value parameter
df1.add(df2, fill_value=0)

### Mapping

At many times, we would want to change or manipulate the values in a particular row or a column by way of applying some functions to the values in selection. For example, think of a data set that captures information about a large collection of products (represented as columns in the data set). These products update to a new version every year. You need a way to update all the version numbers quickly and easily.

This process is known as mapping, and we do this by using the .apply() method, which has following parameters:
- a lambda function, to specify what kind of transformation needs to be applied
- an axis parameter, which by default equates to 0 and so applies across the index (and not columns).

The following code snippets demonstrates this behaviour:


In [None]:
#Check DataFrame
df_states

In [None]:
#Create a Lambda function to convert string to upper case
f = lambda x:x.upper()

In [None]:
#Apply Function Mapping to state column and update DataFrame
df_states['state'] = df_states['state'].apply(f)

In [None]:
df_states

Note that there are other methods available for doing column wise transformations, and we will cover some of those in detail during the data wrangling sections of this course.

### Sorting

Many kinds of data need to be sorted to be meaningful and useful. Think of a video on-demand streaming service that needs to know which TV series in its catalogue are the most popular, so that it can decide which of them to renew for another season. The series titles need to be sorted by the most watched. Sorting is also one of the important operations that we perform on the data in Pandas. 

#### Sorting the indexes / labels

To sort lexicographically (ie, the dictionary order) by row or column index, we use the sort_index() method. See below a demonstration of sorting the indexes.

It should be noted that this method returns a new object, which is sorted based on the criteria specified:

Original DataFrame

In [None]:
df_states

Original DataFrame

In [None]:
df_states.sort_index()

DataFrame sorted by columns (lexicographically)

In [None]:
df_states.sort_index(axis=1)

### Sorting by values

Instead of indexes and labels, we can also sort the data by the actual values in the columns.

For this purpose, there is another function, sort_values(), that can be used. This function will do the sorting on the basis of values instead of the labels.

See below code snippet for an example, where we will arrange the values by GDP Column:


In [None]:
df_states

In [None]:
#Sort Data by GDP Column
df_states.sort_values('GDP')

## Pandas: Mathematical and Statistical Methods

Think back to Module 1 and the basic statistical methods that you learnt about. As you can imagine, it would be very laborious to calculate statistics on large datasets by hand. Fortunately, there are various mathematical and statistical methods in Pandas to automate some of the calculations. Available methods include summation, finding maximum and minimum values, and generating descriptive statistics.  

Let’s explore these methods with some code examples:


### Sum() Method

Summation is a common step in calculating statistics such as the mean and standard deviation. This method calculates the summation on a per column basis, and returns the result as a Pandas Series.

In [None]:
new_df = DataFrame(np.random.rand(24).reshape((4,6)), 
                   index=['r1','r2','r3','r4'], 
                   columns=['c1','c2','c3','c4','c5','c6'])
new_df

In [None]:
#Get summation for all the columns
new_df.sum()

In [None]:
#Get summation for all the rows, along column axis
new_df.sum(axis=1)

### Min(), Max() methods

As the name suggests, these methods will return the minimum and maximum values respectively for every column. These methods form the basis for other statistics such as finding the range of a set of values.

In [None]:
##Check the DataFrame
new_df

In [None]:
## Minimum for every columns, across the rows axis
new_df.min()

In [None]:
##minimum across the columns, per rows
new_df.min(axis=1)

### Describe() Method

The describe method provides the summary statistics, such as mean, range, and standard deviation, of all the data columns available in the DataFrame.

See the below code snippet for an example:

In [None]:
##Check the original DataFrame
new_df

In [None]:
##Check the summary statistics of the DataFrame
new_df.describe()

For an exhaustive list of methods, refer to the Pandas API Documentation.<br>
Refer to: https://pandas.pydata.org/pandas-docs/stable/reference/index.html 

We will also explore some additional methods by way of demonstration throughout the course.