# Data Manipulation and Analysis with Pandas
By Eli Yi-Liang Tung 

Department of Analytics and Operations, Business School, NUS

### Learning Objectives
1. What is Pandas package? 
2. Import data into Python
3. Basic components of DataFrame and Series
4. Basic data manipulation
    * Access data
    * Filter/subset data

## Import your data into Python 
To start your data analytics project, the first thing is that you are able to import your raw data into Python system. Before that, let's find out the current working directory using `os` package.

In [None]:
import os
cwd = os.getcwd()
print(cwd) 

After getting your working directory, **you must move your raw data file into the working directory and then Python will know where to find your data.**

## Brief introduction to `Pandas` package
- Pandas is an open-source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
- Developed by Wes McKinney since 2008.
- The package’s name derives from <i>`panel data`</i>, a common term for multidimensional data sets encountered in statistics and econometrics.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

Pandas provides a lot of methods to import external data sources into Python. Here we just need to work on  `.csv ` file. Let's use `read_csv()` method from Pandas.

Let's import `ChallengerSales.csv` into Python and do basic data manipulation to understand the sales data.

In [None]:
df = pd.read_csv("ChallengerSales.csv")

In [None]:
df

In [None]:
df.head(5)    # head() method can show several rows of the data imported for checking purposes. 

In [None]:
df.tail(5)    # tail() method can show last several rows of the data imported 

To check all the data types in the df DataFrame, you can use **dtypes**, an attribute of df.

In [None]:
df.dtypes

In [None]:
df.shape

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.values    #Result is a 2-D Numpy array

In [None]:
type(df)

Thus, in the df DataFrame, there are 400 rows and 10 columns.

## Main Data Structures in Pandas
There are two main data structures designed by Pandas. 
- **Series**
    * 1-D array of labelled data 
    * Series can be viewed as a hybrid of a 1-D Numpy ndarray with row index labels 
    * There is no columns attribute for a Series.

- **DataFrame**
    * A labelled 2-D array of data
    * Each column is a Series sharing common row labels
    * Two key components to label each data point: column name and index label

Let's discuss DataFrame first.

<img src="DataFrame.jpg" alt="Pandas Data Frame">

- The term index refers to all the index labels of the DataFrame
- The term columns refers to all the column names of the DataFrame  

## Main Data Types in Pandas

The table below is from Pandas Cookbook by Theodore Petrou.

<img src="DataTypes.jpg" alt="Pandas Data Frame">

Notes of Data Types:
<n>
- Each DataFrame column must exactly one type of data. 
- Pandas defaults its numeric data, integers and floats to 64 bits.
- When a column is of the object data type, it signals that the entire column is strings. 

Second, we discuss Series.

In [None]:
# Create a Series by selecting a column from df DataFrame 
total_cost = df['TotalCost']
type(total_cost)

In [None]:
total_cost

In [None]:
total_cost.columns    # No .columns attribute is associated with a Series 

In [None]:
total_cost.index

In [None]:
total_cost.values    # Result is a 1-D Numpy array

## Basic data manipulation using Pandas
To learn Pandas, the first task is that you can assess the data stored in the Dataframe. Again, we can apply indexing and slicing to access the data you need. However, remember Pandas Series and DataFrame are labelled data; there are specific methods that can utilize labelled data for data manipulation.   
### Get data elements using `Indexing Operator`
Our focus is on working with DataFrame, a labelled 2-D array of data. For DataFrame, each column, in fact, is a Pandas Series and all columns share common row index labels.
#### Access the whole column: 

In [None]:
# You can apply indexing. However the first set of square brackets refers to column name 
# Recall that you can view column name as a dictionary's key

region = df['Region']    # Access the whole Region column of the df DataFrame
region

In [None]:
type(region)

In [None]:
df[['Region','Day']]

In [None]:
type(df[['Region','Day']])

#### Access a specific element:
Similar to a Dictionary object, you need to identify column name (key) first and use the second set of square brackets to find out the location of the element.

In [None]:
# The transaction record in row 6 and "Region" column
# Recall that Python uses zero-based indexing
print(df['Region'][5])

#### Access several multiple columns at the same time

In [None]:
# Create a string list including the names of the columns needed 
cols_need = ['Gender', 'Member', 'Region', 'TotalCost']
df_short = df[cols_need]
df_short.head(4)

#### Mixed with slicing operator

In [None]:
# Combine slicing to obtain the odd rows of the data set only
df_short2 = df[cols_need][::2]   # start:end:step_size, note that "end" is exclusive
df_short2.shape

In [None]:
df_short2.head(5)

### Get data elements using `Indexer Attribute`
As mentioned, DataFrame is a labelled data container. To deal with labels, Pandas also provides indexers, one of the attributes of a DataFrame. There are two types of indexers defined in Pandas: `.loc` and `.iloc` indexers. 
* *.loc indexer*, which uses row index and column name to identify data elements.

1. To use the `.loc` indexer, you must input the `exact` row and column labels. Otherwise, you will encounter errors. 
2. Only one set of square brackets [] is needed.
3. When you apply the `.loc` indexer, please do remember that the first argument is `the row index`, followed by `the column name`, and you use a comma to separate them. 

In [None]:
# Be careful that we need to specify row index first followed by column name
print(df.loc[5,'Region'])

In [None]:
# If you do not input the column name, the whole row will be selected by default
row_6 = print(df.loc[5])    # Get row 6 (the corresponding row index is 5)

In [None]:
# To get the whole "Region" column
# Shorthand for selecting "all" is a colon mark. Here we want to select all rows
df_demo1= df.loc[:,'Region']
print(df_demo1.head())

In [None]:
cols_need = ['Gender', 'Member', 'Region', 'TotalCost']
df_demo2 = df.loc[::2, cols_need]
print(df_demo2.shape)

In [None]:
cols_need = ['Gender', 'Member', 'Region', 'TotalCost']
rows_need = [0, 5, 8, 100]
df_demo3 = df.loc[rows_need, cols_need]
print(df_demo3)

In [None]:
# Input a wrong column name. You must provide the “exact” row index and column name
print(df.loc[5,'region'])

* *iloc indexer*, which uses positional indices to identify elements.

When you apply the `iloc` indexer, please do remember that the first argument is the row index, followed by the column index.

In [None]:
# Again do not forget zero-based indexing
print(df.iloc[5, 3])   # row 6 and column 4

In [None]:
len(df.columns)

In [None]:
# Only 10 columns in the df DataFrame. Thus, the range of column indices is from 0 to 9
print(df.iloc[5, 10])

In [None]:
# You cannot use a mixing input style
# If you want to use .iloc indexer, all the inputs must be positional indices
cols_need = ['Gender', 'Member', 'Region', 'TotalCost']
df_redu = df.iloc[:, cols_need]

In [None]:
cols_need = ['Gender', 'Member', 'Region', 'TotalCost']
cols_index = [3, -2, -1, -4]
df_redu = df.iloc[:, cols_index]
print(df_redu.head())

### A Key Difference between .loc and .iloc when using slicing operators

When using `.loc` indexer, the slicing operator behaves differently from the standard slicing operator. **The ending label is inclusive.** 

In [None]:
df.iloc[0:5, 2:6]

In [None]:
df.loc[0:5,'Time':'TotalCost']

### Modify columns in the DataFrame
You can add new columns into an existing DataFrame.

#### Add a new column
The sytax is very similar to the way we add a key into a dictionary. 

Moreover, Pandas package is built on the well-known Numpy package as its core. Thus, Pandas also supports element-wise operations. 

In [None]:
df['AvecostItem'] = 0    # Add a new column with column name "AvecostItem" and initialize values to be zero
print(df.head(5))

In [None]:
# Add a new column 'AvecostItem', which represents the total cost of a trasaction / the items ordered in that transaction
df['AvecostItem'] = df['TotalCost']/df['ItemsOrdered']

In [None]:
df.head(5)

## Filtering via Boolean Indexing
- The most useful data filtering technique in Pandas is Boolean indexing.
- Boolean indexing refers to selecting rows of a DataFrame by providing a Boolean value for each row.
- These Boolean values are stored in a Series or a Numpy’s ndarray and are created by applying logical/comparison operators.   

#### <i>Question:</i> We want to know the number of transactions from the West region?

- <i>Step 1:</i> Create a Boolean Series to satisfy the `filtering criterion/condition`. We call this Boolean Series `a Filter`.
- <i>Step 2:</i> Use the filter and apply indexing operator or indexers to subset the original data set.

In [None]:
# Step 1: Create the filter, Boolean Series
west_filter = df['Region'] == 'West'  # We call this Boolean Series as a filter
print(west_filter.head(10),"\n")
print(df['Region'].head(10))

In [None]:
# Step 2(Method 1): Subset your data by using indexing operator
west_df_1 = df[:][west_filter]
print(west_df_1.shape, "\n")
print(west_df_1.head())

In [None]:
# Step 2(Method 2): Subset your data by using .loc indexer
west_df_2 = df.loc[west_filter, :]
print(west_df_2.shape, "\n")
print(west_df_2.head())

In [None]:
# Step 2(Method 3): Subset your data by using .iloc indexer
west_df_3 = df.iloc[west_filter, :]
print(west_df_3.shape, "\n")
print(west_df_3.head())

In [None]:
west_filter.values

In [None]:
west_df_3 = df.iloc[west_filter.values, :]
print(west_df_3.head())

In [None]:
west_df_3.describe()

In [None]:
# value_count is a Series method, no such method for a DataFrame
west_df_3["Gender"].value_counts()

In [None]:
west_df_3["Gender"].value_counts(normalize = True)

#### <i>Question:</i> Among transactions with a total cost larger than 150, what is the gender by region composition in you data? 

We can consider more than one filtering condition at the same time.

In [None]:
criter1 = df["TotalCost"] > 150     # Create the first filter, the result is a Boolean Series
criter2 = df["Region"] == "West"    # Create the second filter, only transactions from West will be True

print(df[["TotalCost", "Region"]].tail(5), "\n") # From df, we just select two columns of interest and show last 5 rows 
print(criter1.tail(5), "\n")                     # Check first filter and again just show last 5 rows 
print(criter2.tail(5))                           # Check second filter and show last 5 rows

In [None]:
criter1 and criter2     # "and" operator can be applied only to basic Python variables

### Bitwise Operators (&, |,  ~)
- In forming Boolean Series in Pandas, we cannot use <i>**and**, **or** and **not**</i> to do logical comparisons. We must use bitwise operators instead.
- For each filtering condition, we must use round brackets () to enclose it.
- There are three bitwise operators you will use in Pandas:
    1. **&**: the same as and
    2. **|**: the same as or
    3. **~**: the same as not

In [None]:
# inters_c1_c2 = criter1 & criter2 
inters_c1_c2 = (df["TotalCost"] > 150) & (df["Region"] == "West")
print(f"The number of transactions that meet the two filtering conditions is {inters_c1_c2.sum()}.", "\n")

print(df[["TotalCost", "Region"]].tail(5), "\n") # From df, we just select two columns of interest and show last 5 rows 
inters_c1_c2.tail(5)                             # check the intersection of both filters 

In [None]:
dft = df.loc[inters_c1_c2,:]                 # Subset data to include only transactions with total amount > 150, from West  
dft["Gender"].value_counts(normalize = True) # Find out gender proportions in the filtered data

# Using method chaining 
df.loc[inters_c1_c2,:]["Gender"].value_counts(normalize = True)

In [None]:
# Now, we need to find out gender proportions across different regions
print(df["Region"].value_counts())
print("\n")

# Find out unique values in the Region column
uni_region = df["Region"].unique()
print(uni_region)

In [None]:
#Method 1: Above is the code to create an empty dictionary with 4 keys and all corresponding values are empty lists

output = {'West': list(),
          'East': list(),
          'South': list(),
          'Central': list()}
output

In [None]:
# Method 2: another way of creating an empty dictionary
uni_region = df["Region"].unique()     # Find out unique values in the Region column
output = {}
for reg in uni_region:
    output.update({str(reg): list()})

In [None]:
criter1 = df["TotalCost"] > 150
for i in range(len(uni_region)):                # Use a for loop to loop over all possible regions 
    criter2 = (df["Region"] == uni_region[i])   # For each region, we create a region-specific filter
    joint_criter = criter1 & criter2 
    dfsubset = df.loc[joint_criter, :].copy()          # Subset data from each region
    dfs_prop = dfsubset["Gender"].value_counts(normalize = True)
    output[uni_region[i]] = dfs_prop            # Assign the computaion result to a key in the output dictionary
    
result = pd.DataFrame(output)   # Make a DateFrame for presentation
result

In [None]:
# If you don't enclose each filtering condition, you will encounter errors
criter1 = df["TotalCost"] > 100 & df["TotalCost"] <= 300

#### <i>Question:</i> 
<n>
Let’s define a transaction amount between 50 and 400 as a regular transaction. Moreover, we exclude transactions with a single item and focus on the female customers only. For those female customers, do they have any preferred shopping time period in a day?

In [None]:
criter1 = (df["TotalCost"] > 50) & (df["TotalCost"] <= 400) # filter 1: 50 < TotalCost <= 400
criter2 = df["ItemsOrdered"] > 1                            # filter 2: the number of items > 1
criter3 = df["Gender"] == "Female"                          # filter 3: transactions made by female customers

joint_crit = criter1 & criter2 & criter3                    # Consider 3 filtering conditions at the same time
dfQ3 = df.iloc[joint_crit.values,:]
dfQ3["Time"].value_counts(normalize = True)