# Manipulating DataFrames

As always, we first import our .csv file which we assign it on df_master, we make a copy of it to df as we want to keep the df_master with the original values:

In [None]:
import pandas as pd
df_master = pd.read_csv('../input/orders.csv')
df= df_master.copy()

In [None]:
df.head(10)

## Filtering DataFrames

We can select specific rows based on some criteria, using boolean operators.

For example let's say we want to select all these rows (observations) which have as order_hour_of_day the 8.
Then this statement would be :

In [None]:
statement = df.order_hour_of_day == 8

In [None]:
df8= df[statement]
df8.head()

Instead we could include the statement inside the brackets:

In [None]:
df8= df[df.order_hour_of_day == 8]
df8.head()

# Change values based on criteria

We can also change values in a dataframes, based on a criterion.
In the following we want to change the cells of order_hour_of_day with value 8, to 9:

In [None]:
eight_rows = df.order_hour_of_day == 8
eight_rows.head(10)

First we create a boolean array, to store whether this condition is true or false.
Then we alter the values in the column, where the condition is True

In [None]:
df.loc[eight_rows, 'order_hour_of_day'] = 9

In this case with .loc we selected from the DataFrame these rows where the statement is True and the column 'order_hour_of_day'. We assigned on them the value 9.

In [None]:
df.head()

## Drop rows with NaN

In this case we want to remove all the rows with **ANY** NaN.
To achieve this:

In [None]:
df_clean = df.dropna(how='any')
df_clean.head()

We can also remove observations where **ALL** columns have NaN values by using the argument how='all'.

In general, removing rows where all columns have NaN is a safe practice. However, removing rows where only one (**ANY**) column has a NaN value, has to be done with caution. Can you imagine a reason?

## The .apply() method

The .apply() is a very handy method in DataFrames, as it can be used to apply a specific function on every cell of a column.

In our example we will transform all values of 'days_since_prior_order' to integers. To do this we will use the built-in function of Python int()

In [None]:
df_clean['days_since_prior_order_int'] = df_clean.days_since_prior_order.apply(int)
# do not worry about the warning message for the time being_

In [None]:
df_clean.head()

You can always remove a column with del in order to keep only the new one :

In [None]:
del df_clean['days_since_prior_order']

In [None]:
df_clean.head()

## Grouping Data

## Group based on one attribute

Now we bring back our original dataset imported from Instacart:

In [None]:
df.head()

Let's say that we want to know how many orders have been made for every hour during the date.
Column "order_hour_of_day" indicates on which hour an order has been made.

In this case we could count how many orders (rows) have as value 1 in "order_hour_of_day" to know how many were the orders at hour 1.

This could be done as follows:

In [None]:
df[df.order_hour_of_day == 1].order_hour_of_day.count()

However, if we had to inspect the orders for each hour then we would end up with 24 lines of code as the previous one.

To facilitate this,  .groupby method offers an alternative approach which consists of two parts:

1. Split our DataFrame into groups:

The groups are created based on the different values that can be found on a specific column (in our case different values of hour in the column "order_hour_of_day"). 

Note that these columns are preferred to have categorical data. 
> In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property. Examples of values that might be represented in a categorical variable: 
*  The blood type of a person: A, B, AB or O.  
*  The state that a person lives in
*  The political party that a voter might vote for.  
> <p style='text-align: center;'> *"Categorical Variables" retrieved from WikiPedia.* </p> 

In our case "order_hour_of_day" is indeed a categorical data ; it can take values from 0 to 23.

2. Apply aggregation functions on them:

Aggregation functions are actually all these functions that can turn a group into a single value.

Some aggregation functions are:

* sum: it adds (+) all the values of a group
* count: it counts all the observations (values) in a group
* mean: it estimates the mean value (which is actually the result of sum divided by count)
* min/max : it finds the min or maximum value in a group
* first/last : it finds the first or last value in an order of the group
* std: it calculates the standard deviation of the group (population).

In the following example we will make groups of different hours of the day and we will count how many observations there are in these groups:

In [None]:
#create an empty DataFrame
df_groups = pd.DataFrame()

In [None]:
#groupby "order_hour_of_day" & find the size of the groups (count)
df_groups = df.groupby("order_hour_of_day").count()

In [None]:
df_groups.head()

The produced table returns the number of times an observation is found on a column for a specific group.

So in group of hour 1 (all orders that have as order_hour_of_day=1 ) there are actually 12398 orders.
This comes from the first column, which shows that 12398 different order_ids have been created at 1.

Can you figure a way to calculate which is the most popular hour of the day, where new customers place orders?



In [None]:
hour_new_customers =  df_groups.user_id - df_groups.days_since_prior_order 
# sort.values(ascending=False) sorts the values in a descending order.
hour_new_customers.sort_values(ascending=False)


As you can see 9hr is the most popular time for new customers.
But as you may notice, this metric is not normalised, as at hours 9 to 20 there is a greater volume of orders in general.
A percentage rate based on the total orders may be a more accurate metric.
In this case:

In [None]:
# all orders = 100%
# new orders = ??
pct_hour_new_customers= (hour_new_customers*100)/df_groups.user_id
pct_hour_new_customers.sort_values(ascending=False)

Which shows that at hour 2, in general,  the 7% of the customers, are new customers.



By subtracking from the total number of orders the total number of new orders, we can identify how many are the customers are:

In [None]:
df_groups.user_id.sum() - df_groups.days_since_prior_order.sum() 

Which equals the number of unique user_id values from the main dataframe

In [None]:
df.user_id.nunique()

## Group based on two or more attributes

By having a look at the initial dataset we see that apart from column "days_since_prior_order" we have also the column "order_dow" (order day of the week).

Order day of the week takes 7 distinct values (0,1...,6) for each day of the week.
In the coming example we will try to further subgroup our main DataFrame into two levels: days of the week & hour of the day.

By this way we will come up with the total number of orders for each hour of each day.

To do this, we simply enter in the .groupby() method, the two columns that we want to create our subgroups.

In [None]:
df_groups_2 = df.groupby(["order_dow", "order_hour_of_day"]).agg("count")
df_groups_2

As you can see the returned table consists of two different indexes (grey row names). One index for "order_dow" & one for "order_hour_of_day". In this case, index in "order_dow" is of level 0 & "order_hour_of_day" is of level 1. A combination of different indexes called hierarchical indexing.

Remember that a table is returned only when we apply an aggregation method (count,sum,mean etc.) on a grouping.

Now with the grouping of df_group_2 we can identify the hour and the day with the most orders: 



In [None]:
#even if it is not mandatory, we chain here .to_frame() to convert the series (single column) to a DataFrame

day_hour_order = df_groups_2.order_id.sort_values(ascending=False).to_frame()
day_hour_order

As you see most popular day is this with index 1 at hour with index 9. 

If you have a deeper look in the order,  you will find out that in general hour 9 is most popular hour for orders in each day.

This means that managers can expect a great volume of orders at this time block, and respectively manage their logistics.

## Reshape by pivoting

With method .pivot() we will transform our table so we can have as rows the different days and as columns each hour. 

### Resetting Index at produced DataFrames with Hierarchical Index.

As you can see in the previous df_groups_2 DataFrame, order_dow & order_hour_of_day values are reserved as hierarchical index for each row.
* This means we cannot use them further (as columns) in other actions inside the DataFrame.
For this reason we can reset the index of the DataFrame and turn this piece of information into columns.

In [None]:
df_groups_2_reset = df_groups_2.reset_index()
df_groups_2_reset

### Removing unnecessary columns

With examining the df_groups_2_reset DataFrame we see that columns user_id, eval_set, order_number, days_since_prior_order are unnecessary (are offering the same information), so we will remove them. 
To do this we will create a new DataFrame with only important columns.

In [None]:
df_groups_2_reset = df_groups_2_reset.loc[:,['order_dow', 'order_hour_of_day', 'order_id']]
df_groups_2_reset.head()

And at this stage we will transform our DataFrame so to have as rows each day of the week:
In .pivot() method, we will enter as first argument the name of the column that we want to pass as rows, in the second argument the columns and in the third the values.

In [None]:
df_groups2_pivot = df_groups_2_reset.pivot(index='order_dow', columns='order_hour_of_day', values='order_id')
df_groups2_pivot

Easily we could assign as rows each hour of day by changing the order between order_hour_of_day & order_dow

In [None]:
df_groups2_pivot = df_groups_2_reset.pivot(index='order_hour_of_day', columns='order_dow', values= 'order_id')
df_groups2_pivot

## The .pivot_table( ) method and its connection with groupby( )

Note that .pivot() method works only when the combination of index & columns has a unique value.

If we had **two or more values** assigned in the same combination then we should use .pivot_table() method.

Let's have a look again in our initial DataFrame:

In [None]:
df.head(20)

As you may notice row with index 0 & 16 have as order_dow=2 & order_hour_of_day=9. 

This means there are multiple identical combinations of order_dow & order_hour_of day, something that we found out while we were grouping the DataFrame based on these two columns.

So by using method .pivot_table( ) we can identify **how many** combinations there of each unique set of order_dow & order_hour_of_day.

"How many" in the previous sentence means that we want to count the amount of unique combinations in our DataFrame.

In this case .pivot_table would be:

In [None]:
#the aggregation function in our example is the count
df_pivot_table = df.pivot_table(index='order_hour_of_day', columns='order_dow', values= 'order_id' , aggfunc='count')
df_pivot_table

Which is the same result as the reshaped DataFrame with grouped data (df_groups2_pivot)

So, .pivot_table(  ) method combines all the previous steps: it groups data based on specific categorical columns, aggegates them (count in our case) and returns the result for each combination

### Usign margins in pivot tables

Now we will combine all the previous results in a signle line.
We will enter in the previous pivot_table call the argument margins=True, so can have an aggregated number of counts for each day & hour.



In [None]:
df_pivot_table = df.pivot_table(index='order_hour_of_day', columns='order_dow', values= 'order_id' , aggfunc='count', margins=True)
df_pivot_table

As you can see in the bottom right corner, 3.421.083 orders have been made on all days and hours.
Which is true, as equals the number of rows in the original DataFrame

In [None]:
df.shape