# Pandas Operations

Before we start exploring further the capabilities of Pandas package,  we need first to import our .csv file with the orders from instacart. We use the pandas dedicated function to store the table in a DataFrame

In [None]:
import pandas as pd
df = pd.read_csv('../input/orders.csv', sep=',')

## Explore DataFrames

With the .head() method we can retrieve the first 5 rows of a table. 

In [None]:
df

Which returns the information stored for the first five order (not ordered by default).
Every order (row or observation) may have its own **index** (noted in bold black on the left hand side) and values for each measurement (column). Every column has its own **column label** (noted on the top with bold)

If we want a different number of rows, then we pass the amount as an argument.

In [None]:
df.head(7)

And if we want to retrieve the 7 last rows we use the .tail() method

In [None]:
df.tail(7)

Now with the .info() method we get summary information (meta-data) about the DataFrame

In [None]:
df.info()

From the above we understand that:

- The df is indeed a DataFrame ( <class 'pandas.core.frame.DataFrame'> )
- The df has 300 entries with index 0 to 299
- It has 8 columns, where 6 are of integer type, 1 float and 1 object (which means that is not specified)
- The column 'days_since_prior_order' has 275 non-null entries, which means that 25 rows do not contain information for this measurement
- The actual size of the DataFrame is 18.8 kb

In addition, every column in a DataFrame can be referred as of type "Series". Series are actually 1-D arrays.

In [None]:
type(df.eval_set)

## Count unique values

With the .value_count() method we can get the unique values of column and how many times appear on it.

In the following example we will use the order_dow attribute which indicates on which day a order has been made.

We will find how many orders have been made on each day *(note that each day is represented by a number \[0 ... 6 \] rather than their actual name)* .

In [None]:
df.order_dow.value_counts()

Which shows that the day 1 has the most orders.

Note that .value_counts() method can be used on Series object. (Columns of a DataFrame are Series objects).
If you try the same method directly on a DataFrame you will get an error:

In [None]:
df.value_counts()

**This means that every different kind of object in Python and Pandas comes with its own methods. **

## Sort values by index

In the previous example  the method return a Series that is sorted by the number of occurrences of each value.
But what if we wanted to be sorted by the days?

In [None]:
df.order_dow.value_counts().sort_index()

The day numbers are assigned in this Series as its index.

## Select columns

To select a row with pandas we use the name of the dataframe combined with the name of our desired columns in brackets:

In [None]:
s1=df['order_id']
type(s1)

The object that has been created is a Series, a 1-dimension labeled array

In [None]:
s1.head()

If we want to create a DataFrame with one column, we need to nest the name of the column in double brackets:

In [None]:
dcol1=df[['order_id']]
type(dcol1)

Another more simple way to retrieve a **single** column is to chain the name of the column to the DataFrame:

In [None]:
df.order_id

In [None]:
type(df.order_id)

Again we can use head() to ease the exploration of the column 

In [None]:
df.order_id.head()

Now, if we want to create a DataFrame with a combination of columns then we pass them on a list:

In [None]:
dcol2 = df[['order_id', 'order_dow', 'order_hour_of_day' ]]
dcol2.head()

Could we create a Series with these three columns?

## Select Rows

To select Rows in a DataFrame we follow the same fashion. However now we indicate the starting index and the ending index for our desired rows.

So in order to get the 7th row to the 13th we will enter:

In [None]:
drow1 = df[6:13]
drow1

Don't forget that in Python we use zero-indexes!

- The 7th value means that has index 6. 
- The 13th value means that has index 12.
- And as the closing index is exclusive in selection, we need to use index 13.

So actually we substract the starting value by one and we keep the ending value as it is.

Note that in row selection we use indexes and not row names (if were any). In our example indexes match the row names.

## Multi-axis indexing & Selection

To select specific columns and rows in a DataFrame we use the .loc\[ \]  ,  .iloc\[ \] .

With .loc\[ \] method you can specify the rows and columns you want by their **label**.

For example if you want to select the 5th row to the 9th for columns order_id & order_hour_of_day then would be:

In [None]:
dfloc= df.loc[ 4:9 ,['order_id','order_hour_of_day']]
dfloc

If our rows had as index a label, then instead "4:9" we could have used \[ 'string5', 'string6', ..., 'string9'\]. 

Both ways would return the same result.

Now with .iloc\[ \] you can retrieve rows and columns based on their **index**.

For the previous example that would be:

In [None]:
dfiloc= df.iloc[ 4:9 ,[1,6]]
dfiloc

From the previous examples we see that a sequence of values is represented with "** : **" 

Where for selecting single values we enter them in brackets seperated by commas "**\[ , , \]**"

So if we wanted the 5th and 7th row for the columns starting from order_id to order_hour_of_day then would be:

In [None]:
dfiloc2= df.iloc[ [4,9] , 1:6 ]
dfiloc2

Remember! Rows comes first and then the columns with .loc, .iloc methods.

Can you figure a way to retrieve the 10th to 13th element of columns 'order_id', 'order_hour_of_day'?