# Data Manipulation in Pandas

Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

The fundamental Pandas data structures:

* **Series**: a "one-dimensional array" with flexible indices
* **DataFrame**: a "two-dimensional array" with both flexible row indices and flexible column names

# Introduction

When you get a dataset to analyze, it is rare that the data set is clean or in exactly the right form you need. Often you’ll need to perform some data preprocessing/wrangling, e.g., creating some new variables or summaries, filtering out some rows based on certain search criteria, renaming the variables, reordering the observations by some column, etc. 

In this notebook, you will learn how to perform a variety of data preprocessing tasks. Here, we will use a dataset on flights departing New York City in 2013. 

In [1]:
import pandas as pd
import numpy as np 

In [2]:
# Install the package 'nycflights13' before you can run this
from nycflights13 import flights
flights.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z


In [3]:
flights.shape

(336776, 19)

In [4]:
list(flights.columns) 

['year',
 'month',
 'day',
 'dep_time',
 'sched_dep_time',
 'dep_delay',
 'arr_time',
 'sched_arr_time',
 'arr_delay',
 'carrier',
 'flight',
 'tailnum',
 'origin',
 'dest',
 'air_time',
 'distance',
 'hour',
 'minute',
 'time_hour']

## Data frame with columns

- year,month,day
        Date of departure    
- dep_time,arr_time
        Actual departure and arrival times (format HHMM or HMM), local tz.
- sched_dep_time,sched_arr_time
        Scheduled departure and arrival times (format HHMM or HMM), local tz.    
- dep_delay,arr_delay
        Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
- hour,minute
        Time of scheduled departure broken into hour and minutes.
- carrier
        Two letter carrier abbreviation. See airlines() to get name
- tailnum
        Plane tail number
- flight
        Flight number
- origin,dest
        Origin and destination. See airports() for additional metadata.
- air_time
        Amount of time spent in the air, in minutes
- distance
        Distance between airports, in miles
- time_hour
        Scheduled date and hour of the flight as a date. Along with origin, can be used to join flights data to weather data.

In [5]:
flights.dtypes

year                int64
month               int64
day                 int64
dep_time          float64
sched_dep_time      int64
dep_delay         float64
arr_time          float64
sched_arr_time      int64
arr_delay         float64
carrier            object
flight              int64
tailnum            object
origin             object
dest               object
air_time          float64
distance            int64
hour                int64
minute              int64
time_hour          object
dtype: object

In [6]:
flights.describe(include='all')

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
count,336776.0,336776.0,336776.0,328521.0,336776.0,328521.0,328063.0,336776.0,327346.0,336776,336776.0,334264,336776,336776,327346.0,336776.0,336776.0,336776.0,336776
unique,,,,,,,,,,16,,4043,3,105,,,,,6936
top,,,,,,,,,,UA,,N725MQ,EWR,ORD,,,,,2013-09-13T12:00:00Z
freq,,,,,,,,,,58665,,575,120835,17283,,,,,94
mean,2013.0,6.54851,15.710787,1349.109947,1344.25484,12.63907,1502.054999,1536.38022,6.895377,,1971.92362,,,,150.68646,1039.912604,13.180247,26.2301,
std,0.0,3.414457,8.768607,488.281791,467.335756,40.210061,533.264132,497.457142,44.633292,,1632.471938,,,,93.688305,733.233033,4.661316,19.300846,
min,2013.0,1.0,1.0,1.0,106.0,-43.0,1.0,1.0,-86.0,,1.0,,,,20.0,17.0,1.0,0.0,
25%,2013.0,4.0,8.0,907.0,906.0,-5.0,1104.0,1124.0,-17.0,,553.0,,,,82.0,502.0,9.0,8.0,
50%,2013.0,7.0,16.0,1401.0,1359.0,-2.0,1535.0,1556.0,-5.0,,1496.0,,,,129.0,872.0,13.0,29.0,
75%,2013.0,10.0,23.0,1744.0,1729.0,11.0,1940.0,1945.0,14.0,,3465.0,,,,192.0,1389.0,17.0,44.0,


## Basic Operations of Data Manipulations

You will learn the five key operations that allow you to solve the vast majority of your data manipulation challenges:

* Pick observations by their values.
* Reorder the rows.
* Pick variables by their names.
* Create new variables with functions of existing variables.
* Collapse many values down to a single summary.

These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.

## Select Rows

In [None]:
# Filter rows 
# Select all flights in January: 
flights.loc[flights['month']==1]
# flights.loc[flights.month==1] 

In [None]:
flights[flights['month']==1]
#flights[flights.month==1]

In [None]:
# Select all flights on January 1st: 
flights[(flights.month==1) & (flights.day==1)]
#flights[(flights['month']==1) & (flights['day']==1)]

In [None]:
# Save the subset to a new dataframe
flights_0101 = flights[(flights.month==1) & (flights.day==1)]
flights_0101

In [None]:
# Select all flights scheduled to depart before 6:00 am. 
flights[flights.sched_dep_time<=600]

In [None]:
# Use query() function
flights.query('sched_dep_time<=600')

## Logical operators

As shown above, multiple filtering conditions are combined with “&”: every condition must be true in order for a row to be included in the output. 

For other types of combinations, you’ll need to use Boolean operators yourself: ``&`` is “and”, ``|`` is “or”, and ``~`` is “not”. 

In [None]:
# Select flights in either Janurary or Feburary
flights[(flights.month==1) | (flights.month==2)]

In [None]:
# Select flights in the second quarter
flights[flights.month.isin([4,5,6])]

In [None]:
# Select flights that are not in January
flights[flights.month!=1]

In [None]:
# Select flights that are not in January, Feburary, or March
flights[(flights.month!=1) & (flights.month!=2) & (flights.month!=3)]
#flights[~flights.month.isin([1,2,3])]

In [None]:
# Use query() function
flights.query('month>=1 and month<=3')

## Missing values

It is quite common to have missing values or NaN's in data frames. NaN represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown. 

In Python, if you want to determine if a value is missing, use ``.isnull()``:

In [None]:
flights[flights.arr_time.isnull()]

## Sorting

Given a data frame, we often want to sort the rows by a column name, or a set of column names, or more complicated expressions. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. 

In [None]:
# Order rows by month
flights.sort_values('month')

In [None]:
# Order rows by month in descending order
flights.sort_values('month', ascending=False)

In [None]:
# Order rows by year, month, day
flights.sort_values(by=['year','month','day'])
# Or simply: 
#flights.sort_values(['year','month','day'])

In [None]:
# You can specify different ascending arguments for different column names
flights.sort_values(['month', 'day'], ascending=[True, False])

In [None]:
# By default, missing values (NAs) are always sorted at the end. 
flights.sort_values('dep_delay')

## Select Columns

When you work with a dataset with hundreds or even thousands of variables, which is not uncommon, the first challenge is often narrowing in on the variables you’re actually interested in. 

In [None]:
# Select one column
#flights['carrier']
flights.carrier

In [None]:
# Select multiple columns
flights[['year','month','day']]

Select columns whose name matches regular expression regex.

``df.filter(regex='regex')``

In [None]:
# Select all columns containing a '_' in the name.
flights.filter(regex='_')

In [None]:
# Select all columns beginning with word 'dep'
flights.filter(regex='^dep')

In [None]:
# Select all columns endding with word 'time'
flights.filter(regex='time$')

In [None]:
# Select all columns beginning with 'a', endding with 'e', and any string in between. 
flights.filter(regex='^a.*e$')

In [None]:
# Select all columns between 'carrier' and 'dest' (inclusive).
flights.loc[:,'carrier':'dest']

In [None]:
# Select by column indexes: 
# Select columns in positions 1, 2 and 5 (first column is 0).
flights.iloc[:,[1,2,5]]

In [None]:
# Select rows meeting logical condition, and only the specific columns.
# Select all flights in January, display the day, carrier, and flight: 
flights.loc[flights['month']==1, ['day','carrier', 'flight']]

## Add new variables

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. 

In [None]:
# First, let's create a small dataframe to work with
flights_sml = flights.filter(['year','month','day','dep_delay','arr_delay','distance','air_time'])
flights_sml

In [None]:
# Create two new variables one at a time
flights_sml['gain'] = flights_sml.dep_delay - flights_sml.arr_delay
flights_sml['speed'] = flights_sml.distance / flights_sml.air_time * 60
flights_sml.head()

In [None]:
# Remove existing columns from a dataframe
flights_sml.drop(columns=['gain','speed'])

In [None]:
# Create multiple new columns 
flights_sml.assign(
    gain = lambda x: x.dep_delay - x.arr_delay,
    hours = lambda x: x.air_time / 60,
    gain_per_hour = lambda x: x.gain / x.hours # Note that you can refer to columns that you’ve just created
)

## Useful creation functions

There are many functions for creating new variables
- Arithmetic operators: +, -, *, /, ^. 
- Modular arithmetic: // (floor division) and % (remainder), where x == y * (x // y) + (x % y). 
- Logs: log(), log2(), log10(). Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. 
- Logical comparisons, <, <=, >, >=, !=, and ==, which you learned about earlier. If you’re doing a complex sequence of logical operations it’s often a good idea to store the interim values in new variables so you can check that each step is working as expected.
- Cumulative and rolling aggregates: Python provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax()
- Ranking: get the rankings of rows using function rank()

In [None]:
flights_sml['air_time_hours'] = flights_sml.air_time // 60
flights_sml['log2_dist'] = np.log2(flights_sml.distance)
flights_sml['gain_pos'] = flights_sml.gain > 0
flights_sml['gain_cumsum'] = flights_sml.gain.cumsum()
flights_sml['dist_rank'] = flights_sml['distance'].rank(method='min',ascending=True)

flights_sml.head()