# Getting Started with pandas Part 2
## Book: Python for Data Analysis

In [2]:
import pandas as pd 
import numpy as np

## 1) Reindexing

In [3]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index= ['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

Using **'reindex'**: rearrange the data according to the new index

In [6]:
#Reindex rows:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [7]:
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [9]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index= [0,2,4])
obj3

0      blue
2    purple
4    yellow
dtype: object

For ordered data like time series, it may be desired to do some interporlation or filling values when reindexing

In [10]:
#Fullfill reindex rows
obj3.reindex(range(6), method= 'ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [17]:
frame = pd.DataFrame(np.arange(9).reshape(3,3), index= ['a', 'c', 'd'], columns= ['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [19]:
#Reindex rows
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [20]:
#Reindex columns 
states = ['Texas', 'Utah', 'California']
frame.reindex(columns= states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


You can reindex more compactly by label-indexing with **loc**

In [25]:
#Reindexing rows with 'loc'

frame.loc[['a', 'b', 'c', 'd'], states]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  """Entry point for launching an IPython kernel.


Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


## 2) Dropping Entries from an Axis

In [3]:
obj = pd.Series(np.arange(5.), index= ['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [4]:
#Dropping row indixes: 

new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [6]:
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

Drop within a Dataframe

In [7]:
data = pd.DataFrame(np.arange(16).reshape((4,4)), index= ['Ohio', 'Colorado', 'Utah', 'New York'],
                   columns= ['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [8]:
#Drop row values

data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [42]:
#Drop columns values 

data.drop('two', axis= 1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [15]:
data.drop(['one', 'two'], axis=1)

Unnamed: 0,three,four
Ohio,2,3
Colorado,6,7
Utah,10,11
New York,14,15


In [17]:
data.drop(['one', 'two'], axis='columns')

Unnamed: 0,three,four
Ohio,2,3
Colorado,6,7
Utah,10,11
New York,14,15


In [16]:
data.drop(['Ohio', 'Colorado'], axis=0)

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [18]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [21]:
# Using inplace= True the modification stays in the object 

obj.drop('c', inplace= True)

In [22]:
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

## 3) Indexing, Selection and Filtering 

In [25]:
obj = pd.Series(np.arange(4.), index= ['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [29]:
obj['a']

0.0

In [28]:
obj[0]

0.0

In [30]:
obj.shape

(4,)

In [31]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [32]:
obj[2:]

c    2.0
d    3.0
dtype: float64

In [34]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [35]:
obj[[1,3]]

b    1.0
d    3.0
dtype: float64

In [38]:
#Logical operations: 

obj[obj <= 2]

a    0.0
b    1.0
c    2.0
dtype: float64

In [41]:
#Slicing with lables it's different than the normal Python slicing, the end-point is inclusive

obj['b': 'c']

b    1.0
c    2.0
dtype: float64

In [43]:
obj['b': 'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

Indexing into a Dataframe 

In [45]:
data = pd.DataFrame(np.arange(16).reshape((4,4)), index = ['Ohio', 'Colorado', 'Utah', 'New York'],
                   columns= ['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [46]:
#Dataframe column slicing 

data[['one', 'four']]

Unnamed: 0,one,four
Ohio,0,3
Colorado,4,7
Utah,8,11
New York,12,15


In [55]:
#Dataframe row slicing 

data[2:]

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [58]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [53]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [48]:
#Boolean slicing 

data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [52]:
#Double boolean slicing 

data[(data['three'] > 5) & (data['four'] >= 10)]

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [59]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [60]:
data[data < 5] = 0

In [62]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


### 3.2) Selection with loc and iloc

For Dataframe label-indexing on the rows, we can use **loc** (axis labels) and **iloc** (integers). They enable you to select a subset of the rows and columns with NumPy-like notation. 

In [63]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [64]:
data.loc['Colorado']

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

In [65]:
data.loc['Colorado', ['one', 'two']]

one    0
two    5
Name: Colorado, dtype: int64

In [66]:
data.iloc[1]

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

In [67]:
data.iloc[1, [0,1]]

one    0
two    5
Name: Colorado, dtype: int64

In [69]:
data.iloc[1, range(2)]

one    0
two    5
Name: Colorado, dtype: int64

In [70]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [73]:
data.iloc[[3,0], [0,3,1]]

Unnamed: 0,one,four,two
New York,12,15,13
Ohio,0,0,0


In [74]:
data.loc[:'Utah']

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11


In [75]:
data.loc[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

In [76]:
data.iloc[:,]

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [77]:
data.iloc[:, :3]

Unnamed: 0,one,two,three
Ohio,0,0,0
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


In [80]:
data.iloc[:, :3][data['three'] > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


## 4) Integer Indexes

There are some differences between indexing pandas objects built-in Python data structures like list and tuples

In [82]:
ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

We wound't expect an error in the code below: 

In [83]:
ser[-1]

KeyError: -1

The problem is the integer indexes

In [85]:
ser2 = pd.Series(np.arange(3.), index= ['a', 'b', 'c'])
ser2

a    0.0
b    1.0
c    2.0
dtype: float64

With a non-integer index, there's no potential for ambiguity

In [88]:
ser2[-1]

2.0

In [89]:
ser

0    0.0
1    1.0
2    2.0
dtype: float64

To keep things consistent, if you have integer indexes, data selection will always be label-oriented

In [94]:
ser.loc[:1]

0    0.0
1    1.0
dtype: float64