# Pandas

In [49]:
import pandas as pd
import numpy as np

## Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows

In [28]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [9]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [10]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [16]:
data2= pd.Series(5, index=[100, 200, 300])
data2

100    5
200    5
300    5
dtype: int64

In [17]:
data2[200]

5

In [11]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [13]:
population.values

array([38332521, 26448193, 19651127, 19552860, 12882135])

In [14]:
population.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [15]:
population['Florida']

19552860

## Pandas DataFrame Object

A DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index.

In [19]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)

area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)

states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [20]:
states.values

array([[38332521,   423967],
       [26448193,   695662],
       [19651127,   141297],
       [19552860,   170312],
       [12882135,   149995]])

In [21]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [22]:
states.columns

Index(['population', 'area'], dtype='object')

Note: in DataFrame, the square backet selection will select columns instead of rows in Series

In [24]:
states['area']   

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [27]:
number = pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])
number

Unnamed: 0,foo,bar
a,0.008694,0.258342
b,0.543602,0.374614
c,0.226685,0.665655


## Data Indexing and Selection

### Series

In [30]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [32]:
data['a':'c'] # slicing by explicit index

a    0.25
b    0.50
c    0.75
dtype: float64

In [33]:
data[0:2] # slicing by implicit integer index

a    0.25
b    0.50
dtype: float64

In [34]:
data[(data > 0.3) & (data < 0.8)] 

b    0.50
c    0.75
dtype: float64

In [36]:
data[['a', 'd']]

a    0.25
d    1.00
dtype: float64

However, in case of integer indexes, the confusion between explicit index and implicit index will arise confusion. Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. 

In [37]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [38]:
data[1]

'a'

In [39]:
data[1:3]

3    b
5    c
dtype: object

First, the loc attribute allows indexing and slicing that always references the explicit index

In [41]:
data.loc[1]

'a'

In [42]:
data.loc[1:3]

1    a
3    b
dtype: object

The iloc attribute allows indexing and slicing that always references the implicit Python-style index:

In [44]:
data.iloc[1]

'b'

In [45]:
data.iloc[1:3]

3    b
5    c
dtype: object

### DataFrame

In [50]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [51]:
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [52]:
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


## Handling Missing Data

In [56]:
data = pd.Series([1, np.nan, 'hello', None])
data.isnull() # Detecting null values; Similarly, we have notnull() to detect non null values

0    False
1     True
2    False
3     True
dtype: bool

In [57]:
data.dropna() # Remove null values

0        1
2    hello
dtype: object

In [58]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df.dropna() # In DataFrame, dropna() will drop all rows in which any null value is present

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [61]:
df.sum()

0     3.0
1     7.0
2    13.0
dtype: float64

In [60]:
 df.dropna(axis=1) # axis=1 drops all columns containing a null value

Unnamed: 0,2
0,2
1,5
2,6


In [62]:
df.dropna(axis='columns', how='all') # You can also specify how='all', which will only drop rows/columns that are all null values:

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [65]:
data.fillna(0) # We can fill NA entries with a single value, such as zero

0        1
1        0
2    hello
3        0
dtype: object

In [67]:
data.fillna(method='ffill') # We can specify a forward-fill to propagate the previous value forward:

0        1
1        1
2    hello
3    hello
dtype: object

In [68]:
df.fillna(method='bfill', axis=1) #  we can also specify an axis along which the fills take place

Unnamed: 0,0,1,2
0,1.0,2.0,2.0
1,2.0,3.0,5.0
2,4.0,4.0,6.0
