# Pandas
## DataFrames - Initialization / Manipulation

In [18]:
import numpy as np
import pandas as pd

**a DataFrame is an analog of a two-dimensional (or more) numpy array with both flexible 
row indices and flexible column names
and in other ways like a dictionary of Series structures sharing the same index**


**Initialize a DataFrame**  
Specficy data, indexes, columns

In [19]:
data = pd.DataFrame(data=np.random.rand(3, 2), 
                    index=['a', 'b', 'c'],
                    columns=['foo', 'bar']
                   )

data

Unnamed: 0,foo,bar
a,0.221859,0.04239
b,0.001061,0.400966
c,0.916522,0.250875


**Initialize a DataFrame from a list of dictionaries**  
Keys will be converted to columns, indexes are generated (or can be explicitly set).  
Even if there's no values for all the keys, Pandas will fill the mmissing ones with NaN


In [20]:
data = pd.DataFrame([{'a': 1, 'b': 2}, 
                     {'b': 3, 'c': 4}], ['ind1','ind2'])
data

Unnamed: 0,a,b,c
ind1,1.0,2,
ind2,,3,4.0


**Initialize a DataFrame from Series objects**

In [21]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662, 'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127}, name='population')

df = pd.DataFrame({'area': area, 'population': population })
df.head()

Unnamed: 0,area,population
Alaska,1723337.0,
California,423967.0,38332521.0
New York,,19651127.0
Texas,695662.0,26448193.0


**Select a column**  
returns a Series

In [22]:
df['population']

Alaska               NaN
California    38332521.0
New York      19651127.0
Texas         26448193.0
Name: population, dtype: float64

**Select multiple columns from th DataFrame**

In [23]:
df[['area','population']]

Unnamed: 0,area,population
Alaska,1723337.0,
California,423967.0,38332521.0
New York,,19651127.0
Texas,695662.0,26448193.0


**Add a new column**

In [24]:
df['density']=df['population']/df['area']
df

Unnamed: 0,area,population,density
Alaska,1723337.0,,
California,423967.0,38332521.0,90.413926
New York,,19651127.0,
Texas,695662.0,26448193.0,38.01874


**Remove column**  
Drop will remove index by default, specifiy axis=1 to remove column  
Set **inplace=True** to make it permanently affect the dataframe


In [25]:
df.drop('density', axis=1)

Unnamed: 0,area,population
Alaska,1723337.0,
California,423967.0,38332521.0
New York,,19651127.0
Texas,695662.0,26448193.0


**Select a row**

By label based index with loc[] - returns a Series

In [26]:
df.loc['Alaska']

area          1723337.0
population          NaN
density             NaN
Name: Alaska, dtype: float64

By numerical based index with iloc[] - returns also a Series

In [27]:
df.iloc[0]

area          1723337.0
population          NaN
density             NaN
Name: Alaska, dtype: float64

**Selecting rows and columns**

In [28]:
df.loc[['Alaska','New York'],['area','population']]

Unnamed: 0,area,population
Alaska,1723337.0,
New York,,19651127.0


In [29]:
df.iloc[[0,2],[0,1]]

Unnamed: 0,area,population
Alaska,1723337.0,
New York,,19651127.0


**Condtional selecting**

In [30]:
df[df['population']>19751127.0]

Unnamed: 0,area,population,density
California,423967.0,38332521.0,90.413926
Texas,695662.0,26448193.0,38.01874


**Multiple conditions selecting**  
Use & (and) | (or)

In [31]:
df[(df['population']>19751127.0) & (df['density']<40)]

Unnamed: 0,area,population,density
Texas,695662.0,26448193.0,38.01874


**Reset Index**  
Old index becomes 'index' column.  
Use inplace=True to make the change permanent.

In [32]:
df.reset_index(inplace=True)
df

Unnamed: 0,index,area,population,density
0,Alaska,1723337.0,,
1,California,423967.0,38332521.0,90.413926
2,New York,,19651127.0,
3,Texas,695662.0,26448193.0,38.01874


**New Index**

In [33]:
newind = 'CA NY WY OR'.split()
newind

['CA', 'NY', 'WY', 'OR']

In [34]:
df['state']=newind
df

Unnamed: 0,index,area,population,density,state
0,Alaska,1723337.0,,,CA
1,California,423967.0,38332521.0,90.413926,NY
2,New York,,19651127.0,,WY
3,Texas,695662.0,26448193.0,38.01874,OR


Set index from an existing columns, completly erases the old column

In [35]:
df.set_index('state', inplace=True)
df

Unnamed: 0_level_0,index,area,population,density
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,Alaska,1723337.0,,
NY,California,423967.0,38332521.0,90.413926
WY,New York,,19651127.0,
OR,Texas,695662.0,26448193.0,38.01874


**Index object follow Python’s built-in set data structure, 
so that unions, intersections, differences can be computed in a familiar way**

In [36]:
indA = pd.Index([1, 3, 5, 7, 9]) 
indB = pd.Index([2, 3, 5, 7, 11])

**Intersection**

In [37]:
indA & indB

Int64Index([3, 5, 7], dtype='int64')

**Union**

In [38]:
indA | indB

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

**Set difference**

In [39]:
indA ^ indB

Int64Index([1, 2, 9, 11], dtype='int64')

In [40]:
data.values

array([[ 1.,  2., nan],
       [nan,  3.,  4.]])

In [41]:
data.columns

Index(['a', 'b', 'c'], dtype='object')

**List data items**

In [42]:
list(data.items())

[('a',
  ind1    1.0
  ind2    NaN
  Name: a, dtype: float64),
 ('b',
  ind1    2
  ind2    3
  Name: b, dtype: int64),
 ('c',
  ind1    NaN
  ind2    4.0
  Name: c, dtype: float64)]

**Slicing**

**Notie that when you are slicing with an explicit index (i.e., data['a':'c']), 
the final index is included in the slice, while when you’re slicing with an implicit 
index (i.e., data[0:2]), the final index is excluded from the slice.**


In [43]:
area = pd.Series({'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [44]:
data['California':'New York']

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


**Transpose**

In [45]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


**With loc access rows (also with masking) then columns by name**

In [46]:
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [47]:
data[data['density']>100]['pop']

New York    19651127
Florida     19552860
Name: pop, dtype: int64

In [48]:
df = pd.DataFrame(np.random.rand(5,4),['A','B','C','D','E'],['W','X','Y','Z'])

In [49]:
df

Unnamed: 0,W,X,Y,Z
A,0.146352,0.039255,0.96342,0.978799
B,0.117589,0.37997,0.747851,0.293672
C,0.874467,0.471558,0.294971,0.340522
D,0.26866,0.912969,0.592781,0.964275
E,0.705044,0.152124,0.353609,0.685908


In [50]:
df[df['W']>0.2]

Unnamed: 0,W,X,Y,Z
C,0.874467,0.471558,0.294971,0.340522
D,0.26866,0.912969,0.592781,0.964275
E,0.705044,0.152124,0.353609,0.685908


**Apply function**  

Define a function

In [52]:
def times2(x):
    return x*2

Apply it to a column

In [55]:
df['W'].apply(times2)

A    0.292704
B    0.235179
C    1.748934
D    0.537320
E    1.410087
Name: W, dtype: float64

In [57]:
# Or use a lambda function instead
df['W'].apply(lambda x: x*2)

A    0.292704
B    0.235179
C    1.748934
D    0.537320
E    1.410087
Name: W, dtype: float64

**Sort Values**

In [60]:
df.sort_values(by=['W','Y'], ascending=False)

Unnamed: 0,W,X,Y,Z
C,0.874467,0.471558,0.294971,0.340522
E,0.705044,0.152124,0.353609,0.685908
D,0.26866,0.912969,0.592781,0.964275
A,0.146352,0.039255,0.96342,0.978799
B,0.117589,0.37997,0.747851,0.293672


**Pivot Table**

In [61]:
data={'A':['foo','foo','foo','bar','bar','bar'], 'B': ['one','one','two','two','one','one'], 'C': ['x','y','x','y','x','y'], 'D': [1,3,2,5,4,1] }

In [64]:
df = pd.DataFrame(data, [0,1,2,3,4,5])
df

Unnamed: 0,A,B,C,D
0,foo,one,x,1
1,foo,one,y,3
2,foo,two,x,2
3,bar,two,y,5
4,bar,one,x,4
5,bar,one,y,1


In [67]:
df.pivot_table(index=['A','B'],columns=['C'], values='D')

Unnamed: 0_level_0,C,x,y
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,4.0,1.0
bar,two,,5.0
foo,one,1.0,3.0
foo,two,2.0,
