Learning Pandas
---

This file is a notebook for learning pandas.


![Pandas Basic](images/pandas_basics-1.png)
![Pandas](images/pandas-1.png)


# Pandas Basic
The Pandas library is built on NumPy and provides easy-to-use
data structures and data analysis tools for the Python
programming language.
## import convention

In [3]:
import pandas as pd

## Asking for Help

In [None]:
help(pd.Series.loc)

## Data Structures

### Series 
A one-dimensional labeled array
capable of holding any data type

In [49]:
s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])

s

a    3
b   -5
c    7
d    4
dtype: int64

### DataFrame
A two-dimensional labeled data structure with columns of potentially different types

In [12]:
data = {'Country': ['Belgium', 'India', 'Brazil'],
'Capital': ['Brussels', 'New Delhi', 'Brasília'],
'Population': [11190846, 1303171035, 207847528]}

df = pd.DataFrame(data, columns=['Country', 'Capital', 'Population'])
#print(df)
df

Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,11190846
1,India,New Delhi,1303171035
2,Brazil,Brasília,207847528


## Selection
### Getting

In [14]:
# get one element
s['b']

-5

In [11]:
# get subset of DataFrame
df[1:]

Unnamed: 0,Country,Capital,Population
1,India,New Delhi,1303171035
2,Brazil,Brasília,207847528


### Selecting, Boolean Indexing & Setting

#### By Position

In [13]:
df.iloc[[0],[0]]

Unnamed: 0,Country
0,Belgium


In [19]:
help(pd.DataFrame.iat)

Help on property:

    Access a single value for a row/column pair by integer position.
    
    Similar to ``iloc``, in that both provide integer-based lookups. Use
    ``iat`` if you only need to get or set a single value in a DataFrame
    or Series.
    
    Raises
    ------
    IndexError
        When integer position is out of bounds
    
    See Also
    --------
    DataFrame.at : Access a single value for a row/column label pair.
    DataFrame.loc : Access a group of rows and columns by label(s).
    DataFrame.iloc : Access a group of rows and columns by integer position(s).
    
    Examples
    --------
    >>> df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
    ...                   columns=['A', 'B', 'C'])
    >>> df
        A   B   C
    0   0   2   3
    1   0   4   1
    2  10  20  30
    
    Get value at specified row/column pair
    
    >>> df.iat[1, 2]
    1
    
    Set value at specified row/column pair
    
    >>> df.iat[1, 2] = 10
    >>> df.iat[1, 2]

In [20]:
df.iat[0,0]

'Belgium'

#### By Label

In [21]:
df.loc[[0],['Country']]

Unnamed: 0,Country
0,Belgium


In [25]:
#help(df.at)
df.at[0,'Country']

'Belgium'

#### By Label/Position

In [26]:
df.ix[2]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Country          Brazil
Capital        Brasília
Population    207847528
Name: 2, dtype: object

In [28]:
df.ix[:,'Capital']

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


0     Brussels
1    New Delhi
2     Brasília
Name: Capital, dtype: object

In [29]:
df.ix[1,'Capital']

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


'New Delhi'

#### Boolean Indexing

In [31]:
# Series s where value is not >
s[~(s>1)]

b   -5
dtype: int64

In [33]:
# s where value is <-1 or >2
s[(s<-1) | s>2]

Series([], dtype: int64)

In [34]:
# df[df['Population']>1200000000]
df[df['Population']>1200000000]

Unnamed: 0,Country,Capital,Population
1,India,New Delhi,1303171035


#### Setting

In [37]:
# Set index a of Series s to 6
s['a'] = 6
s

a    6
b   -5
c    7
d    4
dtype: int64

## !/O

### Read and Write to CSV

In [None]:
pd.read_csv('file.csv', header=None, nrows=5)
df.to_csv('myDataFrame.cvs')

### Read and Write to Excel

In [None]:
pd.read_excel('file.xlsx')
pd.to_excel('dir/myDataFrame.xlsx', sheet_name='Sheet1')

#### Read multiple sheets from the same file

In [None]:
xlsx = pd.ExcelFile('file.xls')
df = pd.read_excel(xlsx, 'Sheet1')

### Read and Write to SQL Query or Database Table

In [None]:
# read_sql() is a convenience wrapper around read_sql_table() and read_sql_query()
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
pd.read_sql("SELECT * FROM my_table;", engine)
pd.read_sql_table('my_table', engine)
pd.read_sql_query("SELECT * FROM my_table;", engine)

In [None]:
pd.to_sql('myDf', engine)

## Dropping

In [None]:
# Drop values from rows (axis=0)
s.drop(['a', 'c'])

# Drop values from columns(axis=1)
df.drop('Country', axis=1)

## Soft & Rank

In [None]:
# Sort by labels along an axis
df.sort_index()

# Sort by the values along an axis
df.sort_values(by='Country') 

# Assign ranks to entries
df.rank()

## Retrieving Series/DataFrame Information

### Basic Information

In [None]:
# (rows,columns)
df.shape

# Describe index
df.index

# Describe DataFrame columns
df.columns

# Info on DataFrame
df.info()

# Number of non-NA values
df.count()


### Summary

In [None]:
# Sum of values
df.sum()

# Cummulative sum of values
df.cumsum()

# Minimum/maximum values
df.min()/df.max()

# Minimum/Maximum index value
df.idxmin()/df.idxmax()

# Summary statistics
df.describe()

# Mean of values
df.mean()

#Median of values
df.median()

## Applying Functions

In [None]:
f = lambda x: x*2
# Apply function
df.apply(f)

# Apply function element-wise
df.applymap(f)


## Data Alignment
### Internal Data Alignment
NA values are introduced in the indices that don’t overlap:

In [40]:
s3 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])
s + s3

a    13.0
b     NaN
c     5.0
d     7.0
dtype: float64

### Arithmetic Operations with Fill Methods
You can also do the internal data alignment yourself with
the help of the fill methods:

In [41]:
s.add(s3, fill_value=0)

a    13.0
b    -5.0
c     5.0
d     7.0
dtype: float64

In [42]:
s.sub(s3, fill_value=2)

a   -1.0
b   -7.0
c    9.0
d    1.0
dtype: float64

In [43]:
s.div(s3, fill_value=4)

a    0.857143
b   -1.250000
c   -3.500000
d    1.333333
dtype: float64

In [44]:
s.mul(s3, fill_value=3)

a    42.0
b   -15.0
c   -14.0
d    12.0
dtype: float64

# Pandas Advance
## Reshaping Data
### Pivot

In [54]:
data2 = {
    'Date': ['2016-03-01', '2016-03-02', '2016-03-01',
             '2016-03-03', '2016-03-02', '2016-03-03',],
    'Type': ['a', 'b', 'c',
             'a', 'a', 'c',],
    'Value': [11.432, 13.031,20.784,
              99.906,1.303,20.784 ]
       }
#s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(data2, columns=['Date', 'Type', 'Value'])

# Spread rows into columns
df3= df2.pivot(index='Date',
                columns='Type',
                values='Value')


In [57]:
df2

Unnamed: 0,Date,Type,Value
0,2016-03-01,a,11.432
1,2016-03-02,b,13.031
2,2016-03-01,c,20.784
3,2016-03-03,a,99.906
4,2016-03-02,a,1.303
5,2016-03-03,c,20.784


In [56]:
df3

Type,a,b,c
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-03-01,11.432,,20.784
2016-03-02,1.303,13.031,
2016-03-03,99.906,,20.784


### Pivot Table

In [58]:
# Spread rows into columns
df4 = pd.pivot_table(df2,
                     values='Value',
                     index='Date',
                     columns='Type'])

SyntaxError: invalid syntax (<ipython-input-58-0efc9e71d483>, line 3)

### Stack/ Unstack

In [None]:
# Pivot a level of column labels
stacked = df5.stack()
# Pivot a level of index labels
stacked.unstack()

### Melt

In [47]:
# Gather columns into rows
pd.melt(df2, 
        id_vars=["Date"],
        value_vars=["Type", "Value"],
        value_name="Observations")

NameError: name 'df2' is not defined

## Advanced Indexing
### 

#### Selecting

In [48]:
# Select cols with any vals >1
df3.loc[:,(df3>1).any()]

# Select cols with vals > 1
df3.loc[:,(df3>1).all()]

# Select cols with NaN
df3.loc[:,df3.isnull().any()]

# Select cols without NaN
df3.loc[:,df3.notnull().all()]

NameError: name 'df3' is not defined

#### Indexing With isin

In [None]:
# Find same elements
df[(df.Country.isin(df2.Type))]

# Filter on values
df3.filter(items=”a”,”b”])

# Select specific elements
df.select(lambda x: not x%5)

#### Where

In [None]:
# Subset the data
s.where(s > 0)

#### Query

In [None]:
# Query DataFrame
df6.query('second > first')

### Setting/Resetting Index

In [None]:
# Set the index
df.set_index('Country')

# Reset the index
df4 = df.reset_index()

# Rename DataFrame
df = df.rename(index=str, columns= {"Country":"cntry","Population":"ppltn"})

### Reindexing

In [None]:
s2 = s.reindex(['a','c','d','e','b'])

#### Forward Filling

In [None]:
df.reindex(range(4),method='ffill')

#### Backward Filling

In [None]:
s3 = s.reindex(range(5),method='bfill')

### MultiIndexing

In [None]:
arrays = [np.array([1,2,3]),
              np.array([5,4,3])]

df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays)

tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples,
                                  names=['first', 'second'])
df6 = pd.DataFrame(np.random.rand(3, 2), index=index)
df2.set_index(["Date", "Type"])

## Duplicate Data

In [None]:
# Return unique values
s3.unique()

# Check duplicates
df2.duplicated('Type')

# Drop duplicates
df2.drop_duplicates('Type', keep='last')

# Check index duplicate
df.index.duplicated()


## Grouping Data

#### Aggregation

In [None]:
df2.groupby(by=['Date','Type']).mean()
df4.groupby(level=0).sum()
f4.groupby(level=0).agg({'a':lambda x:sum(x)/len(x),
                         'b': np.sum})

#### Transformation

In [None]:
customSum = lambda x: (x+x%2)
df4.groupby(level=0).transform(customSum)

## Missing Data

In [None]:
# Drop NaN values
df.dropna()

# Fill NaN values with a predetermined value
df3.fillna(df3.mean())

# Replace values with others
df2.replace("a", "f")


## Combining Data

In [None]:
data1 = 
data2 =

### Merge

In [None]:
# left
pd.merge(data1,data2,how='left',on='X1')

In [None]:
# right
pd.merge(data1,data2,how='right',on='X1')

In [None]:
# inner
pd.merge(data1,data2,how='inner',on='X1')

In [None]:
# outer
pd.merge(data1,data2,how='outer',on='X1')

### Join

In [None]:
data1.join(data2, how='right')

### Concatenate

#### Vertical

In [None]:
s.append(s2)

#### Horizontal/Vertical

In [None]:
pd.concat([s,s2],axis=1, keys=['One','Two'])
pd.concat([data1, data2], axis=1, join='inner')

## Dates

In [None]:
df2['Date']= pd.to_datetime(df2['Date'])
df2['Date']= pd.date_range('2000-1-1',
                           periods=6,
                           freq='M')
dates = [datetime(2012,5,1), datetime(2012,5,2)]
index = pd.DatetimeIndex(dates)
index = pd.date_range(datetime(2012,2,1), end, freq='BM')

## Visualization
__ Also see Matplotlib __

In [None]:
import matplotlib.pyplot as plt
s.plot()
plt.show()
df2.plot()
plt.show()