# How-to use data in Pandas.
[Pandas](http://pandas.pydata.org/pandas-docs/stable/) uses datafarame as main data container. Ofter noted as **pd.DataFrame**. *DataFrame <> python array* for very good reason: it carries metadata with data to help magical calls dow the road to stay magical and be easy to use by data-scientist.

This page demonstrates different ways to access data in dataframe.
We need some toy df for the demonstration. Let's create one that consists of random numbers arranged in 6 rows and 4 columns. 

In [1]:
import pandas as pd
import numpy as np

np.random.seed(42)
df1 = pd.DataFrame(
        np.random.randn(6,4),       # content
        index=[f'idx{i}' for i in range(0,12,2)],  # row's index
        columns=['a','b','c','d']   # column names
        )

print(df1)

              a         b         c         d
idx0   0.496714 -0.138264  0.647689  1.523030
idx2  -0.234153 -0.234137  1.579213  0.767435
idx4  -0.469474  0.542560 -0.463418 -0.465730
idx6   0.241962 -1.913280 -1.724918 -0.562288
idx8  -1.012831  0.314247 -0.908024 -1.412304
idx10  1.465649 -0.225776  0.067528 -1.424748


## Access = select.
Let's start with selecting a row in few ways possible. OK, let's select a third row 'idx4'. 

### By position
Going with just specifing a single number does select a row as Series. This should be close to your intuition about old python arrays. A single raw of data is wrapped in **pd.Series**, the array of data is wrapped in **pd.DataFrame**. Such transformation getting a series out of dataframe is 'a cross section' in pandas terms.

However the most often way to make a selection is to specify both start and end positions. Note, such select will return dataframe with one row, not just a row of data like above. Please be carefull with using \[\] versus \(\) they do different things here.

In [2]:
pos = 2

#  a cross section
s = df1.iloc[pos]
print(s,'\n', type(s),'\n')

# one row selcet
df = df1[pos:pos+1]
print(df,'\n', type(df),'\n')


a   -0.469474
b    0.542560
c   -0.463418
d   -0.465730
Name: idx4, dtype: float64 
 <class 'pandas.core.series.Series'> 

             a        b         c        d
idx4 -0.469474  0.54256 -0.463418 -0.46573 
 <class 'pandas.core.frame.DataFrame'> 



### By index
In pandas documentation and code **index** also refered as **label** or **key**. The terms key, index originate in SQL data bases. Note the usage of '**:**' for indexing.


In [3]:
df = df1['idx4' : 'idx4']
print(df,'\n', type(df),'\n')

# call with just one string is enterprited as selct by COLUMN
# don't put index here
df = df1['a']
print(df,'\n', type(df),'\n')

             a        b         c        d
idx4 -0.469474  0.54256 -0.463418 -0.46573 
 <class 'pandas.core.frame.DataFrame'> 

idx0     0.496714
idx2    -0.234153
idx4    -0.469474
idx6     0.241962
idx8    -1.012831
idx10    1.465649
Name: a, dtype: float64 
 <class 'pandas.core.series.Series'> 



### Subset
By far the common transformation is getting subset of datafarame. Let's get columns b,c and rows idx4 to idx8. Remember `list of columns` '**,**' with `slice of row indexes` '**:**'.

In [17]:
df = df1.loc['idx4':'idx8', ['b','c']]
print(df,'\n', type(df),'\n')

df = df1[['b','c']]['idx4':'idx8']
print(df,'\n', type(df),'\n')

df = df1['idx4':'idx8'][['b','c']]
print(df,'\n', type(df),'\n')

             b         c
idx4  0.542560 -0.463418
idx6 -1.913280 -1.724918
idx8  0.314247 -0.908024 
 <class 'pandas.core.frame.DataFrame'> 

             b         c
idx4  0.542560 -0.463418
idx6 -1.913280 -1.724918
idx8  0.314247 -0.908024 
 <class 'pandas.core.frame.DataFrame'> 

             b         c
idx4  0.542560 -0.463418
idx6 -1.913280 -1.724918
idx8  0.314247 -0.908024 
 <class 'pandas.core.frame.DataFrame'> 



Learn more in pandas doc on [indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics). Whitening the data deservs its own howto. Comming later.

## Save the data before analysis
Quite often it takes few iterations to get analysis straight. After all your hard work on slicing and cleaning the data. Don't lose it. You can save dataframe into csv and load it later to do you calculations, many times. :) Note Pandas saves both data and metadata.

In [22]:
df1.to_csv("random.csv")

df2 = pd.read_csv("random.csv", index_col=0)
df2

Unnamed: 0,a,b,c,d
idx0,0.496714,-0.138264,0.647689,1.52303
idx2,-0.234153,-0.234137,1.579213,0.767435
idx4,-0.469474,0.54256,-0.463418,-0.46573
idx6,0.241962,-1.91328,-1.724918,-0.562288
idx8,-1.012831,0.314247,-0.908024,-1.412304
idx10,1.465649,-0.225776,0.067528,-1.424748


In [24]:
df2.dtypes

a    float64
b    float64
c    float64
d    float64
dtype: object

In [25]:
df2.index

Index(['idx0', 'idx2', 'idx4', 'idx6', 'idx8', 'idx10'], dtype='object')