<center><img src="img/pandas.png" alt="drawing" width="150"/></center>

# Pandas

Pandas is a free software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. Its name is a play on the phrase "Python data analysis" itself.

In [None]:
import pandas as pd

## Pandas Series

A Pandas Series is like a column in a table. It is a one-dimensional array holding data of any type.

In [57]:
s1 = pd.Series([34,'john','doe'])
s1

0      34
1    john
2     doe
dtype: object

In [59]:
# or including labels
s2 = pd.Series({
    'age':34, 
    'name':'john',
    'surname':'joe'
}) 
s2

age          34
name       john
surname     joe
dtype: object

## Dataframes

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. In essense a dataframe is a collection of series.

In [60]:
# Dataframe: Every dictionary a new row. Each key of dictionary a column
df = pd.DataFrame([
    {'age':34, 'name':'john', 'surname':'doe'},         
    {'age':43, 'name':'alice', 'surname':'cooper'}
])                        
df

Unnamed: 0,age,name,surname
0,34,john,doe
1,43,alice,cooper


More often than not, one loads a dataframe instead of creating it.

In [None]:
# load data as Datafrane
df = pd.read_table('path_to_csv',                                 
                   sep = ',', # defines the seperator. Default of read_table: tabs
                   header = None, # if there is a header or not.   
                   usecols = [0,4], # specific columns to read. Works also with name of the columns
                   names = ['col1','col2'], # names of the columns. MUST USE header = 0 for this one to work.
                   index = 'col1', # which column is the index. If none, creates an index by itself
                   skiprows = 12, # line numbers to skip (0-indexed)
                   nrows = 12) # Read only the first 12 rows
                      
# saves dataframe as a csv
pd.DataFrame().to_csv('name')

Dataframes carry some basic methods and attributes. Here are just some of the most useful ones.

In [None]:
# show the 10 first rows. Default value = 5
df.head(10) 

# Show the 10 last rows. Default value = 5
df.tail(10)

# basic statistics per column
df.describe() 

# tupple: (Rows, Columns)
df.shape

# type of the values in each column
df.dtypes   

# informations on the index
df.index 

# use column col1 as the index
df.set_index('col1')

# create another set of indices and put the old index back as a column (drop=True drops the old index)
df.reset_index(drop=True)    

# a list with the names of columns. One can change them by setting them equal to another list.
df.columns

# drops col1 from df (axis=0 -> rows, axis=1 -> columns)
df.drop('col1', axis=1)  

# renames columns
df.rename({'old_name':'new_name'}) 

# selects column named 'col1' from Dataframe df. This object is a pandas series.
df.col1 or df['col1']

# unique values of column col1
df.col1.unique()

# number of unique values of column col1.
df.col1.nunique()

# count occurences of unique values (normalize = True, gives percentages instead of pure counts)
df.col1.value_counts(normalize = True)   

# conditional selection
df[df.col1 = 'hey']

# access a group of rows and columns by label(s)
df.loc['Greece', 'col1']     

# purely integer-location based indexing for selection by position
df.iloc[12,1]     

# sort values of dataframe (default ascending = True)
df.sort_values('col1', ascending = False)  

# groups data and select a lot of statistics
df.groupby('col1').agg(['min', 'max', 'mean'])