# Development Essentials course

## Python for Data Analysis - part 2

### Pandas

[Pandas](https://pandas.pydata.org/) aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.

In [None]:
import pandas as pd

pd.__version__

The two primary data structures of pandas, `Series` (1-dimensional) and `DataFrame` (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. 

#### Series

In [None]:
# Series can be instantiated from dicts

d = {1: 'ab', 3: 'bc', 4: 'cd'}
s = pd.Series(d)
print(s)

In [None]:
type(s)

In [None]:
# Every series has an index

s.index

In [None]:
# Series elements can be addresses via index

s[1]

In [None]:
# Index can be any set of symbols

d = {'b': 1, 'a': 0, 'c': 2}
s = pd.Series(d)
print(s)
print('element with index `b`:', s['b'])

In [None]:
# Series can be instantiated from numpy array

import numpy as np

a = np.array([7, 3, 5, 2, 5, 8, 11])
s = pd.Series(a)
print('Series from array:')
print(s)

s = pd.Series(a, index=['a', 'b', 'c', 'a', 'e', 'd', 'g'])
print('\nSeries from array with index:')
print(s)

In [None]:
# Series can be instantiated from scalar value

s=pd.Series(5.0, index=['a', 'b', 'c', 'd', 'e'])
print(s)

### DataFrames

Each column in a `DataFrame` is a `Series`.

#### Creating

In [None]:
# DataFrame can be instantiated from dictionary of Series
d = {
    'one': pd.Series([1.0, 2.0, 3.0], index=['a', 'b', 'c']),
    'two': pd.Series([1.0, 2.0, 3.0, 4.0], index=['a', 'b', 'c', 'd']),
}
df1 = pd.DataFrame(d)
print(df1)

In [None]:
# DataFrame is optimized to be shown in notebook
# no need to `print` dataframe, it looks pretty

df1

In [None]:
# ...or use `display` for pretty view

display(df1)

In [None]:
# We ca set index when creating dataframe

df2 = pd.DataFrame(d, index=['d', 'b', 'a'])
df2

In [None]:
# Modify columns when created

df3 = pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
df3

In [None]:
# Create dataframe from dictionary of lists

d = {
    'one': [1.0, 2.0, 3.0, 4.0], 
    'two': [4.0, 3.0, 2.0, 1.0]
}
df = pd.DataFrame(d)
#df=pd.DataFrame(d, index=['a', 'b', 'c', 'd']) 
df

In [None]:
# ...with custom index

df = pd.DataFrame(d, index=['a', 'b', 'c', 'd']) 
df

#### Navigating elements

In [None]:
d = {
    'name': ['Joe', 'Nat', 'Harry', 'Sam', 'Monica'], 
    'age': [20, 21, 19, 20, 22],
    'Marks': [85.10, 77.80, 91.54, 88.78, 60.55], 
    'Grade': ['A', 'B', 'A', 'A', 'B'],
    'Hobby': ['Swimming', 'Reading', 'Music', 'Painting', 'Dancing']
}
df = pd.DataFrame(d, index=['S1', 'S2', 'S3', 'S4', 'S5'])
df.head()  # diplay first five rows

In [None]:
df.loc['S4','Marks']  # `loc` is best option to locate element

In [None]:
df.loc['S4', :]  # take a row

In [None]:
df.loc[:, 'Marks']  # take a column

In [None]:
df.iloc[3, 2]  # take element by indexes

In [None]:
df.iloc[3, :]  # by index of a row

In [None]:
df.iloc[:, 2]  # by index of a column

#### Working with columns

In [None]:
d = {
    'one': [1.0, 2.0, 3.0, 4.0], 
    'two': [4.0, 3.0, 2.0, 1.0]
}
df = pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
df

In [None]:
df['three'] = df['one'] * df['two']  # create new column as `*` operation
df

In [None]:
df['flag'] = df['one'] > 2  # new column as the result of logical operation
df

In [None]:
del df['two']  # simply delete a specified column
df

In [None]:
three = df.pop('three')  # extract column
df

In [None]:
# Here is extracted column

three

In [None]:
# ...it is Series

type(three)

In [None]:
df['foo'] = 'bar'  # new column as a constant
df

In [None]:
df.insert(1, 'bar', 42)  # insert specified column
df

#### Data alignment and operations

In [None]:
d = {
    '1-one': [1.0, 2.0, 3.0, 4.0], 
    '2-two': [4.0, 3.0, 2.0, 1.0]
}
df1 = pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
df1

In [None]:
d = {
    '1-one': [3, 2, 5], 
    '2-two': [0, -2, 1], 
    '3-three': [-1, 3, 2]
}
df2 = pd.DataFrame(d, index=['a', 'b', 'c'])
df2

In [None]:
# We can sum two dataframes with respect 
# to corresponding indexes and columns

df_plus = df2 + df1
df_plus

In [None]:
d = {'one': 1, 'two': 2, 'three': 3}
df1 = pd.Series(d)
df1

In [None]:
d = {
    'one': [3, 2, 5], 
    'two': [0, -2, 1], 
    'three': [-1, 3, 2]
}
df2 = pd.DataFrame(d, index=['a', 'b', 'c'])
df2

In [None]:
# Note by-row substraction
# with respect to columns

df_diff = df2 - df1
df_diff