# Key Features of Pandas
* Fast and efficient DataFrame object with default and customized indexing.
* Tools for loading data into in-memory data objects from different file formats.
* Data alignment and integrated handling of missing data.
* Label-based slicing, indexing and subsetting of large data sets.
* Columns from a data structure can be deleted or inserted.
* Group by data for aggregation and transformations.
* High performance merging and joining of data.
* Time Series functionality.

# Pandas Data Structures
There are 3 main data structures in Pandas
* Series: 1D homogenous (single typed) array
* DataFrame: Generally 2D tabular structure with potentially heterogenous (different typed) columns
* Panel: General 3D structures (Basically multiple dataframes)

These data structures are built on top of Numpy array, which means they are fast. We'll cover only the Series and DataFrame in this workshop.

## Series

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Some key points to remember about the Series are:
* Homogeneous data (single-typed)
* Size Immutable (this is fixed once we define it)
* Data values mutable (we can change the actual contents)

Lets work with some code examples:

In [1]:
import pandas as pd
# Initializes an empty series by default
s = pd.Series()
print(s)

Series([], dtype: float64)


In [7]:
import numpy as np
# Load some data into a series
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print(s)
print('----------')

# We can also load some data from a dictionary too
data = {'a' : 0., 'b' : 1., 'c' : 2.}
# Any values we don't have data for become NaN
s = pd.Series(data,index=['b','c','d','a'])
print(s)
print('----------')

0    a
1    b
2    c
3    d
dtype: object
----------
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64
----------


In [9]:
# Let's examine Series indexing
# We can index just like a normal python list
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first element
print(s[0])
print('----------')
print(s[:3])
print('----------')

# We can also index by our own index ['a','b','c','d','e']
print(s['a'])
print('----------')
print(s[:'c'])
print('----------')

# Retrieving indicies that don't exist will throw a key error
print(s['f'])

1
----------
a    1
b    2
c    3
dtype: int64
----------
1
----------
a    1
b    2
c    3
dtype: int64
----------


KeyError: 'f'

## DataFrame

A 2D data structure, i.e. data is aligned in a tabular fashion in rows and columns.

* Potentially columns are of different types
* Mutable Size
* Labeled axes (rows and columns
* Perform arithmetic operations on rows and columns

Made with the following constructor
```
pandas.DataFrame(data, index, columns, dtype, copy)
```

The DataFrame can be made with various inputs like
* lists
* dict
* Series
* Numpy Arrays
* Another Dataframe

Let's cover a few useful examples of creating DataFrames.

In [3]:
import pandas as pd
# Create an Empty DF
empty_df = pd.DataFrame()
print(empty_df)

# Create Dataframe from lists
data = [['Alex',10],['Bob',12],['Clarke',13]]
# Note the dtype=float casts the age to floats
name_df = pd.DataFrame(data,columns=['Name','Age'], dtype=float)
print("List DF")
print(name_df)

Empty DataFrame
Columns: []
Index: []
List DF
     Name   Age
0    Alex  10.0
1     Bob  12.0
2  Clarke  13.0


In [12]:
import numpy as np
# Most commonly you'll be making dataframes from numpy arrays
data = np.random.uniform(size=(100, 3))
random_df = pd.DataFrame(data, columns=['a', 'b', 'c'])
# Since this is a big dataframe, we don't print the whole thing out
# Lets just print the last 5 values
random_df.tail()

Unnamed: 0,a,b,c
95,0.920395,0.172463,0.631018
96,0.411321,0.430093,0.168328
97,0.607133,0.514733,0.799955
98,0.213114,0.565159,0.940272
99,0.185256,0.214769,0.978575


### Column Addition

We square bracket notation from python is used for columns in pandas

In [13]:
print ("Adding a new column by passing as a numpy array:")
random_df['d'] = np.ones((100,))
print(random_df.tail())

print ("Adding a new column using the existing columns in DataFrame:")
random_df['e'] = random_df['c'] + random_df['d']
print(random_df.tail())

Adding a new column by passing as a numpy array:
           a         b         c    d
95  0.920395  0.172463  0.631018  1.0
96  0.411321  0.430093  0.168328  1.0
97  0.607133  0.514733  0.799955  1.0
98  0.213114  0.565159  0.940272  1.0
99  0.185256  0.214769  0.978575  1.0
Adding a new column using the existing columns in DataFrame:
           a         b         c    d         e
95  0.920395  0.172463  0.631018  1.0  1.631018
96  0.411321  0.430093  0.168328  1.0  1.168328
97  0.607133  0.514733  0.799955  1.0  1.799955
98  0.213114  0.565159  0.940272  1.0  1.940272
99  0.185256  0.214769  0.978575  1.0  1.978575


### Column Deletion

Using the pop function we can delete columns


In [14]:
random_df.pop('e')
random_df.tail()

Unnamed: 0,a,b,c,d
95,0.920395,0.172463,0.631018,1.0
96,0.411321,0.430093,0.168328,1.0
97,0.607133,0.514733,0.799955,1.0
98,0.213114,0.565159,0.940272,1.0
99,0.185256,0.214769,0.978575,1.0


### Row Manipulation

Let's take a look at row selection, addition, and deletion.

In [17]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


We can use `loc` to index by row

In [18]:
df.loc['b']

one    2.0
two    2.0
Name: b, dtype: float64

We can select multiple rows using pythons ':' operator.

In [21]:
df['c':]

Unnamed: 0,one,two
c,3.0,3
d,,4
