## Data Analysis with pandas

pandas - a powerful data analysis and manipulation library for Python. It provides fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both easy and intuitive. 

In [3]:
import pandas as pd

pd.__version__

'0.24.2'

## Pandas Objects: Series, DataFrame, and Index

In [2]:
import numpy as np

## Series:

Pandas Series is a one-dimensional array of indexed data. It can be created from a list/array or dictionary:

Constructing series:
pd.Series(list/array, index=index) OR pd.Series(dict)

### Constructing series from list

In [3]:
series = pd.Series([0.25,0.5,0.75,1.0]) # with default index
series

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [11]:
# series with explicit index
data = pd.Series([0.25,0.5,0.75,1.0], index=['a','b','c','d']) # with explicit index
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [12]:
data['b']

0.5

In [1]:
series.values  # returns numpy array

NameError: name 'series' is not defined

In [5]:
type(series.values)

numpy.ndarray

In [7]:
series.index

RangeIndex(start=0, stop=4, step=1)

In [8]:
type(series.index)

pandas.core.indexes.range.RangeIndex

In [10]:
series[1:3] # index bases element access

1    0.50
2    0.75
dtype: float64

### Constructing Series from python dictionary:

In [5]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

population = pd.Series(population_dict)
population

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

When series is constructed from dictionary, the key of dictionary becomes the index of series.
Dictionary Keys -> Series Index

In [15]:
population['California']

38332521

In [16]:
population['California':'Illinois']

California    38332521
Florida       19552860
Illinois      12882135
dtype: int64

## DataFrame 

Two dimension array, tabular data, with shared index and flexible column names. It can be created using python dictionary or numpy array, or from the tabular data file such as csv files.

In [7]:
df = pd.DataFrame(population)
df

Unnamed: 0,0
California,38332521
Florida,19552860
Illinois,12882135
New York,19651127
Texas,26448193


In [9]:
df2 = pd.DataFrame({"Population": population})  #population is a series/dict.
df2

Unnamed: 0,Population
California,38332521
Florida,19552860
Illinois,12882135
New York,19651127
Texas,26448193


In [14]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}

In [15]:
area = pd.Series(area_dict)
area

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
dtype: int64

In [16]:


states = pd.DataFrame({'population': population, 'area': area})

states

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


### Extracting index info

In [17]:
states.index

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

Return type is a Index. 
### Extracting columns info

In [18]:
states.columns

Index(['area', 'population'], dtype='object')

In [19]:
# Wheneever we want numpy array from the pandas object 
type(states.values)

numpy.ndarray

### Extacting a column from DataFrame

In [20]:
# DataFrame as specialized dictionary
states['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [None]:
# we can also extact 

In [21]:
states[0] #similar to numpy array states[0] gives first column here 0->'col0'

KeyError: 0

In [26]:
#explict column name
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Florida,19552860
Illinois,12882135
New York,19651127
Texas,26448193


In [28]:
# Creating DF fron dictionary
data = [{'a':i, 'b':2*i} for i in range(3)]
data

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

In [30]:
pd.DataFrame(data)# dict key as column not as index(as in case of Series)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


Pandas will fill missing key valus with the NaN:

In [31]:
pd.DataFrame([{'a':1, 'b': 2}, {'b':3, 'c':4}])


Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


## From a dictionary of Series objects:

In [33]:
pd.DataFrame({'population':population, 'area':area})

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


## From a two-dimensional NumPy array:

In [35]:
pd.DataFrame(np.random.rand(3,2),
             columns=['foo','bar'],
             index=['a','b','c']) # or implicit index

Unnamed: 0,foo,bar
a,0.30242,0.681188
b,0.463586,0.4088
c,0.893844,0.950789


## From a NumPy structured arrya

In [36]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [37]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


## Pandas Index Object

In [38]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

## Index as immutable array

In [39]:
ind[1]

3

In [40]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

In [41]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


In [43]:
ind[1] = 0 # Can't modify, immutable array

TypeError: Index does not support mutable operations

## Index as ordered set

In [44]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [45]:
indA & indB # intersection

Int64Index([3, 5, 7], dtype='int64')

In [46]:
indA | indB # union

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [47]:
indA ^ indB # symmetric difference

Int64Index([1, 2, 9, 11], dtype='int64')

# Data Indexing and Selection

# Data Section in Series

## Series as dictionary

In [48]:
# TODO

## Series as one-dimensional array

In [49]:
# TODO: slicing, masking, and fancy indexing

## Indexers: loc, iloc, and ix

In [50]:
# TODO: loc, iloc, and ix

# Data Selection in DataFrame

## DataFrame as a dictionary

In [51]:
# TODO

## DataFrame as two-dimensional array

In [52]:
# TODO

## Additional indexing conventions

In [53]:
# TODO

## Operating on Data in Pandas

## Ufuncs: Index Preservation

All NumPy ufunc will work on Pandas Series and DataFrame objects.