# Pandas

1. A newer package bulit on top of NumPy, and provides an efficient implementation of a DataFrame.
2. DataFrames are multidimensional arrays with attached row and column labels, and often with heterogenous types
    and/or missing data.
    
3. Numpy's limitations are clear when we need more flexibility with attaching labels to data, working
   with missing data, and performing aggregations operations that do not map well with element-wise broadcasting
   (grouping).

## Pandas Objects

1. The three fundamental objects(data structures) in Pandas are: **Series, DataFrame, Index**
2. At basic level , these objects can be thought of as enhanced versions of NumPy strucutured arrays in which the 
   rows and columns are identified with labels rather than simple integer indices

## The Pandas Series Object

- A Pandas series is a one-dimensional array of **indexed** data

In [2]:
import numpy as np
import pandas as pd

pd.__version__

'0.25.1'

In [3]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])

In [4]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [5]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
data[1]

0.5

In [8]:
data[1:3]

1    0.50
2    0.75
dtype: float64

###  Series as a generalized version of NumPy array:



1. The essential difference is the presence of the **index**.

2. NumPy array has an implicitly defined integer index used to access values

3. Pandas Series has an explicitly defined index associated with the values

In [11]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                  index = ['a', 'b','c', 'd'])

data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [12]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                  index = [2, 5, 3, 7])

data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

### Series as a specialized dictionary

1. A dictionary maps keys to a set of values,
2. Series maps typed keys to a set of typed values.

<br>
Differences

1. Type-specific compiled code behind Numpy array makes it more efficient than a Python List.
   Similarly type information in Pandas Series makes it more efficient than Python Dictionaries for certain operations.

### Constructing Series objects

In [13]:
#From a List
pd.Series([2,4,6])

0    2
1    4
2    6
dtype: int64

In [14]:
pd.Series(5, index = [100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [15]:
#From a Dictionary
pd.Series({2: 'a', 1: 'b', 3: 'c', 5: 'd'})

2    a
1    b
3    c
5    d
dtype: object

In [17]:
population_dict = {'Delhi': 36787623,
                   'Mumbai': 45678234,
                   'Bangalore':55678643,
                   'Chennai': 44567412,
                   'Goa': 23455412}

population = pd.Series(population_dict)
population

Delhi        36787623
Mumbai       45678234
Bangalore    55678643
Chennai      44567412
Goa          23455412
dtype: int64

In [18]:
population['Chennai']

44567412

In [19]:
population['Mumbai':'Goa']

Mumbai       45678234
Bangalore    55678643
Chennai      44567412
Goa          23455412
dtype: int64

In [22]:
population_dict['Mumbai': 'Goa']

TypeError: unhashable type: 'slice'

## The Pandas DataFrame Object

1. A DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python Dictionary

2. A Series is an analog of a one-dimensional array with flexible indices,

3. A DataFrame is an analog of a two-dimensional array with both flexible row indexes and flexible column names.


### DataFrame as a generalized NumPy array:

1. A two-dimensional array can be thought of as an ordered sequence of aligned one-dimensional columns
2. A DataFrame can be thought of as a sequence of aligned **Series** objects
3. Here 'aligned' means that they share the same index

In [23]:
population_dict = {'Delhi': 36787623,
                   'Mumbai': 45678234,
                   'Bangalore':55678643,
                   'Chennai': 44567412,
                   'Goa': 23455412}

population = pd.Series(population_dict)
population

Delhi        36787623
Mumbai       45678234
Bangalore    55678643
Chennai      44567412
Goa          23455412
dtype: int64

In [24]:
area_dict = {'Delhi': 456782,
             'Mumbai': 459874,
             'Bangalore':234564,
             'Chennai': 654345,
             'Goa': 887345}

area = pd.Series(area_dict)
area

Delhi        456782
Mumbai       459874
Bangalore    234564
Chennai      654345
Goa          887345
dtype: int64

In [25]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
Delhi,36787623,456782
Mumbai,45678234,459874
Bangalore,55678643,234564
Chennai,44567412,654345
Goa,23455412,887345


In [26]:
states.index

Index(['Delhi', 'Mumbai', 'Bangalore', 'Chennai', 'Goa'], dtype='object')

In [27]:
states.columns

Index(['population', 'area'], dtype='object')

### DataFrame as a specialized Dictionary

1. A Dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data

In [28]:
states['area']

Delhi        456782
Mumbai       459874
Bangalore    234564
Chennai      654345
Goa          887345
Name: area, dtype: int64

### Constructing DataFrame Objects

In [29]:
#From a single Series object. 

pd.DataFrame(population, columns = ['population'])

Unnamed: 0,population
Delhi,36787623
Mumbai,45678234
Bangalore,55678643
Chennai,44567412
Goa,23455412


In [34]:
#From a list of dicts.

# data = []
# for i range(3):
    
#     data.append({'a': i, 'b': 2 * i})
               
data = [{'a': i, 'b': 2 * i} for i in range(3)]

In [35]:
data

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

In [33]:
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [36]:
#Not a Number(NaN)
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [37]:
#From a dictionary of Series Objects

states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
Delhi,36787623,456782
Mumbai,45678234,459874
Bangalore,55678643,234564
Chennai,44567412,654345
Goa,23455412,887345


In [38]:
#From a two-dimensional NumPy array

pd.DataFrame(np.random.rand(3,2), 
             columns = ['A', 'B'],
             index = ['a', 'b', 'c'])

Unnamed: 0,A,B
a,0.819041,0.13801
b,0.621419,0.463577
c,0.75922,0.463089


In [39]:
# From a NumPy strucutred array

A = np.zeros(3, dtype = [('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [40]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


### The Pandas Index Object

1. The Index object can be thought of either as an immutable array, or as an ordered set( multiset)

In [41]:
index = pd.Index([2,3,4,5,7])
index

Int64Index([2, 3, 4, 5, 7], dtype='int64')

In [42]:
index[1]

3

In [43]:
index[::2]

Int64Index([2, 4, 7], dtype='int64')

In [45]:
print(index.size, index.shape, index.ndim, index.dtype)

5 (5,) 1 int64


In [46]:
index[1] = 0

TypeError: Index does not support mutable operations

In [47]:
# Index as ordered set

indA = pd.Index([1,3,5,7,9])
indB = pd.Index([2,3,5,7,11])

indA & indB #intersection

Int64Index([3, 5, 7], dtype='int64')

In [48]:
indA | indB #union

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [55]:
name = 'India is a great country'

In [60]:
name[:16:-1]

'yrtnuoc'