# 1. Series object

- 1-D array.
- Data is indexed in this.
- Can be easily created using array /list.
- Index and Values can easily be accessed here.
- Pandas series is more flexible than a Numpy array : **In Pandas we can use labels also to index a series but in numpy we cannot.**

In [2]:
import pandas as pd

# Series Object

data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [3]:
# Access series attributes
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [4]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [5]:
data[1]  # Access value using index

np.float64(0.5)

In [6]:
data[1:3]  # Access a range of values using index range

1    0.50
2    0.75
dtype: float64

## (a) As Generalized NumPy array
- In numpy, indexing is implicit which is an integer index used to access values of array.
- In pandas we explicitely define index which is associated to values.
- Thus, pandas give more flexibility.
- Even we can define string also as an index in pandas.

In [7]:
# Example of explicit indexing in pandas
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [8]:
# Access values
data['b']

np.float64(0.5)

We can even use non sequential indexing. For example,

In [9]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index = [2, 5, 1,3])
data

2    0.25
5    0.50
1    0.75
3    1.00
dtype: float64

In [10]:
# Access values using index
data[1]

np.float64(0.75)

## (b) As Specialized Dictionary

- In python we have dict with keys:values which are not typed (i.e. one dict can have heterogeous keys/value types)

In [11]:
# Example
my_dict1 = {1: 'apple', 'two': 2, (3, 4): [5, 6]}  # different key data types
my_dict2 = {'a': 1, 'b': 'hello', 'c': [1, 2, 3]}  # different value data types

- But in pandas series, the series is like dictionary with typed keys and typed values-- Means the keys types and values types are homogeneous.
- The consistent datatypes make Series more efficient in processing, especially in large datasets.
- Same is the case with numpy array vs python list - Numpy array have consistent datatypes, thus more efficient than Python lists.


- We can make a pandas series directly from a dict:

In [12]:
population_dict = {'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135}
population = pd.Series(population_dict)
population  # int type series

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

- If our dict has heterogenous data, then series type will be 'object' type.
- Since the 'object' dtype is not as optimized as more specific dtypes like 'int64' or 'float64', operations on such a Series may be slower and less memory-efficient.

In [13]:
# Example of object dtype
population_dict = {'California': 'a', 'Texas': 26448193, 'New York':'d', 'Florida': 19552860, 'Illinois': 12882135}
population = pd.Series(population_dict)
population  # object dtype

California           a
Texas         26448193
New York             d
Florida       19552860
Illinois      12882135
dtype: object

- How this Series is more flexible than dictionary? -- We can access range of values in one go:

In [14]:
population['Texas':'Florida']

Texas       26448193
New York           d
Florida     19552860
dtype: object

## (c) Constructing a Series object
`pd.Series(data, index=index)`

- data can be a list, numpy array, dict, or scalar.
- index is optional.

In [15]:
# From list
pd.Series([9,2,3])

0    9
1    2
2    3
dtype: int64

In [16]:
# From scalar
pd.Series(5, index=[1,2,3,4,9])

1    5
2    5
3    5
4    5
9    5
dtype: int64

In [17]:
# From Dict
pd.Series({2:'a', 1:'b', 5: 'k'})

2    a
1    b
5    k
dtype: object

- Here dtype is object because values are string type.
- In pandas, when a Series has string values, it assigns the data type object to represent them because there is no specific string dtype in pandas for Series; it uses the more general object dtype for heterogeneous or string data.
- To provide string type, explicitly specify.

In [18]:
pd.Series({2:'a', 1:'b', 5:'k'}, dtype='string')

2    a
1    b
5    k
dtype: string

In [47]:
# We can populate a series with only explicitly given index
pd.Series({2:'a', 1:'b', 5: 'k'}, index=[1,5, 8])

1      b
5      k
8    NaN
dtype: object

# 2. Dataframe object
- This is another fundamental structure in pandas.
- Similar to Series object, we can see Dataframes also as generalization of Numpy array and specialization of python Dictionary.

## (a) As Generalized NumPy array

- Generalization of array means, in DataFrame you can organize multiple arrays/Series objects together, in columns.
  
- If series is analog of 1D array, then DataFrame is analog of 2D array.
  
- 2D array of flexible 'row indices' and 'column names'-- flexible means, index of rows and names of columns can be anything/customizable. It can have diverse data types. We can easily perform column/row reordering/add/remove operation etc.
  
- We can imagine Dataframe as aligned sequence of series obejcts-- Aligned means they have same index.(Assume each series as one column)
  
- Let's create a dataframe using two series data:

In [48]:
# Population series
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})

# Area series
area = pd.Series({'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})

# Create states df
states = pd.DataFrame({'population':population, 'area': area})  # {column_name : series}
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


## (b) DataFrame as Specialized Dictionary

- In dict we map key with a value.
- In DF we map a column name with a series data.

In [49]:
states['population']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

**Pay attention to remove confusion**
- In a 2D Numpy array each array forms a row, thus accessing data[0] will give first row.
- But in DF each Series forms a column, thus accessing the data['column_name'] will give first column.

In [50]:
states['population']  # access population column

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

In [51]:
states['population']['Texas']  # Access 'population' column the 'Texas' row 

np.int64(26448193)

## (c) Constructing DataFrame object
- Can be constructed using Series object, List of dict, Dict of Series objects, 2D NumPy array, NumPy structures array.

In [52]:
# Using Series object
pd.DataFrame([3,4,5,1], index=['first', 'second', 'third', 'fourth'], columns=['numbers']) # Column content, rows names, and column name

Unnamed: 0,numbers
first,3
second,4
third,5
fourth,1


In [20]:
pd.DataFrame([[1,2],[3,4]], index= ['first', 'second'], columns=['first_col', 'second_col'])

Unnamed: 0,first_col,second_col
first,1,2
second,3,4


In [53]:
# List of Dicts
pd.DataFrame([{'a':i, 'b': 2*i, 'c': 3*i} for i in range(1,5)])

Unnamed: 0,a,b,c
0,1,2,3
1,2,4,6
2,3,6,9
3,4,8,12


In [54]:
# Dict of Series objects
pd.DataFrame({'population':population, 'area': area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [22]:
# From 2D NumPy array
import numpy as np
pd.DataFrame(np.random.rand(4,2), columns=['foo', 'bar'], index= ['a', 'b', 'c', 'd'])

Unnamed: 0,foo,bar
a,0.013241,0.757791
b,0.02602,0.939068
c,0.051356,0.129692
d,0.503044,0.16908


# 3. Index Object

## (a) As Immutable Array
- Once a Seriee/DF is created we cannot modify Index -- thus immutable.
- Like slicing/indexing of a normal array, we can also do same with Pandas index objects.(But we cannot modify the index object values)

In [30]:
# Define Index object
ind = pd.Index([2,3,5,7,11])

In [31]:
# Access Index object value
print(ind[2])
print(ind[::2])

5
Index([2, 5, 11], dtype='int64')


In [32]:
# Try modify Index object values
ind[3] = 2   # TypeError

TypeError: Index does not support mutable operations

## (b) As Ordered Set

- Pandas Index Objects are like an ordered set, which can also be multiset i.e. we can have repeated/duplicate values in index!
- All valid operations with sets can be performed with index object as well.

In [60]:
# Multiset index
multi_index = pd.Index([1,2,3,4,2,7,11])
print(multi_index)

Index([1, 2, 3, 4, 2, 7, 11], dtype='int64')


In [61]:
# Set operations on index objects
indA= pd.Index([1, 3, 5, 7, 9])
indB= pd.Index([2, 3, 5, 7, 11])

print(indA.intersection(indB))  # Intersection
print(indA.union(indB))  # Union
print(indA.symmetric_difference(indB))  # Symmetric difference


Index([3, 5, 7], dtype='int64')
Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
Index([1, 2, 9, 11], dtype='int64')
