# 1. Series object

- 1-D array.
- Data is indexed in this.
- Can be easily created using array /list.
- Index and Values can easily be accessed here.
- Pandas series is more flexible than a Numpy array : In Pandas we can use labels also to index a series but in numpy we cannot.

In [31]:
import pandas as pd

# Series Object

data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [32]:
# Access series attributes
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [33]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [34]:
data[1]  # Access value using index

np.float64(0.5)

In [35]:
data[1:3]  # Access a range of values using index range

1    0.50
2    0.75
dtype: float64

## (a) As Generalized NumPy array
- In numpy, indexing is implicit which is an integer index used to access values of array.
- In pandas we explicitely define index which is associated to values.
- Thus, pandas give more flexibility.
- Even we can define string also as an index in pandas.

In [36]:
# Example of explicit indexing in pandas
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [37]:
# Access values
data['b']

np.float64(0.5)

We can even use non sequential indexing. For example,

In [38]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index = [2, 5, 1,3])
data

2    0.25
5    0.50
1    0.75
3    1.00
dtype: float64

In [39]:
# Access values using index
data[1]

np.float64(0.75)

## (b) As Specialized Dictionary

- In python we have dict with keys:values which are not typed (i.e. one dict can have heterogeous keys/value types)

In [40]:
# Example
my_dict1 = {1: 'apple', 'two': 2, (3, 4): [5, 6]}  # different key data types
my_dict2 = {'a': 1, 'b': 'hello', 'c': [1, 2, 3]}  # different value data types

- But in pandas series, the series is like dictionary with typed keys and typed values-- Means the keys types and values types are homogeneous.
- The consistent datatypes make Series more efficient in processing, especially in large datasets.
- Same is the case with numpy array vs python list - Numpy array have consistent datatypes, thus more efficient than Python lists.


- We can make a pandas series directly from a dict:

In [41]:
population_dict = {'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135}
population = pd.Series(population_dict)
population  # int type series

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

- If our dict has heterogenous data, then series type will be 'object' type.
- Since the 'object' dtype is not as optimized as more specific dtypes like 'int64' or 'float64', operations on such a Series may be slower and less memory-efficient.

In [42]:
# Example of object dtype
population_dict = {'California': 'a', 'Texas': 26448193, 'New York':'d', 'Florida': 19552860, 'Illinois': 12882135}
population = pd.Series(population_dict)
population  # object dtype

California           a
Texas         26448193
New York             d
Florida       19552860
Illinois      12882135
dtype: object

- How this Series is more flexible than dictionary? -- We can access range of values in one go:

In [43]:
population['Texas':'Florida']

Texas       26448193
New York           d
Florida     19552860
dtype: object

## (c) Constructing a Series object
`pd.Series(data, index=index)`

- data can be a list, numpy array, dict, or scalar.
- index is optional.

In [44]:
# From list
pd.Series([9,2,3])

0    9
1    2
2    3
dtype: int64

In [45]:
# From scalar
pd.Series(5, index=[1,2,3,4,9])

1    5
2    5
3    5
4    5
9    5
dtype: int64

In [46]:
# From Dict
pd.Series({2:'a', 1:'b', 5: 'k'})

2    a
1    b
5    k
dtype: object

In [47]:
# We can populate a series with only explicitly given index
pd.Series({2:'a', 1:'b', 5: 'k'}, index=[1,5, 8])

1      b
5      k
8    NaN
dtype: object

# 2. Dataframe object
- This is another fundamental structure in pandas.
- Similar to Series object, we can see Dataframes also as generalization of Numpy array and specialization of python Dictionary.

## (a) As Generalized NumPy array

- Generalization of array means, in DataFrame you can organize multiple arrays/Series objects together, in columns.
  
- If series is analog of 1D array, then DataFrame is analog of 2D array.
  
- 2D array of flexible 'row indices' and 'column names'-- flexible means, index of rows and names of columns can be anything/customizable. It can have diverse data types. We can easily perform column/row reordering/add/remove operation etc.
  
- We can imagine Dataframe as aligned sequence of series obejcts-- Aligned means they have same index.(Assume each series as one column)
  
- Let's create a dataframe using two series data:

In [48]:
# Population series
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})

# Area series
area = pd.Series({'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})

# Create states df
states = pd.DataFrame({'population':population, 'area': area})  # {column_name : series}
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


## (b) DataFrame as Specialized Dictionary

- In dict we map key with a value.
- In DF we map a column name with a series data.

In [49]:
states['population']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

**Pay attention to remove confusion**
- In a 2D Numpy array each array forms a row, thus accessing data[0] will give first row.
- But in DF each Series forms a column, thus accessing the data['column_name'] will give first column.

In [50]:
states['population']  # access population column

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

In [51]:
states['population']['Texas']  # Access 'population' column the 'Texas' row 

np.int64(26448193)

## (c) Constructing DataFrame object
- Can be constructed using Series object, List of dict, Dict of Series objects, 2D NumPy array, NumPy structures array.

In [52]:
# Using Series object
pd.DataFrame([3,4,5,1], index=['first', 'second', 'third', 'fourth'], columns=['numbers']) # Column content, rows names, and column name

Unnamed: 0,numbers
first,3
second,4
third,5
fourth,1


In [53]:
# List of Dicts
pd.DataFrame([{'a':i, 'b': 2*i, 'c': 3*i} for i in range(1,5)])

Unnamed: 0,a,b,c
0,1,2,3
1,2,4,6
2,3,6,9
3,4,8,12


In [54]:
# Dict of Series objects
pd.DataFrame({'population':population, 'area': area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [55]:
# From 2D NumPy array
import numpy as np
pd.DataFrame(np.random.rand(4,2), columns=['foo', 'bar'], index= ['a', 'b', 'c', 'd'])

Unnamed: 0,foo,bar
a,0.118741,0.171121
b,0.174509,0.568145
c,0.40456,0.466251
d,0.725503,0.937247


In [56]:
# From NumPy structured array
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])  # i and f are dtypes
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


# 3. Pandas Index Object
- In each Series and DF objects, an explicit index is present.
- This index is used as reference to access/modify data/values.
- We can think of Index object as: 1. An immutable array 2. An ordered set

## (a) As Immutable Array
- Once a Seriee/DF is created we cannot modify Index -- thus immutable.
- Like slicing/indexing of a normal array, we can also do same with Pandas index objects.(But we cannot modify the index object values)

In [57]:
# Define Index object
ind = pd.Index([2,3,5,7,11])

In [58]:
# Access Index object value
print(ind[2])
print(ind[::2])

5
Index([2, 5, 11], dtype='int64')


In [59]:
# Try modify Index object values
ind[3] = 2   # TypeError

TypeError: Index does not support mutable operations

## (b) As Ordered Set

- Pandas Index Objects are like an ordered set, which can also be multiset i.e. we can have repeated/duplicate values in index!
- All valid operations with sets can be performed with index object as well.

In [60]:
# Multiset index
multi_index = pd.Index([1,2,3,4,2,7,11])
print(multi_index)

Index([1, 2, 3, 4, 2, 7, 11], dtype='int64')


In [61]:
# Set operations on index objects
indA= pd.Index([1, 3, 5, 7, 9])
indB= pd.Index([2, 3, 5, 7, 11])

print(indA.intersection(indB))  # Intersection
print(indA.union(indB))  # Union
print(indA.symmetric_difference(indB))  # Symmetric difference


Index([3, 5, 7], dtype='int64')
Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
Index([1, 2, 9, 11], dtype='int64')


# 3. Data Indexing and Selection

## (a) Data selection in Series

  - We can access a Series data in two ways: 
  1. As a dict 
  2. As an array
  

### Access Series as Dict

In [81]:
# Access Series as Dict

data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
print(data['c'])  # Use index as keys
print('c' in data)  # Check a key
print(data.keys())  # Get all keys
print(list(data.items()))  # Get all items
print(data.values)  # Pay attention for values, we are not using values()
data['e'] = 1.25 # Modify Series as Dict
print(data)

0.75
True
Index(['a', 'b', 'c', 'd'], dtype='object')
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
[0.25 0.5  0.75 1.  ]
a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64


### Access Series as 1D array
- Series, extends its dict-like behaviour to an array-like too, therefore more flexible.

In [82]:
# Access Series as 1D array
print(data['a':'c'])  # Slicing with explicit index
print(data[0:2])      # Slicing with implicit index
print(data[(data>0.3) & (data<0.8)])  # Masking: uses boolean condition to select data
print(data[['a','e']])  # Fancy Indexing: uses list of indices to select data

a    0.25
b    0.50
c    0.75
dtype: float64
a    0.25
b    0.50
dtype: float64
b    0.50
c    0.75
dtype: float64
a    0.25
e    1.25
dtype: float64


**Note: There can be confusion in slicing. In Implicit indexing final index is excluded, while in explicit indexing final index is included.**

### Indexers: loc, iloc, and ix

- In case of integer explicit indexes, implicit and explicit index access can be very confusing.
- Therefore, pandas provides special indexing attributes-- loc and iloc are attributes to handle indexing schemes.
- 'loc' - To slice with EXPLICIT index.
- 'iloc' - To slice with IMPLICIT index.

In [83]:
data = pd.Series([23, 24, 25, 27, 20], index=[2, 3, 4, 5, 6])
data.reset_index()

Unnamed: 0,index,0
0,2,23
1,3,24
2,4,25
3,5,27
4,6,20


In [84]:
# Explicit Indexing 
print(data.loc[3])  # Explicit index 
print(data.iloc[3]) # Implicit index

24
27


- 'ix' is a hybrid approach, which creates a standard indexing for a Series object. But again, this can lead to confusion. Thus, it is removed from recent updates of Pandas.

## (b) Data Selection in DataFrame

- DF is accessed in two ways:
  1. As a dict.
  2. As a 2 D array.

### Access DF as Dict
- Consider dataframe as a Dictionary of related Series objects.

In [85]:
# 1. Create a DF
import pandas as pd

# Define Series of area and population
area = pd.Series({'California': 423967, 'Texas': 695662,
                   'New York': 141297, 'Florida': 170312,
                   'Illinois': 149995})

pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                  'New York': 19651127, 'Florida': 19552860,
                  'Illinois': 12882135})

# Create a DF
data = pd.DataFrame({'area':area, 'pop':pop})

print(data)

              area       pop
California  423967  38332521
Texas       695662  26448193
New York    141297  19651127
Florida     170312  19552860
Illinois    149995  12882135


In [86]:
# 2. (a) Access column as using dict keys
print(data['area'])

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64


In [87]:
# (b). Access column with attribute-style
print(data.area)

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64


In [88]:
# Attribute style and dict-style both access same object
assert data.area is data['area']

**NOTE: Accessing columns using attribute style is a shorthand useful. But it only works with string column names. Also, if the name of column is same as any dataframe method, it will not work!**

In [89]:
# Example of uisng attribute style if column name is same as any method
assert data.pop is not data['pop']

  While creating a new column using dict-style, the following syntax performs elemnt-by-elelemnt arithmatic between Series objects.

In [90]:
# (c) Creating a new column using dict-style syntax
data['density'] = data['pop']/data['area']
print(data)

              area       pop     density
California  423967  38332521   90.413926
Texas       695662  26448193   38.018740
New York    141297  19651127  139.076746
Florida     170312  19552860  114.806121
Illinois    149995  12882135   85.883763


### Access DF as a 2D array
- Various 2D array like operations can be performed on a DF.

In [91]:
# 1. Access rows with values attribute
print(data.values)

[[4.23967000e+05 3.83325210e+07 9.04139261e+01]
 [6.95662000e+05 2.64481930e+07 3.80187404e+01]
 [1.41297000e+05 1.96511270e+07 1.39076746e+02]
 [1.70312000e+05 1.95528600e+07 1.14806121e+02]
 [1.49995000e+05 1.28821350e+07 8.58837628e+01]]


In [92]:
# Transpose data
print(data.T)

           California         Texas      New York       Florida      Illinois
area     4.239670e+05  6.956620e+05  1.412970e+05  1.703120e+05  1.499950e+05
pop      3.833252e+07  2.644819e+07  1.965113e+07  1.955286e+07  1.288214e+07
density  9.041393e+01  3.801874e+01  1.390767e+02  1.148061e+02  8.588376e+01


In [93]:
# To access a row
print(data.values[0])

[4.23967000e+05 3.83325210e+07 9.04139261e+01]


- We cannot do data[0] to access a row directly, because indexing is done in dict style.
- data['col_name'] is used to access a column.

In [94]:
# To access a column
print(data['area'])

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64


Therefore, loc, iloc, and ix, like in Series object indexing, we use in DF indexing as well.


### Indexers: loc, iloc

In [95]:
# Access using implicit index
data.reset_index()

Unnamed: 0,index,area,pop,density
0,California,423967,38332521,90.413926
1,Texas,695662,26448193,38.01874
2,New York,141297,19651127,139.076746
3,Florida,170312,19552860,114.806121
4,Illinois,149995,12882135,85.883763


In [96]:
data.iloc[0:2, : ]   # Last index is not included

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874


In [97]:
data.loc['California':'Florida', :]  # Includes last index also

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


### Masking and fancy indexing

In [109]:
# Fancy Indexing
data.loc[data.density>100, ['pop','density']]  # Using boolean Expression to select rows

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [108]:
# Masking
data.iloc[[1,2], [1,2]]  # Using list to select rows and cols

Unnamed: 0,pop,density
Texas,26448193,38.01874
New York,19651127,139.076746


In [110]:
# Modify values using iloc/loc

data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [111]:
data.iloc[0, 2] = 0.0

In [112]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,0.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


### Additional Indexing Conventions
**1. While indexing refers to columns, slicing refers to rows.**

In [117]:
# Example
data['area']  # Indexing refers column

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [119]:
data['California':'New York'] # SLicing refers rows

Unnamed: 0,area,pop,density
California,423967,38332521,0.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


**2.Slicing rows can also be referd with numbers (implicit index).**

In [122]:
data[0:3]

Unnamed: 0,area,pop,density
California,423967,38332521,0.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


**3. Direct masking operation also signifies a row-operation.**

In [127]:
data[data['density']>100] 

#or

data[data.density>100]

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
