# Data Indexing and Selection

## (a) Data selection in Series

  - We can access a Series data in two ways: 
  1. As a dict 
  2. As an array
  

### Access Series as Dict

In [1]:
# Access Series as Dict
import pandas as pd 
import numpy as np

data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
print(data['c'])  # Use index as keys
print('c' in data)  # Check a key
print(data.keys())  # Get all keys
print(list(data.items()))  # Get all items
print(data.values)  # Pay attention for values, we are not using values()
data['e'] = 1.25 # Modify Series as Dict
print(data)

0.75
True
Index(['a', 'b', 'c', 'd'], dtype='object')
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
[0.25 0.5  0.75 1.  ]
a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64


### Access Series as 1D array
- Series, extends its dict-like behaviour to an array-like too, therefore more flexible.

In [2]:
# Access Series as 1D array
print(data['a':'c'])  # Slicing with explicit index
print(data[0:2])      # Slicing with implicit index
print(data[(data>0.3) & (data<0.8)])  # Masking: uses boolean condition to select data
print(data[['a','e']])  # Fancy Indexing: uses list of indices to select data

a    0.25
b    0.50
c    0.75
dtype: float64
a    0.25
b    0.50
dtype: float64
b    0.50
c    0.75
dtype: float64
a    0.25
e    1.25
dtype: float64


**Note: There can be confusion in slicing. In Implicit indexing final index is excluded, while in explicit indexing final index is included.**

### Indexers: loc, iloc, and ix

- In case of integer explicit indexes, implicit and explicit index access can be very confusing.
- Therefore, pandas provides special indexing attributes-- loc and iloc are attributes to handle indexing schemes.
- 'loc' - To slice with EXPLICIT index.
- 'iloc' - To slice with IMPLICIT index.

In [3]:
data = pd.Series([23, 24, 25, 27, 20], index=[2, 3, 4, 5, 6])
data.reset_index()

Unnamed: 0,index,0
0,2,23
1,3,24
2,4,25
3,5,27
4,6,20


In [4]:
# Explicit Indexing 
print(data.loc[3])  # Explicit index 
print(data.iloc[3]) # Implicit index

24
27


- 'ix' is a hybrid approach, which creates a standard indexing for a Series object. But again, this can lead to confusion. Thus, it is removed from recent updates of Pandas.

## (b) Data Selection in DataFrame

- DF is accessed in two ways:
  1. As a dict.
  2. As a 2 D array.

### Access DF as Dict
- Consider dataframe as a Dictionary of related Series objects.

In [5]:
# 1. Create a DF
import pandas as pd

# Define Series of area and population
area = pd.Series({'California': 423967, 'Texas': 695662,
                   'New York': 141297, 'Florida': 170312,
                   'Illinois': 149995})

pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                  'New York': 19651127, 'Florida': 19552860,
                  'Illinois': 12882135})

# Create a DF
data = pd.DataFrame({'area':area, 'pop':pop})

print(data)

              area       pop
California  423967  38332521
Texas       695662  26448193
New York    141297  19651127
Florida     170312  19552860
Illinois    149995  12882135


In [6]:
# 2. (a) Access column as using dict keys
print(data['area'])

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64


In [7]:
# (b). Access column with attribute-style
print(data.area)

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64


In [8]:
# Attribute style and dict-style both access same object
assert data.area is data['area']

**NOTE: Accessing columns using attribute style is a shorthand useful. But it only works with string column names. Also, if the name of column is same as any dataframe method, it will not work!**

In [9]:
# Example of uisng attribute style if column name is same as any method
assert data.pop is not data['pop']

  While creating a new column using dict-style, the following syntax performs elemnt-by-elelemnt arithmatic between Series objects.

In [10]:
# (c) Creating a new column using dict-style syntax
data['density'] = data['pop']/data['area']
print(data)

              area       pop     density
California  423967  38332521   90.413926
Texas       695662  26448193   38.018740
New York    141297  19651127  139.076746
Florida     170312  19552860  114.806121
Illinois    149995  12882135   85.883763


### Access DF as a 2D array
- Various 2D array like operations can be performed on a DF.

In [11]:
# 1. Access rows with values attribute
print(data.values)

[[4.23967000e+05 3.83325210e+07 9.04139261e+01]
 [6.95662000e+05 2.64481930e+07 3.80187404e+01]
 [1.41297000e+05 1.96511270e+07 1.39076746e+02]
 [1.70312000e+05 1.95528600e+07 1.14806121e+02]
 [1.49995000e+05 1.28821350e+07 8.58837628e+01]]


In [12]:
# Transpose data
print(data.T)

           California         Texas      New York       Florida      Illinois
area     4.239670e+05  6.956620e+05  1.412970e+05  1.703120e+05  1.499950e+05
pop      3.833252e+07  2.644819e+07  1.965113e+07  1.955286e+07  1.288214e+07
density  9.041393e+01  3.801874e+01  1.390767e+02  1.148061e+02  8.588376e+01


In [13]:
# To access a row
print(data.values[0])

[4.23967000e+05 3.83325210e+07 9.04139261e+01]


- We cannot do data[0] to access a row directly, because indexing is done in dict style.
- data['col_name'] is used to access a column.

In [14]:
# To access a column
print(data['area'])

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64


Therefore, loc, iloc, and ix, like in Series object indexing, we use in DF indexing as well.


### Indexers: loc, iloc

In [15]:
# Access using implicit index
data.reset_index()

Unnamed: 0,index,area,pop,density
0,California,423967,38332521,90.413926
1,Texas,695662,26448193,38.01874
2,New York,141297,19651127,139.076746
3,Florida,170312,19552860,114.806121
4,Illinois,149995,12882135,85.883763


In [16]:
data.iloc[0:2, : ]   # Last index is not included

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874


In [17]:
data.loc['California':'Florida', :]  # Includes last index also

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


### Masking and fancy indexing

In [29]:
d =data[data.density>100]
d[['pop','density']]

# Fancy Indexing
# data.loc[data.density>100, ['pop','density']]  # Using boolean Expression to select rows

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [32]:
# data[data.density>100, ['pop', 'density']]

# Error

- Pandas indexing via the standard bracket [] notation only allows one indexing operation (either rows or columns). When you use a comma inside [] without .loc or .iloc, it leads to ambiguity because pandas expects a single indexing operation.

In [33]:
# Fancy Indexing
data.loc[data.density>100, ['pop','density']]  # Using boolean Expression to select rows

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [34]:
# Masking
data.iloc[[1,2], [1,2]]  # Using list to select rows and cols

Unnamed: 0,pop,density
Texas,26448193,38.01874
New York,19651127,139.076746


In [35]:
# Modify values using iloc/loc

data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [37]:
data.iloc[0, 2] = 0.0

In [38]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,0.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


### Additional Indexing Conventions
**1. While indexing refers to columns, slicing refers to rows.**

In [39]:
# Example
data['area']  # Indexing refers column

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [40]:
data['California':'New York'] # SLicing refers rows

Unnamed: 0,area,pop,density
California,423967,38332521,0.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


**2.Slicing rows can also be referd with numbers (implicit index).**

In [41]:
data[0:3]

Unnamed: 0,area,pop,density
California,423967,38332521,0.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


**3. Direct masking operation also signifies a row-operation.**

In [42]:
data[data['density']>100] 

#or

data[data.density>100]

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
