# **`Data Science Learners Hub`**

**Module : Python**

**email** : [datasciencelearnershub@gmail.com](mailto:datasciencelearnershub@gmail.com)

## **`#2: DataFrames in Depth`**

4. **Creating DataFrames**
    
    - From lists, dictionaries, and arrays
    - Reading data from CSV, Excel, and other formats
5. **Basic DataFrame Operations**
    
    - Inspecting the DataFrame
    - Indexing and selecting data
    - Descriptive statistics
6. **Data Cleaning and Handling Missing Data**
    
    - Handling missing values
    - Dropping or filling missing values
    - Removing duplicates

### **`5. Basic DataFrame Operations`**

#### **`Indexing and Selecting Data in a DataFrame`**

**Introduction:**
Indexing and selecting data in a Pandas DataFrame are fundamental operations for extracting specific subsets of information. Two main methods for this purpose are `loc[]` and `iloc[]`. In this prompt, we'll explore these methods and provide examples of conditional indexing and boolean indexing.

**Using `loc[]` for Label-Based Indexing:**

1. **Selecting Rows by Label:**
   - Use `loc[]` to select rows based on their labels (index values):

In [1]:
import pandas as pd

# Read the CSV file without setting index
df = pd.read_csv('data.csv')

# Display the DataFrame to inspect its structure
print(df)

# Set 'ID' column as the index
# df = pd.read_csv('data.csv', index_col='ID')

# Try to select the row with index 2
selected_row = df.loc[2]
print(selected_row)


      Name  Age          City
0   Laxman   25          Pune
1   Rajesh   30     Hyderabad
2      Ram   22  Mahabubnagar
3    Ganga   32  Mahabubnagar
4   Jamuna   32          Pune
5  Namrata   15          Pune
6   Varsha   16     Hyderabad
7   Vamshi   22     Hyderabad
8   Ananya   14  Mahabubnagar
Name             Ram
Age               22
City    Mahabubnagar
Name: 2, dtype: object


2. **Selecting Specific Columns for a Row:**
   - Specify both row label and column label to select a specific value:

In [2]:
specific_value = df.loc[2, 'Name']  # Select 'Name' for row with index 2
print(specific_value)

Ram


3. **Slicing Rows:**
   - Use slicing with labels to select a range of rows:

In [3]:
sliced_rows = df.loc[2:5]  # Select rows with indices 2 to 5 (inclusive)
print(sliced_rows)

      Name  Age          City
2      Ram   22  Mahabubnagar
3    Ganga   32  Mahabubnagar
4   Jamuna   32          Pune
5  Namrata   15          Pune


#### Explanation fr why stop is inclusive in slicing in case of loc[]

In the code snippet below:

```python
sliced_rows = df.loc[2:5]  # Select rows with indices 2 to 5 (inclusive)
print(sliced_rows)
```

- `df`: This is assumed to be a Pandas DataFrame.
- `loc`: This is a method in Pandas used for label-based indexing. It is not a function or constant. The `loc` method is primarily used to access a group of rows and columns by labels or a boolean array. In this case, it is used to select rows with index labels ranging from 2 to 5 (inclusive).

The syntax for the `loc` method generally follows the pattern `df.loc[row_indexer, column_indexer]`, where `row_indexer` is used to specify the rows you want to select based on their labels.

Regarding your question about slicing:

While in general Python slicing uses a stop value that is exclusive, Pandas' `loc` method is designed to be inclusive for both the start and stop values. When using `loc` for label-based indexing, both the start and stop values are considered inclusive. So, in the expression `df.loc[2:5]`, it will select rows with index labels 2, 3, 4, and 5.

In contrast, if you were using integer-based indexing or slicing with `iloc`, the stop value would be exclusive, following the standard Python slicing convention. However, with `loc`, which is label-based, inclusivity is maintained for consistency and to avoid potential confusion when working with labeled data.

#### Is loc[] a function or constant or attribute ?

In Python, methods and attributes are accessed without parentheses in certain cases. This is because, in some contexts, you are accessing the method or attribute itself, not invoking it.

In the case of Pandas' `loc`, it is used for label-based indexing, and when you use it, you're providing labels to select specific rows and columns. It doesn't require parentheses because you're not calling or invoking the method; you're using it to perform a specific operation.

Here's a brief explanation:

- **Accessing Method or Attribute**: When you access a method or attribute without using parentheses, you are referring to the method or attribute itself, not calling or executing it.

- **Calling or Invoking Method**: When you use parentheses, you are invoking or calling the method, executing the code associated with that method.

In the case of Pandas' `loc`, consider the following:

- **Accessing the Method (without parentheses)**:
  ```python
  selected_rows = df.loc[2:5]  # accessing the loc method, not invoking it
  ```

- **Calling the Method (with parentheses)**:
  ```python
  selected_rows = df.loc[2:5]()  # would be incorrect; loc is not called with parentheses in this context
  ```

So, `df.loc[2:5]` is using the `loc` method to select rows based on labels, and the absence of parentheses is consistent with the way Pandas designed the label-based indexing syntax.

4. **Selecting Rows and Columns Simultaneously:**
   - Use `loc[]` to select specific rows and columns:

In [4]:
selected_data = df.loc[2:5, ['Name', 'Age']]
print(selected_data)

      Name  Age
2      Ram   22
3    Ganga   32
4   Jamuna   32
5  Namrata   15


In [5]:
selected_data = df.loc[2:5, ['Age', 'Name']]
print(selected_data)

# Note Order of columns dosent matter

   Age     Name
2   22      Ram
3   32    Ganga
4   32   Jamuna
5   15  Namrata


In [7]:
selected_data = df.loc[[3,2,5], ['Age', 'Name']]
print(selected_data)

# Note : We can also select rows selectively

   Age     Name
3   32    Ganga
2   22      Ram
5   15  Namrata


**Using `iloc[]` for Position-Based Indexing:**

1. **Selecting Rows by Position:**
   - Use `iloc[]` to select rows based on their integer positions:

In [28]:
print(df)
print("-------------")
selected_row_position = df.iloc[1]  # Select the second row (position 1)
print(selected_row_position)

      Name  Age          City
0   Laxman   25          Pune
1   Rajesh   30     Hyderabad
2      Ram   22  Mahabubnagar
3    Ganga   32  Mahabubnagar
4   Jamuna   32          Pune
5  Namrata   15          Pune
6   Varsha   16     Hyderabad
7   Vamshi   22     Hyderabad
8   Ananya   14  Mahabubnagar
-------------
Name       Rajesh
Age            30
City    Hyderabad
Name: 1, dtype: object


2. **Selecting Specific Columns for a Row by Position:**
   - Specify both row position and column position to select a specific value:

In [25]:
specific_value_position = df.iloc[1, 0]  # Select the first column for the second row
print(specific_value_position)

Rajesh


3. **Slicing Rows by Position:**
   - Use slicing with integer positions to select a range of rows:

In [26]:
sliced_rows_position = df.iloc[1:4]  # Select rows with positions 1 to 3
print(sliced_rows_position)

     Name  Age          City
1  Rajesh   30     Hyderabad
2     Ram   22  Mahabubnagar
3   Ganga   32  Mahabubnagar


4. **Selecting Rows and Columns Simultaneously by Position:**
   - Use `iloc[]` to select specific rows and columns by position:

In [29]:
selected_data_position = df.iloc[1:4, [0, 1]]
print(selected_data_position)

     Name  Age
1  Rajesh   30
2     Ram   22
3   Ganga   32


**Conditional Indexing and Boolean Indexing:**

1. **Conditional Indexing:**
   - Use boolean conditions to filter rows based on a specific criterion:

In [30]:
condition = df['Age'] > 30
conditionally_selected = df.loc[condition]
print(conditionally_selected)

     Name  Age          City
3   Ganga   32  Mahabubnagar
4  Jamuna   32          Pune


2. **Boolean Indexing:**
   - Use boolean arrays directly for filtering:

In [31]:
boolean_selected = df[df['Age'] > 30]
print(boolean_selected)

     Name  Age          City
3   Ganga   32  Mahabubnagar
4  Jamuna   32          Pune


**Conclusion:**
Indexing and selecting data in a Pandas DataFrame using `loc[]` and `iloc[]` are powerful techniques. These methods allow you to retrieve specific rows and columns based on labels or positions. Additionally, conditional indexing and boolean indexing enable you to filter data efficiently based on specific criteria.

#### Example:


In [33]:
import pandas as pd

# Creating a sample DataFrame
data = {
    'Name': ['Laxman', 'Rajesh', 'Chanakya', 'Dravid', 'Emanuel'],
    'Age': [25, 30, 35, 22, 28],
    'Salary': [50000, 60000, 75000, 48000, 55000],
    'Experience': [3, 5, 8, 2, 4],
}

df = pd.DataFrame(data)
df.set_index('Name', inplace=True)  # Set 'Name' column as the index

# Using loc[] for Label-Based Indexing
selected_row = df.loc['Rajesh']
specific_value = df.loc['Rajesh', 'Age']
sliced_rows = df.loc['Rajesh':'Dravid']
selected_data = df.loc[['Rajesh', 'Dravid'], ['Age', 'Salary']]

# Using iloc[] for Position-Based Indexing
selected_row_position = df.iloc[1]
specific_value_position = df.iloc[1, 0]
sliced_rows_position = df.iloc[1:4]
selected_data_position = df.iloc[1:4, [0, 1]]

# Conditional Indexing and Boolean Indexing
condition = df['Age'] > 30
conditionally_selected = df.loc[condition]
boolean_selected = df[df['Age'] > 30]

# Displaying the results
print("Using loc[] for Label-Based Indexing:")
print("Selected Row:\n", selected_row)
print("Specific Value:\n", specific_value)
print("Sliced Rows:\n", sliced_rows)
print("Selected Data:\n", selected_data)

print("\nUsing iloc[] for Position-Based Indexing:")
print("Selected Row by Position:\n", selected_row_position)
print("Specific Value by Position:\n", specific_value_position)
print("Sliced Rows by Position:\n", sliced_rows_position)
print("Selected Data by Position:\n", selected_data_position)

print("\nConditional Indexing and Boolean Indexing:")
print("Conditionally Selected:\n", conditionally_selected)
print("Boolean Selected:\n", boolean_selected)


Using loc[] for Label-Based Indexing:
Selected Row:
 Age              30
Salary        60000
Experience        5
Name: Rajesh, dtype: int64
Specific Value:
 30
Sliced Rows:
           Age  Salary  Experience
Name                             
Rajesh     30   60000           5
Chanakya   35   75000           8
Dravid     22   48000           2
Selected Data:
         Age  Salary
Name               
Rajesh   30   60000
Dravid   22   48000

Using iloc[] for Position-Based Indexing:
Selected Row by Position:
 Age              30
Salary        60000
Experience        5
Name: Rajesh, dtype: int64
Specific Value by Position:
 30
Sliced Rows by Position:
           Age  Salary  Experience
Name                             
Rajesh     30   60000           5
Chanakya   35   75000           8
Dravid     22   48000           2
Selected Data by Position:
           Age  Salary
Name                 
Rajesh     30   60000
Chanakya   35   75000
Dravid     22   48000

Conditional Indexing and Boolean Ind

#### Real-world Scenario:
Imagine you have a dataset containing information about employees in a company, including their ID, name, department, salary, and performance ratings. You want to perform various operations to analyze and extract specific information about employees.

In [9]:
import pandas as pd

# Sample employee data
employee_data = {
    'EmployeeID': [101, 102, 103, 104, 105],
    'Name': ['Radhika', 'Shanaya', 'Khushi', 'Bhanu', 'Swapna'],
    'Department': ['HR', 'IT', 'Sales', 'Finance', 'Marketing'],
    'Salary': [60000, 75000, 80000, 90000, 70000],
    'PerformanceRating': [8.5, 9.2, 7.8, 8.9, 9.5],
}

# Creating a DataFrame from the employee data
df_employees = pd.DataFrame(employee_data)

# Indexing and selecting data
selected_employee = df_employees.loc[df_employees['Name'] == 'Khushi']
high_performance_employees = df_employees[df_employees['PerformanceRating'] > 9.0]
selected_columns = df_employees.loc[:, ['Name', 'Department', 'Salary']]

# Displaying the selected data
print("Selected Employee (Khushi):\n", selected_employee)
print("\nHigh-Performance Employees:\n", high_performance_employees)
print("\nSelected Columns:\n", selected_columns)


Selected Employee (Khushi):
    EmployeeID    Name Department  Salary  PerformanceRating
2         103  Khushi      Sales   80000                7.8

High-Performance Employees:
    EmployeeID     Name Department  Salary  PerformanceRating
1         102  Shanaya         IT   75000                9.2
4         105   Swapna  Marketing   70000                9.5

Selected Columns:
       Name Department  Salary
0  Radhika         HR   60000
1  Shanaya         IT   75000
2   Khushi      Sales   80000
3    Bhanu    Finance   90000
4   Swapna  Marketing   70000


#### Considerations or Peculiarities:

- **Indexing Choice:** Choose an appropriate column as the index based on your analysis needs. It could be a unique identifier like employee ID or another column that is relevant to your analysis.

- **Boolean Indexing:** Understand how to use boolean indexing effectively. It allows you to filter data based on conditions, as shown in the example with high-performance employees.

- **Column Selection:** Be mindful of the columns you select. If you only need specific columns, it's more efficient to select those rather than the entire DataFrame.

#### Common Mistakes:

- **Incorrect Syntax:** Incorrect use of square brackets, parentheses, or quotation marks in the indexing conditions can lead to errors. Always double-check syntax.

- **Using `==` for Float Comparison:** When comparing float values, be cautious due to potential precision issues. Using methods like `np.isclose()` is recommended for float comparisons.

- **Misunderstanding Boolean Indexing:** Developers may mistakenly think that boolean indexing is limited to exact matches, but it can be used for various conditions.

Indexing and selecting data are crucial skills for extracting relevant information from a DataFrame. Adjust the example code and considerations based on the specifics of your real-world scenarios and datasets.
