# **`Data Science Learners Hub`**

**Module : Python**

**email** : [datasciencelearnershub@gmail.com](mailto:datasciencelearnershub@gmail.com)

## **`#3: Data Manipulation with Pandas`**
7. **Data Filtering and Selection**
   - Conditional selection
   - Using boolean indexing

8. **Data Sorting and Ranking**
   - Sorting by columns
   - Ranking data

9. **Grouping and Aggregation**
   - GroupBy operations
   - Aggregation functions (sum, mean, count, etc.)

### **`7. Data Filtering and Selection`**

#### `Conditional Selection in Pandas DataFrame`

#### Filtering Data Based on Conditions:

Pandas allows for efficient conditional selection of data in a DataFrame using boolean expressions. This involves creating a boolean mask based on specific conditions and applying it to the DataFrame.

Note : What is boolean mask ? explained at the end of this notebook.

#### Example with Comparison Operators:

In [1]:
import pandas as pd

# Sample DataFrame
data = {'Name': ['Laxman', 'Rajesh', 'Ganga', 'Jamuna'],
        'Age': [25, 30, 22, 35],
        'Salary': [50000, 60000, 45000, 70000]}

df = pd.DataFrame(data)

# Filtering based on Age greater than 25
age_condition = df['Age'] > 25
filtered_data = df[age_condition]

# Displaying the filtered DataFrame
print("DataFrame with Age > 25:")
print(filtered_data)

DataFrame with Age > 25:
     Name  Age  Salary
1  Rajesh   30   60000
3  Jamuna   35   70000


#### Example with Logical Conditions:

In [2]:
# Filtering based on multiple conditions (Age > 25 and Salary > 50000)
combined_condition = (df['Age'] > 25) & (df['Salary'] > 50000)
filtered_data_combined = df[combined_condition]

# Displaying the DataFrame with combined conditions
print("\nDataFrame with Age > 25 and Salary > 50000:")
print(filtered_data_combined)


DataFrame with Age > 25 and Salary > 50000:
     Name  Age  Salary
1  Rajesh   30   60000
3  Jamuna   35   70000


#### Explanation:

- **Comparison Operators:**
  - Comparison operators like `>`, `<`, `==`, `!=` create boolean Series based on the conditions.

- **Logical Conditions:**
  - Logical operators `&` (and), `|` (or), `~` (not) can be used for combining conditions.

- **Boolean Mask:**
  - The boolean condition creates a boolean mask, where `True` represents rows that satisfy the condition.

- **Conditional Selection:**
  - Applying the boolean mask to the DataFrame (`df[condition]`) extracts rows satisfying the condition.

#### Use Cases:

1. **Filtering Employees:**
   - Extract employees older than 30 with a salary greater than $60,000.

2. **Analyzing Sales Data:**
   - Select rows where both the quantity sold is greater than 10 and revenue is above $500.

3. **Data Cleaning:**
   - Identify and remove outliers by filtering data based on certain thresholds.

4. **Time Series Analysis:**
   - Filter time series data for specific time periods or events.

#### Tips:

- **Parentheses for Clarity:**
  - Use parentheses for clear and unambiguous logical conditions.

- **Understanding Operator Precedence:**
  - Be mindful of operator precedence when combining multiple conditions.

- **Efficiency:**
  - For large DataFrames, consider using the `.loc[]` indexer for better performance.

Conditional selection is a powerful tool for extracting relevant information from a DataFrame based on specific criteria. It is widely used in data analysis, filtering, and preprocessing tasks.

#### **`Using Boolean Indexing in Pandas DataFrame`**

#### Concept of Boolean Indexing:

Boolean indexing in Pandas involves selecting subsets of data based on boolean conditions. It allows for flexible and expressive ways to filter both rows and columns of a DataFrame.

#### Application in Filtering Rows:

In [3]:
import pandas as pd

# Sample DataFrame
data = {'Name': ['Laxman', 'Rajesh', 'Ganesh', 'Dinesh'],
        'Age': [25, 30, 22, 35],
        'Salary': [50000, 60000, 45000, 70000]}

df = pd.DataFrame(data)

# Applying boolean indexing to filter rows
filtered_rows = df[df['Age'] > 25]

# Displaying the DataFrame with filtered rows
print("DataFrame with Age > 25:")
print(filtered_rows)

DataFrame with Age > 25:
     Name  Age  Salary
1  Rajesh   30   60000
3  Dinesh   35   70000


#### Application in Filtering Columns:

In [4]:
# Applying boolean indexing to filter columns
filtered_columns = df.loc[:, df.columns != 'Salary']

# Displaying the DataFrame with filtered columns
print("\nDataFrame without the 'Salary' column:")
print(filtered_columns)


DataFrame without the 'Salary' column:
     Name  Age
0  Laxman   25
1  Rajesh   30
2  Ganesh   22
3  Dinesh   35


#### Combining Boolean Indexing with Other Operations:

In [6]:
# Combining boolean indexing with other operations
complex_condition = (df['Age'] > 25) & (df['Salary'] > 50000)
complex_filtered_data = df.loc[complex_condition, ['Name', 'Age']]

print(complex_condition)

# Displaying the DataFrame with complex filtered data
print("\nDataFrame with Age > 25 and Salary > 50000, showing 'Name' and 'Age':")
print(complex_filtered_data)

# Note : More about loc accessor is explained at the end of the notebook

0    False
1     True
2    False
3     True
dtype: bool

DataFrame with Age > 25 and Salary > 50000, showing 'Name' and 'Age':
     Name  Age
1  Rajesh   30
3  Dinesh   35


#### Explanation:

- **Filtering Rows:**
  - Rows are filtered based on a boolean condition applied to a specific column (`df['Age'] > 25`).

- **Filtering Columns:**
  - Columns are filtered based on a boolean condition applied to column names (`df.loc[:, df.columns != 'Salary']`).

- **Combining Conditions:**
  - Multiple conditions can be combined using logical operators (`&`, `|`, `~`) to create complex boolean conditions.

- **Selecting Specific Columns:**
  - Specific columns can be selected along with boolean conditions to extract a subset of data.

#### Use Cases:

1. **Selective Row Extraction:**
   - Extract rows of customers with purchases exceeding a certain amount.

2. **Column Exclusion:**
   - Exclude columns with sensitive information from being displayed.

3. **Filtering with Complex Conditions:**
   - Extract specific columns for rows meeting complex conditions, useful in targeted analyses.

4. **Data Cleaning:**
   - Selectively remove or replace values based on conditions.

#### Tips:

- **Boolean Indexing for Both Rows and Columns:**
  - Boolean indexing can be applied to both rows and columns simultaneously, providing flexibility.

- **Combining Conditions:**
  - Logical operators help create intricate conditions, useful for nuanced data extraction.

- **Efficiency Considerations:**
  - Use boolean indexing efficiently, especially with large datasets, to avoid unnecessary memory usage.

Boolean indexing is a versatile technique for selectively extracting information from a DataFrame. It is widely used in data filtering, cleaning, and preparation for further analysis. Understanding how to combine boolean indexing with other operations enhances its utility in complex data extraction tasks.

#### **`Hands On Experience:`**

Conditional Selection and Boolean Indexing


#### Example 1: Employee Data Analysis

##### Scenario:
Consider a dataset of employee information, including age, department, and salary. You want to analyze employees aged 30 or younger in the 'Marketing' department.

In [9]:
import pandas as pd

# Reading Employee Data
employee_data = {'Name': ['Laxman', 'Rajesh', 'Ganga', 'Jamuna'],
                 'Age': [25, 30, 22, 35],
                 'Department': ['Marketing', 'HR', 'Marketing', 'IT'],
                 'Salary': [50000, 60000, 45000, 70000]}

df_employees = pd.DataFrame(employee_data)

# Conditional Selection
young_marketing_employees = df_employees[(df_employees['Age'] <= 30) & (df_employees['Department'] == 'Marketing')]

# Displaying the Result
print("Young Employees in Marketing Department:")
print(young_marketing_employees)

# print(df_employees)

Young Employees in Marketing Department:
     Name  Age Department  Salary
0  Laxman   25  Marketing   50000
2   Ganga   22  Marketing   45000


#### Considerations:
- **Column Selection:**
  - Select only the relevant columns to avoid unnecessary data processing.
- **Multiple Conditions:**
  - Ensure proper use of parentheses when combining multiple conditions for clarity.

#### Example 2: Sales Data Analysis

##### Scenario:
You have a sales dataset with product information, quantity sold, and revenue. Identify products where the quantity sold is above 10 and revenue exceeds $500.

##### Solution:

In [10]:
# Reading Sales Data
sales_data = {'Product': ['Laptop', 'Smartphone', 'Tablet', 'Camera'],
              'Quantity': [15, 8, 12, 5],
              'Revenue': [12000, 8000, 4500, 6000]}

df_sales = pd.DataFrame(sales_data)

# Boolean Indexing
high_revenue_products = df_sales.loc[(df_sales['Quantity'] > 10) & (df_sales['Revenue'] > 500), 'Product']

# Displaying the Result
print("\nHigh Revenue Products with Quantity > 10:")
print(high_revenue_products)


High Revenue Products with Quantity > 10:
0    Laptop
2    Tablet
Name: Product, dtype: object


#### Considerations:
- **Column Selection:**
  - Use boolean indexing to extract specific columns relevant to the analysis.
- **Condition Complexity:**
  - Carefully structure complex conditions for accurate filtering.

#### Common Mistakes by Developers/Students:

1. **Overlooking Parentheses:**
   - Mistakenly omitting parentheses when combining multiple conditions can lead to unexpected results.

2. **Incorrect Column Names:**
   - Using incorrect column names in conditions may result in errors or inaccurate filtering.

3. **Not Handling Edge Cases:**
   - Failing to consider edge cases or missing values in conditions may impact the accuracy of the analysis.

4. **Misinterpreting Logical Operators:**
   - Misunderstanding how logical operators (`&`, `|`, `~`) work can lead to incorrect boolean conditions.


#### Interesting Facts:

1. **Query-like Syntax:**
   - Pandas allows using a query-like syntax for conditional selection, making it more readable and similar to SQL.

2. **Chaining Conditions:**
   - Chaining conditions with logical operators enables complex and nuanced filtering, enhancing data extraction capabilities.

3. **Efficient Memory Usage:**
   - Properly using boolean indexing can optimize memory usage, especially crucial for large datasets.

4. **Combining with Other Operations:**
   - Boolean indexing seamlessly integrates with other Pandas operations, providing a powerful toolkit for data manipulation.

Understanding real-world scenarios, considerations, and potential mistakes enhances the application of conditional selection and boolean indexing in practical data analysis tasks.

### **`Extra Innings`**

#### **`Understanding Boolean Mask:`**


The term "boolean mask" refers to a binary-valued mask derived from a condition or set of conditions applied to an array or DataFrame. In the context of data manipulation with libraries like NumPy or pandas, a boolean mask is essentially an array of the same shape as the original data, where each element is either `True` or `False` based on whether the corresponding element in the original array or DataFrame satisfies a specified condition.

Here's an example using NumPy:

```python
import numpy as np

# Create a NumPy array
arr = np.array([1, 2, 3, 4, 5])

# Create a boolean mask based on a condition (e.g., values greater than 3)
mask = arr > 3

# The boolean mask
print(mask)
# Output: [False False False  True  True]
```

In this example, the boolean mask `mask` is `True` for elements greater than 3 and `False` otherwise. This mask can be used to index or filter the original array, selecting only the elements that satisfy the specified condition:

```python
# Use the boolean mask to filter the original array
filtered_arr = arr[mask]

# The filtered array
print(filtered_arr)
# Output: [4 5]
```

In the context of pandas DataFrames, boolean masks are frequently used for filtering rows based on conditions. For instance:

```python
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]})

# Create a boolean mask based on a condition (e.g., values in column 'A' greater than 3)
mask = df['A'] > 3

# Use the boolean mask to filter the DataFrame
filtered_df = df[mask]

# The filtered DataFrame
print(filtered_df)
# Output:
#    A   B
# 3  4  40
# 4  5  50
```

In summary, a boolean mask is a powerful tool for conditionally selecting or modifying elements in arrays or DataFrames based on specific criteria. It provides a convenient way to filter data based on logical conditions, facilitating efficient data manipulation and analysis.

#### **`Understanding loc accessor:`**

The `loc` accessor in pandas is used for label-based indexing and selecting data in a DataFrame. It provides a powerful way to subset or filter rows and columns based on labels (either index labels or column names). Here's a more detailed explanation of `df.loc`:

### Basic Syntax:
```python
df.loc[row_labels, column_labels]
```

- `row_labels`: This can be a single label, a list of labels, or a boolean array. It specifies the rows you want to select based on their index labels.
  
- `column_labels`: This can be a single label, a list of labels, or a boolean array. It specifies the columns you want to select based on their column names.

### Examples:

1. **Selecting Specific Rows and Columns by Labels:**
   ```python
   df.loc[[1, 3, 5], ['Name', 'Age']]
   ```
   This selects rows with index labels 1, 3, and 5 and columns 'Name' and 'Age'.

2. **Boolean Indexing with `loc`:**
   ```python
   condition = df['Age'] > 25
   df.loc[condition, ['Name', 'Salary']]
   ```
   This selects rows where the 'Age' column is greater than 25 and includes only the 'Name' and 'Salary' columns.

3. **Slicing with `loc`:**
   ```python
   df.loc[1:3, 'Name':'Salary']
   ```
   This selects rows with index labels 1 to 3 (inclusive) and columns from 'Name' to 'Salary' (inclusive).

4. **Selecting All Rows for Specific Columns:**
   ```python
   df.loc[:, ['Name', 'Age']]
   ```
   This selects all rows for the columns 'Name' and 'Age'.

### Key Points:

- `loc` is label-based, which means it uses the actual index and column names, inclusive of the end.

- It is commonly used for more complex selection operations, such as boolean indexing, selecting by conditions, or combining both row and column selections.

- `loc` returns a view of the DataFrame, not a new DataFrame. Modifications to the view will affect the original DataFrame.

- Be cautious when using `loc` with boolean conditions to avoid setting values on a copy of a slice from a DataFrame.

In your specific example:
```python
complex_condition = (df['Age'] > 25) & (df['Salary'] > 50000)
complex_filtered_data = df.loc[complex_condition, ['Name', 'Age']]
```
The `loc` is used to filter rows based on the boolean condition (`complex_condition`) and select only the 'Name' and 'Age' columns for the matching rows.