**Data Selection and Analysis**: Selecting the right rows and columns is *the first step* in analyzing any dataset. Pandas provides several powerful ways to accomplish this.

1. Selecting Columns
2. Selecing Rows
3. Selecting Specific Rows and Columns
4. Fast Single Element Access
5. Filtering with condition
6. Filtering with `query()` method *(Easy and effective ~ Allow us to play fast)*

In [None]:
import pandas as pd
df = pd.read_csv("../assets/data/practical-Imp.csv")
df

Unnamed: 0,Actor,Film,Year,Genre,BoxOffice(INR Crore),IMDb
0,Shah Rukh Khan,Pathaan,2023,Action,1050,7.2
1,Salman Khan,Tiger Zinda Hai,2017,Action,565,6.0
2,Aamir Khan,Dangal,2016,Biography,2024,8.4
3,Ranbir Kapoor,Brahmastra,2022,Fantasy,431,5.6
4,Ranveer Singh,Padmaavat,2018,Historical,585,7.0
5,Ayushmann Khurrana,Andhadhun,2018,Thriller,111,8.3
6,Rajkummar Rao,Stree,2018,Horror Comedy,180,7.5
7,Hrithik Roshan,War,2019,Action,475,6.5
8,Akshay Kumar,Good Newwz,2019,Comedy,318,7.0
9,Kartik Aaryan,Bhool Bhulaiyaa 2,2022,Horror Comedy,266,5.9


**1. Selecting Columns**

You can select one or more columns from your DataFrame.

  * **Single Column**: This returns a Series.
    ```python
    df["column_name"]
    ```
  * **Multiple Columns**: This returns a new DataFrame.
    ```python
    df[["col1", "col2"]]
    ```

In [20]:
# Select a single column: 'Actor'
actors = df["Actor"]
print(actors)
type(df["Actor"]) # <class 'pandas.core.series.Series'>

print("\n\n")

# Select multiple columns: 'Actor' and 'Year'
actors_year = df[["Actor", "Year"]]
print(actors_year)

0         Shah Rukh Khan
1            Salman Khan
2             Aamir Khan
3          Ranbir Kapoor
4          Ranveer Singh
5     Ayushmann Khurrana
6          Rajkummar Rao
7         Hrithik Roshan
8           Akshay Kumar
9          Kartik Aaryan
10          Varun Dhawan
11         Vicky Kaushal
Name: Actor, dtype: object



                 Actor  Year
0       Shah Rukh Khan  2023
1          Salman Khan  2017
2           Aamir Khan  2016
3        Ranbir Kapoor  2022
4        Ranveer Singh  2018
5   Ayushmann Khurrana  2018
6        Rajkummar Rao  2018
7       Hrithik Roshan  2019
8         Akshay Kumar  2019
9        Kartik Aaryan  2022
10        Varun Dhawan  2017
11       Vicky Kaushal  2019


**2. Selecting Rows**

Use `.loc` for label-based selection and `.iloc` for position-based (integer) selection.

  * **By Label** (e.g., index name `0`):
    ```python
    df.loc[0]
    ```
  * **By Position** (e.g., the first row):
    ```python
    df.iloc[0]
    ```

In [34]:
df.loc[0] 
df.iloc[3]
"""
    Example to illustrate the difference between loc and iloc
    Suppose the DataFrame index is [0, 1, 2, ...] and columns are ['Actor', 'Year', ...]
"""
# Using loc (label-based)
print("Using loc (label-based):")
print(df.loc[0, "Year"])  # Accesses the value in row with label 0 and column 'Year'

# Using iloc (position-based)
print("\nUsing iloc (position-based):")
print(
    df.iloc[0, 1]
)  # Accesses the value in the first row and second column (by position)

Using loc (label-based):
2023

Using iloc (position-based):
Pathaan


**3. Selecting Specific Rows and Columns**

Combine row and column selectors to pinpoint specific data.

  * **Using Labels (`.loc`)**:
    ```python
    # Value at row label 0, column 'Name'
    df.loc[0, "Name"]

    # Rows 0 through 2, for columns 'Name' and 'Age'
    df.loc[0:2, ["Name", "Age"]]
    ```
  * **Using Positions (`.iloc`)**:
    ```python
    # Value at first row, second column
    df.iloc[0, 1]

    # First two rows and first two columns
    df.iloc[0:2, 0:2]
    ```

In [39]:
df.loc[0:2, ["Actor", "IMDb", "Film"]] # Accesses rows with labels 0 to 2 and columns 'Actor', 'IMDb', 'Film'

Unnamed: 0,Actor,IMDb,Film
0,Shah Rukh Khan,7.2,Pathaan
1,Salman Khan,6.0,Tiger Zinda Hai
2,Aamir Khan,8.4,Dangal


In [63]:
# Changing a value in the DataFrame
df.loc[0, "Actor"] = "Rajat"
df

Unnamed: 0,Actor,Film,Year,Genre,BoxOffice(INR Crore),IMDb
0,Rajat,Pathaan,2023,Action,1050,7.2
1,Salman Khan,Tiger Zinda Hai,2017,Action,565,6.0
2,Aamir Khan,Dangal,2016,Biography,2024,8.4
3,Ranbir Kapoor,Brahmastra,2022,Fantasy,431,5.6
4,Ranveer Singh,Padmaavat,2018,Historical,585,7.0
5,Ayushmann Khurrana,Andhadhun,2018,Thriller,111,8.3
6,Rajkummar Rao,Stree,2018,Horror Comedy,180,7.5
7,Hrithik Roshan,War,2019,Action,475,6.5
8,Akshay Kumar,Good Newwz,2019,Comedy,318,7.0
9,Kartik Aaryan,Bhool Bhulaiyaa 2,2022,Horror Comedy,266,5.9


In [40]:
df.iloc[0:2, 0:2]  # Accesses the first two rows and first two columns (by position)

Unnamed: 0,Actor,Film
0,Shah Rukh Khan,Pathaan
1,Salman Khan,Tiger Zinda Hai


**4. Fast Single Element Access (`.at` and `.iat`)**

For getting just one value, these methods are much faster than `.loc` and `.iloc`.

  * **By Label (`.at`)**:
    ```python
    df.at[0, "Name"]
    ```
  * **By Position (`.iat`)**:
    ```python
    df.iat[0, 1]
    ```

In [61]:
df.at[0, "Actor"]

'Shah Rukh Khan'

In [45]:
df.iat[1, 1]

'Tiger Zinda Hai'

**5. Filtering with Conditions**

This is the most common way to select rows that meet certain criteria.

  * **Simple Condition**:

    ```python
    # Select rows where Age is greater than 30
    df[df["Age"] > 30]
    ```

  * **Multiple Conditions**:

    > **Important**: Always wrap each condition in parentheses `()`. Use `&` for AND and `|` for OR.

    ```python
    # AND: Age > 25 AND City is 'Delhi'
    df[(df["Age"] > 25) & (df["City"] == "Delhi")]

    # OR: Name is 'Bob' OR Age is less than 30
    df[(df["Name"] == "Bob") | (df["Age"] < 30)]
    ```


In [48]:
df[df["IMDb"] > 8]  # Filters the DataFrame to include only rows where the 'IMDb' rating is greater than 8

Unnamed: 0,Actor,Film,Year,Genre,BoxOffice(INR Crore),IMDb
2,Aamir Khan,Dangal,2016,Biography,2024,8.4
5,Ayushmann Khurrana,Andhadhun,2018,Thriller,111,8.3
11,Vicky Kaushal,Uri: The Surgical Strike,2019,Action,342,8.2


In [56]:
# Filters for rows where IMDb > 8 AND Year > 2018
df[(df["IMDb"] > 8) & (df["Year"] > 2018)]

Unnamed: 0,Actor,Film,Year,Genre,BoxOffice(INR Crore),IMDb
11,Vicky Kaushal,Uri: The Surgical Strike,2019,Action,342,8.2


### **6. Filtering with `.query()`**

The `.query()` method provides a clean, readable, and SQL-like way to filter your data by expressing the condition as a string.

  * **Basic Query**

    ```python
    df.query("Age > 25 and City == 'Delhi'")
    ```

  * **Using an External Variable**
    Use the `@` symbol to reference a Python variable inside the query string.

    ```python
    age_limit = 30
    df.query("Age > @age_limit")
    ```

  * **Handling Special Column Names**
    If a column name contains spaces (e.g., `first name`) or is a Python keyword (e.g., `class`), you must wrap it in backticks `` ` ``.

    ```python
    df.query("`first name` == 'Alice'")
    ```

  * **Chained Comparisons**
    You can chain multiple comparisons together for more intuitive filtering.

    ```python
    df.query("25 < Age <= 40")
    ```

### **Key Points for `.query()`**

  * **Logical Operators**: Use `and`, `or`, `not` instead of the symbols `&`, `|`, `~`.
  * **String Values**: Place string values inside single or double quotes (e.g., `"Delhi"` or `'Mumbai'`).
  * **Case-Sensitivity**: String comparisons are case-sensitive. The query `City == 'delhi'` will not match `'Delhi'`.
  * **Return Value**: `.query()` returns a **new copy** of the filtered data. The original DataFrame is not modified unless you explicitly reassign the result.
    ```python
    # The original df is unchanged. The filtered result is stored in a new variable.
    filtered_df = df.query("Age < 50")
    ```

In [None]:
"""
DataFrame returns a view of the original data, not a copy.
! Always use .copy method to copy 
To get a copy of the DataFrame, use the .copy() method.
"""

# Selecting specific columns to create a new DataFrame
df_copy = df[["Actor", "IMDb"]].copy()
df_copy

Unnamed: 0,Actor,IMDb
0,Shah Rukh Khan,7.2
1,Salman Khan,6.0
2,Aamir Khan,8.4
3,Ranbir Kapoor,5.6
4,Ranveer Singh,7.0
5,Ayushmann Khurrana,8.3
6,Rajkummar Rao,7.5
7,Hrithik Roshan,6.5
8,Akshay Kumar,7.0
9,Kartik Aaryan,5.9


In [64]:
# Basic Query
df.query("Genre == 'Action' and IMDb > 7")

Unnamed: 0,Actor,Film,Year,Genre,BoxOffice(INR Crore),IMDb
0,Rajat,Pathaan,2023,Action,1050,7.2
11,Vicky Kaushal,Uri: The Surgical Strike,2019,Action,342,8.2


In [67]:
# Chained Operations
df.query("2016 <= Year <= 2018")

Unnamed: 0,Actor,Film,Year,Genre,BoxOffice(INR Crore),IMDb
1,Salman Khan,Tiger Zinda Hai,2017,Action,565,6.0
2,Aamir Khan,Dangal,2016,Biography,2024,8.4
4,Ranveer Singh,Padmaavat,2018,Historical,585,7.0
5,Ayushmann Khurrana,Andhadhun,2018,Thriller,111,8.3
6,Rajkummar Rao,Stree,2018,Horror Comedy,180,7.5
10,Varun Dhawan,Badrinath Ki Dulhania,2017,Romantic Comedy,201,6.1
