# **🎯 Day 2 – Selection & Filtering in Pandas 🔍**

#### **Goal:** Master the techniques for retrieving specific rows, columns, and data subsets using both label-based and condition-based methods.

#### **Topics To Cover:** Column Selection, Row Selection with .loc and .iloc, and Advanced Data Filtering using Boolean Indexing.

----

## **Introduction: The Art of Subsetting ✂️**
When dealing with large datasets, the first crucial step is almost always to focus on a smaller, relevant part. **Selection** and **Filtering** are the methods Pandas provides to perform this "subsetting" operation—pinpointing the exact data points you need for analysis.

**Why are Selection and Filtering Important? 💡**
* **Relevance:** Isolate only the columns or rows necessary for a specific calculation, reducing memory usage and speeding up processing time.

* **Problem Solving:** They are the foundation for cleaning data (e.g., filtering out bad values) and preparing data (e.g., selecting features for a model).

* **Clarity:** Allows you to quickly visualize and analyze specific segments of the data, such as records matching a certain condition (e.g., all users older than 25).

----


## **2.1: Selection (Accessing Data by Label or Position) 🏷️**

**Selection** refers to retrieving a part of the DataFrame based on its **label (name)** or its **position (index number)**. Pandas provides two primary, explicit methods for this: `.loc` and `.iloc`.

### **1. Access by Label: The `.loc` Accessor**

  * **Definition:** `.loc` is strictly **label-based**. When you use it, you refer to the column *names* and the index *labels*.
  * **Syntax:** `df.loc[row_label(s), column_label(s)]`
  * **Key Behavior:** The end boundary is **inclusive** (e.g., slicing `df.loc[0:5]` includes the index label 5).

### **2. Access by Position: The `.iloc` Accessor**

  * **Definition:** `.iloc` is strictly **integer-position-based**. When you use it, you refer to the column and row **numbers**, just like indexing a Python list or NumPy array.
  * **Syntax:** `df.iloc[row_index(s), column_index(s)]`
  * **Key Behavior:** The end boundary is **exclusive** (e.g., slicing `df.iloc[0:5]` includes rows 0 through 4, but excludes position 5).



In [1]:
import pandas as pd
import numpy as np

data = pd.read_csv('../data/Students Social Media Addiction.csv')
df = pd.DataFrame(data)

----

# Let's Begin 🚀

#### Selecting Columns

In [2]:
# Select a column
df['Age'] # recommended way
df.Age

# Select multiple columns
df[['Age', 'Gender', 'Country']]


Unnamed: 0,Age,Gender,Country
0,19,Female,Bangladesh
1,22,Male,India
2,20,Female,USA
3,18,Male,UK
4,21,Male,Canada
...,...,...,...
700,20,Female,Italy
701,23,Male,Russia
702,21,Female,China
703,24,Male,Japan


In [3]:
df.loc[0]  # first row

Student_ID                                    1
Age                                          19
Gender                                   Female
Academic_Level                    Undergraduate
Country                              Bangladesh
Avg_Daily_Usage_Hours                       5.2
Most_Used_Platform                    Instagram
Affects_Academic_Performance                Yes
Sleep_Hours_Per_Night                       6.5
Mental_Health_Score                           6
Relationship_Status             In Relationship
Conflicts_Over_Social_Media                   3
Addicted_Score                                8
Name: 0, dtype: object

In [4]:
df.loc[0:5]  # first 6 rows (inclusive of 5)

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
0,1,19,Female,Undergraduate,Bangladesh,5.2,Instagram,Yes,6.5,6,In Relationship,3,8
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,0,3
2,3,20,Female,Undergraduate,USA,6.0,TikTok,Yes,5.0,5,Complicated,4,9
3,4,18,Male,High School,UK,3.0,YouTube,No,7.0,7,Single,1,4
4,5,21,Male,Graduate,Canada,4.5,Facebook,Yes,6.0,6,In Relationship,2,7
5,6,19,Female,Undergraduate,Australia,7.2,Instagram,Yes,4.5,4,Complicated,5,9


In [5]:
df.loc[:, ['Age', 'Country']]  # all rows, specific columns

Unnamed: 0,Age,Country
0,19,Bangladesh
1,22,India
2,20,USA
3,18,UK
4,21,Canada
...,...,...
700,20,Italy
701,23,Russia
702,21,China
703,24,Japan


Using .iloc[] to select rows by position

In [6]:
df.iloc[0]  # first row

Student_ID                                    1
Age                                          19
Gender                                   Female
Academic_Level                    Undergraduate
Country                              Bangladesh
Avg_Daily_Usage_Hours                       5.2
Most_Used_Platform                    Instagram
Affects_Academic_Performance                Yes
Sleep_Hours_Per_Night                       6.5
Mental_Health_Score                           6
Relationship_Status             In Relationship
Conflicts_Over_Social_Media                   3
Addicted_Score                                8
Name: 0, dtype: object

In [7]:
df.iloc[0:5]  # first 5 rows (exclusive of 5)

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
0,1,19,Female,Undergraduate,Bangladesh,5.2,Instagram,Yes,6.5,6,In Relationship,3,8
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,0,3
2,3,20,Female,Undergraduate,USA,6.0,TikTok,Yes,5.0,5,Complicated,4,9
3,4,18,Male,High School,UK,3.0,YouTube,No,7.0,7,Single,1,4
4,5,21,Male,Graduate,Canada,4.5,Facebook,Yes,6.0,6,In Relationship,2,7


In [8]:
df.iloc[:, [0, 2, 4]]  # all rows, specific columns by index

Unnamed: 0,Student_ID,Gender,Country
0,1,Female,Bangladesh
1,2,Male,India
2,3,Female,USA
3,4,Male,UK
4,5,Male,Canada
...,...,...,...
700,701,Female,Italy
701,702,Male,Russia
702,703,Female,China
703,704,Male,Japan


#### Selecting both rows and columns

In [9]:
df.loc[0,'Age'] # Single Value: first row, 'Age' column

np.int64(19)

In [10]:
df.loc[0:5, ['Student_ID', 'Age', 'Gender', 'Academic_Level']] # Multiple Rows and Columns

Unnamed: 0,Student_ID,Age,Gender,Academic_Level
0,1,19,Female,Undergraduate
1,2,22,Male,Graduate
2,3,20,Female,Undergraduate
3,4,18,Male,High School
4,5,21,Male,Graduate
5,6,19,Female,Undergraduate


In [11]:
df.loc[[i for i in range(0, 10) if i % 2 == 0], ['Age', 'Country', 'Academic_Level']] # Even indexed rows and specific columns

Unnamed: 0,Age,Country,Academic_Level
0,19,Bangladesh,Undergraduate
2,20,USA,Undergraduate
4,21,Canada,Graduate
6,23,Germany,Graduate
8,18,Japan,High School


In [12]:
df.iloc[0, 2]  # Single Value: first row, 3rd column (0-based index)

'Female'

In [13]:
df.iloc[:5, 1:4] # 0 to 4th row and 1 to 3rd column

Unnamed: 0,Age,Gender,Academic_Level
0,19,Female,Undergraduate
1,22,Male,Graduate
2,20,Female,Undergraduate
3,18,Male,High School
4,21,Male,Graduate


In [14]:
df.iloc[0:5, [0, 2, 3, 4]] # Multiple Rows and specific Columns using iloc

Unnamed: 0,Student_ID,Gender,Academic_Level,Country
0,1,Female,Undergraduate,Bangladesh
1,2,Male,Graduate,India
2,3,Female,Undergraduate,USA
3,4,Male,High School,UK
4,5,Male,Graduate,Canada


#### Select a single value

In [15]:
# Using .at[] for fast access to a single value than .loc[]
df.at[0, 'Age']  # first row, 'Age' column

np.int64(19)

In [16]:
# Using .iat[] for fast access to a single value than .iloc[]
df.iat[0, 2]  # first row, 3rd column (0-based index)

'Female'

----



## **2.2: Filtering (Accessing Data by Condition) 🧬**

**Filtering** (also known as **Boolean Indexing** or **Boolean Masking**) refers to retrieving rows based on whether they meet a specific logical **condition**, not based on their label or position.

### **The Mechanics: Boolean Masks**

1.  **Condition:** You define a condition on a column (e.g., `df['Age'] > 25`).
2.  **Mask Creation:** This returns a **Series of True/False values** where `True` means the row meets the condition.
3.  **Application:** You apply this mask to the DataFrame: `df[boolean_mask]`.
4.  **Result:** Pandas only returns the rows corresponding to the `True` values in the mask.

### **Combining Conditions 🔗**

Complex filtering is achieved by combining multiple boolean masks. **Crucially**, you must use the element-wise operators (`&` for AND, `|` for OR, `~` for NOT) and wrap each condition in parentheses `()`.

<table>
  <thead>
    <tr>
      <th>Operator</th>
      <th>Meaning</th>
      <th>Example</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><b>&amp;</b></td>
      <td><b>AND:</b> Both conditions must be True.</td>
      <td><code>(df['Score'] &gt; 80) &amp; (df['City'] == 'NY')</code></td>
    </tr>
    <tr>
      <td><b>|</b></td>
      <td><b>OR:</b> At least one condition must be True.</td>
      <td><code>(df['Score'] &gt; 80) | (df['City'] == 'LA')</code></td>
    </tr>
    <tr>
      <td><b>~</b></td>
      <td><b>NOT:</b> Negates the condition.</td>
      <td><code>~(df['Is_Active'])</code></td>
    </tr>
  </tbody>
</table>



#### Basic Filtering

In [17]:
df[df['Age'] > 20].head()

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,0,3
4,5,21,Male,Graduate,Canada,4.5,Facebook,Yes,6.0,6,In Relationship,2,7
6,7,23,Male,Graduate,Germany,1.5,LinkedIn,No,8.0,9,Single,0,2
9,10,21,Female,Graduate,South Korea,3.3,Instagram,No,7.0,7,In Relationship,1,4
12,13,22,Male,Graduate,Italy,2.8,LinkedIn,No,7.2,8,Single,1,4


In [18]:
df[(df['Age'] > 20) & (df['Country'] == 'India')].head() # multiple conditions

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,0,3
111,112,21,Female,Undergraduate,India,5.8,TikTok,Yes,5.9,6,In Relationship,3,7
123,124,22,Female,Graduate,India,5.8,Instagram,Yes,5.8,6,In Relationship,3,7
129,130,21,Female,Graduate,India,5.6,Instagram,Yes,5.6,6,In Relationship,3,7
141,142,21,Female,Graduate,India,5.2,Instagram,Yes,5.2,6,In Relationship,3,7


In [19]:
df[(df['Relationship_Status'] == 'Single') | (df['Country'] == 'India')].head() # OR condition

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,0,3
3,4,18,Male,High School,UK,3.0,YouTube,No,7.0,7,Single,1,4
6,7,23,Male,Graduate,Germany,1.5,LinkedIn,No,8.0,9,Single,0,2
8,9,18,Male,High School,Japan,4.0,TikTok,No,6.5,7,Single,1,5
12,13,22,Male,Graduate,Italy,2.8,LinkedIn,No,7.2,8,Single,1,4


#### Conditional Filtering

In [20]:
df[df['Country'].isin(['India', 'USA', 'China'])].head() # filter using isin()

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,0,3
2,3,20,Female,Undergraduate,USA,6.0,TikTok,Yes,5.0,5,Complicated,4,9
15,16,20,Female,Undergraduate,China,4.2,TikTok,Yes,6.0,6,Complicated,3,7
111,112,21,Female,Undergraduate,India,5.8,TikTok,Yes,5.9,6,In Relationship,3,7
117,118,20,Female,Undergraduate,India,5.7,Instagram,Yes,6.1,6,Single,3,7


In [21]:
df[df['Most_Used_Platform'].str.contains('Instagram')].head() # filter using str.contains()

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
0,1,19,Female,Undergraduate,Bangladesh,5.2,Instagram,Yes,6.5,6,In Relationship,3,8
5,6,19,Female,Undergraduate,Australia,7.2,Instagram,Yes,4.5,4,Complicated,5,9
9,10,21,Female,Graduate,South Korea,3.3,Instagram,No,7.0,7,In Relationship,1,4
13,14,18,Female,High School,Mexico,6.5,Instagram,Yes,5.5,5,Single,4,9
17,18,19,Female,High School,Norway,5.0,Instagram,Yes,5.7,5,In Relationship,3,8


In [22]:
# With str methods, you can use other methods like startswith(), endswith(), match for regex etc.
df[df['Country'].str.startswith('C')].head()

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
4,5,21,Male,Graduate,Canada,4.5,Facebook,Yes,6.0,6,In Relationship,2,7
15,16,20,Female,Undergraduate,China,4.2,TikTok,Yes,6.0,6,Complicated,3,7
45,46,23,Female,Graduate,Chile,2.7,LinkedIn,No,7.1,8,Complicated,1,4
46,47,19,Male,Undergraduate,Colombia,4.8,Instagram,Yes,5.9,6,In Relationship,3,7
53,54,19,Female,High School,Costa Rica,5.7,Instagram,Yes,5.5,5,Single,4,8


#### Advanced & Null Filtering

In [23]:
# Negating a Condition: Use the tilde operator ~ to select all rows that do not meet a condition.
df[~df['Country'].isin(['USA', 'Canada'])].head()

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
0,1,19,Female,Undergraduate,Bangladesh,5.2,Instagram,Yes,6.5,6,In Relationship,3,8
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,0,3
3,4,18,Male,High School,UK,3.0,YouTube,No,7.0,7,Single,1,4
5,6,19,Female,Undergraduate,Australia,7.2,Instagram,Yes,4.5,4,Complicated,5,9
6,7,23,Male,Graduate,Germany,1.5,LinkedIn,No,8.0,9,Single,0,2


In [24]:
# .query() Method: Use the .query() method for a more readable syntax when filtering with multiple conditions.
df.query('Avg_Daily_Usage_Hours > 4 and Country == "India" and Age < 25').head()

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
111,112,21,Female,Undergraduate,India,5.8,TikTok,Yes,5.9,6,In Relationship,3,7
117,118,20,Female,Undergraduate,India,5.7,Instagram,Yes,6.1,6,Single,3,7
123,124,22,Female,Graduate,India,5.8,Instagram,Yes,5.8,6,In Relationship,3,7
129,130,21,Female,Graduate,India,5.6,Instagram,Yes,5.6,6,In Relationship,3,7
135,136,20,Female,Undergraduate,India,5.4,Instagram,Yes,5.4,6,In Relationship,3,7


In [25]:
# Finding Missing Data: Use .isna() or .isnull() to find rows where a specific value is missing (NaN).
df[df['Addicted_Score'].isna()] # will output nothing because no missing value

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score


In [26]:
# Finding Non-Missing Data: Use .notna() or .notnull() to find rows where a value is not missing.
df[df['Addicted_Score'].notna()]

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
0,1,19,Female,Undergraduate,Bangladesh,5.2,Instagram,Yes,6.5,6,In Relationship,3,8
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,0,3
2,3,20,Female,Undergraduate,USA,6.0,TikTok,Yes,5.0,5,Complicated,4,9
3,4,18,Male,High School,UK,3.0,YouTube,No,7.0,7,Single,1,4
4,5,21,Male,Graduate,Canada,4.5,Facebook,Yes,6.0,6,In Relationship,2,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...
700,701,20,Female,Undergraduate,Italy,4.7,TikTok,No,7.2,7,In Relationship,2,5
701,702,23,Male,Graduate,Russia,6.8,Instagram,Yes,5.9,4,Single,5,9
702,703,21,Female,Undergraduate,China,5.6,WeChat,Yes,6.7,6,In Relationship,3,7
703,704,24,Male,Graduate,Japan,4.3,Twitter,No,7.5,8,Single,2,4


In [27]:
# Between Two Values: Use the .between() method to select rows where a numeric value falls within an inclusive range.
df[df['Age'].between(18, 22)].head()

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
0,1,19,Female,Undergraduate,Bangladesh,5.2,Instagram,Yes,6.5,6,In Relationship,3,8
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,0,3
2,3,20,Female,Undergraduate,USA,6.0,TikTok,Yes,5.0,5,Complicated,4,9
3,4,18,Male,High School,UK,3.0,YouTube,No,7.0,7,Single,1,4
4,5,21,Male,Graduate,Canada,4.5,Facebook,Yes,6.0,6,In Relationship,2,7


In [28]:
# Filtering based on Index: Use boolean masks to filter based on the DataFrame's index values.
# df[df.index % 2 == 0]  # even indexed rows
df[df.index.isin([0, 2, 4, 6, 8])]  # specific index values

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
0,1,19,Female,Undergraduate,Bangladesh,5.2,Instagram,Yes,6.5,6,In Relationship,3,8
2,3,20,Female,Undergraduate,USA,6.0,TikTok,Yes,5.0,5,Complicated,4,9
4,5,21,Male,Graduate,Canada,4.5,Facebook,Yes,6.0,6,In Relationship,2,7
6,7,23,Male,Graduate,Germany,1.5,LinkedIn,No,8.0,9,Single,0,2
8,9,18,Male,High School,Japan,4.0,TikTok,No,6.5,7,Single,1,5
