---
# Row and Column Selection

Selecting rows and columns of **dataframe** and **series**. 

---

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display

In [2]:
# Function for printing a horizontal line. For display purpose
def printhr(s: str = None, n: int = 40):
    """Print a horizontal rule of the character "=" of length n.

    Args:
        s (str, optional): Header message. Defaults to None.
        n (int, optional): Number of characters. Defaults to 50.
    """

    if s:
        print("=" * int(n / 2), s, "=" * int(n / 2))
    else:
        print("=" * n)

In [3]:
# Let us create a sample DataFrame from a python dict. This will be used for examples

# Python dict
people = {
    "first name": ["Klint", "Foo", "Cat"],
    "last name": ["Labadia", "Bar", "Dog"],
    "email": ["ckl@a.a", "foobar@a.a", "catdog@a.a"],
    "age": [25, 19, 7],
}

# Convert dict to a dataframe
people_df = pd.DataFrame(people)
people_df


Unnamed: 0,first name,last name,email,age
0,Klint,Labadia,ckl@a.a,25
1,Foo,Bar,foobar@a.a,19
2,Cat,Dog,catdog@a.a,7


---
## Selecting rows and columns in a dataframe
Selecting rows and columns are commonly done by using the `loc()` or `iloc()` DataFrame methods.

---

.loc()
```
df.loc[<row_indexer>, <column_indexer>]
```

Where:  
 `row_indexer` can be a filter, or label (or index) name.  
 `column_indexer` is the column label.

---

.iloc()
```
df.iloc[<row_indexer>, <column_indexer>]
```

Where:  
 `row_indexer` is the row's index number.  
 `column_indexer` is the column label.

---

---
### Viewing columns and column data types

In pandas, columns can be referred to by using brackets (indexing) or dot notation as follows:  

DataFrame["column_1"]  
or  
DataFrame.column_1

Do note that the dot notation cannot be used on column names with a space (e.g. "Column Name"), and hence square bracket is mostly preferred.

---

In [4]:
# Viewing all DataFrame columns
display(people_df.columns)
printhr()

# Use dtypes to see data types of columns
people_df.dtypes


Index(['first name', 'last name', 'email', 'age'], dtype='object')



first name    object
last name     object
email         object
age            int64
dtype: object

In [5]:
## Selecting Columns

# Single column by bracket (preferred)
display(people_df["email"])
printhr()

# By dot notation (can't use for column names with space, 
# so we cant access the first and last names column in the df)
display(people_df.email)


0       ckl@a.a
1    foobar@a.a
2    catdog@a.a
Name: email, dtype: object



0       ckl@a.a
1    foobar@a.a
2    catdog@a.a
Name: email, dtype: object

---
#### Viewing Multiples Columns by Bracket Notation

Using bracket notation, we can pass a list of the column names to access multiple columns at once as such:

DataFrame[["col1", "col2", "col3"]]

---

In [6]:
# Viewing multiple columns
people_df[["first name", "email"]]

Unnamed: 0,first name,email
0,Klint,ckl@a.a
1,Foo,foobar@a.a
2,Cat,catdog@a.a


---
## .iloc()

iloc is used to select entries base on their integer index. 

df.iloc[`<row_indexer>`, `<column_indexer>`]

Where:  
 `row_indexer` is the row's index number.   
 `column_indexer` is the column label.

Slicing can be used for both row and column indexer.  

For `column_indexer`, the index column is not counted when referring to a column. (i.e. 0 starts at the column AFTER the index column)
 
---

In [7]:
## Selecting Rows using iloc (index location)
a = people_df.iloc[0]

# Multiple rows
b = people_df.iloc[[0, 2]]  # Index 0 and index 2

# Multiple rows using slicing
c = people_df.iloc[0:2]  # Index 0 through 1

display(a)
printhr()
display(b)
printhr()
display(c)


first name      Klint
last name     Labadia
email         ckl@a.a
age                25
Name: 0, dtype: object



Unnamed: 0,first name,last name,email,age
0,Klint,Labadia,ckl@a.a,25
2,Cat,Dog,catdog@a.a,7




Unnamed: 0,first name,last name,email,age
0,Klint,Labadia,ckl@a.a,25
1,Foo,Bar,foobar@a.a,19


In [8]:
## Selecting Column from Row using iloc

# 1 = row, 2 = column
# i.e. email (index 2- column) of entry at index 1 (row)
x = people_df.iloc[1, 2]

# Example 2 - First name (0) and last name (1) of entries at index 0 to 1 (0:2)
y = people_df.iloc[0:2, [0, 1]]

display(x)
printhr()
display(y)


'foobar@a.a'



Unnamed: 0,first name,last name
0,Klint,Labadia
1,Foo,Bar


---
## .loc()

loc is used to select entries base on the name of their index (int for default index, or a str for named index). Slicing using loc includes both end points.

df.loc[`<row_indexer>`, `<column_indexer>`]

Where:  
 `row_indexer` is the row's index number.
 `column_indexer` is the column label.

Slicing can be used for both row and column indexer. Slicing is done by use of the str names, NOT integers (except for rows when index is the default integer).
 
---

In [9]:
# Change Indices of people_df to str for the purpose of 
# practicing row selection using loc
# More on indexes on next chapter.

people_df2 = people_df.copy()
people_df2.index = list("ABC")
people_df2

Unnamed: 0,first name,last name,email,age
A,Klint,Labadia,ckl@a.a,25
B,Foo,Bar,foobar@a.a,19
C,Cat,Dog,catdog@a.a,7


In [10]:
## Selecting Rows using loc (by label)
display(people_df2)
printhr()

x = people_df2.loc["A"]

# Multiple rows
y = people_df2.loc[["A", "C"]]  # Index "A" and index "C"

# Index "A" through index "C" ("C" included, unlike iloc)
y = people_df2.loc["A":"C"]

display(x)
printhr()
display(y)

Unnamed: 0,first name,last name,email,age
A,Klint,Labadia,ckl@a.a,25
B,Foo,Bar,foobar@a.a,19
C,Cat,Dog,catdog@a.a,7




first name      Klint
last name     Labadia
email         ckl@a.a
age                25
Name: A, dtype: object



Unnamed: 0,first name,last name,email,age
A,Klint,Labadia,ckl@a.a,25
B,Foo,Bar,foobar@a.a,19
C,Cat,Dog,catdog@a.a,7


In [11]:
# Index "A" through "C", from columns "first name" to "email"
x = people_df2.loc["A":"C", "first name":"email"]
display(x)

Unnamed: 0,first name,last name,email
A,Klint,Labadia,ckl@a.a
B,Foo,Bar,foobar@a.a
C,Cat,Dog,catdog@a.a


In [12]:
## Selecting Column from Row using loc

# "A" = index (row), "email" = column
# Note that unlike iloc, the column should be the column name.

# i.e. email (column) of entry at index A (row)
x = people_df2.loc["A", "email"]

# Example 2 - Last name and email of entries index A and B
y = people_df2.loc[["A", "B"], ["last name", "email"]]

display(x)
printhr()
display(y)


'ckl@a.a'



Unnamed: 0,last name,email
A,Labadia,ckl@a.a
B,Bar,foobar@a.a


---
## Example from stackoverflow data set
---

In [13]:
# Load csv files as df
df = pd.read_csv("data/survey_results_public_2022.csv")
schema_df = pd.read_csv("data/survey_results_schema.csv")


In [14]:
# Configure display options
pd.set_option("display.max_rows", 80)
pd.set_option("display.max_columns", 80)


In [15]:
# return top 5 items from the top of df
df.head()


Unnamed: 0,ResponseId,MainBranch,Employment,RemoteWork,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,LearnCodeCoursesCert,YearsCode,YearsCodePro,DevType,OrgSize,PurchaseInfluence,BuyNewTool,Country,Currency,CompTotal,CompFreq,LanguageHaveWorkedWith,LanguageWantToWorkWith,DatabaseHaveWorkedWith,DatabaseWantToWorkWith,PlatformHaveWorkedWith,PlatformWantToWorkWith,WebframeHaveWorkedWith,WebframeWantToWorkWith,MiscTechHaveWorkedWith,MiscTechWantToWorkWith,ToolsTechHaveWorkedWith,ToolsTechWantToWorkWith,NEWCollabToolsHaveWorkedWith,NEWCollabToolsWantToWorkWith,OpSysProfessional use,OpSysPersonal use,VersionControlSystem,VCInteraction,VCHostingPersonal use,VCHostingProfessional use,OfficeStackAsyncHaveWorkedWith,OfficeStackAsyncWantToWorkWith,OfficeStackSyncHaveWorkedWith,OfficeStackSyncWantToWorkWith,Blockchain,NEWSOSites,SOVisitFreq,SOAccount,SOPartFreq,SOComm,Age,Gender,Trans,Sexuality,Ethnicity,Accessibility,MentalHealth,TBranch,ICorPM,WorkExp,Knowledge_1,Knowledge_2,Knowledge_3,Knowledge_4,Knowledge_5,Knowledge_6,Knowledge_7,Frequency_1,Frequency_2,Frequency_3,TimeSearching,TimeAnswering,Onboarding,ProfessionalTech,TrueFalse_1,TrueFalse_2,TrueFalse_3,SurveyLength,SurveyEase,ConvertedCompYearly
0,1,None of these,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2,I am a developer by profession,"Employed, full-time",Fully remote,Hobby;Contribute to open-source projects,,,,,,,,,,,Canada,CAD\tCanadian dollar,,,JavaScript;TypeScript,Rust;TypeScript,,,,,,,,,,,,,macOS,Windows Subsystem for Linux (WSL),Git,,,,,,,,Very unfavorable,Collectives on Stack Overflow;Stack Overflow f...,Daily or almost daily,Yes,Daily or almost daily,Not sure,,,,,,,,No,,,,,,,,,,,,,,,,,,,,Too long,Difficult,
2,3,"I am not primarily a developer, but I write co...","Employed, full-time","Hybrid (some remote, some in-person)",Hobby,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Friend or family member...,Technical documentation;Blogs;Programming Game...,,14.0,5.0,Data scientist or machine learning specialist;...,20 to 99 employees,I have some influence,,United Kingdom of Great Britain and Northern I...,GBP\tPound sterling,32000.0,Yearly,C#;C++;HTML/CSS;JavaScript;Python,C#;C++;HTML/CSS;JavaScript;TypeScript,Microsoft SQL Server,Microsoft SQL Server,,,Angular.js,Angular;Angular.js,Pandas,.NET,,,Notepad++;Visual Studio,Notepad++;Visual Studio,Windows,Windows,Git,Code editor,,,,,Microsoft Teams,Microsoft Teams,Very unfavorable,Collectives on Stack Overflow;Stack Overflow;S...,Multiple times per day,Yes,Multiple times per day,Neutral,25-34 years old,Man,No,Bisexual,White,None of the above,"I have a mood or emotional disorder (e.g., dep...",No,,,,,,,,,,,,,,,,,,,,Appropriate in length,Neither easy nor difficult,40205.0
3,4,I am a developer by profession,"Employed, full-time",Fully remote,I don’t code outside of work,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Books / Physical media;School (i.e., Universit...",,,20.0,17.0,"Developer, full-stack",100 to 499 employees,I have some influence,Other (please specify):,Israel,ILS\tIsraeli new shekel,60000.0,Monthly,C#;JavaScript;SQL;TypeScript,C#;SQL;TypeScript,Microsoft SQL Server,Microsoft SQL Server,,,ASP.NET;ASP.NET Core,ASP.NET;ASP.NET Core,.NET,.NET,,,Notepad++;Visual Studio;Visual Studio Code,Notepad++;Visual Studio;Visual Studio Code,Windows,Windows,Git,Code editor;Command-line;Version control hosti...,,,Jira Work Management;Trello,Jira Work Management;Trello,Slack;Zoom,Slack;Zoom,Very unfavorable,Collectives on Stack Overflow;Stack Overflow f...,Daily or almost daily,Yes,A few times per week,"Yes, definitely",35-44 years old,Man,No,Straight / Heterosexual,White,None of the above,None of the above,No,,,,,,,,,,,,,,,,,,,,Appropriate in length,Easy,215232.0
4,5,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Stack Overflow;O...,,8.0,3.0,"Developer, front-end;Developer, full-stack;Dev...",20 to 99 employees,I have some influence,Start a free trial;Visit developer communities...,United States of America,USD\tUnited States dollar,,,C#;HTML/CSS;JavaScript;SQL;Swift;TypeScript,C#;Elixir;F#;Go;JavaScript;Rust;TypeScript,Cloud Firestore;Elasticsearch;Microsoft SQL Se...,Cloud Firestore;Elasticsearch;Firebase Realtim...,Firebase;Microsoft Azure,Firebase;Microsoft Azure,Angular;ASP.NET;ASP.NET Core ;jQuery;Node.js,Angular;ASP.NET Core ;Blazor;Node.js,.NET,.NET;Apache Kafka,npm,Docker;Kubernetes,Notepad++;Visual Studio;Visual Studio Code;Xcode,Rider;Visual Studio;Visual Studio Code,Windows,macOS;Windows,Git;Other (please specify):,Code editor,,,,,Microsoft Teams;Zoom,,Unfavorable,Collectives on Stack Overflow;Stack Overflow f...,Multiple times per day,Yes,Daily or almost daily,"Yes, definitely",25-34 years old,,,,,,,No,,,,,,,,,,,,,,,,,,,,Too long,Easy,


In [16]:
df.columns


Index(['ResponseId', 'MainBranch', 'Employment', 'RemoteWork',
       'CodingActivities', 'EdLevel', 'LearnCode', 'LearnCodeOnline',
       'LearnCodeCoursesCert', 'YearsCode', 'YearsCodePro', 'DevType',
       'OrgSize', 'PurchaseInfluence', 'BuyNewTool', 'Country', 'Currency',
       'CompTotal', 'CompFreq', 'LanguageHaveWorkedWith',
       'LanguageWantToWorkWith', 'DatabaseHaveWorkedWith',
       'DatabaseWantToWorkWith', 'PlatformHaveWorkedWith',
       'PlatformWantToWorkWith', 'WebframeHaveWorkedWith',
       'WebframeWantToWorkWith', 'MiscTechHaveWorkedWith',
       'MiscTechWantToWorkWith', 'ToolsTechHaveWorkedWith',
       'ToolsTechWantToWorkWith', 'NEWCollabToolsHaveWorkedWith',
       'NEWCollabToolsWantToWorkWith', 'OpSysProfessional use',
       'OpSysPersonal use', 'VersionControlSystem', 'VCInteraction',
       'VCHostingPersonal use', 'VCHostingProfessional use',
       'OfficeStackAsyncHaveWorkedWith', 'OfficeStackAsyncWantToWorkWith',
       'OfficeStackSyncHaveWork

In [17]:
# First four rows, from column "YearsCode" to column "OrgSize"
df.loc[0:4, "YearsCode":"OrgSize"]


Unnamed: 0,YearsCode,YearsCodePro,DevType,OrgSize
0,,,,
1,,,,
2,14.0,5.0,Data scientist or machine learning specialist;...,20 to 99 employees
3,20.0,17.0,"Developer, full-stack",100 to 499 employees
4,8.0,3.0,"Developer, front-end;Developer, full-stack;Dev...",20 to 99 employees


---
#### Note

Slicing in iloc behaves like python's list slicing in that it will return first endpoint but not the final endpoint, unlike loc which returns both endpoints. I think this is to allow slicing to include the last element in a collection since in cases of named indexes, you can't refer to a str-named index with an integer.

---