<a href="https://colab.research.google.com/github/werowe/HypatiaAcademy/blob/master/pandas/pandas_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Here We Show how to Search Pandas Data

The data we use is very interesting since it is data science salary data we get from Kaggle.

We show:

* how to search using loc
* how to search using iloc
* how to search using boolean expressions
* how to search by date
* how to set and index and search by that
* how to retrieve a subset of columns
* how to retrieve a subset of rows and columns


In [None]:
from datetime import date
import pandas as pd


# here we create a dataframe.  We want to have a date as a column that we can search on using an index.
# so we set it up like this

df = pd.DataFrame({
      "agent": "Fred",
      "sales": [100, 200, 300],
      ""
      "date": [date.fromisoformat("2024-04-11"), date.fromisoformat("2024-04-12"), date.fromisoformat("2024-04-13")]

})

df

Unnamed: 0,agent,sales,date
0,Fred,100,2024-04-11
1,Fred,200,2024-04-12
2,Fred,300,2024-04-13


In [None]:
# since we did not set as index it just users a number 0, 1, 2, ... as the index

df.index

RangeIndex(start=0, stop=3, step=1)

In [None]:
# make index be the date.  use inplace=True so that is changes the dataframe and does not create a new one

df.set_index(("date"),inplace=True)


In [None]:
# now we see the index values

df.index

Index([2024-04-11, 2024-04-12, 2024-04-13], dtype='object', name='date')

In [None]:
df.loc[date.fromisoformat("2024-04-11")]

agent    Fred
sales     100
Name: 2024-04-11, dtype: object

In [None]:
# here we use a much larger dataframe to have some more interesting data.

import pandas as pd

file="https://raw.githubusercontent.com/werowe/HypatiaAcademy/master/pandas/DataScience_salaries_2024.csv"
df = pd.read_csv(file)
df

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2021,MI,FT,Data Scientist,30400000,CLP,40038,CL,100,CL,L
1,2021,MI,FT,BI Data Analyst,11000000,HUF,36259,HU,50,US,L
2,2020,MI,FT,Data Scientist,11000000,HUF,35735,HU,50,HU,L
3,2021,MI,FT,ML Engineer,8500000,JPY,77364,JP,50,JP,S
4,2022,SE,FT,Lead Machine Learning Engineer,7500000,INR,95386,IN,50,IN,L
...,...,...,...,...,...,...,...,...,...,...,...
14833,2022,MI,FT,Business Intelligence Developer,15000,USD,15000,GH,100,GH,M
14834,2020,EX,FT,Staff Data Analyst,15000,USD,15000,NG,0,CA,M
14835,2021,EN,FT,Machine Learning Developer,15000,USD,15000,TH,100,TH,L
14836,2022,EN,FT,Data Analyst,15000,USD,15000,ID,0,ID,L


In [None]:
# Here we find by boolean condition. "EN" means entry-level worker

df[df['experience_level'] == 'EN']

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
7,2022,EN,FT,Data Scientist,6600000,HUF,17684,HU,100,HU,M
9,2022,EN,FT,Research Engineer,5500000,JPY,41809,JP,50,JP,L
11,2023,EN,FT,AI Programmer,4950806,INR,60207,IN,0,IN,S
14,2020,EN,FT,Data Engineer,4450000,JPY,41689,JP,100,JP,S
16,2023,EN,FT,Applied Machine Learning Scientist,4000000,INR,48644,IN,100,DE,L
...,...,...,...,...,...,...,...,...,...,...,...
14831,2022,EN,FT,Research Engineer,15662,EUR,16455,RU,100,RU,M
14832,2024,EN,PT,Data Science,15000,EUR,16666,DE,50,DE,M
14835,2021,EN,FT,Machine Learning Developer,15000,USD,15000,TH,100,TH,L
14836,2022,EN,FT,Data Analyst,15000,USD,15000,ID,0,ID,L


In [None]:
# Here we use the loc() function.  It has different formats.  Here we look for
# df.loc(boolean, columns)
# so it selects rows based on the boolean condition and pick the columns we want
# "FR" means France
# notice also that we use & to add two boolean logical expressions



df.loc[(df['company_location'] == 'FR') & (df['work_year'] == 2023), ['salary', 'job_title']]

Unnamed: 0,salary,job_title
1895,225000,Machine Learning Engineer
1901,225000,Research Engineer
3169,199000,Machine Learning Engineer
5181,168000,Research Engineer
6612,150000,AI Developer
9655,120000,AI Programmer
10518,110000,Data Engineer
11407,100000,Machine Learning Infrastructure Engineer
12125,90000,Data Engineer
13047,75000,Data Engineer


In [None]:
# you could also write this:

df.loc[(df['company_location'] == 'FR') & (df['work_year'] == 2023), ['salary', 'job_title']]

#like this:


df.loc[(df['company_location'] == 'FR') & (df['work_year'] == 2023)].loc[:,['salary', 'job_title']]



Unnamed: 0,salary,job_title
1895,225000,Machine Learning Engineer
1901,225000,Research Engineer
3169,199000,Machine Learning Engineer
5181,168000,Research Engineer
6612,150000,AI Developer
9655,120000,AI Programmer
10518,110000,Data Engineer
11407,100000,Machine Learning Infrastructure Engineer
12125,90000,Data Engineer
13047,75000,Data Engineer


In [None]:
df.loc[(df['company_location'] == 'FR') & (df['work_year'] == 2023), ['salary', 'job_title']]

Unnamed: 0,salary,job_title
1895,225000,Machine Learning Engineer
1901,225000,Research Engineer
3169,199000,Machine Learning Engineer
5181,168000,Research Engineer
6612,150000,AI Developer
9655,120000,AI Programmer
10518,110000,Data Engineer
11407,100000,Machine Learning Infrastructure Engineer
12125,90000,Data Engineer
13047,75000,Data Engineer


In [None]:
# in this example we search for all rows : and then we tell it what columns to select

df.loc[:,['work_year','job_title','salary']]

Unnamed: 0,work_year,job_title,salary
0,2021,Data Scientist,30400000
1,2021,BI Data Analyst,11000000
2,2020,Data Scientist,11000000
3,2021,ML Engineer,8500000
4,2022,Lead Machine Learning Engineer,7500000
...,...,...,...
14833,2022,Business Intelligence Developer,15000
14834,2020,Staff Data Analyst,15000
14835,2021,Machine Learning Developer,15000
14836,2022,Data Analyst,15000


In [None]:
# the .iloc() function searches based on numbers
# of course it is not always useful as columns by column number might only be useful is the column names
# were something numeric, like month
# this gets the first two rows and the first 4 columns


df.iloc[1:3,1:5]


Unnamed: 0,experience_level,employment_type,job_title,salary
1,MI,FT,BI Data Analyst,11000000
2,MI,FT,Data Scientist,11000000
