<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Python_Data_Analytics_Course/blob/main/2_Advanced/05_Pandas_Index_Management.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Pandas Index Management

Load data.

In [44]:
# Importing Libraries
import pandas as pd
from datasets import load_dataset
import matplotlib.pyplot as plt  

# Loading Data
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()

# Data Cleanup
df['job_posted_date'] = pd.to_datetime(df['job_posted_date'])

## Index Attributes

### Notes

* You can get or set properties of the index, such as its name or data type.  
* It's helpful for maintaining metadata or ensuring index compatibility in operations.
* `(index.name, index.dtype)`
    * `index.name` - name of the index
    * `index.dtype` - data type of the index

Let's look at our DataFrame.

In [45]:
df.sample(3)

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
580368,Software Engineer,Lead Engineer,"Copenhagen, Denmark",via Jobs Trabajo.org,Full-time,False,Denmark,2023-12-25 10:49:32,False,False,Denmark,,,,Hays,,
777016,Data Analyst,Data Analyst,"Manila, Metro Manila, Philippines",via BeBee,,False,Philippines,2023-10-09 03:33:15,False,False,Philippines,,,,Dempsey,"['java', 'python', 'perl']","{'programming': ['java', 'python', 'perl']}"
150735,Business Analyst,Junior Process Analyst,"Bologna, Metropolitan City of Bologna, Italy",via Trabajo. Org,Full-time,False,Italy,2023-07-27 15:37:39,False,False,Italy,,,,Randstad Italia Spa,"['gdpr', 'sap', 'excel']","{'analyst_tools': ['sap', 'excel'], 'libraries..."


Let's inspect our index.

In [46]:
df.index

RangeIndex(start=0, stop=787686, step=1)

In [47]:
df.index.dtype

dtype('int64')

Our index is a range of numbers, inspecting the name...

In [48]:
df.index.name

It has no name... so let's name it.

In [49]:
df.index.name = 'job_index'

Inspecting it.

In [50]:
df.index.name

'job_index'

In [51]:
df.sample(3)

Unnamed: 0_level_0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
job_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
714767,Data Analyst,Data Analyst H/F,"Nanterre, France",via Jobijoba,Full-time,False,France,2023-06-16 06:57:14,True,False,France,,,,Scc,"['sql', 'python', 'scala', 'spark', 'pyspark']","{'libraries': ['spark', 'pyspark'], 'programmi..."
766358,Data Engineer,Data Engineer,"Mexico City, CDMX, Mexico",via Big Country Jobs,Full-time,False,Mexico,2023-08-27 01:47:35,False,False,Mexico,,,,Cognizant Technology Solutions,"['python', 'sql', 'gcp', 'airflow', 'terraform...","{'cloud': ['gcp'], 'libraries': ['airflow'], '..."
536934,Data Scientist,Docent data scientist,"'t Harde, Netherlands",via Werken Voor Nederland,Full-time and Part-time,False,Netherlands,2023-10-18 23:34:09,False,False,Netherlands,,,,Ministerie van Defensie,,


### Example 2

 Remember the pivot table we made in the last example? Where we got median yearly salaries for the different job titles. Let's run that code. Then we'll get the index name and the data type of the index using `df.index.name` and `df.index.dtype`.

In [56]:
median_pivot = df.pivot_table(values='salary_year_avg', index='job_title_short', aggfunc='median')
median_pivot

Unnamed: 0_level_0,salary_year_avg
job_title_short,Unnamed: 1_level_1
Business Analyst,85000.0
Cloud Engineer,90000.0
Data Analyst,90000.0
Data Engineer,125000.0
Data Scientist,128000.0
Machine Learning Engineer,106000.0
Senior Data Analyst,111175.0
Senior Data Engineer,147500.0
Senior Data Scientist,155000.0
Software Engineer,99150.0


In [57]:
index_name = median_pivot.index.name  
index_name 

'job_title_short'

In [58]:
index_dtype = median_pivot.index.dtype  
index_dtype

dtype('O')

## reset_index()

### Notes 

* `reset_index()`: Resets the DataFrame’s index to the default integer index. This is particularly useful after operations that alter the index, like sorting or filtering, to simplify further data manipulation.

### Example 1

When we create new DataFrames by filtering this jacks up our index!

In [59]:
df_usa = df[df['job_country'] == 'United States']

df_usa.head(5)

Unnamed: 0_level_0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
job_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
4,Data Analyst,Technical Data Analyst,"Fairfax, VA",via Indeed,Contractor,False,"New York, United States",2023-12-20 07:00:10,True,False,United States,,,,Info Origin Inc.,"['sql', 'python', 'jira']","{'async': ['jira'], 'programming': ['sql', 'py..."
6,Data Scientist,Research Data Scientist - Now Hiring,"Washington, DC",via Snagajob,Full-time,False,"New York, United States",2023-02-01 07:03:49,False,True,United States,,,,Booz Allen Hamilton,"['python', 'r', 'tableau', 'splunk', 'docker']","{'analyst_tools': ['tableau', 'splunk'], 'othe..."
7,Data Scientist,Diversity and Inclusion Workforce Data Scienti...,United States,via BeBee,Full-time,False,"Texas, United States",2023-07-27 07:04:35,False,False,United States,,,,CIDIS LLC,"['python', 'r', 'sql', 'cognos', 'alteryx', 't...","{'analyst_tools': ['cognos', 'alteryx', 'table..."
14,Data Scientist,Data Scientist,"Hampton, VA",via Monster,Full-time,False,Georgia,2023-04-10 07:54:26,False,True,United States,,,,Guidehouse,"['sql', 'r', 'python', 'excel', 'tableau']","{'analyst_tools': ['excel', 'tableau'], 'progr..."
16,Data Engineer,Staff Data Engineer - Now Hiring,"Plano, TX",via Snagajob,Full-time and Part-time,False,"Texas, United States",2023-12-28 07:26:45,False,False,United States,hour,,57.060001,FinThrive,"['python', 'scala', 'sql', 'bash', 'shell', 's...","{'analyst_tools': ['excel'], 'cloud': ['azure'..."


In [60]:
df_usa.index

Index([     4,      6,      7,     14,     16,     19,     26,     31,     32,
           33,
       ...
       787498, 787515, 787553, 787571, 787574, 787579, 787580, 787583, 787589,
       787623],
      dtype='int64', name='job_index', length=206145)

Index is no longer correctly spaced by 1 increment.

In [61]:
df_usa.reset_index(inplace=True)
df_usa.head()

Unnamed: 0,job_index,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
0,4,Data Analyst,Technical Data Analyst,"Fairfax, VA",via Indeed,Contractor,False,"New York, United States",2023-12-20 07:00:10,True,False,United States,,,,Info Origin Inc.,"['sql', 'python', 'jira']","{'async': ['jira'], 'programming': ['sql', 'py..."
1,6,Data Scientist,Research Data Scientist - Now Hiring,"Washington, DC",via Snagajob,Full-time,False,"New York, United States",2023-02-01 07:03:49,False,True,United States,,,,Booz Allen Hamilton,"['python', 'r', 'tableau', 'splunk', 'docker']","{'analyst_tools': ['tableau', 'splunk'], 'othe..."
2,7,Data Scientist,Diversity and Inclusion Workforce Data Scienti...,United States,via BeBee,Full-time,False,"Texas, United States",2023-07-27 07:04:35,False,False,United States,,,,CIDIS LLC,"['python', 'r', 'sql', 'cognos', 'alteryx', 't...","{'analyst_tools': ['cognos', 'alteryx', 'table..."
3,14,Data Scientist,Data Scientist,"Hampton, VA",via Monster,Full-time,False,Georgia,2023-04-10 07:54:26,False,True,United States,,,,Guidehouse,"['sql', 'r', 'python', 'excel', 'tableau']","{'analyst_tools': ['excel', 'tableau'], 'progr..."
4,16,Data Engineer,Staff Data Engineer - Now Hiring,"Plano, TX",via Snagajob,Full-time and Part-time,False,"Texas, United States",2023-12-28 07:26:45,False,False,United States,hour,,57.060001,FinThrive,"['python', 'scala', 'sql', 'bash', 'shell', 's...","{'analyst_tools': ['excel'], 'cloud': ['azure'..."


Technically we could `.drop()` the `job_index`.

BUT, if we wanted to do some sort of merge operations in the future with our original DataFrame, this provides the unique `id` to do that.

### Example 2

Back to our main DataFrame with our job postings in it. We're going to actually reset the indexes in the pivot table so `job_title_short` isn't the index anymore.

In [63]:
median_pivot.reset_index(inplace=True)
median_pivot

Unnamed: 0,job_title_short,salary_year_avg
0,Business Analyst,85000.0
1,Cloud Engineer,90000.0
2,Data Analyst,90000.0
3,Data Engineer,125000.0
4,Data Scientist,128000.0
5,Machine Learning Engineer,106000.0
6,Senior Data Analyst,111175.0
7,Senior Data Engineer,147500.0
8,Senior Data Scientist,155000.0
9,Software Engineer,99150.0


## set_index()

### Notes

* `set_index()`: Sets one or more existing columns as the index of the DataFrame. This is useful for timeseries data or when you want to index by specific attributes.

### Example 1

What if we wanted to go back to job_index as our main index?

In [73]:
df_usa.set_index('job_index', inplace=True)

df_usa.head()

Unnamed: 0_level_0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
job_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
4,Data Analyst,Technical Data Analyst,"Fairfax, VA",via Indeed,Contractor,False,"New York, United States",2023-12-20 07:00:10,True,False,United States,,,,Info Origin Inc.,"['sql', 'python', 'jira']","{'async': ['jira'], 'programming': ['sql', 'py..."
6,Data Scientist,Research Data Scientist - Now Hiring,"Washington, DC",via Snagajob,Full-time,False,"New York, United States",2023-02-01 07:03:49,False,True,United States,,,,Booz Allen Hamilton,"['python', 'r', 'tableau', 'splunk', 'docker']","{'analyst_tools': ['tableau', 'splunk'], 'othe..."
7,Data Scientist,Diversity and Inclusion Workforce Data Scienti...,United States,via BeBee,Full-time,False,"Texas, United States",2023-07-27 07:04:35,False,False,United States,,,,CIDIS LLC,"['python', 'r', 'sql', 'cognos', 'alteryx', 't...","{'analyst_tools': ['cognos', 'alteryx', 'table..."
14,Data Scientist,Data Scientist,"Hampton, VA",via Monster,Full-time,False,Georgia,2023-04-10 07:54:26,False,True,United States,,,,Guidehouse,"['sql', 'r', 'python', 'excel', 'tableau']","{'analyst_tools': ['excel', 'tableau'], 'progr..."
16,Data Engineer,Staff Data Engineer - Now Hiring,"Plano, TX",via Snagajob,Full-time and Part-time,False,"Texas, United States",2023-12-28 07:26:45,False,False,United States,hour,,57.060001,FinThrive,"['python', 'scala', 'sql', 'bash', 'shell', 's...","{'analyst_tools': ['excel'], 'cloud': ['azure'..."


### Example 2

Now that we've reset our index we can set the new index on another column like `salary_year_avg`.

In [65]:
median_pivot.set_index('job_title_short', inplace=True)
median_pivot

Unnamed: 0_level_0,salary_year_avg
job_title_short,Unnamed: 1_level_1
Business Analyst,85000.0
Cloud Engineer,90000.0
Data Analyst,90000.0
Data Engineer,125000.0
Data Scientist,128000.0
Machine Learning Engineer,106000.0
Senior Data Analyst,111175.0
Senior Data Engineer,147500.0
Senior Data Scientist,155000.0
Software Engineer,99150.0


## sort_index()

### Notes 

* `sort_index()`: Sorts the DataFrame by the index (row labels), either ascending or descending. This helps in quickly organizing data by the index and is often used after `set_index()`.

### Example

Back to our pivoted DataFrame let's sort this new index alphabetically.

In [67]:
median_pivot.sort_index(inplace=True)
median_pivot

Unnamed: 0_level_0,salary_year_avg
job_title_short,Unnamed: 1_level_1
Business Analyst,85000.0
Cloud Engineer,90000.0
Data Analyst,90000.0
Data Engineer,125000.0
Data Scientist,128000.0
Machine Learning Engineer,106000.0
Senior Data Analyst,111175.0
Senior Data Engineer,147500.0
Senior Data Scientist,155000.0
Software Engineer,99150.0
