<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Python_Data_Analytics_Course/blob/main/2_Advanced/10_Pandas_Applying_Functions.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Pandas Applying Functions

Load data.

In [25]:
# Importing Libraries
import pandas as pd
from datasets import load_dataset
import matplotlib.pyplot as plt  

# Loading Data
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()

# Data Cleanup
df['job_posted_date'] = pd.to_datetime(df['job_posted_date'])

## Apply

In [48]:
help(df.apply)

Help on method apply in module pandas.core.frame:

apply(func: 'AggFuncType', axis: 'Axis' = 0, raw: 'bool' = False, result_type: "Literal['expand', 'reduce', 'broadcast'] | None" = None, args=(), by_row: "Literal[False, 'compat']" = 'compat', engine: "Literal['python', 'numba']" = 'python', engine_kwargs: 'dict[str, bool] | None' = None, **kwargs) method of pandas.core.frame.DataFrame instance
    Apply a function along an axis of the DataFrame.
    
    Objects passed to the function are Series objects whose index is
    either the DataFrame's index (``axis=0``) or the DataFrame's columns
    (``axis=1``). By default (``result_type=None``), the final return type
    is inferred from the return type of the applied function. Otherwise,
    it depends on the `result_type` argument.
    
    Parameters
    ----------
    func : function
        Function to apply to each column or row.
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Axis along which the function is applied:
 

### Notes

* `apply()`: Apply functions to columns or rows.

### Example 1

Calculate projected salaries next year, using an assumed rate of 3.0% for all roles.

In [34]:
def inflation(salary):
    return salary * 1.03

df['salary_year_inflated'] = df['salary_year_avg'].apply(inflation)

df[pd.notna(df['salary_year_avg'])][['salary_year_avg', 'salary_year_inflated']]

Unnamed: 0,salary_year_avg,salary_year_inflated
18,56700.0,58401.00
45,100000.0,103000.00
56,120000.0,123600.00
76,111175.0,114510.25
79,125000.0,128750.00
...,...,...
787368,125000.0,128750.00
787496,115000.0,118450.00
787497,147500.0,151925.00
787505,145000.0,149350.00


We can actually simplify this with a lambda function.

In [35]:
df['salary_year_inflated'] = df['salary_year_avg'].apply(lambda salary: salary * 1.03)

df[pd.notna(df['salary_year_avg'])][['salary_year_avg', 'salary_year_inflated']]

Unnamed: 0,salary_year_avg,salary_year_inflated
18,56700.0,58401.00
45,100000.0,103000.00
56,120000.0,123600.00
76,111175.0,114510.25
79,125000.0,128750.00
...,...,...
787368,125000.0,128750.00
787496,115000.0,118450.00
787497,147500.0,151925.00
787505,145000.0,149350.00


Now technically this could have been done like this... 

In [36]:
df['salary_year_inflated'] = df['salary_year_avg'] * 1.03

df[pd.notna(df['salary_year_avg'])][['salary_year_avg', 'salary_year_inflated']]

Unnamed: 0,salary_year_avg,salary_year_inflated
18,56700.0,58401.00
45,100000.0,103000.00
56,120000.0,123600.00
76,111175.0,114510.25
79,125000.0,128750.00
...,...,...
787368,125000.0,128750.00
787496,115000.0,118450.00
787497,147500.0,151925.00
787505,145000.0,149350.00


### Example 2

Calculate projected salaries next year, but:
- For senior roles (e.g., Senior Data Analysts), assume the rate is 5%
- For all other roles, assume rate is 3%

In [42]:
def projected_salary(row):
    if 'Senior' in row['job_title_short']:
        return  1.05 * row['salary_year_avg']
    else:
        return  1.03 * row['salary_year_avg']

df['salary_year_inflated'] = df.apply(projected_salary, axis=1)

df[pd.notna(df['salary_year_avg'])][['job_title_short', 'salary_year_avg', 'salary_year_inflated']]

Unnamed: 0,job_title_short,salary_year_avg,salary_year_inflated
18,Data Scientist,56700.0,58401.00
45,Data Scientist,100000.0,103000.00
56,Data Engineer,120000.0,123600.00
76,Data Analyst,111175.0,114510.25
79,Data Engineer,125000.0,128750.00
...,...,...,...
787368,Data Engineer,125000.0,128750.00
787496,Data Analyst,115000.0,118450.00
787497,Senior Data Engineer,147500.0,154875.00
787505,Data Engineer,145000.0,149350.00


Technically you could write this with a lambda function:

In [43]:
df['salary_year_inflated'] = df.apply(lambda row: 1.05 * row['salary_year_avg'] if 'Senior' in row['job_title_short'] else 1.03 * row['salary_year_avg'], axis=1)

df[pd.notna(df['salary_year_avg'])][['job_title_short', 'salary_year_avg', 'salary_year_inflated']]

Unnamed: 0,job_title_short,salary_year_avg,salary_year_inflated
18,Data Scientist,56700.0,58401.00
45,Data Scientist,100000.0,103000.00
56,Data Engineer,120000.0,123600.00
76,Data Analyst,111175.0,114510.25
79,Data Engineer,125000.0,128750.00
...,...,...,...
787368,Data Engineer,125000.0,128750.00
787496,Data Analyst,115000.0,118450.00
787497,Senior Data Engineer,147500.0,154875.00
787505,Data Engineer,145000.0,149350.00


### Example 3

Convert the `job_skills` from a generic object to an actual list object (*hint* this is very important for later). Let's try doing that by just using `ast.literal_eval` and then look at our new column.

A reminder of what our `job_skills` column looks like now:

In [28]:
df['job_skills']

0         ['go', 'python', 'mongodb', 'mongodb', 'css', ...
1         ['sql', 'python', 'sql server', 'oracle', 'azu...
2                                                      None
3                                                      None
4                                 ['sql', 'python', 'jira']
                                ...                        
787681    ['python', 'sql', 'azure', 'aws', 'hadoop', 's...
787682    ['vba', 'sql', 'python', 'excel', 'sap', 'shar...
787683    ['python', 'sql', 'scala', 'java', 'javascript...
787684                   ['excel', 'powerpoint', 'tableau']
787685         ['sql', 'nosql', 'azure', 'spark', 'hadoop']
Name: job_skills, Length: 787686, dtype: object

In [29]:
df['job_skills'][0]

"['go', 'python', 'mongodb', 'mongodb', 'css', 'javascript', 'mysql', 'numpy', 'pandas', 'matplotlib', 'hadoop', 'tableau']"

Inspecting it this list is currently a string!

In [30]:
type(df['job_skills'][0])

str

Let's look at the `literal_eval()` function from the Python Standard Library `ast` module.

In [44]:
import ast

ast.literal_eval(df['job_skills'][0])

['go',
 'python',
 'mongodb',
 'mongodb',
 'css',
 'javascript',
 'mysql',
 'numpy',
 'pandas',
 'matplotlib',
 'hadoop',
 'tableau']

In [45]:
type(ast.literal_eval(df['job_skills'][0]))

list

It's a list!!

🪲 **Debugging**

**This is an intentional mistake**

This is used to demonstrate debugging.

Error: This will return an error because in the `job_skills` column has NaN values.

Steps to Debug:

1. Look at the actual error, can you tell what the problem is?
2. If not, then look it up:
  1. Use a chatbot like ChatGPT or Claude
  2. Look it up using Google

In [31]:
# Returns an error 

# Convert string representation to actual list
df['job_skills'] = df['job_skills'].apply(ast.literal_eval)

df.head()

ValueError: malformed node or string: None

Since we have nan values lets adjust our code to add in a condition to check if the value is not NaN. 
* If it's not NaN it returns `True` and applies `ast.literal_eval()` function on it. 
* if it's a Nan value then it returns `False` and the NaN value doesn't change. 

In [46]:
import ast

# Convert string representation to actual list, checking for NaN values first
df['job_skills'] = df['job_skills'].apply(lambda x: ast.literal_eval(x) if pd.notna(x) else x)

Now let's look at our `job_skills` column. You can see that it's actually a list.

In [47]:
df['job_skills']

0         [go, python, mongodb, mongodb, css, javascript...
1             [sql, python, sql server, oracle, azure, sap]
2                                                      None
3                                                      None
4                                       [sql, python, jira]
                                ...                        
787681    [python, sql, azure, aws, hadoop, spark, pytor...
787682    [vba, sql, python, excel, sap, sharepoint, tab...
787683    [python, sql, scala, java, javascript, sas, sa...
787684                         [excel, powerpoint, tableau]
787685                   [sql, nosql, azure, spark, hadoop]
Name: job_skills, Length: 787686, dtype: object