<img src="./images/banner.png" width="800">

# Applying Functions to Series and DataFrames

**Applying functions** is a fundamental skill in data manipulation with Pandas. It allows you to transform, analyze, and extract insights from your data efficiently. In this lecture, we'll explore various methods to apply functions to both Series and DataFrames in Pandas.


Pandas provides several powerful methods for applying functions:

- `.apply()`: A versatile method for applying functions to Series or DataFrame axes
- `.map()`: Used for transforming Series based on a mapping or function
- `.applymap()`: Applies a function to every element in a DataFrame
- Vectorized operations: Efficient element-wise operations on Series and DataFrames


Throughout this lecture, we'll cover:
- How to use built-in functions and create custom functions
- Applying functions to Series and DataFrames
- Advanced techniques like vectorization and aggregation
- Practical examples and best practices for efficient data manipulation


By mastering these techniques, you'll be able to write cleaner, more efficient code for data processing tasks in Pandas.


**Table of contents**<a id='toc0_'></a>    
- [Applying Functions to Series](#toc1_)    
  - [Using Built-in Functions](#toc1_1_)    
  - [Applying Custom Functions with .apply()](#toc1_2_)    
  - [Using .map() for Series Transformation](#toc1_3_)    
- [Applying Functions to DataFrames](#toc2_)    
  - [Using DataFrame-wide Functions](#toc2_1_)    
  - [Applying Functions to Columns with .apply()](#toc2_2_)    
  - [Applying Functions to Rows with .apply(axis=1)](#toc2_3_)    
  - [Using .applymap() for Element-wise Operations](#toc2_4_)    
- [Advanced Function Application](#toc3_)    
  - [Vectorized Operations for Performance](#toc3_1_)    
  - [Using lambda Functions](#toc3_2_)    
  - [Applying Multiple Functions with .agg()](#toc3_3_)    
- [Practical Examples and Use Cases](#toc4_)    
  - [Example 1: Data Cleaning and Transformation](#toc4_1_)    
  - [Example 2: Text Data Processing](#toc4_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Applying Functions to Series](#toc0_)

Series are one-dimensional labeled arrays in Pandas. We can apply various functions to transform or analyze these Series efficiently.


### <a id='toc1_1_'></a>[Using Built-in Functions](#toc0_)


Pandas and NumPy provide many built-in functions that can be directly applied to Series. These functions are optimized for performance and are often vectorized, meaning they operate on the entire Series at once.


In [1]:
import pandas as pd

In [2]:
# Create a sample Series
s = pd.Series([1, 2, 3, 4, 5])


In [3]:
# Using built-in functions
s.sum()


15

In [4]:
s.mean()


3.0

In [5]:
s.max()


5

In [6]:
s.min()


1

In [7]:
s.abs()


0    1
1    2
2    3
3    4
4    5
dtype: int64

You can also use NumPy functions directly on Series:


In [8]:
import numpy as np

In [9]:
np.log(s)

0    0.000000
1    0.693147
2    1.098612
3    1.386294
4    1.609438
dtype: float64

In [10]:
np.exp(s)

0      2.718282
1      7.389056
2     20.085537
3     54.598150
4    148.413159
dtype: float64

In [11]:
np.sin(s)

0    0.841471
1    0.909297
2    0.141120
3   -0.756802
4   -0.958924
dtype: float64

### <a id='toc1_2_'></a>[Applying Custom Functions with .apply()](#toc0_)


The `.apply()` method allows you to apply custom functions to every element in a Series.


In [12]:
# Define a custom function
def square(x):
    return x ** 2

In [13]:
# Apply the custom function
s.apply(square)

0     1
1     4
2     9
3    16
4    25
dtype: int64

In [14]:
# Using a lambda function
s.apply(lambda x: x ** 2)

0     1
1     4
2     9
3    16
4    25
dtype: int64

In [15]:
# More complex custom function
def categorize(x):
    if x < 3:
        return 'Low'
    elif x < 5:
        return 'Medium'
    else:
        return 'High'

s.apply(categorize)

0       Low
1       Low
2    Medium
3    Medium
4      High
dtype: object

### <a id='toc1_3_'></a>[Using .map() for Series Transformation](#toc0_)


The `.map()` method is useful for transforming a Series based on a mapping (dictionary) or a function.


In [16]:
# Using a dictionary for mapping
mapping = {1: 'One', 2: 'Two', 3: 'Three', 4: 'Four', 5: 'Five'}
s.map(mapping)

0      One
1      Two
2    Three
3     Four
4     Five
dtype: object

In [17]:
# Using a function with map
s.map(lambda x: x * 10)

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [18]:
# Handling missing values in mapping
s_with_na = pd.Series([1, 2, 3, 4, 5, 6])
s_with_na.map(mapping, na_action='ignore')

0      One
1      Two
2    Three
3     Four
4     Five
5      NaN
dtype: object

`.map()` is particularly useful when you want to replace values in a Series based on a predefined mapping.


In [19]:
# Example with categorical data
fruits = pd.Series(['apple', 'banana', 'cherry', 'date', 'elderberry'])
fruit_colors = {'apple': 'red', 'banana': 'yellow', 'cherry': 'red', 'date': 'brown'}

fruits.map(fruit_colors)

0       red
1    yellow
2       red
3     brown
4       NaN
dtype: object

**Note**: `.map()` is generally faster than `.apply()` for simple operations, especially when using a dictionary for mapping. However, `.apply()` is more flexible and can handle more complex functions.


These methods provide powerful ways to transform and analyze Series data in Pandas. Choose the appropriate method based on your specific use case and performance requirements.

## <a id='toc2_'></a>[Applying Functions to DataFrames](#toc0_)

DataFrames are two-dimensional labeled data structures in Pandas. We can apply functions to entire DataFrames, specific columns, rows, or individual elements.


### <a id='toc2_1_'></a>[Using DataFrame-wide Functions](#toc0_)


Many built-in functions in Pandas can be applied directly to DataFrames, operating on all numeric columns.


In [20]:
# Create a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': [100, 200, 300, 400, 500]
})

# Apply DataFrame-wide functions
df.sum()

A      15
B     150
C    1500
dtype: int64

In [21]:
df.mean()

A      3.0
B     30.0
C    300.0
dtype: float64

In [22]:
df.max()

A      5
B     50
C    500
dtype: int64

In [23]:
df.min()

A      1
B     10
C    100
dtype: int64

In [24]:
df.describe()

Unnamed: 0,A,B,C
count,5.0,5.0,5.0
mean,3.0,30.0,300.0
std,1.581139,15.811388,158.113883
min,1.0,10.0,100.0
25%,2.0,20.0,200.0
50%,3.0,30.0,300.0
75%,4.0,40.0,400.0
max,5.0,50.0,500.0


### <a id='toc2_2_'></a>[Applying Functions to Columns with .apply()](#toc0_)


You can use `.apply()` to apply a function to each column of a DataFrame.


In [25]:
# Apply a function to each column
df.apply(np.sum)

A      15
B     150
C    1500
dtype: int64

In [26]:
# Custom function for columns
def column_stats(col):
    return pd.Series({
        'min': col.min(),
        'max': col.max(),
        'mean': col.mean(),
        'median': col.median()
    })

df.apply(column_stats)

Unnamed: 0,A,B,C
min,1.0,10.0,100.0
max,5.0,50.0,500.0
mean,3.0,30.0,300.0
median,3.0,30.0,300.0


### <a id='toc2_3_'></a>[Applying Functions to Rows with .apply(axis=1)](#toc0_)


To apply a function to each row of a DataFrame, use `.apply()` with `axis=1`.


In [27]:
# Function to apply to each row
def row_sum(row):
    return row.sum()

df.apply(row_sum, axis=1)

0    111
1    222
2    333
3    444
4    555
dtype: int64

In [28]:
# More complex row operation
def categorize_row(row):
    total = row.sum()
    if total < 200:
        return 'Low'
    elif total < 400:
        return 'Medium'
    else:
        return 'High'

df.apply(categorize_row, axis=1)

0       Low
1    Medium
2    Medium
3      High
4      High
dtype: object

### <a id='toc2_4_'></a>[Using .applymap() for Element-wise Operations](#toc0_)


`.applymap()` applies a function to every element in the DataFrame.


In [29]:
# Apply a function to every element
df.applymap(lambda x: f"{x:.2f}")

  df.applymap(lambda x: f"{x:.2f}")


Unnamed: 0,A,B,C
0,1.0,10.0,100.0
1,2.0,20.0,200.0
2,3.0,30.0,300.0
3,4.0,40.0,400.0
4,5.0,50.0,500.0


In [30]:
# Another example: categorizing values
def categorize(x):
    if x < 50:
        return 'Low'
    elif x < 250:
        return 'Medium'
    else:
        return 'High'

df.applymap(categorize)

  df.applymap(categorize)


Unnamed: 0,A,B,C
0,Low,Low,Medium
1,Low,Low,Medium
2,Low,Low,High
3,Low,Low,High
4,Low,Medium,High


**Note**: While `.applymap()` is convenient for element-wise operations, it can be slower than vectorized operations for large DataFrames. When possible, use vectorized operations or apply functions to specific columns for better performance.


In [31]:
# Vectorized operation (faster)
df_categorized = pd.DataFrame({
    'A': pd.cut(df['A'], bins=[-np.inf, 2, 4, np.inf], labels=['Low', 'Medium', 'High']),
    'B': pd.cut(df['B'], bins=[-np.inf, 20, 40, np.inf], labels=['Low', 'Medium', 'High']),
    'C': pd.cut(df['C'], bins=[-np.inf, 200, 400, np.inf], labels=['Low', 'Medium', 'High'])
})

df_categorized

Unnamed: 0,A,B,C
0,Low,Low,Low
1,Low,Low,Low
2,Medium,Medium,Medium
3,Medium,Medium,Medium
4,High,High,High


These methods provide powerful ways to manipulate and analyze DataFrame data in Pandas. Choose the appropriate method based on your specific use case, considering both functionality and performance.

## <a id='toc3_'></a>[Advanced Function Application](#toc0_)

As you become more proficient with Pandas, you'll encounter situations that require more advanced function application techniques. This section covers vectorized operations, lambda functions, and applying multiple functions simultaneously.


### <a id='toc3_1_'></a>[Vectorized Operations for Performance](#toc0_)


Vectorized operations in Pandas are highly optimized and perform operations on entire arrays at once, rather than element by element. This leads to significant performance improvements, especially for large datasets.


In [32]:
# Create a sample DataFrame
df = pd.DataFrame({
    'A': np.random.rand(100000),
    'B': np.random.rand(100000)
})
df

Unnamed: 0,A,B
0,0.139420,0.111679
1,0.320422,0.969468
2,0.614179,0.707831
3,0.651313,0.001723
4,0.447244,0.837443
...,...,...
99995,0.375665,0.212190
99996,0.375856,0.480424
99997,0.574096,0.801636
99998,0.629009,0.675819


In [33]:
# Vectorized operation (fast)
%timeit df['C'] = df['A'] + df['B']

The slowest run took 4.57 times longer than the fastest. This could mean that an intermediate result is being cached.
304 µs ± 189 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [34]:
# Non-vectorized operation using apply (slow)
%timeit df['D'] = df.apply(lambda row: row['A'] + row['B'], axis=1)

624 ms ± 363 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [35]:
# Compare results
df.head()

Unnamed: 0,A,B,C,D
0,0.13942,0.111679,0.251099,0.251099
1,0.320422,0.969468,1.28989,1.28989
2,0.614179,0.707831,1.322009,1.322009
3,0.651313,0.001723,0.653036,0.653036
4,0.447244,0.837443,1.284687,1.284687


Whenever possible, use vectorized operations for better performance:


In [36]:
# More vectorized operations
df['E'] = np.sqrt(df['A'])
df['F'] = np.where(df['B'] > 0.5, 'High', 'Low')

df.head()

Unnamed: 0,A,B,C,D,E,F
0,0.13942,0.111679,0.251099,0.251099,0.37339,Low
1,0.320422,0.969468,1.28989,1.28989,0.566058,High
2,0.614179,0.707831,1.322009,1.322009,0.783695,High
3,0.651313,0.001723,0.653036,0.653036,0.807039,Low
4,0.447244,0.837443,1.284687,1.284687,0.668763,High


### <a id='toc3_2_'></a>[Using lambda Functions](#toc0_)


Lambda functions are anonymous, inline functions that are useful for simple operations. They're often used with `.apply()`, `.map()`, and other Pandas methods.


In [37]:
# Using lambda with Series
s = pd.Series([1, 2, 3, 4, 5])
s.apply(lambda x: x**2 if x % 2 == 0 else x**3)

0      1
1      4
2     27
3     16
4    125
dtype: int64

In [39]:
# Using lambda with DataFrame columns
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50]
})
df

Unnamed: 0,A,B
0,1,10
1,2,20
2,3,30
3,4,40
4,5,50


In [42]:
df['C'] = df['A'].apply(lambda x: 'Even' if x % 2 == 0 else 'Odd')
df['D'] = df.apply(lambda row: row['A'] * row['B'], axis=1)
df

Unnamed: 0,A,B,C,D
0,1,10,Odd,10
1,2,20,Even,40
2,3,30,Odd,90
3,4,40,Even,160
4,5,50,Odd,250


While lambda functions are convenient, they can make code less readable for complex operations. In such cases, it's often better to define a named function.


### <a id='toc3_3_'></a>[Applying Multiple Functions with .agg()](#toc0_)


The `.agg()` method allows you to apply multiple functions to one or more columns simultaneously.


In [43]:
# Create a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': [100, 200, 300, 400, 500]
})
df

Unnamed: 0,A,B,C
0,1,10,100
1,2,20,200
2,3,30,300
3,4,40,400
4,5,50,500


In [44]:
# Apply multiple functions to all columns
df.agg(['sum', 'mean', 'max', 'min'])

Unnamed: 0,A,B,C
sum,15.0,150.0,1500.0
mean,3.0,30.0,300.0
max,5.0,50.0,500.0
min,1.0,10.0,100.0


In [45]:
# Apply different functions to different columns
df.agg({
    'A': ['sum', 'mean'],
    'B': ['min', 'max'],
    'C': 'std'
})

Unnamed: 0,A,B,C
sum,15.0,,
mean,3.0,,
min,,10.0,
max,,50.0,
std,,,158.113883


In [48]:
# Using custom functions with .agg()
def range_calc(x):
    return x.max() - x.min()

df.agg(['mean', range_calc])

Unnamed: 0,A,B,C
mean,3.0,30.0,300.0
range_calc,4.0,40.0,400.0


The `.agg()` method is particularly useful when you need to compute multiple summary statistics for your data efficiently.


These advanced techniques allow for more flexible and efficient data manipulation in Pandas. By combining vectorized operations, lambda functions, and aggregation methods, you can perform complex data transformations and analyses with concise and performant code.

## <a id='toc4_'></a>[Practical Examples and Use Cases](#toc0_)

Let's explore some practical examples and use cases for applying functions to Series and DataFrames. These examples will demonstrate how to use the techniques we've learned in real-world scenarios.


### <a id='toc4_1_'></a>[Example 1: Data Cleaning and Transformation](#toc0_)


Suppose we have a dataset of customer information with some inconsistencies:


In [69]:
# Create a sample DataFrame
df = pd.DataFrame({
    'Name': ['John Smith', 'Jane Doe', 'Bob Johnson', 'Alice Brown'],
    'Age': [28, 35, 42, 31],
    'Income': ['$45,000', '$60,000', '$75,000', '$55,000'],
    'City': ['NEW YORK', 'los angeles', 'Chicago', 'HOUston']
})

df

Unnamed: 0,Name,Age,Income,City
0,John Smith,28,"$45,000",NEW YORK
1,Jane Doe,35,"$60,000",los angeles
2,Bob Johnson,42,"$75,000",Chicago
3,Alice Brown,31,"$55,000",HOUston


Let's clean and transform this data:


In [70]:
# Clean up names (capitalize first letter of each word)
df['Name'] = df['Name'].apply(lambda x: x.title())
df

Unnamed: 0,Name,Age,Income,City
0,John Smith,28,"$45,000",NEW YORK
1,Jane Doe,35,"$60,000",los angeles
2,Bob Johnson,42,"$75,000",Chicago
3,Alice Brown,31,"$55,000",HOUston


In [71]:
# Convert income to numeric, removing '$' and ','
df['Income'] = df['Income'].apply(lambda x: float(x.replace('$', '').replace(',', '')))
df

Unnamed: 0,Name,Age,Income,City
0,John Smith,28,45000.0,NEW YORK
1,Jane Doe,35,60000.0,los angeles
2,Bob Johnson,42,75000.0,Chicago
3,Alice Brown,31,55000.0,HOUston


In [72]:
# Standardize city names (capitalize)
df['City'] = df['City'].apply(lambda x: x.capitalize())
df

Unnamed: 0,Name,Age,Income,City
0,John Smith,28,45000.0,New york
1,Jane Doe,35,60000.0,Los angeles
2,Bob Johnson,42,75000.0,Chicago
3,Alice Brown,31,55000.0,Houston


In [73]:
# Add a new column for income category
df['Income_Category'] = df['Income'].apply(lambda x: 'High' if x > 60000 else 'Medium' if x > 40000 else 'Low')
df

Unnamed: 0,Name,Age,Income,City,Income_Category
0,John Smith,28,45000.0,New york,Medium
1,Jane Doe,35,60000.0,Los angeles,Medium
2,Bob Johnson,42,75000.0,Chicago,High
3,Alice Brown,31,55000.0,Houston,Medium


### <a id='toc4_2_'></a>[Example 2: Text Data Processing](#toc0_)


Let's process a dataset containing customer reviews:


In [62]:
# Create a sample DataFrame with customer reviews
df = pd.DataFrame({
    'Review': [
        "Great product, highly recommended!",
        "Disappointing quality, wouldn't buy again.",
        "Average product, nothing special.",
        "Excellent service and fast delivery!",
        "Terrible customer support, avoid this company."
    ]
})
df

Unnamed: 0,Review
0,"Great product, highly recommended!"
1,"Disappointing quality, wouldn't buy again."
2,"Average product, nothing special."
3,Excellent service and fast delivery!
4,"Terrible customer support, avoid this company."


In [63]:
# Function to calculate review length
def review_length(text):
    return len(text.split())

In [64]:
# Function to detect sentiment (very simple approach)
def detect_sentiment(text):
    positive_words = ['great', 'excellent', 'good', 'best', 'amazing']
    negative_words = ['bad', 'terrible', 'worst', 'disappointing', 'avoid']

    text_lower = text.lower()
    if any(word in text_lower for word in positive_words):
        return 'Positive'
    elif any(word in text_lower for word in negative_words):
        return 'Negative'
    else:
        return 'Neutral'

In [65]:
# Apply functions to the DataFrame
df['Review_Length'] = df['Review'].apply(review_length)
df['Sentiment'] = df['Review'].apply(detect_sentiment)

df

Unnamed: 0,Review,Review_Length,Sentiment
0,"Great product, highly recommended!",4,Positive
1,"Disappointing quality, wouldn't buy again.",5,Negative
2,"Average product, nothing special.",4,Neutral
3,Excellent service and fast delivery!,5,Positive
4,"Terrible customer support, avoid this company.",6,Negative


These examples demonstrate how applying functions to Series and DataFrames can be used in various real-world scenarios, from data cleaning and transformation to complex analyses in time series, text processing, and financial data. The flexibility and power of Pandas make it an excellent tool for handling diverse data manipulation tasks.