In [1]:
import pandas as pd
import numpy as np

# <a id='toc1_'></a>[Data Transformations](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Data Transformations](#toc1_)    
  - [Data Normalization](#toc1_1_)    
    - [Min-Max Scaling](#toc1_1_1_)    
    - [Z-Score Normalization](#toc1_1_2_)    
    - [Using Min-Max or Z-Score normalization in pandas](#toc1_1_3_)    
  - [Standardize datetime formats](#toc1_2_)    
    - [Standardize datetime formats using Python datetime module](#toc1_2_1_)    
    - [Standardize datetime formats pandas pd.to_datetime()](#toc1_2_2_)    
  - [Standardize column names](#toc1_3_)    
  - [Data Agregation](#toc1_4_)    
    - [Merge on columns](#toc1_4_1_)    
      - [Inner join merge:](#toc1_4_1_1_)    
      - [Left Join merge](#toc1_4_1_2_)    
      - [Outer Join merge](#toc1_4_1_3_)    
    - [Merge on indexes](#toc1_4_2_)    
      - [Using pd.merge()](#toc1_4_2_1_)    
      - [Using df.join()](#toc1_4_2_2_)    
  - [Grouping data (Split-Apply-Combine)](#toc1_5_)    
    - [Grouping by a single Column](#toc1_5_1_)    
    - [Grouping by list of columns](#toc1_5_2_)    
    - [Group-specific aggregation.](#toc1_5_3_)    
    - [Group-specific transformations.](#toc1_5_4_)    
    - [Group-specific filterations](#toc1_5_5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Data Normalization](#toc0_)

Data normalization is a process used in data processing and analysis to adjust values measured on different scales to a common scale, often without distorting differences in the ranges of values. 

<img src="./images/Min_max_normalization.png" alt="Min_max_normalization" style="height:200px">

This technique is essential for comparative analysis, especially when dealing with variables that span several magnitudes.

Data normalization techniques commonly used in data analysis are Min-Max Scaling and Z-score normalization.

### <a id='toc1_1_1_'></a>[Min-Max Scaling](#toc0_)

This technique rescales the range of features to scale the range in [0, 1] or [-1, 1]. 

It's performed by subtracting the minimum value of the feature and then dividing by the range of the feature.

To normalize these values between 0 and 1, you would apply the following formula for each value 

<img src="./images/min_max_formula.webp" alt="min_max_formula" style="height:200px">

### <a id='toc1_1_2_'></a>[Z-Score Normalization](#toc0_)

The Z-score, also known as a standard score, is a statistical measurement that describes a value's relationship to the mean of a group of values, measured in terms of standard deviations from the mean.

Z-score normalization scales the values to have a mean of 0 and a standard deviation of 1. This is done by subtracting the mean of the feature from each value, and then dividing by the standard deviation.

<img src="./images/Z-score_formula.png" alt="Z-score_formula" style="height:200px">

### <a id='toc1_1_3_'></a>[Using Min-Max or Z-Score normalization in pandas](#toc0_)

Pandas doesn't have a built-in method specifically for Min-Max or Z-Score normalization. You can perform the calculations manualy or to use the variaty of scalerd provided in [sklearn.preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html#) module from [scikit-learn](https://scikit-learn.org/stable/index.html) package which are often used in conjunction with pandas DataFrames for normalization.

To install scikit-learn use pip in activated virtual environment:

```
pip install scikit-learn
```

In next example we will normalize a salaries values using MinMaxScaler and StandardScaler (for Z-Score) classes:

In [7]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Example data
df = pd.DataFrame({
    'Sales': [2000, 3500, 1500, 4000, 3000]
})

# Initialize a MinMaxScaler
scaler = MinMaxScaler()
# Fit and transform the data
df['Min-Max_Normalized_Sales'] = scaler.fit_transform(df[['Sales']])

# Initialize a StandardScaler
scaler = StandardScaler()
print(scaler)
# Fit and transform the data
df['Standard_Normalized_Sales'] = scaler.fit_transform(df[['Sales']])

df


StandardScaler()


Unnamed: 0,Sales,Min-Max_Normalized_Sales,Standard_Normalized_Sales
0,2000,0.2,-0.862662
1,3500,0.8,0.754829
2,1500,0.0,-1.401826
3,4000,1.0,1.293993
4,3000,0.6,0.215666


In [11]:
# Step 2: Create sample data
data = {
    'Feature': [10, 20, 30, 40.22, 50]
}

# Step 3: Create DataFrame
df = pd.DataFrame(data)

scaler = MinMaxScaler()
df.Feature = scaler.fit_transform(df[['Feature']])

print(df)

   Feature
0   0.0000
1   0.2500
2   0.5000
3   0.7555
4   1.0000


## <a id='toc1_2_'></a>[Standardize datetime formats](#toc0_)

Standardizing datetime formats is pivotal for merging, analyzing, and visualizing data from diverse sources.

Let's have next data:

In [3]:
# Create a DataFrame
data = {
    'Date': ['12/05/2024', '05/15/2024', '15/05/2024', '23/06/2024', 'September 24, 2013', '2012.10.01']
}
df = pd.DataFrame(data)

### <a id='toc1_2_1_'></a>[Standardize datetime formats using Python datetime module](#toc0_)

This approach involves trying multiple date format strings until one successfully parses the date. If none of the formats work, the function can return None or raise a ValueError, depending on how you wish to handle errors

In [4]:
from datetime import datetime

def standardize_date(date_string, input_formats, output_format='%Y-%m-%d'):
    """
        Attempts to convert a date string from various formats to a standardized format.

        Args:
            date_string (str): The date string to standardize.
            input_formats (list): A list of strings representing date formats to attempt.
            output_format (str): The target format for the standardized date string. Defaults to '%Y-%m-%d'.

        Returns:
            str or None: The standardized date string, or None if the date_string cannot be parsed.
    """
    for input_format in input_formats:
        try:
            parsed_date = datetime.strptime(date_string, input_format)
            return parsed_date.strftime(output_format)
        except ValueError:
            continue  # Try the next format

    # All formats failed; consider logging this or raise error
    print(f"Warning: Unable to parse date string '{date_string}' with given formats.")
    return None

# Define date formats to attempt parsing with
input_formats = ["%m/%d/%Y", "%d/%m/%Y", "%Y.%m.%d", "%B %d, %Y"]

df['Standard Date 1'] = df['Date'].apply(lambda date: standardize_date(date, input_formats))
df

Unnamed: 0,Date,Standard Date 1
0,12/05/2024,2024-12-05
1,05/15/2024,2024-05-15
2,15/05/2024,2024-05-15
3,23/06/2024,2024-06-23
4,"September 24, 2013",2013-09-24
5,2012.10.01,2012-10-01


### <a id='toc1_2_2_'></a>Standardize datetime formats pandas [pd.to_datetime()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) [&#8593;](#toc0_)

Using pandas.to_datetime() is an effective way to standardize datetime formats in your DataFrame or Series. This function is versatile, allowing you to convert a variety of string formats into a standardized datetime format.

As of Pandas 2.0 ([What’s new in 2.0.0 (April 3, 2023)](https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#enhancements)) we can pass to pd.to_datetime() the parameter `format='mixed'`  to infer the format for each element individually. This is risky, and you should probably use it along with dayfirst.

In [5]:
# Convert the 'Date' column to datetime
# Since we have mixed date formats, we'll handle ambiguous cases by specifying dayfirst=True
df['Standard Date 2'] = pd.to_datetime(df['Date'], errors='coerce', format='mixed', dayfirst=True)
df

Unnamed: 0,Date,Standard Date 1,Standard Date 2
0,12/05/2024,2024-12-05,2024-05-12
1,05/15/2024,2024-05-15,2024-05-15
2,15/05/2024,2024-05-15,2024-05-15
3,23/06/2024,2024-06-23,2024-06-23
4,"September 24, 2013",2013-09-24,2013-09-24
5,2012.10.01,2012-10-01,2012-10-01


## <a id='toc1_3_'></a>[Standardize column names](#toc0_)

Standardizing column names is a crucial step in data preprocessing, especially when working with data from multiple sources or when aiming for consistency in data analysis and modeling.

In [26]:
import pandas as pd

# Sample DataFrame
data = {'Employee Name': [ 'Jane Doe', 'John Smith'],
        'EMPLOYEE ID': [12345, 67890],
        'Department (Dept.)': ['HR', 'IT'],
        '2023 Salary ($)': [50000, 60000]}
df = pd.DataFrame(data)

# Function to standardize column names
def standardize_col_names(df):
    df.columns = df.columns.str.lower()  # Convert to lowercase
    df.columns = df.columns.str.replace(' ', '_')  # Replace spaces with underscores
    df.columns = df.columns.str.replace('(', '').str.replace(')', '')  # Remove parentheses
    df.columns = df.columns.str.replace('$', 'usd')  # Replace $ with 'usd'
    return df

# Apply the function
df = standardize_col_names(df)
df



Unnamed: 0,employee_name,employee_id,department_dept.,2023_salary_usd
0,Jane Doe,12345,HR,50000
1,John Smith,67890,IT,60000


## <a id='toc1_4_'></a>[Data Agregation](#toc0_)

Data aggregation is the process of summarizing data from multiple sources to provide actionable insights. 

When using Pandas, first we have to combine the data into a DataFrame. For that step we can use the pandas merge() or join() functions. 

Second, we have to perform the Split-Apply-Combine strategy by using pandas group-by aggregation function.

### <a id='toc1_4_1_'></a>[Merge on columns](#toc0_)

[pandas.DataFrame.merge]([https://](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)) function is a powerful tool for combining data on common columns or indices, similar to SQL joins.

It enables you to horizontally concatenate DataFrames or Series, based on one or more keys, aligning data in the process. 

This function is particularly useful for integrating related datasets to perform comprehensive analyses.

<img src="./images/df_merge_flow.png" style="height:240px">

The basic syntax of df.merge() is:

```
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False)
```

where:

- **left**: The DataFrame on the left side of the merge.
- **right**: The DataFrame on the right side of the merge.
- **how**: Specifies how to determine which keys are to be included in the resulting table. It can be 
  
    - 'left'  : similar to a SQL left outer join
    - 'right' :similar to a SQL right outer join
    - 'outer' : similar to a SQL full outer join
    - 'inner' (default): similar to a SQL inner join
    - 'cross' : creates the cartesian product from both frames.
    
- **on**: Column or index level names to join on. Must be found in both the left and right DataFrames.
- **left_on**: Columns from the left DataFrame to use as keys.
- **right_on**: Columns from the right DataFrame to use as keys.
- **left_index/right_index**: If True, use the index from the left/right DataFrame as the join key.

The basic join types visualized:

<img src="./images/df_merge_types_of_joins.png" style="height:240px">

In next examles we'll use 2 dataframes:

- employees_df: Contains employee IDs, names, and departments.
- salaries_df: Contains employee IDs and their salaries.

We'll merge these DataFrames to have a complete overview of employees and their salaries.

In [6]:
# Create employees_df
employees_data = {
    'EmployeeID': [1, 2, 3, 4, 5],
    'EmployeeName': ['Ivan Ivanov', 'Maria Popova', 'Georgi Dimitrov', 'Sofia Petrova', 'Nikolay Banev']
}
employees_df = pd.DataFrame(employees_data)

employees_df

Unnamed: 0,EmployeeID,EmployeeName
0,1,Ivan Ivanov
1,2,Maria Popova
2,3,Georgi Dimitrov
3,4,Sofia Petrova
4,5,Nikolay Banev


In [7]:
# Create salaries_df.
# Note missing data for EmployeeID 2 and 5, and data for EmployeeID 6 which is not in employees_df
salaries_data = {
    'EmployeeID': [1, 3, 4, 6],
    'Salary': [3000, 2900, 3200, 10000]
}
salaries_df = pd.DataFrame(salaries_data)

salaries_df

Unnamed: 0,EmployeeID,Salary
0,1,3000
1,3,2900
2,4,3200
3,6,10000


#### <a id='toc1_4_1_1_'></a>[Inner join merge:](#toc0_)

We will merge employee and salary information on the 'EmployeeID' column.

The how='inner' parameter means the merge will include only rows with matching EmployeeID values in both DataFrames, ensuring that we get a DataFrame where every row includes both employee details and salary information.

In [8]:
merged_df = pd.merge(employees_df, salaries_df, on='EmployeeID', how='inner')
merged_df

Unnamed: 0,EmployeeID,EmployeeName,Salary
0,1,Ivan Ivanov,3000
1,3,Georgi Dimitrov,2900
2,4,Sofia Petrova,3200


#### <a id='toc1_4_1_2_'></a>[Left Join merge](#toc0_)

We can perform a left join to merge employees_df with salaries_df, ensuring all employees are included regardless of whether their salary data is present. The Right Join workds, the same, but on right table. 

In [9]:
merged_df = pd.merge(employees_df, salaries_df, on='EmployeeID', how='left')
merged_df

Unnamed: 0,EmployeeID,EmployeeName,Salary
0,1,Ivan Ivanov,3000.0
1,2,Maria Popova,
2,3,Georgi Dimitrov,2900.0
3,4,Sofia Petrova,3200.0
4,5,Nikolay Banev,


#### <a id='toc1_4_1_3_'></a>[Outer Join merge](#toc0_)

Combines all records from both DataFrames.

In [10]:
merged_df = pd.merge(employees_df, salaries_df, on='EmployeeID', how='outer')
merged_df

Unnamed: 0,EmployeeID,EmployeeName,Salary
0,1,Ivan Ivanov,3000.0
1,2,Maria Popova,
2,3,Georgi Dimitrov,2900.0
3,4,Sofia Petrova,3200.0
4,5,Nikolay Banev,
5,6,,10000.0


### <a id='toc1_4_2_'></a>[Merge on indexes](#toc0_)

Merging on indexes is particularly useful in scenarios involving related datasets that are naturally indexed by a common identifier, such as orders and payments, where each order has a corresponding payment record. Merging on indexes is a common task in financial data analysis, customer relationship management (CRM), and any domain where linking records based on a shared key is necessary for analysis or reporting. 

Let's have next data:

- orders_df: Contains information about customer orders, indexed by OrderID.
- payments_df: Contains payment amounts for orders, indexed by OrderID.

In [11]:
orders_data = {
    'CustomerName': ['Ana Petrova', 'Dimitar Ivanov', 'Boris Popov', 'Elena Georgieva'],
    'Product': ['Book', 'Laptop', 'Pen', 'Notebook']
}
orders_df = pd.DataFrame(orders_data, index=[1001, 1002, 1003, 1004])
orders_df

Unnamed: 0,CustomerName,Product
1001,Ana Petrova,Book
1002,Dimitar Ivanov,Laptop
1003,Boris Popov,Pen
1004,Elena Georgieva,Notebook


In [12]:
payments_data = {
    'Amount': [15.99, 1200.00, 3.49, 7.99]
}
payments_df = pd.DataFrame(payments_data, index=[1001, 1002, 1003, 1004])
payments_df

Unnamed: 0,Amount
1001,15.99
1002,1200.0
1003,3.49
1004,7.99


#### <a id='toc1_4_2_1_'></a>[Using pd.merge()](#toc0_)

To merge orders_df and payments_df using their indexes, we can use again pd.merge(), but we must set `left_index=True` and `right_index=True`. 

These parameters indicate that the merge should be performed using the indexes of both orders_df and payments_df. Since both DataFrames are indexed by OrderID, this effectively merges the order details with the corresponding payment information for each order

In [13]:
merged_df = pd.merge(orders_df, payments_df, left_index=True, right_index=True)
merged_df

Unnamed: 0,CustomerName,Product,Amount
1001,Ana Petrova,Book,15.99
1002,Dimitar Ivanov,Laptop,1200.0
1003,Boris Popov,Pen,3.49
1004,Elena Georgieva,Notebook,7.99


#### <a id='toc1_4_2_2_'></a>[Using df.join()](#toc0_)

[df.join()]([https://](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html))  is specifically designed for joining on indexes or joining on a key column in the left DataFrame and an index in the right DataFrame. It's a convenience method for merging that is suited to more straightforward join operations.

In [14]:
merged_df = orders_df.join(payments_df)
merged_df

Unnamed: 0,CustomerName,Product,Amount
1001,Ana Petrova,Book,15.99
1002,Dimitar Ivanov,Laptop,1200.0
1003,Boris Popov,Pen,3.49
1004,Elena Georgieva,Notebook,7.99


## <a id='toc1_5_'></a>[Grouping data (Split-Apply-Combine)](#toc0_)

The "Split-Apply-Combine" strategy offered by pandas through its groupby mechanism is a powerful approach for aggregating, summarizing, and transforming data. 

This strategy is especially useful when dealing with complex datasets that require integrated analysis across different segments or categories. 

The concept, originally described by Hadley Wickham, breaks down the data processing task into three distinct steps: 
- split the data into groups based on some criteria, 
- apply a function to each group independently, 
- combine the results back into a data structure.

<img src="./images/split_apply_combine.png" style="height:300px">

**Split**:

The first step involves splitting the DataFrame into groups based on some key(s). The grouping can be done based on one or more columns, and pandas will essentially create a mapping of keys to groups of data based on the unique values in the specified columns.

**Apply**:

Once the data is split into groups, you can apply a function to each group independently. This function could be:
- aggregation (e.g., sum, mean).
- transformation (e.g., standardizing data within a group), 
- filter operation (e.g., removing data that doesn't meet certain criteria). This step is highly flexible and allows for complex data manipulation operations within each group.

**Combine**:

After processing each group, pandas combines the results into a new DataFrame or Series, depending on the operation performed. The combine step is handled automatically by pandas, ensuring the structure of the output is consistent and aligned with the input data.

### <a id='toc1_5_1_'></a>[Grouping by a single Column](#toc0_)

When you group by a single column, pandas groups the data based on the unique values in that column. Each group consists of all rows in the DataFrame that share the same value in the grouping column.

In [15]:
# Sample data
data = {
    'Store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store A'],
    'Department': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Electronics'],
    'Sales': [1000, 500, 1500, 800, 1200]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Store,Department,Sales
0,Store A,Electronics,1000
1,Store A,Clothing,500
2,Store B,Electronics,1500
3,Store B,Clothing,800
4,Store A,Electronics,1200


In [16]:
# Grouping by 'Store'
grouped = df.groupby('Store')
grouped.groups

{'Store A': [0, 1, 4], 'Store B': [2, 3]}

### <a id='toc1_5_2_'></a>[Grouping by list of columns](#toc0_)

Grouping by a list of columns allows you to create multi-level (hierarchical) groups based on unique combinations of values across the specified columns. This is useful when you want to analyze your data based on multiple criteria simultaneously.

In [17]:
multi_grouped = df.groupby(['Store', 'Department'])
multi_grouped.groups

{('Store A', 'Clothing'): [1], ('Store A', 'Electronics'): [0, 4], ('Store B', 'Clothing'): [3], ('Store B', 'Electronics'): [2]}

### <a id='toc1_5_3_'></a>[Group-specific aggregation.](#toc0_)

An aggregation is a GroupBy operation that reduces the dimension of the grouping object. The result of an aggregation is a scalar value for each column in a group. 

Pandas provides several  [Built-in aggregation methods](https://pandas.pydata.org/docs/user_guide/groupby.html#built-in-aggregation-methods).

Let's consider next scenario:

A retail company operates stores in multiple regions, selling products in different categories. The management wants to understand the average sales figures broken down by region and store to identify which areas and stores are performing well and which might need attention.
We have to group the data by Region and Store to analyze the average sales for each store in each region.

In [18]:
# Sample data
data = {
    'Region': ['North', 'North', 'South', 'South', 'North', 'South'],
    'Store': ['One', 'Two', 'One', 'Two', 'One', 'Two'],
    'Category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Home Goods', 'Home Goods'],
    'Sales': [1000, 1500, 750, 1250, 900, 1100]
}
df = pd.DataFrame(data)

# output data sorted by 'Region', 'Store'
df.sort_values(by=['Region', 'Store'])

Unnamed: 0,Region,Store,Category,Sales
0,North,One,Electronics,1000
4,North,One,Home Goods,900
1,North,Two,Clothing,1500
2,South,One,Electronics,750
3,South,Two,Clothing,1250
5,South,Two,Home Goods,1100


In [19]:
# Multi-level grouping
grouped = df.groupby(['Region', 'Store'])

# Calculating the average sales
average_sales = grouped['Sales'].mean()
average_sales

Region  Store
North   One       950.0
        Two      1500.0
South   One       750.0
        Two      1175.0
Name: Sales, dtype: float64

The output is a Series with a hierarchical index (MultiIndex).

You can use the .reset_index() method to convert the hierarchical index (MultiIndex) created by the multi-level group into regular columns, effectively turning the Series into a DataFrame.

In [20]:
average_sales.reset_index()

Unnamed: 0,Region,Store,Sales
0,North,One,950.0
1,North,Two,1500.0
2,South,One,750.0
3,South,Two,1175.0


### <a id='toc1_5_4_'></a>[Group-specific transformations.](#toc0_)

A transformation is a GroupBy operation whose result is indexed the same as the one being grouped

Pandas has several [built-in transformation methods](https://pandas.pydata.org/docs/user_guide/groupby.html#built-in-transformation-methods) which works on groups. 

We can define our custom transformation using the [ transform() method](https://pandas.pydata.org/docs/user_guide/groupby.html#the-transform-method).


Let's consider nex example: a dataset containing test scores from students in different classes. Some of the scores are missing, and we want to fill these missing values with the average score of each respective class, rather than using the overall average.


In [12]:
# Sample data: Student scores from different classes, with some missing values
data = {
    'Class': ['A', 'A', 'B', 'B', 'A', 'B'],
    'StudentID': [1, 2, 3, 4, 5, 6],
    'Score': [88, np.nan, 75, np.nan, 92, 85]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Class,StudentID,Score
0,A,1,88.0
1,A,2,
2,B,3,75.0
3,B,4,
4,A,5,92.0
5,B,6,85.0


In [22]:
# Group by 'Class' and fill NaN values with the mean score of each class
df['Cleaned Score'] = df.groupby('Class')['Score'].transform(lambda x: x.fillna(x.mean()))
df

Unnamed: 0,Class,StudentID,Score,Cleaned Score
0,A,1,88.0,88.0
1,A,2,,90.0
2,B,3,75.0,75.0
3,B,4,,80.0
4,A,5,92.0,92.0
5,B,6,85.0,85.0


### <a id='toc1_5_5_'></a>[Group-specific filterations](#toc0_)

A filtration is a GroupBy operation that subsets the original grouping object. It may either filter out entire groups, part of groups, or both.

This can be particularly useful when you want to analyze subsets of your data that meet certain criteria within each group.

Let's say we have a dataset containing sales data from various stores, each belonging to a different region. Our goal is to filter out stores in region that have a total sales figure above 2000 units sold. This will help us focus on higher-performing stores for further analysis.

In [13]:
# Sample data
data = {
    'Region': ['North', 'North', 'South', 'South', 'East', 'East'],
    'Store': ['Store A', 'Store B', 'Store C', 'Store D', 'Store E', 'Store F'],
    'Sales': [950, 1100, 550, 1250, 1750, 600]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Region,Store,Sales
0,North,Store A,950
1,North,Store B,1100
2,South,Store C,550
3,South,Store D,1250
4,East,Store E,1750
5,East,Store F,600


In [24]:
# Apply group-specific filtering
filtered_df = df.groupby('Region').filter(lambda group: group['Sales'].sum() > 2000)
filtered_df

Unnamed: 0,Region,Store,Sales
0,North,Store A,950
1,North,Store B,1100
4,East,Store E,1750
5,East,Store F,600


In [14]:
from datetime import datetime as dt

In [18]:
date_string = "2021-10-04"

date_object = dt.strptime(date_string, '%Y-%m-%d')
date_object

date_object_formated = date_object.strftime('%Y-%b-%d')
date_object_formated

'2021-Oct-04'

In [19]:
data = {
    'Group': ['A', 'A', 'B', 'B', 'A', 'B'],
    'Value': [10, 20, 15, np.nan, 30, 25]
}

df = pd.DataFrame(data)
print(df)

  Group  Value
0     A   10.0
1     A   20.0
2     B   15.0
3     B    NaN
4     A   30.0
5     B   25.0


In [20]:
def fill_with_group_mean(group):
    group_mean = group.mean()
    return group.fillna(group_mean)

In [22]:
df['Filled_Value'] = df.groupby('Group')['Value'].transform(fill_with_group_mean)
print(df)

  Group  Value  Filled_Value
0     A   10.0          10.0
1     A   20.0          20.0
2     B   15.0          15.0
3     B    NaN          20.0
4     A   30.0          30.0
5     B   25.0          25.0


In [34]:
# Correct approach to apply a condition to each element within each group
df['Max_Value'] = df.groupby('Group')['Value'].transform(lambda x: x.where(x < 10, 100))

print(df)

  Group  Value  Filled_Value  Max_Value
0     A   10.0          10.0      100.0
1     A   20.0          20.0      100.0
2     B   15.0          15.0      100.0
3     B    NaN          20.0      100.0
4     A   30.0          30.0      100.0
5     B   25.0          25.0      100.0


In [36]:
df.Value.sort_values()

0    10.0
2    15.0
1    20.0
5    25.0
4    30.0
3     NaN
Name: Value, dtype: float64