
## Day 14 - Data Transformation: Applying Functions

### Why is Data Transformation Important?

Data transformation is a critical step in the data analysis pipeline. It involves converting data into a format that is more suitable for analysis. Whether you're cleaning data, engineering features, or normalizing values, applying transformations is key to preparing your dataset for meaningful insights.



### Example 1: Applying a Function to a Column

Let's start with a simple example where we want to convert a column of temperatures from Celsius to Fahrenheit.


In [None]:
!pip install pandas

In [None]:

import pandas as pd

# Sample DataFrame with temperatures in Celsius
data = {
    'City': ['New York', 'Los Angeles', 'Chicago'],
    'Temperature (C)': [25, 30, 20]
}
df = pd.DataFrame(data)

# Function to convert Celsius to Fahrenheit
def celsius_to_fahrenheit(celsius):
    return (celsius * 9/5) + 32

# Applying the function to the 'Temperature (C)' column
df['Temperature (F)'] = df['Temperature (C)'].apply(celsius_to_fahrenheit)

print("DataFrame after applying the conversion:")
print(df)



### Example 2: Applying a Lambda Function to a Column

You can also apply lambda functions for quick, one-line operations. For example, let's double the values in a numeric column.


In [None]:

# Doubling the temperatures in Celsius using a lambda function
df['Temperature (C)'] = df['Temperature (C)'].apply(lambda x: x * 2)

print("DataFrame after doubling the Celsius temperatures:")
print(df)



## Use Case: Normalizing a Dataset of Customer Reviews

For this use case, we will normalize a dataset containing customer reviews. Normalization typically involves scaling data to a specific range (e.g., 0 to 1), which is particularly useful when dealing with features that have different units or ranges.



### Step 1: Creating a Sample Dataset

Let’s assume you have a dataset of customer reviews with ratings on a scale from 1 to 10. We will normalize these ratings to a 0–1 scale.


In [None]:

import pandas as pd
import numpy as np

# Sample dataset of customer reviews
data = {
    'Customer ID': np.arange(1, 11),
    'Review Score': [8, 6, 7, 9, 5, 10, 4, 8, 7, 6]
}
reviews_df = pd.DataFrame(data)

print("Original dataset of customer reviews:")
print(reviews_df)



### Step 2: Normalizing the Review Scores

We will now apply a normalization function to scale the review scores between 0 and 1.


In [None]:

# Function to normalize values to a 0-1 scale
def normalize(x):
    return (x - reviews_df['Review Score'].min()) / (reviews_df['Review Score'].max() - reviews_df['Review Score'].min())

# Applying the normalization function to the 'Review Score' column
reviews_df['Normalized Score'] = reviews_df['Review Score'].apply(normalize)

print("\nDataset after normalizing review scores:")
print(reviews_df)



### Step 3: Analyzing the Results

After normalization, all review scores will be scaled between 0 and 1, allowing for consistent comparisons and analysis.


In [None]:

print("\nSummary statistics of the normalized scores:")
print(reviews_df['Normalized Score'].describe())
