# Creating a Master Dataset

This notebook merges preprocessed datasets from the "pre_processed_datasets" folder into a single master dataset, which is saved in the "master_dataset" folder.
The merging steps include:
 - Loading the preprocessed datasets.
 - Merging them on the common column `Day Index`.
 - Saving the final merged dataset to the "master_dataset" folder.

# Importing the Pandas Library

I imported the **Pandas** library, which is essential for data manipulation and analysis. Pandas provides powerful tools for handling datasets, including reading, merging, and exporting data, making it a crucial part of this project.


In [17]:
import pandas as pd  # Import the pandas library for data manipulation and analysis
import os # Import os for working with file paths/location

# Defining Pre Processed Dataset Paths
I used relative paths for specifying the locations of the cleaned datasets. This ensures that the notebook works on any machine as long as the dataset files are placed in the expected directory structure.

In [18]:
google_clicks_path = '../datasets/pre_processed_datasets/pre_processed_ProductA_google_clicks.xlsx'
fb_impressions_path = '../datasets/pre_processed_datasets/pre_processed_ProductA_fb_impressions.xlsx'
quantity_path = '../datasets/pre_processed_datasets/pre_processed_ProductA.xlsx'

# Verify each file
print(f"Google Clicks Path Exists: {os.path.exists(google_clicks_path)}")
print(f"Facebook Impressions Path Exists: {os.path.exists(fb_impressions_path)}")
print(f"Quantity Path Exists: {os.path.exists(quantity_path)}")

# Print the current working directory for confirmation
print(f"Current Working Directory: {os.getcwd()}")

Google Clicks Path Exists: True
Facebook Impressions Path Exists: True
Quantity Path Exists: True
Current Working Directory: c:\Users\nitin\OneDrive\Documents\Infosys Springboard Internship Files\NitinMishra-Infosys-Nov24\python files


# Loading Individual Datasets

To begin the data consolidation process, I loaded three separate datasets into individual DataFrames:

1. **Google Clicks**:
   - File Path: `../datasets/pre_processed_datasets/pre_processed_ProductA.xlsx`
   - Contains preprocessed data on clicks generated through Google.

2. **Facebook Impressions**:
   - File Path: `./datasets/pre_processed_datasets/pre_processed_ProductA_fb_impressions.xlsx`
   - Includes preprocessed impression data from Facebook.

3. **Quantity Data**:
   - File Path: `./datasets/pre_processed_datasets/pre_processed_ProductA.xlsx`
   - Provides preprocessed sales-related data for Product A.

Each dataset was read into a Pandas DataFrame using the `read_excel` function (since the preprocessed datasets were of `.xlsx` type) , ensuring the data is ready for merging and further processing.


In [19]:
# Load each Excel file into a DataFrame
google_clicks = pd.read_excel(google_clicks_path)
fb_impressions = pd.read_excel(fb_impressions_path)
quantity = pd.read_excel(quantity_path)

print("Datasets Loaded successfully !")

Datasets Loaded successfully !


# Merging Datasets

To create a unified master dataset, I merged the three individual DataFrames on their common column, **`Day Index`**. Here's what this step accomplished:

- Combined data from:
  - **Google Clicks**
  - **Facebook Impressions**
  - **Quantity Data**
- Used Pandas' `merge` method to join the datasets, ensuring alignment based on the **`Day Index`** column.

This operation resulted in a comprehensive dataset that consolidates all relevant metrics, making it ready for analysis and visualization.


In [20]:
# Merge the three DataFrames on the common column "Day Index"
# This combines all data into a single master dataset
master_dataset = google_clicks.merge(fb_impressions, on="Day Index").merge(quantity, on="Day Index")

print("Datasets merged successfully")

Datasets merged successfully


# Saving the Master Dataset

After merging the datasets, I saved the resulting master dataset to an Excel file for future use. Here's what I did:

- Used the `to_excel` method to export the merged DataFrame to an Excel file.
- Saved the file at the following location:
  `./datasets/master_dataset/master_dataset.xlsx`
- Ensured that the **index was excluded** (`index=False`) for a cleaner output.

This step makes the unified dataset accessible for subsequent analyses and visualizations.


In [21]:
save_path = '../datasets/master_dataset/master_dataset.xlsx'
os.makedirs(os.path.dirname(save_path),exist_ok = True) # Ensure that directory exists
master_dataset.to_excel(save_path, index = False)

print(f"Merged data saved successfully at {save_path}")

Merged data saved successfully at ../datasets/master_dataset/master_dataset.xlsx
