# Introductions

In this sales data analysis project using Spark, I will begin by focusing on customer segmentation as my primary objective. To achieve this, I will meticulously prepare the data by extracting relevant customer information from the sales dataset, including details such as purchase history, frequency, monetary value, and any available demographic data. Employing Spark's powerful data transformation and feature engineering capabilities, I will carefully select key features that differentiate customers and apply clustering algorithms, like k-means from Spark MLlib, to form distinct customer segments. Once identified, I will delve into analyzing the characteristics of each segment, such as buying behavior and preferences. Assigning labels for easier interpretation, I will tailor marketing strategies for each segment, ensuring personalized experiences and targeted communication.

Moving on to the second objective, the Promotion and Discount Analysis, I will define clear metrics to measure the success of promotions, encompassing factors like uplift in sales, customer acquisition, and return on investment. Building upon the segmented analysis, I will assess the effectiveness of promotions within each customer segment, identifying optimal timing and discount rates. Using Spark's A/B testing capabilities, I will compare different promotion strategies and determine their significance through statistical analysis. Additionally, I will develop forecasting models to predict the potential impact of upcoming promotions, aiding in inventory management and marketing optimization. By combining these two objectives, I aim to create a data-driven approach that maximizes sales, tailoring promotions to specific customer segments and ensuring a more personalized and effective marketing strategy.

## Incentive

1. **Customer Segmentation:**
   - The initial focus of our analysis involves categorizing customers into distinct groups based on shared characteristics and behaviors. By extracting and utilizing relevant data such as purchase history and demographic information, I will employ Spark's data transformation capabilities to identify key features for differentiation. Through clustering algorithms like k-means, I aim to create meaningful customer segments. These segments will then be thoroughly analyzed to understand their unique characteristics, allowing for the development of targeted marketing strategies tailored to each segment.

2. **Promotion and Discount Optimization:**
   - In the second phase of the project, the emphasis will shift towards analyzing the impact of promotions and discounts on sales. By defining clear metrics for success, I will evaluate the effectiveness of promotions within each customer segment. This will involve assessing optimal timing, discount rates, and conducting A/B testing using Spark's capabilities to compare different promotional strategies. Additionally, I will develop forecasting models to predict the potential impact of future promotions, enabling proactive optimization of inventory management and marketing efforts. The goal is to fine-tune promotional strategies for maximum impact on sales while maintaining profitability.
   

# Data importaion

In this PySpark script, I've initiated a Spark session and devised a systematic process for importing and processing multiple CSV files containing diverse data relevant to geographical sales prediction. The folder path to the dataset is specified, and a list of CSV files within this directory is generated. For each CSV file, I've dynamically created variables for their respective DataFrames. To enhance readability and ensure meaningful variable names, I removed the 'Dim' prefix from each DataFrame's name. This approach facilitates a clearer understanding of the data when working with the DataFrames later.

During the dynamic variable creation, each DataFrame is read from its corresponding CSV file using PySpark's `read.csv` method. The resulting DataFrames are then stored in the script's environment using the `globals()` function, allowing direct access to each DataFrame by its modified name. Furthermore, the script showcases the contents of each DataFrame using the `show()` method, providing a quick overview of the imported data.

This script contributes significantly to achieving the project's objectives. The initial step involves data acquisition from diverse sources, and by importing multiple CSV files, it ensures a comprehensive dataset for subsequent analysis. The dynamic variable creation simplifies the process of referring to specific DataFrames, facilitating their individual exploration and manipulation. The overall structure of the script supports subsequent tasks, such as customer segmentation and promotion/discount analysis, as it lays the foundation for leveraging Spark's distributed computing capabilities on the imported datasets. By adopting this approach, the script promotes modularity and clarity, key aspects in handling large-scale data analysis projects.

In [1]:
from pyspark.sql import SparkSession
import os

# Initialize a Spark session
spark = SparkSession.builder.appName("Uzumaki").getOrCreate()

# Define the path to the folder containing CSV files for geographical sales prediction
folder_path = r'C:\Users\neste\OneDrive\Desktop\karanja\DataSet_final\DataSet_final'

# Get a list of all CSV files in the folder
csv_files = [file for file in os.listdir(folder_path) if file.endswith('.csv')]

# Create variables for each DataFrame with modified names
for csv_file in csv_files:
    # Use the file name (without extension) as the variable name
    df_name = os.path.splitext(csv_file)[0]
    
    # Remove the word 'Dim' from the DataFrame name
    df_name_without_dim = df_name.replace('Dim', '')
    
    # Read the CSV file into a DataFrame
    df = spark.read.csv(os.path.join(folder_path, csv_file), header=True, inferSchema=True)
    
    # Save the DataFrame in the environment with its modified file name
    globals()[df_name_without_dim] = df
    
    # Show the contents of the DataFrame
    df.show()


+----------+----------------+-----------------------+-----------------------------+--------------------+-----------+--------+-------------+---------+-------------------+
|AccountKey|ParentAccountKey|AccountCodeAlternateKey|ParentAccountCodeAlternateKey|  AccountDescription|AccountType|Operator|CustomMembers|ValueType|CustomMemberOptions|
+----------+----------------+-----------------------+-----------------------------+--------------------+-----------+--------+-------------+---------+-------------------+
|         1|            NULL|                      1|                         NULL|       Balance Sheet|       NULL|       ~|         NULL| Currency|               NULL|
|         2|               1|                     10|                            1|              Assets|     Assets|       +|         NULL| Currency|               NULL|
|         3|               2|                    110|                           10|      Current Assets|     Assets|       +|         NULL| Currency| 