# OBJECTIVES


*Background:*
The company has collected a comprehensive set of sales data across various dimensions, including products, customers, regions, and time periods. The goal is to gain valuable insights into the sales performance and make data-driven decisions to optimize revenue and enhance overall business strategy.

*Objectives:*

1. **Sales Performance Overview:**
   - Understand the historical sales trends and patterns to inform future business strategies.

2. **Product Analysis:**
   - Identify top-performing products and product categories.
   - Explore opportunities for product portfolio enhancement and marketing strategies.

3. **Customer Segmentation:**
   - Segment customers based on their purchasing behavior and preferences.
   - Tailor marketing and engagement strategies for different customer segments.

4. **Geographical Analysis:**
   - Evaluate sales performance across different regions.
   - Identify regions with untapped potential or areas requiring special attention.

5. **Time-based Analysis:**
   - Analyze sales trends over different time dimensions (daily, monthly, yearly).
   - Identify peak sales periods and strategize inventory and marketing efforts accordingly.

6. **Promotion and Reseller Impact:**
   - Evaluate the effectiveness of promotions on sales.
   - Assess the contribution of different resellers to overall sales and optimize partnerships.

7. **Financial Insights:**
   - Calculate key financial metrics such as revenue, profit margins, and return on investment.
   - Identify opportunities for cost optimization and revenue growth.

*Outcome:*
By addressing the objectives outlined above, the company aims to enhance its understanding of the sales landscape, identify growth opportunities, and make informed decisions to improve overall business performance.

# data importation
In setting up the data importation process, I began by initializing a Spark session, essentially creating a connection to the Spark framework for data processing. I specified the folder path where the CSV files are located, and I compiled a list of the CSV files I intended to import, including "DimCurrency.csv," "DimCustomer.csv," "DimProduct.csv," "DimGeography.csv," "DimPromotion.csv," and "DimSalesTerritory.csv."

For each CSV file in the list, I used the `spark.read.csv` method to read the contents into a Spark DataFrame. The `header=True` parameter was set to indicate that the first row of each CSV file contains column names, and `inferSchema=True` was used to let Spark automatically infer the data types of the columns. After loading each DataFrame, I created a variable for it, naming it according to the base name of the corresponding CSV file without the file extension.

This approach allows for convenient referencing of each DataFrame using variable names aligned with the file names. For instance, the `DimCurrency` DataFrame holds information about currencies, and `DimCustomer` contains details about customers. This naming convention facilitates subsequent analyses aligned with the objectives we established earlier, such as exploring product-centric data, customer segmentation, geographical analysis, and various time-based and financial insights. The groundwork has been laid for a comprehensive sales analysis, with the ability to delve into each aspect individually or combine dimensions for more complex analyses.

In [8]:
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("SalesAnalysis").getOrCreate()

# Specify the folder path
folder_path = "C:/Users/neste/OneDrive/Desktop/karanja/DataSet_final/DataSet_final"

# List of CSV files to load
files_to_load = [
    "DimCurrency.csv",
    "DimCustomer.csv",
    "DimProduct.csv",
    "DimGeography.csv",
    "DimPromotion.csv",
    "DimSalesTerritory.csv",
    "FactInternetSales.csv"
]

# Load each CSV file into a Spark DataFrame and assign a name corresponding to the file name
for file in files_to_load:
    file_path = f"{folder_path}/{file}"
    dataframe_name = file.split('.')[0]  # Use the file name without extension as the DataFrame name
    globals()[dataframe_name] = spark.read.csv(file_path, header=True, inferSchema=True)

# Display the first 5 rows of each DataFrame
DimCurrency.show(5)
DimCustomer.show(5)
DimProduct.show(5)
DimGeography.show(5)
DimPromotion.show(5)
DimSalesTerritory.show(5)


+-----------+--------------------+--------------+
|CurrencyKey|CurrencyAlternateKey|  CurrencyName|
+-----------+--------------------+--------------+
|          1|                 AFA|       Afghani|
|          2|                 DZD|Algerian Dinar|
|          3|                 ARS|Argentine Peso|
|          4|                 AMD| Armenian Dram|
|          5|                 AWG|Aruban Guilder|
+-----------+--------------------+--------------+
only showing top 5 rows

+-----------+------------+--------------------+-----+---------+----------+--------+---------+---------+-------------+------+------+--------------------+------------+-------------+--------------------+----------------+----------------+---------------+-----------------+-----------------+----------------+--------------+---------------+-------------------+------------+-------------------+-----------------+---------------+
|CustomerKey|GeographyKey|CustomerAlternateKey|Title|FirstName|MiddleName|LastName|NameStyle|BirthDate|

# DATA CLEANING
