# PROBLEM STATEMENT

Retail companies, in their pursuit of unraveling the intricacies of customer behavior, have been striving to understand consumption patterns and improve predictive capabilities. In this endeavor, my primary objective is to develop a comprehensive predictive model that delves into the diverse customer base, anticipating the specific products they are likely to purchase. Utilizing a robust dataset encompassing extensive customer details, product information, and historical order data, the goal is to discern nuanced patterns and trends that drive purchasing decisions. The predictive model will intricately incorporate various factors, including customer demographics, order history, and product attributes. To elevate the model's performance, sophisticated feature engineering techniques will be explored, ensuring the creation of pertinent features that contribute to enhanced prediction accuracy. The overarching ambition is to furnish actionable insights empowering marketing and sales teams with targeted product recommendations, optimizing customer satisfaction and maximizing sales opportunities. This predictive analytics endeavor stands as a pivotal step toward unlocking the full potential of business intelligence, fostering informed decision-making, and driving sustained growth in a dynamic market landscape.


**Research Questions:**

1. **RQ1: Holistic Customer Segmentation and Demographics Analysis**
   - How can I develop a comprehensive customer segmentation strategy, considering a diverse range of demographic, psychographic, and behavioral characteristics?
   - What influence do cultural and regional factors have on my customers' preferences, and how can I incorporate these nuances into my segmentation strategies?

2. **RQ2: In-depth Exploration of My Consumption Patterns**
   - What are the longitudinal aspects of my customer consumption patterns, and how do these evolve over time?
   - How can I integrate machine learning algorithms to identify subtle patterns, anomalies, and outliers in my large-scale historical order data?

3. **RQ3: Advancing My Predictive Modeling for Personalized Recommendations**
   - In addition to traditional predictive modeling, how can I incorporate advanced techniques such as deep learning to achieve more nuanced and personalized product recommendations?
   - What role do external factors like macroeconomic indicators and social trends play in enhancing the accuracy of my predictive models?

4. **RQ4: Feature Engineering for Multi-dimensional Insights**
   - How can I leverage not only my customer details, product information, and order history but also incorporate external data sources for comprehensive feature engineering?
   - What techniques can I employ to assess the importance and interaction of various features in driving predictive accuracy?

5. **RQ5: Actionable Insights Across My Organization**
   - Beyond marketing and sales, how can actionable insights be extended to other touchpoints such as customer support and product development?
   - What strategies can I implement to ensure seamless communication and application of insights across different departments within my organization?

**Objectives:**

1. **Objective 1: Integration of Multifaceted Customer Segmentation**
   - I aim to develop a comprehensive framework for customer segmentation that considers a diverse set of demographic, psychographic, and behavioral factors.

2. **Objective 2: Longitudinal Analysis of My Consumption Patterns**
   - I will conduct an in-depth longitudinal analysis of my customer consumption patterns, exploring temporal variations and seasonality.

3. **Objective 3: Integration of Advanced Predictive Modeling Techniques**
   - I intend to explore and implement advanced predictive modeling techniques, including deep learning, to provide more personalized and accurate product recommendations.

4. **Objective 4: Comprehensive Feature Engineering Framework**
   - I am dedicated to developing a robust framework for feature engineering that incorporates not only traditional customer and product data but also external data sources for a multi-dimensional view.

5. **Objective 5: Cross-Functional Application of Actionable Insights**
   - I will establish protocols and communication channels for the dissemination and application of actionable insights across various departments, fostering a holistic organizational approach.

6. **Objective 6: Real-time Adaptation and Continuous Improvement**
   - I aim to implement mechanisms for real-time adaptation of strategies based on evolving consumer trends, ensuring continuous improvement in the accuracy and relevance of my predictive models.


# Data cleaning and preparation

# data imporation


In the process of importing data into my Spark environment, I initiated a Spark session using the PySpark library, marking the beginning of my data processing journey. Specifying the path to the folder where my CSV files reside, I meticulously curated a list of these files, filtering only those with the '.csv' extension. This step ensured a focused approach to my data loading endeavor. Employing a dynamic and scalable strategy, I proceeded to iterate through each CSV file within the designated folder. For every iteration, I conscientiously extracted the file name (excluding the extension) and employed it as a variable name. This approach aimed at maintaining clarity and organization within my environment. With each iteration, I invoked Spark's `read.csv` method, reading the contents of the CSV file into a distinct DataFrame. These DataFrames were then not only appended to a list for a comprehensive overview but, more significantly, embedded into my environment using their respective file names as variable references. This meticulous orchestration enabled seamless access to each DataFrame by its designated variable, thereby facilitating subsequent analysis and exploration. The conclusive act of displaying the contents of each DataFrame using the `show()` method underscored the successful importation of diverse datasets, each now poised for further exploration and insightful analysis within my Spark environment.

In [2]:
from pyspark.sql import SparkSession
import os

def load_csv_files_into_dataframes(folder_path):
    """
    Load multiple CSV files into Spark DataFrames and save them in the environment with their file names.

    Parameters:
    - folder_path (str): The path to the folder containing the CSV files.

    Returns:
    - List of Spark DataFrames.
    """
    # Initialize a Spark session
    spark = SparkSession.builder.appName("CSVLoaderFunction").getOrCreate()

    # Get a list of all CSV files in the folder
    csv_files = [file for file in os.listdir(folder_path) if file.endswith('.csv')]

    # Create a list to store the DataFrames
    dataframes = []

    # Create variables for each DataFrame
    for csv_file in csv_files:
        # Use the file name (without extension) as the variable name
        df_name = os.path.splitext(csv_file)[0]
        # Read the CSV file into a DataFrame
        df = spark.read.csv(os.path.join(folder_path, csv_file), header=True, inferSchema=True)
        # Save the DataFrame in the environment with its file name
        globals()[df_name] = df
        # Append the DataFrame to the list
        dataframes.append(df)

    return dataframes

# Example usage:
folder_path = r'C:\Users\neste\OneDrive\Desktop\karanja\DataSet_final\DataSet_final'
loaded_dataframes = load_csv_files_into_dataframes(folder_path)

# Show the contents of each DataFrame
DimGeography.show()
DimAccount.show()
DimCurrency.show()
DimCustomer.show()
DimDate.show()
DimDepartmentGroup.show()
DimOrganization.show()
DimProduct.show()
DimProductCategory.show()
DimProductSubcategory.show()
DimPromotion.show()
DimReseller.show()
DimSalesReason.show()
DimSalesTerritory.show()
DimScenario.show()
FactCallCenter.show()
FactCurrencyRate.show()
FactInternetSales.show()



+------------+--------------+-----------------+-----------------+-----------------+------------------------+------------------------+-----------------------+----------+-----------------+----------------+
|GeographyKey|          City|StateProvinceCode|StateProvinceName|CountryRegionCode|EnglishCountryRegionName|SpanishCountryRegionName|FrenchCountryRegionName|PostalCode|SalesTerritoryKey|IpAddressLocator|
+------------+--------------+-----------------+-----------------+-----------------+------------------------+------------------------+-----------------------+----------+-----------------+----------------+
|           1|    Alexandria|              NSW|  New South Wales|               AU|               Australia|               Australia|              Australie|      2015|                9|    198.51.100.2|
|           2| Coffs Harbour|              NSW|  New South Wales|               AU|               Australia|               Australia|              Australie|      2450|                

# data preparation


In [10]:
# Select the specific columns from DimProduct
selected_columns = ["ProductKey", "ProductSubcategoryKey", "EnglishProductName", "FinishedGoodsFlag",
                    "Color", "SafetyStockLevel", "ReorderPoint", "SizeRange", "DaysToManufacture"]

dim_product_selected = DimProduct.select(selected_columns)

# Perform the join operation on ProductKey
prepared_data = FactInternetSales.join(dim_product_selected, "ProductKey")


# Select the specific columns from DimCustomer, including DateFirstPurchase
customer_selected_columns = ["CustomerKey", "GeographyKey", "NameStyle", "BirthDate", "MaritalStatus", "Gender",
                              "YearlyIncome", "TotalChildren", "NumberChildrenAtHome", "EnglishEducation",
                              "EnglishOccupation", "HouseOwnerFlag", "NumberCarsOwned", "CommuteDistance",
                              "DateFirstPurchase"]

dim_customer_selected = DimCustomer.select(customer_selected_columns)

# Perform the join operation on CustomerKey
prepared_data = prepared_data.join(dim_customer_selected, "CustomerKey")

# Select the specific columns from DimPromotion
promotion_selected_columns = ["PromotionKey", "EnglishPromotionName", "DiscountPct", "EnglishPromotionCategory", "MinQty"]

dim_promotion_selected = DimPromotion.select(promotion_selected_columns)

# Perform the join operation on PromotionKey with the existing prepared_data DataFrame
prepared_data = prepared_data.join(dim_promotion_selected, "PromotionKey")


# Select the specific column from DimCurrency
currency_selected_columns = ["CurrencyKey", "CurrencyName"]

dim_currency_selected = DimCurrency.select(currency_selected_columns)

# Perform the join operation on CurrencyKey with the existing prepared_data DataFrame
prepared_data = prepared_data.join(dim_currency_selected, "CurrencyKey")


# Select the specific columns from DimSalesTerritory
sales_territory_selected_columns = ["SalesTerritoryKey", "SalesTerritoryRegion", "SalesTerritoryCountry"]

dim_sales_territory_selected = DimSalesTerritory.select(sales_territory_selected_columns)

# Perform the join operation on SalesTerritoryKey with the existing prepared_data DataFrame
prepared_data = prepared_data.join(dim_sales_territory_selected, "SalesTerritoryKey")



# Select the specific columns from DimProductSubcategory
product_subcategory_selected_columns = ["ProductSubcategoryKey", "EnglishProductSubcategoryName"]

dim_product_subcategory_selected = DimProductSubcategory.select(product_subcategory_selected_columns)

# Perform the join operation on ProductSubcategoryKey with the existing prepared_data DataFrame
prepared_data = prepared_data.join(dim_product_subcategory_selected, "ProductSubcategoryKey")

# Select the specific columns from DimGeography
geography_selected_columns = ["GeographyKey", "City", "StateProvinceName"]

dim_geography_selected = DimGeography.select(geography_selected_columns)

# Perform the join operation on GeographyKey with the existing prepared_data DataFrame
prepared_data = prepared_data.join(dim_geography_selected, "GeographyKey")

# Show the resulting DataFrame
prepared_data.show()

+------------+---------------------+-----------------+-----------+------------+-----------+----------+------------+----------+-----------+----------------+--------------------+--------------+-------------+---------+--------------+--------------------+--------------+-------------------+----------------+-----------+--------+-------+---------------------+----------------+---------+-------+--------+--------------------+-----------------+------+----------------+------------+---------+-----------------+---------+----------+-------------+------+------------+-------------+--------------------+-------------------+-----------------+--------------+---------------+---------------+-----------------+--------------------+-----------+------------------------+------+--------------------+--------------------+---------------------+-----------------------------+-------------+-------------------+
|GeographyKey|ProductSubcategoryKey|SalesTerritoryKey|CurrencyKey|PromotionKey|CustomerKey|ProductKey|OrderDate

In [9]:
prepared_data.columns

['ProductSubcategoryKey',
 'SalesTerritoryKey',
 'CurrencyKey',
 'PromotionKey',
 'CustomerKey',
 'ProductKey',
 'OrderDateKey',
 'DueDateKey',
 'ShipDateKey',
 'SalesOrderNumber',
 'SalesOrderLineNumber',
 'RevisionNumber',
 'OrderQuantity',
 'UnitPrice',
 'ExtendedAmount',
 'UnitPriceDiscountPct',
 'DiscountAmount',
 'ProductStandardCost',
 'TotalProductCost',
 'SalesAmount',
 'TaxAmt',
 'Freight',
 'CarrierTrackingNumber',
 'CustomerPONumber',
 'OrderDate',
 'DueDate',
 'ShipDate',
 'EnglishProductName',
 'FinishedGoodsFlag',
 'Color',
 'SafetyStockLevel',
 'ReorderPoint',
 'SizeRange',
 'DaysToManufacture',
 'GeographyKey',
 'NameStyle',
 'BirthDate',
 'MaritalStatus',
 'Gender',
 'YearlyIncome',
 'TotalChildren',
 'NumberChildrenAtHome',
 'EnglishEducation',
 'EnglishOccupation',
 'HouseOwnerFlag',
 'NumberCarsOwned',
 'CommuteDistance',
 'DateFirstPurchase',
 'EnglishPromotionName',
 'DiscountPct',
 'EnglishPromotionCategory',
 'MinQty',
 'CurrencyName',
 'SalesTerritoryRegion',


In [11]:
len(prepared_data.columns)

58