# PROBLEM STATEMENT

Retail companies, in their pursuit of unraveling the intricacies of customer behavior, have been striving to understand consumption patterns and improve predictive capabilities. In this endeavor, my primary objective is to develop a comprehensive predictive model that delves into the diverse customer base, anticipating the specific products they are likely to purchase. Utilizing a robust dataset encompassing extensive customer details, product information, and historical order data, the goal is to discern nuanced patterns and trends that drive purchasing decisions. The predictive model will intricately incorporate various factors, including customer demographics, order history, and product attributes. To elevate the model's performance, sophisticated feature engineering techniques will be explored, ensuring the creation of pertinent features that contribute to enhanced prediction accuracy. The overarching ambition is to furnish actionable insights empowering marketing and sales teams with targeted product recommendations, optimizing customer satisfaction and maximizing sales opportunities. This predictive analytics endeavor stands as a pivotal step toward unlocking the full potential of business intelligence, fostering informed decision-making, and driving sustained growth in a dynamic market landscape.


**Research Questions:**

1. **RQ1: Holistic Customer Segmentation and Demographics Analysis**
   - How can I develop a comprehensive customer segmentation strategy, considering a diverse range of demographic, psychographic, and behavioral characteristics?
   - What influence do cultural and regional factors have on my customers' preferences, and how can I incorporate these nuances into my segmentation strategies?

2. **RQ2: In-depth Exploration of My Consumption Patterns**
   - What are the longitudinal aspects of my customer consumption patterns, and how do these evolve over time?
   - How can I integrate machine learning algorithms to identify subtle patterns, anomalies, and outliers in my large-scale historical order data?

3. **RQ3: Advancing My Predictive Modeling for Personalized Recommendations**
   - In addition to traditional predictive modeling, how can I incorporate advanced techniques such as deep learning to achieve more nuanced and personalized product recommendations?
   - What role do external factors like macroeconomic indicators and social trends play in enhancing the accuracy of my predictive models?

4. **RQ4: Feature Engineering for Multi-dimensional Insights**
   - How can I leverage not only my customer details, product information, and order history but also incorporate external data sources for comprehensive feature engineering?
   - What techniques can I employ to assess the importance and interaction of various features in driving predictive accuracy?

5. **RQ5: Actionable Insights Across My Organization**
   - Beyond marketing and sales, how can actionable insights be extended to other touchpoints such as customer support and product development?
   - What strategies can I implement to ensure seamless communication and application of insights across different departments within my organization?

**Objectives:**

1. **Objective 1: Integration of Multifaceted Customer Segmentation**
   - I aim to develop a comprehensive framework for customer segmentation that considers a diverse set of demographic, psychographic, and behavioral factors.

2. **Objective 2: Longitudinal Analysis of My Consumption Patterns**
   - I will conduct an in-depth longitudinal analysis of my customer consumption patterns, exploring temporal variations and seasonality.

3. **Objective 3: Integration of Advanced Predictive Modeling Techniques**
   - I intend to explore and implement advanced predictive modeling techniques, including deep learning, to provide more personalized and accurate product recommendations.

4. **Objective 4: Comprehensive Feature Engineering Framework**
   - I am dedicated to developing a robust framework for feature engineering that incorporates not only traditional customer and product data but also external data sources for a multi-dimensional view.

5. **Objective 5: Cross-Functional Application of Actionable Insights**
   - I will establish protocols and communication channels for the dissemination and application of actionable insights across various departments, fostering a holistic organizational approach.

6. **Objective 6: Real-time Adaptation and Continuous Improvement**
   - I aim to implement mechanisms for real-time adaptation of strategies based on evolving consumer trends, ensuring continuous improvement in the accuracy and relevance of my predictive models.


# Data cleaning and preparation

# data imporation


In the process of importing data into my Spark environment, I initiated a Spark session using the PySpark library, marking the beginning of my data processing journey. Specifying the path to the folder where my CSV files reside, I meticulously curated a list of these files, filtering only those with the '.csv' extension. This step ensured a focused approach to my data loading endeavor. Employing a dynamic and scalable strategy, I proceeded to iterate through each CSV file within the designated folder. For every iteration, I conscientiously extracted the file name (excluding the extension) and employed it as a variable name. This approach aimed at maintaining clarity and organization within my environment. With each iteration, I invoked Spark's `read.csv` method, reading the contents of the CSV file into a distinct DataFrame. These DataFrames were then not only appended to a list for a comprehensive overview but, more significantly, embedded into my environment using their respective file names as variable references. This meticulous orchestration enabled seamless access to each DataFrame by its designated variable, thereby facilitating subsequent analysis and exploration. The conclusive act of displaying the contents of each DataFrame using the `show()` method underscored the successful importation of diverse datasets, each now poised for further exploration and insightful analysis within my Spark environment.

In [1]:
from pyspark.sql import SparkSession
import os

def load_csv_files_into_dataframes(folder_path):
    """
    Load multiple CSV files into Spark DataFrames and save them in the environment with their file names.

    Parameters:
    - folder_path (str): The path to the folder containing the CSV files.

    Returns:
    - List of Spark DataFrames.
    """
    # Initialize a Spark session
    spark = SparkSession.builder.appName("CSVLoaderFunction").getOrCreate()

    # Get a list of all CSV files in the folder
    csv_files = [file for file in os.listdir(folder_path) if file.endswith('.csv')]

    # Create a list to store the DataFrames
    dataframes = []

    # Create variables for each DataFrame
    for csv_file in csv_files:
        # Use the file name (without extension) as the variable name
        df_name = os.path.splitext(csv_file)[0]
        # Read the CSV file into a DataFrame
        df = spark.read.csv(os.path.join(folder_path, csv_file), header=True, inferSchema=True)
        # Save the DataFrame in the environment with its file name
        globals()[df_name] = df
        # Append the DataFrame to the list
        dataframes.append(df)

    return dataframes

# Example usage:
folder_path = r'C:\Users\neste\OneDrive\Desktop\karanja\DataSet_final\DataSet_final'
loaded_dataframes = load_csv_files_into_dataframes(folder_path)

# Show the contents of each DataFrame
DimGeography.show()
DimAccount.show()
DimCurrency.show()
DimCustomer.show()
DimDate.show()
DimDepartmentGroup.show()
DimOrganization.show()
DimProduct.show()
DimProductCategory.show()
DimProductSubcategory.show()
DimPromotion.show()
DimReseller.show()
DimSalesReason.show()
DimSalesTerritory.show()
DimScenario.show()
FactCallCenter.show()
FactCurrencyRate.show()
FactInternetSales.show()



# data preparation

## data joining 

In weaving together a comprehensive dataset for analysis, I embarked on a series of strategic joins, each adding a layer of richness to the information at hand. Initially, I sought to enhance the understanding of product-related data by merging the `FactInternetSales` data with `DimProduct` through a join on the common ground of `ProductKey`. This amalgamation aimed to facilitate a deeper exploration of product-centric insights, incorporating details such as product subcategories, names, and key attributes.

Subsequently, my focus shifted to customer-centric insights. Through a join with `DimCustomer`, I seamlessly integrated customer-specific information into the evolving dataset, using the `CustomerKey` as the bridge. This strategic decision allowed for a holistic examination of customer behavior, encompassing demographics, purchase history, and other pertinent details.

The inclusion of promotional details became the next logical step in unraveling the dynamics of sales. By merging with `DimPromotion` based on the shared `PromotionKey`, I introduced promotional aspects such as names, discount percentages, and categories. This step added a temporal dimension to the dataset, enabling a closer examination of the impact of promotions on sales patterns.

Considering the financial dimension, the dataset expanded to include currency-related insights through a join with `DimCurrency` using the common identifier `CurrencyKey`. This addition allowed for a standardized representation of monetary values, enhancing the precision of financial analyses.

Sales territory details were seamlessly incorporated into the dataset by joining with `DimSalesTerritory` using the `SalesTerritoryKey` as a linking element. This strategic inclusion provided geographical context to sales data, facilitating analyses related to regional trends, market performance, and customer distribution.

To deepen the understanding of product subcategories, I orchestrated a join with `DimProductSubcategory`, utilizing the `ProductSubcategoryKey` as the connection point. This step allowed for a more nuanced exploration of product types and their categorizations, enabling insights into product preferences and market segments.

Finally, a geographical layer was added to the dataset through a join with `DimGeography`, utilizing the `GeographyKey` as the amalgamating factor. This geographical context provided a framework for exploring regional variations, customer distribution across cities, and other spatial considerations.

Each join in this sequential process was a deliberate choice aimed at creating a comprehensive and interconnected dataset. The rationale behind these joins was rooted in the pursuit of a holistic understanding of the business landscape. By integrating diverse dimensions such as products, customers, promotions, currencies, sales territories, and geography, the resulting dataset stands poised to offer nuanced insights, empowering data-driven decision-making processes.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, expr

# Select the specific columns from DimProduct
selected_columns = ["ProductKey", "ProductSubcategoryKey", "EnglishProductName", "FinishedGoodsFlag",
                    "Color", "SafetyStockLevel", "ReorderPoint", "SizeRange", "DaysToManufacture"]

dim_product_selected = DimProduct.select(selected_columns)

# Perform the join operation on ProductKey
prepared_data = FactInternetSales.join(dim_product_selected, "ProductKey")


# Select the specific columns from DimCustomer, including DateFirstPurchase
customer_selected_columns = ["CustomerKey", "GeographyKey", "NameStyle", "BirthDate", "MaritalStatus", "Gender",
                              "YearlyIncome", "TotalChildren", "NumberChildrenAtHome", "EnglishEducation",
                              "EnglishOccupation", "HouseOwnerFlag", "NumberCarsOwned", "CommuteDistance",
                              "DateFirstPurchase"]

dim_customer_selected = DimCustomer.select(customer_selected_columns)

# Perform the join operation on CustomerKey
prepared_data = prepared_data.join(dim_customer_selected, "CustomerKey")

# Select the specific columns from DimPromotion
promotion_selected_columns = ["PromotionKey", "EnglishPromotionName", "DiscountPct", "EnglishPromotionCategory", "MinQty"]

dim_promotion_selected = DimPromotion.select(promotion_selected_columns)

# Perform the join operation on PromotionKey with the existing prepared_data DataFrame
prepared_data = prepared_data.join(dim_promotion_selected, "PromotionKey")


# Select the specific column from DimCurrency
currency_selected_columns = ["CurrencyKey", "CurrencyName"]

dim_currency_selected = DimCurrency.select(currency_selected_columns)

# Perform the join operation on CurrencyKey with the existing prepared_data DataFrame
prepared_data = prepared_data.join(dim_currency_selected, "CurrencyKey")


# Select the specific columns from DimSalesTerritory
sales_territory_selected_columns = ["SalesTerritoryKey", "SalesTerritoryRegion", "SalesTerritoryCountry"]

dim_sales_territory_selected = DimSalesTerritory.select(sales_territory_selected_columns)

# Perform the join operation on SalesTerritoryKey with the existing prepared_data DataFrame
prepared_data = prepared_data.join(dim_sales_territory_selected, "SalesTerritoryKey")



# Select the specific columns from DimProductSubcategory
product_subcategory_selected_columns = ["ProductSubcategoryKey", "EnglishProductSubcategoryName"]

dim_product_subcategory_selected = DimProductSubcategory.select(product_subcategory_selected_columns)

# Perform the join operation on ProductSubcategoryKey with the existing prepared_data DataFrame
prepared_data = prepared_data.join(dim_product_subcategory_selected, "ProductSubcategoryKey")

# Select the specific columns from DimGeography
geography_selected_columns = ["GeographyKey", "City", "StateProvinceName"]

dim_geography_selected = DimGeography.select(geography_selected_columns)

# Perform the join operation on GeographyKey with the existing prepared_data DataFrame
prepared_data = prepared_data.join(dim_geography_selected, "GeographyKey")

# Convert the numeric representation to a string and then to a DateType
prepared_data = prepared_data.withColumn(
    "OrderDate",
    to_date(expr("cast(OrderDateKey as string)"), "yyyyMMdd")
)

# Show the resulting DataFrame
prepared_data.show()

In [None]:
from pyspark.sql import DataFrame
from pyspark.sql.functions import col, to_date, expr, month, year
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, month, year, expr

def perform_joins(fact_df: DataFrame, *dimension_dfs: DataFrame, join_keys: list, date_column: str = None):
    """
    Perform a series of joins between a Fact DataFrame and multiple Dimension DataFrames.

    Parameters:
    - fact_df: The Fact DataFrame to start with.
    - dimension_dfs: A variable number of Dimension DataFrames to join with the Fact DataFrame.
    - join_keys: A list of join keys to be used in the order of the joins.
    - date_column: Optional. If provided, convert this numeric column to DateType.

    Returns:
    - The resulting DataFrame after all the specified joins.
    """
    prepared_data = fact_df

    for dimension_df, join_key in zip(dimension_dfs, join_keys):
        prepared_data = prepared_data.join(dimension_df, join_key)

    if date_column:
        # Convert the numeric representation to a string and then to a DateType
        prepared_data = prepared_data.withColumn(
            "OrderDate",
            to_date(expr(f"cast({date_column} as string)"), "yyyyMMdd")
        )

    return prepared_data

# Example usage:
prepared_data = perform_joins(
    FactInternetSales,
    DimProduct,
    DimCustomer,
    DimPromotion,
    DimCurrency,
    DimSalesTerritory,
    DimProductSubcategory,
    DimGeography,
    join_keys=["ProductKey", "CustomerKey", "PromotionKey", "CurrencyKey", "SalesTerritoryKey", "ProductSubcategoryKey", "GeographyKey"],
    date_column="OrderDateKey"
)

# Extract order month and order year
prepared_data = prepared_data.withColumn("OrderMonth", month("OrderDate"))
prepared_data = prepared_data.withColumn("OrderYear", year("OrderDate"))




# Show the resulting DataFrame
prepared_data.show()

# cleaned data 
In the refinement of our dataset, I have strategically excluded several columns that contribute limited value to our specific analytical goals. These excluded columns encompass various keys, identifiers, and descriptive attributes that, while potentially valuable in other contexts, are deemed extraneous for our current data exploration objectives.

This meticulous exclusion process aims to enhance the precision and relevance of our dataset, creating a more focused DataFrame named "cleaned_data." By eliminating non-essential information, we optimize the dataset for subsequent analyses, ensuring that the retained data aligns closely with the specific insights we seek to derive. This precision-driven approach not only streamlines the dataset but also facilitates a more efficient and meaningful exploration, enhancing the potential for uncovering valuable patterns and trends pertinent to our analytical objectives.

In [None]:
columns_to_exclude = [
    "GeographyKey", "ProductSubcategoryKey", "SalesTerritoryKey", "CurrencyKey", "PromotionKey",
    "CustomerKey", "ProductKey", "OrderDateKey", "DueDateKey", "ShipDateKey", "SalesOrderNumber",
    "SalesOrderLineNumber", "RevisionNumber", "OrderQuantity", "CarrierTrackingNumber",
    "CustomerPONumber", "OrderDate", "DueDate", "ShipDate", "ProductAlternateKey",
    "SpanishProductName", "FrenchProductName", "EnglishDescription", "StartDate", "EndDate", "Status",
    "CustomerAlternateKey", "Title", "FirstName", "MiddleName", "LastName", "BirthDate", "EmailAddress",
    "SpanishEducation", "FrenchEducation", "EnglishOccupation", "SpanishOccupation", "FrenchOccupation",
    "AddressLine1", "AddressLine2", "Phone", "DateFirstPurchase", "SpanishPromotionName",
    "FrenchPromotionName", "EnglishPromotionType", "SpanishPromotionType", "FrenchPromotionType",
    "SpanishPromotionCategory", "FrenchPromotionCategory", "StartDate", "EndDate", "MaxQty",
    "CurrencyAlternateKey", "ProductCategoryKey", "StateProvinceCode", "CountryRegionCode",
    "SpanishCountryRegionName", "FrenchCountryRegionName", "PostalCode", "SalesTerritoryKey",
    "IpAddressLocator", "SalesTerritoryAlternateKey", "SalesTerritoryRegion",
    "ProductSubcategoryAlternateKey","Suffix"
]

# Drop the specified columns from the DataFrame and assign the result to "cleaned_data"
cleaned_data = prepared_data.drop(*columns_to_exclude)

# Show the resulting cleaned DataFrame
cleaned_data.show()


In [None]:
len(cleaned_data.columns)

In [None]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

# Specify the categorical columns to convert
categorical_columns = ["MaritalStatus", "Gender", "CommuteDistance", 
                        "EnglishEducation", "EnglishPromotionName", "CurrencyName", 
                        "SalesTerritoryRegion", "SalesTerritoryCountry", "SalesTerritoryGroup", 
                        "StateProvinceName", "EnglishCountryRegionName", "City"]

# Identify non-categorical columns
non_categorical_columns = [col for col in cleaned_data.columns if col not in categorical_columns]

# Convert boolean columns to string
boolean_columns = [col for col in categorical_columns if cleaned_data.schema[col].dataType == 'boolean']
for boolean_col in boolean_columns:
    cleaned_data = cleaned_data.withColumn(boolean_col, col(boolean_col).cast("string"))

# Create StringIndexers for text columns
string_indexers = [StringIndexer(inputCol=col, outputCol=f"{col}_index", handleInvalid="keep") for col in categorical_columns if cleaned_data.schema[col].dataType == 'string']

# Create a pipeline to apply the StringIndexers
string_indexer_pipeline = Pipeline(stages=string_indexers)

# Fit and transform the pipeline
transformed_data = string_indexer_pipeline.fit(cleaned_data).transform(cleaned_data)

# Create OneHotEncoders for indexed columns
one_hot_encoders = [OneHotEncoder(inputCol=f"{col}_index", outputCol=f"{col}_encoded") for col in categorical_columns if cleaned_data.schema[col].dataType == 'string']

# Create a pipeline to apply the OneHotEncoders
one_hot_encoder_pipeline = Pipeline(stages=one_hot_encoders)

# Fit and transform the pipeline
transformed_data = one_hot_encoder_pipeline.fit(transformed_data).transform(transformed_data)

# Select non-categorical columns and the encoded columns
selected_columns = non_categorical_columns + [f"{col}_encoded" for col in categorical_columns if cleaned_data.schema[col].dataType == 'string']
transformed_data = transformed_data.select(selected_columns)

# Show the result
transformed_data.show()


In [None]:
cleaned_data.columns