# Machine Learning Pipeline for FinMark Corporation

## Project Overview
This notebook implements a comprehensive machine learning pipeline for analyzing and predicting customer behavior, sales patterns, and business performance.

## Business Objectives
1. Automate sales forecasting and analysis
2. Develop data-driven customer segmentation
3. Identify key purchasing patterns and trends
4. Enable predictive analytics for business decisions

## Data Sources
1. Customer Data
   - Demographics
   - Company information
   - Historical behavior
   
2. Product Data
   - Product details
   - Pricing information
   - Category classifications
   
3. Transaction Data
   - Purchase history
   - Temporal patterns
   - Customer-product interactions

## Expected Outcomes
- Accurate sales predictions
- Customer segmentation insights
- Actionable business recommendations
- Automated analysis pipeline

## Library Setup and Data Preparation

### Libraries Used
1. Data Processing
   - pandas: Data manipulation and analysis
   - numpy: Numerical operations
   
2. Machine Learning
   - sklearn: Model implementation and evaluation
   - statsmodels: Statistical analysis
   
3. Visualization
   - matplotlib: Basic plotting
   - seaborn: Advanced visualizations

### Why These Libraries?
- Industry standard implementations
- Robust data handling capabilities
- Efficient processing of large datasets
- Comprehensive ML toolkit

In [None]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#Load datasets: transactions, products and customers
transactions = pd.read_csv('transactions_data.csv')
products = pd.read_csv('products_data.csv')
customers = pd.read_csv('customers_data.csv')

## Data Loading and Initial Processing

### Data Integration Strategy
1. Customer Data Loading
   - Company information
   - Demographics
   - Business characteristics
   
2. Product Data Processing
   - Standardizing product information
   - Price normalization
   - Category harmonization
   
3. Transaction Data Preparation
   - Date standardization
   - Missing value handling
   - Anomaly detection

### Why This Approach?
- Ensures data quality at source
- Maintains data relationships
- Enables efficient analysis
- Prepares for model input

In [None]:
customers.head(20)

Unnamed: 0,Company_ID,Company_Name,Company_Profit,Address
0,1.0,Tech Enterprises 1,80701.0,"EDSA, Barangay 606, Pasig, Philippines"
1,2.0,Global Partners 2,80511.0,"Commonwealth Ave, Barangay 789, Taguig, Philip..."
2,3.0,Quantum Associates 3,110664.0,"Roxas Blvd, Barangay 505, Pasig, Philippines"
3,4.0,Prime Network 4,,"Alabang-Zapote Rd, Barangay 202, Taguig, Phili..."
4,5.0,Elite Ventures 5,69427.0,"Ayala Avenue, Barangay 101, Makati, Philippines"
5,,Elite Network 6,36967.0,"Katipunan Ave, Barangay 707, Davao City, Phili..."
6,7.0,Dynamic Solutions 7,36661.0,"Commonwealth Ave, Barangay 303, Pasig, Philipp..."
7,,Green Enterprises 8,107952.0,"Roxas blvd, barangay 404, manila, philippines"
8,9.0,Global Enterprises 9,96046.0,"Slex, barangay 123, pasig, philippines"
9,10.0,Pioneer Network 10,65200.0,"Katipunan Ave, Barangay 101, Mandaluyong, Phil..."


Assigning Sequential Company_IDs to Handle Missing Values

In [None]:
# Fill NaN values in Company_ID with sequential numbers starting from 1
customers['Company_ID'] = range(1, len(customers) + 1)
customers

Unnamed: 0,Company_ID,Company_Name,Company_Profit,Address
0,1,Tech Enterprises 1,80701.0,"EDSA, Barangay 606, Pasig, Philippines"
1,2,Global Partners 2,80511.0,"Commonwealth Ave, Barangay 789, Taguig, Philip..."
2,3,Quantum Associates 3,110664.0,"Roxas Blvd, Barangay 505, Pasig, Philippines"
3,4,Prime Network 4,,"Alabang-Zapote Rd, Barangay 202, Taguig, Phili..."
4,5,Elite Ventures 5,69427.0,"Ayala Avenue, Barangay 101, Makati, Philippines"
...,...,...,...,...
95,96,Dynamic Network 96,101428.0,"Alabang-Zapote Rd, Brgy. 456, Cebu City, Phili..."
96,97,Quantum Holdings 97,33449.0,"EDSA, Barangay 789, Manila, Philippines"
97,98,Pioneer Ventures 98,71095.0,"Roxas Blvd, Barangay 123, Taguig, Philippines"
98,99,Elite Corp 99,107929.0,"Alabang-Zapote Rd, Barangay 303, Makati, Phili..."


Cleaning 'Product_Price': Removing Symbols and Converting to Numeric Format

In [None]:
#Cleaning the dataset Products: Clean 'Product_Price' (remove "Php or ? " and convert to numeric)
products['Product_Price'] = products['Product_Price'].astype(str).replace({'\?': '', 'Php': ''}, regex=True).str.replace(',', '').astype(float)
products

Unnamed: 0,Product_ID,Product_Name,Product_Price
0,1.0,FinPredictor Suite,140000.0
1,2.0,MarketMinder Analytics,168000.0
2,3.0,TrendWise Forecaster,100800.0
3,4.0,CustomerScope Insights,123200.0
4,5.0,SalesSync Optimizer,84000.0
5,6.0,RevenueVue Dashboard,179200.0
6,7.0,DataBridge Integration Tool,151200.0
7,,RiskRadar Monitor,151200.0
8,9.0,Product 9,112000.0
9,10.0,SegmentX Targeting,89600.0


## Data Cleaning and Preprocessing

### Cleaning Steps
1. Missing Value Treatment
   - Identification strategy
   - Imputation methods
   - Validation approach
   
2. Data Type Standardization
   - Date formatting
   - Numerical conversion
   - Categorical encoding
   
3. Quality Checks
   - Duplicate detection
   - Outlier identification
   - Consistency validation

### Business Impact
- Improved data reliability
- Better model performance
- More accurate predictions

Assigning Sequential Product_IDs to Handle Missing Values

In [None]:
# Fill NaN values in Product_ID with sequential numbers starting from 1
products['Product_ID'] = range(1, len(products) + 1)
products

Unnamed: 0,Product_ID,Product_Name,Product_Price
0,1,FinPredictor Suite,140000.0
1,2,MarketMinder Analytics,168000.0
2,3,TrendWise Forecaster,100800.0
3,4,CustomerScope Insights,123200.0
4,5,SalesSync Optimizer,84000.0
5,6,RevenueVue Dashboard,179200.0
6,7,DataBridge Integration Tool,151200.0
7,8,RiskRadar Monitor,151200.0
8,9,Product 9,112000.0
9,10,SegmentX Targeting,89600.0


Standardizing 'Transaction_Date' to YYYY-MM-DD Format

In [None]:
#Transactions: Standardize 'Transaction_Date' to Format to YYYY-MM-DD
transactions['Transaction_Date'] = pd.to_datetime(transactions['Transaction_Date'], format='mixed', errors='coerce').dt.strftime('%Y-%m-%d')
transactions

Unnamed: 0.1,Unnamed: 0,Transaction_ID,Company_ID,Product_ID,Quantity,Transaction_Date,Product_Price,Total_Cost
0,0.0,1.0,88.0,6.0,,2024-03-26,194379.147964,1075200.0
1,1.0,2.0,29.0,19.0,16.0,2024-07-09,97930.993380,1428000.0
2,2.0,,28.0,18.0,6.0,2024-04-13,126095.547778,940800.0
3,3.0,4.0,85.0,12.0,12.0,2023-09-06,,1008000.0
4,4.0,5.0,47.0,3.0,8.0,2021-07-06,99575.609634,705600.0
...,...,...,...,...,...,...,...,...
9995,9995.0,,,10.0,,2022-06-05,,627200.0
9996,9996.0,9997.0,39.0,2.0,9.0,2021-05-17,159518.597391,1512000.0
9997,9997.0,,90.0,1.0,15.0,2022-07-19,128137.094759,1960000.0
9998,9998.0,9999.0,33.0,,19.0,2021-04-15,81786.119894,1680000.0


Filling Missing Transaction_IDs with Sequential Numbers

In [None]:
# Fill NaN values in Transaction_ID with sequential numbers starting from 1
transactions['Transaction_ID'] = transactions['Transaction_ID'].fillna(pd.Series(range(1, len(transactions) + 1)))
transactions

Unnamed: 0.1,Unnamed: 0,Transaction_ID,Company_ID,Product_ID,Quantity,Transaction_Date,Product_Price,Total_Cost
0,0.0,1.0,88.0,6.0,,2024-03-26,194379.147964,1075200.0
1,1.0,2.0,29.0,19.0,16.0,2024-07-09,97930.993380,1428000.0
2,2.0,3.0,28.0,18.0,6.0,2024-04-13,126095.547778,940800.0
3,3.0,4.0,85.0,12.0,12.0,2023-09-06,,1008000.0
4,4.0,5.0,47.0,3.0,8.0,2021-07-06,99575.609634,705600.0
...,...,...,...,...,...,...,...,...
9995,9995.0,9996.0,,10.0,,2022-06-05,,627200.0
9996,9996.0,9997.0,39.0,2.0,9.0,2021-05-17,159518.597391,1512000.0
9997,9997.0,9998.0,90.0,1.0,15.0,2022-07-19,128137.094759,1960000.0
9998,9998.0,9999.0,33.0,,19.0,2021-04-15,81786.119894,1680000.0


Handling Missing Numerical Values in Transactions Using Median Imputation

In [None]:
#Fill missing numerical values in Transactions
transactions['Quantity'] = transactions['Quantity'].fillna(transactions['Quantity'].median())
transactions['Product_Price'] = transactions['Product_Price'].fillna(products['Product_Price'].median())
transactions['Total_Cost'] = transactions['Total_Cost'].fillna(transactions['Total_Cost'].median())

Generating Descriptive Statistics for Customers, Products, and Transactions Datasets

In [None]:
#Descriptive Statistics
def describe_data(df):
    descriptives = {}
    for column in df.columns:
        if df[column].dtype == 'object':  # Categorical
            descriptives[column] = {
                'Type': 'Categorical',
                'Unique Values': df[column].nunique(),
                'Mode': df[column].mode()[0]
            }
        elif df[column].dtype in ['float64', 'int64']:  # Numerical
            descriptives[column] = {
                'Type': 'Numerical',
                'Mean': df[column].mean(),
                'Median': df[column].median(),
                'Std Dev': df[column].std(),
                'Range': df[column].max() - df[column].min()
            }
        elif pd.api.types.is_datetime64_any_dtype(df[column]):  # Date
            descriptives[column] = {
                'Type': 'Date',
                'Earliest': df[column].min(),
                'Latest': df[column].max()
            }
    return descriptives

# Applying the describe_data function to all three datasets
descriptive_stats_customers = describe_data(customers)
descriptive_stats_products = describe_data(products)
descriptive_stats_transactions = describe_data(transactions)

descriptive_stats_customers, descriptive_stats_products, descriptive_stats_transactions

({'Company_ID': {'Type': 'Numerical',
   'Mean': 50.5,
   'Median': 50.5,
   'Std Dev': 29.011491975882016,
   'Range': 99},
  'Company_Name': {'Type': 'Categorical',
   'Unique Values': 100,
   'Mode': 'Dynamic  Network  96'},
  'Company_Profit': {'Type': 'Numerical',
   'Mean': 76400.5,
   'Median': 75301.5,
   'Std Dev': 27296.169253359454,
   'Range': 87451.0},
  'Address': {'Type': 'Categorical',
   'Unique Values': 97,
   'Mode': 'Ayala Avenue, Brgy. 101, Baguio, Philippines'}},
 {'Product_ID': {'Type': 'Numerical',
   'Mean': 10.5,
   'Median': 10.5,
   'Std Dev': 5.916079783099616,
   'Range': 19},
  'Product_Name': {'Type': 'Categorical',
   'Unique Values': 20,
   'Mode': 'BudgetMaster Pro'},
  'Product_Price': {'Type': 'Numerical',
   'Mean': 134680.0,
   'Median': 131600.0,
   'Std Dev': 39408.916971189465,
   'Range': 140000.0}},
 {'Unnamed: 0': {'Type': 'Numerical',
   'Mean': 4994.049111111111,
   'Median': 4997.5,
   'Std Dev': 2885.331476239788,
   'Range': 9999.0},
  

In [None]:
#Generate descriptives for each dataset
transactions_desc = describe_data(transactions)
transactions_desc

{'Unnamed: 0': {'Type': 'Numerical',
  'Mean': 4994.049111111111,
  'Median': 4997.5,
  'Std Dev': 2885.331476239788,
  'Range': 9999.0},
 'Transaction_ID': {'Type': 'Numerical',
  'Mean': 5000.5,
  'Median': 5000.5,
  'Std Dev': 2886.8956799071675,
  'Range': 9999.0},
 'Company_ID': {'Type': 'Numerical',
  'Mean': 50.583555555555556,
  'Median': 50.0,
  'Std Dev': 28.900869703353283,
  'Range': 99.0},
 'Product_ID': {'Type': 'Numerical',
  'Mean': 10.446777777777777,
  'Median': 10.0,
  'Std Dev': 5.768340607805518,
  'Range': 19.0},
 'Quantity': {'Type': 'Numerical',
  'Mean': 10.5759,
  'Median': 11.0,
  'Std Dev': 5.5117943700502625,
  'Range': 21.0},
 'Transaction_Date': {'Type': 'Categorical',
  'Unique Values': 1493,
  'Mode': '2022-04-10'},
 'Product_Price': {'Type': 'Numerical',
  'Mean': 134347.52228314718,
  'Median': 131600.0,
  'Std Dev': 37064.88553794515,
  'Range': 170665.68741232133},
 'Total_Cost': {'Type': 'Numerical',
  'Mean': 1416221.52,
  'Median': 1344000.0,
  '

In [None]:
products_desc = describe_data(products)
products_desc

{'Product_ID': {'Type': 'Numerical',
  'Mean': 10.5,
  'Median': 10.5,
  'Std Dev': 5.916079783099616,
  'Range': 19},
 'Product_Name': {'Type': 'Categorical',
  'Unique Values': 20,
  'Mode': 'BudgetMaster Pro'},
 'Product_Price': {'Type': 'Numerical',
  'Mean': 134680.0,
  'Median': 131600.0,
  'Std Dev': 39408.916971189465,
  'Range': 140000.0}}

In [None]:
customers_desc = describe_data(customers)
customers_desc

{'Company_ID': {'Type': 'Numerical',
  'Mean': 50.5,
  'Median': 50.5,
  'Std Dev': 29.011491975882016,
  'Range': 99},
 'Company_Name': {'Type': 'Categorical',
  'Unique Values': 100,
  'Mode': 'Dynamic  Network  96'},
 'Company_Profit': {'Type': 'Numerical',
  'Mean': 76400.5,
  'Median': 75301.5,
  'Std Dev': 27296.169253359454,
  'Range': 87451.0},
 'Address': {'Type': 'Categorical',
  'Unique Values': 97,
  'Mode': 'Ayala Avenue, Brgy. 101, Baguio, Philippines'}}

In [None]:
#Print descriptives
print("Transactions Dataset Descriptives:")
print(transactions_desc)
print("\nProducts Dataset Descriptives:")
print(products_desc)
print("\nCustomers Dataset Descriptives:")
print(customers_desc)

Transactions Dataset Descriptives:
{'Unnamed: 0': {'Type': 'Numerical', 'Mean': 4994.049111111111, 'Median': 4997.5, 'Std Dev': 2885.331476239788, 'Range': 9999.0}, 'Transaction_ID': {'Type': 'Numerical', 'Mean': 5000.5, 'Median': 5000.5, 'Std Dev': 2886.8956799071675, 'Range': 9999.0}, 'Company_ID': {'Type': 'Numerical', 'Mean': 50.583555555555556, 'Median': 50.0, 'Std Dev': 28.900869703353283, 'Range': 99.0}, 'Product_ID': {'Type': 'Numerical', 'Mean': 10.446777777777777, 'Median': 10.0, 'Std Dev': 5.768340607805518, 'Range': 19.0}, 'Quantity': {'Type': 'Numerical', 'Mean': 10.5759, 'Median': 11.0, 'Std Dev': 5.5117943700502625, 'Range': 21.0}, 'Transaction_Date': {'Type': 'Categorical', 'Unique Values': 1493, 'Mode': '2022-04-10'}, 'Product_Price': {'Type': 'Numerical', 'Mean': 134347.52228314718, 'Median': 131600.0, 'Std Dev': 37064.88553794515, 'Range': 170665.68741232133}, 'Total_Cost': {'Type': 'Numerical', 'Mean': 1416221.52, 'Median': 1344000.0, 'Std Dev': 862330.9763195113, '

Converting 'Product_ID' to String for Consistency Across Datasets

In [None]:
#Convert Product_ID to string in both datasets
transactions['Product_ID'] = transactions['Product_ID'].astype(str)
products['Product_ID'] = products['Product_ID'].astype(str)

Converting 'Company_ID' to String for Consistency Across Datasets

In [None]:
#Convert Company_ID to string in both datasets
transactions['Company_ID'] = transactions['Company_ID'].astype(str)
customers['Company_ID'] = customers['Company_ID'].astype(str)

Removing Rows with Missing Values in Key Identifier Columns

In [None]:
# Ensure no missing values in key columns
transactions = transactions.dropna(subset=['Product_ID', 'Company_ID'])
products = products.dropna(subset=['Product_ID'])
customers = customers.dropna(subset=['Company_ID'])

Standardizing 'Product_ID' Format by Removing Decimal Points

In [None]:
# Remove decimal points from Product_ID in both datasets
transactions['Product_ID'] = transactions['Product_ID'].str.replace('.0', '', regex=False)
products['Product_ID'] = products['Product_ID'].str.replace('.0', '', regex=False)

Standardizing 'Company_ID' Format by Removing Decimal Points

In [None]:
# Remove decimal points from Company_ID in both datasets
transactions['Company_ID'] = transactions['Company_ID'].str.replace('.0', '', regex=False)
customers['Company_ID'] = customers['Company_ID'].str.replace('.0', '', regex=False)

Saving Cleaned Datasets to CSV Files

In [None]:
#Save cleaned datasets
transactions.to_csv("cleaned_transactions_data.csv", index=False)
products.to_csv("cleaned_products_data.csv", index=False)
customers.to_csv("cleaned_customers_data.csv", index=False)

## Data Integration and Merging

### Merge Strategy
1. Customer-Transaction Integration
   - Key: Company_ID
   - Validation checks
   - Relationship preservation
   
2. Product Integration
   - Key: Product_ID
   - Category alignment
   - Price standardization
   
3. Quality Assurance
   - Relationship verification
   - Data completeness checks
   - Consistency validation

### Why This Matters
- Creates comprehensive view
- Maintains data relationships
- Enables advanced analytics

Merging Transactions with Products on 'Product_ID'

In [None]:
# Merge Transactions with Products on 'Product_ID'
merged_data = pd.merge(transactions, products, on='Product_ID', how='left')

Merging Transactions-Products Data with Customers on 'Company_ID

In [None]:
# Merge the result with Customers on 'Company_ID'
merged_data = pd.merge(merged_data, customers, on='Company_ID', how='left')

Previewing the First Few Rows of the Merged Dataset

In [None]:
# Display the first few rows of the merged dataset
print("Merged Dataset:")
print(merged_data.head())

Merged Dataset:
   Unnamed: 0  Transaction_ID Company_ID Product_ID  Quantity  \
0         0.0             1.0         88          6      11.0   
1         1.0             2.0         29         19      16.0   
2         2.0             3.0         28         18       6.0   
3         3.0             4.0         85         12      12.0   
4         4.0             5.0         47          3       8.0   

  Transaction_Date  Product_Price_x  Total_Cost            Product_Name  \
0       2024-03-26    194379.147964   1075200.0    RevenueVue Dashboard   
1       2024-07-09     97930.993380   1428000.0        EcoNomix Modeler   
2       2024-04-13    126095.547778    940800.0  DashSync Analytics Hub   
3       2023-09-06    131600.000000   1008000.0        BudgetMaster Pro   
4       2021-07-06     99575.609634    705600.0    TrendWise Forecaster   

   Product_Price_y           Company_Name  Company_Profit  \
0         179200.0    Elite Consulting 88         75950.0   
1          95200.0  

Identifying and Removing Duplicate Entries in the Merged Dataset

In [None]:
# Count duplicates in the merged dataset
duplicates = merged_data.duplicated().sum()

# Drop duplicates from the merged dataset
merged_data = merged_data.drop_duplicates()

duplicates

0

Checking for Missing Values in the Merged Dataset

In [None]:
# Check for missing values
missing_summary = merged_data.isnull().sum()
print("Missing Values per Column:\n", missing_summary)

Missing Values per Column:
 Unnamed: 0          1000
Transaction_ID         0
Company_ID             0
Product_ID             0
Quantity               0
Transaction_Date       0
Product_Price_x        0
Total_Cost             0
Product_Name        1000
Product_Price_y     1000
Company_Name        1000
Company_Profit      2074
Address             1000
dtype: int64


Dropping Unnecessary Columns from the Merged Dataset

In [None]:
# Drop unnecessary columns
merged_data = merged_data.drop(columns=['Unnamed: 0'], errors='ignore')

Checking for Missing Values in the Merged Dataset

In [None]:
# Check for missing values
missing_summary = merged_data.isnull().sum()
print("Missing Values per Column:\n", missing_summary)

Missing Values per Column:
 Transaction_ID         0
Company_ID             0
Product_ID             0
Quantity               0
Transaction_Date       0
Product_Price_x        0
Total_Cost             0
Product_Name        1000
Product_Price_y     1000
Company_Name        1000
Company_Profit      2074
Address             1000
dtype: int64


Saving the Final Merged Dataset to CSV

In [None]:
# Save the merged dataset
merged_data.to_csv("merged_data.csv", index=False)

Checking and Addressing Missing Values in the Merged Dataset

In [None]:
# Address missing data in the merged dataset
# Check for missing values
missing_summary = merged_data.isnull().sum()
print("Missing Values per Column:\n", missing_summary)

Missing Values per Column:
 Transaction_ID         0
Company_ID             0
Product_ID             0
Quantity               0
Transaction_Date       0
Product_Price_x        0
Total_Cost             0
Product_Name        1000
Product_Price_y     1000
Company_Name        1000
Company_Profit      2074
Address             1000
dtype: int64


Imputing Missing Values in Numerical Columns with Median

In [None]:
# Numerical columns: Continuous and Count Data
numerical_columns = merged_data.select_dtypes(include=['float64', 'int64']).columns
for col in numerical_columns:
    if merged_data[col].isnull().sum() > 0:
        # Use median for imputation
        merged_data[col].fillna(merged_data[col].median(), inplace=True)

Imputing Missing Values in Categorical Columns with Mode

In [None]:
# Categorical columns: Impute with the mode
categorical_columns = merged_data.select_dtypes(include=['object']).columns
for col in categorical_columns:
    if merged_data[col].isnull().sum() > 0:
        # Use mode for imputation
        merged_data[col].fillna(merged_data[col].mode()[0], inplace=True)

Imputing Missing Date Values Using Forward-Fill Method

In [None]:
# Date columns: Forward-fill or backward-fill for imputation
date_columns = [col for col in merged_data.columns if 'Date' in col or 'date' in col]
for col in date_columns:
    if merged_data[col].isnull().sum() > 0:
        # Forward-fill method
        merged_data[col] = pd.to_datetime(merged_data[col], errors='coerce')
        merged_data[col].fillna(method='ffill', inplace=True)

Verifying Resolution of Missing Data After Imputation

In [None]:
# Validate that missing data is resolved
print("Post-Imputation Missing Values:\n", merged_data.isnull().sum())

Post-Imputation Missing Values:
 Transaction_ID      0
Company_ID          0
Product_ID          0
Quantity            0
Transaction_Date    0
Product_Price_x     0
Total_Cost          0
Product_Name        0
Product_Price_y     0
Company_Name        0
Company_Profit      0
Address             0
dtype: int64


Saving the Merged and Cleaned Dataset to CSV

In [None]:
# Save the merged dataset
merged_data.to_csv("merged_data.csv", index=False)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=6a672b84-99aa-4a43-8d4f-014dc3397bc0' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>