# Original Prompt
Could you provide me with a python script that generates a sample dataset? I would like this dataset to be similar to a product sales report extracted from SAP or other ERP system. I would like it to have the following Columns:
- ISO Week Start Date (Date):
- Country (String): possible values "United States", "Canada", "Mexico"
- Account (String): possible values "Revenue", "Labor", "Raw Materials",  "Admin", "Utilities"
- Product (String): possible values "Product A", "Product B", "Overhead"
- Value in USD (Currency)

The data should not be completely random. For example 
- There should similar distribution of Values over time for any given combination of Product, Account, and Country.
- Labor, Raw Materials, and Revenue should be be highly correlated over time within a given combination of Product and Country.
- Revenue should be inversely correlated between Product A and Product B
- Admin and Utilities should stay relatively consistent overtime and have no correlation to other accounts within a given product
- For a given week, Revenue should be higher than the sum of Labor, Raw Materials, Admin.

In [2]:
## Output

In [62]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

def generate_week_dates(start_date, num_weeks=520):
    return [start_date + timedelta(weeks=wk) for wk in range(num_weeks)]

def generate_seasonal_trend(week_num, amplitude=25000, period=52, phase_shift=0, linear_trend=50):
    """Generate a seasonal trend with an optional linear trend component."""
    seasonal_component = amplitude * np.sin(2 * np.pi * (week_num + phase_shift) / period)
    linear_component = linear_trend * week_num
    return seasonal_component + linear_component

# Initialize base data
start_date = datetime(2013, 12, 30)
countries = ["United States", "Canada", "Mexico"]
accounts = ["Revenue", "Labor", "Raw Materials", "Admin", "Utilities"]
products = ["Product A", "Product B"]
dates = generate_week_dates(start_date)

data_rows = []  # List to hold all rows of data

np.random.seed(42)  # For reproducibility

week_num = 0  # Initialize week number for trend calculation
for date in dates:
    for country in countries:
        for product in products:
            # Calculate base revenue
            base_revenue = np.random.randint(50000, 100000) if product == "Product A" else np.random.randint(20000, 50000)

            # Apply seasonal adjustment to revenue
            seasonal_adjustment = generate_seasonal_trend(week_num)
            revenue = base_revenue + seasonal_adjustment

            # Calculate admin and utilities as a percent of revenue with Gaussian distribution
            admin = revenue * max(0, np.random.normal(loc=0.04, scale=0.01))  # Mean at 4%, with some variation
            utilities = revenue * max(0, np.random.normal(loc=0.025, scale=0.005))  # Mean at 2.5%, with some variation

            # Calculate labor and raw materials as 25-40% of the seasonal revenue with Gaussian distribution
            labor_percentage = np.random.normal(loc=0.325, scale=0.075)  # Centered at 32.5%
            raw_materials_percentage = np.random.normal(loc=0.325, scale=0.075)  # Centered at 32.5%
            labor = revenue * max(0, labor_percentage)
            raw_materials = revenue * max(0, raw_materials_percentage)

            # Append rows for each account type
            data_rows.extend([
                {'ISO Week Start Date': date, 'Country': country, 'Account': 'Revenue', 'Product': product, 'Value in USD': revenue},
                {'ISO Week Start Date': date, 'Country': country, 'Account': 'Labor', 'Product': product, 'Value in USD': labor},
                {'ISO Week Start Date': date, 'Country': country, 'Account': 'Raw Materials', 'Product': product, 'Value in USD': raw_materials},
                {'ISO Week Start Date': date, 'Country': country, 'Account': 'Admin', 'Product': product, 'Value in USD': admin},
                {'ISO Week Start Date': date, 'Country': country, 'Account': 'Utilities', 'Product': product, 'Value in USD': utilities},
            ])
    week_num += 1  # Increment week number

# Convert list of dicts to DataFrame
data = pd.DataFrame(data_rows)

In [63]:
data.head()

Unnamed: 0,ISO Week Start Date,Country,Account,Product,Value in USD
0,2013-12-30,United States,Revenue,Product A,65795.0
1,2013-12-30,United States,Labor,Product A,22760.339137
2,2013-12-30,United States,Raw Materials,Product A,26369.888987
3,2013-12-30,United States,Admin,Product A,1900.238476
4,2013-12-30,United States,Utilities,Product A,1749.785846


In [64]:

data.to_csv('./data/dummy_data.csv', index=False)

In [65]:
# Stratify raw material costs and revenue by Country and Product
avg_costs = data[data['Account'] == 'Raw Materials'].groupby(['Country', 'Product'])['Value in USD'].mean().reset_index()
avg_revenue = data[data['Account'] == 'Revenue'].groupby(['Country', 'Product'])['Value in USD'].mean().reset_index()

# Join the stratified averages back to a unique list of combinations of ISO Week Start Date, Country, and Product
unique_combinations = data[['ISO Week Start Date', 'Country', 'Product']].drop_duplicates().reset_index(drop=True)
dummy_data = unique_combinations.merge(avg_costs, on=['Country', 'Product'], how='left')
dummy_data = dummy_data.merge(avg_revenue, on=['Country', 'Product'], how='left')
dummy_data.rename(columns={'Value in USD_x': 'Avg Raw Material Cost', 'Value in USD_y': 'Avg Revenue'}, inplace=True)

# Calculate Units Produced and Sold based on stratified averages
# Adjust the formula as needed for your specific context
dummy_data['Units Produced'] = np.round((dummy_data['Avg Raw Material Cost'] / 100) * np.random.uniform(0.8, 1.2, size=len(dummy_data)))
dummy_data['Units Sold'] = np.round((dummy_data['Avg Revenue'] / 350) * np.random.uniform(0.8, 1.2, size=len(dummy_data)))

# Ensure Units Sold does not exceed Units Produced by adjusting them to integers
# dummy_data['Units Sold'] = dummy_data.apply(lambda row: min(row['Units Sold'], row['Units Produced']), axis=1).astype(int)
dummy_data['Units Sold'] = dummy_data['Units Sold'].astype(int)
dummy_data['Units Produced'] = dummy_data['Units Produced'].astype(int)

# Calculate costs per unit produced and revenue per unit sold for a more detailed analysis
dummy_data['Cost per Unit Produced'] = dummy_data['Avg Raw Material Cost'] / dummy_data['Units Produced']
dummy_data['Revenue per Unit Sold'] = dummy_data['Avg Revenue'] / dummy_data['Units Sold']

# Remove 'Overhead' from products 
dummy_data = dummy_data[dummy_data['Product'] != 'Overhead']


In [66]:
dummy_data.tail()

Unnamed: 0,ISO Week Start Date,Country,Product,Avg Raw Material Cost,Avg Revenue,Units Produced,Units Sold,Cost per Unit Produced,Revenue per Unit Sold
3115,2023-12-11,United States,Product B,16177.826945,48221.173077,132,128,122.559295,376.727915
3116,2023-12-11,Canada,Product A,28750.241084,86990.338462,279,236,103.047459,368.603129
3117,2023-12-11,Canada,Product B,16029.124413,48087.817308,165,127,97.146209,378.644231
3118,2023-12-11,Mexico,Product A,27972.359439,86295.586538,257,282,108.841866,306.012718
3119,2023-12-11,Mexico,Product B,15382.783203,47617.159615,156,140,98.607585,340.122569


In [67]:
dummy_data.to_csv('./data/dummy_data_counts.csv', index=False)

In [68]:
new_data_counts = dummy_data 
original_data = data

query = ''' 
WITH cost_revenue AS (
    SELECT
        o.Country,
        o.Product,
        o.`ISO Week Start Date`,
        SUM(CASE WHEN o.Account = 'Revenue' THEN o.`Value in USD` ELSE 0 END) AS Total_Revenue,
        SUM(CASE WHEN o.Account IN ('Labor', 'Raw Materials') THEN o.`Value in USD` ELSE 0 END) AS Total_Cost
    FROM original_data o
    INNER JOIN new_data_counts n ON o.Country = n.Country AND o.Product = n.Product AND o.`ISO Week Start Date` = n.`ISO Week Start Date`
    GROUP BY o.Country, o.Product, o.`ISO Week Start Date`
),
monthly_overhead AS (
    SELECT
        Country,
        `ISO Week Start Date`,
        SUM(`Value in USD`) AS Monthly_Overhead
    FROM original_data
    WHERE Account IN ('Admin', 'Utilities')
    GROUP BY Country, `ISO Week Start Date`
),
gross_profit_calc AS (
    SELECT
        c.Country,
        c.Product,
        c.`ISO Week Start Date`,
        c.Total_Revenue - c.Total_Cost AS Gross_Profit,
        mo.Monthly_Overhead
    FROM cost_revenue c
    JOIN monthly_overhead mo ON c.Country = mo.Country AND c.`ISO Week Start Date` = mo.`ISO Week Start Date`
)
SELECT *,
       SUM(Gross_Profit) OVER (PARTITION BY Country, Product ORDER BY `ISO Week Start Date`) AS Cumulative_Gross_Profit,
       SUM(Monthly_Overhead) OVER (PARTITION BY Country ORDER BY `ISO Week Start Date`) AS Cumulative_Overhead
FROM gross_profit_calc
'''

joined_data = pysqldf(query)

In [69]:
joined_data.head()

Unnamed: 0,Country,Product,ISO Week Start Date,Gross_Profit,Monthly_Overhead,Cumulative_Gross_Profit,Cumulative_Overhead
0,Canada,Product A,2013-12-30 00:00:00.000000,24923.966146,6160.574475,24923.966146,12321.14895
1,Canada,Product A,2014-01-06 00:00:00.000000,24528.794407,6284.259681,49452.760553,24889.668311
2,Canada,Product A,2014-01-13 00:00:00.000000,38367.701076,9549.035133,87820.461629,43987.738577
3,Canada,Product A,2014-01-20 00:00:00.000000,28351.572622,7340.952134,116172.034251,58669.642844
4,Canada,Product A,2014-01-27 00:00:00.000000,21689.303676,7478.579702,137861.337927,73626.802249


In [70]:
joined_data.to_csv('./data/dummy_data_joined.csv', index=False)

In [71]:
import altair as alt
# Altair visualization
chart = alt.Chart(joined_data).mark_bar(opacity=0.5).encode(
    alt.X('Gross_Profit:Q', bin=alt.Bin(maxbins=100), title='Gross Profit'),
    alt.Y('count()', stack=None, title='Number of Records'),
    color='Product:N',
    row='Country:N'
).properties(
    height=200,  # Adjust the height of each row
    width=400    # Adjust the width of the chart
)

chart

# Prompting for Prompts

## Prompt
Thank you!  I  would like to use this dataset to help financial analyst learn how to use Chat Gpt to analyze data. Can you provide some recommended questions they could ask you? 

## Response 
Integrating ChatGPT into financial analysis can significantly enhance data interpretation, trend identification, and decision-making processes. Here are some recommended questions that financial analysts could ask ChatGPT based on the dataset like the one we discussed. These questions are designed to simulate real-world scenarios, helping analysts learn how to leverage AI for financial data analysis:

1. **Summarize Key Trends**: "Can you summarize the key trends in revenue across different countries for Product A over the last quarter?"

2. **Comparative Analysis**: "How does the revenue from Product B in Canada compare to that in the United States over the past year?"

3. **Correlation Analysis**: "Is there a correlation between labor costs and revenue for Product A in Mexico across different weeks?"

4. **Predictive Insights**: "Based on past trends, what are the predicted revenue ranges for Product B in the next month?"

5. **Expense Analysis**: "What percentage of total costs do raw materials constitute for Product A across all countries?"

6. **Performance Benchmarking**: "Which country has the highest efficiency in terms of revenue to labor cost ratio for Product B?"

7. **Impact Analysis**: "How does the variation in utility costs impact the overall expenses for Overhead across each country?"

8. **Outlier Detection**: "Identify any outliers in the revenue for Product A and B across all countries. What might be the reasons for these anomalies?"

9. **Profitability Analysis**: "What is the profitability trend for each product in each country over the past six months?"

10. **Cost Optimization**: "Which accounts show the most significant potential for cost reduction without affecting the revenue for Product B?"

11. **Seasonal Trends**: "Are there any noticeable seasonal trends in the sales of Product A and B across different countries?"

12. **Product Comparison**: "Compare the financial performance of Product A vs. Product B across all accounts and countries."

13. **Expense Forecasting**: "What are the forecasted admin and utility expenses for the next quarter based on past trends?"

14. **Market Analysis**: "Which market (country) shows the most growth potential for Product A based on past revenue trends?"

15. **Strategic Recommendations**: "Based on the data, what strategic adjustments would you recommend for improving the profitability of Product B in Canada?"

Encouraging analysts to ask these types of questions can help them develop a deeper understanding of how to use AI tools like ChatGPT for financial data analysis. It allows them to explore various analytical dimensions, from basic querying to more complex predictive and strategic analysis.

In [9]:
import altair as alt
import pandas as pd

# Assuming 'data' is your DataFrame loaded from the CSV
# First, filter the data for 'Revenue' account type
revenue_data = data[data['Account'] == 'Revenue'].groupby(['Country', 'Product'], as_index=False)['Value in USD'].sum()

# Creating a bar chart for comparative analysis
revenue_chart = alt.Chart(revenue_data).mark_bar().encode(
    x='Country:N',
    y='sum(Value in USD):Q',
    color='Country:N',
    column='Product:N',
    tooltip=['Country', 'Product', 'sum(Value in USD)']
).properties(
    title='Total Revenue by Product and Country'
)

revenue_chart

In [19]:
# Preparing data for correlation analysis
# Filter the data for relevant accounts and a specific product/country if desired
correlation_data = data[data['Account'].isin(['Labor', 'Raw Materials', 'Revenue'])]

# Pivot the data to have accounts as columns
correlation_pivot = correlation_data.pivot_table(index=['ISO Week Start Date', 'Country', 'Product'],
                                                 columns='Account', values='Value in USD').reset_index()

min_x = correlation_pivot['Labor'].min()
max_x = correlation_pivot['Labor'].max()

min_y = correlation_pivot['Revenue'].min()
max_y = correlation_pivot['Revenue'].max()


In [20]:



# Creating scatter plots to visualize correlations
# Example: Correlation between Labor and Revenue
labor_revenue_corr_chart = alt.Chart(correlation_pivot).mark_circle(size=60).encode(
    x=alt.X('Labor:Q', scale=alt.Scale(domain=[min_x, max_x])),
    y=alt.Y('Revenue:Q', scale=alt.Scale(domain=[min_y, max_y])),
    color='Country:N',
    tooltip=['Country', 'Product', 'Labor', 'Revenue']
).properties(
    title='Correlation between Labor Costs and Revenue'
)

labor_revenue_corr_chart

# Asking for Insights

## Prompt
Can you provide any insights from the data

## Response
To provide insights from the dataset you've shared, I will analyze it focusing on the relationships and patterns among the variables, such as country performance, product sales, and the correlation between different types of costs and revenue. This analysis will involve summarizing the data, identifying trends, and exploring correlations within the dataset. Let's proceed with this analysis.

### Insights from Data Analysis

1. **Overall Revenue by Country and Product**: Identify which country and product combinations are the most and least profitable. This could indicate market preferences or the success of local strategies.

2. **Correlation between Labor and Revenue**: Explore if higher labor costs are associated with higher revenue, which might suggest that investing in labor (quality or quantity) could be beneficial.

3. **Correlation between Raw Materials and Revenue**: Determine if there's a relationship between the amount spent on raw materials and revenue, potentially indicating the impact of material quality or cost on sales.

4. **Efficiency Analysis**: Assess which countries are most efficient in turning labor and raw materials into revenue. This could highlight best practices or areas for improvement.

5. **Trend Over Time**: Look for any trends in revenue, labor costs, and raw material costs over time to identify seasonal patterns or long-term growth/decline.

Let's begin this analysis by summarizing key metrics and then examining correlations among the data.

Based on the analysis of your dataset, here are some key insights:

### Overall Revenue by Country and Product
- **Canada**: Product A generates the highest revenue, followed by Product B, and Overhead has the least.
- **Mexico**: Product A is the top revenue generator, similar to Canada, with Product B and Overhead following.
- **United States**: Product A also leads in revenue, with Product B and Overhead trailing.

This pattern suggests that Product A is consistently the most successful product across all countries, indicating its strong market fit or effective sales strategy.

### Correlation Analysis
- **Labor and Revenue Correlation**: The correlation coefficient between labor costs and revenue is approximately 0.287. This indicates a positive, albeit weak, relationship, suggesting that higher labor costs are somewhat associated with higher revenue, but other factors also play a significant role.
- **Raw Materials and Revenue Correlation**: The correlation coefficient here is slightly higher at 0.290, indicating a similar weak positive relationship. This suggests that investments in raw materials slightly correlate with revenue increases, but, like labor, the impact is not strongly pronounced.

### Interpretation and Recommendations
- **Product Focus**: Given Product A's consistent performance across all markets, focusing on its production, marketing, and sales could further enhance profitability. Investigating why it performs better than Product B and Overhead could provide actionable insights.
- **Market Strategies**: While Product A performs well, there might be room to improve Product B's market strategy, possibly by analyzing customer feedback or market trends to adjust the product or its positioning.
- **Cost Management**: The weak correlation between labor/raw materials and revenue suggests a review of cost efficiency. While investments in these areas are essential, ensuring they're optimized for revenue generation is crucial. Investigating areas where costs can be reduced without impacting product quality or sales might reveal opportunities for efficiency improvements.
- **Cross-Market Learning**: The differences in revenue across countries, even with Product A leading, suggest that market-specific strategies are essential. Learning from the strategies employed in one country and applying successful tactics elsewhere could enhance overall performance.

These insights should help guide strategic decisions, from product focus to market strategies and cost management. It's also important to dive deeper into the specific aspects that make Product A successful and apply these lessons to other products or markets.