### This data is formatted the next way
- Every Excel has N different pages for fruit/vegetable group like "Manzana"
- We don't know all the pages there are as they change (probably because of seasons, offer/demand)
- All pages are separated into 2 groups, the first one for Wholesale prices, the second Wholesale volumes
- We have (hopefully) the same number of datapoints in the first and second group.
- Datapoints are formatted in the following columns: 'Variety', 'Market Name', 'Monday', 'Tuesday', 'Wednesday', 'Thrustday', 'Friday', 'Commerce Unit'.
- Commerce Units are directly related between price and volume but need formatting to merge.


### Which fruits/vegetables have the highest/lowest sales volume and revenue?
    Identify the top-selling and least-selling products in terms of volume and revenue to understand the most and least popular items.

### What are the sales trends for specific fruits/vegetables over time?
    Analyze weekly sales data to identify seasonal trends, demand patterns, and potential growth opportunities.

### What is the average price and volume sold per unit for each fruit/vegetable?
    Calculate the average price and volume to understand the typical market conditions for different products.

### Which markets show the highest/lowest sales for specific fruits/vegetables?
    Determine the geographical regions with the highest and lowest sales to optimize distribution and marketing efforts.

### What is the overall revenue and volume for all fruits/vegetables combined?
    Provide an overview of total revenue and volume to showcase the overall performance of the business.

### Are there any price or volume fluctuations over time?
    Analyze price and volume variations to identify factors affecting sales and potential pricing strategies.

### Which fruits/vegetables have the highest profit margins?
    Calculate profit margins for each product to identify opportunities for maximizing profitability.

### What are the best-selling fruits/vegetables in each market?
    Determine the top products in each market to guide inventory management and marketing efforts.

### Can we identify any correlations between price and volume for specific products?
    Conduct a correlation analysis to explore the relationship between price and volume for individual items.

### What is the overall market share of each fruit/vegetable?
    Calculate the market share of each product to understand its position compared to competitors.

### How do pricing strategies impact sales volume and revenue?
    Analyze the effect of pricing changes on sales performance to optimize pricing strategies.

### Which fruits/vegetables have the highest customer satisfaction or repeat purchase rates?
    Use customer feedback data to identify products with high satisfaction levels and loyal customers.

### Can we forecast future sales for specific products based on historical data?
    Utilize time series analysis and forecasting techniques to predict future sales for individual items.

### What are the most profitable markets for each fruit/vegetable?
    Analyze the profit margins across different markets to identify lucrative opportunities.

### Are there any product combinations that lead to increased sales?
    Analyze cross-selling patterns to identify potential product bundling opportunities.

In [2]:
import pandas as pd
import numpy as np
import dateparser
import re
from datetime import datetime, timedelta
import os
import math

In [3]:
def findBetween(datestr, start, end):
    # Simple find string function
    pattern = re.escape(start) + r'(.*?)' + re.escape(end)
    matches = re.findall(pattern, datestr)
    return matches[0]
def extractDates(datestr):
    try:
        # Second date is full
        second_date = dateparser.parse(datestr.split('al ')[1])
        
        # First date day or day+month
        first_day = findBetween(datestr, 'Semana del ', ' al')
        first_date = None
        
        # Dates comes in two formats
        if not first_day.isnumeric():
            # Example: Semana del 27 de junio al 1 de julio 2022
            first_day, first_month = first_day.split(' de ')
            first_month = dateparser.parse(f"{first_day} de {first_month} de {second_date.year}").month
            first_date = second_date.replace(day=int(first_day), month=first_month)
            
            if (second_date - first_date) < timedelta(days=0):
                first_date = first_date.replace(year=first_date.year - 1)
        else:
            # Example: Semana del 09 al 13 de marzo de 2015
            first_day = int(first_day)
            first_date = second_date.replace(day=first_day)
        return [first_date, second_date]
    except Exception as e:
        print('Extract Dates: ', datestr, e)
def formatDateStr(datestr):
    # Dates in filenames can come in two formats
    try:
        new_date = dateparser.parse(datestr, date_formats=['%Y%m%d'])
        if new_date == None:
            new_date = dateparser.parse(datestr, date_formats=['%d%m%Y'])
            return new_date
        else:
            return new_date
    except Exception as e:
        print(e, datestr)
        return None

In [3]:
# First 2 values are not products
df = pd.read_excel('./data/Boletin_Semanal_Precios_Mayoristas_20230714.xlsx', sheet_name=None)
sheet_names = list(df.keys())
sheet_names

['Portada Boletin semanal',
 'Presentación',
 'Cebolla',
 'Lechuga',
 'Limón',
 'Manzana',
 'Naranja',
 'Palta',
 'Papa',
 'Pera',
 'Tomate',
 'Uva',
 'Zanahoria']

In [4]:
excel_files = os.listdir('./data/')
excel_files[0]

'Boletin_Semanal_Precios_Mayoristas_20220128.xlsx'

In [5]:
path_data = './data/'
excel_files = os.listdir(path_data)
excel_files

['Boletin_Semanal_Precios_Mayoristas_20220128.xlsx',
 'Boletin_Semanal_Precios_Mayoristas_20230224.xlsx',
 'Boletin_Semanal_Precios_Mayoristas_20200306.xlsx',
 'Boletin_Semanal_Precios_Mayoristas_20230210.xlsx',
 'Boletin_Semanal_Precios_Mayoristas_20190412.xlsx',
 'Boletin-semanal-precios-mayoristas-06042018.xlsx',
 'Boletin-semanal-precios-mayoristas_14092018.xlsx',
 'Boletin_Semanal_Precios_Mayoristas_20201231.xlsx',
 'Boletin-semanal-precios-mayoristas-15062018.xlsx',
 'Boletin-semanal-precios-mayoristas-27072018.xlsx',
 'Boletin_Semanal_Precios_Mayoristas_20210806.xlsx',
 'Boletin_Semanal_Precios_Mayoristas_20210205.xlsx',
 'Boletin_Semanal_Precios_Mayoristas_20221007.xlsx',
 'Boletin_Semanal_Precios_Mayoristas_20210709.xlsx',
 'Boletin_Semanal_Precios_Mayoristas_20200828.xlsx',
 'Boletin_semanal_precios_mayoristas_20190705.xlsx',
 'Boletin_Semanal_Precios_Mayoristas_20201224.xlsx',
 'Boletin-precios-mayoristas-2015-03-07-a-13.xls',
 'Boletin-semanal-precios-mayoristas_07092018.xl

In [6]:
path_data = './data/'
test_file = 'Boletin-precios-mayoristas-2015-03-07-a-13.xls'

In [7]:
path_data = './data/'
excel_files = os.listdir(path_data)
selected_excels = []
target_date = datetime(2018,9,21)

for excel_file in excel_files:
    pattern = r'(\d+)\.xlsx'
    match = re.search(pattern, excel_file)
    if match:
        parsed_date = formatDateStr(match.group(1))
        if parsed_date > target_date:
            selected_excels.append({'date':parsed_date, 'file': excel_file})
    else:
        continue
    continue
    
    """df = pd.read_excel(path_data + excel_file, sheet_name=None)
    sheet_names = list(df.keys())[2:] # Take out the first 2
    selected_sheet = sheet_names[-1]
    df_sheet = pd.read_excel(path_data + excel_file, sheet_name=selected_sheet)
    try:
        print(excel_file)
        dates = extractDates(str(df_sheet.iloc[1][0]))
        print(selected_sheet, dates[0])
        if dates[0] > target_date:
            selected_excels.append(excel_file)
    except Exception as e:
        print(e, sheet_names, excel_file)
    if (dates[1]+timedelta(days=1)-dates[0]) < timedelta(days=5):
        print(path_data+excel_file, sheet)"""
sorted_data = sorted(selected_excels, key=lambda x: x['date'])

In [8]:
sorted_data

[{'date': datetime.datetime(2018, 9, 28, 0, 0),
  'file': 'Boletin-semanal-precios-mayoristas_28092018.xlsx'},
 {'date': datetime.datetime(2018, 10, 5, 0, 0),
  'file': 'Boletin-semanal-precios-mayoristas_05102018.xlsx'},
 {'date': datetime.datetime(2018, 10, 12, 0, 0),
  'file': 'Boletin-semanal-precios-mayoristas_12102018.xlsx'},
 {'date': datetime.datetime(2018, 10, 19, 0, 0),
  'file': 'Boletin-semanal-precios-mayoristas_19102018.xlsx'},
 {'date': datetime.datetime(2018, 10, 26, 0, 0),
  'file': 'Boletin-semanal-precios-mayoristas_26102018.xlsx'},
 {'date': datetime.datetime(2018, 10, 31, 0, 0),
  'file': 'Boletin-semanal-precios-mayoristas_31102018.xlsx'},
 {'date': datetime.datetime(2018, 11, 9, 0, 0),
  'file': 'Boletin_semanal_precios_mayoristas_20181109.xlsx'},
 {'date': datetime.datetime(2018, 11, 16, 0, 0),
  'file': 'Boletin_semanal_precios_mayoristas-20181116.xlsx'},
 {'date': datetime.datetime(2018, 11, 23, 0, 0),
  'file': 'Boletin_semanal_precios_mayoristas_20181123.xls

In [9]:
"""path_data = './data/'
excel_files = os.listdir(path_data)
selected_excels = []
target_date = datetime(2018,9,21)

for excel_file in excel_files:
    df = pd.read_excel(path_data + excel_file, sheet_name=None)
    sheet_names = list(df.keys())[2:] # Take out the first 2
    selected_sheet = sheet_names[-1]
    df_sheet = pd.read_excel(path_data + excel_file, sheet_name=selected_sheet)
    try:
        print(excel_file)
        dates = extractDates(str(df_sheet.iloc[1][0]))
        print(selected_sheet, dates[0])
        if dates[0] > target_date:
            selected_excels.append(excel_file)
    except Exception as e:
        print(e, sheet_names, excel_file)
    if (dates[1]+timedelta(days=1)-dates[0]) < timedelta(days=5):
        print(path_data+excel_file, sheet)
selected_excels""";

In [10]:
# Solve different sheet format
"""path_data = './data/'
for data in sorted_data:
    df_sheets = pd.read_excel(path_data + data['file'], sheet_name=None)
    sheet_names = list(df_sheets.keys())[2:] #Remove 'Portada...' and 'Presentacion'
    for sheet in sheet_names:
        df = pd.read_excel(path_data + data['file'], sheet_name=sheet)
        print(extractDates(str(df.iloc[1][0])), sheet)"""

[datetime.datetime(2018, 9, 24, 0, 0), datetime.datetime(2018, 9, 28, 0, 0)] Cebolla
[datetime.datetime(2018, 9, 24, 0, 0), datetime.datetime(2018, 9, 28, 0, 0)] Lechuga
[datetime.datetime(2018, 9, 24, 0, 0), datetime.datetime(2018, 9, 28, 0, 0)] Limón
[datetime.datetime(2018, 9, 24, 0, 0), datetime.datetime(2018, 9, 28, 0, 0)] Manzana
[datetime.datetime(2018, 9, 24, 0, 0), datetime.datetime(2018, 9, 28, 0, 0)] Naranja
[datetime.datetime(2018, 9, 24, 0, 0), datetime.datetime(2018, 9, 28, 0, 0)] Palta
[datetime.datetime(2018, 9, 24, 0, 0), datetime.datetime(2018, 9, 28, 0, 0)] Papa
[datetime.datetime(2018, 9, 24, 0, 0), datetime.datetime(2018, 9, 28, 0, 0)] Pera
[datetime.datetime(2018, 9, 24, 0, 0), datetime.datetime(2018, 9, 28, 0, 0)] Tomate
[datetime.datetime(2018, 9, 24, 0, 0), datetime.datetime(2018, 9, 28, 0, 0)] Zanahoria
[datetime.datetime(2018, 10, 1, 0, 0), datetime.datetime(2018, 10, 5, 0, 0)] Cebolla
[datetime.datetime(2018, 10, 1, 0, 0), datetime.datetime(2018, 10, 5, 0, 0

[datetime.datetime(2018, 11, 26, 0, 0), datetime.datetime(2018, 11, 30, 0, 0)] Palta
[datetime.datetime(2018, 11, 26, 0, 0), datetime.datetime(2018, 11, 30, 0, 0)] Papa
[datetime.datetime(2018, 11, 26, 0, 0), datetime.datetime(2018, 11, 30, 0, 0)] Pera
[datetime.datetime(2018, 11, 26, 0, 0), datetime.datetime(2018, 11, 30, 0, 0)] Tomate
[datetime.datetime(2018, 11, 26, 0, 0), datetime.datetime(2018, 11, 30, 0, 0)] Uva
[datetime.datetime(2018, 11, 26, 0, 0), datetime.datetime(2018, 11, 30, 0, 0)] Zanahoria
[datetime.datetime(2018, 12, 3, 0, 0), datetime.datetime(2018, 12, 7, 0, 0)] Cebolla
[datetime.datetime(2018, 12, 3, 0, 0), datetime.datetime(2018, 12, 7, 0, 0)] Lechuga
[datetime.datetime(2018, 12, 3, 0, 0), datetime.datetime(2018, 12, 7, 0, 0)] Limón
[datetime.datetime(2018, 12, 3, 0, 0), datetime.datetime(2018, 12, 7, 0, 0)] Manzana
[datetime.datetime(2018, 12, 3, 0, 0), datetime.datetime(2018, 12, 7, 0, 0)] Naranja
[datetime.datetime(2018, 12, 3, 0, 0), datetime.datetime(2018, 12,

[datetime.datetime(2019, 2, 4, 0, 0), datetime.datetime(2019, 2, 8, 0, 0)] Uva
[datetime.datetime(2019, 2, 4, 0, 0), datetime.datetime(2019, 2, 8, 0, 0)] Zanahoria
[datetime.datetime(2019, 2, 11, 0, 0), datetime.datetime(2019, 2, 15, 0, 0)] Cebolla
[datetime.datetime(2019, 2, 11, 0, 0), datetime.datetime(2019, 2, 15, 0, 0)] Lechuga
[datetime.datetime(2019, 2, 11, 0, 0), datetime.datetime(2019, 2, 15, 0, 0)] Limón
[datetime.datetime(2019, 2, 11, 0, 0), datetime.datetime(2019, 2, 15, 0, 0)] Manzana
[datetime.datetime(2019, 2, 11, 0, 0), datetime.datetime(2019, 2, 15, 0, 0)] Naranja
[datetime.datetime(2019, 2, 11, 0, 0), datetime.datetime(2019, 2, 15, 0, 0)] Palta
[datetime.datetime(2019, 2, 11, 0, 0), datetime.datetime(2019, 2, 15, 0, 0)] Papa
[datetime.datetime(2019, 2, 11, 0, 0), datetime.datetime(2019, 2, 15, 0, 0)] Pera
[datetime.datetime(2019, 2, 11, 0, 0), datetime.datetime(2019, 2, 15, 0, 0)] Tomate
[datetime.datetime(2019, 2, 11, 0, 0), datetime.datetime(2019, 2, 15, 0, 0)] Uva
[

[datetime.datetime(2019, 4, 15, 0, 0), datetime.datetime(2019, 4, 19, 0, 0)] Lechuga
[datetime.datetime(2019, 4, 15, 0, 0), datetime.datetime(2019, 4, 19, 0, 0)] Limón
[datetime.datetime(2019, 4, 15, 0, 0), datetime.datetime(2019, 4, 19, 0, 0)] Manzana
[datetime.datetime(2019, 4, 15, 0, 0), datetime.datetime(2019, 4, 19, 0, 0)] Naranja
[datetime.datetime(2019, 4, 15, 0, 0), datetime.datetime(2019, 4, 19, 0, 0)] Palta
[datetime.datetime(2019, 4, 15, 0, 0), datetime.datetime(2019, 4, 19, 0, 0)] Papa
[datetime.datetime(2019, 4, 15, 0, 0), datetime.datetime(2019, 4, 19, 0, 0)] Pera
[datetime.datetime(2019, 4, 15, 0, 0), datetime.datetime(2019, 4, 19, 0, 0)] Tomate
[datetime.datetime(2019, 4, 15, 0, 0), datetime.datetime(2019, 4, 19, 0, 0)] Uva
[datetime.datetime(2019, 4, 15, 0, 0), datetime.datetime(2019, 4, 19, 0, 0)] Zanahoria
[datetime.datetime(2019, 4, 22, 0, 0), datetime.datetime(2019, 4, 26, 0, 0)] Cebolla
[datetime.datetime(2019, 4, 22, 0, 0), datetime.datetime(2019, 4, 26, 0, 0)] L

[datetime.datetime(2019, 6, 17, 0, 0), datetime.datetime(2019, 6, 21, 0, 0)] Limón
[datetime.datetime(2019, 6, 17, 0, 0), datetime.datetime(2019, 6, 21, 0, 0)] Manzana
[datetime.datetime(2019, 6, 17, 0, 0), datetime.datetime(2019, 6, 21, 0, 0)] Naranja
[datetime.datetime(2019, 6, 17, 0, 0), datetime.datetime(2019, 6, 21, 0, 0)] Palta
[datetime.datetime(2019, 6, 17, 0, 0), datetime.datetime(2019, 6, 21, 0, 0)] Papa
[datetime.datetime(2019, 6, 17, 0, 0), datetime.datetime(2019, 6, 21, 0, 0)] Pera
[datetime.datetime(2019, 6, 17, 0, 0), datetime.datetime(2019, 6, 21, 0, 0)] Tomate
[datetime.datetime(2019, 6, 17, 0, 0), datetime.datetime(2019, 6, 21, 0, 0)] Uva
[datetime.datetime(2019, 6, 17, 0, 0), datetime.datetime(2019, 6, 21, 0, 0)] Zanahoria
[datetime.datetime(2019, 6, 24, 0, 0), datetime.datetime(2019, 6, 28, 0, 0)] Cebolla
[datetime.datetime(2019, 6, 24, 0, 0), datetime.datetime(2019, 6, 28, 0, 0)] Lechuga
[datetime.datetime(2019, 6, 24, 0, 0), datetime.datetime(2019, 6, 28, 0, 0)] L

[datetime.datetime(2019, 8, 26, 0, 0), datetime.datetime(2019, 8, 30, 0, 0)] Cebolla
[datetime.datetime(2019, 8, 26, 0, 0), datetime.datetime(2019, 8, 30, 0, 0)] Lechuga
[datetime.datetime(2019, 8, 26, 0, 0), datetime.datetime(2019, 8, 30, 0, 0)] Limón
[datetime.datetime(2019, 8, 26, 0, 0), datetime.datetime(2019, 8, 30, 0, 0)] Manzana
[datetime.datetime(2019, 8, 26, 0, 0), datetime.datetime(2019, 8, 30, 0, 0)] Naranja
[datetime.datetime(2019, 8, 26, 0, 0), datetime.datetime(2019, 8, 30, 0, 0)] Palta
[datetime.datetime(2019, 8, 26, 0, 0), datetime.datetime(2019, 8, 30, 0, 0)] Papa
[datetime.datetime(2019, 8, 26, 0, 0), datetime.datetime(2019, 8, 30, 0, 0)] Pera
[datetime.datetime(2019, 8, 26, 0, 0), datetime.datetime(2019, 8, 30, 0, 0)] Tomate
[datetime.datetime(2019, 8, 26, 0, 0), datetime.datetime(2019, 8, 30, 0, 0)] Zanahoria
[datetime.datetime(2019, 9, 2, 0, 0), datetime.datetime(2019, 9, 6, 0, 0)] Cebolla
[datetime.datetime(2019, 9, 2, 0, 0), datetime.datetime(2019, 9, 6, 0, 0)] L

[datetime.datetime(2019, 11, 4, 0, 0), datetime.datetime(2019, 11, 8, 0, 0)] Cebolla
[datetime.datetime(2019, 11, 4, 0, 0), datetime.datetime(2019, 11, 8, 0, 0)] Lechuga
[datetime.datetime(2019, 11, 4, 0, 0), datetime.datetime(2019, 11, 8, 0, 0)] Limón
[datetime.datetime(2019, 11, 4, 0, 0), datetime.datetime(2019, 11, 8, 0, 0)] Manzana
[datetime.datetime(2019, 11, 4, 0, 0), datetime.datetime(2019, 11, 8, 0, 0)] Naranja
[datetime.datetime(2019, 11, 4, 0, 0), datetime.datetime(2019, 11, 8, 0, 0)] Palta
[datetime.datetime(2019, 11, 4, 0, 0), datetime.datetime(2019, 11, 8, 0, 0)] Papa
[datetime.datetime(2019, 11, 4, 0, 0), datetime.datetime(2019, 11, 8, 0, 0)] Pera
[datetime.datetime(2019, 11, 4, 0, 0), datetime.datetime(2019, 11, 8, 0, 0)] Tomate
[datetime.datetime(2019, 11, 4, 0, 0), datetime.datetime(2019, 11, 8, 0, 0)] Zanahoria
[datetime.datetime(2019, 11, 11, 0, 0), datetime.datetime(2019, 11, 15, 0, 0)] Cebolla
[datetime.datetime(2019, 11, 11, 0, 0), datetime.datetime(2019, 11, 15, 

[datetime.datetime(2019, 12, 30, 0, 0), datetime.datetime(2020, 1, 3, 0, 0)] Zanahoria
[datetime.datetime(2020, 1, 6, 0, 0), datetime.datetime(2020, 1, 10, 0, 0)] Cebolla
[datetime.datetime(2020, 1, 6, 0, 0), datetime.datetime(2020, 1, 10, 0, 0)] Lechuga
[datetime.datetime(2020, 1, 6, 0, 0), datetime.datetime(2020, 1, 10, 0, 0)] Limón
[datetime.datetime(2020, 1, 6, 0, 0), datetime.datetime(2020, 1, 10, 0, 0)] Manzana
[datetime.datetime(2020, 1, 6, 0, 0), datetime.datetime(2020, 1, 10, 0, 0)] Naranja
[datetime.datetime(2020, 1, 6, 0, 0), datetime.datetime(2020, 1, 10, 0, 0)] Palta
[datetime.datetime(2020, 1, 6, 0, 0), datetime.datetime(2020, 1, 10, 0, 0)] Papa
[datetime.datetime(2020, 1, 6, 0, 0), datetime.datetime(2020, 1, 10, 0, 0)] Pera
[datetime.datetime(2020, 1, 6, 0, 0), datetime.datetime(2020, 1, 10, 0, 0)] Tomate
[datetime.datetime(2020, 1, 6, 0, 0), datetime.datetime(2020, 1, 10, 0, 0)] Uva
[datetime.datetime(2020, 1, 6, 0, 0), datetime.datetime(2020, 1, 10, 0, 0)] Zanahoria
[d

KeyboardInterrupt: 

### Cebollas test

In [None]:
test_df = pd.DataFrame()
df = pd.read_excel('./data/Boletin_Semanal_Precios_Mayoristas_20230714.xlsx', sheet_name='Cebolla')
extractDates(str(df.iloc[1][0]))

In [None]:
test_df = pd.concat([df, test_df])

In [None]:
test_df

In [None]:
df.columns = np.array(df.iloc[4])

In [None]:
df.drop(df[df['Mercado'].isnull() == True].index, axis=0, inplace=True)

In [None]:
df.drop(df[df['Mercado'] == 'Mercado'].index, axis=0, inplace=True)

In [None]:
df.reset_index(drop=True)

### Manzanas

In [4]:
df2 = pd.read_excel('./data/Boletin_Semanal_Precios_Mayoristas_20230714.xlsx', sheet_name='Manzana', skiprows=0)
df2.shape

(59, 8)

In [None]:
df2.drop(df2[df2[df2.columns[1]].isnull() == True].index, axis=0, inplace=True)

In [None]:
df2.columns = np.array(df2.iloc[0])

In [None]:
df2.rename(columns={'Unidad de\ncomercialización ': 'Unidad'}, inplace=True)

In [None]:
df2.drop(df2[df2['Mercado'] == 'Mercado'].index, axis=0, inplace=True)

In [None]:
df2 = df2.reset_index(drop=True)
df2.head()

In [None]:
df2.rename(columns={'Unidad de\ncomercialización ': 'Unidad'}, inplace=True)
df2[df2['Unidad'].str.contains('\$') == False].head()

## Transform data

In [None]:
price_df = df2[df2['Unidad'].str.contains('\$') == True]
volume_df = df2[df2['Unidad'].str.contains('\$') == False]
price_df = pd.melt(price_df, id_vars=['Variedad', 'Mercado', 'Unidad'], var_name='Dia', value_name='Precio')
volume_df = pd.melt(volume_df, id_vars=['Variedad', 'Mercado', 'Unidad'], var_name='Dia', value_name='Volumen')

In [None]:
start_date = pd.to_datetime('2023-03-14')

# Create a dictionary to map weekday names to their respective date offsets
weekday_to_offset = {
    'Lunes': 0,
    'Martes': 1,
    'Miércoles': 2,
    'Jueves': 3,
    'Viernes': 4
}

# Function to calculate the date for each weekday based on the start date
def calculate_date(row):
    offset = weekday_to_offset[row['Dia']]
    return start_date + pd.DateOffset(days=offset)

# Apply the function to create the 'Date' column
price_df['Fecha'] = price_df.apply(calculate_date, axis=1)
volume_df['Fecha'] = volume_df.apply(calculate_date, axis=1)

In [None]:
price_df['Unidad'] = volume_df['Unidad']

In [None]:
price_df.head()

In [None]:
merged_df = pd.merge(price_df, volume_df, on=['Variedad', 'Mercado', 'Dia', 'Fecha', 'Unidad'])
merged_df.head()

In [None]:
merged_df['Producto'] = 'Manzana'
merged_df.head()

In [18]:
product = 'Pera'
df = pd.read_excel('./data/Boletin_Semanal_Precios_Mayoristas_20230127.xlsx', sheet_name=product, skiprows=0)
start_date, end_date = extractDates(str(df.iloc[1][0]))
df.drop(df[df[df.columns[1]].isnull() == True].index, axis=0, inplace=True)
df.columns = np.array(df.iloc[0])
df.rename(columns={'Unidad de\ncomercialización ': 'Unidad'}, inplace=True)
df = df.reset_index(drop=True)
split = int(df[df['Mercado'] == 'Mercado'].index[int(len(df[df['Mercado'] == 'Mercado'].index)/2)])
df1 = df.iloc[:split].copy()
df2 = df.iloc[split:].copy()

In [14]:
df

Unnamed: 0,Variedad,Mercado,Lunes,Martes,Miércoles,Jueves,Viernes,Unidad
0,Variedad,Mercado,Lunes,Martes,Miércoles,Jueves,Viernes,Unidad de\ncomercialización
1,Bartlett de verano,Vega Modelo de Temuco,16000,16000,16000,16769,16000,$/bandeja 18 kilos granel
2,Favorita de Clapp,Vega Modelo de Temuco,20000,0,0,20000,0,$/bandeja 18 kilos granel
3,Packham's Triumph,Vega Modelo de Temuco,24000,0,19111,20000,20000,$/bandeja 18 kilos granel
4,Variedad,Mercado,Lunes,Martes,Miércoles,Jueves,Viernes,Unidad de\ncomercialización
5,Bartlett de verano,Terminal La Palmera de La Serena,355000,355000,0,345000,335000,$/bin (450 kilos)
6,Bartlett de verano,Vega Modelo de Temuco,340000,0,0,0,0,$/bin (450 kilos)
7,Packham's Triumph,Vega Modelo de Temuco,500000,500000,0,500000,0,$/bin (450 kilos)
8,Variedad,Mercado,Lunes,Martes,Miércoles,Jueves,Viernes,Unidad de\ncomercialización
9,Bartlett de verano,Vega Modelo de Temuco,200,100,300,650,250,Bandeja 18 kilos granel


In [19]:
df1.drop(df1[df1['Mercado'] == 'Mercado'].index, axis=0, inplace=True)

In [20]:
df2.drop(df2[df2['Mercado'] == 'Mercado'].index, axis=0, inplace=True)

In [27]:
df1.reset_index(drop=True)

Unnamed: 0,Variedad,Mercado,Lunes,Martes,Miércoles,Jueves,Viernes,Unidad
0,Bartlett de verano,Vega Modelo de Temuco,16000,16000,16000,16769,16000,$/bandeja 18 kilos granel
1,Favorita de Clapp,Vega Modelo de Temuco,20000,0,0,20000,0,$/bandeja 18 kilos granel
2,Packham's Triumph,Vega Modelo de Temuco,24000,0,19111,20000,20000,$/bandeja 18 kilos granel
3,Bartlett de verano,Terminal La Palmera de La Serena,355000,355000,0,345000,335000,$/bin (450 kilos)
4,Bartlett de verano,Vega Modelo de Temuco,340000,0,0,0,0,$/bin (450 kilos)
5,Packham's Triumph,Vega Modelo de Temuco,500000,500000,0,500000,0,$/bin (450 kilos)


In [28]:
df2.reset_index(drop=True)

Unnamed: 0,Variedad,Mercado,Lunes,Martes,Miércoles,Jueves,Viernes,Unidad
0,Bartlett de verano,Vega Modelo de Temuco,200,100,300,650,250,Bandeja 18 kilos granel
1,Favorita de Clapp,Vega Modelo de Temuco,300,0,0,250,0,Bandeja 18 kilos granel
2,Packham's Triumph,Vega Modelo de Temuco,200,0,450,350,180,Bandeja 18 kilos granel
3,Bartlett de verano,Mercado Mayorista Lo Valledor de Santiago,24,0,10,30,30,Bin (450 kilos)
4,Bartlett de verano,Terminal La Palmera de La Serena,18,18,0,20,20,Bin (450 kilos)
5,Bartlett de verano,Vega Modelo de Temuco,4,0,0,0,0,Bin (450 kilos)
6,Packham's Triumph,Vega Modelo de Temuco,2,1,0,2,0,Bin (450 kilos)


In [87]:
product = 'Pera'
df = pd.read_excel('./data/Boletin_Semanal_Precios_Mayoristas_20220722.xlsx', sheet_name=product, skiprows=0)
df.drop(df[df[df.columns[1]].isnull() == True].index, axis=0, inplace=True)
df.columns = np.array(df.iloc[0])
df.rename(columns={'Unidad de\ncomercialización ': 'Unidad'}, inplace=True)
df = df.reset_index(drop=True)
split_index = int(df[df['Mercado'] == 'Mercado'].index[int(len(df[df['Mercado'] == 'Mercado'].index)/2)])
price_df = df.iloc[:split_index].copy()
volume_df = df.iloc[split_index:].copy()
price_df.drop(price_df[price_df['Mercado'] == 'Mercado'].index, axis=0, inplace=True)
volume_df.drop(volume_df[volume_df['Mercado'] == 'Mercado'].index, axis=0, inplace=True)
if (end_date - start_date) < timedelta(days=4):
    if 'Viernes' not in price_df.columns:
        price_df.insert(loc=len(price_df.columns), column='Viernes', value=0.0)
        volume_df.insert(loc=len(volume_df.columns), column='Viernes', value=0.0)
    end_date = end_date + timedelta(days=1)
price_df = pd.melt(price_df, id_vars=['Variedad', 'Mercado', 'Unidad'], var_name='Dia', value_name='Precio')
volume_df = pd.melt(volume_df, id_vars=['Variedad', 'Mercado', 'Unidad'], var_name='Dia', value_name='Volumen')
# Function to calculate the date for each weekday based on the start date
def calculate_date(row):
    weekday_to_offset = {
        'Lunes': 0,
        'Martes': 1,
        'Miércoles': 2,
        'Jueves': 3,
        'Viernes': 4
    }
    offset = weekday_to_offset[row['Dia']]
    return start_date + pd.DateOffset(days=offset)

# Apply the function to create the 'Date' column
price_df['Fecha'] = price_df.apply(calculate_date, axis=1)
volume_df['Fecha'] = volume_df.apply(calculate_date, axis=1)

price_df['Unidad'] = volume_df['Unidad']

merged_df = pd.merge(price_df, volume_df, on=['Variedad', 'Mercado', 'Unidad', 'Dia', 'Fecha'], how='outer')
merged_df.sort_values(by='Fecha')

Unnamed: 0,Variedad,Mercado,Unidad,Dia,Precio,Fecha,Volumen
0,Packham's Triumph,Mercado Mayorista Lo Valledor de Santiago,Bin (450 kilos),Lunes,140000,2023-01-23,35
7,Winter Nelis,Vega Monumental Concepción,Caja 16 kilos empedrada,Lunes,0,2023-01-23,0
6,Packham's Triumph,Vega Monumental Concepción,Caja 16 kilos empedrada,Lunes,0,2023-01-23,0
5,Abate Fetel,Vega Monumental Concepción,Caja 16 kilos empedrada,Lunes,0,2023-01-23,0
40,Packham's Triumph,Terminal Hortofrutícola Agro Chillán,Caja 16 kilos empedrada,Lunes,,2023-01-23,0
...,...,...,...,...,...,...,...
35,Winter Nelis,Mercado Mayorista Lo Valledor de Santiago,Caja 16 kilos empedrada,Viernes,0,2023-01-27,
36,Winter Nelis,Terminal La Palmera de La Serena,Bin (450 kilos),Viernes,0,2023-01-27,0
37,Abate Fetel,Vega Monumental Concepción,Bin (450 kilos),Viernes,8000,2023-01-27,
39,Winter Nelis,Vega Monumental Concepción,Bin (450 kilos),Viernes,8000,2023-01-27,


In [88]:
merged_df['Precio'].sum(), merged_df['Volumen'].sum()

(3000833, 1143)

In [72]:
merged_df['Mercado'].unique()

array(['Vega Modelo de Temuco', 'Terminal La Palmera de La Serena',
       'Mercado Mayorista Lo Valledor de Santiago'], dtype=object)