In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

Table of Contents

-----

# <a id='toc1_'></a>[**Sales Forecasting Project**](#toc0_)

-----------------------------
## <a id='toc1_1_'></a>[**Project Context**](#toc0_)
-----------------------------

Fresh Analytics, a data analytics company, aims to comprehend and predict the demand for various items across restaurants. The primary goal of the project is to determine the sales of items across different restaurants over the years. In an ever-changing competitive market, accurate forecasting is crucial for making correct decisions and plans related to sales, production, and other business aspects.

-----------------------------
## <a id='toc1_2_'></a>[**Project Objectives**](#toc0_)
-----------------------------

In ever-changing competitive market conditions, there is a need to make correct decisions and plans for future events related to business like sales, production, and many more. The effectiveness of a decision taken by business managers is influenced by the accuracy of the models used. Demand is the most important aspect of a business's ability to achieve its objectives. Many decisions in business depend on demand, like production, sales, and staff requirements. Forecasting is necessary for business at both international and domestic levels. 

-----------------------------
## <a id='toc1_3_'></a>[**Project Dataset Description**](#toc0_)
-----------------------------

1. **restaurants.csv**: Contains information about the restaurants or stores.
   - id: Unique identification of the restaurant or store
   - name: Name of the restaurant

2. **items.csv**: Provides details about the items sold.
   - id: Unique identification of the item
   - store_id: Unique identification of the store
   - name: Name of the item
   - kcal: A measure of energy nutrients (calories) in the item
   - cost: The unit price of the item

3. **sales.csv**: Contains sales data for items at different stores on various dates.
   - date: Date of purchase
   - item: Name of the item bought
   - Price: Unit price of the item
   - item_count: Total count of the items bought on that day

-----------------------------------
## <a id='toc1_4_'></a>[**Project Analysis Steps To Perform**](#toc0_)
-----------------------------------

4.1  Preliminary analysis:

         4.1.1. Import the datasets into the Python environment

         4.1.2. Examine the dataset's shape and structure, and look out for any outlier

         4.1.3. Merge the datasets into a single dataset that includes the date, item id, price, item count, item names, kcal values, store id, and store name

4.2  Exploratory data analysis:

         4.2.1. Examine the overall date wise sales to understand the pattern
      
         4.2.2. Find out how sales fluctuate across different days of the week
      
         4.2.3. Look for any noticeable trends in the sales data for different months of the year
      
         4.2.4. Examine the sales distribution across different quarters averaged over the years. Identify any noticeable patterns.
      
         4.2.5. Compare the performances of the different restaurants. Find out which restaurant had the most sales and look at the sales for each restaurant across different years, months, and days.
      
         4.2.6. Identify the most popular items overall and the stores where they are being sold. Also, find out the most popular item at each store.
      
         4.2.7. Determine if the store with the highest sales volume is also making the most money per day
      
         4.2.8. Identify the most expensive item at each restaurant and find out its calorie count

4.3 Forecasting using machine learning algorithms

         4.3.1. Forecasting using machine learning algorithms

            4.3.1.1. Generate necessary features for the development of these models, like day of the week, quarter of the year, month, year, day of the month and so on

            4.3.1.2. Use the data from the last six months as the testing data

            4.3.1.3. Compute the root mean square error (RMSE) values for each model to compare their performances

            4.3.1.4. Use the best-performing models to make a forecast for the next year

4.4 Forecasting using deep learning algorithms

         4.4.1. Use sales amount for predictions instead of item count
      
         4.4.2. Build a long short-term memory (LSTM) model for predictions
      
            4.4.2.1. Define the train and test series
      
            4.4.2.2. Generate synthetic data for the last 12 months
      
            4.4.2.3. Build and train an LSTM model.
      
            4.4.2.4. Use the model to make predictions for the test data.
      
         4.4.3. Calculate the mean absolute percentage error (MAPE) and comment on the model's performance
      
         4.4.4. Develop another model using the entire series for training, and use it to forecast for the next three months



### <a id='toc1_4_1_'></a>[**4.1. Preliminary analysis**](#toc0_)

#### <a id='toc1_4_1_1_'></a>[**4.1.1. Import Datasets**](#toc0_)

In [None]:
# Import necessary libraries
import os
from dotenv import load_dotenv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import LSTM, Dense
import calendar
from statsmodels.tsa.seasonal import seasonal_decompose
from scipy.fft import fft
from scipy import stats
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from pandas.plotting import parallel_coordinates
from pandas.plotting import andrews_curves
from sklearn.cluster import KMeans

# To ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Set random seed for reproducibility
np.random.seed(1971)

load_dotenv(verbose=True, dotenv_path='.env', override=True)

DATASET_PATH = os.getenv('DATASET_PATH')

restaurants_ds_file = f'{DATASET_PATH}/resturants.csv'
items_ds_file = f'{DATASET_PATH}/items.csv'
sales_ds_file = f'{DATASET_PATH}/sales.csv'

# Read the CSV files
restaurants_df = pd.read_csv(restaurants_ds_file)
items_df = pd.read_csv(items_ds_file)
sales_df = pd.read_csv(sales_ds_file)

# Display the first few rows of each dataset
print("Restaurants dataset:")
print(restaurants_df.head())
print("\nItems dataset:")
print(items_df.head())
print("\nSales dataset:")
print(sales_df.head())

**Explanations:**

- This code block imports necessary libraries (`pandas`, `numpy`, `matplotlib`, and `seaborn`) and reads the three CSV files into pandas DataFrames. It then displays the first few rows of each dataset to give an initial view of the data.

**Why It Is Important:**

- Importing and examining the datasets is crucial as it allows us to understand the structure and content of our data. This step helps identify any immediate issues with data formatting or missing values and provides a foundation for all subsequent analyses.

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

#### <a id='toc1_4_1_2_'></a>[**4.1.2. Examine the dataset's shape and structure, and look out for any outlier**](#toc0_)

In [None]:
# Check the shape of each dataset
print("Restaurants dataset shape:", restaurants_df.shape)
print("Items dataset shape:", items_df.shape)
print("Sales dataset shape:", sales_df.shape)

# Display info for each dataset
print("\nRestaurants dataset info:")
restaurants_df.info()
print("\nItems dataset info:")
items_df.info()
print("\nSales dataset info:")
sales_df.info()

# Display basic statistics for numerical columns
print("\nRestaurants Dataset Statistics:")
print(restaurants_df.describe())
print("\nItems Dataset Statistics:")
print(items_df.describe())
print("\nSales Dataset Statistics:")
print(sales_df.describe())

# Check for missing values
print("\nMissing values in Restaurants dataset:")
print(restaurants_df.isnull().sum())
print("\nMissing values in Items dataset:")
print(items_df.isnull().sum())
print("\nMissing values in Sales dataset:")
print(sales_df.isnull().sum())

# Check for duplicates
print("\nDuplicates in Restaurants dataset:", restaurants_df.duplicated().sum())
print("Duplicates in Items dataset:", items_df.duplicated().sum())
print("Duplicates in Sales dataset:", sales_df.duplicated().sum())

# Check for outliers using box plots
plt.figure(figsize=(12, 4))
plt.subplot(131)
sns.boxplot(data=items_df, y='kcal')
plt.title('Kcal Distribution')
plt.subplot(132)
sns.boxplot(data=items_df, y='cost')
plt.title('Cost Distribution')
plt.subplot(133)
sns.boxplot(data=sales_df, y='item_count')
plt.title('Item Count Distribution')
plt.tight_layout()
plt.show()

In [None]:
sales_df.info()
sales_df['date'] = pd.to_datetime(sales_df['date'])
sales_df.info()
print(f"Items_df information: ")
items_df.info()

In [None]:
from scipy import stats

# datasets = [restaurants_df, items_df, sales_df]
datasets = [items_df, sales_df]

restaurants_df.attrs['name'] = 'Restaurants Dataset'
items_df.attrs['name'] = 'Items Dataset'
sales_df.attrs['name'] = 'Sales Dataset'

# 1. Correlation Matrix
def plot_correlation_matrix(df):
    # Retrieve the dataset name from the DataFrame's attributes
    dataset_name = df.attrs.get('name', 'Dataset')
    corr = df.select_dtypes(include=[np.number]).corr()
    plt.figure(figsize=(10, 8))
    sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1, center=0)
    plt.title(f'{dataset_name} Correlation Matrix')
    plt.show()

for df in datasets:
    plot_correlation_matrix(df)

# 2. Pairplot
def create_pairplot(df):
    # Retrieve the dataset name from the DataFrame's attributes
    dataset_name = df.attrs.get('name', 'Dataset')
    sns.pairplot(df.select_dtypes(include=[np.number]))
    plt.suptitle(f'{dataset_name} Pairplot of Numerical Variables', y=1.02)
    plt.show()

for df in datasets:
    create_pairplot(df)

# # 3. Z-score Analysis
# def identify_outliers_zscore(df, threshold=3):
#     outliers = pd.DataFrame()
#     for column in df.select_dtypes(include=[np.number]).columns:
#         z_scores = np.abs(stats.zscore(df[column]))
#         outliers[column] = df[column][(z_scores > threshold)]
#     return outliers
# 
# outliers_zscore = identify_outliers_zscore(items)
# print("Outliers (Z-score method):")
# print(outliers_zscore)
# 
# # 4. IQR Method for Outliers
# def identify_outliers_iqr(df):
#     outliers = pd.DataFrame()
#     for column in df.select_dtypes(include=[np.number]).columns:
#         Q1 = df[column].quantile(0.25)
#         Q3 = df[column].quantile(0.75)
#         IQR = Q3 - Q1
#         outliers[column] = df[column][(df[column] < (Q1 - 1.5 * IQR)) | (df[column] > (Q3 + 1.5 * IQR))]
#     return outliers
# 
# outliers_iqr = identify_outliers_iqr(items)
# print("Outliers (IQR method):")
# print(outliers_iqr)

# 5. Time Series Decomposition
def plot_time_series_decomposition(df):
    # Retrieve the dataset name from the DataFrame's attributes
    dataset_name = df.attrs.get('name', 'Dataset')
    df_temp = df.copy()
    df_temp.set_index('date', inplace=True)
    result = seasonal_decompose(df_temp['item_count'], model='additive', period=30)  # Adjust period as needed
    result.plot()
    plt.suptitle(f'Time Series Decomposition of {dataset_name}', y=1.02)
    plt.tight_layout()
    plt.show()

plot_time_series_decomposition(sales_df)

# 6. Distribution Plots
def plot_distributions(df):
    # Retrieve the dataset name from the DataFrame's attributes
    dataset_name = df.attrs.get('name', 'Dataset')
    num_cols = df.select_dtypes(include=[np.number]).columns
    n_cols = 2
    n_rows = (len(num_cols) + 1) // 2
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5*n_rows))
    for i, col in enumerate(num_cols):
        ax = axes[i//n_cols, i%n_cols]
        sns.histplot(df[col], kde=True, ax=ax)
        ax.set_title(f'Distribution of {col} in {dataset_name}')
    plt.tight_layout()
    plt.show()

for df in datasets:
    plot_distributions(df)

# 7. Cumulative Distribution Functions (CDF)
# def plot_cdfs(df):
#     # Retrieve the dataset name from the DataFrame's attributes
#     dataset_name = df.attrs.get('name', 'Dataset')
#     num_cols = df.select_dtypes(include=[np.number]).columns
#     n_cols = 2
#     n_rows = (len(num_cols) + 1) // 2
#     fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5*n_rows))
#     for i, col in enumerate(num_cols):
#         ax = axes[i//n_cols, i%n_cols]
#         sorted_data = np.sort(df[col])
#         yvals = np.arange(len(sorted_data))/float(len(sorted_data)-1)
#         ax.plot(sorted_data, yvals)
#         ax.set_title(f'CDF of {col} in {dataset_name}')
#         ax.set_xlabel(col)
#         ax.set_ylabel('Cumulative Probability')
#     plt.tight_layout()
#     plt.show()
# 
# for df in datasets:
#     plot_cdfs(df)
# 
# # 8. Box Plots by Category
# def plot_boxplots_by_category(df, category_col, value_col):
#     # Retrieve the dataset name from the DataFrame's attributes
#     dataset_name = df.attrs.get('name', 'Dataset')
#     plt.figure(figsize=(12, 6))
#     sns.boxplot(x=category_col, y=value_col, data=df)
#     plt.title(f'{value_col} by {category_col} in {dataset_name}')
#     plt.xticks(rotation=45)
#     plt.show()
# 
# plot_boxplots_by_category(sales_df, 'item_id', 'price')  # Example
# 
# # 9. Missing Value Patterns
# def plot_missing_values(df):
#     plt.figure(figsize=(10, 6))
#     sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')
#     plt.title('Missing Value Patterns')
#     plt.show()
# 
# plot_missing_values(sales)
# 
# 10. Value Counts for Categorical Variables
# def plot_value_counts(df, column):
#     # Retrieve the dataset name from the DataFrame's attributes
#     dataset_name = df.attrs.get('name', 'Dataset')
#     plt.figure(figsize=(10, 6))
#     df[column].value_counts().plot(kind='bar')
#     plt.title(f'Value Counts for {column} in {dataset_name}')
#     plt.xlabel(column)
#     plt.ylabel('Count')
#     plt.xticks(rotation=45)
#     plt.show()
# 
# plot_value_counts(sales_df, 'item_id')  # Example

# 11. Scatter Plot Matrix
def plot_scatter_matrix(df):
    # Retrieve the dataset name from the DataFrame's attributes
    dataset_name = df.attrs.get('name', 'Dataset')
    pd.plotting.scatter_matrix(df.select_dtypes(include=[np.number]), figsize=(15, 15), diagonal='kde')
    plt.suptitle(f'Scatter Plot Matrix of {dataset_name}', y=1.02)
    plt.tight_layout()
    plt.show()

for df in datasets:
    plot_scatter_matrix(df)


# 12. 3D Scatter Plot
def plot_3d_scatter(df, x_col, y_col, z_col):
    # Retrieve the dataset name from the DataFrame's attributes
    dataset_name = df.attrs.get('name', 'Dataset')
    fig = plt.figure(figsize=(10, 8))
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(df[x_col], df[y_col], df[z_col])
    ax.set_xlabel(x_col)
    ax.set_ylabel(y_col)
    ax.set_zlabel(z_col)
    plt.title(f'3D Scatter Plot of {dataset_name}')
    plt.show()

plot_3d_scatter(items_df, 'kcal', 'cost', 'id')  # Example

# 13. Parallel Coordinates Plot
# import matplotlib.pyplot as plt
# from pandas.plotting import parallel_coordinates
# import pandas as pd
# 
# def plot_parallel_coordinates(df):
#     # Retrieve the dataset name from the DataFrame's attributes
#     dataset_name = df.attrs.get('name', 'Dataset')
#     plt.figure(figsize=(12, 6))
#     parallel_coordinates(df.select_dtypes(include=[np.number]), 'id')
#     plt.title(f'Parallel Coordinates Plot of {dataset_name}')
#     plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
#     plt.tight_layout()
#     plt.show()
# for df in datasets:
#     plot_parallel_coordinates(df)
# plot_parallel_coordinates(items_df)

# 14. Andrews Curves
# from pandas.plotting import andrews_curves
# def plot_andrews_curves(df):
#     # Retrieve the dataset name from the DataFrame's attributes
#     dataset_name = df.attrs.get('name', 'Dataset')
#     plt.figure(figsize=(12, 6))
#     andrews_curves(df.select_dtypes(include=[np.number]), 'id')
#     plt.title(f'Andrews Curves of {dataset_name}')
#     plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
#     plt.tight_layout()
#     plt.show()
#     
# plot_andrews_curves(items_df)

# 15. Cluster Analysis
from sklearn.cluster import KMeans
def perform_cluster_analysis(df, n_clusters=3):
    # Retrieve the dataset name from the DataFrame's attributes
    dataset_name = df.attrs.get('name', 'Dataset')
    numeric_data = df.select_dtypes(include=[np.number])
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(numeric_data)
    
    plt.figure(figsize=(10, 8))
    scatter = plt.scatter(numeric_data.iloc[:, 0], numeric_data.iloc[:, 1], c=clusters, cmap='viridis')
    plt.title(f'K-means Clustering of Numeric Variables in {dataset_name}')
    plt.xlabel(numeric_data.columns[0])
    plt.ylabel(numeric_data.columns[1])
    plt.colorbar(scatter)
    plt.show()

for df in datasets:
    perform_cluster_analysis(df)
# perform_cluster_analysis(items)

**Explanations:**

- This code examines the shape, structure, and quality of each dataset. It checks the number of rows and columns, data types of each column, presence of missing values, and existence of duplicate entries.

**Why It Is Important:**

- Understanding the dataset's structure and quality is crucial for data preprocessing and analysis. It helps identify potential issues like missing data or duplicates that need to be addressed before proceeding with the analysis.

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

#### <a id='toc1_4_1_3_'></a>[**4.1.2. Merge the datasets into a single dataset that includes the date, item id, price, item count, item names, kcal values, store id, and store name**](#toc0_)

In [None]:
# Merge sales with items
merged_df = pd.merge(sales_df, items_df, left_on='item_id', right_on='id', how='left')

# Merge with restaurants
merged_df = pd.merge(merged_df, restaurants_df, left_on='store_id', right_on='id', how='left')

# Rename columns to avoid confusion
merged_df = merged_df.rename(columns={'name_x': 'item_name', 'name_y': 'restaurant_name'})

# Select relevant columns
final_df = merged_df[['date', 'item_id', 'price', 'item_count', 'item_name', 'kcal', 'store_id', 'restaurant_name']]

# Convert date to datetime
final_df['date'] = pd.to_datetime(final_df['date'])

# Display the first few rows of the merged dataset
print(final_df.head())

# Display info of the merged dataset
final_df.info()

**Explanations:**

- This code merges the three datasets based on common identifiers (item_id and store_id). It then selects relevant columns, renames them for clarity, and converts the date column to datetime format.

**Why It Is Important:**

- Merging the datasets is crucial for conducting comprehensive analyses that involve information from all three sources. It allows us to examine relationships between sales, item characteristics, and restaurant information in a single DataFrame.

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

### <a id='toc1_4_2_'></a>[**4.2. Exploratory data analysis**](#toc0_)

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

#### <a id='toc1_4_2_1_'></a>[**4.2.1. Examine the overall date wise sales to understand the pattern**](#toc0_)

In [None]:
# Group by date and calculate total sales
daily_sales = final_df.groupby('date').agg({'price': 'sum', 'item_count': 'sum'})
daily_sales['total_sales'] = daily_sales['price'] * daily_sales['item_count']

# Plot daily sales
plt.figure(figsize=(15, 6))
plt.plot(daily_sales.index, daily_sales['total_sales'])
plt.title('Daily Sales Over Time')
plt.xlabel('Date')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Calculate and plot 7-day moving average
daily_sales['7_day_ma'] = daily_sales['total_sales'].rolling(window=7).mean()

plt.figure(figsize=(15, 6))
plt.plot(daily_sales.index, daily_sales['total_sales'], alpha=0.5, label='Daily Sales')
plt.plot(daily_sales.index, daily_sales['7_day_ma'], color='red', label='7-day Moving Average')
plt.title('Daily Sales and 7-day Moving Average')
plt.xlabel('Date')
plt.ylabel('Total Sales')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:


# First, let's check the structure of your DataFrame
print(daily_sales.index)
print(daily_sales.columns)

# If the index is not already a DatetimeIndex, we'll try to convert it
if not isinstance(daily_sales.index, pd.DatetimeIndex):
    # Check if there's a column that could be a date
    date_columns = daily_sales.select_dtypes(include=[np.datetime64]).columns
    if len(date_columns) > 0:
        date_column = date_columns[0]
        daily_sales.set_index(date_column, inplace=True)
    else:
        # If no date column is found, we'll create a date range
        daily_sales.index = pd.date_range(start='2019-01-01', periods=len(daily_sales), freq='D')

# Ensure the index is sorted
daily_sales.sort_index(inplace=True)

# 1. Decomposition Plot
def plot_decomposition():
    result = seasonal_decompose(daily_sales['total_sales'], model='additive', period=365)
    fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(12, 16))
    result.observed.plot(ax=ax1)
    ax1.set_title('Observed')
    result.trend.plot(ax=ax2)
    ax2.set_title('Trend')
    result.seasonal.plot(ax=ax3)
    ax3.set_title('Seasonal')
    result.resid.plot(ax=ax4)
    ax4.set_title('Residual')
    plt.tight_layout()
    plt.show()

# 2. Heatmap of Daily Sales
def plot_daily_heatmap():
    daily_data = daily_sales['total_sales'].resample('D').sum()
    heatmap_data = pd.DataFrame({
        'year': daily_data.index.year,
        'day_of_year': daily_data.index.dayofyear,
        'sales': daily_data.values
    })
    heatmap_pivot = heatmap_data.pivot(index='year', columns='day_of_year', values='sales')
    plt.figure(figsize=(16, 8))
    sns.heatmap(heatmap_pivot, cmap='YlOrRd')
    plt.title('Daily Sales Heatmap')
    plt.xlabel('Day of Year')
    plt.ylabel('Year')
    plt.show()

# 3. Year-over-Year Comparison
def plot_year_over_year():
    daily_sales['year'] = daily_sales.index.year
    daily_sales['day_of_year'] = daily_sales.index.dayofyear
    plt.figure(figsize=(12, 6))
    for year in daily_sales['year'].unique():
        year_data = daily_sales[daily_sales['year'] == year]
        plt.plot(year_data['day_of_year'], year_data['total_sales'], label=str(year))
    plt.title('Year-over-Year Sales Comparison')
    plt.xlabel('Day of Year')
    plt.ylabel('Total Sales')
    plt.legend()
    plt.show()

# 4. Box plots by Month and Day of Week
def plot_boxplots():
    daily_sales['month'] = daily_sales.index.month
    daily_sales['day_of_week'] = daily_sales.index.day_name()
    
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))
    sns.boxplot(x='month', y='total_sales', data=daily_sales, ax=ax1)
    ax1.set_title('Sales Distribution by Month')
    sns.boxplot(x='day_of_week', y='total_sales', data=daily_sales, 
                order=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], ax=ax2)
    ax2.set_title('Sales Distribution by Day of Week')
    plt.tight_layout()
    plt.show()

# 5. Autocorrelation and Partial Autocorrelation Plots
def plot_acf_pacf():
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10))
    plot_acf(daily_sales['total_sales'], ax=ax1, lags=50)
    plot_pacf(daily_sales['total_sales'], ax=ax2, lags=50)
    plt.tight_layout()
    plt.show()

# 6. Cumulative Sales Plot
def plot_cumulative_sales():
    cumulative_sales = daily_sales['total_sales'].cumsum()
    plt.figure(figsize=(12, 6))
    plt.plot(cumulative_sales.index, cumulative_sales)
    plt.title('Cumulative Sales Over Time')
    plt.xlabel('Date')
    plt.ylabel('Cumulative Sales')
    plt.show()

# 7. Seasonal Subseries Plot
def plot_seasonal_subseries():
    daily_sales['quarter'] = daily_sales.index.quarter
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    for i, quarter in enumerate([1, 2, 3, 4]):
        ax = axes[i // 2, i % 2]
        quarter_data = daily_sales[daily_sales['quarter'] == quarter]
        sns.lineplot(x=quarter_data.index.dayofyear, y='total_sales', hue=quarter_data.index.year, data=quarter_data, ax=ax)
        ax.set_title(f'Q{quarter} Sales')
    plt.tight_layout()
    plt.show()

# 8. Rolling Statistics
def plot_rolling_stats():
    rolling_mean = daily_sales['total_sales'].rolling(window=7).mean()
    rolling_std = daily_sales['total_sales'].rolling(window=7).std()
    rolling_median = daily_sales['total_sales'].rolling(window=7).median()
    
    plt.figure(figsize=(12, 6))
    plt.plot(daily_sales.index, daily_sales['total_sales'], label='Daily Sales')
    plt.plot(rolling_mean.index, rolling_mean, label='7-day Moving Average')
    plt.plot(rolling_std.index, rolling_std, label='7-day Moving Std')
    plt.plot(rolling_median.index, rolling_median, label='7-day Moving Median')
    plt.title('Rolling Statistics of Daily Sales')
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# 9. Fourier Transform
def plot_fourier_transform():
    sales_fft = fft(daily_sales['total_sales'].values)
    frequencies = np.fft.fftfreq(len(sales_fft))
    plt.figure(figsize=(12, 6))
    plt.plot(frequencies, np.abs(sales_fft))
    plt.title('Fourier Transform of Sales Data')
    plt.xlabel('Frequency')
    plt.ylabel('Magnitude')
    plt.xlim(0, 0.5)  # Only show positive frequencies
    plt.show()

# 10. Anomaly Detection
def detect_anomalies():
    rolling_mean = daily_sales['total_sales'].rolling(window=7).mean()
    rolling_std = daily_sales['total_sales'].rolling(window=7).std()
    anomalies = daily_sales[(daily_sales['total_sales'] > rolling_mean + 2*rolling_std) | 
                            (daily_sales['total_sales'] < rolling_mean - 2*rolling_std)]
    
    plt.figure(figsize=(12, 6))
    plt.plot(daily_sales.index, daily_sales['total_sales'], label='Daily Sales')
    plt.scatter(anomalies.index, anomalies['total_sales'], color='red', label='Anomalies')
    plt.title('Daily Sales with Anomalies')
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Run all the functions
plot_decomposition()
plot_daily_heatmap()
plot_year_over_year()
plot_boxplots()
plot_acf_pacf()
plot_cumulative_sales()
plot_seasonal_subseries()
plot_rolling_stats()
plot_fourier_transform()
detect_anomalies()

**Explanations:**

- This code calculates daily total sales by multiplying price and item count. It then plots the daily sales over time and a 7-day moving average to smooth out daily fluctuations and reveal underlying trends.

**Why It Is Important:**

- Examining overall sales patterns helps identify long-term trends, seasonality, and potential anomalies in the data. This information is crucial for understanding the business's performance over time and can inform forecasting models.

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

#### <a id='toc1_4_2_2_'></a>[**4.2.2. Find out how sales fluctuate across different days of the week**](#toc0_)

In [None]:
print("The daily_sales dataset info:")
print(daily_sales.info())
print

In [None]:
# Add day of week column
daily_sales['day_of_week'] = daily_sales.index.dayofweek
daily_sales['day_of_week_name'] = daily_sales.index.day_name()

# Calculate average sales by day of week
avg_sales_by_day = daily_sales.groupby('day_of_week_name')['total_sales'].mean().reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
total_sales_by_day = daily_sales.groupby('day_of_week_name')['item_count'].sum().reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

# Plot average sales by day of week
plt.figure(figsize=(10, 6))
sns.barplot(x=avg_sales_by_day.index, y=avg_sales_by_day.values, palette='viridis')
plt.title('Average Sales by Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Average Sales')

# Annotate each bar with the average sales value formatted with commas and dollar signs
for index, value in enumerate(avg_sales_by_day.values):
    plt.text(index, value, f'${value:,.2f}', ha='center', va='bottom')

plt.show()

plt.figure(figsize=(10, 6))
sns.barplot(x=total_sales_by_day.index, y=total_sales_by_day.values, palette='viridis')
plt.title('Total Item Sales by Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Total Items Sold')

# Annotate each bar with the average sales value formatted with commas and dollar signs
for index, value in enumerate(total_sales_by_day.values):
    plt.text(index, value, f'{value:,.0f}', ha='center', va='bottom')
plt.show()

# Define the order of the days
days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Boxplot of sales by day of week
plt.figure(figsize=(12, 6))
sns.boxplot(x='day_of_week_name', y='total_sales', data=daily_sales, order=days_order, palette='viridis')
plt.title('Sales Distribution by Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Total Sales')

# Annotate the median values
medians = daily_sales.groupby('day_of_week_name')['total_sales'].median().reindex(days_order)
for index, median in enumerate(medians):
    plt.text(index, median, f'${median:,.2f}', ha='center', va='bottom', color='black', weight='bold')

plt.show()


**Explanations:**

- This code analyzes sales fluctuations across different days of the week. It calculates average sales for each day and creates a bar plot to visualize these averages. Additionally, it generates a box plot to show the distribution of sales for each day of the week.

**Why It Is Important:**

- Understanding weekly sales patterns is crucial for inventory management, staffing decisions, and marketing strategies. It can help businesses optimize their operations based on expected demand for different days of the week.

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

#### <a id='toc1_4_2_3_'></a>[**4.2.3. Look for any noticeable trends in the sales data for different months of the year**](#toc0_)

In [None]:
# Add month column
daily_sales['month'] = daily_sales.index.month
daily_sales['month_name'] = daily_sales.index.month_name()

# 2. Identify trends in sales data for different months
# merged_df['month'] = merged_df['date'].dt.month_name()
# monthly_sales = merged_df.groupby('month')['item_count'].sum().reindex(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])


# Calculate average sales by month
avg_sales_by_month = daily_sales.groupby('month_name')['total_sales'].mean().reindex(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])
total_sales_by_month = daily_sales.groupby('month_name')['item_count'].sum().reindex(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])

# Plot average sales by month
plt.figure(figsize=(12, 6))
sns.barplot(x=avg_sales_by_month.index, y=avg_sales_by_month.values, palette='viridis')
plt.title('Average Sales by Month')
plt.xlabel('Month')
plt.xticks(rotation=45)
plt.ylabel('Average Sales')

# Annotate each bar with the average sales value formatted with commas and dollar signs
for index, value in enumerate(avg_sales_by_month.values):
    plt.text(index, value, f'${value:,.2f}', ha='center', va='bottom', fontsize=8)
plt.show()


# Plot total sales by month
plt.figure(figsize=(12, 6))
sns.barplot(x=total_sales_by_month.index, y=total_sales_by_month.values, palette='viridis')
plt.title('Total Item Sales by Month')
plt.xlabel('Month')
plt.ylabel('Total Items Sold')
plt.xticks(rotation=45)

# Annotate each bar with the average sales value formatted with commas and dollar signs
for index, value in enumerate(total_sales_by_month.values):
    plt.text(index, value, f'{value:,.0f}', ha='center', va='bottom')
plt.show()

# Heatmap of sales by month and day of week
# Pivot the data to create a heatmap-friendly format
# Heatmap of sales by month and day of week
monthly_dow_sales = daily_sales.groupby(['month_name', 'day_of_week_name'])['total_sales'].mean().unstack()

# Define the correct order for the days of the week and months of the year
days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
months_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

# Reindex the DataFrame to ensure the correct order
monthly_dow_sales = monthly_dow_sales.reindex(index=months_order, columns=days_order)

# Plot the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(monthly_dow_sales, cmap='YlOrRd', annot=True, fmt='.0f')
plt.title('Average Sales by Month and Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Month')
plt.show()

**Explanations:**

- This code analyzes sales trends across different months of the year. It calculates and plots average sales for each month. Additionally, it creates a heatmap to visualize the relationship between months, days of the week, and sales.

**Why It Is Important:**

- Analyzing monthly sales trends helps identify seasonal patterns in the business. This information is crucial for long-term planning, inventory management, and marketing strategies. It can also inform the development of more accurate forecasting models by accounting for seasonality.

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

#### <a id='toc1_4_2_4_'></a>[**4.2.4. Examine the sales distribution across different quarters averaged over the years. Identify any noticeable patterns**](#toc0_)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Add quarter column
daily_sales['quarter'] = daily_sales.index.quarter

# Calculate average sales by quarter
avg_sales_by_quarter = daily_sales.groupby('quarter')['total_sales'].mean()

# Plot average sales by quarter
plt.figure(figsize=(10, 6))
ax = sns.barplot(x=avg_sales_by_quarter.index, y=avg_sales_by_quarter.values, palette='viridis')
plt.title('Average Sales by Quarter')
plt.xlabel('Quarter')
plt.ylabel('Average Sales')

# Add value labels on the bars
for i, v in enumerate(avg_sales_by_quarter.values):
    ax.text(i, v, f'${v:,.0f}', ha='center', va='bottom')

plt.show()

# Boxplot of sales by quarter
plt.figure(figsize=(12, 6))
ax = sns.boxplot(x='quarter', y='total_sales', data=daily_sales, palette='viridis')
plt.title('Sales Distribution by Quarter')
plt.xlabel('Quarter')
plt.ylabel('Total Sales')

# Add median values on the boxplot
medians = daily_sales.groupby('quarter')['total_sales'].median()
for i, median in enumerate(medians):
    ax.text(i, median, f'${median:,.0f}', ha='center', va='bottom', color='white', fontweight='bold')

plt.show()

# Violin plot
plt.figure(figsize=(12, 6))
ax = sns.violinplot(x='quarter', y='total_sales', data=daily_sales, palette='viridis')
plt.title('Sales Distribution by Quarter (Violin Plot)')
plt.xlabel('Quarter')
plt.ylabel('Total Sales')

# Add median values on the violin plot
for i, median in enumerate(medians):
    ax.text(i, median, f'${median:,.0f}', ha='center', va='center', color='white', fontweight='bold')

plt.show()

# Box plot with swarm
plt.figure(figsize=(12, 6))
ax = sns.boxplot(x='quarter', y='total_sales', data=daily_sales, palette='viridis')
sns.swarmplot(x='quarter', y='total_sales', data=daily_sales, color=".25", size=3)
plt.title('Sales Distribution by Quarter (Box Plot with Swarm)')
plt.xlabel('Quarter')
plt.ylabel('Total Sales')

# Add median values on the boxplot
for i, median in enumerate(medians):
    ax.text(i, median, f'${median:,.0f}', ha='center', va='bottom', color='white', fontweight='bold')

plt.show()

# Year-over-Year Quarterly Sales Growth heatmap
quarterly_yoy = daily_sales.groupby([daily_sales.index.year, 'quarter'])['total_sales'].sum().unstack()
quarterly_yoy_pct = quarterly_yoy.pct_change() * 100

plt.figure(figsize=(12, 6))
ax = sns.heatmap(quarterly_yoy_pct, cmap='RdYlGn', annot=True, fmt='.1f')
plt.title('Year-over-Year Quarterly Sales Growth (%)')
plt.xlabel('Quarter')
plt.ylabel('Year')

# Rotate the tick labels
plt.yticks(rotation=0)

plt.show()

**Explanations:**

- This code examines sales distribution across different quarters of the year. It calculates and plots average sales for each quarter, creates a box plot to show the distribution of sales by quarter, and generates a line plot to visualize quarterly sales trends over the years.

**Why It Is Important:**

- Analyzing quarterly sales patterns helps identify broader seasonal trends and potential fiscal year effects on the business. This information is valuable for strategic planning, budgeting, and setting sales targets. It can also reveal how the business performance has evolved over the years on a quarterly basis.

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

#### <a id='toc1_4_2_5_'></a>[**4.2.5. Compare the performances of the different restaurants. Find out which restaurant had the most sales and look at the sales for each restaurant across different years, months, and days**](#toc0_)

In [None]:
import calendar

# Calculate daily sales for each restaurant
restaurant_daily_sales = final_df.groupby(['date', 'restaurant_name']).agg({
    'price': lambda x: (x * final_df.loc[x.index, 'item_count']).sum()
}).reset_index().rename(columns={'price': 'daily_sales'})

# Calculate total sales by restaurant
restaurant_sales = final_df.groupby('restaurant_name').agg({
    'price': lambda x: (x * final_df.loc[x.index, 'item_count']).sum()
}).rename(columns={'price': 'total_sales'}).sort_values('total_sales', ascending=False)

# 2. Separate plots
bobs_data = restaurant_sales[restaurant_sales.index == "Bob's Diner"]
other_data = restaurant_sales[restaurant_sales.index != "Bob's Diner"]

# 4. Secondary y-axis for Bob's Diner
fig, ax1 = plt.subplots(figsize=(12, 6))

# Plot for other restaurants
sns.barplot(x=other_data.index, y='total_sales', data=other_data, ax=ax1, palette='husl')
ax1.set_xlabel('Restaurant')
ax1.set_ylabel('Total Sales (Other Restaurants)')
ax1.tick_params(axis='x')

# Add values on top of bars for other restaurants
for p in ax1.patches:
    ax1.annotate('${:,.2f}'.format(p.get_height()), 
                 (p.get_x() + p.get_width() / 2., p.get_height()), 
                 ha='center', va='center', 
                 xytext=(0, 9), 
                 textcoords='offset points')

# Secondary y-axis for Bob's Diner
ax2 = ax1.twinx()
sns.barplot(x=bobs_data.index, y='total_sales', data=bobs_data, ax=ax2, color='red', alpha=0.5)
ax2.set_ylabel("Total Sales (Bob's Diner)")

# Add values on top of bars for Bob's Diner
for p in ax2.patches:
    ax2.annotate('${:,.2f}'.format(p.get_height()), 
                 (p.get_x() + p.get_width() / 2., p.get_height()), 
                 ha='center', va='center', 
                 xytext=(0, 9), 
                 textcoords='offset points')

plt.title('Total Sales by Restaurant (Dual Axis)')
plt.tight_layout()
plt.show()

# Assuming final_df is already created and contains the necessary data
# Calculate total sales if not already done
final_df['total_sales'] = final_df['price'] * final_df['item_count']

# Function to create individual bar plots
def plot_individual_bars(data, x, y, title, x_label, y_label):
    # Define a color palette
    palette = sns.color_palette("husl", len(data['restaurant_name'].unique()))
    
    restaurants = data['restaurant_name'].unique()
    
    for restaurant in restaurants:
        restaurant_data = data[data['restaurant_name'] == restaurant]
        
        plt.figure(figsize=(12, 6))
        ax = sns.barplot(data=restaurant_data, x=x, y=y, palette=palette)
        plt.title(f"{title} - {restaurant}")
        plt.xlabel(x_label)
        plt.ylabel(y_label)
        
        # Add values on top of bars
        if restaurant != "Bob's Diner" or title != "Average Daily Sales":
            for p in ax.patches:
                ax.annotate(f'${p.get_height():,.2f}', 
                            (p.get_x() + p.get_width() / 2., p.get_height()), 
                            ha='center', va='center', 
                            xytext=(0, 9), 
                            textcoords='offset points')
        
        plt.tight_layout()
        plt.show()

# 1. Yearly sales
yearly_sales = final_df.groupby([final_df['date'].dt.year, 'restaurant_name'])['total_sales'].sum().reset_index()
yearly_sales = yearly_sales.rename(columns={'date': 'year'})
plot_individual_bars(yearly_sales, 'year', 'total_sales', 'Yearly Sales', 'Year', 'Total Sales')

# 2. Monthly sales (averaged across years)
monthly_sales = final_df.groupby([final_df['date'].dt.month, 'restaurant_name'])['total_sales'].mean().reset_index()
monthly_sales = monthly_sales.rename(columns={'date': 'month'})
monthly_sales['month_name'] = monthly_sales['month'].apply(lambda x: calendar.month_abbr[x])
plot_individual_bars(monthly_sales, 'month_name', 'total_sales', 'Average Monthly Sales', 'Month', 'Average Sales')

# 3. Daily sales (averaged across years and months)
daily_sales = final_df.groupby([final_df['date'].dt.day, 'restaurant_name'])['total_sales'].mean().reset_index()
daily_sales = daily_sales.rename(columns={'date': 'day'})
plot_individual_bars(daily_sales, 'day', 'total_sales', 'Average Daily Sales', 'Day of Month', 'Average Sales')

# 4. Day-name sales (averaged across all dates)
final_df['day_name'] = final_df['date'].dt.day_name()
day_name_sales = final_df.groupby(['day_name', 'restaurant_name'])['total_sales'].mean().reset_index()
day_name_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_name_sales['day_name'] = pd.Categorical(day_name_sales['day_name'], categories=day_name_order, ordered=True)
day_name_sales = day_name_sales.sort_values('day_name')
plot_individual_bars(day_name_sales, 'day_name', 'total_sales', 'Average Sales by Day of Week', 'Day of Week', 'Average Sales')

# Assuming final_df is already created and contains the necessary data
# Calculate total sales if not already done
final_df['total_sales'] = final_df['price'] * final_df['item_count']

# Function to create individual boxplots for each restaurant
def plot_individual_boxplots(data, x, y, title_prefix, x_label, y_label, is_monthly=False):
    restaurants = data['restaurant_name'].unique()
    colors = plt.cm.rainbow(np.linspace(0, 1, len(restaurants)))
    
    for restaurant, color in zip(restaurants, colors):
        restaurant_data = data[data['restaurant_name'] == restaurant]
        
        if is_monthly:
            plt.figure(figsize=(20, 10))
            ax = sns.boxplot(data=restaurant_data, x='month', y=y, order=range(1, 13), palette='viridis')
            plt.title(f"{title_prefix} - {restaurant}")
            plt.xlabel(x_label)
            plt.ylabel(y_label)
            
            # Customize x-axis labels
            month_names = [calendar.month_abbr[i] for i in range(1, 13)]
            plt.xticks(range(12), month_names)
            
            # Add median values on the boxplot
            medians = restaurant_data.groupby('month')[y].median()
            for i, median in enumerate(medians):
                ax.text(i, median, f'${median:,.2f}', 
                        horizontalalignment='center', size='small', color='white', 
                        weight='semibold', bbox=dict(facecolor='black', edgecolor='none', alpha=0.5))
        else:
            plt.figure(figsize=(12, 6))
            ax = sns.boxplot(data=restaurant_data, x=x, y=y, color=color)
            plt.title(f"{title_prefix} - {restaurant}")
            plt.xlabel(x_label)
            plt.ylabel(y_label)
            
            # Add median values on the boxplot
            medians = restaurant_data.groupby(x)[y].median().values
            pos = range(len(medians))
            for tick, label in zip(pos, ax.get_xticklabels()):
                ax.text(pos[tick], medians[tick], f'${medians[tick]:,.2f}', 
                        horizontalalignment='center', size='small', color='black', 
                        weight='semibold', rotation=0, alpha=0.7, 
                        bbox=dict(facecolor='white', edgecolor='none', alpha=0.7))
        
        plt.tight_layout()
        plt.show()

# 1. Daily sales
daily_sales = final_df.groupby(['date', 'restaurant_name'])['total_sales'].sum().reset_index()
plot_individual_boxplots(daily_sales, 'restaurant_name', 'total_sales', 'Daily Sales Distribution', 'Restaurant', 'Daily Total Sales')

# 2. Monthly sales
final_df['month'] = final_df['date'].dt.month
monthly_sales = final_df.groupby(['month', 'restaurant_name', final_df['date'].dt.to_period('M')])['total_sales'].sum().reset_index()
plot_individual_boxplots(monthly_sales, 'month', 'total_sales', 'Monthly Sales Distribution', 'Month', 'Monthly Total Sales', is_monthly=True)

# 3. Quarterly sales
final_df['year_quarter'] = final_df['date'].dt.to_period('Q')
quarterly_sales = final_df.groupby(['year_quarter', 'restaurant_name'])['total_sales'].sum().reset_index()
quarterly_sales['year_quarter'] = quarterly_sales['year_quarter'].astype(str)
plot_individual_boxplots(quarterly_sales, 'year_quarter', 'total_sales', 'Quarterly Sales Distribution', 'Year-Quarter', 'Quarterly Total Sales')

# 4. Yearly sales
yearly_sales = final_df.groupby([final_df['date'].dt.year, 'restaurant_name'])['total_sales'].sum().reset_index()
yearly_sales = yearly_sales.rename(columns={'date': 'year'})
plot_individual_boxplots(yearly_sales, 'year', 'total_sales', 'Yearly Sales Distribution', 'Year', 'Yearly Total Sales')

# 5. Day of week sales
final_df['day_of_week'] = final_df['date'].dt.day_name()
day_of_week_sales = final_df.groupby(['day_of_week', 'restaurant_name', 'date'])['total_sales'].sum().reset_index()
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_of_week_sales['day_of_week'] = pd.Categorical(day_of_week_sales['day_of_week'], categories=day_order, ordered=True)
day_of_week_sales = day_of_week_sales.sort_values('day_of_week')
plot_individual_boxplots(day_of_week_sales, 'day_of_week', 'total_sales', 'Day of Week Sales Distribution', 'Day of Week', 'Total Sales')


**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

#### <a id='toc1_4_2_6_'></a>[**4.2.6. Identify the most popular items overall and the stores where they are being sold. Also, find out the most popular item at each store**](#toc0_)

In [None]:
# Calculate total sales and quantity sold for each item
item_sales = final_df.groupby(['item_id', 'item_name']).agg({
    'price': lambda x: (x * final_df.loc[x.index, 'item_count']).sum(),
    'item_count': 'sum'
}).reset_index().rename(columns={'price': 'total_sales', 'item_count': 'quantity_sold'})

# Sort items by quantity sold and get top 10
top_10_items = item_sales.sort_values('quantity_sold', ascending=False).head(10)

# Plot top 10 items by quantity sold
plt.figure(figsize=(12, 6))
sns.barplot(x='item_name', y='quantity_sold', data=top_10_items)
plt.title('Top 10 Items by Quantity Sold')
plt.xlabel('Item')
plt.ylabel('Quantity Sold')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Identify most popular item at each store
store_popular_items = final_df.groupby(['store_id', 'restaurant_name', 'item_id', 'item_name'])['item_count'].sum().reset_index()
store_popular_items = store_popular_items.loc[store_popular_items.groupby('store_id')['item_count'].idxmax()]

print("Most popular item at each store:")
print(store_popular_items[['restaurant_name', 'item_name', 'item_count']])

# Plot distribution of sales for top 5 items
top_5_items = item_sales.sort_values('total_sales', ascending=False).head(5)
top_5_item_sales = final_df[final_df['item_id'].isin(top_5_items['item_id'])]

plt.figure(figsize=(12, 6))
sns.boxplot(x='item_name', y='price', data=top_5_item_sales)
plt.title('Sales Distribution for Top 5 Items')
plt.xlabel('Item')
plt.ylabel('Price')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

#### <a id='toc1_4_2_7_'></a>[**4.2.7. Determine if the store with the highest sales volume is also making the most money per day**](#toc0_)

In [None]:
# Calculate daily sales volume and revenue for each store
daily_store_performance = final_df.groupby(['date', 'store_id', 'restaurant_name']).agg({
    'item_count': 'sum',
    'price': lambda x: (x * final_df.loc[x.index, 'item_count']).sum()
}).reset_index().rename(columns={'item_count': 'daily_volume', 'price': 'daily_revenue'})

# Calculate average daily volume and revenue
avg_store_performance = daily_store_performance.groupby('restaurant_name').agg({
    'daily_volume': 'mean',
    'daily_revenue': 'mean'
}).reset_index()

# Plot relationship between average daily volume and revenue
plt.figure(figsize=(10, 6))
sns.scatterplot(x='daily_volume', y='daily_revenue', data=avg_store_performance, s=100)

for i, row in avg_store_performance.iterrows():
    plt.annotate(row['restaurant_name'], (row['daily_volume'], row['daily_revenue']))

plt.title('Average Daily Sales Volume vs Revenue by Restaurant')
plt.xlabel('Average Daily Sales Volume')
plt.ylabel('Average Daily Revenue')
plt.tight_layout()
plt.show()

# Calculate correlation between daily volume and revenue
correlation = daily_store_performance['daily_volume'].corr(daily_store_performance['daily_revenue'])
print(f"Correlation between daily sales volume and revenue: {correlation:.2f}")

# Identify store with highest sales volume and its revenue ranking
highest_volume_store = avg_store_performance.loc[avg_store_performance['daily_volume'].idxmax()]
revenue_rank = avg_store_performance['daily_revenue'].rank(ascending=False)
highest_volume_store_rank = revenue_rank[highest_volume_store.name]

print(f"Store with highest sales volume: {highest_volume_store['restaurant_name']}")
print(f"Its revenue rank: {highest_volume_store_rank} out of {len(avg_store_performance)}")

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

#### <a id='toc1_4_2_8_'></a>[**4.2.8. Identify the most expensive item at each restaurant and find out its calorie count**](#toc0_)

In [None]:
# Get the most expensive item at each restaurant
most_expensive_items = final_df.groupby(['store_id', 'restaurant_name', 'item_id', 'item_name', 'price', 'kcal']).size().reset_index(name='count')
most_expensive_items = most_expensive_items.loc[most_expensive_items.groupby('store_id')['price'].idxmax()]

# Sort by price in descending order
most_expensive_items = most_expensive_items.sort_values('price', ascending=False)

# Display the most expensive items and their calorie counts
print("Most expensive item at each restaurant and its calorie count:")
print(most_expensive_items[['restaurant_name', 'item_name', 'price', 'kcal']])

# Visualize price vs calorie count for the most expensive items
plt.figure(figsize=(12, 6))
sns.scatterplot(x='kcal', y='price', data=most_expensive_items, s=100)

for i, row in most_expensive_items.iterrows():
    plt.annotate(row['restaurant_name'], (row['kcal'], row['price']))

plt.title('Price vs Calorie Count for Most Expensive Items')
plt.xlabel('Calories')
plt.ylabel('Price')
plt.tight_layout()
plt.show()

# Calculate correlation between price and calorie count for all items
price_calorie_correlation = final_df[['price', 'kcal']].drop_duplicates().corr().iloc[0, 1]
print(f"Correlation between price and calorie count for all items: {price_calorie_correlation:.2f}")

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

### <a id='toc1_4_3_'></a>[**4.3. Forecasting using machine learning algorithms**](#toc0_)

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

#### <a id='toc1_4_3_1_'></a>[**4.3.1. Forecasting using machine learning algorithms**](#toc0_)

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

##### <a id='toc1_4_3_1_1_'></a>[**4.3.1.1. Generate necessary features for the development of these models, like day of the week, quarter of the year, month, year, day of the month and so on**](#toc0_)

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

##### <a id='toc1_4_3_1_2_'></a>[**4.3.1.2. Use the data from the last six months as the testing data**](#toc0_)

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

##### <a id='toc1_4_3_1_3_'></a>[**4.3.1.3. Compute the root mean square error (RMSE) values for each model to compare their performances**](#toc0_)

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

##### <a id='toc1_4_3_1_4_'></a>[**4.3.1.4. Use the best-performing models to make a forecast for the next year**](#toc0_)

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

### <a id='toc1_4_4_'></a>[**4.4. Forecasting using deep learning algorithms**](#toc0_)

#### <a id='toc1_4_4_1_'></a>[**4.4.1. Use sales amount for predictions instead of item count**](#toc0_)

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

#### <a id='toc1_4_4_2_'></a>[**4.4.2. Build a long short-term memory (LSTM) model for predictions**](#toc0_)

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

##### <a id='toc1_4_4_2_1_'></a>[**4.4.2.1. Define the train and test series**](#toc0_)

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

##### <a id='toc1_4_4_2_2_'></a>[**4.4.2.2. Generate synthetic data for the last 12 months**](#toc0_)

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

##### <a id='toc1_4_4_2_3_'></a>[**4.4.2.3. Build and train an LSTM model**](#toc0_)

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

##### <a id='toc1_4_4_2_4_'></a>[**4.4.2.4. Use the model to make predictions for the test data**](#toc0_)

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

#### <a id='toc1_4_4_3_'></a>[**4.4.3. Calculate the mean absolute percentage error (MAPE) and comment on the model's performance**](#toc0_)

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]

#### <a id='toc1_4_4_4_'></a>[**4.4.4. Develop another model using the entire series for training, and use it to forecast for the next three months**](#toc0_)

**Explanations:**

- [Placeholder for observations after running the code]

**Why It Is Important:**

- [Placeholder for observations after running the code]

**Observations:**

- [Placeholder for observations after running the code]

**Conclusions:**

- [Placeholder for conclusions based on initial data view]

**Recommendations:**

- [Placeholder for recommendations based on initial data examination]