In [1]:

Creating a machine learning model to predict and reduce waste in the hospitality industry involves several steps. Here's a detailed workflow:

1. Data Collection and Preparation
Identify Data Sources: Gather data from various sources such as inventory records, sales data, purchase orders, and waste logs.
Data Integration: Combine data from different sources into a unified format.
Data Cleaning: Handle missing values, outliers, and inconsistencies in the data. <->
Feature Engineering: Create relevant features from the raw data, such as average daily usage, seasonal variations, and supplier reliability.
2. Exploratory Data Analysis (EDA)
Descriptive Statistics: Compute basic statistics to understand the distribution and central tendencies of the data.
Visualization: Use plots (e.g., histograms, scatter plots, heatmaps) to identify patterns, correlations, and trends.
3. Model Selection
Choose Algorithms: Select appropriate machine learning algorithms (e.g., Linear Regression, Decision Trees, Random Forest, Gradient Boosting, etc.).
Baseline Model: Develop a simple baseline model to set a performance benchmark.
4. Model Training and Validation
Split Data: Divide the data into training and testing sets (e.g., 80% training, 20% testing).
Train Models: Train multiple models using the training data.
Hyperparameter Tuning: Optimize model parameters using techniques like grid search or random search.
Cross-Validation: Use cross-validation to assess model performance and prevent overfitting.
5. Model Evaluation
Performance Metrics: Evaluate models using relevant metrics (e.g., Mean Absolute Error, Mean Squared Error, R-squared).
Compare Models: Compare the performance of different models and select the best one.
6. Model Deployment
Integration: Integrate the selected model into the inventory management system.
APIs: Develop APIs to allow the system to interact with the model and make predictions in real-time.
User Interface: Update the user interface to display predictions and recommendations for reducing waste.
7. Monitoring and Maintenance
Track Performance: Continuously monitor the model's performance and make adjustments as needed.
Regular Updates: Retrain the model periodically with new data to maintain accuracy and relevance.
Feedback Loop: Implement a feedback loop to gather user input and improve the system over time.
8. Documentation and Training
Documentation: Create comprehensive documentation for the system, including setup, usage, and troubleshooting guides.
Training: Provide training sessions for staff to ensure they can effectively use the new system.
By following this workflow, you can develop a robust machine-learning model to predict and reduce waste in the hospitality industry, leading to cost savings and more sustainable operations.

SyntaxError: unterminated string literal (detected at line 1) (552744237.py, line 1)

In [4]:
import pandas as pd
import numpy as np

# Load data from the uploaded CSV file
file_path = '/mnt/data/SalesKaggle3.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the data
print("Original Data:\n", data.head())

# Data Cleaning
# Handling missing values
data.fillna(method='ffill', inplace=True)  # Forward fill for simplicity
data.dropna(inplace=True)  # Drop any remaining NaN values

# Removing outliers (example using z-score method)
from scipy import stats
z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))
data = data[(z_scores < 3).all(axis=1)]

# Feature Engineering
# Example: Creating new features such as average daily usage, seasonal variations, and supplier reliability

# Since we don't have dates, we'll create a dummy 'date' column based on 'ReleaseYear'
data['date'] = pd.to_datetime(data['ReleaseYear'], format='%Y')

# Average Daily Usage (dummy example since actual usage data isn't available)
data['daily_usage'] = data['ItemCount'] / 365  # Assuming item count spread over a year

# Seasonal Variations (Example: Extracting month as a feature)
data['month'] = data['date'].dt.month

# Supplier Reliability (Example: Counting orders per marketing type as a proxy)
marketing_reliability = data.groupby('MarketingType')['Order'].count().reset_index()
marketing_reliability.columns = ['MarketingType', 'order_count']
data = pd.merge(data, marketing_reliability, on='MarketingType')

# Display the first few rows of the prepared data
print("Prepared Data:\n", data.head())

# Save the cleaned and prepared data to a new CSV file
output_file_path = '/mnt/data/prepared_sales_data.csv'
data.to_csv(output_file_path, index=False)
print(f"Prepared data saved to {output_file_path}")


FileNotFoundError: [Errno 2] No such file or directory: '/mnt/data/SalesKaggle3.csv'

In [6]:
import pandas as pd
import numpy as np
from scipy import stats

# Load data from the local CSV file
file_path = r'C:\Users\silve\Downloads\archive (6)\SalesKaggle3.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the data
print("Original Data:\n", data.head())

# Data Cleaning
# Handling missing values
data.fillna(method='ffill', inplace=True)  # Forward fill for simplicity
data.dropna(inplace=True)  # Drop any remaining NaN values

# Removing outliers (example using z-score method)
z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))
data = data[(z_scores < 3).all(axis=1)]

# Feature Engineering
# Example: Creating new features such as average daily usage, seasonal variations, and supplier reliability

# Since we don't have dates, we'll create a dummy 'date' column based on 'ReleaseYear'
data['date'] = pd.to_datetime(data['ReleaseYear'], format='%Y')

# Average Daily Usage (dummy example since actual usage data isn't available)
data['daily_usage'] = data['ItemCount'] / 365  # Assuming item count spread over a year

# Seasonal Variations (Example: Extracting month as a feature)
data['month'] = data['date'].dt.month

# Supplier Reliability (Example: Counting orders per marketing type as a proxy)
marketing_reliability = data.groupby('MarketingType')['Order'].count().reset_index()
marketing_reliability.columns = ['MarketingType', 'order_count']
data = pd.merge(data, marketing_reliability, on='MarketingType')

# Display the first few rows of the prepared data
print("Prepared Data:\n", data.head())

# Save the cleaned and prepared data to a new CSV file
output_file_path = r'C:\Users\silve\Downloads\archive (6)\prepared_sales_data.csv'
data.to_csv(output_file_path, index=False)
print(f"Prepared data saved to {output_file_path}")



Original Data:
    Order   File_Type  SKU_number  SoldFlag  SoldCount MarketingType  \
0      2  Historical     1737127       0.0        0.0             D   
1      3  Historical     3255963       0.0        0.0             D   
2      4  Historical      612701       0.0        0.0             D   
3      6  Historical      115883       1.0        1.0             D   
4      7  Historical      863939       1.0        1.0             D   

   ReleaseNumber  New_Release_Flag  StrengthFactor  PriceReg  ReleaseYear  \
0             15                 1        682743.0     44.99         2015   
1              7                 1       1016014.0     24.81         2005   
2              0                 0        340464.0     46.00         2013   
3              4                 1        334011.0    100.00         2006   
4              2                 1       1287938.0    121.95         2010   

   ItemCount  LowUserPrice  LowNetPrice  
0          8         28.97        31.84  
1         

  data.fillna(method='ffill', inplace=True)  # Forward fill for simplicity


Prepared Data:
    Order   File_Type  SKU_number  SoldFlag  SoldCount MarketingType  \
0      2  Historical     1737127       0.0        0.0             D   
1      3  Historical     3255963       0.0        0.0             D   
2      4  Historical      612701       0.0        0.0             D   
3      8  Historical      214948       0.0        0.0             D   
4      9  Historical      484059       0.0        0.0             D   

   ReleaseNumber  New_Release_Flag  StrengthFactor  PriceReg  ReleaseYear  \
0             15                 1        682743.0     44.99         2015   
1              7                 1       1016014.0     24.81         2005   
2              0                 0        340464.0     46.00         2013   
3              0                 0       1783153.0    132.00         2011   
4             13                 1       2314801.0     95.95         2010   

   ItemCount  LowUserPrice  LowNetPrice       date  daily_usage  month  \
0          8        

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the cleaned and prepared data
file_path = r'C:\Users\silve\Downloads\archive (6)\prepared_sales_data.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the data
print("Prepared Data:\n", data.head())

# Descriptive Statistics
print("\nDescriptive Statistics:\n", data.describe())

# Correlation Matrix
corr_matrix = data.corr()
print("\nCorrelation Matrix:\n", corr_matrix)

# Visualization
# Histograms
data.hist(figsize=(15, 10))
plt.suptitle('Histograms of Numerical Features')
plt.show()

# Scatter Plot
sns.pairplot(data.select_dtypes(include=[np.number]))
plt.suptitle('Scatter Plot of Numerical Features')
plt.show()

# Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

# Boxplots for detecting outliers in numerical columns
plt.figure(figsize=(15, 10))
data.select_dtypes(include=[np.number]).boxplot()
plt.title('Boxplot of Numerical Features')
plt.xticks(rotation=45)
plt.show()

# Distribution of categorical features
categorical_columns = data.select_dtypes(include=[object]).columns
for col in categorical_columns:
    plt.figure(figsize=(10, 6))
    sns.countplot(data[col])
    plt.title(f'Distribution of {col}')
    plt.xticks(rotation=45)
    plt.show()


Prepared Data:
    Order   File_Type  SKU_number  SoldFlag  SoldCount MarketingType  \
0      2  Historical     1737127       0.0        0.0             D   
1      3  Historical     3255963       0.0        0.0             D   
2      4  Historical      612701       0.0        0.0             D   
3      8  Historical      214948       0.0        0.0             D   
4      9  Historical      484059       0.0        0.0             D   

   ReleaseNumber  New_Release_Flag  StrengthFactor  PriceReg  ReleaseYear  \
0             15                 1        682743.0     44.99         2015   
1              7                 1       1016014.0     24.81         2005   
2              0                 0        340464.0     46.00         2013   
3              0                 0       1783153.0    132.00         2011   
4             13                 1       2314801.0     95.95         2010   

   ItemCount  LowUserPrice  LowNetPrice        date  daily_usage  month  \
0          8       

ValueError: could not convert string to float: 'Historical'

In [8]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the cleaned and prepared data
file_path = r'C:\Users\silve\Downloads\archive (6)\prepared_sales_data.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the data
print("Prepared Data:\n", data.head())

# Descriptive Statistics
print("\nDescriptive Statistics:\n", data.describe())

# Correlation Matrix
corr_matrix = data.corr()
print("\nCorrelation Matrix:\n", corr_matrix)

# Visualization
# Histograms
data.hist(figsize=(15, 10))
plt.suptitle('Histograms of Numerical Features')
plt.show()

# Scatter Plot
sns.pairplot(data.select_dtypes(include=[np.number]))
plt.suptitle('Scatter Plot of Numerical Features')
plt.show()

# Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

# Boxplots for detecting outliers in numerical columns
plt.figure(figsize=(15, 10))
data.select_dtypes(include=[np.number]).boxplot()
plt.title('Boxplot of Numerical Features')
plt.xticks(rotation=45)
plt.show()

# Distribution of categorical features
categorical_columns = data.select_dtypes(include=[object]).columns
for col in categorical_columns:
    plt.figure(figsize=(10, 6))
    sns.countplot(data[col])
    plt.title(f'Distribution of {col}')
    plt.xticks(rotation=45)
    plt.show()


Prepared Data:
    Order   File_Type  SKU_number  SoldFlag  SoldCount MarketingType  \
0      2  Historical     1737127       0.0        0.0             D   
1      3  Historical     3255963       0.0        0.0             D   
2      4  Historical      612701       0.0        0.0             D   
3      8  Historical      214948       0.0        0.0             D   
4      9  Historical      484059       0.0        0.0             D   

   ReleaseNumber  New_Release_Flag  StrengthFactor  PriceReg  ReleaseYear  \
0             15                 1        682743.0     44.99         2015   
1              7                 1       1016014.0     24.81         2005   
2              0                 0        340464.0     46.00         2013   
3              0                 0       1783153.0    132.00         2011   
4             13                 1       2314801.0     95.95         2010   

   ItemCount  LowUserPrice  LowNetPrice        date  daily_usage  month  \
0          8       

ValueError: could not convert string to float: 'Historical'

In [9]:
import pandas as pd
import numpy as np
from scipy import stats

# Load data from the local CSV file
file_path = r'C:\Users\silve\Downloads\archive (6)\SalesKaggle3.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the data
print("Original Data:\n", data.head())

# Data Cleaning
# Handling missing values
data.fillna(method='ffill', inplace=True)  # Forward fill for simplicity
data.dropna(inplace=True)  # Drop any remaining NaN values

# Removing outliers (example using z-score method)
z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))
data = data[(z_scores < 3).all(axis=1)]

# Feature Engineering
# Example: Creating new features such as average daily usage, seasonal variations, and supplier reliability

# Since we don't have dates, we'll create a dummy 'date' column based on 'ReleaseYear'
data['date'] = pd.to_datetime(data['ReleaseYear'], format='%Y')

# Average Daily Usage (dummy example since actual usage data isn't available)
data['daily_usage'] = data['ItemCount'] / 365  # Assuming item count spread over a year

# Seasonal Variations (Example: Extracting month as a feature)
data['month'] = data['date'].dt.month

# Supplier Reliability (Example: Counting orders per marketing type as a proxy)
marketing_reliability = data.groupby('MarketingType')['Order'].count().reset_index()
marketing_reliability.columns = ['MarketingType', 'order_count']
data = pd.merge(data, marketing_reliability, on='MarketingType')

# Display the first few rows of the prepared data
print("Prepared Data:\n", data.head())

# Save the cleaned and prepared data to a new CSV file
output_file_path = r'C:\Users\silve\Downloads\archive (6)\prepared_sales_data.csv'
data.to_csv(output_file_path, index=False)
print(f"Prepared data saved to {output_file_path}")


Original Data:
    Order   File_Type  SKU_number  SoldFlag  SoldCount MarketingType  \
0      2  Historical     1737127       0.0        0.0             D   
1      3  Historical     3255963       0.0        0.0             D   
2      4  Historical      612701       0.0        0.0             D   
3      6  Historical      115883       1.0        1.0             D   
4      7  Historical      863939       1.0        1.0             D   

   ReleaseNumber  New_Release_Flag  StrengthFactor  PriceReg  ReleaseYear  \
0             15                 1        682743.0     44.99         2015   
1              7                 1       1016014.0     24.81         2005   
2              0                 0        340464.0     46.00         2013   
3              4                 1        334011.0    100.00         2006   
4              2                 1       1287938.0    121.95         2010   

   ItemCount  LowUserPrice  LowNetPrice  
0          8         28.97        31.84  
1         

  data.fillna(method='ffill', inplace=True)  # Forward fill for simplicity


Prepared Data:
    Order   File_Type  SKU_number  SoldFlag  SoldCount MarketingType  \
0      2  Historical     1737127       0.0        0.0             D   
1      3  Historical     3255963       0.0        0.0             D   
2      4  Historical      612701       0.0        0.0             D   
3      8  Historical      214948       0.0        0.0             D   
4      9  Historical      484059       0.0        0.0             D   

   ReleaseNumber  New_Release_Flag  StrengthFactor  PriceReg  ReleaseYear  \
0             15                 1        682743.0     44.99         2015   
1              7                 1       1016014.0     24.81         2005   
2              0                 0        340464.0     46.00         2013   
3              0                 0       1783153.0    132.00         2011   
4             13                 1       2314801.0     95.95         2010   

   ItemCount  LowUserPrice  LowNetPrice       date  daily_usage  month  \
0          8        