# Hands-On Activity: Optimizing Sales Prediction with Random Forest

---

## Introduction:
In this hands-on activity, we'll be optimizing a Random Forest Regressor to predict sales using a dataset. By the end of this session, you'll have a deeper understanding of how to fine-tune a machine learning model for improved performance.

## Instructions:
### Step 1: Load and Prepare the Dataset. 
Let's assume we want to predict the 'Sales' column based on other features.

In [1]:
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

Take a moment to explore the dataset. What features are we including, and why are we using one-hot encoding for categorical variables?

In [2]:
# Load the dataset
filename = "sales-data.csv"
df = pd.read_csv(filename)

# Drop empty columns
df = df.drop(df.columns[df.isna().all()], axis=1)

# Select features and target variable
X = df.drop(['Sales', 'SalesForecast','OrderID', 'OrderDate', 'ShipDate', 
             'DaystoShipScheduled', 'SalesperCustomer', 'ProductName', 'ShipMode','State',
             'Country', 'PostalCode', 'Discount','City', 'CustomerName'], axis=1)
y = df['Sales']

# Convert categorical variables to numerical using one-hot encoding
X = pd.get_dummies(X, 
                   columns=['Category', 'Region', 'Segment', 'Sub_Category','ShipStatus'], 
                   drop_first=True, dtype='int')


### Step 2: Initialize Random Forest Regressor and Define Hyperparameter Grid

Why did we choose a Random Forest Regressor, and what do the hyperparameters n_estimators and max_depth signify?

In [3]:
# Initialize the Random Forest Regressor model
# The n_jobs parameter in the RandomForestRegressor constructor controls the 
# number of parallel jobs that will be used to train the model.
# A value of -1 means that all available CPU cores will be used.

# model = RandomForestRegressor(random_state=42, n_jobs=-1)
model = LinearRegression()

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [100, 500, 1000],
    'max_depth': [None, 10, 20, 30],
}



### Step 3: Implement GridSearchCV for Hyperparameter Tuning

Explore the purpose of GridSearchCV. What is it doing to our model, and how does it help us find the best hyperparameters?

In [4]:
# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=model, 
                           param_grid=param_grid, 
                           scoring='r2',
                           cv=5, n_jobs=-1)

# Fit the model to the data
grid_search.fit(X, y)


### Step 4: Retrieve and Evaluate the Best Model

What hyperparameters did GridSearchCV identify as the best, and how well does the best model perform according to the R-squared score?

In [5]:
# Get the best model
best_model = grid_search.best_estimator_

# Print the best hyperparameters
print("\nBest Hyperparameters:")
print(grid_search.best_params_)

# Print the best model's performance
print("\nBest Model Performance:")
print("Best r2:", grid_search.best_score_)


Best Hyperparameters:
{'max_depth': None, 'n_estimators': 500}

Best Model Performance:
Best r2: 0.8584246011547785


### Step 5: Make Predictions on New Data using the Best Model

How can we use the best model to make predictions on new data? What insights can we gain from comparing actual and predicted values?

In [6]:
# Use the best model to make predictions on new data
new_data = X.iloc[:5]
predictions = best_model.predict(new_data)

# Create a DataFrame to display actual and predicted values
result_df = pd.DataFrame({'Actual': y.iloc[:5].values, 'Predicted': predictions})

# Display the actual and predicted values
result_df


Unnamed: 0,Actual,Predicted
0,16,16.416
1,12,11.958
2,4,3.618
3,273,275.106
4,20,20.182
