# Hands-On Activity: Optimizing Sales Prediction with Random Forest

---

## Introduction:
In this hands-on activity, we'll be optimizing a Random Forest Regressor to predict sales using a dataset. By the end of this session, you'll have a deeper understanding of how to fine-tune a machine learning model for improved performance.

## Instructions:
### Step 1: Load and Prepare the Dataset. 
Let's assume we want to predict the 'Sales' column based on other features.

In [127]:
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix

Take a moment to explore the dataset. What features are we including, and why are we using one-hot encoding for categorical variables?

In [128]:
# Load the dataset
filename = "sales-data.csv"
df = pd.read_csv(filename)

# Drop empty columns
df = df.drop(df.columns[df.isna().all()], axis=1)

In [129]:
df.head()

Unnamed: 0,OrderDate,Category,City,Country,CustomerName,Discount,OrderID,PostalCode,ProductName,Profit,...,State,Sub_Category,DaystoShipActual,SalesForecast,ShipStatus,DaystoShipScheduled,SalesperCustomer,ProfitRatio,latitude,longitude
0,2011-01-04T00:00:00.000Z,Office Supplies,Houston,United States,Darren Powers,0.2,CA-2011-103800,77095,"Message Book, Wirebound, Four 5 1/2 X 4 Forms/...",6,...,Texas,Paper,4,22,Shipped Early,6,16.45,33.8,29.8941,-95.6481
1,2011-01-05T00:00:00.000Z,Office Supplies,Naperville,United States,Phillina Ober,0.2,CA-2011-112326,60540,Avery 508,4,...,Illinois,Labels,4,15,Shipped Early,6,11.78,36.3,41.7662,-88.141
2,2011-01-05T00:00:00.000Z,Office Supplies,Naperville,United States,Phillina Ober,0.8,CA-2011-112326,60540,GBC Standard Plastic Binding Systems Combs,-5,...,Illinois,Binders,4,5,Shipped Early,6,3.54,-155.0,41.7662,-88.141
3,2011-01-05T00:00:00.000Z,Office Supplies,Naperville,United States,Phillina Ober,0.2,CA-2011-112326,60540,SAFCO Boltless Steel Shelving,-65,...,Illinois,Storage,4,357,Shipped Early,6,272.74,-23.8,41.7662,-88.141
4,2011-01-06T00:00:00.000Z,Office Supplies,Philadelphia,United States,Mick Brown,0.2,CA-2011-141817,19143,Avery Hi-Liter EverBold Pen Style Fluorescent ...,5,...,Pennsylvania,Art,7,26,Shipped Late,6,19.54,25.0,39.9448,-75.2288


In [130]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 26 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   OrderDate            9994 non-null   object 
 1   Category             9994 non-null   object 
 2   City                 9994 non-null   object 
 3   Country              9994 non-null   object 
 4   CustomerName         9994 non-null   object 
 5   Discount             9994 non-null   float64
 6   OrderID              9994 non-null   object 
 7   PostalCode           9994 non-null   int64  
 8   ProductName          9994 non-null   object 
 9   Profit               9994 non-null   int64  
 10  Quantity             9994 non-null   int64  
 11  Region               9994 non-null   object 
 12  Sales                9994 non-null   int64  
 13  Segment              9994 non-null   object 
 14  ShipDate             9994 non-null   object 
 15  ShipMode             9994 non-null   o

In [131]:


# Select features and target variable
X = df.drop(['Sales', 'SalesForecast','OrderID', 'OrderDate', 'ShipDate', 
             'DaystoShipScheduled', 'SalesperCustomer', 'ProductName', 'ShipMode','State',
             'Country', 'PostalCode', 'Discount','City', 'CustomerName'], axis=1)
y = df['Sales']

# Convert categorical variables to numerical using one-hot encoding
X = pd.get_dummies(X, 
                   columns=['Category', 'Region', 'Segment', 'Sub_Category','ShipStatus'], 
                   drop_first=True, dtype='int')


### Step 2: Initialize Random Forest Regressor and Define Hyperparameter Grid

Why did we choose a Random Forest Regressor, and what do the hyperparameters n_estimators and max_depth signify?

In [132]:
# Initialize the Random Forest Regressor model
# The n_jobs parameter in the RandomForestRegressor constructor controls the 
# number of parallel jobs that will be used to train the model.
# A value of -1 means that all available CPU cores will be used.

model = RandomForestRegressor(random_state=42, n_jobs=-1)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 500, 1000],  # Specify a list of values to try for the number of trees
    'max_depth': [None, 10, 20, 30]  # Specify the maximum depth of the trees or use None for unlimited depth
}



### Step 3: Implement GridSearchCV for Hyperparameter Tuning

Explore the purpose of GridSearchCV. What is it doing to our model, and how does it help us find the best hyperparameters?

In [133]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [134]:
# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=model, 
                           param_grid=param_grid, 
                           scoring='r2',
                           cv=5, n_jobs=-1)

# Fit the model to the data
grid_search.fit(X_train, y_train)


### Step 4: Retrieve and Evaluate the Best Model

What hyperparameters did GridSearchCV identify as the best, and how well does the best model perform according to the R-squared score?

In [138]:
# Get the best model
best_model = grid_search.best_estimator_

# Print the best hyperparameters
print("\nBest Hyperparameters:")
print(grid_search.best_params_)

# Print the best model's performance
print("\nBest Model Performance:")
print("Best r2:", grid_search.best_score_)


Best Hyperparameters:
{'max_depth': 20, 'n_estimators': 100}

Best Model Performance:
Best r2: 0.8585355203823035


In [139]:
from sklearn.metrics import r2_score


y_pred = best_model.predict(X_test)

# Evaluate the model's performance
#accuracy = accuracy_score(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics
#print(f"Accuracy2: {accuracy}\n")
print(f"R2: {r2}\n")

R2: 0.9282865779639411



### Step 5: Make Predictions on New Data using the Best Model

How can we use the best model to make predictions on new data? What insights can we gain from comparing actual and predicted values?

In [140]:
# Use the best model to make predictions on new data
new_data = X.iloc[:5]
predictions = best_model.predict(new_data)

# Create a DataFrame to display actual and predicted values
result_df = pd.DataFrame({'Actual': y.iloc[:5].values, 'Predicted':predictions})

# Display the actual and predicted values
result_df


Unnamed: 0,Actual,Predicted
0,16,16.89143
1,12,11.859917
2,4,3.62358
3,273,286.3675
4,20,19.903608
