# Big Sales Prediction using Random Forest Regressor


-------------

## **Objective**


The objective of this program is to develop and evaluate a predictive model for forecasting sales using a Random Forest Regressor. This involves:

* Data Acquisition: Import and load a dataset containing sales and related features.
* Data Exploration and Visualization: Analyze and visualize the dataset to understand its structure, distribution, and relationships between variables.
* Data Preprocessing: Clean and preprocess the data, including handling missing values, encoding categorical variables, and scaling features if necessary.
* Feature and Target Definition: Define the target variable (sales) and feature variables (predictors) for the model.
* Model Training and Testing: Split the data into training and testing sets, train a Random Forest Regressor model on the training set, and evaluate its performance on the testing set.
* Model Evaluation: Assess the model's accuracy and effectiveness using metrics such as Mean Squared Error (MSE) and R-squared (R²) score.
* Prediction: Use the trained model to make sales predictions on new or unseen data.
* Feature Importance Analysis: Analyze the importance of different features in predicting sales to gain insights into which variables are most influential.

## **Data Source**

Data Source refers to where the dataset is coming from. You might use datasets from:


*   Public Datasets: Websites like Kaggle, UCI Machine Learning Repository, etc.
*   Company Databases: If you're working within a company, you may pull data from internal databases or CRM systems.

*   APIs: Some sales data might be pulled directly from APIs provided by sales platforms or analytics tools.

## **Import Library**

In [1]:
import pandas as pd  # For data manipulation
import numpy as np   # For numerical operations
import matplotlib.pyplot as plt  # For data visualization
import seaborn as sns  # For statistical data visualization
from sklearn.model_selection import train_test_split  # For splitting data
from sklearn.ensemble import RandomForestRegressor  # For the regression model
from sklearn.metrics import mean_squared_error, r2_score  # For evaluation


## **Import Data**

data = pd.read_csv('sales_data.csv')


## **Describe Data**

print(data.info())

print(data.describe())

## **Data Visualization**

sns.histplot(data['Sales'])
plt.show()

sns.scatterplot(x='Feature1', y='Sales', data=data)
plt.show()

sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.show()




## **Data Preprocessing**

data = data.dropna()  # Dropping rows with missing values
data = pd.get_dummies(data, drop_first=True)  # One-hot encoding categorical variables


## **Define Target Variable (y) and Feature Variables (X)**

X = data.drop('Sales', axis=1)  # Features

y = data['Sales']  # Target variable



## **Train Test Split**

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## **Modeling**

model = RandomForestRegressor(n_estimators=100, random_state=42)

model.fit(X_train, y_train)


## **Model Evaluation**

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")

print(f"R^2 Score: {r2}")


## **Prediction**

new_data = pd.DataFrame({

})

predictions = model.predict(new_data)

print(predictions)


## **Explaination**

Random Forest Regressor is an ensemble learning method that combines multiple decision trees to improve prediction accuracy. It works well with both numerical and categorical data and is robust to overfitting. Key points to consider:






> importances = model.feature_importances_
features = X.columns
feature_importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})
print(feature_importance_df.sort_values(by='Importance', ascending=False))



*   Model Tuning: Hyperparameters like n_estimators (number of trees) and max_depth (depth of trees) can be tuned to improve performance.
