# Sales Prediction Model

## Objective
The objective of this notebook is to build machine learning models to predict sales revenue using historical sales data. 

This notebook includes:
- Data preparation
- Feature selection
- Model training
- Model evaluation

The objective is to compare multiple regression models and identify the best model for predicting sales revenue.

## 1. import Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score

## 2. Load Clean Dataset

In [2]:
df=pd.read_csv("cleaned_sales_data.csv")
df.head()

Unnamed: 0,ORDERNUMBER,QUANTITYORDERED,PRICEEACH,ORDERLINENUMBER,SALES,ORDERDATE,STATUS,QTR_ID,MONTH_ID,YEAR_ID,...,ADDRESSLINE1,ADDRESSLINE2,CITY,STATE,POSTALCODE,COUNTRY,TERRITORY,CONTACTLASTNAME,CONTACTFIRSTNAME,DEALSIZE
0,10107,30,95.7,2,2871.0,2003-02-24,Shipped,1,2,2003,...,897 Long Airport Avenue,Not Available,NYC,NY,10022,USA,Unknown,Yu,Kwai,Small
1,10121,34,81.35,5,2765.9,2003-05-07,Shipped,2,5,2003,...,59 rue de l'Abbaye,Not Available,Reims,Unknown,51100,France,EMEA,Henriot,Paul,Small
2,10134,41,94.74,2,3884.34,2003-07-01,Shipped,3,7,2003,...,27 rue du Colonel Pierre Avia,Not Available,Paris,Unknown,75508,France,EMEA,Da Cunha,Daniel,Medium
3,10145,45,83.26,6,3746.7,2003-08-25,Shipped,3,8,2003,...,78934 Hillside Dr.,Not Available,Pasadena,CA,90003,USA,Unknown,Young,Julie,Medium
4,10159,49,100.0,14,5205.27,2003-10-10,Shipped,4,10,2003,...,7734 Strong St.,Not Available,San Francisco,CA,Unknown,USA,Unknown,Brown,Julie,Medium


## 3. Convert Date Column

In [3]:
df["ORDERDATE"]=pd.to_datetime(df["ORDERDATE"])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   ORDERNUMBER       2823 non-null   int64         
 1   QUANTITYORDERED   2823 non-null   int64         
 2   PRICEEACH         2823 non-null   float64       
 3   ORDERLINENUMBER   2823 non-null   int64         
 4   SALES             2823 non-null   float64       
 5   ORDERDATE         2823 non-null   datetime64[ns]
 6   STATUS            2823 non-null   object        
 7   QTR_ID            2823 non-null   int64         
 8   MONTH_ID          2823 non-null   int64         
 9   YEAR_ID           2823 non-null   int64         
 10  PRODUCTLINE       2823 non-null   object        
 11  MSRP              2823 non-null   int64         
 12  PRODUCTCODE       2823 non-null   object        
 13  CUSTOMERNAME      2823 non-null   object        
 14  PHONE             2823 n

In [4]:
df.columns

Index(['ORDERNUMBER', 'QUANTITYORDERED', 'PRICEEACH', 'ORDERLINENUMBER',
       'SALES', 'ORDERDATE', 'STATUS', 'QTR_ID', 'MONTH_ID', 'YEAR_ID',
       'PRODUCTLINE', 'MSRP', 'PRODUCTCODE', 'CUSTOMERNAME', 'PHONE',
       'ADDRESSLINE1', 'ADDRESSLINE2', 'CITY', 'STATE', 'POSTALCODE',
       'COUNTRY', 'TERRITORY', 'CONTACTLASTNAME', 'CONTACTFIRSTNAME',
       'DEALSIZE'],
      dtype='object')

## 4. Feature Selection

In [7]:
features = [
    'QUANTITYORDERED',
    'PRICEEACH',
    'QTR_ID',
    'MONTH_ID',
    'YEAR_ID',
    'MSRP'
]

#Target Variable: SALES

Features were selected based on business relevance and their direct influence on sales revenue. Quantity, pricing, and time-related variables were chosen to capture product value and seasonal sales patterns while avoiding data leakage.

In [9]:
X = df[features]
y = df['SALES']

## 5. Train Test Split

In [10]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

The dataset is split into 80% training data and 20% testing data to evaluate model performance on unseen data.

                                                        <!-- Linear Regression Model -->

## 6. Train Linear Regression Model

In [11]:
model=LinearRegression()
model.fit(X_train,y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


## 7. Make Predictions

In [12]:
y_pred=model.predict(X_test)
y_pred

array([ 1383.72006412,  2649.6444111 ,  4748.37644103,  4878.88595789,
        3503.54991202,  6107.62971851,  3844.85039575,  1402.53976753,
        2606.29886422,  7170.55497179,  2914.90858029,  2756.8845496 ,
        -919.41002373,  4858.13011228,  5354.17298731,  5310.50844575,
        1302.11793146,  5893.10797068,  2130.17304442,   -26.11244895,
        5234.02588025,  3626.08370673,  6921.94364195,   985.7008534 ,
         861.21518228,  5435.72277561,  -718.8710673 ,  5324.82661975,
        2944.22009408,  2932.83143115,  4022.82793542,  4731.08594581,
        3503.46728291,  2621.73276146,  4663.46249946,  -354.58260749,
        1729.57240147,  2397.95579376,  3201.27845914,  5051.79017192,
        5749.16877112,  3614.07799384,  3936.59380052,  1203.33593512,
        3723.19720386,  3932.05480577,  4390.00638941,  4224.61925118,
        5057.001665  ,  2340.94960837,  -840.70809343,  4778.30101318,
        2436.96626168,  2014.52616773,  3002.82680069,  -673.31355984,
      

## 8. Evaluate Model

In [13]:
mae=mean_absolute_error(y_test,y_pred)
rmse=np.sqrt(mean_squared_error(y_test,y_pred))
r2=r2_score(y_test,y_pred)

print("MAE:",mae)
print("RMSE:",rmse)
print("R2:",r2)

MAE: 653.5836521412994
RMSE: 1022.8573827262214
R2: 0.7603061733255345


### Model Performance

The Linear Regression model achieved an R² score of 0.76, indicating that 76% of the variance in sales is explained by the selected features.

The Mean Absolute Error (MAE) is approximately 654, suggesting that predictions differ from actual sales by about 654 units on average.

The Root Mean Squared Error (RMSE) is approximately 1023, indicating the presence of some larger prediction errors.

Overall, the model demonstrates strong predictive performance for a baseline regression model.

                                                   <!-- #Ridge Regression model -->

## 9. Model prediction

In [14]:
from sklearn.linear_model import Ridge

ridge_model=Ridge(alpha=0.1,random_state=42)
ridge_model.fit(X_train,y_train)

ridge_pred=ridge_model.predict(X_test)
ridge_pred

array([ 1383.71769457,  2649.64155171,  4748.35643835,  4878.88279503,
        3503.54894581,  6107.62636322,  3844.84517441,  1402.5612271 ,
        2606.27839651,  7170.5715082 ,  2914.88348321,  2756.88346767,
        -919.4070782 ,  4858.14864089,  5354.17265079,  5310.52964659,
        1302.0959569 ,  5893.08216978,  2130.16859752,   -26.11023622,
        5234.02604176,  3626.07878488,  6921.94371979,   985.71855971,
         861.23136905,  5435.71816963,  -718.85250544,  5324.82627396,
        2944.20070586,  2932.80630858,  4022.8267626 ,  4731.10042973,
        3503.44282952,  2621.7300982 ,  4663.45848486,  -354.56032362,
        1729.56977903,  2397.93467883,  3201.27549967,  5051.79143893,
        5749.18325246,  3614.07833801,  3936.57243987,  1203.33468629,
        3723.21613504,  3932.05564136,  4390.02733089,  4224.63447402,
        5056.99861375,  2340.92935408,  -840.68511644,  4778.29641114,
        2436.96408076,  2014.50541039,  3002.8439974 ,  -673.33119065,
      

## 10. Evaluate Model

In [15]:
ridge_mae = mean_absolute_error(y_test, ridge_pred)
ridge_rmse = np.sqrt(mean_squared_error(y_test, ridge_pred))
ridge_r2 = r2_score(y_test, ridge_pred)

print("Ridge MAE:", ridge_mae)
print("Ridge RMSE:", ridge_rmse)
print("Ridge R2:", ridge_r2)

Ridge MAE: 653.5829455688277
Ridge RMSE: 1022.8569266038895
Ridge R2: 0.7603063870986071


### Ridge Regression Comparison

The Ridge Regression model produced nearly identical performance to Linear Regression. This indicates that multicollinearity is not significantly affecting the model, and the dataset does not require strong regularization.

Both linear models show stable and consistent predictive performance.


                                                   <!-- #Decision Tree  model -->

## 11. Model prediction

In [16]:
from sklearn.tree import DecisionTreeRegressor

dt_model=DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train,y_train)

dt_pred=dt_model.predict(X_test)
dt_pred

array([1514.52, 2117.75, 5190.42, 3843.2 , 3186.48, 6896.75, 6981.  ,
       1666.35, 2112.  , 8284.  , 2616.98, 2120.14,  717.4 , 4983.14,
       5592.22, 5875.2 , 1559.04, 6144.6 , 2259.72, 1033.41, 4948.2 ,
       3091.19, 9064.89, 1502.78, 1545.64, 3560.64,  733.11, 5592.22,
       7110.8 , 2591.96, 3392.26, 4177.49, 2871.  , 2639.58, 3863.87,
        759.46, 1698.78, 2312.24, 2795.13, 3255.12, 4808.38, 3227.63,
       3580.88,  541.14, 2932.08, 3188.12, 5128.11, 3510.  , 4121.43,
       1883.93,  759.46, 5260.15, 2477.23, 1883.93, 2612.48,  902.66,
       1007.14,  977.67, 4432.7 , 2201.62, 1451.  , 4591.72, 1356.4 ,
       6231.91, 3832.64, 8935.5 , 2353.4 , 2353.2 , 3172.05, 2314.4 ,
       8344.71, 4992.61, 4983.14, 5797.44, 1448.  , 1587.08, 3068.69,
       5079.96, 4107.2 , 4302.08, 3098.7 , 2871.  , 3651.56, 7474.5 ,
       3751.  , 4666.62, 6847.  , 3236.1 , 3236.1 , 4781.7 , 3068.69,
       3510.  , 6719.54, 4441.5 , 2546.8 , 1539.72, 4307.52, 5592.22,
       3822.  , 1539

## 12. Evaluate Model

In [17]:
dt_mae = mean_absolute_error(y_test, dt_pred)
dt_rmse = np.sqrt(mean_squared_error(y_test, dt_pred))
dt_r2 = r2_score(y_test, dt_pred)

print("Decision Tree MAE:", dt_mae)
print("Decision Tree RMSE:", dt_rmse)
print("Decision Tree R2:", dt_r2)

Decision Tree MAE: 422.6530265486726
Decision Tree RMSE: 875.5698742570743
Decision Tree R2: 0.8243661176675892


### Decision Tree Model Performance

The Decision Tree Regressor significantly improved prediction performance compared to linear models. The model achieved an R² score of 0.82, indicating better ability to capture complex relationships in the dataset.

The lower MAE and RMSE values show that Decision Tree provides more accurate predictions for sales revenue.


                                                  <!-- #Random Forest Model -->

## 13. Model prediction

In [18]:
from sklearn.ensemble import RandomForestRegressor

rf_model=RandomForestRegressor(random_state=42)
rf_model.fit(X_train,y_train)

rf_pred=rf_model.predict(X_test)
rf_pred

array([ 1523.7938    ,  2170.5404    ,  4782.377     ,  3754.1598    ,
        3028.3599    ,  6612.2357    ,  6110.2154    ,  1655.8588    ,
        2266.4357    ,  8810.2799    ,  2599.7979    ,  2147.499     ,
         738.0526    ,  4743.8506    ,  6170.3456    ,  5999.3242    ,
        1595.6138    ,  6054.4536    ,  2287.8645    ,   988.8026    ,
        4985.3035    ,  3085.9519    ,  8668.446     ,  1512.2554    ,
        1470.7398    ,  5273.5189    ,   830.6207    ,  6188.808     ,
        5584.4092    ,  2578.0357    ,  3387.5442    ,  4541.9664    ,
        2887.0854    ,  2571.3804    ,  3980.8831    ,   906.3553    ,
        1580.2254    ,  2363.5218    ,  3080.538     ,  3251.0641    ,
        5817.256     ,  3168.7517    ,  3576.4266    ,  1167.1034    ,
        3473.07376667,  3466.7401    ,  4553.1428    ,  3612.9633    ,
        4145.7633    ,  1896.7142    ,   798.8135    ,  4662.8906    ,
        2401.2217    ,  1887.6927    ,  2667.8152    ,   821.1196    ,
      

## 14. Evaluate Model

In [19]:
rf_mae = mean_absolute_error(y_test, rf_pred)
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_pred))
rf_r2 = r2_score(y_test, rf_pred)

print("Random Forest MAE:", rf_mae)
print("Random Forest RMSE:", rf_rmse)
print("Random Forest R2:", rf_r2)

Random Forest MAE: 324.6486501753053
Random Forest RMSE: 708.4134346043496
Random Forest R2: 0.8850258366272681


The Random Forest model achieved the best performance with an R² score of 0.88, indicating strong ability to predict sales revenue. The lower MAE and RMSE values show that the model provides accurate predictions with reduced error compared to other models.

## 15. Model Comparison Table

In [20]:
model_comparison=pd.DataFrame({
    "Model":["Linear Regression","Ridge Regression","Decision Tree","Random Forest"],
    "MAE":[mae,ridge_mae,dt_mae,rf_mae],
    "RMSE":[rmse,ridge_rmse,dt_rmse,rf_rmse],
    "R2 score":[r2, ridge_r2, dt_r2, rf_r2]
})

model_comparison

Unnamed: 0,Model,MAE,RMSE,R2 score
0,Linear Regression,653.583652,1022.857383,0.760306
1,Ridge Regression,653.582946,1022.856927,0.760306
2,Decision Tree,422.653027,875.569874,0.824366
3,Random Forest,324.64865,708.413435,0.885026


## Final Model Conclusion

Four regression models were trained and evaluated for sales prediction.

Linear and Ridge Regression models achieved an R² score of approximately 0.76, indicating moderate predictive performance.

The Decision Tree model improved performance significantly with an R² score of 0.82, capturing non-linear relationships in the data.

The Random Forest Regressor achieved the highest performance with an R² score of 0.88 and the lowest MAE and RMSE values. This demonstrates its superior ability to model complex sales patterns.

Based on performance metrics, Random Forest was selected as the final model for sales prediction.