# DS_ML Coding Challenge
# ROHAN SARASWAT, PRN: 21070126071

### Problem Statement:
Given a dataset containing historical records of sourcing costs for various products, the objective is to develop a predictive model that can accurately forecast future sourcing costs. The model should account for temporal patterns, seasonal variations, and other relevant factors to provide reliable predictions for strategic decision-making.

**IMPORTING LIBRARIES**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
import seaborn as sns


**Loading Dataset**

In [None]:
data= pd.read_csv("/content/drive/MyDrive/DATASETS/DS_ML Coding Challenge Dataset (1).xlsx - Training Dataset.csv")

**Basic stats of Data**

In [None]:
data.info()

In [None]:
data.describe()

**No Null Values Found**

In [None]:
data.isnull().sum()

**DATA PREPROCESSING**

In [None]:
data.head()

In [None]:
# Changing month of sourcing format to date-time:
data['Month of Sourcing'] = pd.to_datetime(data['Month of Sourcing'], format='%b-%y')


In [None]:
data.head()

In [None]:
# Sorting Data By Months:
data_sorted = data.sort_values(by='Month of Sourcing')

In [None]:
# Calculating Unique Values in Data:
unique_values_all_columns = {}
for column in data_sorted.columns:
    unique_values_all_columns[column] = data_sorted[column].unique()

# Printing unique values for each column
for column, unique_values in unique_values_all_columns.items():
    print(f"Unique values in column {column}:")
    print(unique_values)

In [None]:
# Calculating Q1 (25th percentile) and Q3 (75th percentile)
Q1 = data['Sourcing Cost'].quantile(0.25)
Q3 = data['Sourcing Cost'].quantile(0.75)

# Calculating the interquartile range (IQR)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identifying outliers
outliers = data[(data['Sourcing Cost'] < lower_bound) | (data['Sourcing Cost'] > upper_bound)]

print("Outliers:")
print(outliers)


In [None]:
# Vizilising OUTLIERS:
start_date = pd.to_datetime('2020-06-01')
end_date = pd.to_datetime('2021-05-01')
data_subset = data[(data['Month of Sourcing'] >= start_date) & (data['Month of Sourcing'] <= end_date)]

downsample_factor = 5
data_downsampled = data_subset.iloc[::downsample_factor]

plt.figure(figsize=(12, 6))
plt.plot(data_downsampled['Month of Sourcing'], data_downsampled['Sourcing Cost'], color='blue', label='Sourcing Cost')

plt.scatter(outliers['Month of Sourcing'], outliers['Sourcing Cost'], color='red', label='Outliers')

plt.title('Sourcing Cost Over Time (Last 12 Months, Downsampled)')
plt.xlabel('Month of Sourcing')
plt.ylabel('Sourcing Cost')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt


window_size = 3

# Calculating the moving average
moving_avg = data_sorted['Sourcing Cost'].rolling(window=window_size, center=True).mean()

# Ploting the original time series data and the moving average
plt.figure(figsize=(12, 6))
plt.plot(data.index, data['Sourcing Cost'], color='blue', label='Original Data')
plt.plot(moving_avg.index, moving_avg, color='red', label=f'Moving Average (Window Size={window_size})')
plt.title('Time Series Data with Moving Average')
plt.xlabel('Date')
plt.ylabel('Sourcing Cost')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


Removing Outliers:
Outliers can skew model parameters, distort statistical metrics, and lower predicted accuracy, it is imperative to remove them from datasets. Outliers are removed to improve data quality and interpretability, which enables more stable and robust models and aligns studies with assumptions. By doing this, overfitting is lessened, model performance is improved, and relevant dataset comparisons are encouraged. Furthermore, outlier removal helps uncover problems with data quality and comprehend underlying patterns, which eventually results in more precise and trustworthy findings.

In [None]:
# Removing Outliers from data
def remove_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    data_cleaned = data[(data >= lower_bound) & (data <= upper_bound)]
    return data_cleaned

# Removing outliers from the 'Sourcing Cost' column using IQR
data_sorted['Sourcing Cost'] = remove_outliers_iqr(data_sorted['Sourcing Cost'])

plt.figure(figsize=(12, 6))
plt.plot(data_sorted['Month of Sourcing'], data_sorted['Sourcing Cost'], color='blue', label='Sourcing Cost (Cleaned)')
plt.title('Sourcing Cost Over Time (Outliers Removed)')
plt.xlabel('Month of Sourcing')
plt.ylabel('Sourcing Cost')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


In [None]:
data_sorted.info()

In [None]:
data_sorted.isnull().sum()

In [None]:
data_sorted = data_sorted.dropna(subset=['Sourcing Cost'])

In [None]:
data_sorted.isnull().sum()

One-hot encoding converts categorical variables into a numerical format that regression algorithms can understand, it is crucial for using regression models with categorical data. Regression models can accurately capture the relationships between variables and maintain the integrity of the categorical data thanks to one-hot encoding, which represents categorical variables as binary vectors where each category becomes a separate binary feature.

In [None]:
#Applying One-Hot Encoding:
# Label encoding:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

data_encoded = data_sorted.copy()
columns_to_encode = ['ProductType', 'Manufacturer', 'Area Code', 'Sourcing Channel', 'Product Size', 'Product Type']

for col in columns_to_encode:
    data_encoded[col] = label_encoder.fit_transform(data_sorted[col])

print(data_encoded.head())


In [None]:
data_encoded.head()

# PERFORMING EDA

**Plot1**

**The plot depicts the trend, seasonality, and potential irregular patterns of the 'Sourcing Cost' over time. It offers a comprehensive view of how the sourcing cost has evolved, allowing analysts to identify long-term trends, recurring patterns, and anomalies in the data. Additionally, the plot provides insights into the impact of external factors, such as changes in market conditions or business strategies, on sourcing costs over time. If time series plots for other variables like 'ProductType' or 'Manufacturer' are included, they can offer further context and reveal potential correlations or dependencies between these variables and the sourcing cost, aiding in deeper exploratory analysis and informed decision-making.**

In [None]:
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Scatter(x=data_encoded['Month of Sourcing'], y=data_encoded['Sourcing Cost'], mode='lines', name='Sourcing Cost', line=dict(color='blue')))
fig.update_layout(title='Sourcing Cost over Time', xaxis_title='Month of Sourcing', yaxis_title='Sourcing Cost')
fig.show()

fig = go.Figure()
fig.add_trace(go.Scatter(x=data_encoded['Month of Sourcing'], y=data_encoded['ProductType'], mode='lines', name='ProductType', line=dict(color='red')))
fig.add_trace(go.Scatter(x=data_encoded['Month of Sourcing'], y=data_encoded['Manufacturer'], mode='lines', name='Manufacturer', line=dict(color='green')))
fig.update_layout(title='Time Series for ProductType and Manufacturer', xaxis_title='Month of Sourcing', yaxis_title='Encoded Value')
fig.show()


**PLOT-2**

**This plot illustrates the decomposition of the time series into its trend, seasonal, and residual components. The "Trend" component represents the long-term movement of the data, capturing any overall increase or decrease over time. The "Seasonal" component displays repetitive patterns that occur at regular intervals, such as monthly or yearly cycles. Finally, the "Residual" component represents the remaining variation in the data after removing the trend and seasonal components, highlighting any irregularities or noise. Analyzing these components can provide insights into the underlying patterns and dynamics present in the time series data.**

In [None]:
import plotly.graph_objects as go
from statsmodels.tsa.seasonal import seasonal_decompose

#seasonal decomposition
decomposition = seasonal_decompose(data_encoded['Sourcing Cost'], model='additive', period=12)  # Assuming a seasonal period of 12 months
# Original time series
fig = go.Figure()
fig.add_trace(go.Scatter(x=data_encoded['Month of Sourcing'], y=data_encoded['Sourcing Cost'], mode='lines', name='Original'))
fig.update_layout(title='Original Time Series', xaxis_title='Month of Sourcing', yaxis_title='Sourcing Cost', xaxis=dict(tickangle=45))
fig.show()

# Trend component
fig = go.Figure()
fig.add_trace(go.Scatter(x=data_encoded['Month of Sourcing'], y=decomposition.trend, mode='lines', name='Trend'))
fig.update_layout(title='Trend Component', xaxis_title='Month of Sourcing', yaxis_title='Trend', xaxis=dict(tickangle=45))
fig.show()

# Seasonal component
fig = go.Figure()
fig.add_trace(go.Scatter(x=data_encoded['Month of Sourcing'], y=decomposition.seasonal, mode='lines', name='Seasonal'))
fig.update_layout(title='Seasonal Component', xaxis_title='Month of Sourcing', yaxis_title='Seasonal', xaxis=dict(tickangle=45))
fig.show()

# Residual component
fig = go.Figure()
fig.add_trace(go.Scatter(x=data_encoded['Month of Sourcing'], y=decomposition.resid, mode='lines', name='Residual'))
fig.update_layout(title='Residual Component', xaxis_title='Month of Sourcing', yaxis_title='Residual', xaxis=dict(tickangle=45))
fig.show()


**Corelation Plot**

In [None]:
import seaborn as sns
correlation_matrix = data_encoded.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()


# APPLYING ML MODELS

**In our analysis, we plan to employ various regression models, including polynomial regression, decision tree regression, random forest regression, and ensemble techniques such as bagging and boosting. Each of these algorithms serves a distinct purpose in regression analysis. Polynomial regression allows us to capture nonlinear relationships between predictors and the target variable, enhancing the model's flexibility in fitting complex data patterns. Decision tree regression partitions the feature space into regions, making it suitable for capturing nonlinear relationships and interactions between predictors. Random forest regression aggregates the predictions of multiple decision trees, providing robustness to overfitting and improving prediction accuracy. Ensemble techniques like bagging and boosting further enhance model performance by combining the strengths of multiple base learners, leading to more accurate and stable predictions. By leveraging these diverse algorithms, we aim to build comprehensive regression models that effectively capture the underlying patterns and relationships present in our dataset, ultimately enabling us to make accurate predictions and derive valuable insights.**

**APPLYING POLONOMIAL REGRESSION**

->Spliting data in train-test

In [None]:
import pandas as pd
from sklearn.metrics import mean_squared_error
from math import sqrt

train_size = int(len(data_encoded) * 0.8)  # 80% train, 20% test
train_data, test_data = data_encoded.iloc[:train_size], data_encoded.iloc[train_size:]

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np


X_train = train_data.drop(columns=['Month of Sourcing', 'Sourcing Cost'])
y_train = train_data['Sourcing Cost']

X_test = test_data.drop(columns=['Month of Sourcing', 'Sourcing Cost'])
y_test = test_data['Sourcing Cost']

degree = 2


poly_features = PolynomialFeatures(degree=degree)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

poly_reg_model = LinearRegression()
poly_reg_model.fit(X_train_poly, y_train)

y_pred = poly_reg_model.predict(X_test_poly)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2) Score:", r2)


**APPLYING DECISION TREE REGRESOR**

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

dt_regressor = DecisionTreeRegressor()

dt_regressor.fit(X_train, y_train)

dt_predictions = dt_regressor.predict(X_test)

dt_mse = mean_squared_error(y_test, dt_predictions)
dt_rmse = np.sqrt(dt_mse)
dt_r2 = r2_score(y_test, dt_predictions)

print("Mean Squared Error (MSE):", dt_mse)
print("Root Mean Squared Error (RMSE):", dt_rmse)
print("R-squared (R2) Score:", dt_r2)


**APPLYING RANDOM FOREST REGRESSOR**

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt

rf_model = RandomForestRegressor(n_estimators=250, random_state=42)  # You can adjust n_estimators as needed

# Training the model
rf_model.fit(X_train, y_train)

=rf_predictions = rf_model.predict(X_test)

=rf_mse = mean_squared_error(y_test, rf_predictions)
rf_rmse = sqrt(rf_mse)

rf_r2 = r2_score(y_test, rf_predictions)

print("Mean Squared Error (MSE):", rf_mse)
print("Root Mean Squared Error (RMSE):", rf_rmse)
print("R-squared (R2) Score:", rf_r2)


**Ensemble Techiniques**

**APPLYING XG-BOOST**

In [None]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

X = data_encoded.drop(columns=['Month of Sourcing', 'Sourcing Cost'])  # Exclude target and date columns
y = data_encoded['Sourcing Cost']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Defining the XGBoost model
xgb_model = xgb.XGBRegressor(objective ='reg:squarederror', random_state=42)

xgb_model.fit(X_train, y_train)

y_pred = xgb_model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2) Score:", r2)


**APPLYING AdaBoost**

In [None]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

X = data_encoded.drop(columns=['Month of Sourcing', 'Sourcing Cost'])  # Exclude target and date columns
y = data_encoded['Sourcing Cost']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

adaboost_model = AdaBoostRegressor(random_state=42)

adaboost_model.fit(X_train, y_train)

y_pred = adaboost_model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2) Score:", r2)


**APPLYING VOTING**

In [None]:
from sklearn.ensemble import VotingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

X = data_encoded.drop(columns=['Month of Sourcing', 'Sourcing Cost'])  # Exclude target and date columns
y = data_encoded['Sourcing Cost']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

linear_model = LinearRegression()
random_forest_model = RandomForestRegressor(random_state=42)
decision_tree_model = DecisionTreeRegressor(random_state=42)

voting_model = VotingRegressor(estimators=[
    ('linear', linear_model),
    ('random_forest', random_forest_model),
    ('decision_tree', decision_tree_model)
])

voting_model.fit(X_train, y_train)

y_pred = voting_model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2) Score:", r2)


**APPLYING GradientBoostingRegressor**

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

X = data_encoded.drop(columns=['Month of Sourcing', 'Sourcing Cost'])  # Exclude target and date columns
y = data_encoded['Sourcing Cost']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gbrt_model = GradientBoostingRegressor(random_state=42)
gbrt_model.fit(X_train, y_train)

y_pred = gbrt_model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2) Score:", r2)


## RESULTS

**Based on the results obtained from applying various regression algorithms to our dataset, we can draw several conclusions:**

**Polynomial Regression**:

Mean Squared Error (MSE): 1508.63
Root Mean Squared Error (RMSE): 38.84
R-squared (R2) Score: 0.529
Polynomial regression yielded relatively high MSE and RMSE values, indicating a significant amount of error in predictions. The R-squared score of 0.529 suggests that the model explains approximately 52.9% of the variance in the target variable.

**Decision Tree Regressor**:

MSE: 516.84
RMSE: 22.73
R2 Score: 0.839
The decision tree regressor performed better than polynomial regression, with lower MSE and RMSE values. The R-squared score of 0.839 indicates that the model explains approximately 83.9% of the variance in the target variable.


**Random Forest Regressor**:

MSE: 517.19
RMSE: 22.74
R2 Score: 0.839
Random forest regression produced similar results to the decision tree regressor, with slightly higher MSE and RMSE values but comparable R-squared score.

**XG-Boost**:

MSE: 527.63
RMSE: 22.97
R2 Score: 0.831
XG-Boost performed well, with a relatively low MSE and RMSE values and a high R-squared score of 0.831, indicating that approximately 83.1% of the variance in the target variable is explained by the model.

**AdaBoost**:

MSE: 1645.99
RMSE: 40.57
R2 Score: 0.472
AdaBoost yielded the highest MSE and RMSE values among all models, indicating significant prediction errors. The R-squared score of 0.472 suggests that the model explains approximately 47.2% of the variance in the target variable.

**Voting (Ensemble of Various Models)**:

MSE: 703.19
RMSE: 26.52
R2 Score: 0.774
The voting ensemble model performed well, with lower MSE and RMSE values compared to polynomial regression and AdaBoost. The R-squared score of 0.774 indicates a good fit of the ensemble model to the data.


**Gradient Boosting Regressor**:

MSE: 712.50
RMSE: 26.69
R2 Score: 0.771
Similar to the voting ensemble, the gradient boosting regressor achieved low MSE and RMSE values and a high R-squared score of 0.771, indicating a strong fit to the data.

****

**In conclusion, among the models tested, the decision tree regressor, random forest regressor, XG-Boost, and the ensemble models (voting and gradient boosting) demonstrated superior performance in terms of lower prediction errors (MSE and RMSE) and higher R-squared scores, indicating better explanatory power. These models can be considered suitable candidates for predicting the target variable in our dataset.**

**APPLYING THE REGRESSION MODEL ON THE TEST DATASET**

LOADING DATASET AND PRE-PROCESSING IT

In [None]:
datatest= pd.read_csv("/content/drive/MyDrive/DATASETS/DS_ML Coding Challenge Dataset (1).xlsx - Test Dataset.csv")

In [None]:
datatest.head()

In [None]:
datatest['Month of Sourcing'] = pd.to_datetime(datatest['Month of Sourcing'], format='%b-%y')


In [None]:
data_sorted2 = datatest.sort_values(by='Month of Sourcing')


In [None]:
data_sorted2.info()

In [None]:
data_encoded2 = data_sorted2.copy()
columns_to_encode = ['ProductType', 'Manufacturer', 'Area Code', 'Sourcing Channel', 'Product Size', 'Product Type']

for col in columns_to_encode:
    data_encoded2[col] = label_encoder.fit_transform(data_sorted2[col])

print(data_encoded2.head())


**PREDICTION BY RANDOM FOREST MODEL**

In [None]:
X_test = data_encoded2.drop(columns=['Sourcing Cost', 'Month of Sourcing'])

rf_test_predictions = rf_model.predict(X_test)

print("Predicted Sourcing Cost:")
print(rf_test_predictions)


In [None]:
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt

y_test = data_encoded2['Sourcing Cost']

rf_test_mse = mean_squared_error(y_test, rf_test_predictions)

# Calculating RMSE for the test set
rf_test_rmse = sqrt(rf_test_mse)

# Calculating R-squared score for the test set
rf_test_r2 = r2_score(y_test, rf_test_predictions)

print("Test Dataset Metrics:")
print("Mean Squared Error (MSE):", rf_test_mse)
print("Root Mean Squared Error (RMSE):", rf_test_rmse)
print("R-squared (R2) Score:", rf_test_r2)


**CONCLUSION:**

In conclusion, based on the performance metrics obtained on unseen test data, the Random Forest trained model demonstrates promising results.

Given these metrics, Random Forest emerges as a strong candidate for the initial model choice, particularly for this imbalanced dataset. Its ability to handle non-linearity, interactions, and outliers, coupled with its robust performance in predicting unseen data, underscores its suitability as a reliable starting point for further model refinement and exploration.