<a href="https://colab.research.google.com/github/ram30098singh/Machine_Learning-Model/blob/main/Adidas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**About Datasets**



1. **Product Information**: Details about the products sold by Adidas, such as product names, categories (shoes, apparel, accessories, etc.), and product specifications.

2. **Sales Metrics**: Information about sales performance, including quantities sold, revenue generated, and possibly even profit margins.

3. **Time Period**: Data might be organized by specific time frames, such as daily, weekly, monthly, or yearly sales figures.

4. **Geographical Information**: Since you mentioned "US sales," the dataset would likely include data specific to the United States, potentially broken down by regions, states, or cities.

5. **Customer Insights**: Data about the types of customers purchasing Adidas products, potentially including demographic information, purchasing behavior, and trends.

6. **Promotions and Discounts**: Information about any promotional campaigns, discounts, or offers that were running during the recorded sales period.

7. **Channel Information**: Sales data could be segmented by different sales channels, such as brick-and-mortar stores, online platforms, or third-party retailers.

8. **Trends and Patterns**: The dataset could reveal sales trends and patterns over time, helping businesses understand the demand for different products throughout the year.

9. **Seasonality**: Insights into how sales vary based on different seasons or events, like holidays or back-to-school seasons.

10. **Competitive Analysis**: Analyzing sales data could also provide insights into how Adidas products are performing in comparison to competitors in the market.


****Objective for Analysis and Model Training Using Random Forest Classifier:****

The objective of this project is to perform a comprehensive analysis of a given dataset using the Random Forest classifier. By leveraging the power of the Random Forest algorithm, we aim to achieve the following goals:

1. **Data Exploration and Preprocessing:** Thoroughly explore the dataset to understand its structure, features, and potential challenges. Perform data preprocessing tasks such as handling missing values, encoding categorical variables, and scaling numerical features.

2. **Feature Importance Assessment:** Utilize the Random Forest's feature importance scores to identify key variables that significantly impact the target variable. This analysis will guide feature selection and potentially uncover hidden relationships within the dataset.

3. **Model Training and Tuning:** Implement the Random Forest classifier to build a predictive model. Employ techniques such as cross-validation to find optimal hyperparameters, ensuring the model's generalization performance on unseen data.

4. **Performance Evaluation:** Assess the model's performance using appropriate evaluation metrics such as accuracy, precision, recall, F1-score, and ROC-AUC. Compare these metrics to establish a baseline for model effectiveness.

5. **Handling Imbalanced Data (if applicable):** If the dataset suffers from class imbalance, address this issue by employing techniques like oversampling, undersampling, or utilizing class-weighted approaches to enhance the model's ability to predict minority classes accurately.

6. **Interpretability and Explainability:** Leverage the inherent interpretability of the Random Forest algorithm to understand how individual trees contribute to predictions. This will provide valuable insights into the decision-making process of the model.

7. **Visualizing Results:** Create meaningful visualizations, including feature importance plots, confusion matrices, and ROC curves, to facilitate clear communication of the model's performance to stakeholders.

8. **Deployability Considerations:** Discuss the potential deployment of the trained model in real-world scenarios. Highlight any challenges, recommendations, or modifications needed to integrate the model into a production environment.

9. **Documentation and Reporting:** Provide a detailed report documenting the entire analysis pipeline, from data preprocessing to model evaluation. Include explanations of the decisions made, the rationale behind them, and the steps taken to ensure the model's reliability.


**Objective:- To find out the best model for predicting the seller with high sales revenue.**

In [None]:
import pandas as pd
# 1.0.1
from sklearn.model_selection import train_test_split
from sklearn.ensemble import  RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

# 1.0.2
from pathlib import Path
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# For skopt routines
! pip install scikit-optimize
# 0.1 For plotting skopt results
! pip install 'scikit-optimize[plots]'



In [None]:
# 1.0 Clear ipython memory
#%reset -f

# 1.1 Data manipulation and plotting modules
import numpy as np
import pandas as pd


# 1.2 Data pre-processing
#     z = (x-mean)/stdev
from sklearn.preprocessing import StandardScaler as ss

# 1.3 Dimensionality reduction and noise removal
from sklearn.decomposition import PCA

# 1.4 Data splitting and model parameter search
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

# 1.5 Model pipelining
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

# 1.6 Hyperparameter optimization
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer


In [None]:
# 1.9 Model evaluation metrics
from sklearn.metrics import accuracy_score, f1_score
#from sklearn.metrics import plot_roc_curve
from sklearn.metrics import confusion_matrix

# 1.10
import matplotlib.pyplot as plt
import seaborn as sns
from xgboost import plot_importance

# 1.11 Permutation feature importance
from sklearn.inspection import permutation_importance

In [None]:

# 1.13 Used in Randomized parameter search
from scipy.stats import uniform
# 1.12 Misc
import time
import os
import gc
import random
# 1.13 Used in Randomized parameter search
from scipy.stats import uniform
from matplotlib.ticker import EngFormatter
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import statsmodels as sm
import scipy.stats as sps
import statsmodels.formula.api as smf
import numpy as np

In [None]:
data=pd.read_csv("https://raw.githubusercontent.com/Subhamtr01/Rac/main/PPD_Adidas.csv")
data

Unnamed: 0,Retailer,Region,State_name,City,Product,Sales Method,Retailer_code,Region_code,State_code,City_code,Product_code,Sales Method_code,Price per Unit,Units Sold,Total_Sales,Operating Profit,Operating Margin,State
0,Foot Locker,Northeast,New York,New York,Men's Street Footwear,In-store,1,1,31,35,2,0,50,1200,600000,300000.00,0.50,NY
1,Foot Locker,Northeast,New York,New York,Men's Athletic Footwear,In-store,1,1,31,35,1,0,50,1000,500000,150000.00,0.30,PA
2,Foot Locker,Northeast,New York,New York,Women's Street Footwear,In-store,1,1,31,35,5,0,40,1000,400000,140000.00,0.35,NY
3,Foot Locker,Northeast,New York,New York,Women's Athletic Footwear,In-store,1,1,31,35,4,0,45,850,382500,133875.00,0.35,PA
4,Foot Locker,Northeast,New York,New York,Men's Apparel,In-store,1,1,31,35,0,0,60,900,540000,162000.00,0.30,NY
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9643,Foot Locker,Northeast,New Hampshire,Manchester,Men's Apparel,Outlet,1,1,28,30,0,2,50,64,3200,896.00,0.28,PA
9644,Foot Locker,Northeast,New Hampshire,Manchester,Women's Apparel,Outlet,1,1,28,30,3,2,41,105,4305,1377.60,0.32,ME
9645,Foot Locker,Northeast,New Hampshire,Manchester,Men's Street Footwear,Outlet,1,1,28,30,2,2,41,184,7544,2791.28,0.37,PA
9646,Foot Locker,Northeast,New Hampshire,Manchester,Men's Athletic Footwear,Outlet,1,1,28,30,1,2,42,70,2940,1234.80,0.42,ME


In [None]:
data.isnull().sum()   #No null values

Retailer             0
Region               0
State_name           0
City                 0
Product              0
Sales Method         0
Retailer_code        0
Region_code          0
State_code           0
City_code            0
Product_code         0
Sales Method_code    0
Price per Unit       0
Units Sold           0
Total_Sales          0
Operating Profit     0
Operating Margin     0
State                0
dtype: int64

In [None]:
data.shape

(9648, 18)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9648 entries, 0 to 9647
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Retailer           9648 non-null   object 
 1   Region             9648 non-null   object 
 2   State_name         9648 non-null   object 
 3   City               9648 non-null   object 
 4   Product            9648 non-null   object 
 5   Sales Method       9648 non-null   object 
 6   Retailer_code      9648 non-null   int64  
 7   Region_code        9648 non-null   int64  
 8   State_code         9648 non-null   int64  
 9   City_code          9648 non-null   int64  
 10  Product_code       9648 non-null   int64  
 11  Sales Method_code  9648 non-null   int64  
 12  Price per Unit     9648 non-null   int64  
 13  Units Sold         9648 non-null   int64  
 14  Total_Sales        9648 non-null   int64  
 15  Operating Profit   9648 non-null   float64
 16  Operating Margin   9648 

In [None]:
data.isna().sum()

Retailer             0
Region               0
State_name           0
City                 0
Product              0
Sales Method         0
Retailer_code        0
Region_code          0
State_code           0
City_code            0
Product_code         0
Sales Method_code    0
Price per Unit       0
Units Sold           0
Total_Sales          0
Operating Profit     0
Operating Margin     0
State                0
dtype: int64

In [None]:
data.describe()

Unnamed: 0,Retailer_code,Region_code,State_code,City_code,Product_code,Sales Method_code,Price per Unit,Units Sold,Total_Sales,Operating Profit,Operating Margin
count,9648.0,9648.0,9648.0,9648.0,9648.0,9648.0,9648.0,9648.0,9648.0,9648.0,9648.0
mean,2.60852,2.0,24.223881,25.768657,2.499793,1.132566,45.216625,256.930037,93273.4375,34425.244761,0.422991
std,1.726698,1.471191,14.742644,14.883855,1.707549,0.689738,14.705397,214.25203,141916.016727,54193.113713,0.097197
min,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.1
25%,1.0,1.0,10.0,12.0,1.0,1.0,35.0,106.0,4254.5,1921.7525,0.35
50%,3.0,2.0,25.0,26.0,2.0,1.0,45.0,176.0,9576.0,4371.42,0.41
75%,4.0,4.0,37.0,39.0,4.0,2.0,55.0,350.0,150000.0,52062.5,0.49
max,5.0,4.0,49.0,51.0,5.0,2.0,110.0,1275.0,825000.0,390000.0,0.8


In [None]:
Pie_chart= px.pie(data, title='Price per Unit', values='Units Sold',  color_discrete_sequence=["#0083B8"], names='Product')
Pie_chart.show()

In [None]:
fig = px.ecdf(data, x='Total_Sales',color='Retailer')
fig.show()

Sales probabiliy of Amazon and Foot locker are higher as well as Walmart and Kohl's are on the lower side initally but as sales increases the Probablity of all of them are equal.

In [None]:
Regional_sales =px.pie(data, title='Sale Region Wise', values='Total_Sales', color_discrete_sequence=["#0083B8"], names='Region')
Regional_sales.show()

The West region dominates the total sales of Adidas products, with total sales equalling $269M. After it, Northeast region comes next with total sales around $186M followed by Southeast region with total sales around $169M

In [None]:
Store_sales =px.pie(data, title='Sale Store Wise', values='Total_Sales', color_discrete_sequence=["#0083B8"],names='Sales Method')
Store_sales.show()

Coming to mode of sales method used, in store contributes maximum to the revenue, which is totalling upto $356M. It is followed by Outlets which totals sales to $295M and then comes online method of sales with revenue equalling $247M

In [None]:
df2=data.copy()

df2['Region']=pd.factorize(df2.Region)[0]
df2['State_name']=pd.factorize(df2.State)[0]
df2['City']=pd.factorize(df2.City)[0]
df2['Product']=pd.factorize(df2.Product)[0]
df2['Retailer']=pd.factorize(df2.Retailer)[0]

df2.rename(columns = {'Sales Method':'Method'}, inplace = True)
df2['Method']=pd.factorize(df2.Method)[0]
df2.head()

corr=df2.corr()
#print(corr)


fig = px.imshow(df2.corr())
fig.show()


Unnamed: 0,Retailer,Region,State_name,City,Product,Method,Retailer_code,Region_code,State_code,City_code,Product_code,Sales Method_code,Price per Unit,Units Sold,Total_Sales,Operating Profit,Operating Margin,State
0,0,0,0,0,0,0,1,1,31,35,2,0,50,1200,600000,300000.0,0.5,NY
1,0,0,1,0,1,0,1,1,31,35,1,0,50,1000,500000,150000.0,0.3,PA
2,0,0,0,0,2,0,1,1,31,35,5,0,40,1000,400000,140000.0,0.35,NY
3,0,0,1,0,3,0,1,1,31,35,4,0,45,850,382500,133875.0,0.35,PA
4,0,0,0,0,4,0,1,1,31,35,0,0,60,900,540000,162000.0,0.3,NY








As it shown that  Units_Sold, Total_Sales, Price_per_units and Operating Profits are highly dependents on each other because their range is more than 0.6 on the other-hand total sales and operating margin does not show ny relation to each other

In [None]:
fig = px.scatter(data, x="Price per Unit", y="Operating Margin", color="Product")
fig.show()

In [None]:
map = px.choropleth(data,
                        locations = 'State',
                        locationmode = 'USA-states',
                        scope = 'usa',
                        color = 'Total_Sales',
                        hover_name = 'State',
                        hover_data = ['Total_Sales'],
                        range_color = [00,825000],
                        color_continuous_scale = 'blues',
                        title = 'Sales state wise')
map



In [None]:
df = data[['Retailer_code', 'Region_code','Product_code', 'State_code', 'City_code', 'Units Sold',
       'Price per Unit','Total_Sales', 'Operating Profit', 'Operating Margin']];

In [None]:
df

Unnamed: 0,Retailer_code,Region_code,Product_code,State_code,City_code,Units Sold,Price per Unit,Total_Sales,Operating Profit,Operating Margin
0,1,1,2,31,35,1200,50,600000,300000.00,0.50
1,1,1,1,31,35,1000,50,500000,150000.00,0.30
2,1,1,5,31,35,1000,40,400000,140000.00,0.35
3,1,1,4,31,35,850,45,382500,133875.00,0.35
4,1,1,0,31,35,900,60,540000,162000.00,0.30
...,...,...,...,...,...,...,...,...,...,...
9643,1,1,0,28,30,64,50,3200,896.00,0.28
9644,1,1,3,28,30,105,41,4305,1377.60,0.32
9645,1,1,2,28,30,184,41,7544,2791.28,0.37
9646,1,1,1,28,30,70,42,2940,1234.80,0.42


In [None]:
y = df.pop('Retailer_code')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9648 entries, 0 to 9647
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Region_code       9648 non-null   int64  
 1   Product_code      9648 non-null   int64  
 2   State_code        9648 non-null   int64  
 3   City_code         9648 non-null   int64  
 4   Units Sold        9648 non-null   int64  
 5   Price per Unit    9648 non-null   int64  
 6   Total_Sales       9648 non-null   int64  
 7   Operating Profit  9648 non-null   float64
 8   Operating Margin  9648 non-null   float64
dtypes: float64(2), int64(7)
memory usage: 678.5 KB


In [None]:
X = df.select_dtypes(exclude = ['object'])

In [None]:
# 4. Split dataset into train and validation parts
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.35,
                                                    shuffle = True,
                                                    stratify = y
                                                    )

# 4.1
X_train.shape
X_test.shape
y_train.shape
y_test.shape

(6271, 9)

(3377, 9)

(6271,)

(3377,)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.metrics import make_scorer, accuracy_score


# Create a RandomForestClassifier
rf_classifier = RandomForestClassifier(n_jobs=3)

# Define the steps for the pipeline
steps_rf = [('sts', ss()),
            ('pca', PCA()),
            ('rf', rf_classifier)]
# Create the pipeline
pipe_rf = Pipeline(steps_rf)
# Define the parameters for GridSearchCV
parameters_rf = {'rf__n_estimators': [200, 300],  # Number of trees in the forest
                 'rf__max_depth': [4, 6],           # Maximum depth of the trees
                 'pca__n_components': [4, 9]}       # Number of PCA components
# Create the GridSearchCV object
scoring = make_scorer(accuracy_score)
clf_rf = GridSearchCV(pipe_rf,
                      parameters_rf,
                      n_jobs=2,
                      cv=2,
                      verbose=1,
                      scoring=scoring,
                      refit='roc_auc')

In [None]:
# 7.2. Start fitting pipeline to data
print("\n\n--Takes time...---\n")
start = time.time()
clf_rf.fit(X_train, y_train)
end = time.time()
print()
(end - start)/60



--Takes time...---

Fitting 2 folds for each of 8 candidates, totalling 16 fits





0.44485881725947063

In [None]:
# 7.3
f"Best score: {clf_rf.best_score_} "

# 7.3.1
print()
f"Best parameter set {clf_rf.best_params_}"

'Best score: 0.5335692111772938 '




"Best parameter set {'pca__n_components': 9, 'rf__max_depth': 6, 'rf__n_estimators': 300}"

In [None]:
# 7.4. Make predictions using the best returned model
y_pred = clf_rf.predict(X_test)
print("--Few predictions--\n")
y_pred[:4]

--Few predictions--



array([5, 1, 1, 5])

In [None]:
print("\n\n--Confusion Matrix--\n")
confusion_matrix( y_test,y_pred)



--Confusion Matrix--



array([[ 40, 129,   0, 145,   0,  18],
       [  2, 592,  15, 181,   0, 133],
       [  0, 119, 101,  73,   0,  68],
       [  0, 139,   2, 518,   0,  52],
       [  0,  59,   0,  80,   4,  76],
       [  0, 165,   0, 146,   2, 518]])

In [None]:
f1_score_micro = f1_score(y_test, y_pred, average="micro")
print("\n\n--F1 Score (micro average)--\n")
print(f"F1 Score (micro average): {f1_score_micro}")




--F1 Score (micro average)--

F1 Score (micro average): 0.525022209061297


In [None]:

print("\n\n--How many features--\n")
clf_rf.best_estimator_.named_steps["rf"].feature_importances_.shape

# 7.9.1
print("\n\n---Feature importances---\n")
clf_rf.best_estimator_.named_steps["rf"].feature_importances_



--How many features--



(9,)



---Feature importances---



array([0.04035484, 0.20624616, 0.17382185, 0.12311752, 0.11885312,
       0.17860241, 0.0833621 , 0.04692861, 0.0287134 ])

In [None]:
colnames = X_train.columns.tolist()
imp_values = clf_rf.best_estimator_.named_steps["rf"].feature_importances_

df_imp = pd.DataFrame(
                      data = imp_values,
                      index = colnames,
                      columns = ["imp"]
                      ).sort_values(by = 'imp')

# 7.10.1
df_imp

Unnamed: 0,imp
Operating Margin,0.028713
Region_code,0.040355
Operating Profit,0.046929
Total_Sales,0.083362
Units Sold,0.118853
City_code,0.123118
State_code,0.173822
Price per Unit,0.178602
Product_code,0.206246


In [None]:
list(df_imp.index.values[:5])

['Operating Margin',
 'Region_code',
 'Operating Profit',
 'Total_Sales',
 'Units Sold']

In [None]:
from scipy.stats import randint

parameters = {
    'rf__n_estimators': randint(50, 300),  # Number of trees in the forest
    'rf__max_depth': randint(3, 10),       # Maximum depth of individual trees
    'rf__max_features': ['auto', 'sqrt'],  # Number of features to consider at each split
    'rf__min_samples_split': randint(2, 10),  # Minimum number of samples required to split an internal node
    'rf__min_samples_leaf': randint(1, 10),   # Minimum number of samples required to be at a leaf node
}


In [None]:
rs = RandomizedSearchCV(
                          pipe_rf,
                          param_distributions=parameters,
                          scoring= ['roc_auc', 'accuracy'],
                          n_iter=4,           # Max combination of
                                              # parameter to try. Default = 10
                          verbose = 1,
                          refit = 'roc_auc',
                          n_jobs = 2,          # Use parallel cpu threads
                          cv = 2               # No of folds.
                                              # So n_iter * cv combinations
                        )

In [None]:
# 9.2 Run random search for 25 iterations. 21 minutes:

start = time.time()
rs.fit(X_train, y_train)
end = time.time()
print()
(end - start)/60   # 4 minutes

Fitting 2 folds for each of 4 candidates, totalling 8 fits



One or more of the test scores are non-finite: [nan nan nan nan]






0.22715401649475098

In [None]:
# 9.3 Evaluate
f"Best score: {rs.best_score_} " ;print()           # 'Best score: 0.8780097831252602 '
f"Best parameter set: {rs.best_params_} " ; print()


# 9.4 Make predictions from the best returned model
y_pred = rs.predict(X_test)
# 9.5 Accuracy and f1_score
accuracy = accuracy_score(y_test, y_pred)
f"Accuracy: {accuracy * 100.0}"   ; print()      # 'Accuracy: 82.0142648448913'
f1_score_micro = f1_score(y_test, y_pred, average="micro")
print("\n\n--F1 Score (micro average)--\n")
print(f"F1 Score (micro average): {f1_score_micro}")

'Best score: nan '




"Best parameter set: {'rf__max_depth': 9, 'rf__max_features': 'sqrt', 'rf__min_samples_leaf': 3, 'rf__min_samples_split': 4, 'rf__n_estimators': 267} "




'Accuracy: 64.76162274207877'




--F1 Score (micro average)--

F1 Score (micro average): 0.6476162274207877
