*Understanding Data Labels *

* Item Identifier : Unique id for item
* Item Weight : Weight of the item
* Item Fat Content : Item fat divided into categories
* Item Visibility : Product Visibility on the storefront(per sft)
* Item Type : Type of product divided into categories
* Item Mrp : Price of the Item
* Outlet Identifier: Unique id for outlet
* Outlet Establishment Year: Outlet established year
* Outlet Size: Size of the outlet divided into categories
* Outlet Location Type: Outlet location divided into categories
* Outlet type: Outlet type divided into categories
* Item Outlet Sales: Sales of every item

Importing datasets

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
train=pd.read_csv("/kaggle/input/bigmart-sales-data/Train.csv") 
new_data=pd.read_csv("/kaggle/input/bigmart-sales-data/Test.csv")

In [None]:
train.head()

In [None]:
new_data.head()

In [None]:
print(train.shape)
print(new_data.shape)

In [None]:
print(train.info())
print(new_data.info())

EXPLORATORY DATA ANALYSIS**** and DATA PRE PROCESSING

* Finding number of categorical columns
* Creating new columns
* Duplicate values detection
* Imputing missing values
* Relation of every column with target variable (Analysing data)
* Plotting correlation
* Checking possibilities to reduce dimensionality
* Handling categorical variable
* Feature selection
* scaling the data

In [None]:
categorical_train=[j for j in train if train[j].dtype == 'object']
categorical_new_data =[k for k in new_data if new_data[k].dtype == 'object']

In [None]:
for i in categorical_train:
    columns = train[i].unique()
    print(i,columns)

In [None]:
for col in categorical_new_data:
    columns2 = new_data[col].unique()
    print(i,columns2)

In [None]:
train.insert(loc=9,column='current_year',value=2021)
new_data.insert(loc=8,column='current_year',value=2021)

In [None]:
train['Outlet_age']=train['current_year']- train['Outlet_Establishment_Year']
new_data['Outlet_age']=new_data['current_year']- new_data['Outlet_Establishment_Year']

In [None]:
train=train.drop(['current_year','Outlet_Establishment_Year'],axis=1)
new_data=new_data.drop(['current_year','Outlet_Establishment_Year'],axis=1)

Missing Values

In [None]:
print(train.isnull().sum())
print('-------------------------------------')
print(new_data.isnull().sum())

In [None]:
## replacing the duplicate values in 'Item_Weight'
train['Item_Fat_Content']=train['Item_Fat_Content'].replace(['low fat','LF','reg'],['Low Fat','Low Fat','Regular'],inplace = False)
new_data['Item_Fat_Content']=new_data['Item_Fat_Content'].replace(['low fat','LF','reg'],['Low Fat','Low Fat','Regular'],inplace = False) 

In [None]:
train.head()

In [None]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
sns.heatmap(new_data.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
##by looking at the above graph, we can say that not too much data is missing in the same area. they are evenly distrbuted 

## we have to check if the data is missing completely at random or not
 for data missing completely at random, there should be equal probability of missing values for every variable and  there should not be any relationship with other variables 
*  Here, we have 'Item_weight' and 'Outlet_type' which subjectively says that there might not be dependant on each other
*  Although it needs domain expert knowledge to make a note why the data is missing , in this case I went with basic subjective knowledge saying that there is no relation
* There any many imputation methods for no relation missing values such as mean , median, mode, random imputation , KNN etc..
* I went with imputation methods using statistics since and it worked as good as KNN 
* Compared to KNN , statistical methods are preferably choosable since they do not require more computation and time

In [None]:
## imputing missing values for categorical variable 'Outlet_Size'
print(train.Outlet_Size.value_counts())
print(new_data.Outlet_Size.value_counts())

In [None]:
## imputing categorical variable with the most repeated
mode=train['Outlet_Size'].mode().values[0]
train['Outlet_Size']=train['Outlet_Size'].replace(np.nan,mode,inplace=False)
mode1=new_data['Outlet_Size'].mode().values[0]
new_data['Outlet_Size']=new_data['Outlet_Size'].replace(np.nan,mode,inplace=False)

In [None]:
import seaborn as sns
import matplotlib as plt
corr=train.iloc[:,1:].corr()
top_features=corr.index
sns.heatmap(train[top_features].corr(),annot=True)

In [None]:
## checking the correlation after imputation of categorical variable to make sure it is not correlating with any other
## Also we can clearly see that the only variable correlating high with sales is MRP .

In [None]:
median_train=train['Item_Weight'].median()
print(median_train)
median_new_data=new_data['Item_Weight'].median()
print(median_new_data)

In [None]:
def impute_nan(train,variable,median_train):
    train[variable+"_median"]=train[variable].fillna(median_train)
    train[variable+"_random"]=train[variable]
    random_sample=train[variable].dropna().sample(train[variable].isnull().sum(),random_state=0)
    random_sample.index=train[train[variable].isnull()].index
    train.loc[train[variable].isnull(),variable+'_random']=random_sample

In [None]:
impute_nan(train,'Item_Weight',median_train)
train.head()

In [None]:
## imputed numerical variable with both median and random variable in  two different columns to compare

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
train['Item_Weight'].plot(kind='kde', ax=ax)
train.Item_Weight_median.plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

In [None]:
## we can observe that there is a deviation in the distribution which leads to outilers

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
train['Item_Weight'].plot(kind='kde', ax=ax)
train.Item_Weight_random.plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

In [None]:
## clearly, random weight imputation is much closer to item weight distribution

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
train['Item_Weight'].plot(kind='kde', ax=ax)
train.Item_Weight_median.plot(kind='kde', ax=ax, color='red')
train.Item_Weight_random.plot(kind='kde', ax=ax, color='green')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

In [None]:
## 'Item_weight' and 'Item_weight_random' are in the same distribution and hence we can drop the imputaion with median

In [None]:
def impute_nan(test,variable,median_new_data):
    new_data[variable+"_median"]=new_data[variable].fillna(median_new_data)
    new_data[variable+"_random"]=new_data[variable]
    random_sample=test[variable].dropna().sample(test[variable].isnull().sum(),random_state=0)
    random_sample.index=test[test[variable].isnull()].index
    new_data.loc[new_data[variable].isnull(),variable+'_random']=random_sample

In [None]:
impute_nan(new_data,'Item_Weight',median_new_data)
new_data.head()

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
new_data['Item_Weight'].plot(kind='kde', ax=ax)
new_data.Item_Weight_median.plot(kind='kde', ax=ax, color='red')
new_data.Item_Weight_random.plot(kind='kde', ax=ax, color='green')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

In [None]:
## it worked the same for testing , dropping 'Item_weight_median'

In [None]:
train=train.drop(['Item_Weight_median','Item_Weight'],axis=1)
new_data=new_data.drop(['Item_Weight_median','Item_Weight'],axis=1)

Relation of every column with target variable (Analysing data)

In [None]:
# Initially, we check if the data is normally distributed or left/right skewed in order to avoid outlier

In [None]:
sns.distplot(train['Item_Visibility'])

* data is right skewed which shows that it is positively skewed
* Although the data is skewed and is touching the peak points, outlier removal might lead to loosing of sensitive information
* since this is a sales data , every information which increases/decreases the sales is equally important 

In [None]:
sns.distplot(train['Item_MRP'])

In [None]:
sns.distplot(train['Item_Outlet_Sales'])

In [None]:
## sales are positively skewed , also shows peakness

In [None]:
sns.distplot(train['Outlet_age'])

In [None]:
import seaborn as sns
import matplotlib as plt
corr=train.iloc[:,1:].corr()
top_features=corr.index
sns.heatmap(train[top_features].corr(),annot=True)

In [None]:
## in order to check relation between target variable, picking top correlated features with sales

In [None]:
sns.regplot(x='Item_MRP',y='Item_Outlet_Sales',data=train)

In [None]:
## as correlation said in the heatmap, as the mrp is increasing, sales are gradually increasing which shows good correlation

In [None]:
sns.regplot(x='Item_Visibility',y='Item_Outlet_Sales',data=train)

In [None]:
##Item_visibility does shows correlation in a negative direction

In [None]:
sns.regplot(x='Item_Weight_random',y='Item_Outlet_Sales',data=train)

In [None]:
## very less correlation

In [None]:
#ckecking possibilities to reduce dimensionality

In [None]:
## removing unnecessary columns based on subjective knowledge
train=train.drop(['Item_Identifier','Outlet_Identifier'],axis=1)
new_data=new_data.drop(['Item_Identifier','Outlet_Identifier'],axis=1)

handling categorical variables

In [None]:
train['Item_Type'].value_counts()

In [None]:
## Item_type has anyway very less correlation with sales and it has too many categorical variables, we can decrease them into categories which can reduce the dimensionality
train['Item_Type']=train['Item_Type'].replace(['Fruits and Vegetables','Snack Foods','Household','Frozen Foods','Dairy','Canned','Baking Goods','Health and Hygiene','Soft Drinks','Meat','Breads','Hard Drinks','Starchy Foods','Breakfast','Seafood','Others'],['edible','edible','non-edible','edible','edible','edible','edible','non-edible','edible','edible','edible','edible','edible','edible','edible','non-edible'],inplace = False)
new_data['Item_Type']=new_data['Item_Type'].replace(['Fruits and Vegetables','Snack Foods','Household','Frozen Foods','Dairy','Canned','Baking Goods','Health and Hygiene','Soft Drinks','Meat','Breads','Hard Drinks','Starchy Foods','Breakfast','Seafood','Others'],['edible','edible','non-edible','edible','edible','edible','edible','non-edible','edible','edible','edible','edible','edible','edible','edible','non-edible'],inplace = False)

In [None]:
train['Item_Type'].value_counts()

In [None]:
new_data.head()

In [None]:
train.head()

In [None]:
X=train[['Item_Fat_Content','Item_Visibility','Item_Type','Item_MRP','Outlet_Size','Outlet_Location_Type','Outlet_Type','Outlet_age','Item_Weight_random']]
y=train['Item_Outlet_Sales']

In [None]:
categorical_columns = X.describe(include='object').columns.to_list()
categorical_columns

In [None]:
X= pd.get_dummies(X,categorical_columns)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [None]:
X.head()

In [None]:
categorical_columns2 = new_data.describe(include='object').columns.to_list()
categorical_columns2

In [None]:
new_data= pd.get_dummies(new_data,categorical_columns2)

In [None]:
new_data.head()

FEATURE SELECTION

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
import matplotlib.pyplot as plt
model = ExtraTreesRegressor()
model.fit(X_train,y_train)

In [None]:
rank=model.feature_importances_

In [None]:
feat_importances = pd.Series(model.feature_importances_, index=X_train.columns)
feat_importances.nlargest(18).plot(kind='barh')
plt.show()

In [None]:
#Dropping unnecessary features that have below 0% can be done but in this case, since they are giving 0.002% information , i did not want to loose any minute information as well. 
#therefore no dropping is perfomed

In [None]:
X_train.shape

In [None]:
imp_fea=feat_importances.nlargest(18)
imp_fea

SCALING

In [None]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()

In [None]:
X_train=sc.fit_transform(X_train)
X_train=pd.DataFrame(X_train,columns=X_test.columns)

X_test=sc.transform(X_test)
X_test=pd.DataFrame(X_test,columns=X_train.columns)

In [None]:
X_train.head()

In [None]:
new_data=sc.fit_transform(new_data)
new_data=pd.DataFrame(new_data,columns=X_test.columns)

MODEL IMPLEMENTATION****

LINEAR REGRESSION

In [None]:
from sklearn.linear_model import LinearRegression
model1 = LinearRegression(normalize=True)
model1.fit(X_train,y_train)

In [None]:
y_pred_train_model1 = model1.predict(X_train)
from sklearn.metrics import r2_score
R2 = r2_score(y_train,y_pred_train_model1)
print("r2 score is :",R2)

In [None]:
y_pred_test_model1 = model1.predict(X_test)
from sklearn.metrics import r2_score
R2 = r2_score(y_test,y_pred_test_model1)
print("r2 score is :",R2)

In [None]:
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_train,y_pred_train_model1))
print('MSE:', metrics.mean_squared_error(y_train, y_pred_train_model1))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_train, y_pred_train_model1)))

In [None]:
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test,y_pred_test_model1))
print('MSE:', metrics.mean_squared_error(y_test, y_pred_test_model1))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_test_model1)))

XGBOOST

In [None]:
from xgboost import XGBRegressor
model2= XGBRegressor(base_score=1, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.15, max_delta_step=0, max_depth=5,
             min_child_weight=2, monotone_constraints='()',
             n_estimators=80, n_jobs=4, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [None]:
model2.fit(X_train,y_train)

In [None]:
y_pred_train_model2 = model2.predict(X_train)
from sklearn.metrics import r2_score
R2 = r2_score(y_train,y_pred_train_model2)
print("r2 score is :",R2)

In [None]:
print('MAE:', metrics.mean_absolute_error(y_train,y_pred_train_model2))
print('MSE:', metrics.mean_squared_error(y_train, y_pred_train_model2))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_train, y_pred_train_model2)))

In [None]:
y_pred_test_model2 = model2.predict(X_test)
from sklearn.metrics import r2_score
R2 = r2_score(y_test,y_pred_test_model2)
print("r2 score is :",R2)

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test,y_pred_test_model2))
print('MSE:', metrics.mean_squared_error(y_test, y_pred_test_model2))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_test_model2)))

GradientBoostingRegressor

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
model3= GradientBoostingRegressor()

In [None]:
from scipy.stats import uniform as sp_randFloat
from scipy.stats import randint as sp_randInt

In [None]:
## implementing randomised search cv to get the best parameters

In [None]:
params = {'learning_rate': sp_randFloat(),'subsample'    : sp_randFloat(),'n_estimators' : sp_randInt(100, 1000),'max_depth'    : sp_randInt(4, 10)}

In [None]:
from sklearn.model_selection import RandomizedSearchCV
randm_search = RandomizedSearchCV(estimator=model3, param_distributions = params,
                               cv = 2, n_iter = 10, n_jobs=-1)
randm_search.fit(X_train, y_train)

In [None]:
print("Best estimators",randm_search.best_estimator_)
print("Best score",randm_search.best_score_)
print("Best params",randm_search.best_params_)

In [None]:
model3=  GradientBoostingRegressor(learning_rate=0.0154291815347819, max_depth=9,
                          n_estimators=165, subsample=0.11550214721325958)

In [None]:
model3.fit(X_train,y_train)

In [None]:
y_pred_train_model3 = model3.predict(X_train)
from sklearn.metrics import r2_score
R2 = r2_score(y_train,y_pred_train_model3)
print("r2 score is :",R2)

In [None]:
y_pred_test_model3 = model3.predict(X_test)
from sklearn.metrics import r2_score
R2 = r2_score(y_test,y_pred_test_model3)
print("r2 score is :",R2)

In [None]:
print('MAE:', metrics.mean_absolute_error(y_train,y_pred_train_model3))
print('MSE:', metrics.mean_squared_error(y_train, y_pred_train_model3))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_train, y_pred_train_model3)))

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test,y_pred_test_model3))
print('MSE:', metrics.mean_squared_error(y_test, y_pred_test_model3))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_test_model3)))

FINAL RESULTS FOR TRAINING and TESTING DATA

LINEAR REGRESSION:

training:
* r2 score is : 0.5584145136909324
* MAE: 848.1920611339497
* MSE: 1306232.051944958
* RMSE: 1142.9050931485772
 
testing
* r2 score is : 0.5809991170997183
* MAE: 791.1141359649645
* MSE: 1138831.8576089442
* RMSE: 1067.1606522023496

XGBOOST:
training: 
* r2 score is : 0.7037219067802329
* MAE: 666.577049444037
* MSE: 876405.4835396637
* RMSE: 936.1653078060859

testing:
* r2 score is : 0.5887121486767045
* MAE: 732.0325818719224
* MSE: 1117868.068659826
* RMSE: 1057.2928017629865

GRADIENT BOOSTING REGRESSOR:
training:
* r2 score is : 0.6737769589742313
* MAE: 709.2194716729872
* MSE: 964984.1434611119
* RMSE: 982.3360644204772

testing:
* r2 score is : 0.6013689251080387
* MAE: 744.5459877561524
* MSE: 1083467.3291795994
* RMSE: 1040.8973672651878

In [None]:
## gradient boost regressor and xgboost regressor are giving almost same root mean saure error but, comparitively , GBR is giving slightly more error than xgboost
## if we observe r2 values of XGBOOST it is overfitting compared to GB REGRESSOR  
## one thing which is commonly observed is that testing data is only able to give 58% of r2 which means that only 58% is explainable by independant variable
## although in XGBOOST 70% is explainable is training , it is overfitting in the testing
## therefore best model which can predict BIGMART SALES would be GRADIENT BOOST REGRESSOR

In [None]:
import pickle
output=open("bigmartsales.pickle","wb")
pickle.dump(model3,output)

In [None]:
sales_pred=open("bigmartsales.pickle","rb")

In [None]:
emp=pickle.load(sales_pred)

In [None]:
pred=model3.predict(new_data)

In [None]:
pred

In [None]:
new_data['Outlet_Sales']=pred

In [None]:
new_data.head()

In [None]:
## we have predicted sales using GRADIENT BOOST REGRESSOR by loading it into pickle and predicting the new data
## the new_data 'Outlet_Sales' is now loaded into the new_data file