<a href="https://colab.research.google.com/github/rajan-dhinoja/machine_learning_projects/blob/main/Sales_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Step-A: Data Preprocessing:-***

## Step-1: Import Required Dependencies:-
Import essential libraries and modules for data manipulation, visualization, and preprocessing:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

## Step-2: Load and Display different function of Dataset:-

In [3]:
dataset = pd.DataFrame(pd.read_csv('train_set.csv'))
dataset.head()
# dataset

Unnamed: 0,ProductID,Weight,FatContent,ProductVisibility,ProductType,MRP,OutletID,EstablishmentYear,OutletSize,LocationType,OutletType,OutletSales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [4]:
print("> Shape of the Dataset: \n", dataset.shape)
print("\n")
print("> Information about Dataset: \n")
print(dataset.info())
print("\n")
print("> Statistical summary of the Dataset: \n")
print(dataset.describe().applymap(lambda x: round(x, 4)) )

> Shape of the Dataset: 
 (8523, 12)


> Information about Dataset: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ProductID          8523 non-null   object 
 1   Weight             7060 non-null   float64
 2   FatContent         8523 non-null   object 
 3   ProductVisibility  8523 non-null   float64
 4   ProductType        8523 non-null   object 
 5   MRP                8523 non-null   float64
 6   OutletID           8523 non-null   object 
 7   EstablishmentYear  8523 non-null   int64  
 8   OutletSize         6113 non-null   object 
 9   LocationType       8523 non-null   object 
 10  OutletType         8523 non-null   object 
 11  OutletSales        8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB
None


> Statistical summary of the Dataset: 

          Weight  ProductVisibility     

## Step-3: Seperating the Dataset into matrix of features (X) and dependent variable/output (y):-

In [5]:
X = dataset.iloc[:, :-1].values # Here :-1 means a range of displaying all the column except the column with index -1, which is not included.
y = dataset.iloc[:, -1].values # Here -1 means only the last column.
print("Matrix of Features: \n", X)
print("\n")
print("Dependent Variable/Output: \n", y)

Matrix of Features: 
 [['FDA15' 9.3 'Low Fat' ... 'Medium' 'Tier 1' 'Supermarket Type1']
 ['DRC01' 5.92 'Regular' ... 'Medium' 'Tier 3' 'Supermarket Type2']
 ['FDN15' 17.5 'Low Fat' ... 'Medium' 'Tier 1' 'Supermarket Type1']
 ...
 ['NCJ29' 10.6 'Low Fat' ... 'Small' 'Tier 2' 'Supermarket Type1']
 ['FDN46' 7.21 'Regular' ... 'Medium' 'Tier 3' 'Supermarket Type2']
 ['DRG01' 14.8 'Low Fat' ... 'Small' 'Tier 1' 'Supermarket Type1']]


Dependent Variable/Output: 
 [3735.138   443.4228 2097.27   ... 1193.1136 1845.5976  765.67  ]


## Step-4:- Checking the Dataset:-

### Step-4.1: Checking any Duplicate Data and Handling them:-

In [6]:
if dataset.duplicated().any():
  dataset.drop_duplicates(inplace=True)
else:
    print("No Duplicate Data(or Identical Rows) found...")

No Duplicate Data(or Identical Rows) found...


### Step-4.2: Cheking any Missing Data and Handling them:-

In [7]:
# Here from the module named impute of the library scikit-learn, we are using the SimpleImputer Class to Handle the Missing Values.
from sklearn.impute import SimpleImputer

if dataset.isnull().values.any():
  # Get categorical and numerical columns
  categorical_cols = dataset.select_dtypes(include=['object']).columns
  numerical_cols = dataset.select_dtypes(exclude=['object']).columns

  # Replace "Unknown" with NaN in categorical columns
  for col in categorical_cols:
    dataset[col] = dataset[col].replace('Unknown', np.nan)

  categorical_missing_counts = dataset[categorical_cols].isnull().sum() + dataset[categorical_cols].isin(['', 'N/A', 'Unknown', 'NaN']).sum()
  numerical_missing_counts = dataset[numerical_cols].isnull().sum()

  # Check if there are any missing values (categorical or numerical)
  if categorical_missing_counts.any() or numerical_missing_counts.any():
      # Print missing counts for categorical columns in the desired format

      print("Missing Data Counts in Categorical Columns: \n", categorical_missing_counts)
      print("\n")
      print("Missing Data Counts in Numerical Columns: \n", numerical_missing_counts)

      # Create imputers for categorical and numerical features
      categorical_imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
      numerical_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

      # Apply imputers to the selected columns in X
      if len(categorical_cols) > 0:
        categorical_cols_for_impution = [col for col in categorical_cols if col != dataset.columns[-1]]
        dataset[categorical_cols_for_impution] = categorical_imputer.fit_transform(dataset[categorical_cols_for_impution])

      # Exclude the dependent variable column (last column) if it's numerical
      numerical_cols_for_impution = [col for col in numerical_cols if col != dataset.columns[-1]]
      if len(numerical_cols_for_impution) > 0:
          dataset[numerical_cols_for_impution] = numerical_imputer.fit_transform(dataset[numerical_cols_for_impution])
      # if len(numerical_cols) > 0:
          # dataset[numerical_cols] = numerical_imputer.fit_transform(dataset[numerical_cols])

      print("\n")
      print("New Data with replaced missing values: \n", dataset)
else:
    print("No missing data found.")


Missing Data Counts in Categorical Columns: 
 ProductID          0
FatContent         0
ProductType        0
OutletID           0
OutletSize      2410
LocationType       0
OutletType         0
dtype: int64


Missing Data Counts in Numerical Columns: 
 Weight               1463
ProductVisibility       0
MRP                     0
EstablishmentYear       0
OutletSales             0
dtype: int64


New Data with replaced missing values: 
      ProductID  Weight FatContent  ProductVisibility            ProductType  \
0        FDA15   9.300    Low Fat           0.016047                  Dairy   
1        DRC01   5.920    Regular           0.019278            Soft Drinks   
2        FDN15  17.500    Low Fat           0.016760                   Meat   
3        FDX07  19.200    Regular           0.000000  Fruits and Vegetables   
4        NCD19   8.930    Low Fat           0.000000              Household   
...        ...     ...        ...                ...                    ...   
8518     

In [9]:
dataset.isnull().values.any()

False

### Step-4.3: Checking any Categorical Data and Encoding them:-

In [10]:
categorical_columns_found = False  # Flag to track if any categorical columns are found

for column in dataset.columns:
    if dataset[column].dtype == 'object':
        categorical_columns_found = True  # Set the flag to True
        print(f"String Values present in Column '{column}'.")

        # Check for repeating values within the categorical column
        value_counts = dataset[column].value_counts()
        repeating_values = value_counts[value_counts > 1].index.tolist()

        if repeating_values:
            print(f"> Also, Categorical values found in column '{column}': {repeating_values}.")
        else:
            print(f"> But No Categorical values found in column '{column}'.")

if not categorical_columns_found:  # Check the flag after processing all columns
    print("No categorical values found in the whole dataset.")


String Values present in Column 'ProductID'.
> Also, Categorical values found in column 'ProductID': ['FDW13', 'FDG33', 'NCY18', 'FDD38', 'DRE49', 'FDV60', 'NCQ06', 'FDF52', 'FDX04', 'NCJ30', 'FDV38', 'NCF42', 'FDT07', 'FDW26', 'NCL31', 'FDU12', 'FDG09', 'FDQ40', 'FDX20', 'NCI54', 'FDX31', 'FDP25', 'FDW49', 'FDF56', 'FDO19', 'DRN47', 'NCB18', 'FDE11', 'NCX05', 'FDQ39', 'FDT55', 'FDO32', 'FDT40', 'FDZ20', 'FDH27', 'FDY49', 'FDS33', 'FDR04', 'FDR43', 'FDR59', 'FDJ55', 'FDT24', 'FDY55', 'FDV09', 'FDU23', 'FDY47', 'DRD25', 'FDK58', 'FDL58', 'FDX58', 'FDR44', 'FDP28', 'FDA39', 'FDH28', 'DRF27', 'FDX21', 'FDY56', 'FDF05', 'FDL20', 'FDY03', 'NCK05', 'FDS55', 'DRA59', 'FDG24', 'NCE54', 'FDZ21', 'FDA04', 'FDW24', 'FDT49', 'DRF23', 'FDD05', 'FDH10', 'FDX50', 'FDT32', 'FDK20', 'FDU13', 'FDN56', 'FDZ26', 'FDL34', 'DRF01', 'FDG38', 'NCE31', 'NCL53', 'NCB31', 'NCQ05', 'FDT21', 'NCV06', 'DRJ24', 'FDI41', 'FDO52', 'FDR23', 'FDG57', 'FDZ33', 'DRF03', 'FDA13', 'FDF22', 'DRP35', 'NCV41', 'FDF16', 'FDB17'

#### Step-4.3.1: Encoding Independent variables / Nominal Features(no inherit order) / Multiple Columns

In [16]:
# # `sklearn.compose` is a module in scikit-learn (a popular Python machine learning library) used for combining different data transformations.
# # `sklearn.preprocessing` contains tools for preparing data for use in machine learning models, including the `OneHotEncoder`.
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import OneHotEncoder

# ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [])], remainder='passthrough')
# X = np.array(ct.fit_transform(X))

# print("Encoded data of Matrix of Features/Independent variable (X): \n",X)

ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric.

#### Step-4.3.2: Encoding the Dependent Variable / Ordinal Feature / Single Column

In [None]:
# # preprocessing is a module within scikit-learn that provides tools for preparing your data for machine learning algorithms.
# # LabelEncoder is a specific class designed to convert categorical labels (like 'France', 'Spain', 'Germany') into numerical labels (like 0, 1, 2).
# from sklearn.preprocessing import LabelEncoder

# le = LabelEncoder()
# y = le.fit_transform(y)

# print("Encoded data of Dependent Variable (y): \n",y)

Encoded data of Dependent Variable (y): 
 [0 1 0 0 1 1 0 1 0 1 1]


## Step-5: Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
print("Printing Training Sets:")
print("> X_train: \n", X_train)
print("> X_test: \n", X_test)
print("\n")
print("Printing Test Sets:")
print("> y_train: \n", y_train)
print("> y_test: \n", y_test)

Printing Training Sets:
> X_train: 
 [[0.0 0.0 0.0 1.0 37.0 58500.0]
 [0.0 0.0 1.0 0.0 27.0 48000.0]
 [0.0 0.0 1.0 0.0 nan 52000.0]
 [1.0 0.0 0.0 0.0 44.0 72000.0]
 [1.0 0.0 0.0 0.0 48.0 79000.0]
 [1.0 0.0 0.0 0.0 37.0 67000.0]
 [0.0 1.0 0.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 0.0 35.0 58000.0]]
> X_test: 
 [[0.0 1.0 0.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 0.0 38.0 61000.0]
 [0.0 1.0 0.0 0.0 40.0 nan]]


Printing Test Sets:
> y_train: 
 [1 1 0 0 1 1 0 1]
> y_test: 
 [0 0 1]


## Step-6: Feature Scaling

In [None]:
""" Here we are not applying Feature Scaling in Dependent Variable (y), because there are only 0 and 1 values in y """
# StandardScaler class from scikit-learn, which is a tool for standardization and Normalization.
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train[:, 3: ] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3: ] = sc.transform(X_test[:, 3:])

In [None]:
print("Printing Training Sets after Feature Scaling:")
print("> X_train: \n", X_train)
print("\n")
print("Printing Test Sets after Feature Scaling:")
print("> X_test: \n", X_test)

Printing Training Sets after Feature Scaling:
> X_train: 
 [[0.0 0.0 0.0 2.6457513110645903 -0.3629763419529134 -0.5225966449595044]
 [0.0 0.0 1.0 -0.3779644730092272 -1.700257601779436 -1.4094273151938148]
 [0.0 0.0 1.0 -0.3779644730092272 nan -1.0715870598664585]
 [1.0 0.0 0.0 -0.3779644730092272 0.5731205399256524 0.6176142167703234]
 [1.0 0.0 0.0 -0.3779644730092272 1.1080330438562613 1.208834663593197]
 [1.0 0.0 0.0 -0.3779644730092272 -0.3629763419529134 0.1953138976111279]
 [0.0 1.0 0.0 -0.3779644730092272 1.3754892958215659 1.5466749189205533]
 [1.0 0.0 0.0 -0.3779644730092272 -0.6304325939182179 -0.564826676875424]]


Printing Test Sets after Feature Scaling:
> X_test: 
 [[0.0 1.0 0.0 -0.3779644730092272 -1.2990732238314793 -0.9026669322027803]
 [0.0 0.0 1.0 -0.3779644730092272 -0.22924821597026115
  -0.31144648537990666]
 [0.0 1.0 0.0 -0.3779644730092272 0.03820803599504337 nan]]
