<a href="https://colab.research.google.com/github/rajan-dhinoja/machine_learning_projects/blob/main/Sales_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Step-A: Data Preprocessing:-***

## Step-1: Importing the necessary Libraries, Modules and Classes...

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Step-2: Importing the Dataset & Printing it...

In [None]:
dataset = pd.DataFrame(pd.read_csv('Data.csv'))
dataset.head()
# dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


## Step-3: Creating the matrix of features (X) and the dependent variable/output (y) & Printing Both

In [None]:
X = dataset.iloc[:, :-1].values # Here :-1 means a range of displaying all the column except the column with index -1, which is not included.
y = dataset.iloc[:, -1].values # Here -1 means only the last column.
print("Matrix of Features: \n", X)
print("\n")
print("Dependent Variable/Output: \n", y)

Matrix of Features: 
 [['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


Dependent Variable/Output: 
 ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Step-4: Cheking any Missing Data and their counts & Printing it...

In [None]:
missing_counts = dataset.isnull().sum()
if missing_counts.any():
    print("Missing Data Counts: \n", missing_counts)
    print("\n")
    missing_values = dataset.isnull()
    print("Missing Data Values: \n", missing_values)
else:
    print("No missing data found.")

Missing Data Counts: 
 Country      0
Age          1
Salary       1
Purchased    0
dtype: int64


Missing Data Values: 
    Country    Age  Salary  Purchased
0    False  False   False      False
1    False  False   False      False
2    False  False   False      False
3    False  False   False      False
4    False  False    True      False
5    False  False   False      False
6    False   True   False      False
7    False  False   False      False
8    False  False   False      False
9    False  False   False      False


### Step-4.1: Taking care of missing data: Changing each Missing values into Mean of that column...  

In [None]:
# Here from the module named impute of the library scikit-learn, we are using the SimpleImputer Class to Handle the Missing Values.
from sklearn.impute import SimpleImputer

# Here, we create an object (an instance) of the SimpleImputer class called imputer.
  # missing_values=np.nan: This tells the imputer to look for missing values represented as np.nan (Not a Number), which is a common way to represent missing data in Python.
  # strategy = 'mean': This is the core of how the imputer will fill in missing values.
    # It's set to 'mean', meaning it will calculate the average (mean) of the non-missing values in a column and use that average to fill in any missing values in that same column.
imputer = SimpleImputer(missing_values=np.nan, strategy = 'mean')

# Here below X[:, 1:3] means it will select the columns in range
# where : means all the rows and 1:3 means the columns with index 1 and 2 (that is Age and Salary Column)
""" Here fit() method helps to connect this imputer with the matrix of features(means the table), &
    find the missing values from the columns given to find(X[:, 1:3], that is Age and Salary column) &
    It also calculates the average(mean) of the column. """
# fit(): This method is called on the imputer object.
  # It essentially makes the imputer analyze the selected columns (Age and Salary) to figure out: 1) where the missing values are located and 2) the average (mean) of the non-missing values in each of those columns.
  # This information is stored within the imputer object for the next step.
imputer.fit(X[:, 1:3])

""" Here transform() method help to replace the nan or missing values with the average(mean) found by the fit() method.
    Also this method returns the updated values and the the updated, it will not change the original values.
    Therefore we need to assign the updated values to the original values as X[:, 1:3]. """
# transform(): This method uses the information the imputer learned during the fit() step.
  # It goes through the selected columns (Age and Salary) and replaces any missing values (np.nan) with the mean it calculated earlier for that respective column.
X[:, 1:3] = imputer.transform(X[:, 1:3])


In [None]:
print("New Data with replacing the missing values with the mean values: \n", X)

New Data with replacing the missing values with the mean values: 
 [['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Step-5: Checking any Categorical Data and Encoding them:-

In [None]:
# We iterate through each column name (dataset.columns)
# For each column, we select it using dataset[column]
# Then, we apply .unique() to get the unique values and print them.
# dataset[column].dtype retrieves the data type of the current column.

categorical_columns_found = False  # Flag to track if any categorical columns are found

for column in dataset.columns:
    if dataset[column].dtype == 'object':
        categorical_columns_found = True  # Set the flag to True
        print(f"String Values present in Column '{column}'.")

        # Check for repeating values within the categorical column
        value_counts = dataset[column].value_counts()
        repeating_values = value_counts[value_counts > 1].index.tolist()

        if repeating_values:
            print(f"> Also, Categorical values found in column '{column}': {repeating_values}.")
        else:
            print(f"> But No Categorical values found in column '{column}'.")

if not categorical_columns_found:  # Check the flag after processing all columns
    print("No categorical values found in the whole dataset.")


String Values present in Column 'Country'.
> Also, Categorical values found in column 'Country': ['France', 'Spain', 'Germany'].
String Values present in Column 'Purchased'.
> Also, Categorical values found in column 'Purchased': ['No', 'Yes'].


### Step-5.1: Encoding Independent variables / Nominal Features(no inherit order) / Multiple Columns

In [None]:
# `sklearn.compose` is a module in scikit-learn (a popular Python machine learning library) used for combining different data transformations.
# `sklearn.preprocessing` contains tools for preparing data for use in machine learning models, including the `OneHotEncoder`.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# ColumnTransformer is used to apply different transformations to different columns of your dataset. &
  # transformers: This argument takes a list of transformations. In this case, there's one transformation:
    # 'encoder': A name for this specific transformation step (you can choose any name),
    # OneHotEncoder(): The transformation to apply (One-Hot Encoding),
    # [0]: The index of the column (column 0 in this case) to apply the OneHotEncoder to. This means the first column in your dataset (X) is the categorical column you want to encode. &
  # remainder='passthrough': This ensures that all other columns in your dataset, besides the one being encoded, are kept as they are and passed through without any changes.
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

# ct.fit_transform(X): This line does two things:
  # fit: It analyzes the specified column ([0] - the first column) of your dataset (X) to learn the categories present in that column.
  # transform: It then applies One-Hot Encoding, creating new columns representing each category and filling them with 0s and 1s (1 indicating the presence of that category for a particular row).
X = np.array(ct.fit_transform(X))
# X = pd.DataFrame(ct.fit_transform(X))

In [None]:
print("Encoded data of Matrix of Features/Independent variable (X): \n",X)

Encoded data of Matrix of Features/Independent variable (X): 
 [[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Step-5.2: Encoding the Dependent Variable / Ordinal Feature / Single Column

In [None]:
# preprocessing is a module within scikit-learn that provides tools for preparing your data for machine learning algorithms.
# LabelEncoder is a specific class designed to convert categorical labels (like 'France', 'Spain', 'Germany') into numerical labels (like 0, 1, 2).
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)

In [None]:
print("Encoded data of Dependent Variable (y): \n",y)

Encoded data of Dependent Variable (y): 
 [0 1 0 0 1 1 0 1 0 1]


## Step-6: Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
print("Printing Training Sets:")
print("> X_train: \n", X_train)
print("> X_test: \n", X_test)
print("\n")
print("Printing Test Sets:")
print("> y_train: \n", y_train)
print("> y_test: \n", y_test)

Printing Training Sets:
> X_train: 
 [[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]
> X_test: 
 [[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


Printing Test Sets:
> y_train: 
 [0 1 0 0 1 1 0 1]
> y_test: 
 [0 1]


## Step-7: Feature Scaling

In [None]:
# Generally, it's best practice to split your data into training and testing sets before applying feature scaling techniques like standardization or normalization.
  # This helps prevent data leakage and ensures your model's performance evaluation is accurate.

# We have to use Feature Scaling only after spliting the dataset into the Training set and Test set,
 # because before splitting, it will apply to all the dataset also in test data, and this leads to an information leakage...

# StandardScaler class from scikit-learn, which is a tool for standardization.
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

""" Here we are not applying Feature Scaling in Dependent Variable (y), because there are only 0 and 1 values in y """

# fit: sc calculates mean and s.d. of features in training data (X_train) from 4th column onwards ([:, 3: ]). These statistics are specific to training data.
# transform: It then applies standardization to training data - subtracting mean and dividing by standard deviation for each feature in training data.
X_train[:, 3: ] = sc.fit_transform(X_train[:, 3:])

# transform (only): We only use the transform method on the test data.
  # It uses the mean and standard deviation that were previously calculated from training data to scale test data.
  # This ensures that the test data is scaled in a way that is consistent with the training data, preventing leakage.
X_test[:, 3: ] = sc.transform(X_test[:, 3:])

In [None]:
print("Printing Training Sets after Feature Scaling:")
print("> X_train: \n", X_train)
print("\n")
print("Printing Test Sets after Feature Scaling:")
print("> X_test: \n", X_test)

Printing Training Sets after Feature Scaling:
> X_train: 
 [[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


Printing Test Sets after Feature Scaling:
> X_test: 
 [[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
