# Convolutional Neural Network

Que1: What is CNN? How does it work behind the scenes?

Que2: What are Stride, Padding, Kernel Filters, and Pooling?

Que3: Why does Overfitting happen in CNN, and how can you avoid it?

Que4: Why is InceptionNet better than VGG?

Que5: What is Augmentation?

Que6: Can you explain the concept of feature maps in CNNs?

### Que1

- CNN stands for Convolutional Neural Network is a type of artificial neural network for images
- It is mainly use for facial recognition, image classification, object detection.
- Input Layer: First layer of CNN which receives the input image as a matrix of pixel values.
- Convolutional Layer (Conv Layer): Tiles a series of filters over the image for detecting different features in the image. Every filter gives you a feature map.
- Activation Function: The output of the convolution operation in the form of feature maps undergoes non-linearity forced using the Activation Function.
- Pooling Layer: It reduces the spatial dimensions of the Feature Maps and keeps the most important features from them.
- Flattening : It converts the final feature maps into a single 1D feature vector.
- Fully Connected Layer (Prediction): Predicts the final result using the 1D feature vector.

### Que 2

- Stride is defined as The amount by which we move our filter / kernel matrix across the input volume.

- Padding is adding layers of zeros onto the perimeter of an input image to manage the size of the resulting feature map.

- Kernels filters are small matrices that move over the input image to detect specific features

- A pooling layer is a down-sampling operation that reduce the spatial dimensions (Width x Height) of the Input volume for the next Convolutional Layer by retaining the important information

### Que 3

- A model overfits to its training data when it captures noise and details not generalizable to new, unseen data

#### Handeling overfitting

- Data Augmentation is the process of creating new training data from the existing data by applying simple transformations like rotation, translation, flipping, scaling etc.

- Dropout: Regularization technique where a portion of neurons in a layer are set to zero during training.

- Early Stopping : First lets see what early stopping do : For example, we monitor our performance on validation set and as soon as we see our performance/loss is degrading we stop there.

- L2 Regularization (Weight Decay): This regularization technique adds a penalty in the loss function, proportional to the square of the weights.

### Que 4

- VGG: Very simple and deep architecture where we only use small 3x3 convolution filters, but its parameter heavy and computationally expensive.

- InceptionNet: A lighter and scalable variant with much fewer parameters, higher accuracy and capable of processing multi-scale features.

- Although these models did not win the competition, the architectural innovations of InceptionNet such as Inception modules and dimensionality reduction techniques automatically learned by the model itself provide substantial benefits over the simple, but computationally expensive, VGG models.
- These advantages makes InceptionNet to be the more useful and superior alternative for a lot of image classification tasks

### Que 5
- Data augmentation is the process of increasing the amount of data to use the models further for accurate predictions.
- In simple words, data augmentation is the strategy that has the potential to substantially improve the accuracy and performance of deep learning models applied to image datasets.
- This helps models generalize better by giving them more diverse examples to train on.
- One of method frequently used for increasing the performance and stability of machine learning models is data augmentation.

### Que 6

- The output that we get in the figure is known as Feature Maps and this is the output from a convolutional layer after the convolutional operation and applying the activation function.
- These define the space dimensions of the characteristics identified by the filters (or kernels) as they slide through the input image.
- Input data features, required for image classification to understand the spatial hierarchies of words in any object.

# Machine Learning Techniques

### Problem statement and Objective

#### Black Friday Project

A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month. The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month. Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.






### Data Variable Definition
• User_ID User ID


• Product_ID Product ID


• Gender Sex of User


• Age Age in bins


• Occupation Occupation (Masked)


• City_Category Category of the City (A,B,C)


• Stay_In_Current_City_Years Number of years stay in current city


• Marital_Status Marital Status


• Product_Category_1 Product Category (Masked)


• Product_Category_2 Product may belongs to other category also (Masked)


• Product_Category_3 Product may belongs to other category also (Masked)


• Purchase Purchase Amount (Target Variable)



### Goal

Our goal is to predict the purchase amount of customers for various products after completing all the necessary preprocessing steps. Additionally, hyperparameter tuning and cross validation is essential. We also need to apply feature selection techniques such as SelectKBest, VIF, and PCA.

#Dataset Link


https://raw.githubusercontent.com/s4sauravv/Datasets/main/Black%20Friday.csv


You have to use multiple algorithms to build the model, and whichever algorithm performs the best, you have to do hyperparameter tuning for it. After tuning the hyperparameters, you also need to plot its best fit line.

In [1]:
import pandas as pd
import numpy as np

data = pd.read_csv('https://raw.githubusercontent.com/s4sauravv/Datasets/main/Black%20Friday.csv')
data.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


In [2]:
data.shape

(550068, 12)

In [3]:
# Check for missing values
print(data.isnull().sum())

User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2            173638
Product_Category_3            383247
Purchase                           0
dtype: int64


In [4]:
# Fill missing values using mode
data['Product_Category_2'].fillna(data['Product_Category_2'].mode()[0], inplace=True)
data['Product_Category_3'].fillna(data['Product_Category_3'].mode()[0], inplace=True)

# Verify that there are no missing values
print(data.isnull().sum())

User_ID                       0
Product_ID                    0
Gender                        0
Age                           0
Occupation                    0
City_Category                 0
Stay_In_Current_City_Years    0
Marital_Status                0
Product_Category_1            0
Product_Category_2            0
Product_Category_3            0
Purchase                      0
dtype: int64


In [5]:
# Convert categorical variables using one-hot encoding
data = pd.get_dummies(data, columns=['Gender', 'Age', 'City_Category'], drop_first=True)
data.head()

Unnamed: 0,User_ID,Product_ID,Occupation,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase,Gender_M,Age_18-25,Age_26-35,Age_36-45,Age_46-50,Age_51-55,Age_55+,City_Category_B,City_Category_C
0,1000001,P00069042,10,2,0,3,8.0,16.0,8370,False,False,False,False,False,False,False,False,False
1,1000001,P00248942,10,2,0,1,6.0,14.0,15200,False,False,False,False,False,False,False,False,False
2,1000001,P00087842,10,2,0,12,8.0,16.0,1422,False,False,False,False,False,False,False,False,False
3,1000001,P00085442,10,2,0,12,14.0,16.0,1057,False,False,False,False,False,False,False,False,False
4,1000002,P00285442,16,4+,0,8,8.0,16.0,7969,True,False,False,False,False,False,True,False,True


In [6]:
# Drop User_ID and Product_ID as they are identifiers and not useful for prediction
data = data.drop(['User_ID', 'Product_ID'], axis=1)
data.head()

Unnamed: 0,Occupation,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase,Gender_M,Age_18-25,Age_26-35,Age_36-45,Age_46-50,Age_51-55,Age_55+,City_Category_B,City_Category_C
0,10,2,0,3,8.0,16.0,8370,False,False,False,False,False,False,False,False,False
1,10,2,0,1,6.0,14.0,15200,False,False,False,False,False,False,False,False,False
2,10,2,0,12,8.0,16.0,1422,False,False,False,False,False,False,False,False,False
3,10,2,0,12,14.0,16.0,1057,False,False,False,False,False,False,False,False,False
4,16,4+,0,8,8.0,16.0,7969,True,False,False,False,False,False,True,False,True


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 16 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Occupation                  550068 non-null  int64  
 1   Stay_In_Current_City_Years  550068 non-null  object 
 2   Marital_Status              550068 non-null  int64  
 3   Product_Category_1          550068 non-null  int64  
 4   Product_Category_2          550068 non-null  float64
 5   Product_Category_3          550068 non-null  float64
 6   Purchase                    550068 non-null  int64  
 7   Gender_M                    550068 non-null  bool   
 8   Age_18-25                   550068 non-null  bool   
 9   Age_26-35                   550068 non-null  bool   
 10  Age_36-45                   550068 non-null  bool   
 11  Age_46-50                   550068 non-null  bool   
 12  Age_51-55                   550068 non-null  bool   
 13  Age_55+       

In [8]:
data['Stay_In_Current_City_Years'].value_counts()

Stay_In_Current_City_Years
1     193821
2     101838
3      95285
4+     84726
0      74398
Name: count, dtype: int64

In [34]:
# Replace "4+" with "4" in the 'Stay_In_Current_City_Years' column
data['Stay_In_Current_City_Years'] = data['Stay_In_Current_City_Years'].replace('4+', '4')

# Convert the column to integer type
data['Stay_In_Current_City_Years'] = data['Stay_In_Current_City_Years'].astype(int)

# Verify the replacement and conversion
print(data['Stay_In_Current_City_Years'].unique())
print(data.dtypes)


[2 4 3 1 0]
Occupation                      int64
Stay_In_Current_City_Years      int64
Marital_Status                  int64
Product_Category_1              int64
Product_Category_2            float64
Product_Category_3            float64
Purchase                        int64
Gender_M                         bool
Age_18-25                        bool
Age_26-35                        bool
Age_36-45                        bool
Age_46-50                        bool
Age_51-55                        bool
Age_55+                          bool
City_Category_B                  bool
City_Category_C                  bool
dtype: object


In [35]:
# Define features and target variable
X = data.drop(['Purchase'], axis=1)
y = data['Purchase']

In [37]:
from sklearn.feature_selection import SelectKBest, f_regression

# Feature Selection using SelectKBest
selector = SelectKBest(score_func=f_regression, k=10)
X_new = selector.fit_transform(X, y)

# Get selected feature names
selected_features = X.columns[selector.get_support()]
print(selected_features)

Index(['Occupation', 'Product_Category_1', 'Product_Category_2',
       'Product_Category_3', 'Gender_M', 'Age_18-25', 'Age_36-45', 'Age_51-55',
       'City_Category_B', 'City_Category_C'],
      dtype='object')


In [44]:
from sklearn.preprocessing import StandardScaler
# Standardize the data before applying PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X[selected_features])


In [46]:
from sklearn.decomposition import PCA

# Apply PCA to retain 95% of variance
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

In [47]:
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

In [48]:
# Model Training


# Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print("Linear Regression RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_lr)))


Linear Regression RMSE: 4677.968589692933


In [49]:
from sklearn.ensemble import RandomForestRegressor

# Random Forest
rf = RandomForestRegressor(random_state=42)
params = {'n_estimators': [100, 200], 'max_depth': [10, 20]}
grid_search = GridSearchCV(estimator=rf, param_grid=params, cv=3, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
best_rf = grid_search.best_estimator_
y_pred_rf = best_rf.predict(X_test)
print("Random Forest RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_rf)))


KeyboardInterrupt: 

In [None]:
# Cross-validation for the best model
cv_scores = cross_val_score(best_rf, X_pca, y, cv=5, scoring='neg_mean_squared_error')
cv_rmse = np.sqrt(-cv_scores)
print("Cross-validated RMSE for Random Forest:", cv_rmse.mean())