This repository contains a machine learning model for predicting discounts on fashion clothing items on Myntra. The project involves data cleaning, preprocessing, feature engineering, exploratory data analysis, and regression analysis. The model performance is improved by applying logarithmic scaling to the features.
- Data Cleaning and Preprocessing
- Feature Engineering
- Exploratory Data Analysis (EDA)
- Regression Analysis
- Logarithmic Scaling
- Results
In this step, the dataset is cleaned and preprocessed to handle missing values and convert data types. The "DiscountOffer" column is filled with 0 for missing values and converted to string data type. A new column called "DiscountOffer_len" is created to store the length of the strings in the "DiscountOffer" column. The data is then split into different groups based on the length of the strings, and the discount amount is segregated into separate columns for each group. The discounted price is calculated for each group and stored in a new column. Finally, all the groups are concatenated back into one dataframe.
The feature engineering process involves creating new features and merging relevant information. The dataset is filtered to separate out instances where the discount percentage is zero. The filtered data is split into training, validation, and test sets. Average rating and total reviews are calculated for each brand, creating a new column called "Brand_importance." The importance values are merged back to the datasets. The number of unique brands in each category is calculated and stored in a column called "ind_cat_popularity," which is also merged back to the datasets. Additionally, the number of products in each category is calculated and stored in a column called "cat_popularity," which is merged back to the datasets.
EDA is performed on the "model_data" dataset using the Seaborn and Matplotlib libraries. A heatmap is created to visualize the correlation between different features. The correlation between the target variable and the features is plotted as a bar chart. A pairplot is also created to visualize the relationships between variables.
Three regression models are used: Linear Regression, KNeighbors Regressor, and Random Forest Regressor. For each model, a model object is created, and it is fitted on the training data. The accuracy of the models is evaluated on the test and validation data using the r2_score function. The feature importance of the Random Forest Regressor model is also calculated and stored in a dataframe.
To improve the model performance, logarithmic scaling is applied to the features of the training, testing, and validation data. This transformation helps balance the magnitude of each feature and makes the data more normally distributed. Logarithm is applied to the feature values, and a small value is added to avoid taking the logarithm of zero.
The model performance improves after applying logarithmic scaling to the features. The accuracy of the Linear Regression model, KNeighbors Regressor model, and Random Forest Regressor model show improvement. Detailed results and analysis can be found in the project code.