# Sephora Website


## Dataset
The dataset was collected by **Raghad Alharbi** using web scraping methods like selenium and beautiful soup to collect more than 1,000 useful records from Sephora website.

## Goals
Predict the price of product based on the features available

## Objective
The objective is to analyze product based on several variables, determine what variables affect product price the most, then build a model that can predict the price of a Product.

# Data Exploration

## Import Libraries

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

print('numpy version : ',np.__version__)
print('pandas version : ',pd.__version__)
print('seaborn version : ',sns.__version__)

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier

In [None]:
aadasd = pd.DataFrame()

In [None]:
sns.set(rc={'figure.figsize':(20.7,8.27)})
sns.set_style("whitegrid")
sns.color_palette("dark")
plt.style.use("fivethirtyeight")

In [None]:
from matplotlib import rcParams
rcParams['figure.figsize'] = 12, 4
rcParams['lines.linewidth'] = 3
rcParams['xtick.labelsize'] = 'x-large'
rcParams['ytick.labelsize'] = 'x-large'

## Load Dataset

In [None]:
df = pd.read_csv('../input/all-products-available-on-sephora-website/sephora_website_dataset.csv')
df.head()

## Description

In [None]:
df.info()

From the information above, it shows us: 
* Dataframe has a total of 9268 rows and 21 columns 
* Target Regression is the column 'value_price' with data type 'float64' 

In [None]:
df.columns

## Data numeric

In [None]:
numeric=['int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df_num=df.select_dtypes(include=numeric)
df_num.head(3)

## Data categorical

In [None]:
df_cat=df.select_dtypes(include='object')
df_cat.head(3)

# Exploratory Data Analysis

## Numerical Approach

### Statistical Summary

In [None]:
describeNum = df.describe(include =['float64', 'int64', 'float', 'int'])
describeNum.T.style.background_gradient(cmap='viridis',low=0.2,high=0.1)

Based on the table above can be seen some columns that have abnormal data distribution among them because it has mean values and medians that are far linked.

In [None]:
describeNumCat = df.describe(include=["O"])
describeNumCat.T.style.background_gradient(cmap='viridis',low=0.2,high=0.1)

### Categorical Value Counting

In [None]:
cats = ['brand','category', 'name', 'size'] 
for col in cats:
    print(f'''Value count kolom {col}:''')
    print(df[col].value_counts())
    print()

## Graphic Approach

### Correlation heatmap

In [None]:
df.corr()

In [None]:
features = ['rating', 'number_of_reviews', 'love', 'price', 'value_price', 'online_only', 'exclusive', 'limited_edition', 'limited_time_offer']

plt.figure(figsize=(30,20))
ax = sns.heatmap(data = df[features].corr(),cmap='YlGnBu',annot=True)

bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5,top - 0.5)

**Analysis results From Correlation Heatmap**
- based on the picture above can be seen that the 'price' and 'value_price' features have a correlation of 0.99, then it is necessary to check further whether these two features have the same actual value but different column names only? 
- 'love' and 'number_of_review' features can also be seen to have a fairly high correlation value of 0.74.

### Scatter plot

In [None]:
fig, ax = plt.subplots()
_ = plt.scatter(x=df['price'], y=df['value_price'], edgecolors="#000000", linewidths=0.5)
_ = ax.set(xlabel="price", ylabel="value_price")

In [None]:
fig, ax = plt.subplots()
_ = plt.scatter(x=df['love'], y=df['number_of_reviews'], edgecolors="#000000", linewidths=0.5)
_ = ax.set(xlabel="love", ylabel="number_of_reviews")

### Boxplot

In [None]:
features = ['number_of_reviews', 'love', 'price', 'value_price']
plt.figure(figsize=(20, 8))
for i in range(0, len(features)):
    plt.subplot(1, 7, i+1)
    sns.boxplot(y=df[features[i]],color='green',orient='v')

### Analysis Variable Dependent "value_price"

In [None]:
plt.figure(figsize=(20,8))

plt.subplot(1,2,1)
plt.title('Sale Price Distribution Plot')
sns.distplot(df.value_price)

plt.subplot(1,2,2)
plt.title('Sale Price Spread')
sns.boxplot(y=df.value_price)

plt.show()

In [None]:
print(df.value_price.describe(percentiles = [0.25,0.50,0.75,0.85,0.90,1]))

In [None]:
# GET SKEWNESS 
print(f"Skewness Co-efficient: {round(df.value_price.skew(), 3)}")
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5), dpi=300)

# HISTOGRAM 
from scipy import stats
sns.distplot(df['value_price'] , fit=stats.norm, ax=ax1)
ax1.set_title('Histogram')

# PROBABILITY / QQ PLOT
stats.probplot(df['value_price'], plot=ax2)

plt.show()

The next step is to analyze the column 'value_price', because the target variable is numeric then look at the histogram whether distributed normally or not. in the column, in the 'value_price' column, we can see a positive skewed because the tail of the distribution is to the right of the most value. That is, most distributions are in low value. So, the target variable is right skewed. As (linear) models love normally distributed data , we need to transform this variable and make it more normally distributed. We will apply log transformation to the feature to make the distribution close to gaussian. We will apply log(1+x) transformation to avoid 0 values (if present)

In [None]:
df["value_price"] = np.log1p(df["value_price"])

# GET SKEWNESS 
print(f"Skewness Co-efficient: {round(df.value_price.skew(), 3)}")

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5), dpi=300)

# HISTOGRAM 
from scipy import stats
sns.distplot(df['value_price'] , fit=stats.norm, ax=ax1)
ax1.set_title('Histogram')
# PROBABILITY / QQ PLOT
stats.probplot(df['value_price'], plot=ax2)

plt.show()

After transformation the skewness has reduced from 3.143 to 0.31, and the plot now looks close to the normal distribution and the probability plot can confirm the same.

### Price and Value_price Check Similarity 

In [None]:
for index, row in df.iterrows():
    if row['price']!=row['value_price']:
        print(index, row['price'], row['value_price'])

We can be seen that indeed these two tables have many different values therefore it is then decided to keep these two features.

### What Brand Got the Highest Number of Reviews

In [None]:
bestBrandReviews = df.groupby(["brand"]).head()
bestBrandReviews = bestBrandReviews.sort_values('number_of_reviews', ascending=False)
bestBrandReviews.head(10)

### What Most Popular Product Based on Rating

In [None]:
rating_products = pd.DataFrame(round(df.groupby('brand')['rating'].mean(),2))
most_rating = rating_products.sort_values('rating', ascending=False)
most_rating.head(10)

- Brand **Four Sigmatic** which is the most popular brand is not the brand that is the highest rated brand, it shows that this brand is only a popular brand but not the most effective brand for buyers. 
- However, **Fable & Mane	, Aether Beauty, and Montblanc** brands are the highest rated brands with maximum scores, and this shows many who like these brands with all the qualities they have given.


### What Product got the most total Rating

In [None]:
popular_products = pd.DataFrame(df.groupby('brand')['rating'].sum())
most_popular = popular_products.sort_values('rating', ascending=False)
most_popular.head(10)

The analysis obtained by SEPHORA COLLECTION brand managed to become the most popular product with a total number of ratings given by consumers, namely 1893.5 rating. but this could be because this brand has a lot of sales.

### What are the Most Expensive Brands

In [None]:
price_products = pd.DataFrame(df.groupby('brand')['price'].mean())
most_price = price_products.sort_values('price', ascending=False)
most_price.head(10)

- Brand **Four Sigmatic, Montblanc, and Aether Beauty** which are the highest rated brands can be seen not including brands with an average price of expensive products.  
- Brands such as **dyson, ReFa, and LightStim** which are the brands with the average price of the most expensive products. brand **ReFa** itself is a brand that falls into the top 10 category with the best rating with a score of 4.83.


### What Product got a lot of Love From Customer

In [None]:
love_products = pd.DataFrame(df.groupby('brand')['love'].mean())
most_love = love_products.sort_values('love', ascending=False)
most_love.head(10)

### What Product got a lot of Reviews From Customer

In [None]:
reviews_products = pd.DataFrame(df.groupby('brand')['number_of_reviews'].mean())
most_reviews = reviews_products.sort_values('number_of_reviews', ascending=False)
most_reviews.head(10)

As for some brands that have the most love in the previous category can be seen also fall into the category of number_of_reviews the top 10. some of them are **Buxom, stila, NARS, Anastasia Beverly Hills, Makeup Eraser, and Urban Decay**. here it can also be seen that these two variables have correlations that can later be seen at the time of correlation analysis before.

### Analysis Variable Brand Buxom, stila, and NARS

In [None]:
xbrand = df[df['brand']=='Buxom']
xbrand.head(45)

In [None]:
ybrand = df[df['brand']=='stila']
ybrand.head(10)

In [None]:
zbrand = df[df['brand']=='NARS']
zbrand.head(45)

Based on the analysis of the above 3 variables can be seen clearly from the brand **'NARS'** which has good data from the side of **love and number_of_reviews** but does not show any correlation with exclusive whether or not a product. so it can be concluded that the two columns **love and number_of_reviews** do not really affect the value of an item is exclusive what not.

### What Most Popular Category Based on Rating

In [None]:
price_category = pd.DataFrame(df.groupby('category')['rating'].mean())
most_price = price_category.sort_values('rating', ascending=False)
most_price.head()

### What Category With The Highest Income Value

In [None]:
price_sorted_category = pd.pivot_table(df,
              index=['category'],
              values=['price'],
              aggfunc=['sum']
              ).reset_index()
price_sorted_category.columns = ['category', 'price']
price_sorted_category = price_sorted_category.sort_values(['price'], ascending = False)
price_sorted_category = price_sorted_category.head(10)
price_sorted_category

In [None]:
fig, ax = plt.subplots(figsize=(15,7))
_ = sns.barplot(x="category", y="price", data=price_sorted_category,
                palette="nipy_spectral", ax=ax)
_ = ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
_ = ax.set(xlabel="Category", ylabel="Total Price")

### What Brand With The Highest Income Value

In [None]:
price_sorted_category = pd.pivot_table(df,
              index=['brand'],
              values=['price'],
              aggfunc=['sum']
              ).reset_index()
price_sorted_category.columns = ['brand', 'price']
price_sorted_category = price_sorted_category.sort_values(['price'], ascending = False)
price_sorted_category = price_sorted_category.head(10)
price_sorted_category

### Analysis Variable Brand = 'TOM FORD'

In [None]:
brandHighestPrice = df[(df["brand"] == 'TOM FORD')]
brandHighestPrice.head()

### What Category With The Highest sales from Highest Income Value Brands

In [None]:
fig, ax = plt.subplots(figsize=(15,7))
_ = sns.barplot(x="category", y="price", data=brandHighestPrice,
                palette="nipy_spectral", ax=ax)
_ = ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
_ = ax.set(xlabel="Category", ylabel="Total Price")

### What Product With The Highest Price

In [None]:
price_sorted_category = pd.pivot_table(df,
              index=['name'],
              values=['price'],
              aggfunc=['sum']
              ).reset_index()
price_sorted_category.columns = ['name', 'price']
price_sorted_category = price_sorted_category.sort_values(['price'], ascending = False)
price_sorted_category = price_sorted_category.head(10)
price_sorted_category

### What is the Brand With the Most Sales?

In [None]:
plt.figure(figsize=(30,100),dpi=100)
plt.xticks(rotation=90)
plt.title('Brand Counts')
sns.countplot(y=df['brand'], palette="nipy_spectral");

In [None]:
brandbig10 = df.groupby(['brand'])['exclusive'].count().sort_values(ascending=False).reset_index().head(10)

plt.figure(figsize=(18,6), dpi=100)
plt.subplot(2,2,1)
plt.ylabel('')
plt.xlabel('')
sns.barplot(y=brandbig10['brand'],x=brandbig10['exclusive'], palette='nipy_spectral')

### What is the Category With the Most Sales?

In [None]:
plt.figure(figsize=(25,40),dpi=100)
plt.xticks(rotation=90)
plt.title('Category Counts')
sns.countplot(y=df['category'], palette="nipy_spectral");

In [None]:
categorybig10 = df.groupby(['category'])['exclusive'].count().sort_values(ascending=False).reset_index().head(10)

plt.figure(figsize=(18,6), dpi=100)
plt.subplot(2,2,1)
plt.ylabel('')
plt.xlabel('')
sns.barplot(y=categorybig10['category'],x=categorybig10['exclusive'], palette='nipy_spectral')

### What is the Rating With the Most Sales? 

In [None]:
sns.countplot(df['rating'],palette='nipy_spectral',orient='v')

# Data Preparation

## Outliers

In [None]:
features = ['number_of_reviews','love','price','value_price']
plt.figure(figsize=(15, 10))
for i in range(0, len(features)):
    plt.subplot(1, 4, i+1)
    sns.boxplot(y=df[features[i]],color='green',orient='v')
    plt.tight_layout()

In [None]:
df['number_of_reviews'] = np.log1p(df['number_of_reviews'])
df['love'] = np.log1p(df['love'])
df['price'] = np.log1p(df['price'])
df['value_price'] = np.log1p(df['value_price'])

In [None]:
plt.figure(figsize=(15, 7))
for i in range(0, len(features)):
    plt.subplot(1, 4, i+1)
    sns.boxplot(y=df[features[i]],color='green',orient='v')
    plt.tight_layout()

## Feature Engineering

In [None]:
df.info()

In [None]:
df['MarketingFlags'] = df.MarketingFlags.map({False:0, True:1})

In [None]:
df = df.drop(['id'],axis=1)
df = df.drop(['name'],axis=1)
df = df.drop(['URL'],axis=1)
df = df.drop(['options'],axis=1)
df = df.drop(['details'],axis=1)
df = df.drop(['how_to_use'],axis=1)
df = df.drop(['ingredients'],axis=1)
df = df.drop(['price'],axis=1)

In [None]:
df.head()

## Feature encoding (one hot encoding)

In [None]:
 df['rating']=df['rating'].astype(str)

In [None]:
# Get all the categorical columns
cat_cols = df.select_dtypes("object").columns

## One-Hot Encoding all the categorical variables but dropping one of the features among them.
drop_categ = []
for i in cat_cols:
    drop_categ += [ i+'_'+str(df[i].unique()[-1]) ]

## Create dummy variables (One-Hot Encoding)
df = pd.get_dummies(df, columns=cat_cols) 

## Drop the last column generated from each categorical feature
df.drop(drop_categ, axis=1, inplace=True)

# Modeling

In [None]:
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn import metrics

from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet

In [None]:
X = df.drop('value_price', axis = 1) 
y = df['value_price']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

## Standardization

In [None]:
# scaler = RobustScaler() #RobustScaler - StandardScaler
# X_train = scaler.fit_transform(X_train)
# X_test = scaler.transform(X_test)

In [None]:
# lets print the shapes again 
print("Shape of the X Train :", X_train.shape)
print("Shape of the y Train :", y_train.shape)
print("Shape of the X test :", X_test.shape)
print("Shape of the y test :", y_test.shape)

In [None]:
# Model Build
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score,roc_curve, auc, precision_recall_curve, f1_score
import warnings
warnings.filterwarnings('ignore')

## XGBoost

In [None]:
xgb = XGBRegressor()

xgb.fit(X_train, y_train)
df_imp = pd.DataFrame(xgb.feature_importances_ , columns = ['Importance'], index=X_train.columns)
df_imp = df_imp.sort_values(['Importance'], ascending = False)

df_imp.head()

In [None]:
XGB_model = XGBRegressor()

XGB_model.fit(X_train, y_train)
y_pred= XGB_model.predict(X_test)

print("Accuracy on Traing set   : ",XGB_model.score(X_train,y_train))
print("Accuracy on Testing set  : ",XGB_model.score(X_test,y_test))
print("__________________________________________")
print("\t\tError Table")
print('Mean Absolute Error      : ', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared  Error      : ', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error  : ', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R Squared Error          : ', metrics.r2_score(y_test, y_pred))

## Random Forest

In [None]:
RandomForest = RandomForestRegressor()
RandomForest.fit(X_train, y_train)
y_pred= RandomForest.predict(X_test)

print("Accuracy on Traing set   : ",RandomForest.score(X_train,y_train))
print("Accuracy on Testing set  : ",RandomForest.score(X_test,y_test))
print("__________________________________________")
print("\t\tError Table")
print('Mean Absolute Error      : ', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error       : ', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error  : ', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R Squared Error          : ', metrics.r2_score(y_test, y_pred))

## Ridge Regression

In [None]:
ridge = Ridge()
ridge.fit(X_train, y_train)
y_pred= ridge.predict(X_test)

print("Accuracy on Traing set   : ",ridge.score(X_train,y_train))
print("Accuracy on Testing set  : ",ridge.score(X_test,y_test))
print("__________________________________________")
print("\t\tError Table")
print('Mean Absolute Error      : ', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared  Error      : ', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error  : ', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R Squared Error          : ', metrics.r2_score(y_test, y_pred))