# SSENSE Spring Summer 2023 Menswear Sale Analysis 

<hr>
<h2>by Richard Ha</h2>
    <h3>Overview</h3><hr>This project aims to analyse the "sale" page of popular online luxury retail site SSENSE. The motivation for this endeavour is essentially to act as a proof-of-concept for a more detailed analysis of multi-dimensional sales data in the online luxury goods market.<br><br>I took a sample of roughly <b>1170 listings (1006 unique)</b> from the "Sale" page on SSENSE, under the menswear section and sorted by trending. The trending tab did not shift the order of listings within the data collection timeframe.<br><br>The main questions I wanted to be able to answer with this data analysis included but was not limited to the following inquiries:<br>
    
    What type of items get discounted by the highest percentage?
  
    Is there a correlation between country of manufacture and the sale percent? 
    
    Is there a pattern in how sale percentages are distributed across item subcategories?
    
However, I also want to test common consumer beliefs regarding luxury fashion consumption, and inspect whether links can be made between them and the Sale page data, like so:<br>
    
    Does the country of origin affect the average base price of items? 
    
    Does black-coloured clothing get discounted by the least amount amongst other colours?
    
Although it is a topic that is seldom considered, (dis)proving it is a fairly sensible activity and is also of interest, with regards to understanding consumer behaviour a little bit better.
    <h3>Underlying Assumptions and Remarks</h3><hr>
    As this analysis is done through the user-side of things, there is no ultra-detailed sales data nor was there a program like Salesforce. The reason that the sample size is 1170 and not more was due to the website security disabling access to basic webscraping tools. <br><br> It's also worth mentioning that my underlying assumption in terms of drawing conclusions is the notion that when something is on sale, it is considered to be stock that needs to be moved. Primarily, in luxury fashion, it is due to a inventory surplus, usually by the item not selling, combined with the need to clear out inventory before the next season, in this case Fall-Winter 2023. It is definitely not the sole reason in why something goes on sale, but for the purposes of this project, it will act as such.

        
<h3>Methodology</h3><h5>
    
    Step 1: Extract HTML Files from SSENSE Sale listings 
    
    Step 2: Process and clean HTML data using BeautifulSoup into categories
    
    Step 3: Append values into Pandas series, which is then combined into a dataframe 
    
    Step 4: Plotting datapoints according the some questions as above
    
    Step 5: Building and validating the model
    
    Step 6: Data visualisation in clear, concise manner 

</h5>The code for steps 1 to 3 is depicted in the cell below. "Dataset Overview", the next section, is continued at the bottom of the page.
 

In [8]:
import pandas as pd 
import matplotlib.pyplot as plt
from pathlib import Path
from bs4 import BeautifulSoup


# creating empty dicts for series conversion 

item_brand_dict = {}
item_type_dict = {}
item_colour_dict = {} 
item_country_dict = {}
item_bp_dict = {}
item_sp_dict = {}
item_salepercent_dict = {}

# running the fade 

directory = Path.cwd()
html_files = Path(directory).glob('**/*.html')
for file in html_files:
    file = str(file)
    with open(file, 'r') as html_file:
        content = html_file.read()
        colour = []
        soup = BeautifulSoup(content, 'lxml')
        fullitemname = []
        namelist = [] 
        tags = soup.find('div', class_="pdp__redesign view")
        text = tags.text
        iterator = 2
        newline_count = 0 

        # finding the brand name 

        for letter in text:
            if letter != "\n":
                namelist.append(letter)
            if newline_count == 2:
                break
            if letter == "\n":
                newline_count += 1
        namestr = ''.join(namelist)
        brand_name = namestr.strip(' ')
        joinedtext = text.replace("\n", " ")
        joinedtext = joinedtext.replace(brand_name, "", 1)
        joinedtext = joinedtext.strip(' ')


        # finding the colour 
        for letter in joinedtext: 
            if letter == " ":
                break
            else:
                colour.append(letter)
        colour = ''.join(colour)
        if colour == 'SSENSE':
            colour = 'Multicolored'
        joinedtext = joinedtext.strip(colour)
        # finding the item type  

        joinedtext = joinedtext.replace("       ", "\n", 1)
        for letter in joinedtext:
            if letter == "\n":
                break
            else:
                fullitemname.append(letter)

        fullitemname = ''.join(fullitemname)
        itemname_list = list(fullitemname.split(" "))
        lis_len = len(itemname_list)
        itemname = itemname_list[lis_len - 1]
        joinedtext = joinedtext.strip(fullitemname)

        # finding the sale price 

        dollar_index = joinedtext.index("$")
        sale_price = [] 
        for letter in range(dollar_index + 1, len(joinedtext)): 
            if joinedtext[letter] == " ":
                break
            else:
                sale_price.append(joinedtext[letter])

        sale_price = int(''.join(sale_price))

        # determining whether or not it is on sale 
        sale_amt = [] 
        sale_index = joinedtext.find("% OFF")
        for letter in joinedtext: 
            if sale_index == -1:
                sale_amt = 0
            else:
                sale_index = joinedtext.find("% OFF")
                sale_amt.append(joinedtext[sale_index - 2])
                sale_amt.append(joinedtext[sale_index - 1])
                break
        
        if sale_amt != 0: 
            sale_amt = int(''.join(sale_amt))
        else:
            sale_amt = 0

        og_price = int(sale_price/(100 - sale_amt) * 100)

        # finding country of origin 
        CTY_OG = []
        origin_index = joinedtext.find("Made in ")
        skip_text = len("Made in ")
        for letter in joinedtext[origin_index + skip_text:]:
            if letter == ".":
                break
            else:
                CTY_OG.append(letter)

        CTY_OG = ''.join(CTY_OG)


        # finding unique product ID 
        productID = []
        psg_len = len(joinedtext)
        copy_passage = joinedtext[::-1]
        for letter in copy_passage:
            if letter == " ":
                break
            else:
                productID.append(letter)

        productID = productID[::-1]
        productID = ''.join(productID)
        
        item_brand_dict[productID] = brand_name 
        item_type_dict[productID] = itemname 
        item_colour_dict[productID] = colour 
        item_country_dict[productID] = CTY_OG
        item_bp_dict[productID] = og_price
        item_sp_dict[productID] = sale_price
        item_salepercent_dict[productID] = sale_amt
        

# turning all dictionaries into series then combining into final dataframe --> write to csv 
item_brands = pd.Series(item_brand_dict)
item_types = pd.Series(item_type_dict)
item_colours = pd.Series(item_colour_dict)
item_countries = pd.Series(item_country_dict)
item_baseprice = pd.Series(item_bp_dict)
item_saleprice = pd.Series(item_sp_dict)
item_salepercent = pd.Series(item_salepercent_dict)

ssensesale_df = pd.DataFrame({'brand': item_brands, 'type': item_types, 'colour': item_colours, 'country of origin': item_countries, 'base price': item_baseprice, 'sale price': item_saleprice, 'sale percent': item_salepercent})

ssensesale_df.to_csv('ssense_sale.csv')


# Dataset Overview 

<h3>At a glance, here is the basic information of the dataset that has been collected.
 
<h6>The top 10 most frequent brands on sale, descending: 

In [9]:
ssensesale_df.value_counts('brand')[0:10]

brand
BAPE                         140
Jacquemus                     84
Stone Island                  54
Maison Margiela               52
Homme Plissé Issey Miyake     47
Acne Studios                  44
Versace                       42
Isabel Marant                 38
The North Face                37
Carhartt Work In Progress     33
dtype: int64

<h6>The top 10 most frequent item types on sale, descending:

In [38]:
cpy = ssensesale_df.value_counts('type')[0:10]
print(cpy)

type
T-Shirt       155
Sneakers      104
Hoodie         63
Trousers       54
Pants          53
Sweatshirt     53
Jacket         50
Sunglasses     45
Shirt          43
Shorts         36
dtype: int64


<h6>The sale percentage overviews (in percent)</h6>

    Mean: 38.7
    Standard Deviation: 13.9
    

In [27]:
ssensesale_df['sale percent'].describe()


count    1006.000000
mean       38.718688
std        13.893747
min         0.000000
25%        29.000000
50%        40.000000
75%        51.000000
max        64.000000
Name: sale percent, dtype: float64

<h5>The next notebook, ssense_analysis.ipynb, will demonstrate comprehensive plots and graphs regarding the questions asked in the introduction previously.