# Intro

## Modelo de Negócio

    Eduardo e Marcelo são dois brasileiros, amigos e sócios de empreendimento. Depois de vários
    negócio bem sucedidos, eles estão planejando entrar no mercado de moda dos USA como um
    modelo de negócio do tipo E-commerce.
    A idéia inicial é entrar no mercado com apenas um produto e para um público específico, no caso
    o produto seria calças Jenas para o público masculino. O objetivo é manter o custo de operação
    baixo e escalar a medida que forem conseguindo clientes.
    Porém, mesmo com o produto de entrada e a audiência definidos, os dois sócios não tem experiência
    nesse mercado de moda e portanto não sabem definir coisas básicas como preço, o tipo de calça e
    o material para a fabricação de cada peça.
    Assim, os dois sócios contrataram uma consultoria de Ciência de Dados para responder as seguintes
    perguntas: 1. Qual o melhor preço de venda para as calças? 2. Quantos tipos de calças e suas
    cores para o produto inicial? 3. Quais as matérias-prima necessárias para confeccionar as calças?
    As principais concorrentes da empresa Start Jeans são as americadas H&M e Macys.

### 3.0.2 2.0 O método SAPE
    1. Qual o melhor preço de venda para as calças?
    2. Quantos tipos de calças e suas cores para o produto inicial?
    3. Quais as matérias-prima necessárias para confeccionar as calças?
### Saída ( o produto final)
    1. Resposta para a pergunta
    - A mediana dos valores dos produtos do site dos concorrentes.
    2. Formato
    - Tabela ou gráfico
    3. Local de entrega
    - App no Streamlit
### Processo ( passo a passo )
    1. Passo a passo para calcular a resposta?
    - Mediana do preço por categoria e tipo.
    2. Como será o gráfico ou tabela final?
    - Simulação da tabela final
    3. Como será o local de entrega?
    - Dashboard em um app no Streamlit e publicá-lo no Heroku.
### Entradas ( fontes de dados )
    1. H&M: https://www2.hm.com/en_us/men/products/jeans.html
    2. Macys: https://www.macys.com/shop/mens-clothing/mens-jeans

# Imports

In [1]:
import re
import requests
import sqlite3
import re
import pandas     as pd
import numpy      as np


from   datetime   import datetime
from   bs4        import BeautifulSoup
from   sqlalchemy import create_engine

# Data Collection

In [2]:
# Parameters
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5),AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

# URL
url = 'https://www2.hm.com/en_us/men/products/jeans.html'

# Request to URL
page = requests.get( url, headers=headers )

# Beaultiful soup object
soup = BeautifulSoup( page.text, 'html.parser' )

#============================ Product Data =========================================
products = soup.find( 'ul', class_='products-listing small' )

product_list = products.find_all( 'article', class_='hm-product-item')

# product id
product_id = [p.get( 'data-articlecode' ) for p in product_list]

# product category
product_category = [p.get( 'data-category' ) for p in product_list]

# product name
product_list = products.find_all( 'a', class_='link' )
product_name = [p.get_text() for p in product_list]

# price
product_list = products.find_all( 'span', class_='price regular' )
product_price = [p.get_text() for p in product_list]

data = pd.DataFrame( [product_id, product_category, product_name, product_price] ).T
data.columns = ['product_id', 'product_category', 'product_name', 'product_price']

# Data Collection by Product

In [4]:
# empty dataframe
df_compositions = pd.DataFrame()

# unique columns for all products
aux = []

df_pattern = pd.DataFrame( columns=['Art. No.', 'Composition', 'Fit', 'Product safety', 'Size', 'More sustainable materials'] )
for i in range( len( data ) ):
    # API Requests
    url = 'https://www2.hm.com/en_us/productpage.' + str(data.loc[i, 'product_id']) + '.html'
    #url = 'https://www2.hm.com/en_us/productpage.' + '0985197001' + '.html'
    print('Product ID: {}'.format(url))
    page = requests.get( url, headers=headers )
    
    # Beautiful Soup object
    soup = BeautifulSoup( page.text, 'html.parser' )
    
    # ==================== color name =================================
    product_list = soup.find_all( 'a', class_='filter-option miniature active' ) + soup.find_all( 'a', class_='filter-option miniature' )
    color_name = [p.get( 'data-color' ) for p in product_list]
    
    # product id
    product_id = [p.get( 'data-articlecode' ) for p in product_list]
    df_color = pd.DataFrame( [product_id, color_name] ).T
    df_color.columns = ['product_id', 'color_name']
    
    # ==================== composition =================================
    for j in range(len(df_color)):
        # API Requests
        url = 'https://www2.hm.com/en_us/productpage.' + (df_color.loc[j, 'product_id']) + '.html'
        print('Color: {}'.format(url))
        
        page = requests.get( url, headers=headers )

        # Beautiful Soup object
        soup = BeautifulSoup( page.text, 'html.parser' )
        
        # ====================== Product Name =======================
        product_name = soup.find_all( 'h1', class_='primary product-item-headline' )
        product_name = product_name[0].get_text()
        
        # ====================== Product Price =======================
        product_price = soup.find_all( 'div', class_='primary-row product-item-price')
        product_price = re.findall( r'\d+\.?\d+', product_price[0].get_text() )[0]
        
        # ====================== Composition =========================
        product_composition_list = soup.find_all( 'div', class_='pdp-description-list-item' )
        product_composition = [list( filter( None, p.get_text().split( '\n' ) ) ) for p in product_composition_list]

        # rename dataframe
        df_composition = pd.DataFrame( product_composition ).T
        df_composition.columns = df_composition.iloc[0]

        # delete first row
        df_composition = df_composition.iloc[1:].fillna( method='ffill' )

        # remove pocket lining, shell and lining
        df_composition['Composition'] = df_composition['Composition'].str.replace( 'Pocket lining: ', '', regex=True)
        df_composition['Composition'] = df_composition['Composition'].str.replace( 'Shell: ', '', regex=True)
        df_composition['Composition'] = df_composition['Composition'].str.replace( 'Lining: ', '', regex=True)

        # garantee the same number of columns
        df_composition = pd.concat( [df_pattern, df_composition], axis=0 )

        # Rename Columns
        df_composition.columns = ['product_id', 'composition', 'fit', 'product_safety', 'size', 'more_sustainable_materials']
        df_composition['product_name'] = product_name
        df_composition['product_price'] = product_price
        
        # Keep new columns if it shows up
        aux = aux + df_composition.columns.tolist()

        # merge data color + decomposition
        df_composition = pd.merge( df_composition, df_color, how='left', on='product_id' )
        
        # all details products
        df_compositions = pd.concat( [df_compositions, df_composition], axis=0 )

# Join Showroom data + details
df_compositions['style_id'] = df_compositions['product_id'].apply( lambda x: x[:-3] )
df_compositions['color_id'] = df_compositions['product_id'].apply( lambda x: x[-3:] )

# scrapy datetime
df_compositions['scrapy_datetime'] = datetime.now().strftime( '%Y-%m-%d %H:%M:%S' )

Product ID: https://www2.hm.com/en_us/productpage.0690449022.html
Color: https://www2.hm.com/en_us/productpage.0690449022.html
Color: https://www2.hm.com/en_us/productpage.0690449001.html
Color: https://www2.hm.com/en_us/productpage.0690449002.html
Color: https://www2.hm.com/en_us/productpage.0690449006.html
Color: https://www2.hm.com/en_us/productpage.0690449007.html
Color: https://www2.hm.com/en_us/productpage.0690449009.html
Color: https://www2.hm.com/en_us/productpage.0690449011.html
Color: https://www2.hm.com/en_us/productpage.0690449013.html
Color: https://www2.hm.com/en_us/productpage.0690449021.html
Color: https://www2.hm.com/en_us/productpage.0690449024.html
Color: https://www2.hm.com/en_us/productpage.0690449028.html
Color: https://www2.hm.com/en_us/productpage.0690449035.html
Color: https://www2.hm.com/en_us/productpage.0690449036.html
Color: https://www2.hm.com/en_us/productpage.0690449040.html
Color: https://www2.hm.com/en_us/productpage.0690449043.html
Color: https://www2

Color: https://www2.hm.com/en_us/productpage.0690449028.html
Color: https://www2.hm.com/en_us/productpage.0690449035.html
Color: https://www2.hm.com/en_us/productpage.0690449036.html
Color: https://www2.hm.com/en_us/productpage.0690449040.html
Color: https://www2.hm.com/en_us/productpage.0690449046.html
Color: https://www2.hm.com/en_us/productpage.0690449051.html
Product ID: https://www2.hm.com/en_us/productpage.0985197007.html
Color: https://www2.hm.com/en_us/productpage.0985197007.html
Color: https://www2.hm.com/en_us/productpage.0985197001.html
Color: https://www2.hm.com/en_us/productpage.0985197002.html
Color: https://www2.hm.com/en_us/productpage.0985197003.html
Color: https://www2.hm.com/en_us/productpage.0985197004.html
Color: https://www2.hm.com/en_us/productpage.0985197005.html
Color: https://www2.hm.com/en_us/productpage.0985197006.html
Product ID: https://www2.hm.com/en_us/productpage.0985197003.html
Color: https://www2.hm.com/en_us/productpage.0985197003.html
Color: https:/

Product ID: https://www2.hm.com/en_us/productpage.0814631006.html
Color: https://www2.hm.com/en_us/productpage.0814631006.html
Color: https://www2.hm.com/en_us/productpage.0814631001.html
Color: https://www2.hm.com/en_us/productpage.0814631002.html
Color: https://www2.hm.com/en_us/productpage.0814631003.html
Color: https://www2.hm.com/en_us/productpage.0814631004.html
Color: https://www2.hm.com/en_us/productpage.0814631005.html
Color: https://www2.hm.com/en_us/productpage.0814631007.html
Color: https://www2.hm.com/en_us/productpage.0814631012.html
Color: https://www2.hm.com/en_us/productpage.0814631014.html
Color: https://www2.hm.com/en_us/productpage.0814631018.html
Product ID: https://www2.hm.com/en_us/productpage.1004199001.html
Color: https://www2.hm.com/en_us/productpage.1004199001.html
Color: https://www2.hm.com/en_us/productpage.1004199002.html
Color: https://www2.hm.com/en_us/productpage.1004199003.html
Product ID: https://www2.hm.com/en_us/productpage.1008549004.html
Color: ht

In [8]:
df_compositions.head()

Unnamed: 0,product_id,composition,fit,product_safety,size,more_sustainable_materials,product_name,product_price,color_name,style_id,color_id,scrapy_datetime
0,690449022,Polyester 100%,Skinny fit,,"The model is 184cm/6'0"" and wears a size 31/32",,\n\t\t\t\t\t\t\t Skinny Jeans,39.99,Black/trashed,690449,22,2021-09-11 20:40:01
1,690449022,"Cotton 98%, Elastane 2%",Skinny fit,,"The model is 184cm/6'0"" and wears a size 31/32",,\n\t\t\t\t\t\t\t Skinny Jeans,39.99,Black/trashed,690449,22,2021-09-11 20:40:01
0,690449001,"Cotton 99%, Elastane 1%",Skinny fit,,,,\n\t\t\t\t\t\t\t Skinny Jeans,16.99,Light denim blue/trashed,690449,1,2021-09-11 20:40:01
0,690449002,"Cotton 98%, Elastane 2%",Skinny fit,,,,\n\t\t\t\t\t\t\t Skinny Jeans,14.99,Denim blue,690449,2,2021-09-11 20:40:01
0,690449006,Cotton 100%,Skinny fit,,,,\n\t\t\t\t\t\t\t Skinny Jeans,7.99,Black/washed,690449,6,2021-09-11 20:40:01


# Data Cleaning

In [31]:
# product id
df_data = df_compositions.dropna( subset=['product_id'] )

# product name
df_data['product_name'] = df_data['product_name'].str.replace( '\n', '')
df_data['product_name'] = df_data['product_name'].str.replace( '\t', '')
df_data['product_name'] = df_data['product_name'].str.replace( '  ', '')
df_data['product_name'] = df_data['product_name'].str.replace( ' ', '_').str.lower()

# product price
df_data['product_price'] = df_data['product_price'].astype( float )

# color name
df_data['color_name'] = df_data['color_name'].str.replace(' ', '_').str.lower()

## fit
df_data['fit'] = df_data['fit'].apply( lambda x: x.replace( ' ', '_' ).lower() if pd.notnull( x ) else x )

# size number
df_data['size_number'] = df_data['size'].apply( lambda x: re.search( '\d{3}cm', x ).group(0) if pd.notnull( x ) else x )
df_data['size_number'] = df_data['size_number'].apply( lambda x: re.search( '\d+', x).group(0) if pd.notnull( x ) else x )

# size model
df_data['size_model'] = df_data['size'].str.extract( '(\d+/\\d+)' )

# break composition by comma
df1 = df_data['composition'].str.split( ',', expand=True ).reset_index(drop=True)

# cotton | polyester | elastano | elasterell
df_ref = pd.DataFrame( index=np.arange( len( df_data ) ), columns=['cotton','polyester', 'elastane', 'elasterell'] )

# ==================== composition ============================ #
# ---------cotton--------
df_cotton_0 = df1.loc[df1[0].str.contains( 'Cotton', na=True), 0]
df_cotton_0.name = 'cotton'

df_cotton_1 = df1.loc[df1[1].str.contains( 'Cotton', na=True), 1]
df_cotton_1.name = 'cotton'

# combine
df_cotton = df_cotton_0.combine_first( df_cotton_1 )

df_ref = pd.concat( [df_ref, df_cotton], axis=1)
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last')]
# ----------polyester--------------
df_polyester_0 = df1.loc[df1[0].str.contains( 'Polyester', na=True), 0]
df_polyester_0.name = 'polyester'

df_polyester_1 = df1.loc[df1[1].str.contains( 'Polyester', na=True), 1]
df_polyester_1.name = 'polyester'

# combine
df_polyester = df_polyester_0.combine_first( df_polyester_1 )

df_ref = pd.concat( [df_ref, df_polyester], axis=1)
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last')]

# ------------elastano------------
df_elastane_1 = df1.loc[df1[1].str.contains( 'Elastane', na=True), 0]
df_elastane_1.name = 'Elastane'

df_elastane_2 = df1.loc[df1[2].str.contains( 'Elastane', na=True), 2]
df_elastane_2.name = 'Elastane'

df_elastane_3 = df1.loc[df1[3].str.contains( 'Elastane', na=True), 3]
df_elastane_3.name = 'Elastane'

# combine
df_elastane_c2 = df_elastane_1.combine_first( df_elastane_2 )
df_elastane = df_elastane_c2.combine_first( df_elastane_3 )

df_ref = pd.concat( [df_ref, df_elastane], axis=1)
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last')]

# -----------elasterell-----------------
df_elasterell = df1.loc[df1[1].str.contains( 'Elasterell', na=True), 0]
df_elasterell.name = 'elasterell'

df_ref = pd.concat( [df_ref, df_elasterell], axis=1)
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last')]

# fjoin of combine with product_id
df_aux = pd.concat( [df_data['product_id'].reset_index(drop=True), df_ref], axis=1)

# format composition data
df_aux['cotton'] = df_aux['cotton'].apply( lambda x: int( re.search( '\d+', x ).group(0) ) / 100 if pd.notnull( x ) else x )
df_aux['polyester'] = df_aux['polyester'].apply( lambda x: int( re.search( '\d+', x).group(0) ) / 100 if pd.notnull( x ) else x )
df_aux['elastane'] = df_aux['elastane'].apply( lambda x: int( re.search( '\d+', x ).group(0) ) / 100 if pd.notnull( x ) else x )
df_aux['elasterell'] = df_aux['elasterell'].apply( lambda x: int( re.search('\d+',x ).group(0) ) / 100 if pd.notnull( x ) else x )

# final join
df_aux = df_aux.groupby( 'product_id').max().reset_index().fillna( 0 )
df_data = pd.merge( df_data, df_aux, on='product_id', how='left')

# Drop columns
df_data = df_data.drop( columns=['size', 'product_safety', 'composition', 'more_sustainable_materials'], axis=1 )

# Drop duplicates
df_data = df_data.drop_duplicates()
df_data.shape

  df_aux = df_aux.groupby( 'product_id').max().reset_index().fillna( 0 )


(112, 14)

In [33]:
df_data.head(20)

Unnamed: 0,product_id,fit,product_name,product_price,color_name,style_id,color_id,scrapy_datetime,size_number,size_model,elastane,cotton,polyester,elasterell
0,690449022,skinny_fit,skinny_jeans,39.99,black/trashed,690449,22,2021-09-11 20:40:01,184.0,31/32,0.0,0.98,1.0,1.0
2,690449001,skinny_fit,skinny_jeans,16.99,light_denim_blue/trashed,690449,1,2021-09-11 20:40:01,,,0.0,0.99,0.0,0.0
3,690449002,skinny_fit,skinny_jeans,14.99,denim_blue,690449,2,2021-09-11 20:40:01,,,0.0,0.98,0.0,0.0
4,690449006,skinny_fit,skinny_jeans,7.99,black/washed,690449,6,2021-09-11 20:40:01,,,0.0,1.0,0.0,1.0
6,690449007,skinny_fit,skinny_jeans,14.99,light_denim_blue,690449,7,2021-09-11 20:40:01,,,0.0,1.0,0.0,1.0
8,690449009,skinny_fit,skinny_jeans,19.99,black_washed_out,690449,9,2021-09-11 20:40:01,,,0.0,0.99,0.0,0.0
9,690449011,skinny_fit,skinny_jeans,19.99,white,690449,11,2021-09-11 20:40:01,,,0.0,0.99,0.0,0.0
10,690449013,skinny_fit,skinny_jeans,27.99,black/washed,690449,13,2021-09-11 20:40:01,,,0.0,0.98,1.0,1.0
12,690449021,skinny_fit,skinny_jeans,25.99,dark_denim_blue/trashed,690449,21,2021-09-11 20:40:01,,,0.0,0.98,1.0,1.0
14,690449024,skinny_fit,skinny_jeans,20.99,dark_blue/trashed,690449,24,2021-09-11 20:40:01,,,0.0,0.98,1.0,1.0
