## 향수 데이터 전처리

### amber_aromatic_chypre.csv, citrus_floral_leather.csv, used_dataset.csv 를 하나로 concat

In [107]:
import pandas as pd

# 상대 경로를 사용하여 데이터셋 로드
amber_aromatic_chypre = pd.read_csv('../../Crawling/dataset/perfume-info-raw/amber_aromatic_chypre.csv')
citrus_floral_leather = pd.read_csv('../../Crawling/dataset/perfume-info-raw/citrus_floral_leather.csv')
used_dataset = pd.read_csv('../../Crawling/dataset/perfume-info-raw/used_dataset.csv')

# 세 데이터셋을 하나로 합치기
df_concat = pd.concat([used_dataset, amber_aromatic_chypre, citrus_floral_leather], ignore_index=True)

In [108]:
df_concat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1081 entries, 0 to 1080
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name          1081 non-null   object
 1   company       1081 non-null   object
 2   image         1081 non-null   object
 3   for_gender    1081 non-null   object
 4   main accords  1074 non-null   object
 5   top notes     905 non-null    object
 6   middle notes  905 non-null    object
 7   base notes    905 non-null    object
dtypes: object(8)
memory usage: 67.7+ KB


### 'name' 열을 기준으로 중복 행 제거

In [109]:
df = df_concat.drop_duplicates(subset='name', keep='first')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1074 entries, 0 to 1080
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name          1074 non-null   object
 1   company       1074 non-null   object
 2   image         1074 non-null   object
 3   for_gender    1074 non-null   object
 4   main accords  1067 non-null   object
 5   top notes     900 non-null    object
 6   middle notes  900 non-null    object
 7   base notes    900 non-null    object
dtypes: object(8)
memory usage: 75.5+ KB


In [110]:
df.head()

Unnamed: 0,name,company,image,for_gender,main accords,top notes,middle notes,base notes
0,Angels' Share By Kilian,By Kilian,https://fimgs.net/mdimg/perfume/375x500.62615.jpg,for women and men,"{'woody': 100.0, 'sweet': 92.6987, 'warm spicy...",['Cognac'],"['Cinnamon', 'Tonka Bean', 'Oak']","['Praline', 'Vanilla', 'Sandalwood']"
1,My Way Giorgio Armani Giorgio Armani,Giorgio Armani,https://fimgs.net/mdimg/perfume/375x500.62036.jpg,for women,"{'white floral': 100.0, 'citrus': 60.4322, 'tu...","['Orange Blossom', 'Bergamot']","['Tuberose', 'Indian Jasmine']","['White Musk', 'Madagascar Vanilla', 'Virginia..."
2,Libre Intense Yves Saint Laurent Yves Saint La...,Yves Saint Laurent,https://fimgs.net/mdimg/perfume/375x500.62318.jpg,for women,"{'vanilla': 100.0, 'aromatic': 71.4216, 'sweet...","['Lavender', 'Mandarin Orange', 'Bergamot']","['Lavender', 'Tunisian Orange Blossom', 'Jasmi...","['Madagascar Vanilla', 'Tonka Bean', 'Ambergri..."
3,Dior Homme 2020 Christian Dior Christian Dior,Christian Dior,https://fimgs.net/mdimg/perfume/375x500.58714.jpg,for men,"{'woody': 100.0, 'musky': 72.7229, 'amber': 53...","['Bergamot', 'Pink Pepper', 'elemi']","['Cashmere Wood', 'Atlas Cedar', 'Patchouli']","['Iso E Super', 'Haitian Vetiver', 'White Musk']"
4,Acqua di Giò Profondo Giorgio Armani Giorgio A...,Giorgio Armani,https://fimgs.net/mdimg/perfume/375x500.59532.jpg,for men,"{'aromatic': 100.0, 'marine': 93.2493, 'citrus...","['Sea Notes', 'Aquozone', 'Bergamot', 'Green M...","['Rosemary', 'Cypress', 'Lavender', 'Mastic or...","['Mineral notes', 'Musk', 'Patchouli', 'Amber']"


### Main Accords, Top, Middle, Base Notes들 텍스트 한 군데('Notes' 열 새로 생성)로 합하기

In [111]:
# 'main accords'에서 키를 추출하여 문자열로 저장
def extract_keys_as_string(accords_str):
    if pd.notna(accords_str):
        try:
            return ', '.join(list(ast.literal_eval(accords_str).keys()))
        except Exception as e:
            print(f"Error parsing accords_str: {accords_str}, error: {e}")
            return ''
    else:
        return ''

# 'main accords'에서 키를 추출하여 'notes' 열에 저장
df['MA'] = df['main accords'].apply(extract_keys_as_string)

def safe_eval(note):
    try:
        return ast.literal_eval(note)
    except:
        return []

def extract_notes(notes):
    combined = []
    for note in notes:
        if pd.notnull(note):
            combined.extend(safe_eval(note))
    return ', '.join(combined)

df['N'] = df[['top notes', 'middle notes', 'base notes']].apply(extract_notes, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['MA'] = df['main accords'].apply(extract_keys_as_string)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['N'] = df[['top notes', 'middle notes', 'base notes']].apply(extract_notes, axis=1)


In [112]:
# 'main'과 'three' 열을 합쳐서 'notes' 열 생성
df['notes'] = df[['MA', 'N']].apply(lambda x: ', '.join(x.dropna().astype(str)), axis=1)

# 기존의 'main'과 'three' 열 삭제
df.drop(columns=['MA', 'N'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['notes'] = df[['MA', 'N']].apply(lambda x: ', '.join(x.dropna().astype(str)), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=['MA', 'N'], inplace=True)


In [113]:
df.head()

Unnamed: 0,name,company,image,for_gender,main accords,top notes,middle notes,base notes,notes
0,Angels' Share By Kilian,By Kilian,https://fimgs.net/mdimg/perfume/375x500.62615.jpg,for women and men,"{'woody': 100.0, 'sweet': 92.6987, 'warm spicy...",['Cognac'],"['Cinnamon', 'Tonka Bean', 'Oak']","['Praline', 'Vanilla', 'Sandalwood']","woody, sweet, warm spicy, vanilla, cinnamon, a..."
1,My Way Giorgio Armani Giorgio Armani,Giorgio Armani,https://fimgs.net/mdimg/perfume/375x500.62036.jpg,for women,"{'white floral': 100.0, 'citrus': 60.4322, 'tu...","['Orange Blossom', 'Bergamot']","['Tuberose', 'Indian Jasmine']","['White Musk', 'Madagascar Vanilla', 'Virginia...","white floral, citrus, tuberose, animalic, Oran..."
2,Libre Intense Yves Saint Laurent Yves Saint La...,Yves Saint Laurent,https://fimgs.net/mdimg/perfume/375x500.62318.jpg,for women,"{'vanilla': 100.0, 'aromatic': 71.4216, 'sweet...","['Lavender', 'Mandarin Orange', 'Bergamot']","['Lavender', 'Tunisian Orange Blossom', 'Jasmi...","['Madagascar Vanilla', 'Tonka Bean', 'Ambergri...","vanilla, aromatic, sweet, white floral, lavend..."
3,Dior Homme 2020 Christian Dior Christian Dior,Christian Dior,https://fimgs.net/mdimg/perfume/375x500.58714.jpg,for men,"{'woody': 100.0, 'musky': 72.7229, 'amber': 53...","['Bergamot', 'Pink Pepper', 'elemi']","['Cashmere Wood', 'Atlas Cedar', 'Patchouli']","['Iso E Super', 'Haitian Vetiver', 'White Musk']","woody, musky, amber, aromatic, Bergamot, Pink ..."
4,Acqua di Giò Profondo Giorgio Armani Giorgio A...,Giorgio Armani,https://fimgs.net/mdimg/perfume/375x500.59532.jpg,for men,"{'aromatic': 100.0, 'marine': 93.2493, 'citrus...","['Sea Notes', 'Aquozone', 'Bergamot', 'Green M...","['Rosemary', 'Cypress', 'Lavender', 'Mastic or...","['Mineral notes', 'Musk', 'Patchouli', 'Amber']","aromatic, marine, citrus, fresh spicy, woody, ..."


In [116]:
# 'top notes', 'middle notes', 'base notes' 열에 존재하는 '[]' 삭제
df['top notes'] = df['top notes'].apply(lambda x: '' if x == '[]' else x)
df['middle notes'] = df['middle notes'].apply(lambda x: '' if x == '[]' else x)
df['base notes'] = df['base notes'].apply(lambda x: '' if x == '[]' else x)

# 'main accords'열에 존재하는 '{}' 삭제
df['main accords'] = df['main accords'].apply(lambda x: '' if x == '{}' else x)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['top notes'] = df['top notes'].apply(lambda x: '' if x == '[]' else x)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['middle notes'] = df['middle notes'].apply(lambda x: '' if x == '[]' else x)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['base notes'] = df['base notes'].apply(lambda x:

In [117]:
df.to_csv("final_perfume-info.csv", index=False)