## 1. Introduction
Noon is a large online marketplace in the Middle East, and this dataset shows the scent notes of perfume in different brands. Since perfume is a product category that has a high profit margin in this region. Looking into the data to understand the trend or the popular scents of perfurme would be beneficial to create a next product.

## 2. Objectives
In order to get the general idea of the market, drawing out the charts would help easily understand the current situation. Thus, here comes the quetions that this report woudl like to answer.

1. What kinds of perfume provide the highest discount that may encourage customers to purchase?
2. Any correlation between the price and the review stars?
3. What scent notes are usually used in diffent sexual categorie?

## 3. Method
Python programing

## 4. Prepare
* First of all, import functions that will be used and then start to analyze the dataset.

In [None]:
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure as fig
import seaborn as sns
from scipy.stats import linregress
from wordcloud import WordCloud, STOPWORDS

* Read the .csv file.

In [None]:
data = pd.read_csv('../input/noon-perfume/noon_perfumes_dataset.csv')

## 5. Process

*  Understand how the data looks like and determine whether there is an null.

In [None]:
print(data.shape)
print(data.isna().sum())
data.head()

In [None]:
beand_list=[]
brands=[beand_list.append(x) for x in data['brand'] if x not in beand_list]
print(len(beand_list))
print(beand_list)

In [None]:
scent_list=[]
scents=[scent_list.append(x) for x in data['scents'] if x not in scent_list]
print(len(scent_list))
print(scent_list)

* Add columns for the price deduction, deduction range, and price per ml.

In [None]:
data['price_deduction']=data['old_price']-data['new_price']
data['%_of_deduction']=((data['price_deduction']/data['old_price']).round(4))*100
data['price_per_ml']=(data['new_price']/data['ml']).round(2)

In [None]:
data

* Split the department column into 3 different categories in order to investigate whether there is a pattern hidden inside.

In [None]:
men=data[data['department']=='Men']
women=data[data['department']=='Women']
unisex=data[data['department']=='Unisex']

## 6. Analyze

* Plot a stacked bar char to determine the discount range in regards to different sexual categories.

In [None]:
avg_price_men=men['new_price'].mean()
avg_discount_men=men['price_deduction'].mean()
avg_price_men=men['new_price'].mean()
avg_discount_men=men['price_deduction'].mean()

In [None]:
avg_old_price = data.groupby('department')['old_price'].mean().reset_index()
avg_new_price = data.groupby('department')['new_price'].mean().reset_index()
sns.barplot(x="department",  y="old_price", data=avg_old_price, color='pink', label='Average Price Reduction')
sns.barplot(x="department", y="new_price", data=avg_new_price, color='lightblue', label='Average New Price')
sns.set_context("talk")
plt.title('Price Reduction')
plt.xlabel('Category')
plt.ylabel('Price')
plt.legend(bbox_to_anchor=(1, 1), loc='upper left')
plt.show()

* From this histogram, it shows unisex perfume products are not as expensive as male and female ones.

In [None]:
plt.hist(men['price_per_ml'], alpha=0.4, label='Men', density=False)
plt.hist(women['price_per_ml'], alpha=0.4, label='Women', density=False)
plt.hist(unisex['price_per_ml'], alpha=0.4, label='Unisex', density=False)
plt.yscale('log')
plt.xticks(ticks=[0,2,4,6,8,10,12], labels=None)
plt.title('Comparison of the Price per ml')
plt.xlabel('Price per ml')
plt.ylabel('Number of Items')
plt.legend()
plt.show()

* From scatter charts, it demonstrates more women perfumes still get 4-5 star reviews, althoght the price is slightly higher.

In [None]:
rate_jitter=data['item_rating']+np.random.normal(0,0.1,size=len(data))
sns.relplot(x=data['price_per_ml'], y=rate_jitter, kind='scatter', hue=data['department'], alpha=0.2, size=data['price_per_ml'], col=data['department'], col_wrap=2)
plt.xticks(ticks=[0,2,4,6,8,10,12], labels=None)
plt.show()

In [None]:
regression=linregress(men['price_per_ml'], men['item_rating'])
print(regression)

* From the below chart, it tells there is a trend that a product with a higher price would perhaps result a better review. Here only shows a chart for men's perfume.

In [None]:
fx=np.array([men['price_per_ml'].min(), men['price_per_ml'].max()])
fy=regression.intercept+regression.slope*fx
rate_jitter_men=men['item_rating']+np.random.normal(0,0.1,size=len(men))
sns.relplot(x=men['price_per_ml'], y=rate_jitter_men, kind='scatter', alpha=0.2, size=men['price_per_ml'])
plt.xticks(ticks=[0,2,4,6,8,10,12], labels=None)
plt.plot(fx, fy, '--', color='red')
plt.xscale("log")
plt.title('Regression between the Price and the Stars', y=1.04)
plt.xlabel('Price per ml')
plt.ylabel('Stars')
plt.show()

* This session would show the note preference of different sexual categories. By giving a bar chart in stead of a table, here demonstrates the top 10 scent notes.

In [None]:
men_base_str=men['base_note'].str.cat(sep=' ')
men_middle_str=men['middle_note'].str.cat(sep=' ')
men_str=men_base_str+men_middle_str
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
men_str_nopunc = ""
for word in men_str:
   if word not in punctuations:
       men_str_nopunc = men_str_nopunc + word
men_str_clean_and=men_str_nopunc.replace(' and ', '')
men_str_clean_And=men_str_clean_and.replace(' And ', '')
men_str_cleaned=men_str_clean_And.replace(' Notes ', '')

In [None]:
#define a fuction to count the number of words in a string
def word_count(str):
    counts = {}
    words = str.split()
    for word in words:
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1
    return counts

men_note_lib=word_count(men_str_cleaned)
sort_men = sorted(men_note_lib.items(), key=lambda x: x[1], reverse=True)

#conver a dictionary to a dataframe
men_note_df=pd.DataFrame(men_note_lib.items(), columns=['note', 'frequency'])
men_note_df.sort_values('frequency', ascending=False).head(10)

In [None]:
men_note_top10=men_note_df.sort_values('frequency', ascending=False).head(10)
sns.catplot(x='frequency', y='note', data=men_note_top10, kind='bar', height=4, aspect=16/9)
plt.title('Top 10 Scent Notes for Men')
plt.xlabel('Frequency')
plt.ylabel('Scent Notes')
plt.show()

In [None]:
#prepare for the string of women's notes
women_base_str=women['base_note'].str.cat(sep=' ')
women_middle_str=women['middle_note'].str.cat(sep=' ')
women_str=women_base_str+women_middle_str
women_str_nopunc = ""
for word in women_str:
   if word not in punctuations:
       women_str_nopunc = women_str_nopunc + word
women_str_clean_and=women_str_nopunc.replace(' and ', '')
women_str_clean_And=women_str_clean_and.replace(' And ', '')
women_str_cleaned=women_str_clean_And.replace(' Notes ', '')


#display the top 10 scent notes for women
women_note_lib=word_count(women_str_cleaned)
sort_women = sorted(women_note_lib.items(), key=lambda x: x[1], reverse=True)
women_note_df=pd.DataFrame(women_note_lib.items(), columns=['note', 'frequency'])
women_note_df.sort_values('frequency', ascending=False).head(10)

In [None]:
women_note_top10=women_note_df.sort_values('frequency', ascending=False).head(10)
sns.catplot(x='frequency', y='note', data=women_note_top10, kind='bar', height=4, aspect=16/9)
plt.title('Top 10 Scent Notes for Women')
plt.xlabel('Frequency')
plt.ylabel('Scent Notes')
plt.show()

In [None]:
#prepare for the string of unisex's notes
unisex_base_str=unisex['base_note'].str.cat(sep=' ')
unisex_middle_str=unisex['middle_note'].str.cat(sep=' ')
unisex_str=unisex_base_str+unisex_middle_str
unisex_str_nopunc = ""
for word in unisex_str:
   if word not in punctuations:
       unisex_str_nopunc = unisex_str_nopunc + word
unisex_str_clean_and=unisex_str_nopunc.replace(' and ', '')
unisex_str_clean_And=unisex_str_clean_and.replace(' And ', '')
unisex_str_cleaned=unisex_str_clean_And.replace(' Notes ', '')

#display the top 10 scent notes for unisex
unisex_note_lib=word_count(unisex_str_cleaned)
sort_unisex = sorted(unisex_note_lib.items(), key=lambda x: x[1], reverse=True)
unisex_note_df=pd.DataFrame(unisex_note_lib.items(), columns=['note', 'frequency'])
unisex_note_df.sort_values('frequency', ascending=False).head(10)

In [None]:
unisex_note_top10=unisex_note_df.sort_values('frequency', ascending=False).head(10)
sns.catplot(x='frequency', y='note', data=unisex_note_top10, kind='bar', height=4, aspect=16/9)
plt.title('Top 10 Scent Notes for Unisex')
plt.xlabel('Frequency')
plt.ylabel('Scent Notes')
plt.show()

* Furthermore, in spite of a bar chart, text graphics would look more fnacy. In this way, it is better to have audience's attention.

In [None]:
men_wc=WordCloud(background_color='#e0edfd', width=1080, height=720).generate(men_str_cleaned)
plt.imshow(men_wc, interpolation='bilinear')
plt.axis('off')
plt.title('Scent Notes for Men',size=16, y=1.04)
plt.show()

In [None]:
women_wc=WordCloud(background_color='white', width=1080, height=720).generate(women_str_cleaned)
plt.imshow(women_wc, interpolation='bilinear')
plt.axis('off')
plt.title('Scent Notes for Women',size=16, y=1.04)
plt.show()

In [None]:
unisex_wc=WordCloud(background_color='#000000', width=1080, height=720).generate(unisex_str_cleaned)
plt.imshow(unisex_wc, interpolation='bilinear')
plt.axis('off')
plt.title('Scent Notes for Unisex',size=16, y=1.04)
plt.show()

## 7. Conclusion
In sum, here are viewpoints as below.

1. Although there is no big discont for unisex perfumes, unisex perfumes are not as expensive as male and female ones.

2. More women perfumes with a high price get 4-5 star reviews than man's perfumes.

3. Top 5 scent notes per category are as below.
* Men - Cedar, Amber, Patchouli, Sandalwood, Musk
* Women - Jasmine, Musk, Rose, Vanilla, Sandalwood
* Unisex - Amber, Musk, Rose, Sandalwood, Vanilla

Suggested further analysis.

1. Even though this analysis only focus on the general idea of the price and scent notes, the following analysis can change to focus on paritcular brands by the same techniques.

2. What combination of the scent notes is common for different categories.

3. Any preference about the perfume concentration or scents reagarding different brands.