# Objectives

In this notebook I'd like to review Cartier Dataset. There are a lot of rumors that Cartier jewellery is expensive, from white gold or platinum, full of diamonds... Only rich people can afford such a treasure. I'd like to investigate that and find out, which prices does Cartier set? What kind of metal they use? Is there only jewelry with diamonds? Let's check!

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')

In [None]:
cartier = pd.read_csv('../input/cartier-jewelry-catalog/cartier_catalog.csv')
cartier.head()

# Data preprocessing

In [None]:
cartier.info()

In [None]:
# There is no sence in reference number from analysis point of view, so I'm removing this column
cartier.drop('ref', axis = 1, inplace = True)

# Also I won't analyse image in this notebook
cartier.drop('image', axis = 1, inplace = True)

In [None]:
# Checking uniques values in categories
cartier['categorie'].unique()

There are only 4 possible values and no gaps in Category column, so it is perfect to use 'category' as data type. Also I'll rename column to eliminate a misspelling.

In [None]:
cartier.rename(columns = {'categorie': 'category'}, inplace = True)
cartier['category'] = cartier['category'].astype('category')

In [None]:
# Checking title column
cartier['title'].unique()

Information is quite chaotic. There are ring names, sizes, materials, number of crystals presented without any order. As I can see, description column has quite similar information, but with better structure, so I'll use descriptions for additional information. Title column I'm going to remove.  

In [None]:
cartier.drop('title', axis = 1, inplace = True)

In [None]:
# Checking price distribution
cartier['price'].describe()

Cheapest ring costs 500 dollars and the most expensive is 370,000. Distribution is negatively skewed.

In [None]:
# Checking the most expensive items
cartier[cartier['price'] > 300000]

We can see that all the expensive items are with diamonds.

Here are some of those jewelleries:
- [Juste un Clou cuff bracelet](https://www.cartier.co.uk/en-gb/collections/jewelry/collections/juste-un-clou/bracelets/h6004717-juste-un-clou-bracelet.html)
- [Reflection de Cartier necklace](https://www.cartier.co.uk/en-gb/collections/jewelry/collections/diamond-collection/necklaces/h7000130-reflection-de-cartier-necklace.html)
- [Cactus de Cartier necklace](https://www.cartier.co.uk/en-gb/collections/jewelry/collections/cactus-de-cartier/necklaces/h7000156-cactus-de-cartier-necklace.html)

In [None]:
# Checking uniques tags values
set([tag for tags in cartier['tags'].str.replace('.','').str.split(', ') for tag in tags])

In [None]:
# Creating list with metals, coatings and crystals based on set of unique tags
metals = ['yellow gold', 'platinum', 'pink gold', 'white gold', 'non-rhodiumized white gold']
coatings = ['black lacquer', 'lacquer', 'black ceramic', 'ceramic']
crystals = ['amazonite', 'amethyst', 'amethysts', 'aquamarines', 'aventurine', 'brown diamonds', 'carnelian', 'carnelians', 'chrysoprase', 'chrysoprases', 'citrine', 'coral', 'diamond', 'diamonds', 'emeralds', 'garnets', 'gray mother-of-pearl', 'lapis lazuli', 'malachite', 'mother-of-pearl', 'obsidians', 'onyx', 'orange diamonds', 'pearl', 'peridots', 'pink sapphire', 'pink sapphires', 'rubies', 'sapphire', 'sapphires', 'spessartite garnet', 'spinels', 'tsavorite garnet', 'tsavorite garnets', 'white mother-of-pearl', 'yellow diamonds']

# Initialising functions to divide tags in different categories. 
def check_tags(group, tags):
    value = ''
    for tag in tags:
        if tag in group:
            value += tag.rstrip('s') + ', '
    if value == '':
        return 'No'
    return value.rstrip(", ")
    
def metal(tags):
    return check_tags(metals, tags)
def crystal(tags):
    return check_tags(crystals, tags)
def coating(tags):
    return check_tags(coatings, tags)

In [None]:
# Creating new columns with metals, crystals and coatings instead of tags 
cartier['metals'] = cartier['tags'].str.replace('non-rhodiumized white gold','white gold').str.replace('.','').str.split(', ').apply(metal)
cartier['crystals'] = cartier['tags'].str.replace('rubies','ruby').str.replace('.','').str.split(', ').apply(crystal)
cartier['coatings'] = cartier['tags'].str.replace('.','').str.split(', ').apply(coating)

In [None]:
# Removing tags in a separate column
tags = cartier.pop('tags')

In description we can find:
- title
- material
- category
- size (small, medium, large)
- width.

We already have material and category of jewelry in separate columns, so this information is not useful. I also don't think that size is an important variable. Let's retrieve only title and width.

In [None]:
# Checking descriptions' lenth
cartier['description'].str.len().sort_values(ascending = False)

There are two long descriptions:

In [None]:
cartier.iloc[258,2]

In [None]:
cartier.iloc[691,2]

Replacing them with first part of the text - till '\n\n'.

In [None]:
cartier.iloc[258,2] = 'Clash de Cartier ring, XL model, 18K yellow gold, coral. Width: 17.7mm.'
cartier.iloc[691,2] = 'Clash de Cartier earrings, XL model, 18K yellow gold, coral. Width: 17.7mm.'

From each description I'll get jewelry title - this is text up to first comma or dash.

Note: There are plain descriptions like *'18K yellow gold necklace set with ceramic ring'* without any punctuation mark. For such rows I'll take their description as a name

In [None]:
cartier['title'] = cartier['description'].apply(lambda x: x.split(', ')[0].split(' - ')[0])

Let's create separate column for width - this is information from word 'Width' till 'mm' notation. If there is no width, then I'll put 'nan'. 

In [None]:
cartier['width'] = cartier['description'].apply(lambda x: x.split('Width: ')[1].split('mm')[0] if len(x.split('Width: ')) > 1 else np.nan).astype('float')

In [None]:
# Removing old column
description = cartier.pop('description')

# Analysis

In [None]:
cartier.info()

Price distribution:

In [None]:
plt.figure(figsize = (10, 6))
price = sns.distplot(cartier['price'], kde = False, color="r", bins = 50)
price.set_xlabel('Price')

In [None]:
cartier['price'].describe()

So mostly (70%) Cartier jewelry costs below 20k! 

Price distribution per jewellery category:

In [None]:
plt.figure(figsize = (10, 6))
price_category = sns.swarmplot(y = 'category', x = 'price', data = cartier, palette = 'magma')
price_category.set_xlabel('Price')
price_category.set_ylabel('Category')

Price distribution depending on metal used:

In [None]:
plt.figure(figsize = (10, 8))
price_metal = sns.swarmplot(y = 'metals', x = 'price', data = cartier, palette = 'magma_r')
price_metal.set_xlabel('Price')
price_metal.set_ylabel('Metal')

The most popular metals:

In [None]:
plt.figure(figsize = (10, 6))
all_metals = [metal for rows in cartier['metals'].str.split(', ') for metal in rows]
metals = sns.countplot(all_metals, palette = 'magma_r')
metals.set_xlabel('Metal')
metals.set_ylabel('Jewelry count')

White gold is used more often than other metals. Also price distribution for white gold jewelry is higher. From the other hand platinum is not so popular in Cartier boutique, but there are both cheap and expensive platinum jewelry.

The most popular coatings:

In [None]:
all_coatings = [metal for rows in cartier[cartier['coatings'] != 'No']['coatings'].str.split(', ') for metal in rows]
coatings = sns.countplot(all_coatings, palette = 'magma_r')
coatings.set_xticklabels(coatings.get_xticklabels(), rotation=90)
coatings.set_xlabel('Coating')
coatings.set_ylabel('Jewelry count')

In [None]:
plt.figure(figsize = (10, 8))
price_metal = sns.swarmplot(y = 'coatings', x = 'price', data = cartier, palette = 'magma_r')
price_metal.set_xlabel('Price')
price_metal.set_ylabel('Coating')

Cartier uses coatings very rare, so there are only few items with coatings. Prices for jewelry with coating are mostly below 50k.

The most popular crystals (or minerals):

In [None]:
plt.figure(figsize = (12, 8))
all_crystals = [metal for rows in cartier[cartier['crystals'] != 'No']['crystals'].str.split(', ') for metal in rows]
crystals = sns.countplot(y = all_crystals, palette = 'Blues_r', order = pd.Series(all_crystals).value_counts().index, )
crystals.set_xticklabels(crystals.get_xticklabels(), rotation=90)
crystals.set_xlabel('Jewelry count')
crystals.set_ylabel('Crystals')

Cartier boutique definitely prefers diamonds!

Some of popular words in Cartier collections and jewelry titles:

In [None]:
from wordcloud import WordCloud, STOPWORDS

# Setting stopwords
stopwords_names = set(STOPWORDS)
stopwords_names.update(['ring', 'Ring', 'bracelet', 'Bracelet', 'necklace', 'Necklace', 'earrings', 'Earrings', 'de', 'Cartier'])

# Creating words list for names
words_in_title = [word for rows in cartier['title'].str.split() for word in rows if word not in stopwords_names]
words = " ".join(word for word in words_in_title)

In [None]:
# Creating a cloud with words from names:
plt.figure(figsize = (10,6))
wordcloud = WordCloud(max_words=30, background_color="white", colormap = 'tab20b').generate(words)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

### End

Thank you for reviewing!