In [2]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.1.2-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 5.6 MB/s eta 0:00:01
[?25hCollecting smart-open>=1.8.1
  Using cached smart_open-5.2.1-py3-none-any.whl (58 kB)
Installing collected packages: smart-open, gensim
Successfully installed gensim-4.1.2 smart-open-5.2.1


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook
import numpy as np
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import re  # For preprocessing
import pandas as pd  # For data handling
from time import time  # To time our operations
from collections import defaultdict  # For word frequency

import spacy  # For preprocessing
from gensim.models import Word2Vec
import logging  # Setting up the loggings to monitor gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)
from sklearn.manifold import TSNE
from numpy import dot
from numpy.linalg import norm

# Dataset Description 

The dataset I used for this tutorial is from Kaggle Dataset( https://www.kaggle.com/CooperUnion/cardataset) This cars dataset includes features such as make, model, year, engine, and other properties of the car. We will use these features to generate the word embeddings for each make model and then compare the similarities between different make model. The following dataframe shows the detail information of this dataset.

In [2]:
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


# Data Preprocessing

Since the purpose of this tutorial is to learn how to generate word embeddings using genism library, I will not do the EDA and feature selection for the word2vec model for the sake of simplicity. 
<br> 
Genism word2Vec requires that a format of list of list for training where every document is contained in a list and every list contains list of tokens of that document. At first, we need to generate a format of list of list for training the make model word embedding. To be more specific, each make model is contained in a list and every list contains list of features of that make model.
To achieve these, we need to do the following data preprocessing steps :
1. Create a new column for Make Model 
2. Generate a format of list of list for each Make Model with the following features: Engine Fuel Type, Transmission Type, Driven_Wheels, Market Category, Vehicle Size and Vehicle Style. 


1. Create a new column for Make Model

In [3]:
df['Maker_Model']= df['Make']+ " " + df['Model']

INFO - 12:54:13: NumExpr defaulting to 8 threads.


2. Generate a format of list of list for each Make Model 

In [5]:
# Select features from original dataset to form a new dataframe 
df1 = df[['Engine Fuel Type','Transmission Type','Driven_Wheels','Market Category','Vehicle Size', 'Vehicle Style', 'Maker_Model']]
# For each row, combine all the columns into one column
df2 = df1.apply(lambda x: ','.join(x.astype(str)), axis=1) 
# Store them in the pandas dataframe
df_clean = pd.DataFrame({'clean': df2}) 
# Create the list of list format of the custom corpus for gensim modeling 
sent = [row.split(',') for row in df_clean['clean']]
# show the example of list of list format of the custom corpus for gensim modeling 
sent[:2] 

print(sent[0])
print(sent[1])
len(sent)

['premium unleaded (required)', 'MANUAL', 'rear wheel drive', 'Factory Tuner', 'Luxury', 'High-Performance', 'Compact', 'Coupe', 'BMW 1 Series M']
['premium unleaded (required)', 'MANUAL', 'rear wheel drive', 'Luxury', 'Performance', 'Compact', 'Convertible', 'BMW 1 Series']


11914

# Genism word2vec Model Training 

Gensim is an open source python library for natural language processing and it was developed and is maintained by the Czech natural language processing researcher Radim Řehůřekand his company RaRe Technologies. Gensim library will enable us to develop word embedding by training our own word2vec models on a custom corpus either with CBOW of skip grams algorithms.

In [6]:
## Train the genisim word2vec model with our own custom corpus
model = Word2Vec(sent, min_count=1, vector_size=50,workers=1, window=3, sg=1)

INFO - 13:00:29: collecting all words and their counts
INFO - 13:00:29: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 13:00:29: PROGRESS: at sentence #10000, processed 74060 words, keeping 841 word types
INFO - 13:00:29: collected 977 word types from a corpus of 88129 raw words and 11914 sentences
INFO - 13:00:29: Creating a fresh vocabulary
INFO - 13:00:29: Word2Vec lifecycle event {'msg': 'effective_min_count=1 retains 977 unique words (100.0%% of original 977, drops 0)', 'datetime': '2022-03-23T13:00:29.234545', 'gensim': '4.1.2', 'python': '3.8.3 (default, Jul  2 2020, 16:21:59) \n[GCC 7.3.0]', 'platform': 'Linux-5.13.0-35-generic-x86_64-with-glibc2.10', 'event': 'prepare_vocab'}
INFO - 13:00:29: Word2Vec lifecycle event {'msg': 'effective_min_count=1 leaves 88129 word corpus (100.0%% of original 88129, drops 0)', 'datetime': '2022-03-23T13:00:29.284419', 'gensim': '4.1.2', 'python': '3.8.3 (default, Jul  2 2020, 16:21:59) \n[GCC 7.3.0]', 'platform': 'Lin

Let's try to understand the hyperparameter of this model.
1. size: The number of dimensions of the embeddings and the default is 100.
2. window: The maximum distance between a target word and words around the target word. The default window is 5.
3. min_count: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.
4. workers: The number of partitions during training and the default workers is 3. 
5. sg: The training algorithm, either CBOW(0) or skip gram (1). The default training alogrithm is CBOW. 

In [12]:
## We can obtain the word embedding directly from the training model
my_w2v = model.wv
my_w2v['Toyota Camry']
#for (k,v) in dict(my_w2v).items():
#    print(k)

array([ 0.01413206,  0.13312066,  0.02064241, -0.10732427, -0.07307266,
       -0.21330908,  0.0129156 ,  0.28755438, -0.09580448, -0.08252585,
        0.04816886,  0.03747622,  0.10407145, -0.02856277, -0.03150049,
        0.16970299,  0.14842999,  0.29115918, -0.1391689 , -0.2961973 ,
       -0.05255633, -0.05688905,  0.26025763,  0.06427574,  0.17634848,
        0.00669038, -0.03523087,  0.38606796, -0.04761263,  0.00278091,
        0.0215994 ,  0.02498762,  0.03595787,  0.00827457,  0.07596808,
       -0.11868112,  0.17895776, -0.01840427,  0.04355941,  0.06628232,
        0.08798219, -0.03108072, -0.2208288 ,  0.08912076,  0.3587986 ,
        0.04740353, -0.02172407, -0.14516725,  0.00390705,  0.02408277],
      dtype=float32)

# Compare Similarities 

Now we could even use Word2vec to compute similarity between two make model in the vocabulary by invoking the model.similarity() and passing in the relvevant words. For instance,  model.similarity('Porsche 718 Cayman', 'Nissan Van') This will give us the Euclidian similarity between Porsche 718 Cayman and Nissan Van. 

In [13]:
my_w2v.similarity('Porsche 718 Cayman', 'Nissan Van')

0.75154096

In [14]:
my_w2v.similarity('Porsche 718 Cayman', 'Mercedes-Benz SLK-Class')

0.88125455

From the above example, we can tell that Porsche 718 Cayman is more similar with Mercedes-Benz SLK-Class than Nissan Van. We also can use the build in function model.most_similar() to get a set of the most similar make models for a given make model.

In [16]:
## Show the most similar vehicles for Mercedes-Benz SLK-Class : Default by eculidean distance 
my_w2v.most_similar('Mercedes-Benz SLK-Class')[:10]

[('Mercedes-Benz SLS AMG GT', 0.9876929521560669),
 ('Maserati Coupe', 0.9859341979026794),
 ('Lamborghini Huracan', 0.9856386184692383),
 ('Audi S3', 0.9855828285217285),
 ('Porsche Cayman', 0.9854549169540405),
 ('BMW M6', 0.9852572679519653),
 ('BMW M4', 0.9850773215293884),
 ('Mitsubishi Lancer Evolution', 0.984929084777832),
 ('Nissan GT-R', 0.9844992756843567),
 ('Porsche Boxster', 0.9842399954795837)]

In [18]:
## Show the most similar vehicles for Toyota Camry : Default by eculidean distance 
my_w2v.most_similar('Toyota Camry')[:10]

[('Nissan Sentra', 0.988138735294342),
 ('Chevrolet Malibu', 0.9875525236129761),
 ('Nissan Altima', 0.984257161617279),
 ('Mazda 6', 0.9840126037597656),
 ('Toyota Avalon', 0.9836633205413818),
 ('Buick Verano', 0.9834824204444885),
 ('Suzuki Kizashi', 0.981656551361084),
 ('Chevrolet Cruze', 0.9809757471084595),
 ('Pontiac Grand Am', 0.9808218479156494),
 ('Dodge Dart', 0.980262279510498)]

However, Euclidian similarity cannot work well for the high-dimensional word vectors, This is because Euclidian similarity will increase the number of dimensions increases even if the word embedding stands for different meanings. Alternatively, we can use cosine similarity to measure the similarity between two vectors.  Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. Therefore, the cosine similarity captures the angle of the word vectors and not the magnitude. Under cosine similarity, no similarity is expressed as a 90-degree angle while the total similarity of 1 is at 0 degree angle. The following function shows how can we generate the most similar make model based on cosine similarity.

In [19]:
def cosine_distance (model, word,target_list , num) :
    cosine_dict ={}
    word_list = []
    a = model[word]
    for item in target_list :
        if item != word :
            b = model [item]
            cos_sim = dot(a, b)/(norm(a)*norm(b))
            cosine_dict[item] = cos_sim
    dist_sort=sorted(cosine_dict.items(), key=lambda dist: dist[1],reverse = True) ## in Descedning order 
    for item in dist_sort:
        word_list.append((item[0], item[1]))
    return word_list[0:num]

In [20]:
Maker_Model = list(df.Maker_Model.unique()) ## only get the unique Maker_Model_Year
## Show the most similar Mercedes-Benz SLK-Class by cosine distance 
cosine_distance(my_w2v,'Mercedes-Benz SLK-Class',Maker_Model,5) 

[('Mercedes-Benz SLS AMG GT', 0.9876931),
 ('Maserati Coupe', 0.98593426),
 ('Lamborghini Huracan', 0.98563856),
 ('Audi S3', 0.98558277),
 ('Porsche Cayman', 0.9854549)]

# T-SNE Plot

It’s hard to visualize the word embedding directly, for they usually has more than 3 dimensions. T-SNE is an useful tool to visualize high-dimensional data by reducing dimensional space while keeping relative pairwise distance between points. It can be said that t-SNE looking for a new data representation where the neighborhood relations are preserved. In this tutorial, I used TSNE package from scikit-learn library. The following code showed how to plot the word embedding with T-SNE plot. 

In [21]:
def display_closestwords_tsnescatterplot(model, word, size):
    
    arr = np.empty((0,size), dtype='f')
    word_labels = [word]

    close_words = model.similar_by_word(word)

    arr = np.append(arr, np.array([model[word]]), axis=0)
    for wrd_score in close_words:
        wrd_vector = model[wrd_score[0]]
        word_labels.append(wrd_score[0])
        arr = np.append(arr, np.array([wrd_vector]), axis=0)
        
    tsne = TSNE(n_components=2, random_state=0)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)

    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    plt.scatter(x_coords, y_coords)

    for label, x, y in zip(word_labels, x_coords, y_coords):
        plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
    plt.xlim(x_coords.min()+0.00005, x_coords.max()+0.00005)
    plt.ylim(y_coords.min()+0.00005, y_coords.max()+0.00005)
    plt.show()


In [22]:
display_closestwords_tsnescatterplot(my_w2v, 'Porsche 718 Cayman', 50)

<IPython.core.display.Javascript object>