### Анализ похожих товаров по их описанию
Допустим, вы аналитик данных в компании, которая занимается продажей мебели. Ваша задача — определить, какие товары наиболее похожи друг на друга по описанию. Для этого необходимо использовать косинусную меру угла с помощью библиотеки spacy.

Шаги выполнения задания:

1.	Скачайте датасет с описанием товаров (исходный файл — product_description.csv).

2.	Импортируйте библиотеку spacy и загрузите модель языка en_core_web_sm.

Дополнительно для выполнения задания выполните импорт функций из библиотек Python:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
3.	Проведите предобработку текста: удалите стоп-слова, лемматизируйте слова, удалите пунктуацию.

Используйте следующий код для предобработки текста:

def preprocess_text(text):
    doc = nlp(text)
    tokens = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]
    return " ".join(tokens)
data['processed_text'] = data['description'].apply(preprocess_text)
4.	Создайте матрицу векторов для каждого товара. 

Используйте следующий код для векторизации:

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(data['processed_text'])
5.	Рассчитайте косинусную меру угла между каждой парой товаров.

6.	Отобразите топ-5 товаров, которые наиболее похожи друг на друга.

In [4]:
import spacy
import numpy as np
import pandas as pd
import os

In [5]:
root_path = os.getcwd()
dir_path = os.path.join(root_path, "datasets")
filename = "product_description.csv"
file_path = os.path.join(dir_path, filename)

df = pd.read_csv(file_path)
df.head(10)

Unnamed: 0,product_name,description
0,Chair,This comfortable chair is perfect for any room...
1,Sofa,This beautiful sofa is a great addition to any...
2,Table,This elegant table is perfect for your dining ...
3,Bed,This cozy bed is perfect for a good night's sl...
4,Bookshelf,This modern bookshelf is perfect for displayin...
5,Desk,This functional desk is perfect for your home ...
6,Ottoman,This stylish ottoman is perfect for your livin...
7,Cabinet,This versatile cabinet is perfect for storing ...
8,Dresser,This elegant dresser is perfect for your bedro...
9,TV Stand,This sleek TV stand is perfect for your entert...


In [2]:
#spacy.cli.download("en_core_web_sm")

In [6]:
nlp = spacy.load('en_core_web_sm')

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [13]:
def preprocess_text(text):
    doc = nlp(text)
    tokens = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]
    return " ".join(tokens)
df['processed_text'] = df['description'].apply(preprocess_text)
df

Unnamed: 0,product_name,description,processed_text
0,Chair,This comfortable chair is perfect for any room...,comfortable chair perfect room home feature so...
1,Sofa,This beautiful sofa is a great addition to any...,beautiful sofa great addition living room slee...
2,Table,This elegant table is perfect for your dining ...,elegant table perfect dining room durable wood...
3,Bed,This cozy bed is perfect for a good night's sl...,cozy bed perfect good night sleep soft mattres...
4,Bookshelf,This modern bookshelf is perfect for displayin...,modern bookshelf perfect display favorite book...
5,Desk,This functional desk is perfect for your home ...,functional desk perfect home office spacious s...
6,Ottoman,This stylish ottoman is perfect for your livin...,stylish ottoman perfect living room bedroom fo...
7,Cabinet,This versatile cabinet is perfect for storing ...,versatile cabinet perfect store belonging mult...
8,Dresser,This elegant dresser is perfect for your bedro...,elegant dresser perfect bedroom spacious drawe...
9,TV Stand,This sleek TV stand is perfect for your entert...,sleek tv stand perfect entertainment center mu...


In [11]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(df['processed_text'])

In [24]:
similarity = cosine_similarity(vectors)

In [26]:
dists = []

for i in range(vectors.shape[0]):
    for j in range(i + 1, vectors.shape[0]):
        dists.append((i, j, similarity[i, j]))

In [28]:
n = 5

top_n = sorted(dists, key = lambda x: x[2], reverse = True)[:n]
top_n

[(0, 2, 0.25938762953244804),
 (6, 8, 0.25085690793255755),
 (0, 1, 0.25055127750774986),
 (0, 3, 0.2456180219805501),
 (2, 8, 0.24096224354397955)]

In [29]:
for i in range(n):
    print(f"top {i + 1}: {df['product_name'][top_n[i][0]]} and {df['product_name'][top_n[i][1]]} (dist = {top_n[i][2]})")

top 1: Chair and Table (dist = 0.25938762953244804)
top 2: Ottoman and Dresser (dist = 0.25085690793255755)
top 3: Chair and Sofa (dist = 0.25055127750774986)
top 4: Chair and Bed (dist = 0.2456180219805501)
top 5: Table and Dresser (dist = 0.24096224354397955)
