## I aim to find the most import features of products that can be used by companies to know how to maximize customer satisfaction for any new product to be released to the market.

## My steps to accomplish this are as follows
1. Extract noun phrases from the product reviews. 
    * Vanilla noun phrases
    * Nouns modified with a linking verb
2. Use a Doc2Vec model to create vector representations of the extracted noun phrases. The model will be trained over all sentences in the review corpus (approximately 1.7 million reviews of Electronics products on Amazon).
3. Using Hierarchical DBSCAN (or other model), find meaningful clusters of the top N noun phrases, where N is a hyperparameter to be tuned for the overall model.
4. For each cluster, find the core point closest to all other core points. Using this as the representative product feature for the cluster, propogate the concept to all other points within the cluster.
5. Using a Word2Vec model trained on the same review data, subtract the vector of the feature term found in (4) to disambiguate any modifiers that don't have a clear sentiment.
6. Build a Linear Model (perhaps with polynomial features) for classification (positive or negative review) or regression (review score) to determine which of the product features have the greatest effect on the overall probability. Models can be fit per product and a bagged ensemble of the k nearest products (by product description similarity) can be used for prediction of a new product. Or models can be fit over entire categories of products

### Step 1

In [3]:
import hdbscan
import pandas as pd
import spacy
import textacy

from collections import Counter
from gensim.models.doc2vec import Doc2Vec
from gensim.models.word2vec import Word2Vec

In [5]:
nlp = spacy.load('en_core_web_md')

In [4]:
df = pd.read_csv('./Electronics_5.csv', encoding = 'utf8')
meta_df = pd.read_csv('./meta_electronics.csv', encoding = 'utf8')

test_df = df[df.asin == 'B003ELYQGG']
test_meta = meta_df[meta_df.asin == 'B003ELYQGG']

del df
del meta_df

In [7]:
test_df.iloc[0].reviewText

"these are super cheap and mostly you get what you pay for. The sound quality is not that good and they're not as sensitive as some others, but they're ok for the price."

In [6]:
docs = nlp.pipe(test_df.reviewText.values, n_threads=16)

In [None]:
noun_phrase_counts = Counter()

for doc in docs:
    these_nps = []
    
    these_nps.extend([nc.lemma_ for nc in doc.noun_chunks if nc.lemma_ not in ['-PRON-']])
    