# BOXCOXROX 

Pseudo product.csv creator.

This notebook creates a pseudo version of the main.csv for the boxcoxrox project.  This notebook will create sample csvs for topic, sentiment, and quantiles, and the join them into one master notebook.  

The first step will be to create a dataframe that includes a set of asins, and a description.  This will probably come from the metadata for the pet products in the pet.db


In [1]:
#!pip install pandas
import pandas as pd
import random
import string

def get_asin(N):
    return ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(N))

asins = [get_asin(10) for x in range(0,1000)]
asins = [x for x in set(asins)]
products_df = pd.DataFrame(asins)
products_df.columns = ['asin']
products_df['description'] = ["Prod description for {}".format(x) for x in asins]
products_df.head()
products_df.to_csv("mock_products_df.csv", index=False)

## Sample LDA Topic csv...

This code will create a csv with the ASINS defined above, and a random topic and topic description.

In [2]:
LDA_df = pd.DataFrame(asins)
LDA_df.columns = ['asin']

topics = [random.choice(range(0,12)) for x in range(len(LDA_df))]
LDA_df['topic']=topics

desc = ["Topic description for {}".format(x) for x in topics]
LDA_df['topic_description'] = desc
LDA_df.head()
LDA_df.to_csv("mock_LDA_df.csv", index=False)

## Sentiment csv

This code will generate a mock sentiment score csv. 

In [3]:
sentiment_df = pd.DataFrame(asins)
sentiment_df.columns = ['asin']

sentiment = [-1 + (2 * random.random()) for x in range(len(sentiment_df))]
sentiment_df['sentiment'] = sentiment
sentiment_df.head()
sentiment_df.to_csv("mock_sentiment_df.csv", index=False)

## Quantiles

The following code will generate a random quantile rating for the products.  This will be done by sorting the asin, and then assinging a rank number based on that sort.  The quantiles are configured so that they range from 0 to 1.  (This is essentiall the scaled rank from 0-1)


In [4]:
quantiles_df = pd.DataFrame(asins)
quantiles_df.columns = ['asin']

# preserve the original order
quantiles_df['order'] = [x for x in range(len(quantiles_df))]

# sort the data by the asin
quantiles_df.set_index("asin", inplace=True)
quantiles_df.sort_values(by=['asin'], inplace=True)

# assign a rank column based on the new order
quantiles_df['rank'] = [x for x in range(len(quantiles_df))]

# calculate the quantile column based on the rank
quantiles_df['quantile'] = quantiles_df['rank']/len(quantiles_df)

# resort the dataframe by the original order
quantiles_df.sort_values(by= ['order'], inplace=True)

# preserve only the asin and quantile column
quantiles_df.reset_index(inplace=True)
quantiles_df = quantiles_df[['asin', 'quantile']]

# ensure that the quantiles range from 0 to 1
assert max(quantiles_df['quantile'])>0.998
assert min(quantiles_df['quantile'])<0.001

quantiles_df.head()
quantiles_df.to_csv("mock_quantiles_df.csv", index=False)