In [1]:
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns



In [5]:
X_bus = pd.read_pickle('data/business-classification/X.pkl')
Y_bus = pd.read_pickle('data/business-classification/Y.pkl')

X_wine = pd.read_pickle('data/wine/X.pkl')
Y_wine = pd.read_pickle('data/wine/Y.pkl')

# Datasets
There are two datasets that we shall analyze today. One pertains to business classification data and the other to wine quality. Both present multi-class labels that we will attempt to predict. Let us examine each dataset in turn.

## Business Classification
The first dataset is a wide dataset in that the design matrix, $X$, has a large number of features, each of which are fairly meaningless on their own. The original data came from a Kaggle dataset where each record was a business showing plaintext scraped from the company's website, along with a label showing one of thirteen possible business classifications, e.g. "Financials", "Healthcare", "Information Technology", and so on. I derived a more simple dataset that basically one-hot-encoded 1000 words from these natural language columns, while also one-hot-encoding the labels to produce a matrix target, $Y$. 

I find this data interesting for two reasons: (1) I have a job in the workers compensation where the classification of work that a business performs is of utmost importance. I want to prove to myself that this classification can be performed automatically using solely data found online at a given company's website. If this is indeed possible, it opens up several opportunities from a customer experience and product differentiation standpoint. (2) I am fascinated by natural language data, though I do not have much experience analyzing it. I figured this "bag-of-words" style dataset could be a good exercise in taking lots of data points which by themselves are not very predictive, but could be when combined with the whole of the features. I expect that even after tuning the models to optimize performance on holdout data, there will still be a decent bit of signal left uncaptured by the model.

## Wine Quality
The next dataset came from the UCI ML Repository and did not require as much manipulation to prepare for analysis. The original wine quality data was entirely composed of numeric columns, though the data was originally split into two datasets: red and white wines. I combined the two and included a new binary column simply called `red` that mapped to the original dataset. I converted the target data into a multiclassification problem where the numeric scores were converted into "Low", "Medium", and "High" categories.

This dataset offers a contrast with the first one in that individual columns have a great deal more predictive value than in the business classification data. To continue the contrast, I expect there to be substantially more signal captured by the wine quality models than the business classifiers, though I do not believe this will be trivial task.

In [17]:
X_bus.loc[:4, ['ability', 'able', 'access', 'world', 'would']]

Unnamed: 0,ability,able,access,world,would
0,0,0,0,0,1
1,0,1,0,0,0
2,0,0,0,0,0
3,0,0,1,0,1
4,0,0,1,0,0


In [18]:
Y_bus.loc[:4]

Unnamed: 0,Commercial Services & Supplies,Consumer Discretionary,Consumer Staples,Corporate Services,Energy & Utilities,Financials,Healthcare,Industrials,Information Technology,Materials,"Media, Marketing & Sales",Professional Services,Transportation & Logistics
0,1,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,0,0,0


In [19]:
X_wine.head(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,red
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,1


In [20]:
Y_wine.head(5)

Unnamed: 0,Low,Medium,High
0,1,0,0
1,1,0,0
2,1,0,0
3,0,1,0
4,1,0,0
