# Dataset: Wine Quality
This dataset is comprised of many measurements of different Portuguese wines (downloaded from [here](https://archive.ics.uci.edu/ml/datasets/Wine+Quality), published [here](http://www.sciencedirect.com/science/article/pii/S0167923609001377)). Specifically, 11 attributes were measured for several hundred red and white wines (e.g. acidity, sugar, sulfar, pH), and each of their qualities were rated from 0 to 10. 

In these exercises, we will use a variety of unsupervised and supervised learning techniques to classify and predict different wine attributes.

# Section 1: Load and Preprocessing the Data
### Load and concatenate the two wine datasets. 
**Hint:** Remember to add a new column denoating the red and white wines. Also, the datasets are semi-colon-separated.

### Visualize thie distribution of each of the 12 attributes with histograms. 
This can be accomplished in several ways. Two such ways are:
1. Initialize multiple plots through Matplotlib and use displot, as shown [here](http://seaborn.pydata.org/examples/distplot_options.html). Multiple for loops will be necessary.
2. Melt the dataframe preserving only the color attribute. Then use FacetGrid as shown [here](http://seaborn.pydata.org/examples/faceted_histogram.html). You will want to turn off sharex/sharey.

#### Matplotlib + Distplot

#### Seaborn only: Facetgrid + Displot

Regardless of the plotting method, two trends should be apparent: 
1. Certain attributes strongly differentiate red vs. white lines, including the acidity, sulfur, and pH variables.
2. Several of the variables are variables are clearly non-normally distributed, especially citric acid and alcohol. 

The next step will be to normalize the variables to some constant scale for ease of fitting various machine learning models.

### Normalize the wine attributes (not including Quality) with the RobustScaler from Scitkit-Learn.
Look back at the *Data Transformations with Scikit-Learn* section of Module 3 notes for a reminder if necessary.

### Using the lmplot function from Seaborn, plot the 11 attributes against Quality, split by color.
The final plot should have the attributes on the x-axis and quality on the y-axis. 

**Hint:** The DataFrame will need to be melted first, preserving quality and color. See [here](https://seaborn.pydata.org/examples/anscombes_quartet.html), [here](https://seaborn.pydata.org/examples/multiple_regression.html), and the Diabetes Dataset section of the Module 3 notes for inspiration. Try messing with col_wrap, sharex, and sharey, y_jitter for better results.

As it should now be clear, several of the variables seem to show a relationship to the Quality score. For example, higher volatile acidities seem to predict lower quality scores. In contrast, increased alcohol content seems to predict higher quality scores.

### Using Statsmodels, compute and visualize the variance inflation factor of the 11 attributes. 


Though density seems to be a little high (VIF ~= 5.55), we will leave it be. Otherwise our variables seem to be largely non-collinear meaning that we should have no trouble fitting linear models.

# Section 2: Unsupervised Learning
In this section, we will attempt to cluster the two wine types (red & color) without any training. 

### Perform principal components analysis on the 11 wine attributes, extracting the first two principle components. Print the explained variance of the first two components.

### Using the scatter function from Matplotlib or the kdeplot from Seaborn, make a scatterplot of the first two principal components split by wine color.
See [here](https://seaborn.pydata.org/examples/multiple_joint_kde.html) for a tutorial using kdeplot.

In either case, you will need to index into your PCA-reduced data twice: once per wine color. For nicer plotting colors, try using the Seaborn [color palette tools](http://seaborn.pydata.org/tutorial/color_palettes.html). Try to set a legend as well!

#### Using Matplotlib and scatter

#### Using Seaborn and kdeplot

It is pretty clear from PCA that even two components are capable of separating out the two wine types. 

### Perform k-means clustering on the 11 wine attributes. Compute the accuracy score of the k-means fit.
To do this, we will need to binarize the wine color variable.

Similar to PCA, K-means clustering accurately predicts wine type.

### Make a denodrogram using the Ward agglomerative clustering method.
**NOTE:** Caution, this exercise may be a little slow. Feel free to skip if running short on time.

# Section 3: Supervised Learning
In this section, we will employed supervised learning approaches to both predict wine color and to predict quality score. 

### While using the Scikit-Learn logistic regression method to predict wine color, demonstrate and visualize the effect of different training size sets on test performance.
Specifically, test training sets in the range of [0.25,0.75,0.10] and test size = 0.25. Use either penalty function (l1 or l2).

For visualization, try out Seaborn [boxplots](https://seaborn.pydata.org/examples/horizontal_boxplot.html), [swarmplots](https://seaborn.pydata.org/examples/scatterplot_categorical.html), or [violinplots](https://seaborn.pydata.org/examples/simple_violinplots.html). In any case, it will be easiest for you to store the scores in a new DataFrame. Take inspiration from the Module 3 Notes to do this.

#### Perform Logistic Regression with cross validation

#### Assemble scores into DataFrame

#### Seaborn Boxplots


#### Seaborn Swarmplot


#### Seaborn Violinplot


### Fit a logistic regression model to 75/25 split, compute scores, and plot the coefficients sorted by magnitude.
Be sure to include the (sorted) column names as xticklabels.

### Turning now to predicting the quality score of wines, compare the performance of OLS linear regression against ridge regression for a  sampling of training set sizes.
Specifically, test training sets in the range of [0.25,0.75,0.10] and test size = 0.25. Visualize using lmplot, splitting between the two regression model types.

**Hint:** Try preallocating a scores matrix to make the post-processing of the scores into DataFrame easier. You will need to make an empty 3-dimensional matrix of size [n_train_sizes, n_splits, 3]. The 3rd dimension will track: (test_size, OLS, Lasso).

See [here](https://seaborn.pydata.org/examples/multiple_regression.html) for visualization help.

Both models show similar performance with increase generalization with increasing training set size and an upwards ceiling of R2 = 0.300. Let's see if we can improve model fit with other models.
### Fit a random forest regressor model to the 11 attributes to predict quality.
Using the same multiple 80/20 splits, evaluate the effect of forest size on predictive ability. Visualize with the swarmplot.

### Given the performance of the random forest method, fit an 80/20 split, compute socre, and show most important coefficients.