## White Wine From the Vihno Verde Region in Portugal - Predicting the Quality Score Using Machine Learning

### Project Description:
Our project examines 11 quantitative features of red/white wine data sets from the Vihno Verde region of Portugal. Using the physicochemical features/breakdown of the wine, we built a predictive machine learning model with a target variable of quality score. Our insights, discoveries, and modeling offer a distinct advantage to wine producers/stakeholders/distributors by using a wine's chemical composition and predicting its associated quality score.

#### Project Planning/Outline:
1. Intro
2. Acquire
3. Prepare/Wrangle
4. Split
5. Exploration Highlights
6. Stats Tests?
7. Scale
8. Clusters
9. Modeling
10. Conclusion
11. Next Steps

#### Target Variable aka Y_variable = 'Quality'

Quality is the median score given to wine based on a 1-10 numerical scale submitted by at least three wine experts

---
## About the data
Our project makes use of the [Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/wine+quality) from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php), which is a labeled dataset consisting of 6500 observations.  Each observation represents a red or white Vihno Verde whine from the Portugal region and includes the physicochemical composition of the wine as well as a labeled `quality` score indicating wine expert's opinion of the wine on a scale of 1 to 10.

### Target variable
**`quality`** - Quality score is the median score given to associated wine based off the rankings of three industry experts

### Data Dictionary
| **Variable Name** | **Explanation** | **Unit** | **Values** |
| :---: | :---: | :---: | :---: |
| Fixed Acidity |  Acids that do not evaporate readily (Tartaric Acid) | g/L | Float |
| Volatile Acidity | Acids evaporate readily (Acetic acid) | g/L | Float |
| Citric Acid | level of Citric acid | g/L | Float |
| Residual Sugar | Sugar that remains after fermenation | g/L | Float |
| Chlorides | Sodium Chloride content | g/L | Float |
| Free Sulfur Dioxide | Levels of free, gaseous sulfur dioxide | mg/L | Float |
| Total Sulfur Dioxide | Total Level of Sulfur Dioxide | mg/L | Float |
| Density | Density in relation to water | g/cm^3 | Float |
| pH| Acidity of the wine | 1-14 | Float |
| Sulphates | Level of potassium sulfate | g/L | Float |
| Alcohol | Alcohol by Volume per wine | ABV% | Float |
| Quality |  The median value of at least 3 independent evualations by wine experts| 1-10 | Integer |

### Data Challenges


The dataset used was a very clean dataset. For the most part, its normally distributed, I.E. on average you get an average wine.  There were zero null values and none of the values could be identified as obviously erroneous.  

However, we discovered 1,177 duplicate records which we dropped leaving us with 5320 observations remaining after cleaning, 1359 of which are red wines and 3961 of which are white wines.


In [1]:
# Helpers
import helpers
import wrangle as wr
import viz


#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from scipy import stats
from sklearn.cluster import KMeans

#functions

import model
import acquire
import wrangle as wr
import warnings
warnings.filterwarnings("ignore")

#evaluate
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.feature_selection import f_regression 
from statsmodels.formula.api import ols
import sklearn.preprocessing

#feature engineering
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.feature_selection import RFE
from sklearn.preprocessing import MinMaxScaler

# modeling methods
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
from sklearn.preprocessing import PolynomialFeatures

# Acquisition

- Pull from wrangle file using function

In [19]:
white = wr.wrangle_data("white")
white.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulphates,alcohol,quality,type
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white
6,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6,6,white


# Preparation 

1. Data has zero nulls
2. Rename columns
3. Drop duplicates
4. Handle outliers

In [3]:
#null check
white.isna().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [4]:
# Get underscores in there
white = white.rename(columns={'fixed acidity':'fixed_acidity','volatile acidity':'volatile_acidity','citric acid':'citric_acid','residual sugar':'residual_sugar',
'free sulfur dioxide':'free_sulfur_dioxide','total sulfur dioxide':'total_sulfur_dioxide'})

white.sample()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
102,6.0,0.21,0.24,12.1,0.05,55.0,164.0,0.997,3.34,0.39,9.4,5


In [5]:
white.shape

#4898 observations and 12 columns to start.

(4898, 12)

In [6]:
#Handle outliers

white = model.drop_outliers(white, "white", method='manual')

In [7]:
white.shape

(4426, 12)

In [8]:
#Drop those duplicates

white = white.drop_duplicates()

In [9]:
white.shape

(3588, 12)

In [None]:
# How do I add the columns that Tim made?

## Prep takeaways

- Working with a rather clean dataset.
- Ended up with 3588 observations
- Outlier handling was the biggest time consumer within the prep stage.

### Split the dataframe using split function from wrangle.py
- separate target in prep for modeling
- split data for exploration

In [15]:
#utilize helper function to split df, stratified on target, return train df and associated splits 
trainwhite, X_trainwhite, X_validatewhite, X_testwhite, y_trainwhite, y_validatewhite, y_testwhite = wr.split(white, 'quality')

In [17]:
X_trainwhite.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol
450,7.2,0.6,0.2,9.9,0.07,21.0,174.0,0.9971,3.03,0.54,9.1
1039,7.5,0.17,0.34,1.4,0.035,13.0,102.0,0.9918,3.05,0.74,11.0
2004,7.4,0.26,0.31,2.4,0.043,58.0,178.0,0.9941,3.42,0.68,10.6
1837,7.2,0.24,0.29,2.2,0.037,37.0,102.0,0.992,3.27,0.64,11.0
4344,6.7,0.27,0.69,1.2,0.176,36.0,106.0,0.99288,2.96,0.43,9.2


# Let's get to exploring