# <a id="0">Wine Data Exercises</a>

In this notebook, we will review basic steps of exploratory data analysis following the example in the EDA-PIPELINE,ipynb example. We will work with the wine data set __winequality-white.csv__ provided in the __data__ folder. 

__Dataset schema:__ 
   - fixed acidity
   - volatile acidity
   - citric acid
   - residual sugar
   - chlorides
   - free sulfur dioxide
   - total sulfur dioxide
   - density
   - pH
   - sulphates
   - alcohol

   Output variable (based on sensory data): 
   - quality (score between 0 and 10)


You will follow the same steps in the _EDA-PIPELINE.ipynb_ notebook to complete the wine quality classification tasks. The section headers are provided,  and you will need to copy and paste the code over, while adding necessary modifications. 

1. <a href="#1">Overall Statistics</a>
2. <a href="#2">Select Feature Columns</a>
3. <a href="#3">Basic Plots</a>
4. <a href="#4">Impute Missing Values</a>
5. <a href="#5">Preparing Training and Test Datasets</a>
6. <a href="#6">Data Processing with Pipeline</a>



__Copy multiple cells from one notebook to another__
 - Select Cell and press Esc to go to command mode.
 - Hit Shift + Up or Shift + Down to select multiple cells.
 - Copy with Ctrl/CMD + C.
 - Paste with Ctrl/CMD + V 

## 1. <a id="=1">Overall Statistics</a>
(<a href="#0">Go to top</a>)

Let's read the dataset into a dataframe, using Pandas.

## 2. <a id="2">Select Feature Columns</a>
(<a href="#0">Go to top</a>)

Let's separate model features and model target.

## 3. <a id="3">Basic Plots</a>
(<a href="#0">Go to top</a>)

### Bar Plots and Histograms

In this section, we examine our data with plots. Important note: These plots ignore null (missing) values. We will learn how to deal with missing values in the next section.


__Bar plots__: These plots show counts of categorical data fields. __value_counts()__ function yields the counts of each unique value. It is useful for categorical variables.

First, let's look at the distribution of the model target.

__Histograms:__ Histograms show distribution of numeric data. Data is divided into intervals, aka, "buckets" or "bins".

### Remove Outliers 

For the wine dataset, you may select 1-2 features and remove the outliers in these features. (Note what happens if you perform this operation on all features.)

### Scatter Plot and Correlation Matrix

We can plot the scatter plot for two selected numeric features.

We plot the correlation matrix. Correlation scores are calculated for numerical fields. 

## 4. <a id="4">Impute Missing Values</a>
(<a href="#0">Go to top</a>)

### Impute (fill-in) missing values with .fillna()


Rather than dropping rows (data samples) and/or columns (features), another strategy to deal with missing values would be to actually complete the missing values with new values: imputation of missing values.

__Imputing Numerical Values:__ The easiest way to impute numerical values is to get the __average (mean) value__ for the corresponding column and use that as the new value for each missing record in that column. 

An elegant way to implement imputation is using sklearn's __SimpleImputer__, a class implementing .fit() and .transform() methods.

## 5. <a id="5">Preparing Training and Test Datasets</a>
(<a href="#0">Go to top</a>)

We split our dataset into training (90%) and test (10%) subsets using sklearn's [__train_test_split()__](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function.

## 6. <a id="6">Data Processing with Pipeline</a>
(<a href="#0">Go to top</a>)

In a typical machine learning workflow you will need to apply data transformations, like imputation and scaling shown here, at least twice. First on the training dataset with __.fit()__ and __.transform()__, when preparing the data to training the model. And again, on any new data you want to predict on, with __.transform()__. Scikit-learn [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) is a tool that simplifies this process by enforcing the implementation and order of data processing steps. 

We build a pipeline to impute the missing values with the mean using sklearn's SimpleImputer, scale the numerical features to have similar orders of magnitude by bringing them into the 0-1 range with sklearn's MinMaxScaler, and finally train a KNN estimator on the imputed and scaled dataset. 