## Manifold Learning PCA LLE on Wine Dataset
    
Data Set Information: Kaggle Red Wine Dataset. 1599 examples with 12 features on wine quality. Predict wine quality (0-10).

https://www.kaggle.com/piyushgoyal443/red-wine-dataset#wineQualityInfo.txt

1) Load the wine quality data set.

2) Fit PCA and plot the cumulative sum of the `pca.explained_variance_ratio_`.

3) Identify the number of principal components to explain 90% of the variance.

4) Build a logistic regression model and record the accuracy.

5) Repeat step 4 using LLE with the same number of components and 30 neighbors.

6) Record your observations and identify your top performing model. Does manifold learning improve predictive performance over PCA in this case?

### Red Wine Dataset

Citation Request: This dataset is publicly available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 

#### Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of wine is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

#### Output variable (based on sensory data): 

12 - quality (score between 0 and 10)

### Load the wine quality dataset and important libraries

In [61]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler


df = pd.read_csv(
    "https://raw.githubusercontent.com/Thinkful-Ed/data-science-lectures/master/wineQualityReds.csv"
)
df.head()

Unnamed: 0.1,Unnamed: 0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
0,1,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,2,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,3,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,4,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,5,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [4]:
# Check the shape of the DataFrame
df.shape

(1599, 13)

In [6]:
# Check for missing values and handle them appropriately if there are any
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            1599 non-null   int64  
 1   fixed.acidity         1599 non-null   float64
 2   volatile.acidity      1599 non-null   float64
 3   citric.acid           1599 non-null   float64
 4   residual.sugar        1599 non-null   float64
 5   chlorides             1599 non-null   float64
 6   free.sulfur.dioxide   1599 non-null   float64
 7   total.sulfur.dioxide  1599 non-null   float64
 8   density               1599 non-null   float64
 9   pH                    1599 non-null   float64
 10  sulphates             1599 non-null   float64
 11  alcohol               1599 non-null   float64
 12  quality               1599 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 162.5 KB


In [13]:
df.describe()

Unnamed: 0.1,Unnamed: 0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,800.0,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,461.735855,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,1.0,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,400.5,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,800.0,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,1199.5,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,1599.0,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


We're interested in predicting the `quality` based on the other features.
* Investigate the `quality` column.
    * What are the most common values?
    * Show this with a plot.

In [8]:
# Check the unique values from "quality" column
pd.Series.unique(df.quality)

array([5, 6, 7, 4, 8, 3])

In [9]:
# Count the unique values in "quality" column
df.quality.value_counts()

5    681
6    638
7    199
4     53
8     18
3     10
Name: quality, dtype: int64

In [11]:
import plotly.express as px

In [12]:
# Plot for quality
fig = px.histogram(df, x='quality')
fig.show()

Separate the `X` and the `y` in preparation to create a supervised learning model.

In [14]:
df.columns

Index(['Unnamed: 0', 'fixed.acidity', 'volatile.acidity', 'citric.acid',
       'residual.sugar', 'chlorides', 'free.sulfur.dioxide',
       'total.sulfur.dioxide', 'density', 'pH', 'sulphates', 'alcohol',
       'quality'],
      dtype='object')

In [15]:
# Separate dependent and independent variables
X = df.drop(columns=['quality','Unnamed: 0'])
y = df['quality']

In [16]:
# Check and print X
X.shape

(1599, 11)

In [17]:
# Check and print y
y.shape

(1599,)

Perform a train test split.

In [21]:
# Split the dataset into the Training set and Test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

In [23]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(1279, 11)
(320, 11)
(1279,)
(320,)


In our modeling process we'd like to use `PCA`.  `PCA` is sensitive to data being on different scales.  Scale the data using `StandardScaler`.

In [50]:
# Feature Scaling
scale = StandardScaler()
train_scaled = scale.fit_transform(X_train)
test_scaled = scale.fit_transform(X_test)



Fit an instance of `PCA` to `X_train` and explore the `explained_variance_ratio_` attribute.  Create a plot showing the cumulative sum of `.explained_variance_ratio_`.

In [46]:
pcaline = PCA()
pcaline.fit(train_scaled)
exp_var_cum = np.cumsum(pcaline.explained_variance_ratio_) 
px.area(
   x=range(1, exp_var_cum.shape[0] + 1),
   y=exp_var_cum,
   labels={'x': 'Number of components', 'y':'Explained variance'}
)

Looking at the cumulative variance explained, identify the `n_components` that explain 90% of the variance. Then build a `PCA` model with that number of components and transform the data.

In [47]:
# Apply PCA
pca90 = PCA(n_components=7)
pca90_ = pca90.fit(train_scaled)
pca7 = pca90.fit_transform(train_scaled)

In [48]:
# Print the explained_variance
print(pca90_.explained_variance_)

[3.07805385 1.96769612 1.52416172 1.22604088 0.96653726 0.65307145
 0.57694001]


Fit and score a logistic regression model using the principal components as the predictors and the quality as the target.

In [53]:
# Fitting Logistic regression into dataset
lr = LogisticRegression()
fit = lr.fit(pca7, y_train)
y_pred = lr.predict(pca7)

Create and print a confusion matrix to further explore the model's performance.

In [66]:
print(confusion_matrix(y_train, y_pred))

[[  1   1   4   1   0   0]
 [  0   0  27  15   0   0]
 [  0   1 410 128   3   0]
 [  0   0 178 299  33   0]
 [  0   0  10 102  50   0]
 [  0   0   0   9   7   0]]


In [63]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           3       1.00      0.14      0.25         7
           4       0.00      0.00      0.00        42
           5       0.65      0.76      0.70       542
           6       0.54      0.59      0.56       510
           7       0.54      0.31      0.39       162
           8       0.00      0.00      0.00        16

    accuracy                           0.59      1279
   macro avg       0.45      0.30      0.32      1279
weighted avg       0.57      0.59      0.57      1279




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



Repeat the modeling process but transform your predictors using `LocallyLinearEmbedding` instead of `PCA`.

* Use `n_neighbors` = 30
* Set `n_components` to the same value you used for PCA
* Use `method`='standard'

In [56]:
# Apply LLE
lle = LocallyLinearEmbedding(n_neighbors=30, n_components=7, method='standard')
lle_ = lle.fit(train_scaled)
ll7 = lle.fit_transform(train_scaled)

In [59]:
# Fitting Logistic regression into dataset
lr1 = LogisticRegression()
fit1 = lr1.fit(ll7, y_train)
y_pred1 = lr1.predict(ll7)

In [65]:
print(confusion_matrix(y_train, y_pred1))

[[  0   0   7   0   0   0]
 [  0   0  36   6   0   0]
 [  0   0 461  81   0   0]
 [  0   0 269 241   0   0]
 [  0   0  24 138   0   0]
 [  0   0   0  16   0   0]]


In [64]:
print(classification_report(y_train, y_pred1))

              precision    recall  f1-score   support

           3       0.00      0.00      0.00         7
           4       0.00      0.00      0.00        42
           5       0.58      0.85      0.69       542
           6       0.50      0.47      0.49       510
           7       0.00      0.00      0.00       162
           8       0.00      0.00      0.00        16

    accuracy                           0.55      1279
   macro avg       0.18      0.22      0.20      1279
weighted avg       0.44      0.55      0.49      1279




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



Compare the model performance. What conclusions can you draw?