# Lab 3: Continuous Data

<img src="https://images.freeimages.com/images/large-previews/bc2/wine-1328360.jpg" width="500"/>

Today, everything is all about wine. The dataset we are using today, is full of continous, numeric data. Let's explore the dataset!


## TASK 1: Exploration and Initial Analysis


### Data Collection
Read the file [``winedata.csv``](https://raw.githubusercontent.com/ADSLab-Salzburg/DataAnalysiswithPython/main/data/winedata.csv), which is a dataset listing the features of several different wines (e.g. alchol level, acids etc.), for an exploratory data analysis.

In [None]:
# YOUR CODE HERE
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('https://raw.githubusercontent.com/ADSLab-Salzburg/DataAnalysiswithPython/main/data/winedata.csv')
df.head()


In [None]:
# Ignore this cell - this is for automatic grading.

### Data Exploration

Print the first lines of the dataset and examine it. (``head()``, ``describe()``). Is there a feature that contains invalid values? If yes, which?

In [None]:
# YOUR CODE HERE
df.head()
df.describe()
#Ash must contain Nan because the count is only 173 while the other all have  a count of 178

Dynamically create a Python list, that contains the column names of the features that contain invalid values. Store it to the variable `colums_containing_nan`.

In [None]:
# YOUR CODE HERE
colums_containing_nan = df.columns[df.isna().any()].tolist()
colums_containing_nan

In [None]:
# Ignore this cell - this is for automatic grading.

### Data Preprocessing
Clean the dataset using appropriate methods. See Lecture 5 for some approaches. Do this inplace on DataFrame `df`.

In [None]:
# YOUR CODE HERE
df.loc[df['Ash'].isna(), 'Ash'] = df['Ash'].mean()
df.describe()
 

In [None]:
# Ignore this cell - this is for automatic grading.

Elaborate on why you exactly choosed this method to clean the dataset? Max. three sentences!

YOUR ANSWER HERE

Used mean because it has a minimal influence on the data set and only changes the column of Ash in 3 rows(std, 75% and count) at the second decimal place.
Another option would have been to delete all rows with NaN values but If i would have done that, every coloumn and their  associated rows would have been effected and I could have deleted a min or max value of another column without knowing. Therefor changing the whole dataset.

### Data Visualization

Draw a seaborn boxplot of each feature (=column) in a single plot.

In [None]:
# YOUR CODE HERE
import seaborn as sns


# Set theme
sns.set_style('whitegrid')

# Boxplot
sns.boxplot(data=df)
plt.xticks(rotation=45)
plt.title('Wine Data')
plt.tick_params(axis='x', which='major', labelsize=8)
plt.xlabel("Ingredients")
plt.ylabel("Amount")
plt.show()

## TASK 2: Normalization of Data

As the proline feature has a different scale, we want to standardize all features to have zero mean and standard variance. Use the [``StandardScaler()``](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from sklearn as an easy way to achieve this.

### Data processing
Normalize the data using ``StandardScaler()``. Store the normalized data to a DataFrame called `df_s`.

In [None]:
# YOUR CODE HERE
from sklearn.preprocessing import StandardScaler
df_cols = df.columns
scaler = StandardScaler()
data = scaler.fit_transform(df)
df_s = pd.DataFrame(data, columns= df_cols)

In [None]:
# Ignore this cell - this is for automatic grading.

### Data Visualization

Visualize the transformed data.

In [None]:
# YOUR CODE HERE
# Set theme
sns.set_style('whitegrid')

# Boxplot
sns.boxplot(data=df_s)
plt.xticks(rotation=45)
plt.title('Wine Data Normaliezed ScatterScaler')
plt.tick_params(axis='x', which='major', labelsize=8)
plt.xlabel("Ingredients")
plt.ylabel("Normalized Amount")
plt.show()

## TASK 3: Dimension Reduction

In order to observe similarities between all data points (= wines), we can draw them in a two-dimensional space: However, we first need a dimensionality reduction - to reduce the currently 13 features. [``TSNE``](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html?highlight=tsne) is a well known and state-of-the-art method for reducing the dimensionality of the data. The idea is to best preserve the distance between data points also in 2D.

### Dimension Reduction using TSNE

Instantiate a TSNE object with meaningful parameters (set the number of target dimensions to $2$), transform scaled data into a two dimensional space, and store it in a DataFrame named `df_d`.

In [None]:
# YOUR CODE HERE
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, learning_rate=1000,perplexity=40,random_state=0)
data =  tsne.fit_transform(df_s)
df_d = pd.DataFrame(data)
df_d.head

In [None]:
# Ignore this cell - this is for automatic grading.

### Data Visualization
Use seaborn to plot the data in a scatterplot.

In [None]:
# YOUR CODE HERE
sns.scatterplot( data = df_d, palette='inferno',legend="full" )
plt.xlabel("wine")
plt.ylabel('relativ value')
plt.legend( title="Specific Value")
plt.title("Wine Date TSNE")

## TASK 4
Load the ``winelabels.csv`` which indicate the vineyard that produced the wines (numbers from 0 to 2). Use this to "label" the points in the scatterplot with one color per vineyard. Add a legend.

In [None]:
# YOUR CODE HERE
label = pd.read_csv('https://raw.githubusercontent.com/ADSLab-Salzburg/DataAnalysiswithPython/main/data/winelabels.csv')
label.rename(columns={"0":"Wineyards"}, inplace=True)
wineyards = label.join(df_d)
wineyards.rename(columns={0:"Value 1", 1:"Value 2"}, inplace="True")
sns.scatterplot(data = wineyards, x="Value 1", y = "Value 2", hue="Wineyards")
plt.xlabel("Special Value 1")
plt.ylabel("Special Value 2")
plt.title("TSNE Wine Data mapped on Wineyards")