Exercise Basics Data Mining
===========================



## Exercise 1: Statistical description of data



Let there be a small data set with one feature: $(8, 2, 4, 5, 1, 2, 6) = (1,2,2,4,5,6,8)$

1.  Calculate the following statistical features by hand!
    (Please do NOT use Python or any calculator, see `quantiles-handout.pdf` for a refresher on quantiles):
    -   mean value
    -   median
    -   quantile $Q_{0.25}$
    -   quantile $Q_{0.75}$

2.  Now use Python to calculate the statistical features. You can use the NumPy function `np.quantile()`
    or the member function `quantile()` of a Pandas DataFrame.

3.  Manually draw (by hand!) a histogram with bins of width 2 (bins: (0,2] , (2,4] ,&#x2026;)

4.  Now use Python to plot the histogram. Do you get the same result?



### Lösung
1. n=7
   - mean: **4**
   - median: **4**
   - quantile $Q_{0.25}$: **2**
     - $i' = q \cdot (n - 1) = 0.25 \cdot (7 - 1) = \textbf{1.5}$
     - $i = \lfloor i' \rfloor = \lfloor 1.5 \rfloor = \textbf{1}$
     - $g = [i'] = [1.5] = \textbf{0.5}$
     - $x_q = x_i + (x_{i+1} - x_i) \cdot g\implies x_{0.25} = x_1 + (x_2 - x_1) \cdot 0.5 = 2 + (2 - 1) \cdot 0.5 = \textbf{2.5}$
   - quantile $Q_{0.75}$:
     - $i' = q \cdot (n - 1) = 0.75 \cdot (7 - 1) = \textbf{4.5}$
     - $i = \lfloor i' \rfloor = \lfloor 1.5 \rfloor = \textbf{4}$
     - $g = [i'] = [1.5] = \textbf{0.5}$
     - $x_{0.75} = x_4 + (x_5 - x_4) \cdot 0.5 = 4 + (5 - 4) \cdot 0.5 = \textbf{4.5}$

In [None]:
import numpy as np
import matplotlib.pyplot as plt

data = [8,2,4,5,1,2,6]
data_np = np.sort(np.array(data))

print(data_np)
print(f'q_25: {np.quantile(data_np, 0.25)}')
print(f'q_75: {np.quantile(data_np, 0.75)}')

bin_width = 2

plt.hist(data, bins=range(min(data), max(data) + bin_width, bin_width), edgecolor='k')
plt.show()

---
## Exercise 2: Project understanding and Data understanding



First download the wine dataset from Moodle (wine.csv).
You can read a CSV with pandas using `pandas.from_csv`. Importing pandas is achieved
with



In [None]:
import pandas as pd
df = pd.read_csv("wine.csv")

Using `pd` as alias is a convention.



### Project and data understanding



**Project Goal**: Using chemical analysis to determine the origin of wines using the „wine“ data set.

Your data to solve the task:

-   3 different types of Italian wine
-   number of instances:  180
-   number of features: 13
-   number of classes: 3

-   features:
    -   Alcohol
    -   Malic acid
    -   Ash
    -   Alcalinity of ash
    -   Magnesium
    -   Total phenols
    -   Flavanoids
    -   Nonflavanoid phenols
    -   Proanthocyanins
    -   Color intensity
    -   Hue
    -   OD280/OD315 of diluted wines
    -   Proline

-   one column „class“: with the types of wine ${1, 2, 3}$

Read the csv-file with the wine data set in a Pandas data frame.

1.  Check if all data objects and features are available, compare the number of lines with the description above.
2.  Check the types of your attributes (there is one column where it does not make sense),
3.  also check for duplicates and missing values.
    If you find duplicates or missing values remove the corresponding objects.

**Hints**:

-   There is one column with a non-sense value in it.
-   If a file is read with `pd.read_csv` the types of each column are determined automatically (if possible).

It might happen that there are different types in one column.
As the documentation tells us: **Columns with mixed types are stored with the object dtype**.

-   Duplicates can be removed with the method `DataFrame.duplicated()`.
-   Missing values can be found with `DataFrame.isnull()`.
-   Missing values can dropped with `DataFrame.dropna()`.

**Important**: Check the dtype of the columns at the end. You can use the member function `astype` of a column to cast
its value (e.g. `df.Ash=df.Ash.astype(np.float64)`).



### Data understanding and preparation, visualization



There are outliers in the data set (hint: 4 obvious outliers in one column, which you will find without having any background in chemistry).

-   Find the outliers and remove the entire instances (the entire rows).
    You can use Python commands and visualization (e.g. histograms or box plots). Which outliers did you find?

**Hints**:

-   The function `DataFrame.describe()` is useful, check out the argument `percentiles`.
-   Make a boxplot of the suspicious column with the member function (`.plot.box()`).



In [None]:
df = df.dropna()
df = df[df['Alcohol'] >= 0]
# df = df.applymap(lambda x: x if x >= 0 else None)
df = df.drop_duplicates()
df = df[pd.to_numeric(df["Ash"], errors='coerce').notna()]
df["Ash"] = df["Ash"].astype(np.float64)

print(df.shape)

df.info()
df.describe()

In [None]:
def multiplot(dataframe):
    df = dataframe
    columns = df.columns[df.columns != "class"]

    # Calculate the number of rows and columns for the subplot grid
    n_cols = 2  # Number of columns in the grid
    n_rows = (len(columns) + n_cols - 1) // n_cols  # Calculate the number of rows

    # Create a grid of subplots
    fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(10, 10))

    # Iterate through the column names and create subplots
    for i, column in enumerate(columns):
        row, col = divmod(i, 2)  # Determine the row and column of the subplot
        ax = axes[row, col]  # Get the corresponding axis
        ax.plot(range(df.shape[0]), df[column])
        ax.set_title(column)

    # Remove empty subplots (if any)
    for i in range(len(columns), n_rows * n_cols):
        fig.delaxes(axes[divmod(i, n_cols)])

    plt.tight_layout()

# multiplot(df)

def multiboxplot(dataframe, numerical_columns, n_cols_per_row):
    n_rows = (len(numerical_columns) + n_cols_per_row - 1) // n_cols_per_row

    # Create subplots for each group of four columns
    fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols_per_row, figsize=(10, 10))

    # Create box plots for each group of four columns
    for i, column in enumerate(numerical_columns):
        row, col = divmod(i, n_cols_per_row)
        df[[column]].boxplot(grid=False, ax=axes[row, col])
        axes[row, col].set_title('Box Plot: ' + column)

    # Remove any empty subplots
    for i in range(len(numerical_columns), n_rows * n_cols_per_row):
        fig.delaxes(axes[divmod(i, n_cols_per_row)])

    # Adjust the layout
    plt.tight_layout()

multiboxplot(df, df.columns[df.columns != 'class'], 4)

In [None]:
df = df[df['Alcohol'] <= 20] # remove alcohol outlier

# multiplot(df)
multiboxplot(df, df.columns[df.columns != 'class'], 4)

---
## Exercise 3: Use simple grouping to understand and classify data

There are many features for each class. A useful  feature to classify wine
is such that it behaves differently for different classes.
First, let us check the mean of each class. Using the pandas
`groupby` function (member function of a DataFrame), you can
compute aggregate functions of groups. Use this to compute the
mean of each feature for each group (e.g., `df.groupby(COLUMN).mean()`).
If you found an interesting column, the following command
vizualizes the distribution for the different classes.



In [None]:
import seaborn as sns

print(df.groupby(df["class"]).mean())

sns.displot(data=df, x="Ash", hue='class', kind='kde')