# Preparing Data for ML Algorithms and Performing Matrix Manipulations Using Pandas

In this Jupyter Notebook tutorial, we will cover how to prepare data for machine learning algorithms and perform matrix manipulations using the Pandas library. This tutorial is intended for high school graduates and assumes basic knowledge of **Python**, **Pandas**, **Matplotlib**, and **Seaborn**.

## Table of Contents

1. Introduction
2. Loading the Dataset
3. Exploratory Data Analysis
    * Descriptive Statistics
    * Visualizations
4. Data Preprocessing
    * Handling Missing Data
    * Handling Categorical Data
    * Feature Scaling
5. Matrix Manipulations
    * Selecting and Filtering Data
    * Matrix Operations
    * Aggregating Data
6. Feature Engineering
    * Feature Creation
    * Feature Selection
7. Splitting the Dataset
    * Train-Test Split
8. Model Selection and Evaluation
    * Model Selection
    * Model Evaluation
9. Model Tuning and Optimization
    * Grid Search
    * Retrain and Evaluate the Optimized Model
10. Conclusion

Throughout the tutorial, exercises will be provided at the end of each section for practice and better understanding of the concepts.

## 0. Installing the Libraries

In [None]:
! pip install scikit-learn

## 1. Introduction
In this tutorial, we will work with the [Wine Quality dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality) from the UCI Machine Learning Repository, which contains information about various wine samples and their quality scores. Our goal is to prepare this dataset for machine learning algorithms and perform matrix manipulations using Pandas.

**Objective:**
* Learn how to prepare data for machine learning algorithms
* Perform matrix manipulations using Pandas


## 2. Loading the Dataset
First, we need to load the Wine Quality dataset using Pandas. The dataset is available as a CSV file, and we will use the '**pd.read_csv()**' function to read the file and store it in a DataFrame called '**wine_df**'.

In [None]:
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine_df = pd.read_csv(url, sep=";")
wine_df.head()


In [None]:
wine_df_copy = wine_df.copy()

**Exercise 1:** Load the Wine Quality dataset for white wines and display the first five rows. The dataset can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv).

In [None]:
# Replace the URL with the one for the white wine dataset and read it into a new DataFrame
# Hint: The correct URL for the white wine dataset is "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
white_wine_url = ______

# Hint: Use the pd.read_csv() function to read the CSV file and set the 'sep' parameter to ";"
white_wine_df = ______

# Hint: Use the .head() method to display the first 5 rows of the DataFrame
______

In [None]:
white_wine_df_copy = white_wine_df.copy() 

## 3. Exploratory Data Analysis
In this section, we will perform exploratory data analysis to understand the dataset better. We will compute descriptive statistics and visualize the data.

### 3.1 Descriptive Statistics
We will start by computing the summary statistics for the dataset using the '**describe()**' function.

In [None]:
wine_df.describe()

**Exercise 2:** Compute and display summary statistics for the white wine dataset.

In [None]:
# Use the describe() function on the white_wine_df DataFrame
______

### 3.2 Visualizations
We will now create some visualizations to explore the distribution of the features and their relationships. We will use both Matplotlib and Seaborn for creating these visualizations.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Histogram of wine quality scores
sns.histplot(wine_df['quality'], kde=False, bins=6)
plt.title('Wine Quality Distribution')
plt.xlabel('Quality')
plt.ylabel('Frequency')
plt.show()

# Boxplot of alcohol content grouped by wine quality
sns.boxplot(x='quality', y='alcohol', data=wine_df)
plt.title('Alcohol Content by Wine Quality')
plt.xlabel('Quality')
plt.ylabel('Alcohol')
plt.show()

# Scatterplot of alcohol content vs. wine density
sns.scatterplot(x='alcohol', y='density', data=wine_df, hue='quality')
plt.title('Alcohol Content vs. Wine Density')
plt.xlabel('Alcohol')
plt.ylabel('Density')
plt.legend(title='Quality')
plt.show()

**Exercise 3:** Create a scatterplot of alcohol content vs. volatile acidity for the white wine dataset. Use wine quality as the hue for the data points.

In [None]:
# Use the scatterplot function with x='alcohol', y='volatile acidity', and hue='quality' on the white_wine_df DataFrame
# Hint: Use the sns.scatterplot() function and set x, y, data, and hue parameters accordingly
sns._____(x=_____, y=_____, data=_____, hue=_____)

# Hint: Use the plt.title() function to set the title of the plot
plt._____('Alcohol Content vs. Volatile Acidity')

# Hint: Use the plt.xlabel() function to set the x-axis label
plt._____('Alcohol')

# Hint: Use the plt.ylabel() function to set the y-axis label
plt._____('Volatile Acidity')

# Hint: Use the plt.legend() function to customize the legend and set the title of the legend
plt.legend(title='Quality')

# Hint: Use the plt.show() function to display the plot

**Exercise 4:** Create a heatmap showing the correlation between the features in the red wine dataset.

In [None]:
# Compute the correlation matrix for the wine_df DataFrame
# Hint: Use the .corr() method on the wine_df DataFrame
corr_matrix = ______

# Create a heatmap using the correlation matrix
# Hint: Use the sns.heatmap() function with the correlation matrix and set annot and cmap parameters
sns._____(corr_matrix, annot=_____, cmap=_____)

# Hint: Use the plt.title() function to set the title of the heatmap
plt._____('Correlation Heatmap')

# Hint: Use the plt.show() function to display the heatmap
plt._____()

Throughout the tutorial, exercises will be provided at the end of each section for practice and better understanding of the concepts.

## 4. Data Preprocessing
In this section, we will perform data preprocessing to clean and prepare the dataset for machine learning algorithms.

### 4.1 Handling Missing Data

First, let's check for missing data in our dataset.

In [None]:
wine_df.isnull().sum()

In this case, there are no missing values in the dataset. However, if there were any missing values, you could handle them using methods like dropping the rows with missing data or imputing the missing values with the mean, median, or mode of the respective feature.

**Exercise 5:** Check the white wine dataset for missing data and handle it if necessary.

In [None]:
# Check for missing data in the white wine dataset
# Hint: Use the .isnull() method followed by the .sum() method on the white_wine_df DataFrame
______.______().______()

### 4.2 Handling Categorical Data
Our dataset consists of only numerical features, so there is no need to handle categorical data in this case. However, if your dataset contains categorical features, you could use techniques like one-hot encoding or label encoding to convert them into numerical data.

### 4.3 Feature Scaling
Feature scaling is an important step in preprocessing data for machine learning algorithms. Many algorithms perform better when the features have the same scale. We will use the StandardScaler from the sklearn library to scale our features.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_wine_df = pd.DataFrame(scaler.fit_transform(wine_df), columns=wine_df.columns)
scaled_wine_df.head()

**Exercise 6:** Apply feature scaling to the white wine dataset.


In [None]:
# Scale the features of the white wine dataset using the StandardScaler
# Hint: Use the scaler.fit_transform() function on the white_wine_df DataFrame and pass the result to pd.DataFrame()
scaled_white_wine_df = pd.DataFrame(______.______(______), columns=______.columns)

# Hint: Use the .head() method to display the first 5 rows of the scaled DataFrame
scaled_white_wine_df.______()

## 5. Matrix Manipulations
In this section, we will perform matrix manipulations using Pandas.

### 5.1 Selecting and Filtering Data

We can select specific rows or columns and filter the data based on certain conditions.

In [None]:
# Selecting the first 10 rows of the 'quality' column
wine_df['quality'].head(10)


In [None]:
# Filtering wines with quality greater than 6
high_quality_wines = wine_df[wine_df['quality'] > 6]
high_quality_wines.head()


**Exercise 7:** Select the first 15 rows of the 'pH' column from the white wine dataset.

In [None]:
# Select the first 15 rows of the 'pH' column from the white_wine_df DataFrame
# Hint: Use the column name 'pH' and the .head() method with 15 as an argument on the white_wine_df DataFrame
white_wine_df[_____].______(15)

**Exercise 8:** Filter the white wine dataset to include only wines with a residual sugar value greater than 20.

In [None]:
# Filter the white_wine_df DataFrame to include only wines with a residual sugar value greater than 20
# Hint: Use boolean indexing with the condition (white_wine_df['residual sugar'] > 20)
sweet_white_wines = white_wine_df[______[___________] > 20]

# Hint: Use the .head() method to display the first 5 rows of the filtered DataFrame
sweet_white_wines.______()

### 5.2 Matrix Operations

We can perform various matrix operations using Pandas, such as adding, subtracting, or multiplying the values in a DataFrame.

In [None]:
# Adding 1 to the alcohol content of each wine
wine_df['alcohol'] + 1

In [None]:
# Multiplying the alcohol content and pH to create a new feature
wine_df['alcohol'] * wine_df['pH']

**Exercise 9:** Create a new feature in the white wine dataset by dividing the 'total sulfur dioxide' by the 'free sulfur dioxide'.

In [None]:
# Divide the 'total sulfur dioxide' by the 'free sulfur dioxide' in the white_wine_df DataFrame
# Hint: Use the column names 'total sulfur dioxide' and 'free sulfur dioxide' for the division operation
white_wine_df[_________________] / white_wine_df[__________________]

**Exercise 10:** Subtract the minimum value of the 'density' column from each density value in the red wine dataset.

In [None]:
# Subtract the minimum value of the 'density' column from each density value in the wine_df DataFrame
# Hint: Use the column name 'density' and the .min() method on the wine_df DataFrame for the subtraction operation
wine_df[_______] - wine_df[_______].____()

## 6. Feature Engineering
In this section, we will create new features that can help improve the performance of our machine learning algorithms.

### 6.1 Creating New Features

We will create a new feature called 'sweetness' by binning the 'residual sugar' values into categories.

In [None]:
# Binning the residual sugar values into categories
wine_df_with_sweetness = wine_df.copy()
wine_df_with_sweetness['sweetness'] = pd.cut(wine_df['residual sugar'], bins=[0, 4, 12, 45], labels=['dry', 'medium', 'sweet'])
wine_df_with_sweetness[['residual sugar', 'sweetness']].head()


**Exercise 11:** Create a new feature in the white wine dataset called 'acidity_level' by binning the 'pH' values into categories such as 'low', 'medium', and 'high'.

In [None]:
# Binning the pH values into categories
# Hint: Create a copy of the white_wine_df DataFrame
white_wine_df_with_acidity = white_wine_df.____()

# Hint: Use the pd.cut() function with the 'pH' column, bins, and labels as arguments
white_wine_df_with_acidity['acidity_level'] = pd.cut(white_wine_df_with_acidity[____], bins=[2.7, 3.0, 3.3, 4.0], labels=['low', 'medium', 'high'])

# Hint: Use the .head() method to display the first 5 rows of the 'pH' and 'acidity_level' columns
white_wine_df_with_acidity[[____, ____]].____()

6.2 Feature Selection

Sometimes, it is helpful to remove irrelevant or redundant features from our dataset to improve the performance of our machine learning algorithms. We can use various techniques like correlation analysis, feature importance, or Recursive Feature Elimination (RFE) to select the most important features.

**Exercise 12:** Remove the least important feature from the red wine dataset using correlation analysis. You can use the correlation matrix created earlier in the tutorial.

In [None]:
# Find the least correlated feature with the 'quality' column
# Hint: Use the .abs() method followed by the .idxmin() method on the 'quality' column of the corr_matrix DataFrame
least_correlated_feature = corr_matrix[______].____().____()

# Drop the least correlated feature from the wine_df DataFrame
# Hint: Use the .drop() method with the columns parameter and the inplace=True argument on the wine_df DataFrame
wine_df.drop(columns=[_______________], inplace=_____)

# Hint: Use the .head() method to display the first 5 rows of the modified DataFrame
wine_df.____()

## 7. Splitting the Dataset

In this section, we will split the dataset into training and testing sets to evaluate the performance of our machine learning algorithms.

### 7.1 Train-Test Split

We will use the '**train_test_split()**' function from the '**sklearn.model_selection**' module to split our dataset into training and testing sets.



In [None]:
from sklearn.model_selection import train_test_split

X = wine_df.drop(columns=['quality'])
y = wine_df['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


**Exercise 13:** Split the white wine dataset into training and testing sets.

In [None]:
# Define the features (X) and target variable (y) for the white wine dataset
# Hint: Use the .drop() method to exclude the 'quality' column for the features
X_white = white_wine_df.____(columns=[_____])
# Hint: Use the column name 'quality' to define the target variable
y_white = white_wine_df[______]

# Split the white wine dataset into training and testing sets
# Hint: Use the train_test_split function with the test_size and random_state parameters
X_train_white, X_test_white, y_train_white, y_test_white = train_test_split(_____, _____, test_size=_____, random_state=_____)

## 8. Model Selection and Evaluation

In this section, we will select a machine learning model, train it on our training data, and evaluate its performance on the testing data.

### 8.1 Model Selection

We will use the Random Forest Classifier from the '**sklearn.ensemble**' module as our machine learning model.

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)


### 8.2 Model Evaluation

We will use accuracy as our evaluation metric. We will compute the accuracy of our model on the testing data.

In [None]:
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

**Exercise 14:** Train a Random Forest Classifier on the white wine dataset and evaluate its performance using accuracy.

In [None]:
# Create and train the Random Forest Classifier on the white wine dataset
# Hint: Instantiate the RandomForestClassifier with the random_state parameter
model_white = RandomForestClassifier(random_state=____)
# Hint: Use the .fit() method to train the model
model_white.____(X_train_white, y_train_white)

# Predict the target variable for the test set and compute the accuracy
# Hint: Use the .predict() method on the model
y_pred_white = model_white.____(X_test_white)
# Hint: Use the accuracy_score function to compute the accuracy
accuracy_white = accuracy_score(y_test_white, y_pred_white)
print(f"Accuracy: {accuracy_white:.4f}")



## 9. Model Tuning and Optimization
In this section, we will fine-tune the hyperparameters of our model to optimize its performance.

### 9.1 Grid Search

We will use the '**GridSearchCV()**' function from the '**sklearn.model_selection**' module to perform a grid search for the best hyperparameters for our model.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42), param_grid=param_grid, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print(f"Best parameters: {best_params}")

### 9.2 Retrain and Evaluate the Optimized Model

Now, we will retrain our model using the best hyperparameters found during the grid search and evaluate its performance on the testing data.

In [None]:
# Train the optimized Random Forest Classifier
optimized_model = RandomForestClassifier(**best_params, random_state=42)
optimized_model.fit(X_train, y_train)

# Predict the target variable for the test set and compute the accuracy
y_pred_optimized = optimized_model.predict(X_test)
accuracy_optimized = accuracy_score(y_test, y_pred_optimized)
print(f"Optimized Accuracy: {accuracy_optimized:.4f}")


**Exercise 15:** Perform grid search for the best hyperparameters for the Random Forest Classifier on the white wine dataset, retrain the model using the best hyperparameters, and evaluate its performance using accuracy.



In [None]:
# Perform grid search for the best hyperparameters on the white wine dataset
grid_search_white = GridSearchCV(estimator=RandomForestClassifier(random_state=42), param_grid=param_grid, cv=5, n_jobs=-1, verbose=1)
grid_search_white.fit(X_train_white, y_train_white)

# Display the best hyperparameters found by the grid search
# Hint: Use the .best_params_ attribute of the grid_search_white object
best_params_white = grid_search_white.________
print(f"Best parameters: {best_params_white}")

# Train the optimized Random Forest Classifier using the best hyperparameters
# Hint: Instantiate the RandomForestClassifier with the best_params_white and random_state parameter
optimized_model_white = RandomForestClassifier(____=best_params_white, random_state=____)
# Hint: Use the .fit() method to train the optimized model
optimized_model_white.____(X_train_white, y_train_white)

# Predict the target variable for the test set and compute the accuracy
# Hint: Use the .predict() method on the optimized model
y_pred_optimized_white = optimized_model_white.____(X_test_white)
# Hint: Use the accuracy_score function to compute the optimized accuracy
accuracy_optimized_white = accuracy_score(y_test_white, y_pred_optimized_white)
print(f"Optimized Accuracy: {accuracy_optimized_white:.4f}")


## 10.Conclusion
In this tutorial, we covered various aspects of preparing data for machine learning algorithms and performing matrix manipulations using pandas. We started by loading and exploring the datasets, followed by data preprocessing, feature engineering, dataset splitting, and model selection. We also trained a Random Forest Classifier and optimized its hyperparameters using grid search. Through exercises, you got hands-on experience with data manipulation and learned various techniques that are essential for preparing data for machine learning algorithms.

Remember that this tutorial is just a starting point, and there are numerous techniques and tools available to further enhance your skills in data preparation and manipulation. As you work with more datasets and different types of data, you will continue to gain experience and knowledge of various techniques that are essential for preparing data for machine learning algorithms.

Now that you have completed the main material of the tutorial, you can continue to challenge yourself by working on these additional advanced exercises. These exercises will help you reinforce your understanding of the concepts and techniques covered in this tutorial.

## Extra Exercises
**Exercise 1:** Perform exploratory data analysis on the white wine dataset, visualizing the distribution of the different features, and identify any outliers or potential issues with the data.

In [None]:
# Use seaborn to visualize the distribution of features in the white wine dataset
import seaborn as sns

# Replace this with the name of the column you want to visualize
column_to_visualize = 'alcohol'

# Hint: Use the sns.histplot function with the 'column_to_visualize' and kde=True
sns.histplot(white_wine_df[_________], kde=____)


**Exercise 2:** Normalize the features of the red wine dataset using Min-Max scaling, and compare the performance of the Random Forest Classifier on the original and normalized data.



In [None]:
from sklearn.preprocessing import MinMaxScaler

# Normalize the features of the red wine dataset
# Hint: Instantiate the MinMaxScaler
scaler = _________()
# Hint: Use the fit_transform method on the scaler for X_train
X_train_scaled = scaler._____________(X_train)
# Hint: Use the transform method on the scaler for X_test
X_test_scaled = scaler._________(X_test)

# Train the Random Forest Classifier on the normalized data
# Hint: Instantiate the RandomForestClassifier with the random_state parameter
model_scaled = ___________________(random_state=42)
model_scaled.fit(X_train_scaled, y_train)

# Predict the target variable for the test set and compute the accuracy
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Scaled Accuracy: {accuracy_scaled:.4f}")



**Exercise 3:** Perform one-hot encoding on the categorical features created in the feature engineering section (e.g., 'sweetness' and 'acidity_level'), and retrain the Random Forest Classifier on the updated dataset. Compare the performance of the model with and without the one-hot encoded features.

In [None]:
# Perform one-hot encoding on the 'sweetness' and 'acidity_level' features
# Hint: Use the pd.get_dummies function with the columns parameter for 'sweetness' and 'acidity_level'
wine_df_encoded = pd.get_dummies(wine_df_with_sweetness, columns=[_____________])
white_wine_df_encoded = pd.get_dummies(white_wine_df_with_acidity, columns=[_____________])

# Train the Random Forest Classifier on the updated dataset with one-hot encoded features
X_encoded = wine_df_encoded.drop(columns=['quality'])
y_encoded = wine_df_encoded['quality']

X_train_encoded, X_test_encoded, y_train_encoded, y_test_encoded = train_test_split(X_encoded, y_encoded, test_size=0.2, random_state=42)

# Hint: Instantiate the RandomForestClassifier with the random_state parameter
model_encoded = ___________________(random_state=42)
model_encoded.fit(X_train_encoded, y_train_encoded)

# Predict the target variable for the test set and compute the accuracy
y_pred_encoded = model_encoded.predict(X_test_encoded)
accuracy_encoded = accuracy_score(y_test_encoded, y_pred_encoded)
print(f"Encoded Accuracy: {accuracy_encoded:.4f}")



**Exercise 4:** Experiment with different machine learning algorithms on the red wine and white wine datasets, such as Logistic Regression, Support Vector Machines, and K-Nearest Neighbors. Compare their performance with the Random Forest Classifier.

In [None]:
# Import the desired machine learning algorithm
from sklearn.linear_model import LogisticRegression

# Train the selected model on the dataset
# Hint: Instantiate the LogisticRegression with the random_state parameter
new_model = ___________________(random_state=____)
new_model.fit(X_train, y_train)

# Predict the target variable for the test set and compute the accuracy
y_pred_new_model = new_model.predict(X_test)
accuracy_new_model = accuracy_score(y_test, y_pred_new_model)
print(f"New Model Accuracy: {accuracy_new_model:.4f}")


**Exercise 5:** Use Recursive Feature Elimination (RFE) to select the most important features for the red wine and white wine datasets, and compare the performance of the machine learning models using the selected features versus the entire feature set.

In [None]:
from sklearn.feature_selection import RFE

# Create the RFE object and compute a ranking of the features
# Hint: Instantiate the RFE object with RandomForestClassifier and n_features_to_select
selector = RFE(_____________________(random_state=42), n_features_to_select=5)
selector = selector.fit(X_train, y_train)

# Select the most important features
# Hint: Use the .support_ attribute of the selector object
selected_features = X.columns[selector._______]

# Train the Random Forest Classifier on the selected features
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

# Hint: Instantiate the RandomForestClassifier with the random_state parameter
model_selected = ___________________(random_state=42)
model_selected.fit(X_train_selected, y_train)

# Predict the target variable for the test set and compute the accuracy
y_pred_selected = model_selected.predict(X_test_selected)
accuracy_selected = accuracy_score(y_test, y_pred_selected)
print(f"Selected Features Accuracy: {accuracy_selected:.4f}")


**Exercise 6:** Perform Principal Component Analysis (PCA) on the red wine dataset to reduce its dimensionality and visualize the first two principal components. Train the Random Forest Classifier on the reduced dataset and compare its performance with the original dataset.

In [None]:
from sklearn.decomposition import PCA

# Apply PCA to the red wine dataset
# Hint: Instantiate PCA with n_components=2
pca = PCA(________________)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Train the Random Forest Classifier on the reduced dataset
# Hint: Instantiate the RandomForestClassifier with the random_state parameter
model_pca = ___________________(random_state=42)
model_pca.fit(X_train_pca, y_train)

# Predict the target variable for the test set and compute the accuracy
y_pred_pca = model_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)
print(f"PCA Accuracy: {accuracy_pca:.4f}")


**Exercise 7:** Using the red wine dataset, create a new feature that represents the ratio between residual sugar and alcohol. Train the Random Forest Classifier on the updated dataset and compare its performance with the original dataset.

In [None]:
# Create a new feature representing the ratio between residual sugar and alcohol
# Hint: Divide the 'residual sugar' column by the 'alcohol' column
wine_df_copy['sugar_alcohol_ratio'] = wine_df_copy[_________________] / wine_df_copy[_________________]

# Train the Random Forest Classifier on the updated dataset
X_ratio = wine_df_copy.drop(columns=['quality'])
y_ratio = wine_df_copy['quality']

X_train_ratio, X_test_ratio, y_train_ratio, y_test_ratio = train_test_split(X_ratio, y_ratio, test_size=0.2, random_state=42)

# Hint: Instantiate the RandomForestClassifier with the random_state parameter
model_ratio = ___________________(random_state=42)
model_ratio.fit(X_train_ratio, y_train_ratio)

# Predict the target variable for the test set and compute the accuracy
y_pred_ratio = model_ratio.predict(X_test_ratio)
accuracy_ratio = accuracy_score(y_test_ratio, y_pred_ratio)
print(f"Ratio Feature Accuracy: {accuracy_ratio:.4f}")


**Exercise 8:** Perform k-means clustering on the red wine dataset and assign each sample to one of the clusters. Train the Random Forest Classifier on the dataset with the new cluster assignments as an additional feature and compare its performance with the original dataset.

In [None]:
from sklearn.cluster import KMeans

# Perform k-means clustering on the red wine dataset
# Hint: Instantiate KMeans with n_clusters=3 and random_state=42
kmeans = KMeans(___________________, random_state=42)
wine_df['cluster'] = kmeans.fit_predict(X)

# Train the Random Forest Classifier on the dataset with the new cluster assignments
X_cluster = wine_df.drop(columns=['quality'])
y_cluster = wine_df['quality']

X_train_cluster, X_test_cluster, y_train_cluster, y_test_cluster = train_test_split(X_cluster, y_cluster, test_size=0.2, random_state=42)

# Hint: Instantiate the RandomForestClassifier with the random_state parameter
model_cluster = ___________________(random_state=42)
model_cluster.fit(X_train_cluster, y_train_cluster)

# Predict the target variable for the test set and compute the accuracy
y_pred_cluster = model_cluster.predict(X_test_cluster)
accuracy_cluster = accuracy_score(y_test_cluster, y_pred_cluster)
print(f"Cluster Feature Accuracy: {accuracy_cluster:.4f}")


**Exercise 9:** Implement feature selection using a correlation matrix and a threshold for the correlation coefficient. Train the Random Forest Classifier on the dataset with the selected features and compare its performance with the original dataset.

In [None]:
# Compute the correlation matrix for the red wine dataset
corr_matrix = wine_df.corr()

# Select features with a correlation coefficient below the threshold
threshold = 0.6
# Hint: Use a list comprehension to filter the columns
selected_features = [column for column in corr_matrix.columns if abs(corr_matrix['quality'][column]) < threshold]

# Train the Random Forest Classifier on the dataset with the selected features
X_corr_selected = wine_df[selected_features]
y_corr_selected = wine_df['quality']

X_train_corr_selected, X_test_corr_selected, y_train_corr_selected, y_test_corr_selected = train_test_split(X_corr_selected, y_corr_selected, test_size=0.2, random_state=42)

# Hint: Instantiate the RandomForestClassifier with the random_state parameter
model_corr_selected = ___________________(random_state=42)
model_corr_selected.fit(X_train_corr_selected, y_train_corr_selected)

# Predict the target variable for the test set and compute the accuracy
y_pred_corr_selected = model_corr_selected.predict(X_test_corr_selected)
accuracy_corr_selected = accuracy_score(y_test_corr_selected, y_pred_corr_selected)
print(f"Correlation Selected Features Accuracy: {accuracy_corr_selected:.4f}")


**Exercise 10:** Implement a custom cross-validation strategy to evaluate the performance of the Random Forest Classifier on the red wine dataset. Compare the performance of the custom cross-validation strategy with the results obtained from the train-test split.

In [None]:
from sklearn.model_selection import KFold
from sklearn.base import clone
import numpy as np

# Implement custom cross-validation strategy
def custom_cross_val_score(model, X, y, cv=5, random_state=42):
    kfold = KFold(n_splits=cv, shuffle=True, random_state=random_state)
    scores = []
    
    for train_index, test_index in kfold.split(X, y):
        X_train_cv, X_test_cv = X.iloc[train_index], X.iloc[test_index]
        y_train_cv, y_test_cv = y.iloc[train_index], y.iloc[test_index]
        
        # Hint: Use the clone function from sklearn.base to make a copy of the model
        model_cv = ___________(model)
        model_cv.fit(X_train_cv, y_train_cv)
        y_pred_cv = model_cv.predict(X_test_cv)
        
        score = accuracy_score(y_test_cv, y_pred_cv)
        scores.append(score)
    
    return np.mean(scores)

# Evaluate the performance of the Random Forest Classifier using the custom cross-validation strategy
# Hint: Instantiate the RandomForestClassifier with the random_state parameter
model_cv = ___________________(random_state=42)
custom_cv_score = custom_cross_val_score(model_cv, X, y, cv=5)
print(f"Custom Cross-Validation Score: {custom_cv_score:.4f}")
