# SEMMA Methodology applied to Wine Quality Dataset
In this notebook, we follow the SEMMA methodology to analyze and build a machine learning model for the Wine Quality dataset. The SEMMA process includes:
- **Sample**: Using the full dataset since it's manageable in size.
- **Explore**: Analyzing the data to find trends, correlations, and distribution of features.
- **Modify**: Cleaning the data and preparing it for modeling, including feature scaling.
- **Model**: Training a Random Forest Classifier to predict wine quality.
- **Assess**: Evaluating the model's performance.

## Step 1: Sample
We will use the full dataset as it contains 1143 rows, which is manageable for analysis.

In [None]:
# Importing required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset
file_path = '/content/WineQT.csv'
data = pd.read_csv(file_path)

# Display first few rows of the dataset
data.head()


## Step 2: Explore
We will perform Exploratory Data Analysis (EDA) to better understand the dataset and its relationships.

In [None]:
# Descriptive statistics and correlation heatmap
desc_stats = data.describe()
print(desc_stats)

# Correlation matrix
correlation_matrix = data.corr()

# Plot correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Wine Quality Dataset')
plt.show()

## Step 3: Modify
In this step, we will drop the 'Id' column (as it is not relevant), scale the features, and split the data into training and testing sets.

In [None]:
# Dropping the 'Id' column
data_cleaned = data.drop('Id', axis=1)

# Separate the features and target variable
X = data_cleaned.drop('quality', axis=1)
y = data_cleaned['quality']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Step 4: Model
We will train a Random Forest classifier on the training data to predict the wine quality.

In [None]:
# Initialize Random Forest Classifier
model = RandomForestClassifier(random_state=42)

# Train the model
model.fit(X_train_scaled, y_train)

# Predict on the test data
y_pred = model.predict(X_test_scaled)

## Step 5: Assess
Now, we evaluate the model's performance using accuracy score, confusion matrix, and classification report.

In [None]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Output the evaluation results
print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{class_report}')

## Conclusion
The model achieved around 69.87% accuracy, but there is room for improvement. Techniques such as handling class imbalance and hyperparameter tuning could further enhance the model's performance.