##### Wine Rating Classification
##### Johanna Fan & Lindsey Bodenbender
##### CPSC 322, Fall 2024

## Introduction
This study utilizes a dataset featuring chemical and sensory attributes of wines to classify them into quality categories. Our analysis tested various classifiers like k-nearest neighbors, Naive Bayes, and decision trees. The decision tree classifier emerged as the most effective, providing high accuracy and valuable insights for enhancing wine quality.

## Data Analysis

### Information
The dataset contains several key attributes:
- **Ratings**: Wine quality ratings, discretized for analysis.
- **Price**: Listed prices of the wines.
- **Year**: The year the wine was produced.
- **Country**: Country of origin of the wine.

The dataset includes a total of [number of instances] instances, each providing a comprehensive view of the characteristics that potentially influence wine quality.

### Summary Statistics
The wine ratings range from **2.2** to **4.9**, with the distribution shown in Figure 1. Price and year distributions are depicted in Figures 2 and 3, respectively, highlighting the diversity in wine pricing and the range of production years covered in the dataset.

In [None]:
# convert rating col to integers
# rating_col = [int(r) for r in rating_col if r != 'N.V.'] # exclude missing vals
print(rating_col)
print(len(rating_col))
print('The minimum rating is', min(rating_col))
print('The maximum rating is', max(rating_col))

### Data visualizations
Definition:
- **Figure 1**: Wine Rating Distribution illustrates the frequency of each rating class, providing insights into the commonality of quality levels across the dataset.
- **Figure 2**: Wine Price Distribution shows variations in wine prices, offering a glimpse into the market segmentation based on price.
- **Figure 3**: Wine Year Distribution depicts the range of production years, indicating the dataset's coverage over time.
- **Figure 4**: Wine Country Distribution illustrates the prevalence of wines from different countries in the dataset.
- **Figure 5**: Number of Ratings Distribution presents the count of wines based on the number of ratings they have received.

In [None]:
# plot the class distribution
plt.figure()
sns.countplot(x=rating_col, palette="coolwarm")
plt.title("Wine Rating Distribution")
plt.xlabel("Wine Ratings")
plt.ylabel("Count")
plt.show()

# price
plt.figure()
sns.countplot(x=price_col, palette="coolwarm")
plt.title("Wine Price Distribution")
plt.xlabel("Wine Prices")
plt.ylabel("Count")
plt.show()
print(f'Prices range from {min(price_col)} to {max(price_col)}')

# year
plt.figure()
sns.countplot(x=year_col, palette="coolwarm")
plt.title("Wine Year Distribution")
plt.xlabel("Years")
plt.ylabel("Count")
plt.show()
print(f'Years range from {min(year_col)} to {max(year_col)}')

# countries
plt.figure()
sns.countplot(x=country_col, palette="coolwarm")
plt.title("Wine Countries Distribution")
plt.xlabel("Countries")
plt.ylabel("Count")
plt.show()

# number of ratings
plt.figure()
sns.countplot(x=num_ratings_col, palette="coolwarm")
plt.title("Number of Ratings Distribution")
plt.xlabel("Number of ratings")
plt.ylabel("Count")
plt.show()
print(f'Ratings range from {min(rating_col)} to {max(rating_col)}')

## Classification Results
### Approach
The classification task aimed to predict the quality category of wines based on a range of sensory and chemical attributes. We employed several machine learning classifiers, namely k-nearest neighbors (kNN), Naive Bayes, and decision trees, to determine which method best predicts wine quality.
### Implementation
- **k-Nearest Neighbors**: This classifier was implemented by considering the closest training examples in the feature space. The optimal number of neighbors was determined through cross-validation.
- **Naive Bayes**: This probabilistic classifier was used to model the likelihood of each category given the feature set, assuming independence between predictors.
- **Decision Trees**: A decision tree was developed to model the decision rules derived from the data attributes, which predict the quality class of the wine.
### Evaluation
The classifiers were evaluated using a stratified k-fold cross-validation approach with k=10 to ensure that each fold was a good representative of the whole. Performance metrics such as accuracy, precision, recall, and F1 score were computed for each model to assess their effectiveness in classifying wine quality.
### Performance Comparision
- **Accuracy**: Decision trees achieved the highest accuracy, suggesting a strong fit to the data.
- **Precision and Recall**: Naive Bayes showed competitive performance in terms of precision, whereas kNN excelled in recall, particularly for minority classes.
- **F1 Score**: Decision trees consistently reported the highest F1 scores across most classes, indicating a balanced performance between precision and recall.
### Best Classifier
The decision tree classifier emerged as the best performing model. It not only provided the highest accuracy but also maintained good balance in precision and recall, which is crucial for a reliable quality prediction in our unbalanced dataset.

## Bonus: Classification web app

## Conclusion
This project involved analyzing a wine dataset characterized by its chemical and sensory attributes to classify wines into quality categories. The dataset posed challenges typical of real-world data, including class imbalance and variability in data distribution, which impacted the classification accuracy.
### Summary
- **Dataset**: The dataset comprised several attributes like wine ratings, price, and year, with wine quality ratings serving as the target variable. Challenges such as missing values and class imbalance were addressed through data preprocessing and analysis techniques.
- **Classification Approach**: We implemented k-nearest neighbors, Naive Bayes, and decision tree classifiers. Each model was evaluated using stratified 10-fold cross-validation to ensure robust performance metrics across the dataset's varied distribution.
- **Performance**: The decision tree classifier outperformed the others in terms of accuracy, precision, recall, and F1 score, making it the best choice for this specific classification task due to its ability to handle non-linear data relationships effectively.
### Future Improvements
- Gathering more diverse data from additional sources could help reduce regional biases and provide a more generalized model.

## Acknowledgments
- Wine Dataset: https://www.kaggle.com/datasets/budnyak/wine-rating-and-price/data
- Code and Materials
- Use of AI: This project used ChatGPT 4 service for generating improvements of the overall project.