# ENGR 1330 – Computational Thinking and Data Science

## Red Wine Quality Final Project - Background
In this project, a dataset related to red wine samples from the northwest region of Portugal will
be analyzed. The quality of a red wine that is determined via a sensory test is dependent on many different
physicochemical properties, namely, fixed acidity, volatile acidity, pH value, density, etc. A file named
‘winequality-red.csv’ contains information about different varieties of red wine and their quality that
depends on several physicochemical properties like the ones mentioned above. Specifically, in the dataset, there is a quality score (QS) ranging from 3 to 8 that is given to each variety of red wine depending on 11 different properties. For this project, consider that a good wine is one with a quality score of
QS ≥ 6 and a bad wine is one with a quality score of QS ≤ 5. The objective of this problem is to
classify whether the wine is good or bad depending on the 11 different properties that are in the dataset.

## Required Tasks:
(a) Literature review:
1) P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. Modeling wine preferences by data mining
from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. [Get
the research paper from Web of Science at TTU].

(b) Data acquisition:
1) Get the dataset required for this project from the following Kaggle website:
https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009

(c) Exploratory data analysis
 1) Perform exploratory data analysis (getting information about the dataset, making plots, etc.)
 2) Modify the dataset as needed for performing the analysis

(d) Classification model
 1) Implement a classification algorithm from scratch as well as using the data science library to
classify good wines and bad wines
 2) Evaluate the model by computing the necessary evaluation metrics from scratch as well as using
the data science library


**Database Acquisition**
- Get the database from the zip file from BlackBoard or Kaggle website: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
- Supply links to any additional databases discovered during the literature research

**Exploratory Data Analysis**
- Describe (in words) the database.
- Reformat as needed (column headings perhaps) the database for subsequent analysis.

**Model Building**
- Build data model
- Assess data model quality
- Build the input data interface for using the model              
       
**Documentation**
- Training and Project management video on how to use your tool, and demonstrate the tool(s) as they are run
- Interim report (see deliverables below); this document must be rendered as a .pdf.
- Final ipynb file (see deliverables below)

## Deliverables:

#### Part 1 Interim Report (due Dec 1):
A report that briefly describes the project.Use the Interim Report Template in BlackBoard.   
- Break down each task into manageable subtasks and describe how you intend to solve the subtasks and how you will test each task. (Perhaps make a simple Gantt Chart) or list of meeting times. 
- Address the responsibilities of each team member for tasks completed and tasks to be completed until the end of the semester. (Perhaps make explicit subtask assignments)  

#### Part 2 Final Report (due Dec 10):
- A well-documented JupyterLab (using a python kernel), use markdown cells and commenting for explanations and text. 
- A how-to video demonstrating performance and description of problems that you were not able to solve and also talk about project management such as who did what. Active participation of every single group member is mandatory in the presentation. 
- A final peer evaluation report, where each group member should rate the participation and contribution of the other members.

**Above items can reside in a single video; but structure the video into the two parts; use an obvious transition when moving from "how to ..." into the project management portion.**  Keep the total video length to less than 10 minutes; submit as an *unlisted* YouTube video, and just supply the link (someone on each team is likely to have a YouTube creator account).  Keep in mind a 10 minute video can approach 100MB file size before compression, so it won't upload to Blackboard and cannot be emailed.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns # might not be needed

rw = pd.read_csv("redwinequality.csv") #rw for redwine
print(rw.head())

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5 