# Introduction

In this Investigation I will use LogisticRegression, DecisionTreeClassifier, and a RandomForestClassifier to predict information with my data. 


I originally chose to do the probability of wildfires, but that data wasn't very uniform and therefore couldn't make very accurate predictions. So now I am doing if a certain wine tpe will be ‘good’, ‘ok’, or ‘bad’. 


The purpose of this investigation is to identify if a certain wine is of good, ok, or bad quality. I will be using fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulphur dioxide, total sulphur dioxide, density, pH, sulphates, alcohol, and in my predicting column, quality. 

Hypotheses. Currently I am unsure how I will identify what makes the wine good, ok or bad so I will have to gather and explore my data first and then come up with a cut of for quality to identify what quality it is made of. I will use LogisticRegression, DecisionTreeClassifier, and a RandomForestClassifier and then assess which one is best as I am currently not sure which one will be the best and most accurate. I am also aiming for a 75% accuracy.


# Setting up
The below code contains necessary steps for setting up our machine learning environment. 

the data column that I will use is fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulphur dioxide, total sulphur dioxide, density, pH, sulphates, alcohol, and in my predicting column, quality. 

I have decided to use so many and all of these as the whole makeup of the wine is relevant when talking about its quality, so I want to have as many possibilities available to get the best results. The predicted is the quality as I wan to use this machine learning program to identify how good or bad of quality is a certain wine type. pH will be very important as it determines the acid and base makeup of the wine and how bitter and the flavouring of the wine.



In [None]:
import os
os.chdir('/kaggle/input/red-wine-quality-cortez-et-al-2009/')

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# Below are diffrent analytics libraries installed that will be used throughout the model
import pandas as pd
import numpy as np # linear algebra
import matplotlib.pyplot as plt
import seaborn as sns 

#Below is the method and techniques that i will uses throughout my project to help predict data
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import cross_val_score

#Machine learning algorithms are even less straightforward than nonlinear regression, partly because machine learning dispenses with the constraint of fitting to a specific mathematical function, such as a polynomial. 
#There are two major categories of problems that are often solved by machine learning: regression and classification. Regression is for numeric data (e.g. What is the likely income for someone with a given address and profession?) and classification is for non-numeric data (e.g. Will the applicant default on this loan?).
#Below are a few Regressions and Classifiers that will be used throughout the model
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

#these below ensure that my LogisticRegression will work and run correctly
%matplotlib inline
plt.style.use('ggplot')

import warnings
warnings.filterwarnings("ignore")


In [None]:
df_red = pd.read_csv("winequality-red.csv")

# Gather and explore the data

This data that I have selected is fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulphur dioxide, total sulphur dioxide, density, pH, sulphates, alcohol, and in my predicting column, quality. Fixed acidity is most acids involved with wine, volatile acidity is a wine fault or defect is an unpleasant characteristic of a wine often resulting from poor winemaking practices or storage conditions and leading to wine spoilage. Citric acid is an organic compound with the chemical formula HOC(CH₂CO₂H)₂ and is found as a component in wine. Residual sugar or 'RS' is the sugar from the grapes that's left over after fermentation. Chloride is an ion and is the anion Cl⁻. And is within wine.   Total Sulphur Dioxide (TSO2) is the portion of SO2 that is free in the wine plus the portion that is bound to other chemicals in the wine such as aldehydes, pigments, or sugars. The density, of a substance is its mass per unit volume. pH is a scale used to specify the acidity or basicity of an aqueous solution. Acidic solutions are measured to have lower pH values than basic or alkaline solutions. The sulfate or sulphate ion is a polyatomic anion with the empirical formula SO²⁻ ₄. Salts, acid derivatives, and peroxides of sulfate are widely used in industry and wine. An alcoholic drink is a drink that contains ethanol, a type of alcohol produced by fermentation of grains, fruits, or other sources of sugar and is also important in the taste and quality of wine. For most wine critics, quality refers to what they personally consider 'good' versus 'bad' wine, and correspondingly desirable versus aversive. This is usually framed within the context of conformity relative to established, learned norms for the wines concerned



In [None]:
#Below is my printed columns of what all my variables are
df = df_red
df.head()

# Prepare the data

Before we can separate our prediction target 'y' from the rest of the data, we need to do some preparation so that there aren't any rows with missing values as our machine learning model will not be able to handle them.

## Select features and target then drop missing values
Choosing our features first will help reduce the total number of rows we need to drop (remove).

We want to choose a selection of features that are:
- Relevant to our predictions
- Don't have many missing values
I have also ensured that I have 
Removes outliers if appropriate and discusses why this is.
Removes rows with missing/NaN values
Reduces training set data loss by pre-selecting features before dropping rows where applicable to selected model.
Replaces unsuitable values or missing values instead of dropping rows where appropriate.



In [None]:
df.info()

In [None]:
df['quality'].value_counts(sort = False)

In [None]:
#This is a histogrph which i cn use to better undersand my data and decide how i can slpit it up into 'good', 'ok', and 'bad'. 
df['quality'].hist() 

Looks like the distribution is imbalanced, althougi will still group them into three categories 'good', 'ok' and 'bad'

In [None]:
#this is the code that i have used to split the data into good, ok and bad.
def gen_labels(df):
    labels = ['bad', 'ok', 'good']
    
    if 1 <= df.loc['quality'] <= 5:
        label = labels[0]
    elif 5 < df.loc['quality'] < 7:
        label = labels[1]
    elif 7 <= df.loc['quality']<= 10:
        label = labels[2]
        
    return label

In [None]:
df['label'] = df.apply(gen_labels, axis = 1)

df['label'] = df['label'].astype('category')

df['label'].value_counts()


In [None]:
#### Taking too long
# df['label'].hist()

In [None]:
df.columns

In [None]:
df.groupby('label').mean()

Looks like the data is not easily distinguishable, we can further come to this conclusion by some swarm plots

In [None]:
sns.catplot(x='label', y='pH', hue='quality', data=df, kind = 'swarm')

In [None]:
sns.catplot(x='label', y='fixed acidity', hue='quality', data=df, kind = 'swarm')

## Split data into training and testing data.
Splitting the training set into two subsets is important because you need to have data that your model hasn't seen yet with actual values to compare to your predictions to be able to tell how well it is performing. 

## Separate Features From Target
Now that we have a set of data (as a Pandas DataFrame) without any missing values, let's separate the features we will use for training from the target.




Before staring to build a model and start making predictions, standardizing and splitting the data into training and test set

In [None]:
def scale_and_split(df, test_sizre=0.3):
    
    target = df[['label']]
    features = df.drop(['label', 'quality'], axis = 1)
    labels = list(target.label.unique())
    
    scaler = StandardScaler()
    features = scaler.fit_transform(features)
    
    X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = 42)
    
    return X_train, X_test, y_train, y_test, labels


# Choose and Train a Model
Now that we have data our model can digest, let's use it to train a model and make some predictions. We're going to use LogisticRegression, DecisionTreeClassifier, and a RandomForestClassifier which is different from the Decision Tree Regressor used in the [Intro to Machine Learning course](https://www.kaggle.com/learn/intro-to-machine-learning) in that it makes categorical predictions instead of continuous numerical predictions. 
For an example of a Decision Tree Classifier working with a non-numerical 'y' and a more in-depth look at how they work, take a look at this Kaggle notebook (https://www.kaggle.com/chrised209/decision-tree-modeling-of-the-iris-dataset)

Ok, let's train our model and see what it looks like.



In [None]:
def evaluate_model(model, df):
    
    X_train, X_test, y_train, y_test, labels = scale_and_split(df)
    
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    print('Cross validation score - ', scores.mean()*100)
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred) 
    print('Test accuracy - ',accuracy*100)
    print('Confusion Matrix -\n', confusion_matrix(y_test, y_pred, labels))

In [None]:
lr = LogisticRegression()
dt = DecisionTreeClassifier(criterion='gini', max_depth=12, random_state=42)
rc = RandomForestClassifier(n_estimators=100, max_depth=12 ,random_state=42)

print('\nEvaluation results - Logistic Regression')
evaluate_model(lr, df)

print('\nEvaluation results - Decision Tree Classifier')
evaluate_model(dt, df)

print('\nEvaluation results - Random Forest Classifier')
evaluate_model(rc, df)

Cross-validation is a statistical method used to estimate the skill of machine learning models. ... That k-fold cross validation is a procedure used to estimate the skill of the model on new data. There are common tactics that you can use to select the value of k for your dataset.

Accuracy is defined as the percentage of correct predictions for the test data. It can be calculated easily by dividing the number of correct predictions by the number of total predictions.

In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix,[9] is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa – both variants are found in the literature

As the data is imbalanced, upmpling the minority class might help increasing the performance of the model.



# Evaluate model performance and tune hyperparameters
Now that we have a sweet looking model, let's see how good it is at predicting passenger survival on our training set. 





In [None]:
from sklearn.utils import resample

In [None]:
df_majority = df[df['label']!='good']
df_minority = df[df['label']=='good']
 
df_minority_upsampled = resample(df_minority, replace=True, n_samples=700, random_state=42)

df_upsampled = pd.concat([df_majority, df_minority_upsampled])

df_upsampled['label'].value_counts()

In [None]:
lr = LogisticRegression()
dt = DecisionTreeClassifier(criterion='gini', max_depth=12, random_state=42)
rc = RandomForestClassifier(n_estimators=250, max_depth=12 ,random_state=42)

print('\nEvaluation results on upsampled data - Logistic Regression')
evaluate_model(lr, df_upsampled)

print('\nEvaluation results on upsampled data - Decision Tree Classifier')
evaluate_model(dt, df_upsampled)

print('\nEvaluation results on upsampled data - Random Forest Classifier')
evaluate_model(rc, df_upsampled)

There is a significant improvement in the accuracy of decision tree classifier and random tree classifier after upsampling the minority class.

# Conclusion
Now that you have some predictions it's important to talk about them. The accuracy of my data was Cross validation score -  78.17340560054672 and a Test accuracy -  76.32 which was very close and slightly better than my 75% goal. It is not as accurate as Harrison’s with a 99.97% but it is also more that some other people is 60%. I think it was does well but I only really needed to do Random Forest Classifier as this has consistently been the most accurate out of all of the ones I have used/ 
 
The purpose of this investigation was to currently I am unsure how I will identify what makes the wine good, ok or bad so I will have to gather and explore my data first and then come up with a cut of for quality to identify what quality it is made of. And I think that I have achieved this well. 
