# Part 3: Logistic Regression

In this part, we will be working with the wine dataset. This dataset contains 11 chemical features of various wines, along with experts' rating of that wine's quality. The quality scale technically runs from 1-10, but only 3-9 are actually used in the data.

**Reference**
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553, 2009. 

**Your goal is to fit a model able to classify a wine as good or bad quality.**

## The DataSet
The dataset contains the following features: 
1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
6. free sulfur dioxide
7. total sulfur dioxide
8. density
9. pH
10. sulphates
11. alcohol

The 12th column of the data file corresponds to the output variable, which represents the quality of the wine (score between 0 and 10)

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
wine_df = pd.read_csv('data/winequality-red.csv') #Reads a CSV file

## Exercise 1: Exploring the data

### Task 1.1: Using pandas for data exploration
As you did in linear regression, use the methods `head(params)` and `describe(params)` to explore the dataset.

In [None]:
#Your code here


In [None]:
#Your code here


### Task 1.2: Adding a good/bad label
The current database does not have a field indicating if a wine is good or bad. We will use the label quality to categorize wines into good or bad according to the following rule: 
    - A wine is considered good if its quality is equal or above 6.5
    
Create a new field in the DataFrame, denoted good, that reflects this rule.  

In [None]:
wine_df['good'] = #Your code here

wine_df.good = wine_df.good.replace({True: 1, False: 0})
wine_df.head()

## Exercise 2: Data splits and further exploration
We will start first by splitting the data in two. For this exercise, we will only use a training and test set. We will omit the use of the validation set as we will train a single model and then check how well it generalizes.

In [None]:
wines_train, wines_test = train_test_split(wine_df, test_size=0.2, random_state=8, stratify=wine_df['good'])

X_train = wines_train.drop(['quality','good'], axis=1)
y_train = wines_train['good']

X_test = wines_test.drop(['quality','good'], axis=1)
y_test = wines_test['good']

X_train.head(15)

### Question 2.1: train_test_split
Explain what is the role of each of the parameters used in the following line of code:

`train_test_split(wine_df, test_size=0.2, random_state=8, stratify=wine_df['good'])`

Your answer here: 

The function scatter_matrix from pandas allows to visually explore the data. This can be useful to identify potential correlations across the input features (which are undesirable). 

In [None]:
from pandas.plotting import scatter_matrix
wines_train = wines_train.drop(['quality'], axis=1)
scatter_matrix(wines_train, figsize=(30,20))

### Question 2.2: Scatter matrix Analysis
Based on the observed plots, do you consider there are any correlations among features? 

Your answer here:

## Exercise 3: Training and Testing
We will  now proceed to train our logistic regression model.

In [None]:
from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression(C=1000000, solver='newton-cg', max_iter=250).fit(X_train,y_train)
print(f'Logistic regression model coefficients:{logistic_model.coef_}\n')

### Question set 3.1: Analysis of the coefficients
Based on the obtained coefficients:
1. Which features seem to have an important contribution towards considering a wine as good?
2. as bad?

Your answer here:

In [None]:
print(f'Accuracy:{logistic_model.score(X_test,y_test)}\n')

### Question 3.2: Accuracy and generalizability
What accuracy did you obtain? Do you consider it good or bad? 

Your answer here:

### Task 3.1 Comparison against a dummy model
Suppose now that you build a dummy model that classifies all wines as bad. 

In [None]:
y_pred = np.zeros(len(y_test))

Estimate this model's accuracy:

In [None]:
#Your code here

#Your code ends here

print(f'Dummy model accuracy:{accuracy}\n')

Given your results, has your original opinion about the trained model changed?

Your answer here: