# Lecture 7 Activity

This notebook builds off the in-class notes on the basic scikit-learn workflow. This time we will look at another classic ML dataset. This dataset connects the properties of a given wine (alcohol content, color, etc.) with the type of wine it is. There are three total categories. The goal of this notebook is to train two kinds of scikit-learn classifiers on this task, and see which performs better.

### Steps
1. Load data into `X` (features) and `y` (labels)
2. Split into train/test using `train_test_split`
3. Create a model
4. Fit with `.fit(X_train, y_train)`
5. Predict with `.predict(X_test)`
6. Evaluate with a metric

### Set up imports
Import pandas and the needed scikit-learn libraries. 

In [None]:
# import the necessary libraries

### Load the wine dataset

It is built into Scikit learn in a similar way to the iris data set <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html">hint</a>

**TODO:** Load the dataset and store:
- `X` as a pandas DataFrame
- `y` as a pandas Series


In [None]:
# Load Iris into X (DataFrame) and y (Series)


In [None]:
assert isinstance(X, pd.DataFrame), 'X should be a pandas DataFrame.'
assert isinstance(y, pd.Series), 'y should be a pandas Series.'
assert X.shape[0] == y.shape[0], 'X and y must have the same number of rows.'

### Inspect the data

1. Display the the first 5 rows of the X data frame

2. Print the first 5 values in the y 

In [None]:
# display first 5 rows of X


In [None]:
# inspect y by printing it out.

### Create training and testing data

Break the data down into training and testing data.

Create `X_train`, `X_test`, `y_train`, `y_test`.

Separate 75% of the data for training and 25% for testing data. Use a random seed of 50.


In [None]:
# Split the data here



In [None]:
assert X_train is not None, 'You must create X_train (and the others too).'
assert X_train.shape[0] + X_test.shape[0] == X.shape[0], 'Train + test rows must equal total rows.'
assert y_train.shape[0] == X_train.shape[0], 'y_train must match X_train rows.'
assert y_test.shape[0] == X_test.shape[0], 'y_test must match X_test rows.'

# Check approximate split ratio (allowing small rounding)
expected_test = int(round(0.25* X.shape[0]))
assert abs(X_test.shape[0] - expected_test) <= 1, 'Test set size looks incorrect.'

# Check stratification: each class should appear in both train and test
assert set(y.unique()).issubset(set(y_train.unique())), 'A class is missing from y_train. Did you stratify?'
assert set(y.unique()).issubset(set(y_test.unique())), 'A class is missing from y_test. Did you stratify?'
print('✅ Passed: Train/test split looks correct.')

## Create the models

We’ll use **Logistic Regression** and **Decision Tree Classifier** for classification.

**TODO:** Create a `LogisticRegression` model named `logistic_model`.

**TODO:** Create a `DecisionTreeClassifier` model named `tree_model`.

You'll need to import it like this: `from sklearn.linear_model import LogisticRegression`

You'll also need to import the decision tree from scikit-learn like in the notes.


In [None]:
# create the logistic model (we don't need to train it yet)


In [None]:
assert logistic_model is not None, 'You must create a model object named `logistic_model`.'
assert isinstance(logistic_model, LogisticRegression), 'logistic_model should be a LogisticRegression instance.'
print('✅ Passed: Logistic Model created.')

In [None]:
# create basic decision tree model here. Set the random_state parameter to 42 for reproducibility.


In [None]:
# check decision tree model
assert tree_model is not None, 'You must create a model object named `tree_model`.'
assert isinstance(tree_model, DecisionTreeClassifier), 'tree_model should be a DecisionTreeClassifier instance.'
print('✅ Passed: Decision Tree Model created.')

### Fit the models

You can ignore the warnings if you see any.

In [None]:
# use the training data to fit both models

In [None]:
assert hasattr(logistic_model, 'classes_'), 'Logistic model does not look fitted yet. Did you call model.fit(...) ?'
assert len(logistic_model.classes_) >= 2, 'Fitted logistic model should have 2+ classes.'
print('✅ Passed: Logistic Model fitted.')

assert hasattr(tree_model, 'classes_'), 'Decision tree model does not look fitted yet. Did you call model.fit(...) ?'
assert len(tree_model.classes_) >= 2, 'Fitted decision tree model should have 2+ classes.'
print('✅ Passed: Decision Tree Model fitted.')

### Generate test set predictions using both models

In [None]:
# Predict on the test data with both models

In [None]:
# --- Checks (do not edit) ---
assert y_pred_logistic is not None, 'You must set y_pred_logistic.'
assert y_pred_tree is not None, 'You must set y_pred_tree.'

assert len(y_pred_logistic) == len(y_test), 'y_pred_logistic should have the same length as y_test.'
assert len(y_pred_tree) == len(y_test), 'y_pred_tree should have the same length as y_test.'

print('✅ Passed: Predictions computed.')

## 6) Evaluate with accuracy

**TODO:** Compute accuracy as a float named `acc`.

Then print it with 3 decimal places.


In [None]:
# get the accuracy score for both models

In [None]:
# --- Checks (do not edit) ---
assert logistic_accuracy is not None, 'You must set logistic_accuracy.'
assert 0.0 <= logistic_accuracy <= 1.0, 'Accuracy must be between 0 and 1.'

assert tree_accuracy is not None, 'You must set tree_accuracy.'
assert 0.0 <= tree_accuracy <= 1.0, 'Accuracy must be between 0 and 1.'


print(f'✅ Passed: Accuracy = {logistic_accuracy:.3f}')
print(f'✅ Passed: Accuracy = {tree_accuracy:.3f}')