<a href="https://colab.research.google.com/github/yongsa-nut/SF251_67_2/blob/main/SF_251_In_class_Exercise_9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# In-Class Exercise 9

Once done, upload to MS Teams

## Pokemon Dataset Part 9

In this exercise, we will predict binary outcomes using logistic regression.

In [None]:
!wget https://raw.githubusercontent.com/yongsa-nut/SF251_67_2/refs/heads/main/data/pokemon.csv

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, confusion_matrix, accuracy_score, classification_report
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
#Load the data
pokemon = pd.read_csv("pokemon.csv")
pokemon.columns

## Logistic Regression

We will start out by just simply apply logistic regression to predict a binary outcome.

**Objective**: Predict pokemon's rarity (normal vs. non normal (legendary, sublegendary, mythical)

**Features**:
- `gen`
- `total_stats` - a sum of all stats (`hp`, `attack`, `defense`, `sp_attack`, `sp_defense`, `speed`. This is a form of feature enginerring
- `primary_type` - We will need to convert this to one-hot encoding (or dummy variables, see lecture 10) using `pd.get_dummies(df, columns=[col names])` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)) or `sklearn.OneHotEncoder` ([documenation](https://scikit-learn.org/1.5/modules/generated/sklearn.preprocessing.OneHotEncoder.html)).

**Model** ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)):
- `LogisticRegression(penalty='l2', C=1.0)` (Default)
- `LogisticRegression(penalty='l2', C=0.1)` (Stronger regularization)
- `LogisticRegression(penalty='l1', C=1.0)`
- `LogisticRegression(penalty='l1', C=0.1)`

**Step**:
1. Create a new column `rarity`: 0 if normal, 1 is legendary/sublegendary/mythical
2. Create a new column `total_stats`
3. convert 'gen' to number
4. Create new columns for one-hot encoding of `primary_type` using `pd.dummy()` [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)
5. Create `X`, `y` dataframe
6. Create training and testing data using `train_test_split`. 80% training and 20% testing, random_state = 100
7. Declare the models
8. Use cross validation to find the best model. Do 5 folds.
9. Standardize your data first
10. Train the best model on the whole training set
11. Test the final model on the test set. Report accuracy and confusion matrix.

In [None]:
# 1) Create a new column rarity
pokemon['rarity'] = ...

# 2) Create a new column total_stats
pokemon['total_stats'] = ...

# 3) Convert gen to number
mappings = {'I':1,'II':2,'III':3,'IV':4,'V':5,'VI':6,'VII':7,'VIII':8}
pokemon['gen'] = ...

# 4) Create new columns for one-hot encoding of primary_type
pokemon = ...

# 5) Create X and y dataframe
X = pokemon[['gen', 'total_stats'] + list(pokemon.columns[pokemon.columns.str.startswith('primary_type_')])]
y = ...

# 6) Create training and testing data using train_test_split. 80% training and 20% testing, random_state = 100
X_train, X_test, y_train, y_test = train_test_split(...)

X_train # Check your results

In [None]:
# 7) Declare the models

models = {
    'Logistic L2 C=1': LogisticRegression(penalty='l2',C=1.0),
    'Logistic L2 C=0.1': LogisticRegression(penalty='l2',C=0.1),
    'Logistic L1 C=1': LogisticRegression(penalty='l1',C=1.0, solver='liblinear'),  # Need to use a different solver for l1
    'Logistic L1 C=0.1': LogisticRegression(penalty='l1',C=0.1, solver='liblinear')
}

# 8) Use cross validation to find the best model. Do 5 folds
cv_results = []

for model_name, model in models.items():
  # standardize the data. This is crucial for l1 to work properly.
  # It's important that you do standardize for each fold.
  # Pipeline is a tool to chain multiple steps of a workflow.
  pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', model)   # The last one has to be an estimator/model
  ])
  cv_scores = cross_val_score(estimator = pipeline,
                                X = ...,        # Fill in your answer here
                                y = ...,        # Fill in your answer here
                                cv = ...,             # Fill in your answer here
                                scoring='accuracy') # you can use precision, recall, or f1 instead too.
  cv_results.append({
         'Model': model_name,
         'Accuracy': cv_scores.mean(),
  })

cv_results_df = pd.DataFrame(cv_results)
print(cv_results_df)

In [None]:
# 9) Standardize your data first (Use these two data for train and test)
# Note: try it without standardization. What do you see?
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 10) Train the best model on the whole training set
best_model = ...
best_model.fit(...)

# 11) Test the final model on the test set. Report accuracy and confusion matrix.
y_pred = ...

print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.95
              precision    recall  f1-score   support

           0       0.97      0.97      0.97       158
           1       0.81      0.77      0.79        22

    accuracy                           0.95       180
   macro avg       0.89      0.87      0.88       180
weighted avg       0.95      0.95      0.95       180

[[154   4]
 [  5  17]]


**Q** What is true positive? What is false positive? (Row is true label, Column is predicted label. See [documentation](https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.confusion_matrix.html))

**Answer**: