# 👽👾 Sci-fi IRL #2: PrintSF 📚🛸

#### A Predictive Machine Learning Model by Tobias Reaper

#### ---- Datalogue 02-009-01 ----

---
---

## Outline

1. Intro
2. Predict
  - Can use a "star" component if predicting
3. Explain / Insights / Analysis
  - Methodology
  - Choice of features
  - Feature engineering
  - Choice of model
  - Choice of metrics
4. Process
  - Size of data
  - Cross-Validation method + train / test

---

## TODOjo

#### Data Hygiene

- [x] Remove Outliers

#### Modeling

- [x] Get accuracy score for majority class baseline
- [x] Get baseline accuracy score for basic logistic regression
- [ ] Beat baseline with gradient-boosted classification model
- [ ] Use RandomizedSearchCV to tune hyperparameters (LSDS_223)

#### Model Interpretation

- [ ] Feature Importances
- [ ] Permutation Importances (LSDS 232)
- [ ] Partial Dependencies
- [ ] Get and interpret confusion matrix (LSDS_224) + precision / recall

---
---

### Imports and Configuration

📥⚙️

In [None]:
# The Utiliteers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Extrateers
import seaborn as sns
import janitor
import os

In [None]:
# Plotly imports
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objs as go
import plotly.io as pio
pio.templates.default = "plotly"  # Set to dark mode

In [None]:
# Jupyter + Plotly imports (if running in Colab or Visual Studio Code, comment out this cell)
import plotly.offline as pyo
pyo.init_notebook_mode()  # Set plotly to notebook mode / work offline

In [None]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [None]:
# Set pandas display options to allow for more columns and rows
pd.options.display.max_columns = 200
pd.options.display.max_rows = 200

ML Imports

In [None]:
# ML Infrastructure
import category_encoders as ce
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
# Extra Crunchy
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

---

### LowData

In [None]:
# Define path to current session directory - 009
datapath = "/Users/Tobias/workshop/dasci/projects/thepurpledot_dev/stories/sci_fi_irl-02/009-Session/"

# Create path to the books dataset
filename = "must_read_books_008-03.csv"

filepath = os.path.join(datapath, filename)
filepath

In [None]:
# TODO: chain up the initial encodings and data rassling

# Load the data
df1 = pd.read_csv(filepath)

# Simple numerical encoding of Bool
df2 = df1.replace(to_replace={True: 1, False:0})

In [None]:
print(df1.shape)
df1.head(2)

---

## Feature Engines

- Interaction Features
  - [ ] `fiction` | `short_stories`
  - [ ] `fiction` & `short_stories`

> Interaction Features, Part 1

In [None]:
# Use pyjanitor's `update_where()` once again, this time for an &
df8 = (df7
       .update_where(
           conditions=((df7["fiction"] == 1) & 
                      (df7["short_stories"] == 1)),
           target_column_name="overlap",
           target_val=1,
       )
      )

In [None]:
df8["overlap"].value_counts()

In [None]:
# Find out if "short_stories" is leaky
df7_f1 = df7[(df7["fiction"] == 0) & (df7["short_stories"] == 1)]
df7_f1.shape
# Verdict is it doesn't look leaky to me.

---

## Posterity Models

In [None]:
# # Save the current dataframe to file
# df2.to_csv("must_read_books_008-03.csv", index=False)

---

### Visualizations

In [None]:
# To find outliers in numerical features, utilize boxplot
sns.boxplot(x=df2["num_pages"]);

In [None]:
# See how much removing pages outliers affects dataset
# This could even be one of the sliders on the app
cutoff = 1000
df3 = df2[df2["num_pages"] <= cutoff]
print(f"There are {df2.shape[0] - df3.shape[0]} books above {cutoff} pages long.")
print(f"The resulting dataset has {df3.shape[0]} rows.")

In [None]:
# To find outliers in numerical features, utilize boxplot
sns.boxplot(x=df3["num_ratings"]);

In [None]:
# See how much removing ratings outliers affects dataset
cutoff = 1000000
df4 = df3[df3["num_ratings"] <= cutoff]
print(f"There are {df3.shape[0] - df4.shape[0]} books with above {cutoff} ratings.")
print(f"The resulting dataset has {df4.shape[0]} rows.")

In [None]:
sns.distplot(df4["num_reviews"]);

In [None]:
# See how much removing ratings outliers affects dataset
cutoff = 20000
df5 = df4[df4["num_reviews"] <= cutoff]
print(f"There are {df4.shape[0] - df5.shape[0]} books with above {cutoff} reviews.")
print(f"The resulting dataset has {df5.shape[0]} rows.")

In [None]:
# See how much removing publish_year outliers affects dataset
cutoff = 1940
df6 = df5[df5["publish_year"] >= cutoff]
print(f"There are {df5.shape[0] - df6.shape[0]} books published before {cutoff}.")
print(f"The resulting dataset has {df6.shape[0]} rows.")

In [None]:
# Scatter Matrix
fig = px.scatter_matrix(df6, dimensions=["num_reviews", "avg_rating", "num_pages", "publish_year"], color="in_series")
fig.show()

In [None]:
# A little more complex scatter, without outliers
px.scatter(df6, x="publish_year", y="avg_rating", size="num_reviews", color="nonfiction", range_y=[2.5, 5])

---
---

# Predictive Modeling

---

## Target Practice 🎯`fiction`

> Binary Classification

Is it fiction or is in fuction?

In [None]:
# Clean up the dataset to remove all the extra genre fields
genre_cols = [
    "european_literature",
    "memoir",
    "fantasy",
    "religion",
    "horror",
    "humor",
    "historical_fiction",
    "classics",
    "adventure",
    "autobiography",
    "nonfiction",
    "novels",
    "biography",
    "war",
    "paranormal",
    "historical",
    "thriller",
    "cultural",
    "philosophy",
    "childrens",
    "literature",
    "young_adult",
    "mystery",
    "science_fiction",
    "contemporary",
    "crime",
    "history",
    "romance",
    "all_nonfiction",
    "overlap",
]

df8 = df7.drop(columns=genre_cols)

df8.shape

In [None]:
df8.head(2)

In [None]:
# Split up data into train / test
# No validation set because I will be using cross-validation
train2, test2 = train_test_split(df8, test_size=0.2, random_state=92)
train2.shape, test2.shape

In [None]:
# Define new target "fiction"
target2 = "fiction"

y2_train = train2[target2]
y2_test = test2[target2]
y2_train.shape, y2_test.shape

---

#### Majority Baseline

In [None]:
y2_train.value_counts(normalize=True)

In [None]:
# Get the mode (aka the majoratahhh class)
maj_class = y2_train.mode()[0]
maj_class

In [None]:
# Create predictions of 100% grass-fed respect-mah-majoritaahhh
y2k_pred = [maj_class] * len(y2_train)

In [None]:
# See how we did!!
accuracy_score(y2_train, y2k_pred)

...not too shabby.

Actually...yes it is. Almost as bad as I could get with binary classification.

Just my luck. That's the best I can do.

# 🥺

## $JK!$ I can do better.

> Starting with Logistic Regression

---

### Basic Logistic Baseline

_Unit 2, Sprint 1, Module 4_

> This time, using a couple features and the `fiction` target!

In [None]:
# Arrange X matrices - using X21 to keep numbering organized
X21_train = train2.drop(columns=[target2])  # i.e. X 2.1
X21_test = test2.drop(columns=[target2])

X21_train.shape, X21_test.shape

In [None]:
# Create basslinic logistic pipeline
pipe21 = make_pipeline(  # i.e. X 2.1
    ce.OrdinalEncoder(),
    StandardScaler(),
    SimpleImputer(strategy="median"),
    LogisticRegressionCV(cv=10, n_jobs=-1, random_state=92),
)

In [None]:
# Fit the pipeline on the training data
pipe21.fit(X21_train, y2_train)

In [None]:
# Get the baseline cross-validation accuracy scores
scores21 = cross_val_score(pipe21, X21_train, y2_train, cv=5)

In [None]:
# Get accuracy scores for the cross-validated logistic model
print("Accuracy score with simple logistic regression using all features: %0.5f (+/- %0.5f)" % (scores21.mean(), scores21.std() * 2))

---

## State of the Patience

> ...it pays off!

---
---

# 🧩 Cont'd in 02-008-04 🥳

---
---