# Extraction the Hard Way Answer Key
This lab is a bit more open ended then the rest, so the exercises don't all have exact solutions. Instead, we'll provide a bit more context around what you should be looking for in the EDA, how to get started exploring certain parts of the data, and a bit more guidance on how to use what you discover.

With that, we've left in a number of the inline solutions in the lab and instead will use this space as a means to get you unstuck if you find yourself stuck.

## Exploratory Data Analysis
We have a lot of "scores". We don't know how they're being generated or how they interact. In theory, we see an `mlx` score and you should see in the initial `.describe()` call that all of the emails marked as `spam` have `reason` as `mlx`. In other words, it's some kind of ML score that is being generated using the email data. We want to analyze this and the other scores to try and understand how they might be useful for identifying potential phishing emails. We also want to learn more about how the scores are being generated from the email content itself.

You're told to first identify interesting columns. I'd start with all the score columns.

```python
score_columns = ['score', 'bulkscore', 'priorityscore', 'spamscore', 'mlxscore', 
                 'mlxlogscore', 'lowpriorityscore', 'suspectscore', 'adultscore', 'clxscore']
```

You can also get crafty and engineer some features, like the length of the content.

### Inferring Decision Boundaries
You're told to make more charts of things you think would be cool to look at. There isn't a solution here because... do what you want! 

Some places to start... 

#### Correlation Heatmap
How do our scores correlate?
```python
import seaborn as sns

score_columns = ['score', 'bulkscore', 'priorityscore', 'spamscore', 'mlxscore', 
                 'mlxlogscore', 'lowpriorityscore', 'suspectscore', 'adultscore', 'clxscore']

score_df = df[score_columns]

# Compute the correlation matrix
corr_matrix = score_df.corr()

# Create a heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, center=0)
plt.title('Correlation Heatmap of Score Columns')
plt.tight_layout()
plt.show()
```

Do you see anything that seems to directly correlate with the `mlx` score(s)?

#### Pairplot
A pairplot can be used to make scatterplots of numerical values to give us a more in-depth look at how the fields might relate to each other.

```python
sns.pairplot(score_df)
plt.tight_layout()
plt.show()
```

Look closely - do you notice any fields that relate to one another perhaps without being immedaitely correlated? What can this tell us about how `mlxscore` is computed?

#### Content Features vs Score
You can also get crafty and make some plots to understand how the content itself relates to the `logmlxscore` (or any other scores). Here we'll look at the overall wordcount, the mean word length, and the proportion of the content made up of non-alphanumeric characters. Go nuts!

```python
# Function to count words
def word_count(text):
    return len(text.split())

# Function to calculate mean word length
def mean_word_length(text):
    words = text.split()
    return sum(len(word) for word in words) / len(words) if words else 0

# Function to calculate proportion of non-alphanumeric characters
def prop_non_alphanum(text):
    total_chars = len(text)
    non_alphanum = sum(not c.isalnum() for c in text)
    return non_alphanum / total_chars if total_chars > 0 else 0

# Calculate new columns
df['word_count'] = df['content'].apply(word_count)
df['avg_word_length'] = df['content'].apply(mean_word_length)
df['prop_non_alphanum'] = df['content'].apply(prop_non_alphanum)

# Create subplots
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20, 6))

# Plot 1: Number of words vs logmlxscore
sns.scatterplot(x='word_count', y='mlxlogscore', data=df, ax=ax1)
ax1.set_title('Number of Words vs Log MLX Score')
ax1.set_xlabel('Number of Words')
ax1.set_ylabel('Log MLX Score')

# Plot 2: Mean word length vs logmlxscore
sns.scatterplot(x='avg_word_length', y='mlxlogscore', data=df, ax=ax2)
ax2.set_title('Mean Word Length vs Log MLX Score')
ax2.set_xlabel('Mean Word Length')
ax2.set_ylabel('Log MLX Score')

# Plot 3: Proportion of non-alphanumeric characters vs logmlxscore
sns.scatterplot(x='prop_non_alphanum', y='mlxlogscore', data=df, ax=ax3)
ax3.set_title('Proportion of Non-Alphanumeric Characters vs Log MLX Score')
ax3.set_xlabel('Proportion of Non-Alphanumeric Characters')
ax3.set_ylabel('Log MLX Score')

plt.tight_layout()
plt.show()
```

## Dimensionality Reduction
We started by vectorizing our email content. Our `word_doc_matrix` will have the same number of rows as the email dataset, and the columns will be equal to the number of unique terms in the vocabulary constructed from all of the email content. A "term" here is referred to as an "n-gram" which is either one or two words. 
### Resources
- What is PCA (Principal Component Analysis) actually?
    - PCA is a method of dimensionality reduction. Right now our `word_doc_matrix` has a lot of columns. We need a way of turning our many columns (dimensions) into dimensions that can take into account all of the data and map it to a N dimensional space. PCA does this by finding the top N eigenvectors of the covariance matrix that have the largest eigenvalues. This gives us the N **principal components**. They essentially tell us the axes upon which the **data varies the most**. When we multiply the data by these N eigenvectors, the data will be "mapped" onto an N-dimensional space. In this case 2.
    - "Those are words I haven't heard in a long time" if you really want to understand it beyond "it does some math to find new axes that better separate the data" then I recommend [StatQuest](https://www.youtube.com/watch?v=HMOI_lkzW08).
- What is t-SNE and why is it different?
    - For the purposes here, it is more dimensionality reduction, but unlike PCA it tries to preserve similarities in the new dimensional mapping. Points that were "close" in the original Z-dimensional space should remain "close" in the new N-dimensional space.
    - The **perplexity** value impacts whether the algorithm focuses more on maintaining local structure between smaller numbers of data points (smaller value of perplexity) vs. optimizing for maintaining the global structure of the data (larger perplexity). Too small and you might see disjoint fragmented clusters. Too large and you might just get a big blob. Play with it!


When we perform the reduction, depending on your setup, you might see some nice clusters appear! These tell us that there might be some decision boundaries that exist in this new dimensionality that allows us to group together similarly spammy messages.

![](../assets/2_extraction_tsne.png)

## Testing out a Model
In the lab, you've loaded a sequence classification model that will learn from labeled text and be able to classify our text samples. We first selected a threshold for the `mlxscore` and labeled emails based on whether their score was above or below the threshold.

Then you train a model using those labels, and try to write phishing email samples that will successfully subvert detection, e.g. yield a "sub-threshold" classification.

Can't land on a good threshold? Try some empirical methods. Write email text that you _know_ should be classified as spam. Start your threshold high, then bring it down until you land on a threshold that will work at classifying your phishy email as spam. 

Then, leverage the techniques above to tweak your email and subvert detection.