# CPSC 330 hw6

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score

# Add your imports below
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, make_scorer
from sklearn.neighbors import NearestNeighbors

In [2]:
def cross_validate_std(*args, **kwargs):
    """Like cross_validate, except also gives the standard deviation of the score"""
    res = pd.DataFrame(cross_validate(*args, **kwargs))
    res_mean = res.mean()

    res_mean["std_test_score"] = res["test_score"].std()
    if "train_score" in res:
        res_mean["std_train_score"] = res["train_score"].std()
    return res_mean

## Instructions
rubric={points:5}

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md). 

## The dataset

In this assignment we'll look at the [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset). The task is to predict whether a text message (SMS) is spam or not spam ("ham"). **Sorry for the offensive language in some text messages. If you are sensitive to such language you may wish to avoid reading the raw messages. I have attempted to design the assignment so that any messages you need to read are not disturbing ones.**

You should start by downloading the dataset and extracting the csv to your current directory. As usual, please do not commit it to your repos.

In [3]:
sms_df = pd.read_csv("spam.csv", encoding="latin-1")
sms_df = sms_df.drop(columns=["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"])
sms_df = sms_df.rename(columns={"v1": "target", "v2": "sms"})

In [4]:
df_train, df_test = train_test_split(sms_df, random_state=123)
df_train.head()

Unnamed: 0,target,sms
647,spam,PRIVATE! Your 2003 Account Statement for shows...
3843,ham,"Yeah that's what I thought, lemme know if anyt..."
3044,ham,"Hello, yeah i've just got out of the bath and ..."
2536,ham,You do what all you like
4644,ham,Are you planning to come chennai?


In [5]:
df_train.shape

(4179, 2)

In this first part of the assignment, we'll build a classification model to predict whether a message is spam or ham.

In [6]:
X_train = df_train["sms"]
y_train = df_train["target"]

X_test = df_test["sms"]
y_test = df_test["target"]

## Exercise 1
rubric={points:25}

- Use `CountVectorizer` to create features from the text data.
- Choose an appropriate baseline model (`DummyClassifier` or `DummyRegressor` to predict spam vs. ham and report the relevant scores
- Choose an appropriate linear model (`LogisticRegression` or `Ridge`) to predict spam vs. ham
- Choose an appropriate random forest model (`RandomForestClassifier` or `RandomForestRegressor`) to predict spam vs. ham
- Report the relevant scores for your two models above. You can keep default hyperparameters for simplicity.
- Report the most important features according to your linear model.

#### Answer:

In [7]:
# Helper functions to display results nicely

def cross_validate_metrics(model, X, y, num_folds=5):
    scorers = {'Accuracy': make_scorer(accuracy_score), 'Precision': make_scorer(precision_score, pos_label='spam'), 'Recall': make_scorer(recall_score, pos_label='spam'), 'F1': make_scorer(f1_score, pos_label='spam')}
    results = pd.DataFrame(columns=['Accuracy', 'Precision', 'Recall', 'F1'], index=['Mean Validation Score', 'Standard Deviation of Validation Score'])
    for scorer in scorers.keys():
        cv_scores = cross_validate_std(model, X, y, cv=num_folds, scoring=scorers[scorer])
        results.loc['Mean Validation Score', scorer] = cv_scores.loc['test_score']
        results.loc['Standard Deviation of Validation Score', scorer] = cv_scores.loc['std_test_score']
    return results

def score_with_metrics(model, X, y):
    scorers = {'Accuracy': make_scorer(accuracy_score), 'Precision': make_scorer(precision_score, pos_label='spam'), 'Recall': make_scorer(recall_score, pos_label='spam'), 'F1': make_scorer(f1_score, pos_label='spam')}
    results = pd.DataFrame(columns=['Accuracy', 'Precision', 'Recall', 'F1'], index=['Score'])
    for scorer in scorers.keys():
        results.loc['Score', scorer] = scorers[scorer](model, X, y)
    return results

##### Baseline Model - `DummyClassifier`

In [8]:
# Note we do not compute additional metrics here because of 0 division error
dc = DummyClassifier(strategy='prior')
cv_results = cross_validate_std(dc, X_train, y_train)
dc.fit(X_train, y_train)
test_score = dc.score(X_test, y_test)
dc_results = pd.DataFrame(columns=['Mean Validation', 'Standard Deviation of Validation', 'Test'], index=['Score'])
print('Dummy Classifier Cross-validation Scores:')
dc_results['Mean Validation'] = cv_results['test_score']
dc_results['Standard Deviation of Validation'] = cv_results['std_test_score']
dc_results['Test'] = test_score
dc_results

Dummy Classifier Cross-validation Scores:


Unnamed: 0,Mean Validation,Standard Deviation of Validation,Test
Score,0.862168,0.000521,0.877243


##### Linear Model - `LogisticRegression`

In [9]:
countvec = CountVectorizer(stop_words='english')
lr = LogisticRegression(max_iter=1000, random_state=123)
linear = Pipeline([('countvec', countvec), ('lr', lr)])
print('Linear Model Cross-validation Scores:')
cross_validate_metrics(linear, X_train, y_train)

Linear Model Cross-validation Scores:


Unnamed: 0,Accuracy,Precision,Recall,F1
Mean Validation Score,0.978943,0.99802,0.848936,0.916639
Standard Deviation of Validation Score,0.00749731,0.00442786,0.0552707,0.0329232


In [10]:
linear.fit(X_train, y_train)
print('Linear Model Test Scores:')
score_with_metrics(linear, X_test, y_test)

Linear Model Test Scores:


Unnamed: 0,Accuracy,Precision,Recall,F1
Score,0.984207,1,0.871345,0.93125


##### Random Forest Model - `RandomForestClassifier`

In [11]:
rf = RandomForestClassifier(random_state=123)
random_forest = Pipeline([('countvec', countvec), ('rf', rf)])
print('Random Forest Cross-validation Scores:')
cross_validate_metrics(random_forest, X_train, y_train)

Random Forest Cross-validation Scores:


Unnamed: 0,Accuracy,Precision,Recall,F1
Mean Validation Score,0.974397,0.989754,0.822834,0.897606
Standard Deviation of Validation Score,0.00777636,0.00993107,0.0587536,0.0350101


In [12]:
random_forest.fit(X_train, y_train)
print('Random Forest Test Scores:')
score_with_metrics(random_forest, X_test, y_test)

Random Forest Test Scores:


Unnamed: 0,Accuracy,Precision,Recall,F1
Score,0.972721,0.992593,0.783626,0.875817


##### Feature Importances from Linear Model

In [13]:
feature_importances = pd.DataFrame(linear.named_steps['lr'].coef_, columns=linear.named_steps['countvec'].get_feature_names()).T.rename(columns={0: 'Coefficient'})
feature_importances = feature_importances.reindex(feature_importances.Coefficient.abs().sort_values(ascending=False).index)
print('10 Most Important Features (largest absolute value of coefficients, in order of most to least important):')
feature_importances.head(10)

10 Most Important Features (largest absolute value of coefficients, in order of most to least important):


Unnamed: 0,Coefficient
uk,2.21063
service,2.003192
claim,1.978341
mobile,1.951308
txt,1.86837
new,1.860522
50,1.804746
150p,1.756085
won,1.68809
message,1.61317


## Exercise 2

Let's now try to use pre-trained word embeddings.

As we saw in class, using pre-trained word embeddings is very common in NLP. These embeddings are created by training a model like [Word2vec](https://en.wikipedia.org/wiki/Word2vec) on a huge corpus of text. In this exercise we will use a package called [spaCy](https://spacy.io/). Unfortunately, I didn't anticipate using spaCy at the start of the course, and thus it was not included in your course environment. You will need to install it now. Perform the following steps:

1. Open a terminal and activate your cpsc330env environment
2. Run `conda install spacy` if you're using conda and `cpsc330env.yml`, or `pip install spacy` if you're using pip and `requirements.txt`
3. Run `python -m spacy download en_core_web_md`

The last line downloads the trained language model itself, called [`en_core_web_md`](https://spacy.io/models/en#en_core_web_md). It is about 50 MB.

When you are done, the following line of could should run:

In [14]:
import spacy
nlp = spacy.load("en_core_web_md")

If there are issues, please ask for help on Piazza.

#### 2(a)
rubric={points:5}

Our pre-trained `en_core_web_md` model gets us a vector representation of text:

In [15]:
X_train.iloc[0]

'PRIVATE! Your 2003 Account Statement for shows 800 un-redeemed S. I. M. points. Call 08715203694 Identifier Code: 40533 Expires 31/10/04'

In [16]:
nlp(X_train.iloc[0]).vector

array([-6.31907508e-02,  1.87776491e-01, -3.50111164e-02, -1.40975965e-02,
        6.48319796e-02, -6.47449717e-02,  1.70375593e-02, -6.10386357e-02,
       -1.64511222e-02,  1.32269633e+00, -2.01264426e-01,  1.75327167e-01,
       -1.92099977e-02, -7.02716932e-02, -4.06776555e-02,  4.39935699e-02,
       -9.40892398e-02,  1.16417742e+00,  5.21034896e-02, -4.15729620e-02,
       -6.67720437e-02,  2.09751315e-02, -2.88010016e-02, -1.74403172e-02,
        7.59456381e-02,  1.59054156e-02, -1.19963408e-01, -4.00383957e-02,
        6.05901927e-02,  6.71593547e-02,  9.76373553e-02,  3.17107029e-02,
        3.25630940e-02,  1.04522884e-01, -4.10015620e-02, -1.43015515e-02,
       -5.64244464e-02,  8.83771945e-03, -5.06912246e-02, -5.84236719e-02,
       -7.24398345e-02, -6.05546385e-02,  1.38748854e-01, -1.05185993e-01,
       -1.15349621e-01,  8.07616413e-02, -3.67925540e-02,  1.33814156e-01,
        4.17499691e-02,  1.61221206e-01, -7.78133571e-02,  2.88000386e-02,
       -8.40148050e-03, -

This is analogous to calling `transform` with `CountVectorizer`.

Compare the _length of the representation_ for these embeddings vs. the `CountVectorizer` approach. Then, compare the _number of nonzero entries_ for the two repesentations of the first training example. Briefly discuss.

Note: As briefly discussed in Lecture 14, a common error here is that scikit-learn methods expect certain data shapes as their input. To address this you can use `X_train.iloc[[0]]` instead of `X_train.iloc[0]`.

#### Answer:

In [17]:
first_ex_spacy = nlp(X_train.iloc[0]).vector
first_ex_countvec = countvec.fit_transform(X_train).toarray()[0]
print('From the first training example:')
table = pd.DataFrame(data={'Nonzero Entries': [np.count_nonzero(first_ex_spacy), np.count_nonzero(first_ex_countvec)], 'Length of Representation': [len(first_ex_spacy), len(first_ex_countvec)]}, index=['Spacy', 'CountVectorizer'])
table

From the first training example:


Unnamed: 0,Nonzero Entries,Length of Representation
Spacy,300,300
CountVectorizer,16,7151


##### Discussion

The `spacy` approach leads to a more dense embedding, as there are a high amount of nonzero entries (300/300). However, the `CountVectorizer` approach leads to a more sparse embedding, as there are many more features but generally fewer nonzero entries (16/7151). I'm not sure which will be more effective, but the `CountVectorizer` embedding is certainly more interpretable, because it simply a count of the features (words) for each example, whereas `spacy` is doing something more opaque.

#### 2(b)
rubric={points:5}

In Exercise 1 you used `CountVectorizer` to generate features, which were then fed into a model. We can do the same here with the features from the pre-trained embedding model. 

In this case, for computational reasons I will first get the embeddings for the entire train and test sets (note that this doesn't violate the Golden Rule because the transformation is independent for each example):

In [18]:
X_train_embeddings = pd.DataFrame([sms.vector for sms in nlp.pipe(X_train)])
X_test_embeddings  = pd.DataFrame([sms.vector for sms in nlp.pipe(X_test)])

In [19]:
X_train_embeddings.shape

(4179, 300)

What sort of scores can you get with these features instead? Compare with your scores from Exercise 1 and briefly discuss. Again, it's fine to stick to default hyperparameters to save time.

#### Answer:

Note: Dummy classifier was not necessary because it doesn't depend on the features, so the results from the embeddings will be the same as before.

##### Linear Model 

In [20]:
print('Scores from Spacy embeddings:')
cross_validate_metrics(lr, X_train_embeddings, y_train)

Scores from Spacy embeddings:


Unnamed: 0,Accuracy,Precision,Recall,F1
Mean Validation Score,0.964106,0.900167,0.833238,0.864507
Standard Deviation of Validation Score,0.00224385,0.0229232,0.0403669,0.0127527


In [21]:
print('Scores from CountVectorizer embeddings:')
cross_validate_metrics(linear, X_train, y_train)

Scores from CountVectorizer embeddings:


Unnamed: 0,Accuracy,Precision,Recall,F1
Mean Validation Score,0.978943,0.99802,0.848936,0.916639
Standard Deviation of Validation Score,0.00749731,0.00442786,0.0552707,0.0329232


##### Random Forest Model

In [22]:
print('Scores from Spacy embeddings:')
cross_validate_metrics(rf, X_train_embeddings, y_train)

Scores from Spacy embeddings:


Unnamed: 0,Accuracy,Precision,Recall,F1
Mean Validation Score,0.971523,0.98334,0.807181,0.886063
Standard Deviation of Validation Score,0.00607234,0.0185804,0.0418557,0.026301


In [23]:
print('Scores from CountVectorizer embeddings:')
cross_validate_metrics(random_forest, X_train, y_train)

Scores from CountVectorizer embeddings:


Unnamed: 0,Accuracy,Precision,Recall,F1
Mean Validation Score,0.974397,0.989754,0.822834,0.897606
Standard Deviation of Validation Score,0.00777636,0.00993107,0.0587536,0.0350101


##### Discussion
For both linear and random forest models, using the `CountVectorizer` embedding achieved mostly similar, but slightly better scores when compared to their `spacy` counterparts. In particular, the linear model had substantially better precision and f1 when using `CountVectorizer`'s embeddings, whereas for the random forest model, the scores were still higher for `CountVectorizer`, but fairly close. We would need to look at the test data to ensure we haven't overfitted the validation set on either model.

#### 2(c)
rubric={points:1}

Score your models on the test data. Are the results what you expected?

#### Answer:

##### Linear Model

In [24]:
print('Scores from Spacy embeddings:')
lr.fit(X_train_embeddings, y_train)
score_with_metrics(lr, X_test_embeddings, y_test)

Scores from Spacy embeddings:


Unnamed: 0,Accuracy,Precision,Recall,F1
Score,0.967696,0.89375,0.836257,0.864048


In [25]:
print('Scores from CountVectorizer embeddings:')
linear.fit(X_train, y_train)
score_with_metrics(linear, X_test, y_test)

Scores from CountVectorizer embeddings:


Unnamed: 0,Accuracy,Precision,Recall,F1
Score,0.984207,1,0.871345,0.93125


##### Random Forest Model

In [26]:
print('Scores from Spacy embeddings:')
rf.fit(X_train_embeddings, y_train)
score_with_metrics(rf, X_test_embeddings, y_test)

Scores from Spacy embeddings:


Unnamed: 0,Accuracy,Precision,Recall,F1
Score,0.974874,1,0.795322,0.885993


In [27]:
print('Scores from CountVectorizer embeddings:')
random_forest.fit(X_train, y_train)
score_with_metrics(random_forest, X_test, y_test)

Scores from CountVectorizer embeddings:


Unnamed: 0,Accuracy,Precision,Recall,F1
Score,0.972721,0.992593,0.783626,0.875817


##### Discussion
As expected (given the previous section), the `CountVectorizer` embeddings performed better on the linear model. However, the `spacy` embeddings performed slightly better in all metrics on the random forest model, but not by a significant amount. The test results are similar to the train results, meaning we haven't overfitted validation. 

## Exercise 3

Now we're done with trying to predict class (spam vs. ham). Our next task will be trying to find similar messages to a query message using nearest neighbours, like the product recommendations we discussed in Lecture 14.

#### 3(a) 
rubric={points:5}

Using scikit-learn's `NearestNeighbours` on the word count features from `CountVectorizer`, searching the training data to find the 5 most similar messages to this made-up message:

In [28]:
query_sms = "Hey how about some CPSC 330 studying over Zoom or socially distanced at a park? This course is so much fun right?"

Use Euclidean distance and the same `CountVectorizer` you used in Exercise 1.

Note: The `kneighbors` function returns indices of the neighbours. To retrieve the corresponding messages, I recommend indexing using the `iloc` syntax.

Note: We don't exactly have a notion of train and test anymore, because we're not doing supervised learning anymore. I just picked the training set for simplicity.

#### Answer:

In [29]:
# Helper function to display neighbors nicely
pd.set_option('display.max_colwidth', None)
def show_neighbors(X, neighbors):
    results = pd.DataFrame(columns=neighbors[1][0], index=['Message'])
    for column in results.columns:
        results.loc['Message', column] = X.iloc[column]
    return results

In [30]:
countvec.fit(X_train)
X_train_countvec = pd.DataFrame(data=countvec.transform(X_train).toarray(), columns=countvec.get_feature_names(), index=X_train.index)
query_sms_countvec = countvec.transform([query_sms])
nn = NearestNeighbors(n_neighbors=5)
nn.fit(X_train_countvec);

In [31]:
neighbors = nn.kneighbors(query_sms_countvec.toarray())
print("Closest neighbors, in order of similarity (left to right):")
show_neighbors(X_train, neighbors)

Closest neighbors, in order of similarity (left to right):


Unnamed: 0,154,1685,1684,58,1924
Message,:),You can never do NOTHING,Can a not?,Hey u still at the gym?,What about this one then.


#### 3(b)
rubric={points:2}

Repeat part (a) but using cosine similarity instead of Euclidean distance. 

#### Answer:

In [32]:
nn = NearestNeighbors(n_neighbors=5, metric='cosine')
nn.fit(X_train_countvec);

In [33]:
neighbors = nn.kneighbors(query_sms_countvec.toarray())
print("Closest neighbors, in order of similarity (left to right):")
show_neighbors(X_train, neighbors)

Closest neighbors, in order of similarity (left to right):


Unnamed: 0,206,264,3326,58,1462
Message,Your right! I'll make the appointment right now.,Of course. I guess god's just got me on hold right now.,\HEY KATE,Hey u still at the gym?,Lmao but its so fun...


#### 3(c)
rubric={points:5}

In lecture we talked about how Euclidean distance resulted in less popular items being recommended than with cosine similarity. What is the analog of "popularity" here? Are your results from parts (a) and (b) consistent with this notion?

#### Answer:
Here, the analog of popularity is the number of words in the string. In the same way that popular products are more likely to be similar in some way to a query item, strings wtih more words are more likely to contain some similar meaning to a query string. The results from (a) and (b) are somewhat consistent with this notion, as cosine similarity (b) found longer messages than euclidean distance (a) on average.

#### 3(d)
rubric={points:3}

Repeat parts (a) and (b) but this time with the pre-trained embeddings from Exercise 2.

#### Answer:

##### Euclidean Distance

In [34]:
query_sms_spacy = nlp(query_sms).vector
nn = NearestNeighbors(n_neighbors=5)
nn.fit(X_train_embeddings);

In [35]:
neighbors = nn.kneighbors([query_sms_spacy])
print("Closest neighbors, in order of similarity (left to right):")
show_neighbors(X_train, neighbors)

Closest neighbors, in order of similarity (left to right):


Unnamed: 0,1642,754,4107,1602,861
Message,Have you heard about that job? I'm going to that wildlife talk again tonight if u want2come. Its that2worzels and a wizzle or whatever it is?!,Not a lot has happened here. Feels very quiet. Beth is at her aunts and charlie is working lots. Just me and helen in at the mo. How have you been?,"Do whatever you want. You know what the rules are. We had a talk earlier this week about what had to start happening, you showing responsibility. Yet, every week it's can i bend the rule this way? What about that way? Do whatever. I'm tired of having thia same argument with you every week. And a &lt;#&gt; movie DOESNT inlude the previews. You're still getting in after 1.",Exactly. Anyways how far. Is jide her to study or just visiting,"Oh ! A half hour is much longer in Syria than Canada, eh ? Wow you must get SO much more work done in a day than us with all that extra time ! *grins*"


##### Cosine Similarity

In [36]:
nn = NearestNeighbors(n_neighbors=5, metric='cosine')
nn.fit(X_train_embeddings);

In [37]:
neighbors = nn.kneighbors([query_sms_spacy])
print("Closest neighbors, in order of similarity (left to right):")
show_neighbors(X_train, neighbors)

Closest neighbors, in order of similarity (left to right):


Unnamed: 0,1642,754,1602,861,4107
Message,Have you heard about that job? I'm going to that wildlife talk again tonight if u want2come. Its that2worzels and a wizzle or whatever it is?!,Not a lot has happened here. Feels very quiet. Beth is at her aunts and charlie is working lots. Just me and helen in at the mo. How have you been?,Exactly. Anyways how far. Is jide her to study or just visiting,"Oh ! A half hour is much longer in Syria than Canada, eh ? Wow you must get SO much more work done in a day than us with all that extra time ! *grins*","Do whatever you want. You know what the rules are. We had a talk earlier this week about what had to start happening, you showing responsibility. Yet, every week it's can i bend the rule this way? What about that way? Do whatever. I'm tired of having thia same argument with you every week. And a &lt;#&gt; movie DOESNT inlude the previews. You're still getting in after 1."


#### 3(e)
rubric={points:2}

Our first approach, using `CountVectorizer` features, should only retrieve similar messages if they have some words in common with the query message. Is this also true for the pre-trained embedding approach as well? Briefly discuss.

#### Answer:
This is not necessarily true of the pre-trained embeddings, because the features in each embedding vector are not counts of words, but are rather some opaque embeddings created by the `en_core_web_md` model, that aren't easy to interpret. While the result is some of the similar messages found share words with the query, these embeddings might be capturing more general sentiment beyond simply the words present in the string.

#### 3(f)
rubric={points:2}

In class we talked about how, when using pre-trained models, it's important that the original training data was somewhat similar to our own data. For example, in Lecture 13 we talked about how the dog breed images were fairly similar to ImageNet images. We're using the `en_core_web_md` pre-trained model from spaCy - its documentation is [here](https://spacy.io/models/en#en_core_web_md). Based on the documentation, it seems like the word vectors come from [Common Crawl](https://commoncrawl.org/). Do you think that training data is suitable for turning these SMS messages into feature vectors? Briefly discuss. (There is no single correct answer here!)

#### Answer:
I think it is fairly similar, and therefore useful, because the web contains huge corpus of language crossing different levels of formality, topics, and linguistic complexity, and this could generalize well to a text message that could use a variety of tones, topics, or sentiments. However, SMS data might tend to be shorter, more abbreviated, and more casual/profane, than perhaps the average sample of text on the web, which might make `spacy`'s embeddings less useful. Ultimately, the trade-off is better trained embeddings that have access to a much larger, but slightly different corpus, versus feature embeddings that might be fit to less data but are more similar to the scope of this problem.

## Exercise 4: Very short answer questions
rubric={points:8}

Each question is worth 2 points.

The first two questions pertain to the material we skipped in Lecture 12. A screencast is available on the course README; [here](https://www.dropbox.com/s/da7lx8kdzxfmna2/lecture12.mp4?dl=0) is the link for convenience.

1. Consider using `CountVectorizer(max_features=1000)` and then reducing to 500 features with `RFE(n_features_to_select=500)` vs. using `CountVectorizer(max_features=500)`. Are these two approaches the same? If not, how are they different?
2. After running feature selection with `RFE`, `rfe.ranking_` tells you the order in which the features were removed. Why could this order be different from the order of the original feature importances, ranked from least to most important?
3. In Lecture 13 we discussed how neural networks are sort of like `Pipeline`s, in the sense that they involve multiple sequential transformations of the data, finally resulting in the prediction. Why was this property useful when it came to transfer learning?
4. In Lecture 15 we saw our pre-trained word embedding model output an analogy that reinforced a gender stereotype. Give an example of how using such a model could cause harm in the real world.

#### Answer:
1. These approaches are different. From the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), `max_features=500` will simply select the 500 most frequent features (which might not be the most important) whereas using `RFE` will start with twice the amount of features, and recursively select the most important ones, recomputing the importances during each iteration.
2. As per lecture 12, "A feature's relevance can only be defined in the context of other features; adding/removing features can make features relevant/irrelevant". Certain features might only be useful in combination with others, so if two features that are only useful together have some importance, eliminating one might make the other (mostly) useless.
3. It's useful because transfer learning involves taking pre-trained transformation steps and putting our own model (classifier/regressor) on top of it. Therefore, we can use the pre-trained transformation steps of a neural network that has seen way more data than our own computers can handle, and apply it to a similar data before putting our own model on top of it.
4. For example, using a model like that as a job recommendation system for the unemployed might push women towards stereotypically feminine roles like nursing whereas it might push men towards male-dominated fields like programming, reinforcing streotypical gender norms. 

## Submission to Canvas

**IF YOU ARE WORKING WITH A PARTNER** please form the group before submitting - see instructions [here](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md#partners).

When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`.
2. Save your notebook.
3. Convert your notebook to `.html` format using the `convert_notebook()` function below **or** by `File -> Export Notebook As... -> Export Notebook to HTML`.
4. Run the code `submit()` below to go through an interactive submission process to Canvas.
>For this step, you will need a Canvas *Access Token* token. If you haven't already got one, log-in to Canvas, click `Account` (top-left of the screen), then `Settings`, then scroll down until you see the `+ New Access Token` button. Click that button, give your token any name you like and set the expiry date to Dec 31, 2020. Then click `Generate token`. Save this token in a safe place on your computer as you'll need it for all assignments. Treat the token with as much care as you would an important password. 

Note: for those having trouble with the Jupyter widgets and the dropdowns: if you add the argument `no_widgets=True` to your `submit` call, it should let you do a text-based entry of your key and avoid the dropdowns altogether. If this doesn't work, you probably need to upgrade to the latest version of `canvasutils` with `pip install canvasutils -U` from your terminal with your environment activated.


In [38]:
from canvasutils.submit import submit, convert_notebook

# Note: the canvasutils package should have been installed as part of your environment setup - 
# see https://github.com/UBC-CS/cpsc330/blob/master/docs/setup.md

In [41]:
# convert_notebook("hw6.ipynb", "html")  # uncomment and run when you want to try convert your notebook to HTML (or you can convert manually from the File menu)

Notebook successfully converted! 


In [42]:
# submit(course_code=53561, token=False)  # uncomment and run when ready to submit 

Please paste your token here and then hit enter:


 ······································································



Token successfully entered - thanks!


Select an assignment to submit to:


Output()