# Homework 2 Feedback

### Use a Word Count to Check for Additional Stopwords

Students would first perform a word count to check for the top frequency words - then, they'd use their domain knowledge to select stopwords to add to the list. For instance, if you look carefully, many of the reviews contain `<br>` or `<br/>`. This is an HTML tag for a line break. This is definitely a word that should just be completely removed - you will catch this if you remove all non alphanumeric characters, tokenize, and perform a word count: `br` shows up as one of the top reoccuring words.

<div class="alert-success">
should valid the reason why extra stopwords are chosen</div>

### Expect to See Num Features with Stemming < Num Features with Lemmatization

In the second part of the homework, we asked you to count the number of features you see with different text preprocessing techniques. You should make sure that you see less features after lemmatizing vs. after stemming. 

Some students submitted notebooks where the number of features for lemmatization was greater than with stemming, or the number of features with lemmatization + stopwords was greater than with just lemmatization. Both indicate some bug or issue in the code.

### Don't Set Too High A Minimum DF Threshold for `CountVectorizer`

For the first part of HW2, we saw many instances of `CountVectorizer` being initialized like this:
```
vectorizer = CountVectorizer(min_df=0.03)
```
Here are some issues with this:
* `min_df=0.03` is actually quite high a bar. This means that the word needs to appear in 3% of all reviews. Given that the entire dataset (good + poor reviews) is 115k, this means the word needs to appear in 3,450 reviews or more to make it as a final feature. This actually means there's a very good chance it's a stopword! Especially since the product reviews are frequently about a wide variety of different toy products, you should inspect the final features you get and do a sanity-check to make sure they seem to be useful. You can also set a `max_df=UPPER_THRESHOLD` to limit stopwords that appear in too many documents.
* we should be ideally using `binary=True` to offsite the possibility someone just writes the same word over and over again in the same review

<div class="alert-success">
given the high variety of language expressing, min_df=0.03 is quite a high bar. 0.01 would be good.</div>

### Save Yourself Work By Using `CountVectorizer`'s Tokenizer

Many students wrote their own functions to remove punctuation, or digits, and only keep alphabetical characters `A-Z`.

However, you could also do all of that, plus remove stopwords all in one line:
```python
vectorizer = CountVectorizer(stopwords=my_custom_stopwords, 
                             token_pattern='\b[a-zA-Z]{2,}\b', 
                             min_df=0.01)
```

### Sanity Check Your Final Features

Some students' final features list looked liked this:
```
'aa'  'aaa'  'aaaa' ...
```
These should ultimately be grouped together using regex, since they are basically the same exact word. Others had
```
034    077    099     9   91 ...
```
Check how `034` and `077` are being used in the reviews themselves. Often, it's used as a unit of currency or time. If that is the case, you ideally should write a regex pattern to group together all tokens that represent money or time (ie., group them as `_MONEY_` and `_TIME_`).

Some students did this by adding phrases to the stopword set:
```python
# Appending stop words that appeared in final corpus that did not contain meaning.
stop.append('10')
stop.append('34')
stop.append('wa')
stop.append('br')
stop.append('ca')
stop.append('ha')
```

### Don't Repeat Yourself In Regex

People wrote regex patterns to group together words they found that expressed similar meanings. For example,

```python
christmas_pattern = r'\b(christ(?:-?)mas)\b|\b((?:x|x{3})(?:\s|-)mas)\b'
```
However, there's several ways to improve this:
* no need to write `\b` multiple times
* combine the group for `christ` and `x`

A shortened version that works just as well is:
```python
christmas_pattern = r'\b(?:christ|(?:x|x{3}))[-\s]?mas\b'
```
See [RegExr sandbox](regexr.com/6isuh).

### Nitpick: Use Type Hints and Docstrings to Improve Your Code's Readability

* [Type Hints in Python](https://towardsdatascience.com/type-hints-in-python-everything-you-need-to-know-in-5-minutes-24e0bad06d0b)
* [Docstrings](https://www.programiz.com/python-programming/docstrings)

<div class="alert-success">
increase the Interpretability of codes</div>

### Nitpick: Use Snake Case in Python

Sometimes students will write variables as `pandasDf` or functions as `def countVectorize(...):`. This is not wrong, but in general, Python will use [snake case](https://en.wikipedia.org/wiki/Snake_case) instead of [camel case](https://en.wikipedia.org/wiki/Camel_case), which is more prevalent in most other programming languages like Java or JavaScript.