# Problem set 1

This problem set will combine skills from a couple different course units:

- **Object-oriented programming**: building classes in Python.  
- **Natural Language Processing**: a simple sentiment analyzer in Python.  
- **Working with data**: review of working with `.txt` files in Python.

At a high-level, your goal will be to create a `SentimentAnalyzer` class, which can produce a *sentiment score* for a piece of text.

In [1]:
import pandas as pd

### Q1. Create a `TextProcessor` class

Create a class called `TextProcessor`, which has the following `self` attributes:

- `chars_to_remove`: a `list` of punctuation and other characters to remove.
- `separator`: the character to tokenize/`split` on.  
- `lower`: a `bool` representing `True` or `False`.  

And the following methods:

- `__init__`: constructor for `TextProcessor`.  
- `tokenize_text`: an instance method that takes in a `text` and a `separator` as arguments and...
   - ...removes all characters from `self.chars_to_remove`...
   - ...if `self.lower`, makes characters lowercase...
   - ...`split`s on `self.separator` and...
   - ... returns a list of tokens from `text`.

See the `assert` statements for examples, and also this example here:

```python
>>> tp = TextProcessor(chars_to_remove = ['.', ',', '?', '!'], separator = " ", lower = True)
>>> tp.tokenize_text(text = "The sky is blue")
["The", "sky", "is", "blue"]
```

In [2]:
### BEGIN SOLUTION
class TextProcessor:
    def __init__(self, chars_to_remove, separator, lower):
        """
        Constructor for TextProcessor.
        :param chars_to_remove: List of characters to remove from the text.
        """
        self.chars_to_remove = chars_to_remove
        self.separator = separator
        self.lower = lower

    def tokenize_text(self, text):
        """
        Tokenizes the input text based on the specified separator and removes specified characters.
        :param text: String to tokenize.
        :return: List of tokens.
        """
        # Remove specified characters
        if self.lower:
            text = text.lower()
        for char in self.chars_to_remove:
            text = text.replace(char, "")
        # Tokenize the text
        tokens = text.split(self.separator)
        return tokens

### END SOLUTION

In [3]:
### Testing constructor
tp = TextProcessor(chars_to_remove = ['.', ',', '?', '!'], separator = " ", lower = True)
assert tp.chars_to_remove == ['.', ',', '?', '!']
assert tp.separator == " "

In [4]:
## This cell contains at least one hidden test
### Testing tokenizer
s1 = "The quick brown fox jumped over the lazy dog"
assert tp.tokenize_text(s1) == ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

In [5]:
s2 = "This example has punctuation. Hopefully the tokenizer removes it!"
assert tp.tokenize_text(s2) == ['this',
 'example',
 'has',
 'punctuation',
 'hopefully',
 'the',
 'tokenizer',
 'removes',
 'it']

In [6]:
s3 = "THIS IS ALL CAPS!"
assert tp.tokenize_text(s3) == ['this', 'is', 'all', 'caps']

### Q2. Create a `SentimentAnalyzer` class

Now comes the slightly harder part. Create a class called `SentimentAnalyzer`, which takes in the following attributes:

- `pos_words`: a `list` of positive words.  
- `neg_words`: a `list` of negative words.  
- `processor`: a `TextProcessor` instance.  

`SentimentAnalyzer` should also have the following *methods*:

- `get_sentimental_words`: takes in a `text` as argument, tokenizes `text` with `processor`, and returns a dictionary containing the number of `pos_words`, `neg_words`, and `total_words` (including neutral words). 
- `score_sentiment`: returns the weighted average sentiment of the dictionary from `get_sentimental_words`, i.e., `((1 * num_pos_words) + (-1 * num_neg_words)) / num_total_words`.


**Note**: I know this question is more complicated——fortunately, I've written the tests to deal with each part one at a time, so you can try to implement it piecemeal.

In [7]:
### BEGIN SOLUTION
class SentimentAnalyzer:
    def __init__(self, pos_words, neg_words, processor):
        """
        Constructor for SentimentAnalyzer.
        :param pos_words: List of positive words.
        :param neg_words: List of negative words.
        :param processor: TextProcessor instance for text tokenization.
        """
        self.pos_words = pos_words
        self.neg_words = neg_words
        self.processor = processor

    def get_sentimental_words(self, text):
        """
        Tokenizes the input text and counts positive, negative, and total words.
        :param text: String to analyze.
        :return: Dictionary with counts of positive, negative, and total words.
        """
        tokens = self.processor.tokenize_text(text)
        pos_count = sum(token in self.pos_words for token in tokens)
        neg_count = sum(token in self.neg_words for token in tokens)
        total_count = len(tokens)
        return {"pos_words": pos_count, "neg_words": neg_count, "total_words": total_count}

    def score_sentiment(self, text):
        """
        Calculates the weighted average sentiment of the text.
        :param text: String to analyze.
        :return: Weighted average sentiment score.
        """
        word_counts = self.get_sentimental_words(text)
        # Avoid division by zero
        if word_counts["total_words"] == 0:
            return 0
        score = (word_counts["pos_words"] - word_counts["neg_words"]) / word_counts["total_words"]
        return score
### END SOLUTION

In [8]:
## initializing params
tp = TextProcessor(chars_to_remove=['.', ',', '!', '?'], separator = " ", lower = True)
pos_words = ['love', 'great', 'best', 'good', 'happy']
neg_words = ['hate', 'terrible', 'worst', 'bad', 'sad']

sa = SentimentAnalyzer(pos_words = pos_words, 
                       neg_words = neg_words, 
                       processor=tp)

### Checking that constructor works
assert sa.pos_words == pos_words
assert sa.neg_words == neg_words
assert sa.processor.tokenize_text("This is a test") == ["this", "is", "a", "test"]

In [9]:
### Checking that get_sentimental_words works
s1 = "I LOVE that movie, it is great."
s2 = "I hate that movie, it is terrible."
s3 = "I did not love that movie, it was not good"

assert sa.get_sentimental_words(s1) == {'pos_words': 2, 'neg_words': 0, 'total_words': 7}
assert sa.get_sentimental_words(s2) == {'pos_words': 0, 'neg_words': 2, 'total_words': 7}
assert sa.get_sentimental_words(s3) == {'pos_words': 2, 'neg_words': 0, 'total_words': 10}

In [10]:
### Checking score_sentiment
s1 = "I LOVE that movie, it is great."
s2 = "I hate that movie, it is terrible."
s3 = "I did not love that movie, it was not good"

assert sa.score_sentiment(s1) > 0
assert sa.score_sentiment(s2) < 0
assert sa.score_sentiment(s3) > 0

### Q3. Read in `pos_words.txt` and `neg_words.txt` and create a new `SentimentAnalyzer`

Read in the two `.txt` files located in the `data` directory. In each case, `split` the contents of the files on the *newline* character and set the result to a variable called `pos_words` or `neg_words`.

Then, now you have a more sophisticated sentiment lexicon, let's create a new `SentimentAnalyzer` object with those lists. Call it `analyzer`.

In [11]:
### BEGIN SOLUTION
with open("data/pos_words.txt", "r") as f:
    pos_words = f.read().split("\n")
    
with open("data/neg_words.txt", "r") as f:
    neg_words = f.read().split("\n")
    
analyzer = SentimentAnalyzer(pos_words, neg_words, tp)
### END SOLUTION

In [12]:
assert len(pos_words) == 2006
assert len(neg_words) == 4783

In [13]:
assert len(analyzer.pos_words) == 2006
assert len(analyzer.neg_words) == 4783

### Q4. Load `restaurant_reviews.csv`

Now, read in `data/restaurant_reviews.csv`, which contains a bunch of reviews for different restaurants. Call the result `df_fcores.

In [14]:
### BEGIN SOLUTION
df_scores = pd.read_csv("data/restaurant_reviews.csv")
### END SOLUTION

In [15]:
assert type(df_scores) == pd.DataFrame
assert len(df_scores) == 1000

### Q5. Apply `score_sentiment` to each review

Now, use the `apply` function from `pandas` to apply `score_sentiment` to each review, and set the result to a new column called `sentiment_score`. As a hint, you can use apply in the following way:

```python
df['new_col_name'] = df['col_to_apply'].apply(lambda x: function_to_apply(x))
```

In [16]:
### BEGIN SOLUTION
df_scores['sentiment_score'] = df_scores['Review'].apply(lambda x: analyzer.score_sentiment(x))
### END SOLUTION

In [17]:
assert 'sentiment_score' in df_scores.columns

In [18]:
assert df_scores['sentiment_score'].max() == 1
assert df_scores['sentiment_score'].min() == -.5

### Q6. Calculate the mean sentiment per `Liked`

Finally, use `groupby` and `mean` to calculate the mean sentiment across categories of the `Liked` column. Call the result `df_grouped`. 

Is the scored sentiment higher when `Liked==1`?

In [19]:
### BEGIN SOLUTION
df_grouped = df_scores.groupby("Liked").mean("sentiment_score")
### END SOLUTION

In [20]:
assert df_grouped['sentiment_score'][0] < 0
assert df_grouped['sentiment_score'][1] > 0

## Submit!

Congratulations——you've now built an entire *class* that can analyze the sentiment of arbitrary text passages! This is a real achievement, and now you can say you built a `SentimentAnalyzer` in Python.