1. A manager in the marketing department wants to find out the most frequently used words in positive reviews (five stars) and negative reviews (one star) in order to determine what occasion the toys are purchased for (Christmas, birthdays, and anniversaries.). He would like your opinion on **which gift occasions (Christmas, birthdays, or anniversaries) tend to have the most positive reviews** to focus marketing budget on those days.

In [6]:
import re

good_file = open('good_amazon_toy_reviews.txt','r')
poor_file = open('poor_amazon_toy_reviews.txt','r')

goods: str = good_file.read()
poors: str = poor_file.read()

In [125]:
# define patterns 

patterns = [
    r'\b[cC]+hristmas|[hH]oliday\b',
    r'\b([bB]+irthday|[bB]+day|[bB]+irth)\b',
    r'\b[aA]+nniversar+[a-zA-Z]\b'
]


In [126]:
# Method 1
# count the number of good reviews that contains key words

reviews = goods.split("\n")
print(f'Number of good reviews: {len(reviews)}')

for i in patterns:
    count = 0
    for review in reviews:
        if len(re.findall(i, review)) > 0:
            count+=1
    #print(count)
    print(f'There are {count} good reviews containing word {i[2:-2]}')



Number of good reviews: 102218
There are 1176 good reviews containing word [cC]+hristmas|[hH]oliday
There are 3911 good reviews containing word ([bB]+irthday|[bB]+day|[bB]+irth)
There are 50 good reviews containing word [aA]+nniversar+[a-zA-Z]


In [127]:
# Method 2
# count how many times the keyword is used in text

for i in patterns:
    print(f'Word {i[2:-2]} was used {len(re.findall(i, goods))} times in good reviews.')

Word [cC]+hristmas|[hH]oliday was used 1281 times in good reviews.
Word ([bB]+irthday|[bB]+day|[bB]+irth) was used 4109 times in good reviews.
Word [aA]+nniversar+[a-zA-Z] was used 52 times in good reviews.


Both methods show the birthday is the occasion that we should focus budget on. 

2. There are malformed characters in the review text. For instance, notice the `&#34;` - these are examples of incorrectly decoded [HTML encodings](https://krypted.com/utilities/html-encoding-reference/).
```
"amazing quality first of all, these cards are amazing proxies (but don't try to use em in &#34;official duels&#34; unless a judge is okay with it, if you have the real thing to show) and look amazing in your binder!"
```
Please clean up all instances of these incorrect decodings.

In [128]:
goods_cleaned = re.sub(r'(&#+[0-9])', '', goods)
poors_cleaned = re.sub(r'(&#+[0-9])', '', poors)

3. One of your product managers suspects that **toys purchased for male recipients (husbands, sons, etc.)** tend to be much more likely to be reviewed poorly. She would like to see some data points confirming or rejecting her hypothesis. 


In [129]:
good_reviews = goods_cleaned.split("\n")
poor_reviews = poors_cleaned.split("\n")

In [134]:
# define pattern
# count how many times the keyword is used in text
pattern = r'\b[hH]+usband|[sS]+on|[dD]+ad|[bB]rother|[gG]+randson[gG]+randpa\b'

print(f'Gifts to males have {len(re.findall(pattern, goods))} good reviews.')
print(f'Gifts to males have {len(re.findall(pattern, poors))} poor reviews.')
    

Gifts to males have 16045 good reviews.
Gifts to males have 1331 poor reviews.


It seems gifts to male recipients are more likely to receive positive reviews; reject manager's hypothesis.

4. Use **regular expressions to parse out all references to recipients and gift occassions**, and account for the possibility that people may spell words "son" / "children" / "Christmas" as both singular and plural, upper or lower-cased.


In [135]:
# parse out all references to recipients and gift occassions
# account for the possibility that people may spell words "son" / "children" / "Christmas" as both singular and plural, upper or lower-cased.

# ？？找大小写的概率还是单复数的概率啊？

# for q4 in HW1, how do we know if a word is used as both singular and plural regardless of upper or lower-cased?? Are we looking for (number of plurals / total number of this word used) for 'son', 'children', 'Christmas' ??



5. Explain what some of **pitfalls/limitations** are of using only a word count analysis to make these inferences. What additional research/steps would you need to do to verify your conclusions?

Simple word counting from entire text file is not the same with counting word in each individual reviews that it might produce higher count number. Breaking the text file down to reviews and count the number of reviews might give us better business insights because we wantn to study buyers' reaction by looking at their reviews, instead of word counting itself.