# Homework 1 (Due Tuesday, March 30th, 2021 at 6:29pm PST)

Every day late is -10%.

You are a business analyst working for a major US toy retailer:

* A manager in the marketing department wants to find out the most frequently used words in positive reviews (five stars) and negative reviews (one star) in order to determine what occasion the toys are purchased for (Christmas, birthdays, and anniversaries.). He would like your opinion on **which gift occasions (Christmas, birthdays, or anniversaries) tend to have the most positive reviews** to focus marketing budget on those days.

* There are malformed characters in the review text. For instance, notice the `&#34;` - these are examples of incorrectly decoded [HTML encodings](https://krypted.com/utilities/html-encoding-reference/).
```
"amazing quality first of all, these cards are amazing proxies (but don't try to use em in &#34;official duels&#34; unless a judge is okay with it, if you have the real thing to show) and look amazing in your binder!"
```
Please clean up all instances of these incorrect decodings.

* One of your product managers suspects that **toys purchased for male recipients (husbands, sons, etc.)** tend to be much more likely to be reviewed poorly. She would like to see some data points confirming or rejecting her hypothesis. 

* Use **regular expressions to parse out all references to recipients and gift occassions**, and account for the possibility that people may spell words "son" / "children" / "Christmas" as both singular and plural, upper or lower-cased.

* Explain what some of **pitfalls/limitations** are of using only a word count analysis to make these inferences. What additional research/steps would you need to do to verify your conclusions?

Perform the same word count analysis using the reviews received from Amazon to answer your marketing manager's question. They are stored in two files, (`poor_amazon_toy_reviews.txt`) and (`good-amazon-toy-reviews.txt`). **Provide a few sentences with your findings and business recommendations.** Make any assumptions you'd like to- this is a fictitious company after all. I just want you to get into the habit of "finishing" your analysis: to avoid delivering technical numbers to a non-technical manager.

**Submit everything as a new notebook and Slack direct message to me (Yu Chen) and the TA the HW as an attachment.**

**NOTE**: Name the notebook `lastname_firstname_HW1.ipynb`.

## Load data

In [1]:
import pandas as pd

In [2]:
good = pd.DataFrame(open('good_amazon_toy_reviews.txt', 'r', encoding='latin1'), columns=['line'])
good.head()

Unnamed: 0,line
0,Excellent!!!\n
1,"""Great quality wooden track (better than some ..."
2,my daughter loved it and i liked the price and...
3,Great item. Pictures pop thru and add detail a...
4,I was pleased with the product.\n


In [3]:
poor = pd.DataFrame(open('poor_amazon_toy_reviews.txt', 'r', encoding='latin1'), columns=['line'])
poor.head()

Unnamed: 0,line
0,Do not buy these! They break very fast I spun ...
1,Showed up not how it's shown . Was someone's o...
2,You need expansion packs 3-5 if you want acces...
3,"""This was to be a gift for my husband for our ..."
4,Received a pineapple rather than the advertise...


## Clean up text

- Clean up malformed characters, punctuations, cases

In [4]:
def clean_text(df, column):
    
    # replace \n with an empty string 
    df[column] = df[column].str.replace('\n', '')
    
    # replace html encodings with a white space - not empty string in case some separate words get linked together
    df[column] = df[column].str.replace(r'&#[0-9]+;', ' ')
    
    # replace html tags with a white space
    df[column] = df[column].str.replace(r'<.*?/>', ' ')
    
    # replace punctuations with a white space
    df[column] = df[column].str.replace(r'[^\w\s]', ' ')
    
    # lowercase everything
    df[column] = df[column].str.lower()
    
    return df

In [5]:
good = clean_text(good, 'line')
good.head()

Unnamed: 0,line
0,excellent
1,great quality wooden track better than some ...
2,my daughter loved it and i liked the price and...
3,great item pictures pop thru and add detail a...
4,i was pleased with the product


In [6]:
poor = clean_text(poor, 'line')
poor.head()

Unnamed: 0,line
0,do not buy these they break very fast i spun ...
1,showed up not how it s shown was someone s o...
2,you need expansion packs 3 5 if you want acces...
3,this was to be a gift for my husband for our ...
4,received a pineapple rather than the advertise...


## Gift occasions

### Christmas

In [7]:
good['christmas'] = good['line'].str.findall(r'\b(?:christmas(?:tide)?|xmas|noel|yule(?:tide)?|nativity)\b')
good['christmas_count'] = good['christmas'].apply(len)
print('Total word count of Christmas:', good['christmas_count'].sum())
print('Number of reviews containing Christmas:', len(good[good['christmas_count']>0]))

Total word count of Christmas: 1281
Number of reviews containing Christmas: 1206


### Birthday

In [8]:
good['birthday'] = good['line'].str.findall(
    r'\b(?:birthda(?:ys?|tes?)|b ?days?|natal days?|dob|da(?:ys?|tes?) of birth)\b'
)
good['birthday_count'] = good['birthday'].apply(len)
print('Total word count of birthday:', good['birthday_count'].sum())
print('Number of reviews containing birthday:', len(good[good['birthday_count']>0]))

Total word count of birthday: 4227
Number of reviews containing birthday: 4018


### Anniversary

In [9]:
good['anniversary'] = good['line'].str.findall(
    r'\b(?:anniversar(?:y|ies)|da(?:ys?|tes?) of remembrance|memorable da(?:ys?|tes?)|once a year|annual celebrations?)\b'
)
good['anniversary_count'] = good['anniversary'].apply(len)
print('Total word count of anniversary:', good['anniversary_count'].sum())
print('Number of reviews containing anniversary:', len(good[good['anniversary_count']>0]))

Total word count of anniversary: 54
Number of reviews containing anniversary: 52


**Findings & Business Recommendations:**

According to the results above, **birthday** is the gift occasion that tends to have the most positive reviews, since it has the highest total word count and highest number of reviews among all three gift occasions in the good review dataset. Therefore, the marketing manager may want to focus marketing budget on birthdays.

## Male recipients

In [10]:
# define a function to count male or female recipients
def recipients_count(df, column, gender):
    if gender=='male':
        df[gender] = df[column].str.findall(
            r'\b(?:grand(?:pas?|fathers?|sons?)|step(?:dads?|fathers?|sons?|brothers?)|fathers?|dads?|dadd(?:y|ies)|pa(?:pas?)?|sons?|husbands?|uncles?|brothers?|bros?|nephews?|boy ?friends?|boys?)\b'
        )
    else:
        df[gender] = df[column].str.findall(
            r'\b(?:grand(?:mas?|mothers?|daughters?)|step(?:moms?|mothers?|daughters?|sisters?)|mothers?|moms?|momm(?:y|ies)|ma(?:mas?)?|daughters?|wi(?:fe|ves)|aunts?|aunt(?:y|ies?)|sisters?|siss?|nieces?|girl ?friends?|girls?)\b'
        )
    df[gender+'_count'] = df[gender].apply(len)
    word_count = df[gender+'_count'].sum()
    n_reviews = len(df[df[gender+'_count']>0])
    name = [x for x in globals() if globals()[x] is df][0]
    print(f'In {name} reviews:')
    print(f'Total word count of {gender} recipients: {word_count}')
    print(f'Number of reviews containing {gender} recipients: {n_reviews}')
    print(f'Percentage of {name} reviews containing {gender} recipients: {round(n_reviews/len(df)*100,2)}%')

### In poor reviews

#### Male recipients

In [11]:
recipients_count(poor, 'line', 'male')

In poor reviews:
Total word count of male recipients: 1179
Number of reviews containing male recipients: 988
Percentage of poor reviews containing male recipients: 7.78%


#### Female recipients

In [12]:
recipients_count(poor, 'line', 'female')

In poor reviews:
Total word count of female recipients: 823
Number of reviews containing female recipients: 734
Percentage of poor reviews containing female recipients: 5.78%


**Findings:**

According to the results above, in poor reviews, there are more reviews referring to male recipients than to female recipients, so we may tend to directly conclude that the product manager's hypothesis about more male recipients giving poor reviews was correct. However, let's take a look at the good reviews as follows.

### In good reviews

#### Male recipients

In [13]:
recipients_count(good, 'line', 'male')

In good reviews:
Total word count of male recipients: 17465
Number of reviews containing male recipients: 15019
Percentage of good reviews containing male recipients: 14.69%


#### Female recipients

In [14]:
recipients_count(good, 'line', 'female')

In good reviews:
Total word count of female recipients: 16519
Number of reviews containing female recipients: 13941
Percentage of good reviews containing female recipients: 13.64%


**Findings & Business Recommendations:**

Based on the results above, we can notice that male recipients also tend to give more good reviews than female recipients. Therefore, I would now compare between male recipient's good reviews and poor reviews instead of across genders. We can see that the proportion of good reviews containing male recipients is much higher than the proportion of poor reviews containing male recipients. Hence, I would **reject** the product manager's hypothesis that toys purchased for male recipients tend to be much more likely to be reviewed poorly.

## Pitfalls/limitations & additional steps/recommendations

Using only a word count analysis to make the inferences above has certain pitfalls and limitations:

- A word count analysis ignores the order/context of the words, which means that even if a certain word is contained in a review, it may not actually relate to the positive or negative attitude the review is expressing. For instance, a review containing "birthday" may actually say "the toy was not a birthday gift but...". In this case, the word count of "birthday" does not accurately reflect the number of birthday gift occasions. For male recipients, some "dad", "husband", etc. contained in certain reviews were not the actual recipients of the toys, so the number of male recipients giving good/poor reviews would also be inaccurate.
- Some reviews do not explicitly use the specific words we are looking for or we may not be able to take into account all possible variations of a certain word.
- In some cases, one review may contain multiple target words, so this review may be counted multiple times for the different target words. For example, a review may contain both "son" and "daughter", then it would be counted twice for both male recipients and female recipients.

Some additional research/steps I would recommend doing next:

- A potential localized sentiment analysis: While the reviews are already separated into good and poor reviews, I may want to analyze specific sentences in each review to determine their own sentiments. For a good review, the overall sentiment should be positive, but there might a specific sentence containing the target word that actually has a neural or negative sentiment.
- Stemming/Lemmatization: Stemming helps to cover more variations of a same stem whereas lemmatization also looks at the usage of a word within the context.
- I may analyze other information about each review in addition to the text only. For example, I may look at the dates on which the reviews were posted and the customers' account information to aid my analysis and help to make more reasonable and accurate inferences.