## Data cleaning goals
* Exploration of amazon review data
* Data cleaning and pre-processing for text analytics & sentiment analysis
* credit: https://github.com/EnesGokceDS/Amazon_Reviews_NLP_Capstone_Project/blob/master/1_Data_cleaning_and_feature_extraction.ipynb

## Part 1: Load data
* Load libraries
* Load data
* Quick exploration of data

In [1]:
# Import libraries
import pandas as pd
import nltk as n
from textblob import TextBlob

n.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\domen\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
# Get the amazon reviews data, store as pandas df
df = pd.read_csv("dog-cameras-raw.csv")
df = pd.DataFrame(df)
df.head(3) # show first row

Unnamed: 0,product,date,title,rating,body
0,"Furbo Dog Camera: Treat Tossing, Full HD WiFi ...","Reviewed in Canada on December 14, 2018",Glorified Webcam,2.0,I bought the Furbo as a birthday gift for my b...
1,"Furbo Dog Camera: Treat Tossing, Full HD WiFi ...","Reviewed in Canada on August 15, 2018",Recieved Used Item!,1.0,Extremely disappointed. I recieved a Furbo tha...
2,"Furbo Dog Camera: Treat Tossing, Full HD WiFi ...","Reviewed in Canada on May 26, 2018",Furbo made miracle happen for me,5.0,I’ve been using furbo for 2.5 weeks now. It ha...


In [3]:
# Rename body (review text) to text, ensure it is of type string
df = df.rename(columns = {'body': 'text'}).astype(str)

# Get info from df
df.info()

# Describe the df
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 490 entries, 0 to 489
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   product  490 non-null    object
 1   date     490 non-null    object
 2   title    490 non-null    object
 3   rating   490 non-null    object
 4   text     490 non-null    object
dtypes: object(5)
memory usage: 19.3+ KB


Unnamed: 0,product,date,title,rating,text
count,490,490,490,490.0,490
unique,1,355,442,5.0,490
top,"Furbo Dog Camera: Treat Tossing, Full HD WiFi ...","Reviewed in Canada on January 9, 2020",Five Stars,5.0,"I’m glad I got this, this is so great. Thanks 5/5"
freq,490,7,15,347.0,1


In [4]:
# Exploring missing values
null_values = df.isna().sum()
null_values = pd.DataFrame(null_values,columns=['null'])
sum_tot = len(df)
null_values['percent'] = null_values['null']/sum_tot*100
round(null_values,3).sort_values('percent',ascending=False)

Unnamed: 0,null,percent
product,0,0.0
date,0,0.0
title,0,0.0
rating,0,0.0
text,0,0.0


## Part 2: Feature Extraction (before text cleaning)
* Count of Stopwords
* Count of Punctuation
* Count of Hashtag characters
* Count of Numeric characters
* Count of Emojis & Emoticons

In [5]:
# Load libraries
!pip install -q wordcloud
import wordcloud
from nltk.corpus import stopwords
import nltk
import string
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\domen\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\domen\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\domen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\domen\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [6]:
# Create stopword count feature
df['stopword_ct'] = df.text.apply(lambda x: len([x for x in x.split() if x in stop]))

# See 3 rows
df[['text','stopword_ct']].sort_values('stopword_ct',ascending=False).head(3)

Unnamed: 0,text,stopword_ct
2,I’ve been using furbo for 2.5 weeks now. It ha...,358
54,"Ok, ive finally gotten frustrated for the last...",253
64,This thing is worth every penny.Bought it beca...,203


In [7]:
# Create punctuation count feature
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return count

df['punctuation_ct'] = df.text.apply(lambda x: count_punct(x))

# See 3 rows
df[['text', 'punctuation_ct']].sort_values('punctuation_ct',ascending=False).head(3)

Unnamed: 0,text,punctuation_ct
14,EDIT:I am very pleased with their customer sup...,76
2,I’ve been using furbo for 2.5 weeks now. It ha...,62
64,This thing is worth every penny.Bought it beca...,53


In [12]:
# Create hashtag count feature
df['hashtag_ct'] = df.text.apply(lambda x: len([x for x in x.split() if x.startswith('#')]))

# See 3 rows
df[['text','hashtag_ct']].sort_values('hashtag_ct',ascending=False).head(3)

# How many times where hashtag is not 0?
# df.hastag_ct.loc[df.hastag_ct != 0].count()

Unnamed: 0,text,hashtag_ct
0,I bought the Furbo as a birthday gift for my b...,0
336,Great product and impeccable customer service....,0
334,I bought this on sale and I love it! Being abl...,0


In [9]:
# Create numeric count feature
df['numeric_ct'] = df.text.apply(lambda x: len([x for x in x.split() if x.isdigit()]))

# See 3 rows
df[['text','numeric_ct']].sort_values('numeric_ct',ascending=False).head(3)

Unnamed: 0,text,numeric_ct
76,Love the furbo! I can access the camera easily...,3
3,I had REALLY high hopes for this product since...,3
6,Excellent camera. Works great as a security ca...,3


In [10]:
# Create an uppercase count feature
df['uppercase_ct'] = df.t.apply(lambda x: len([x for x in x.split() if x.isupper()]))

# See 3 rows
df[['text','uppercase_ct']].sort_values('uppercase_ct',ascending=False).head(3)

Unnamed: 0,text,uppercase_ct
2,I’ve been using furbo for 2.5 weeks now. It ha...,32
46,I am a person who is so bonded to my golden re...,21
24,"I put off buying this because of the price, bu...",20


In [25]:
#TODO Right now, this code gets distinct number of emojis... 
#TODO Change to total number of emojis used

# Load libraries for emoji & regex
import emoji
import regex
import re
import advertools as adv

# Define function to remove emojis
def count_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags 
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.findall(text)

# Emoji count 
df['emoji_ct'] = df.text.apply(lambda x: len(count_emoji(x)))


# Write a function to identify all the emojis, call it emoji_ct
#df['emoji_ct'] = df['text'].apply(lambda x: len([x for x in x.split() if x.isupper()]))

# See 3 rows
df[['text','emoji_ct']].sort_values('emoji_ct',ascending=False).head(10)


In [None]:
# Write a function to identify all the emoticons, call it emoticon_ct

# See 3 rows
# df[['text','emoticon_ct']].head(3)



## Part 3: Data & Text Cleaning
* Change to lower case
* Remove punctuation, stopwords, URLs, html tags, emojis, emoticons
* Spell correction
* Explore & remove custom stopwords

In [26]:
# Create a new column called "text_clean"
df['text_clean'] = df.text

In [27]:
# Convert all to lowercase
df['text_clean'] = df.text_clean.apply(lambda x: " ".join(x.lower() for x in x.split()))

# See 3 rows
df.text_clean.head(3)

0    i bought the furbo as a birthday gift for my b...
1    extremely disappointed. i recieved a furbo tha...
2    i’ve been using furbo for 2.5 weeks now. it ha...
Name: text_clean, dtype: object

In [28]:
# Remove punctuation
df['text_clean'] = df.text_clean.str.replace('[^\w\s]','')

# See 3 rows
df.text_clean.head(3)

  df['text_clean'] = df.text_clean.str.replace('[^\w\s]','')


0    i bought the furbo as a birthday gift for my b...
1    extremely disappointed i recieved a furbo that...
2    ive been using furbo for 25 weeks now it has d...
Name: text_clean, dtype: object

In [29]:
# Library to identify stopwords
from nltk.corpus import stopwords

# Define english & french stopwords 
stop = stopwords.words('english')
stop2 = stopwords.words('french')

# Remove english & french stopwords
df['text_clean'] = df.text_clean.apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['text_clean'] = df.text_clean.apply(lambda x: " ".join(x for x in x.split() if x not in stop2))

# See 10 rows
df.text_clean.sample(10)

19     honestly buy waste money first 200 upfront non...
291    absolutely love camera actually dont use pat u...
382         awesome product two dogs love getting treats
374    dog loves treats also doubles additional secur...
49     camera works great notifications accuratehowev...
245    good system easy set thing sucks yhe treats ne...
190    product works really well get allot enjoyment ...
325    love small dog though 6lbs treats quite small ...
252    great helps dog feel safe gone everyone know t...
82     bought gift love itpro says works great securi...
Name: text_clean, dtype: object

In [30]:
# Load libraries
import re
import string

# Define function to remove URLs
def remove_url(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', text)

# Remove the test
df['text_clean'] = df.text_clean.apply(lambda x: remove_url(x))

# See 3 rows
df.text_clean.head(3)

0    bought furbo birthday gift boyfriend absolutel...
1    extremely disappointed recieved furbo clearly ...
2    ive using furbo 25 weeks done much maltipoo st...
Name: text_clean, dtype: object

In [31]:
# Define function to remove HTML tags
def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

# Remove all html tags
df['text_clean'] = df.text_clean.apply(lambda x: remove_html(x))

# See 3 rows
df.text_clean.head(3)

0    bought furbo birthday gift boyfriend absolutel...
1    extremely disappointed recieved furbo clearly ...
2    ive using furbo 25 weeks done much maltipoo st...
Name: text_clean, dtype: object

In [32]:
# Define function to remove emojis --> e.g. 😜
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags 
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

# Remove all emojis
df['text_clean'] = df.text_clean.apply(lambda x: remove_emoji(x))

# See 3 rows
df.text_clean.head(3)

0    bought furbo birthday gift boyfriend absolutel...
1    extremely disappointed recieved furbo clearly ...
2    ive using furbo 25 weeks done much maltipoo st...
Name: text_clean, dtype: object

In [33]:
# Define function to remove emoticons --> e.g. :-)

# Libraries
!pip install emot
from emot.emo_unicode import UNICODE_EMO, EMOTICONS

# Function for removing emoticons
def remove_emoticons(text):
    emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in EMOTICONS) + u')')
    return emoticon_pattern.sub(r'', text)

# Remove all emoticons
df['text_clean'] = df.text_clean.apply(lambda x: remove_emoticons(x))

# See 3 rows
df.text_clean.head(3)



0    bought furbo birthday gift boyfriend absolutel...
1    extremely disappointed recieved furbo clearly ...
2    ive using furbo 25 weeks done much maltipoo st...
Name: text_clean, dtype: object

In [34]:
#TODO Check to see if this worked as intended...
# Spell correction 
from textblob import TextBlob
df['text_clean'][:5].apply(lambda x: str(TextBlob(x).correct()))

# See 3 rows
df.text_clean.head(3)

0    bought furbo birthday gift boyfriend absolutel...
1    extremely disappointed recieved furbo clearly ...
2    ive using furbo 25 weeks done much maltipoo st...
Name: text_clean, dtype: object

## Part 4: Feature Extraction (after text cleaning)
* Word count
* Character count
* Avg/median word length
* Create date/time variable
* Create review country


## Part 5: Save Data
* Save cleaned data to CSV

In [None]:
# Function to find the polarity of each review
def polarity(x):
    pol = TextBlob(x).sentiment.polarity
    df['polarity'] = x['text'].apply(pol) # depending on the size of your data, this step may take some time.
    return df

polarity()