DATASET: Tweets to and from companies doing customer support on Twitter.

Step 1: Import Libraries

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
import spacy
import string


Step 2: Load the dataset

In [6]:
full_df = pd.read_csv('/content/1429_1.csv', nrows=5000)
df = full_df[["reviews.text"]]
df["reviews.text"] = df["reviews.text"].astype(str)
full_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["reviews.text"] = df["reviews.text"].astype(str)


Unnamed: 0,id,name,asins,brand,categories,keys,manufacturer,reviews.date,reviews.dateAdded,reviews.dateSeen,...,reviews.doRecommend,reviews.id,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username
0,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,This product so far has not disappointed. My c...,Kindle,,,Adapter
1,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,great for beginner or experienced person. Boug...,very fast,,,truman
2,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,Inexpensive tablet for him to use and learn on...,Beginner tablet for our 9 year old son.,,,DaveZ
3,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,4.0,http://reviews.bestbuy.com/3545/5620406/review...,I've had my Fire HD 8 two weeks now and I love...,Good!!!,,,Shacks
4,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-12T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,I bought this for my grand daughter when she c...,Fantastic Tablet for kids,,,explore42


# Lower Casing:

The idea is to convert the input text into same casing format so that 'text', 'Text' and 'TEXT' are treated the same way.

By default, lower casing is done my most of the modern day vecotirzers and tokenizers like sklearn TfidfVectorizer and Keras Tokenizer.

So we need to set them to false as needed depending on our use case.

In [7]:
df["text_lower"] = df["reviews.text"].str.lower()
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_lower"] = df["reviews.text"].str.lower()


Unnamed: 0,reviews.text,text_lower
0,This product so far has not disappointed. My c...,this product so far has not disappointed. my c...
1,great for beginner or experienced person. Boug...,great for beginner or experienced person. boug...
2,Inexpensive tablet for him to use and learn on...,inexpensive tablet for him to use and learn on...
3,I've had my Fire HD 8 two weeks now and I love...,i've had my fire hd 8 two weeks now and i love...
4,I bought this for my grand daughter when she c...,i bought this for my grand daughter when she c...


# Removal of Punctuations:

One another common text preprocessing technique is to remove the punctuations from the text data. This is again a text standardization process that will help to treat 'hurray' and 'hurray!' in the same way.

We also need to carefully choose the list of punctuations to exclude depending on the use case. For example, the string.punctuation in python contains the following punctuation symbols

!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`

We can add or remove more punctuations as per our need.

In [13]:
# drop the new column created in last cell
#df.drop(["text_lower"], axis=1, inplace=True)

PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

df["text_wo_punct"] = df["reviews.text"].apply(lambda text: remove_punctuation(text))
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_wo_punct"] = df["reviews.text"].apply(lambda text: remove_punctuation(text))


Unnamed: 0,reviews.text,text_wo_punct
0,This product so far has not disappointed. My c...,This product so far has not disappointed My ch...
1,great for beginner or experienced person. Boug...,great for beginner or experienced person Bough...
2,Inexpensive tablet for him to use and learn on...,Inexpensive tablet for him to use and learn on...
3,I've had my Fire HD 8 two weeks now and I love...,Ive had my Fire HD 8 two weeks now and I love ...
4,I bought this for my grand daughter when she c...,I bought this for my grand daughter when she c...


# Removal of stopwords:
Stopwords are commonly occuring words in a language like 'the', 'a' and so on.

They can be removed from the text most of the times, as they don't provide valuable information for downstream analysis.

In cases like Part of Speech tagging, we should not remove them as provide very valuable information about the POS.

These stopword lists are already compiled for different languages and we can safely use them.

For example, the stopword list for english language from the nltk package can be seen below.

In [14]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [15]:
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [16]:
STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

df["text_wo_stop"] = df["text_wo_punct"].apply(lambda text: remove_stopwords(text))
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_wo_stop"] = df["text_wo_punct"].apply(lambda text: remove_stopwords(text))


Unnamed: 0,reviews.text,text_wo_punct,text_wo_stop
0,This product so far has not disappointed. My c...,This product so far has not disappointed My ch...,This product far disappointed My children love...
1,great for beginner or experienced person. Boug...,great for beginner or experienced person Bough...,great beginner experienced person Bought gift ...
2,Inexpensive tablet for him to use and learn on...,Inexpensive tablet for him to use and learn on...,Inexpensive tablet use learn step NABI He thri...
3,I've had my Fire HD 8 two weeks now and I love...,Ive had my Fire HD 8 two weeks now and I love ...,Ive Fire HD 8 two weeks I love This tablet gre...
4,I bought this for my grand daughter when she c...,I bought this for my grand daughter when she c...,I bought grand daughter comes visit I set user...


# Removal of Frequent words


In the previos preprocessing step, we removed the stopwords based on language information. But say, if we have a domain specific corpus, we might also have some frequent words which are of not so much importance to us.

So this step is to remove the frequent words in the given corpus. If we use something like tfidf, this is automatically taken care of.

Let us get the most common words adn then remove them in the next step

In [17]:
from collections import Counter
cnt = Counter()
for text in df["text_wo_stop"].values:
    for word in text.split():
        cnt[word] += 1

cnt.most_common(10)

[('I', 4381),
 ('tablet', 2015),
 ('great', 1248),
 ('use', 1113),
 ('price', 926),
 ('The', 887),
 ('good', 743),
 ('This', 741),
 ('Kindle', 708),
 ('one', 686)]

In [18]:
FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])
def remove_freqwords(text):
    """custom function to remove the frequent words"""
    return " ".join([word for word in str(text).split() if word not in FREQWORDS])

df["text_wo_stopfreq"] = df["text_wo_stop"].apply(lambda text: remove_freqwords(text))
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_wo_stopfreq"] = df["text_wo_stop"].apply(lambda text: remove_freqwords(text))


Unnamed: 0,reviews.text,text_wo_punct,text_wo_stop,text_wo_stopfreq
0,This product so far has not disappointed. My c...,This product so far has not disappointed My ch...,This product far disappointed My children love...,product far disappointed My children love like...
1,great for beginner or experienced person. Boug...,great for beginner or experienced person Bough...,great beginner experienced person Bought gift ...,beginner experienced person Bought gift loves
2,Inexpensive tablet for him to use and learn on...,Inexpensive tablet for him to use and learn on...,Inexpensive tablet use learn step NABI He thri...,Inexpensive learn step NABI He thrilled learn ...
3,I've had my Fire HD 8 two weeks now and I love...,Ive had my Fire HD 8 two weeks now and I love ...,Ive Fire HD 8 two weeks I love This tablet gre...,Ive Fire HD 8 two weeks love valueWe Prime Mem...
4,I bought this for my grand daughter when she c...,I bought this for my grand daughter when she c...,I bought grand daughter comes visit I set user...,bought grand daughter comes visit set user ent...


# Removal of Rare words
This is very similar to previous preprocessing step but we will remove the rare words from the corpus.

In [19]:
# Drop the two columns which are no more needed
df.drop(["text_wo_punct", "text_wo_stop"], axis=1, inplace=True)

n_rare_words = 10
RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]])
def remove_rarewords(text):
    """custom function to remove the rare words"""
    return " ".join([word for word in str(text).split() if word not in RAREWORDS])

df["text_wo_stopfreqrare"] = df["text_wo_stopfreq"].apply(lambda text: remove_rarewords(text))
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(["text_wo_punct", "text_wo_stop"], axis=1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_wo_stopfreqrare"] = df["text_wo_stopfreq"].apply(lambda text: remove_rarewords(text))


Unnamed: 0,reviews.text,text_wo_stopfreq,text_wo_stopfreqrare
0,This product so far has not disappointed. My c...,product far disappointed My children love like...,product far disappointed My children love like...
1,great for beginner or experienced person. Boug...,beginner experienced person Bought gift loves,beginner experienced person Bought gift loves
2,Inexpensive tablet for him to use and learn on...,Inexpensive learn step NABI He thrilled learn ...,Inexpensive learn step NABI He thrilled learn ...
3,I've had my Fire HD 8 two weeks now and I love...,Ive Fire HD 8 two weeks love valueWe Prime Mem...,Ive Fire HD 8 two weeks love valueWe Prime Mem...
4,I bought this for my grand daughter when she c...,bought grand daughter comes visit set user ent...,bought grand daughter comes visit set user ent...


We can combine all the list of words (stopwords, frequent words and rare words) and create a single list to remove them at once.

# Stemming
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form.

For example, if there are two words in the corpus walks and walking, then stemming will stem the suffix to make them walk. But say in another example, we have two words console and consoling, the stemmer will remove the suffix and make them consol which is not a proper english word.

There are several type of stemming algorithms available and one of the famous one is porter stemmer which is widely used. We can use nltk package for the same.

In [23]:
from nltk.stem.porter import PorterStemmer

# Drop the two columns
df.drop(["text_wo_stopfreq", "text_wo_stopfreqrare"], axis = 1,  inplace=True)

stemmer = PorterStemmer()
def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

df["text_stemmed"] = df["reviews.text"].apply(lambda text: stem_words(text))
df.head()

KeyError: ignored

We can see that words like private and propose have their e at the end chopped off due to stemming. This is not intented. What can we do fort hat? We can use Lemmatization in such cases.

Also this porter stemmer is for English language. If we are working with other languages, we can use snowball stemmer. The supported languages for snowball stemmer are

In [24]:
from nltk.stem.snowball import SnowballStemmer
SnowballStemmer.languages

('arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

# Lemmatization
Lemmatization is similar to stemming in reducing inflected words to their word stem but differs in the way that it makes sure the root word (also called as lemma) belongs to the language.

As a result, this one is generally slower than stemming process. So depending on the speed requirement, we can choose to use either stemming or lemmatization.

Let us use the WordNetLemmatizer in nltk to lemmatize our sentences

In [25]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [27]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

df["text_lemmatized"] = df["reviews.text"].apply(lambda text: lemmatize_words(text))
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_lemmatized"] = df["reviews.text"].apply(lambda text: lemmatize_words(text))


Unnamed: 0,reviews.text,text_lemmatized
0,This product so far has not disappointed. My c...,This product so far ha not disappointed. My ch...
1,great for beginner or experienced person. Boug...,great for beginner or experienced person. Boug...
2,Inexpensive tablet for him to use and learn on...,Inexpensive tablet for him to use and learn on...
3,I've had my Fire HD 8 two weeks now and I love...,I've had my Fire HD 8 two week now and I love ...
4,I bought this for my grand daughter when she c...,I bought this for my grand daughter when she c...
