In [1]:
import pandas as pd
reviews = pd.read_csv('reviews.csv')

In [2]:
reviews.sample()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
528184,528185,B004981W5G,A1OIFEOB52CSIL,sunny,0,0,4,1349308800,I like these chips!,"I oederd these chips,<br />And I enjoyed that!..."


In [3]:
reviews.shape

(568454, 10)

# Data Preprocessing

## 1. Filter out some (Score, Summary, Text) features from dataset

In [4]:
df = reviews.filter(['Score','Summary','Text'], axis = 1)
df.iloc[:5]

Unnamed: 0,Score,Summary,Text
0,5,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,4,"""Delight"" says it all",This is a confection that has been around a fe...
3,2,Cough Medicine,If you are looking for the secret ingredient i...
4,5,Great taffy,Great taffy at a great price. There was a wid...


## 2.   Splitting the data into train and test

In [5]:
# finding the number of rows in our dataset
rows = df.shape[0]
rows

568454

In [6]:
train_size = int(rows * 0.70)
test_size = rows - train_size

In [7]:
rows == train_size + test_size

True

In [8]:
train_data = pd.DataFrame(columns=df.columns)
test_data  = pd.DataFrame(columns=df.columns)

In [9]:
print("Train Data : {}".format(train_data))
print("\nTest Data: {}".format(test_data))
print("\ntest_size : {} train_size : {}".format(test_size, train_size))

Train Data : Empty DataFrame
Columns: [Score, Summary, Text]
Index: []

Test Data: Empty DataFrame
Columns: [Score, Summary, Text]
Index: []

test_size : 170537 train_size : 397917


In [10]:
import random

# indices list contains all indices from 0 to rows..
indices =list(range(rows))
random.shuffle(indices)

# get random indices from shuffled indices array for train and test data..
trnind = indices[:train_size]
tstind = indices[train_size:rows]

print(trnind[:10])
print(tstind[:10])

[409845, 553353, 225845, 140136, 463841, 148970, 567397, 273502, 225530, 549318]
[455334, 404664, 171311, 1361, 320521, 554661, 61624, 487900, 278675, 334969]


In [11]:
# get the training data with that trnindices array
train_data = df.iloc[trnind]
train_data.shape

(397917, 3)

In [12]:
# get the test data with that tstindices array
test_data = df.iloc[tstind]
test_data.shape

(170537, 3)

In [13]:
train_data.iloc[:3]

Unnamed: 0,Score,Summary,Text
409845,5,Perfect Dark Chocolate Experience,"*****<br />Well, this is MY dark chocolate. Cr..."
553353,5,Fresh Quality,I swear that the cupcakes I ate were more fres...
225845,5,I was wrong,"I had originally given this 1 star, since my c..."


In [14]:
test_data.iloc[:3]

Unnamed: 0,Score,Summary,Text
455334,3,Tastes ok but nothing special,Maybe the claims are true but who can really t...
404664,5,Love these,Love the individual packaging because you don'...
171311,5,Healthy and tasty cat food,Wellness is the only canned food I buy for my ...


## 3. Some Preprocessing

* ___Removing html tags___

* ___Removing all special characters___

* ___Convert all text into small characters___

* ___Removing stop words___

* ___Stemming words using Porter Stemming algorithm___

In [15]:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

ps = PorterStemmer()

In [16]:
def preProcessString(text):
    # remove all html tags
    text = re.sub('<.*?>', ' ', str(text))
    
    # remove all special characters
    text = re.sub('[^A-Za-z0-9]+', ' ', text)
    
    # converting all text into small letters and store them as words for furthur processing
    text_list = text.lower().split()
    
    # removing stopwords from the text
    english_stop_words = set(stopwords.words('english'))
    # we have used set instead of list because, set uses hashing to store the words. So lookup is O(1).
    # where as for list the look up time is O(n) (ie., make things faster in list comprehension below)
    text_list = [word for word in text_list if word not in english_stop_words]
    
    # stemming the words (removing prefix and postfix) using Porter stemming algorithm..
    text_list = [ps.stem(word) for word in text_list]
    
    return ' '.join(text_list)

In [17]:
train_data['Summary'] = train_data['Summary'].apply(preProcessString)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [18]:
train_data['Text'] = train_data['Text'].apply(preProcessString)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [19]:
test_data['Summary'] = test_data['Summary'].apply(preProcessString)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [20]:
test_data['Text'] = test_data['Text'].apply(preProcessString)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [26]:
# let's check some random datapoint of test data
test_data.sample()

Unnamed: 0,Score,Summary,Text
23449,4,easi tasti,use shake n bake chicken year never thought us...


In [27]:
# let's check some random datapoint of train data
train_data.sample()

Unnamed: 0,Score,Summary,Text
39340,5,red hot,bought daughter canning appl slice delight get


In [39]:
# combine these two datasets so that we can save them as a csv file locally..
edited_reviews = pd.concat([train_data, test_data])

In [47]:
edited_reviews.to_csv('preprocessed_reviews_SST.csv', index=False)

In [49]:
# It is same size as original dataframe. (NO DATA LOSS)
edited_reviews.shape

(568454, 3)