[Preprocessing: Cleaning Data](Preprocessing:-Cleaning-Data)

1. [Import Data](#Import-Data)    
2. [Breaking a Large String Into Smaller Strings](#Breaking-a-Large-String-Into-Smaller-Strings)   
      a. [Individual Words](#Individual-Words)      
      b. [Getting Word Counts](#Getting-Word-Counts)    
      c. [Clear Limitations of Built-In `str` Methods](#Clear-Limitations-of-Built-In-`str`-Methods)     
3. [Conlclusions](Conclusions)

# Preprocessing: Cleaning Data

There are numerous osteps that can be taken to help put all text on equal footing, many of which involve the comparatively simple ideas of substitution or removal. They are, however, no less important to the overall process. These include:   

* set all characters to lowercase
* remove punctuation (generally part of tokenization, but still worth keeping in mind at this stage, even as confirmation)
* remove numbers (or convert numbers to textual representations)
* strip white space (also generally part of tokenization)
* remove default stop words (general English stop words)

## Import Data

I've included an excerpt from [Amazon Fine Food Reviews](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews?datasetId=18) in the Data Folder as well! This file is called `Amazon Reviews.csv`.   

I have reduced it into a smaller one called `Food_Review.csv`

In [4]:
import pandas as pd
df = pd.read_csv('Food_Review.csv')


[jupyter and pandas display](http://songhuiming.github.io/pages/2017/04/02/jupyter-and-pandas-display/) is a good resource to help use jupyters display with pandas to the fullest.

In [6]:
print("The shape of the data set is ", df.shape)
df.head(2)

The shape of the data set is  (1000, 2)


Unnamed: 0,Summary,Text
0,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...


In [7]:
df['Text'].head(3)

0    I have bought several of the Vitality canned d...
1    Product arrived labeled as Jumbo Salted Peanut...
2    This is a confection that has been around a fe...
Name: Text, dtype: object

In [8]:
#for automatic linebreaks and multi-line cells.
pd.set_option('display.max_colwidth', -1)

  pd.set_option('display.max_colwidth', -1)


In [9]:
#suppress all warnings with this
import warnings
warnings.filterwarnings("ignore")

In [10]:
df['Text'].head(3)

0    I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.                                                                                                                                                                                                                                                      
1    Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".                                                                                                                                                                                                                                                                                                  

## Breaking a Large String Into Smaller Strings

A big task for preparing string data is breaking the string into smaller substrings. In ths notebook we'll focus on breaking our [Amazon Fine Food Reviews](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews?datasetId=18) excerpt into individual words, then we'll look into trying to make individual sentences. Our goal by the end of this notebook is to be able to take in our excerpt and return a word count pandas dataframe.

### Individual Words
`str.split()`.    

The `split` function inherent to all `str` objects in python allows you to take a string and break it into a list of substrings based on the input it is given.

In [7]:
df['Text'].head(2).str.split()

0    [I, have, bought, several, of, the, Vitality, canned, dog, food, products, and, have, found, them, all, to, be, of, good, quality., The, product, looks, more, like, a, stew, than, a, processed, meat, and, it, smells, better., My, Labrador, is, finicky, and, she, appreciates, this, product, better, than, most.]
1    [Product, arrived, labeled, as, Jumbo, Salted, Peanuts...the, peanuts, were, actually, small, sized, unsalted., Not, sure, if, this, was, an, error, or, if, the, vendor, intended, to, represent, the, product, as, "Jumbo".]                                                                                         
Name: Text, dtype: object

Since we want words, let's first lower ervery word in our dataframe.  
this is accomplished by using `.str.lower()`

The `str.lower()` method will take all `A-Z` characters in the string and turn them into their corresponding `a-z` form.

In [11]:
# We lower all srings 
df['Text_clean'] = df['Text'].str.lower()

In [13]:
df['Text_clean'].head(1)

0    i have bought several of the vitality canned dog food products and have found them all to be of good quality. the product looks more like a stew than a processed meat and it smells better. my labrador is finicky and she appreciates this product better than  most.
Name: Text_clean, dtype: object

`str.replace()`
We can replace any specified substring within a string with another specified substring using `str.replace()`. This can help us eliminate the pesky punctuation.

In [14]:
### Some substrings we'll want to remove are:
## , ",", ".", "!", "?", "\'", '\"', "-", "(", ")"

df['Text_cleaned'] = df['Text_clean'].replace(",","")
df['Text_cleaned'] = df['Text_cleaned'].replace(".","")
df['Text_cleaned'] = df['Text_cleaned'].replace("!","")
df['Text_cleaned'] = df['Text_cleaned'].replace("?","")
df['Text_cleaned'] = df['Text_cleaned'].replace("\'","")
df['Text_cleaned'] = df['Text_cleaned'].replace('\"',"")
df['Text_cleaned'] = df['Text_cleaned'].replace("-"," ")
df['Text_cleaned'] = df['Text_cleaned'].replace("(","")
df['Text_cleaned'] = df['Text_cleaned'].replace(")","")

In [15]:
#Here we clean the content by removing all the  punctuation, 
df['Text_clean'] = df['Text_clean'].str.replace('[^\w\s]','')

In [18]:
df['Text_clean'].tail(20)

980    fast easy and definitely delicious  makes a great cup of coffee and very easy to make  good purchase  will continue to order from herebr thanx                                                                                                                                                                                                                                                                                                                                                                             
981    i gave this tea a try to add variety to my tea habit and im glad i did it has a somewhat sweet and almost minty taste and a wonderful aroma it has a nice reddish color this tea probably has more of a taste factor than any of the wonderful japanese green matcha tea ive had and i look forward to experiencing any of the purported health benefits of rooibos that ive read about                                                                                                       

### To convert Digit into numbers   
Import `re` library, make sure your column is of type `string`, and use `(?<!\S)\d+(?!\S)` to match sequences of digits that are between start/end of string and whitespace chars. If you want to only match whole entries that are all digits, you may use `^\d+$` regex.


In [13]:
def f(row):
    return num2words(row['Text_clean'])

In [14]:
import re
import num2words
import inflect
p = inflect.engine()

In [19]:
#Here we clean the content by removing all the  numbers 
df['Text_nonumber'] = df['Text_clean'].str.replace('\d+', '')

#Here we clean the content  convert Digit into numbers 
df['Text_convnumber'] = df.iloc[:,3].astype(str).apply(lambda row: re.sub(r'(^\d+$)', lambda x: p.number_to_words(x.group()), row))

df['Text_convnumber'] = df['Text_clean'].apply(num2words)

In [20]:
# picked some arbitrary rows to review.
df[['Text_clean','Text_nonumber']][16:20]

Unnamed: 0,Text_clean,Text_nonumber
16,i love eating them and they are good for watching tv and looking at movies it is not too sweet i like to transfer them to a zip lock baggie so they stay fresh so i can take my time eating them,i love eating them and they are good for watching tv and looking at movies it is not too sweet i like to transfer them to a zip lock baggie so they stay fresh so i can take my time eating them
17,i am very satisfied with my twizzler purchase i shared these with others and we have all enjoyed them i will definitely be ordering more,i am very satisfied with my twizzler purchase i shared these with others and we have all enjoyed them i will definitely be ordering more
18,twizzlers strawberry my childhood favorite candy made in lancaster pennsylvania by y s candies inc one of the oldest confectionery firms in the united states now a subsidiary of the hershey company the company was established in 1845 as young and smylie they also make apple licorice twists green color and blue raspberry licorice twists i like them allbr br i keep it in a dry cool place because is not recommended it to put it in the fridge according to the guinness book of records the longest licorice twist ever made measured 1200 feet 370 m and weighted 100 pounds 45 kg and was made by y s candies inc this recordbreaking twist became a guinness world record on july 19 1998 this product is kosher thank you,twizzlers strawberry my childhood favorite candy made in lancaster pennsylvania by y s candies inc one of the oldest confectionery firms in the united states now a subsidiary of the hershey company the company was established in as young and smylie they also make apple licorice twists green color and blue raspberry licorice twists i like them allbr br i keep it in a dry cool place because is not recommended it to put it in the fridge according to the guinness book of records the longest licorice twist ever made measured feet m and weighted pounds kg and was made by y s candies inc this recordbreaking twist became a guinness world record on july this product is kosher thank you
19,candy was delivered very fast and was purchased at a reasonable price i was home bound and unable to get to a store so this was perfect for me,candy was delivered very fast and was purchased at a reasonable price i was home bound and unable to get to a store so this was perfect for me


In [17]:
df['Text_clean'].head(1)

0    i have bought several of the vitality canned dog food products and have found them all to be of good quality the product looks more like a stew than a processed meat and it smells better my labrador is finicky and she appreciates this product better than  most
Name: Text_clean, dtype: object

In [18]:
#Here we clean the content by removing all the  white space, 
df['Text_clean'] = df['Text_clean'].str.strip()

In [19]:
df['Text_clean'].head(1)

0    i have bought several of the vitality canned dog food products and have found them all to be of good quality the product looks more like a stew than a processed meat and it smells better my labrador is finicky and she appreciates this product better than  most
Name: Text_clean, dtype: object

In [20]:
df['words'] = df.Text_clean.str.strip().str.split('[\W_]+')

In [21]:
df['words'].head(1)

0    [i, have, bought, several, of, the, vitality, canned, dog, food, products, and, have, found, them, all, to, be, of, good, quality, the, product, looks, more, like, a, stew, than, a, processed, meat, and, it, smells, better, my, labrador, is, finicky, and, she, appreciates, this, product, better, than, most]
Name: words, dtype: object

In [22]:
#pd.set_option('display.max_colwidth', -1) # Setting this so we can see the full content of cells

# picked some arbitrary rows to review.
df[['Text_clean','words']][16:20]

Unnamed: 0,Text_clean,words
16,i love eating them and they are good for watching tv and looking at movies it is not too sweet i like to transfer them to a zip lock baggie so they stay fresh so i can take my time eating them,"[i, love, eating, them, and, they, are, good, for, watching, tv, and, looking, at, movies, it, is, not, too, sweet, i, like, to, transfer, them, to, a, zip, lock, baggie, so, they, stay, fresh, so, i, can, take, my, time, eating, them]"
17,i am very satisfied with my twizzler purchase i shared these with others and we have all enjoyed them i will definitely be ordering more,"[i, am, very, satisfied, with, my, twizzler, purchase, i, shared, these, with, others, and, we, have, all, enjoyed, them, i, will, definitely, be, ordering, more]"
18,twizzlers strawberry my childhood favorite candy made in lancaster pennsylvania by y s candies inc one of the oldest confectionery firms in the united states now a subsidiary of the hershey company the company was established in 1845 as young and smylie they also make apple licorice twists green color and blue raspberry licorice twists i like them allbr br i keep it in a dry cool place because is not recommended it to put it in the fridge according to the guinness book of records the longest licorice twist ever made measured 1200 feet 370 m and weighted 100 pounds 45 kg and was made by y s candies inc this recordbreaking twist became a guinness world record on july 19 1998 this product is kosher thank you,"[twizzlers, strawberry, my, childhood, favorite, candy, made, in, lancaster, pennsylvania, by, y, s, candies, inc, one, of, the, oldest, confectionery, firms, in, the, united, states, now, a, subsidiary, of, the, hershey, company, the, company, was, established, in, 1845, as, young, and, smylie, they, also, make, apple, licorice, twists, green, color, and, blue, raspberry, licorice, twists, i, like, them, allbr, br, i, keep, it, in, a, dry, cool, place, because, is, not, recommended, it, to, put, it, in, the, fridge, according, to, the, guinness, book, of, records, the, longest, licorice, twist, ever, made, measured, 1200, feet, 370, m, and, weighted, 100, ...]"
19,candy was delivered very fast and was purchased at a reasonable price i was home bound and unable to get to a store so this was perfect for me,"[candy, was, delivered, very, fast, and, was, purchased, at, a, reasonable, price, i, was, home, bound, and, unable, to, get, to, a, store, so, this, was, perfect, for, me]"


### Getting Word Counts
Now that we have a list of the words used in the text we can write a quick loop to make a word count dataframe.

In [23]:
words_list = df['Text_clean'].tolist()
raw_text = ''.join(words_list)

In [24]:
all_words = raw_text.split()

In [25]:
type(words_list)

list

In [26]:
all_words[:10]

['i',
 'have',
 'bought',
 'several',
 'of',
 'the',
 'vitality',
 'canned',
 'dog',
 'food']

In [27]:
### We'll make a temporary dictionary to hold the words
### Dictionaries are quite useful for word counts
word_dict = {}

## For each word in the text
for word in all_words:
    # if the word wasn't already in the dictionary
    if word not in word_dict.keys():
        # add it
        word_dict[word] = 1
    # otherwise
    else:
        # add 1 to the existing count
        word_dict[word] = word_dict[word] + 1
        
## NOTE In the future we could write this as a function
## then anytime we want a word count we just need to call the
## function!


# Let's examine the dictionary
word_dict

{'i': 1978,
 'have': 571,
 'bought': 83,
 'several': 28,
 'of': 1329,
 'the': 3099,
 'vitality': 1,
 'canned': 9,
 'dog': 46,
 'food': 208,
 'products': 41,
 'and': 2096,
 'found': 92,
 'them': 378,
 'all': 271,
 'to': 1517,
 'be': 279,
 'good': 303,
 'quality': 71,
 'product': 189,
 'looks': 16,
 'more': 176,
 'like': 407,
 'a': 1901,
 'stew': 2,
 'than': 199,
 'processed': 4,
 'meat': 16,
 'it': 1229,
 'smells': 4,
 'better': 116,
 'my': 603,
 'labrador': 1,
 'is': 1138,
 'finicky': 3,
 'she': 69,
 'appreciates': 1,
 'this': 859,
 'mostproduct': 1,
 'arrived': 29,
 'labeled': 2,
 'as': 433,
 'jumbo': 1,
 'salted': 10,
 'peanutsthe': 1,
 'peanuts': 11,
 'were': 197,
 'actually': 48,
 'small': 56,
 'sized': 12,
 'unsalted': 10,
 'not': 471,
 'sure': 53,
 'if': 234,
 'was': 467,
 'an': 137,
 'error': 2,
 'or': 290,
 'vendor': 4,
 'intended': 1,
 'represent': 1,
 'jumbothis': 1,
 'confection': 1,
 'that': 609,
 'has': 209,
 'been': 103,
 'around': 35,
 'few': 48,
 'centuries': 1,
 'light

In [28]:
# Now import pandas
import pandas as pd

In [29]:
print(pd.__version__)

1.1.3


In [30]:
# Now make the dataframe
# Note .count() is a native method for a dataframe object
# this is why I used times_used instead!
pa_word_counts = pd.DataFrame({'word':list(word_dict.keys()),
                               'times_used':list(word_dict.values())})

In [31]:
pa_word_counts.sort_values('times_used',ascending=False).head(25)

Unnamed: 0,word,times_used
5,the,3099
11,and,2096
0,i,1978
23,a,1901
15,to,1517
4,of,1329
28,it,1229
33,is,1138
75,in,888
37,this,859


Great!

As a note, you might think it's silly that we care about how many times the word `the` is used. Hold onto that thought for the next notebook(s).

### Practice
Okay I've been talking a lot, now is your time to practice. I've included an excerpt from [IMDB Dataset of 50K Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv) in the Data Folder as well! This file is called `IMDB Dataset.csv`.   

I have reduced it into a smaller one called `Movie_Review.csv`

You're job is to produce a word count dataframe using what we learned above. This should take 5-10 minutes.

In [32]:
df = pd.read_csv('Movie_Review.csv')

In [33]:
df.head(2)

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.",positive


In [34]:
# We lower all srings 
df['review_clean'] = df['review'].str.lower()

In [35]:
### Some substrings we'll want to remove are:
## , ",", ".", "!", "?", "\'", '\"', "-", "(", ")"

df['review_cleaned'] = df['review_clean'].replace(",","")
df['review_cleaned'] = df['review_cleaned'].replace(".","")
df['review_cleaned'] = df['review_cleaned'].replace("!","")
df['review_cleaned'] = df['review_cleaned'].replace("?","")
df['review_cleaned'] = df['review_cleaned'].replace("\'","")
df['review_cleaned'] = df['review_cleaned'].replace('\"',"")
df['review_cleaned'] = df['review_cleaned'].replace("-"," ")
df['review_cleaned'] = df['review_cleaned'].replace("(","")
df['review_cleaned'] = df['review_cleaned'].replace(")","")

In [36]:
#Here we clean the content by removing all the  punctuation, 
df['review_clean'] = df['review_clean'].str.replace('[^\w\s]','')

In [37]:
#Here we clean the content by removing all the  numbers 
df['review_nonumber'] = df['review_clean'].str.replace('\d+', '')

In [38]:
# picked some arbitrary rows to review.
df[['review_clean','review_nonumber']][16:20]

Unnamed: 0,review_clean,review_nonumber
16,some films just simply should not be remade this is one of them in and of itself it is not a bad film but it fails to capture the flavor and the terror of the 1963 film of the same title liam neeson was excellent as he always is and most of the cast holds up with the exception of owen wilson who just did not bring the right feel to the character of luke but the major fault with this version is that it strayed too far from the shirley jackson story in its attempts to be grandiose and lost some of the thrill of the earlier film in a trade off for snazzier special effects again i will say that in and of itself it is not a bad film but you will enjoy the friction of terror in the older version much more,some films just simply should not be remade this is one of them in and of itself it is not a bad film but it fails to capture the flavor and the terror of the film of the same title liam neeson was excellent as he always is and most of the cast holds up with the exception of owen wilson who just did not bring the right feel to the character of luke but the major fault with this version is that it strayed too far from the shirley jackson story in its attempts to be grandiose and lost some of the thrill of the earlier film in a trade off for snazzier special effects again i will say that in and of itself it is not a bad film but you will enjoy the friction of terror in the older version much more
17,this movie made it into one of my top 10 most awful movies horrible br br there wasnt a continuous minute where there wasnt a fight with one monster or another there was no chance for any character development they were too busy running from one sword fight to another i had no emotional attachment except to the big bad machine that wanted to destroy them br br scenes were blatantly stolen from other movies lotr star wars and matrix br br examplesbr br the ghost scene at the end was stolen from the final scene of the old star wars with yoda obee one and vader br br the spider machine in the beginning was exactly like frodo being attacked by the spider in return of the kings elijah wood is the victim in both films and waitit hypnotizes stings its victim and wraps them upuh hellobr br and the whole machine vs humans theme was the matrixor terminatorbr br there are more examples but why waste the time and will someone tell me what was with the nazis nazis br br there was a juvenile story line rushed to a juvenile conclusion the movie could not decide if it was a childrens movie or an adult movie and wasnt much of either br br just awful a real disappointment to say the least save your money,this movie made it into one of my top most awful movies horrible br br there wasnt a continuous minute where there wasnt a fight with one monster or another there was no chance for any character development they were too busy running from one sword fight to another i had no emotional attachment except to the big bad machine that wanted to destroy them br br scenes were blatantly stolen from other movies lotr star wars and matrix br br examplesbr br the ghost scene at the end was stolen from the final scene of the old star wars with yoda obee one and vader br br the spider machine in the beginning was exactly like frodo being attacked by the spider in return of the kings elijah wood is the victim in both films and waitit hypnotizes stings its victim and wraps them upuh hellobr br and the whole machine vs humans theme was the matrixor terminatorbr br there are more examples but why waste the time and will someone tell me what was with the nazis nazis br br there was a juvenile story line rushed to a juvenile conclusion the movie could not decide if it was a childrens movie or an adult movie and wasnt much of either br br just awful a real disappointment to say the least save your money
18,i remember this filmit was the first film i had watched at the cinema the picture was dark in places i was very nervous it was back in 7475 my dad took me my brother sister to newbury cinema in newbury berkshire england i recall the tigers and the lots of snow in the film also the appearance of grizzly adams actor dan haggery i think one of the tigers gets shot and dies if anyone knows where to find this on dvd etc please let me knowthe cinema now has been turned in a fitness club which is a very big shame as the nearest cinema now is 20 miles away would love to hear from others who have seen this film or any other like it,i remember this filmit was the first film i had watched at the cinema the picture was dark in places i was very nervous it was back in my dad took me my brother sister to newbury cinema in newbury berkshire england i recall the tigers and the lots of snow in the film also the appearance of grizzly adams actor dan haggery i think one of the tigers gets shot and dies if anyone knows where to find this on dvd etc please let me knowthe cinema now has been turned in a fitness club which is a very big shame as the nearest cinema now is miles away would love to hear from others who have seen this film or any other like it
19,an awful film it must have been up against some real stinkers to be nominated for the golden globe theyve taken the story of the first famous female renaissance painter and mangled it beyond recognition my complaint is not that theyve taken liberties with the facts if the story were good that would perfectly fine but its simply bizarre by all accounts the true story of this artist would have made for a far better film so why did they come up with this dishwaterdull script i suppose there werent enough naked people in the factual version its hurriedly capped off in the end with a summary of the artists life we could have saved ourselves a couple of hours if theyd favored the rest of the film with same brevity,an awful film it must have been up against some real stinkers to be nominated for the golden globe theyve taken the story of the first famous female renaissance painter and mangled it beyond recognition my complaint is not that theyve taken liberties with the facts if the story were good that would perfectly fine but its simply bizarre by all accounts the true story of this artist would have made for a far better film so why did they come up with this dishwaterdull script i suppose there werent enough naked people in the factual version its hurriedly capped off in the end with a summary of the artists life we could have saved ourselves a couple of hours if theyd favored the rest of the film with same brevity


In [39]:
#Here we clean the content by removing all the  white space, 
df['review_clean'] = df['review_clean'].str.strip()

In [40]:
df['words'] = df.review_clean.str.strip().str.split('[\W_]+')

In [41]:
#pd.set_option('display.max_colwidth', -1) # Setting this so we can see the full content of cells

# picked some arbitrary rows to review.
df[['review_clean','words']][18:20]

Unnamed: 0,review_clean,words
18,i remember this filmit was the first film i had watched at the cinema the picture was dark in places i was very nervous it was back in 7475 my dad took me my brother sister to newbury cinema in newbury berkshire england i recall the tigers and the lots of snow in the film also the appearance of grizzly adams actor dan haggery i think one of the tigers gets shot and dies if anyone knows where to find this on dvd etc please let me knowthe cinema now has been turned in a fitness club which is a very big shame as the nearest cinema now is 20 miles away would love to hear from others who have seen this film or any other like it,"[i, remember, this, filmit, was, the, first, film, i, had, watched, at, the, cinema, the, picture, was, dark, in, places, i, was, very, nervous, it, was, back, in, 7475, my, dad, took, me, my, brother, sister, to, newbury, cinema, in, newbury, berkshire, england, i, recall, the, tigers, and, the, lots, of, snow, in, the, film, also, the, appearance, of, grizzly, adams, actor, dan, haggery, i, think, one, of, the, tigers, gets, shot, and, dies, if, anyone, knows, where, to, find, this, on, dvd, etc, please, let, me, knowthe, cinema, now, has, been, turned, in, a, fitness, club, which, is, a, ...]"
19,an awful film it must have been up against some real stinkers to be nominated for the golden globe theyve taken the story of the first famous female renaissance painter and mangled it beyond recognition my complaint is not that theyve taken liberties with the facts if the story were good that would perfectly fine but its simply bizarre by all accounts the true story of this artist would have made for a far better film so why did they come up with this dishwaterdull script i suppose there werent enough naked people in the factual version its hurriedly capped off in the end with a summary of the artists life we could have saved ourselves a couple of hours if theyd favored the rest of the film with same brevity,"[an, awful, film, it, must, have, been, up, against, some, real, stinkers, to, be, nominated, for, the, golden, globe, theyve, taken, the, story, of, the, first, famous, female, renaissance, painter, and, mangled, it, beyond, recognition, my, complaint, is, not, that, theyve, taken, liberties, with, the, facts, if, the, story, were, good, that, would, perfectly, fine, but, its, simply, bizarre, by, all, accounts, the, true, story, of, this, artist, would, have, made, for, a, far, better, film, so, why, did, they, come, up, with, this, dishwaterdull, script, i, suppose, there, werent, enough, naked, people, in, the, factual, version, its, hurriedly, capped, ...]"


In [42]:
words_list = df['review_clean'].tolist()
raw_text = ''.join(words_list)

In [43]:
all_words = raw_text.split()

### Clear Limitations of Built-In `str` Methods
Okay so we've seen how useful of the box str methods can be, but as was the case with punctuation clean up, they have their weaknesses as well.

For another example of why we might want fancier tools we'll do another quick practice.

Try to take the excerpt of Harry Potter and the Prisoner of Azkaban and break it into unique sentences. Let's take 5 minutes on this.

In [50]:
raw_text[:101].split(".")#[:1]

['one of the other reviewers has mentioned that after watching just 1 oz episode youll be hooked they a']

* What Happened?  
* What are some issues you ran into?   

## Conclusions
While some of you probably were already quite familiar with using str methods, it's good to review. Sometimes when cleaning data you'll want something quick and easy to code, and using some of the techniques we'll learn in the following notebooks may be a bit of overkill.
