[Preprocessing: Cleaning Data](Preprocessing:-Cleaning-Data)

1. [Import Data](#Import-Data)    
2. [Breaking a Large String Into Smaller Strings](#Breaking-a-Large-String-Into-Smaller-Strings)   
      a. [Individual Words](#Individual-Words)    
      b. [Getting Word Counts](#Getting-Word-Counts)    
      c. [Clear Limitations of Built-In `str` Methods](#Clear-Limitations-of-Built-In-`str`-Methods)
3. [Conlclusions](Conclusions)      

# Preprocessing: Cleaning Data

There are numerous osteps that can be taken to help put all text on equal footing, many of which involve the comparatively simple ideas of substitution or removal. They are, however, no less important to the overall process. These include:   

* set all characters to lowercase
* remove punctuation (generally part of tokenization, but still worth keeping in mind at this stage, even as confirmation)
* remove numbers (or convert numbers to textual representations)
* strip white space (also generally part of tokenization)
* remove default stop words (general English stop words)

## Import Data

I've included an excerpt from [Amazon Fine Food Reviews](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews?datasetId=18) in the Data Folder as well! This file is called `Amazon Reviews.csv`.   

I have reduced it into a smaller one called `Food_Review.csv`

In [None]:
import pandas as pd
df = pd.read_csv('Food_Review.csv')


[jupyter and pandas display](http://songhuiming.github.io/pages/2017/04/02/jupyter-and-pandas-display/) is a good resource to help use jupyters display with pandas to the fullest.

In [None]:
df.head(2)

In [None]:
df['Text'].head(3)

In [None]:
#for automatic linebreaks and multi-line cells.
pd.set_option('display.max_colwidth', -1)

In [None]:
#suppress all warnings with this
import warnings
warnings.filterwarnings("ignore")

In [None]:
df['Text'].head(3)

## Breaking a Large String Into Smaller Strings

A big task for preparing string data is breaking the string into smaller substrings. In ths notebook we'll focus on breaking our [Amazon Fine Food Reviews](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews?datasetId=18) excerpt into individual words, then we'll look into trying to make individual sentences. Our goal by the end of this notebook is to be able to take in our excerpt and return a word count pandas dataframe.

### Individual Words
`str.split()`.    

The `split` function inherent to all `str` objects in python allows you to take a string and break it into a list of substrings based on the input it is given.

In [None]:
df['Text'].head(2).str.split()

Since we want words, let's first lower ervery word in our dataframe.  
this is accomplished by using `.str.lower()`

The `str.lower()` method will take all `A-Z` characters in the string and turn them into their corresponding `a-z` form.

In [None]:
"THE Ohio State University".lower()

In [None]:
# We lower all srings 
df['Text_clean'] = df['Text'].str.lower()

In [None]:
df['Text_clean'].head(1)

`str.replace()`
We can replace any specified substring within a string with another specified substring using `str.replace()`. This can help us eliminate the pesky punctuation.

In [None]:
### Some substrings we'll want to remove are:
## , ",", ".", "!", "?", "\'", '\"', "-", "(", ")"

df['Text_cleaned'] = df['Text_clean'].replace(",","")
df['Text_cleaned'] = df['Text_cleaned'].replace(".","")
df['Text_cleaned'] = df['Text_cleaned'].replace("!","")
df['Text_cleaned'] = df['Text_cleaned'].replace("?","")
df['Text_cleaned'] = df['Text_cleaned'].replace("\'","")
df['Text_cleaned'] = df['Text_cleaned'].replace('\"',"")
df['Text_cleaned'] = df['Text_cleaned'].replace("-"," ")
df['Text_cleaned'] = df['Text_cleaned'].replace("(","")
df['Text_cleaned'] = df['Text_cleaned'].replace(")","")

In [None]:
#Here we clean the content by removing all the  punctuation, 
df['Text_clean'] = df['Text_clean'].str.replace('[^\w\s]','')

In [None]:
df['Text_clean'].head(1)

### To convert Digit into numbers   
Import `re` library, make sure your column is of type `string`, and use `(?<!\S)\d+(?!\S)` to match sequences of digits that are between start/end of string and whitespace chars. If you want to only match whole entries that are all digits, you may use `^\d+$` regex.


In [None]:
def f(row):
    return num2words(row['Text_clean'])

In [None]:
import re
import num2words
import inflect
p = inflect.engine()

In [None]:
#Here we clean the content by removing all the  numbers 
df['Text_nonumber'] = df['Text_clean'].str.replace('\d+', '')

#Here we clean the content  convert Digit into numbers 
df['Text_convnumber'] = df.iloc[:,3].astype(str).apply(lambda row: re.sub(r'(^\d+$)', lambda x: p.number_to_words(x.group()), row))

df['Text_convnumber'] = df['Text_clean'].apply(num2words)

In [None]:
# picked some arbitrary rows to review.
df[['Text_clean','Text_nonumber']][16:20]

In [None]:
df['Text_clean'].head(1)

In [None]:
#Here we clean the content by removing all the  white space, 
df['Text_clean'] = df['Text_clean'].str.strip()

In [None]:
df['Text_clean'].head(1)

In [None]:
df['words'] = df.Text_clean.str.strip().str.split('[\W_]+')

In [None]:
df['words'].head(1)

In [None]:
#pd.set_option('display.max_colwidth', -1) # Setting this so we can see the full content of cells

# picked some arbitrary rows to review.
df[['Text_clean','words']][16:20]

### Getting Word Counts
Now that we have a list of the words used in the text we can write a quick loop to make a word count dataframe.

In [None]:
words_list = df['Text_clean'].tolist()
raw_text = ''.join(words_list)

In [None]:
all_words = raw_text.split()

In [None]:
type(words_list)

In [None]:
all_words[:10]

In [None]:
### We'll make a temporary dictionary to hold the words
### Dictionaries are quite useful for word counts
word_dict = {}

## For each word in the text
for word in all_words:
    # if the word wasn't already in the dictionary
    if word not in word_dict.keys():
        # add it
        word_dict[word] = 1
    # otherwise
    else:
        # add 1 to the existing count
        word_dict[word] = word_dict[word] + 1
        
## NOTE In the future we could write this as a function
## then anytime we want a word count we just need to call the
## function!


# Let's examine the dictionary
word_dict

In [None]:
# Now import pandas
import pandas as pd

In [None]:
print(pd.__version__)

In [None]:
# Now make the dataframe
# Note .count() is a native method for a dataframe object
# this is why I used times_used instead!
pa_word_counts = pd.DataFrame({'word':list(word_dict.keys()),
                               'times_used':list(word_dict.values())})

In [None]:
pa_word_counts.sort_values('times_used',ascending=False).head(25)

Great!

As a note, you might think it's silly that we care about how many times the word `the` is used. Hold onto that thought for the next notebook(s).

### Practice
Okay I've been talking a lot, now is your time to practice. I've included an excerpt from [IMDB Dataset of 50K Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv) in the Data Folder as well! This file is called `IMDB Dataset.csv`.   

I have reduced it into a smaller one called `Movie_Review.csv`

You're job is to produce a word count dataframe using what we learned above. This should take 5-10 minutes.

In [None]:
## Code here

In [None]:
## Code here

In [None]:
## Code here

In [None]:
## Code here

### Clear Limitations of Built-In `str` Methods
Okay so we've seen how useful of the box str methods can be, but as was the case with punctuation clean up, they have their weaknesses as well.

For another example of why we might want fancier tools we'll do another quick practice.

Try to take the excerpt of Harry Potter and the Prisoner of Azkaban and break it into unique sentences. Let's take 5 minutes on this.

In [None]:
## Code here

In [None]:
## Code here

* What Happened?  
* What are some issues you ran into?   

## Conclusions
While some of you probably were already quite familiar with using str methods, it's good to review. Sometimes when cleaning data you'll want something quick and easy to code, and using some of the techniques we'll learn in the following notebooks may be a bit of overkill.
