<a href="https://colab.research.google.com/github/sammatuba/AI-NLP-Codecamp/blob/master/Text_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Pre-processing** 

Refers to the transformations applied to our 
data before feeding it to machine learning algorithms.

Most of the industry now are dealing with big data. To improve efficiency, we need to
reduce dimensionality by removing some data.




In [1]:
#library
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

#test

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [24]:
review =  "I had a GREAT experience at the golden harbor! I love an order of fried chicken which is enough for 5 persons and it taste so great.I love that place!"

review

'I had a GREAT experience at the golden harbor! I love an order of fried chicken which is enough for 5 persons and it taste so great.I love that place!'

**Remove number(s)**
Numbers can be remove because it does not tell wether the data is positive or negative.

In [25]:
result = ''.join([i for i in review if not i.isdigit()])
result

'I had a GREAT experience at the golden harbor! I love an order of fried chicken which is enough for  persons and it taste so great.I love that place!'

**Set all characters to lowercase**

In [26]:
result = result.lower()

result

'i had a great experience at the golden harbor! i love an order of fried chicken which is enough for  persons and it taste so great.i love that place!'

**Tokenize Word** is a step which splits longer string into words using spaces and punctuation

In [27]:
from nltk.tokenize import word_tokenize 


result = word_tokenize(result) 

result

['i',
 'had',
 'a',
 'great',
 'experience',
 'at',
 'the',
 'golden',
 'harbor',
 '!',
 'i',
 'love',
 'an',
 'order',
 'of',
 'fried',
 'chicken',
 'which',
 'is',
 'enough',
 'for',
 'persons',
 'and',
 'it',
 'taste',
 'so',
 'great.i',
 'love',
 'that',
 'place',
 '!']

**Remove stop words**  that are not relevant and does not help the algorithm

In [28]:
from nltk.corpus import stopwords
  
stopw = stopwords.words('english')
#stopw
     
result = [w for w in result if w not in stopw]

result

['great',
 'experience',
 'golden',
 'harbor',
 '!',
 'love',
 'order',
 'fried',
 'chicken',
 'enough',
 'persons',
 'taste',
 'great.i',
 'love',
 'place',
 '!']

**Stemming** is the process of reducing inflection in words to their root forms even if the stem itself is not a valid word in the Language.

In [29]:
from nltk.stem import PorterStemmer 
stemmer = PorterStemmer()

w = ["served", "caring","supervisory" ,"better","believes","cookery","supervisory"]

for word in w:
  print(stemmer.stem(word)) #served, caring,supervisory ,better,believes,cookery,supervisory

serv
care
supervisori
better
believ
cookeri
supervisori


In [30]:
from nltk.stem import LancasterStemmer 
stemmer = LancasterStemmer() 

stemmer.stem("believes") #caring,supervisory ,better,believes,cookery,supervisory

'believ'

In [31]:
from nltk.stem import SnowballStemmer

#'danish', 'dutch', 'english', 'finnish', 'french', 
#'german', 'hungarian', 'italian', 'norwegian', 
#'porter', 'portuguese', 'romanian', 'russian', 
#'spanish', 'swedish' 
  
sb_stemmer = SnowballStemmer('english') 
sb_stemmer.stem('believes') #caring,supervisory ,better,believes,cookery,supervisory

'believ'

In [32]:
from nltk.stem import PorterStemmer 
stemmer = PorterStemmer()

result_stemm = [stemmer.stem(w) for w in result]

result_stemm

['great',
 'experi',
 'golden',
 'harbor',
 '!',
 'love',
 'order',
 'fri',
 'chicken',
 'enough',
 'person',
 'tast',
 'great.i',
 'love',
 'place',
 '!']

**Lemmatization** is similar to stemming but it brings context to the words.

In [33]:
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 

result_lemm = [lemmatizer.lemmatize(w) for w in result] #Caring ,better pos ="a",believes,cooking,cookbook,cookery,waits,waited,waiting,"waiting", pos = 'v'

print(list(zip(result,result_lemm)))

[('great', 'great'), ('experience', 'experience'), ('golden', 'golden'), ('harbor', 'harbor'), ('!', '!'), ('love', 'love'), ('order', 'order'), ('fried', 'fried'), ('chicken', 'chicken'), ('enough', 'enough'), ('persons', 'person'), ('taste', 'taste'), ('great.i', 'great.i'), ('love', 'love'), ('place', 'place'), ('!', '!')]


**Combine text**

In [35]:
result = ' '.join(result_lemm)

result

'great experience golden harbor ! love order fried chicken enough person taste great.i love place !'

In [0]:
#review

result

**Create a Function pre-process**

In [0]:
def preprocessing(text):
  
  #remove numbers
  xresult = ''.join([i for i in text if not i.isdigit()])
  
  #set to lower case
  xresult = xresult.lower()
  
  xresult = xresult.split()
  
  #remove stop words
  xstopw = stopwords.words('english')
  xresult = [w for w in xresult if w not in xstopw]
  
  #lemmatization
  xlemma = WordNetLemmatizer()
  xresult = [xlemma.lemmatize(w) for w in xresult]
  
  preprocessed_text = ' '.join(xresult)
    
  return preprocessed_text

**data frame**

In [0]:
from pandas import DataFrame

#create sample data frame
data = {'x':['Wow... Loved this place!',
             'Crust is not good.',
             'Not tasty and the texture was just nasty.',
             'A great touch.'],
        'y':['1','0','0','1']}   

df = DataFrame(data, columns = ['x','y'])

#check pre process data
df.head()

**Apply to the dataframe**

In [0]:
#pre-process x column
df['x'] = df['x'].apply(preprocessing)

#check pre process data
df.head()


# **Thank you**