## [Workbook 1](https://www.kaggle.com/sabasiddiqi/workbook-1-text-pre-processing-for-beginners) - Text Preprocessing for Beginners - Data Cleaning
<br>
**Level** : Beginner

This notebook discusses **Text Data Preprocessing** for **NLP Problems** using Toxic Comment Classification Dataset. Data comprises of large number of Wikipedia comments which have been labeled by human raters for toxic behavior

Data is available via following link.
[Toxic Comment Classification](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data)

Next Workbook : [Workbook 2 - Text Preprocessing for Beginners - Feature Extraction](https://www.kaggle.com/sabasiddiqi/workbook-2-text-preprocessing-feature-extraction) 

To skip the initial steps (reading data, text extraction from data), Jump to [Text Pre-Processing Steps](#jump).

Starting by importing required libraries.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import re
import string
from string import digits
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

Reading training and test data from CSV file and saving as Pandas' Dataframe

In [None]:
print(os.listdir("../input"))
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
print("\nTrain data: \n",train.head())
print("\nTest data: \n",test.head())

Data here comprises of ids, comments, and labels. 

Removing IDs from Train data, keeping Test data IDs for submission. 

In [None]:
train_data=train.drop(train.columns[0], axis=1) 
test_data=test
print(train_data.head())
print(test_data.head())

Now extracting comments from train and test data, and storing their index for later use.
Merging comments for both train and test, so that Preprocessing Steps can be performed on both at same time.

In [None]:
train_comments=train_data.iloc[:,0]
test_comments=test_data.iloc[:,1]

#saving index to separate them later
train_comments_index=train_comments.index
test_comments_index=test_comments.index

frames = [train_comments, test_comments]
comments = pd.concat(frames, ignore_index=True)


labels=train_data.iloc[:,1:]

print("Train Comments Shape: ",train_comments.shape)
print("Test Comments Shape: ",test_comments.shape)
print("Comments Shape after Merge: ",comments.shape)
print("Comments are: \n",comments.head())
print("\nLabels are: \n", labels.head())

<br>
###   <a id="jump">Basic Text Preprocessing Steps - Cleaning </a>
<br>
Now that we have comments, its time to process them to convert them into a form that can be fed to classifier.

To do so following basic steps are performed and to get a better idea of what these steps do, an example is added as well. 

**“You are annoying!!! goJumpOff4Cliff pleaseeeeeeee”**
* Step 1 - [Remove punctuation](#1) →** You are annoying goJumpOff4Cliff pleaseeeeeeee**
* Step 2 - [Remove digits](#2)→ ** You are annoying goJumpOffCliff please**
* Step 3 - [Split combined words](#3) → **You are annoying go Jump Off Cliff please**
* Step 4 - [Convert to lowercase](#4) →   ** your are annoying go jump off cliff please**
* Step 5 - [Split each sentence using delimiter](#5) →   ** your, are, annoying, go, jump, off, cliff, please**
* Step 6 - [Remove stop words](#6) →       **annoying, jump, cliff **
* Step 7 - [Convert Word to Base Form](#7) →                      **annoy, jump, cliff** 

Please note that order of steps matter here, if step number 4 is performed before Step 3, we wont be able to split the Combined words like **goJumpOffCliff**.

<a id="1">Step 1 - Remove Punctuation</a>

In [None]:
c=comments.str.translate(str.maketrans(' ', ' ', string.punctuation))
c.head()

<a id="2">Step 2 - Remove Digits </a>

Removing \n and digits

In [None]:
c=c.str.translate(str.maketrans(' ', ' ', '\n'))
c=c.str.translate(str.maketrans(' ', ' ', digits))
c.head()

<a id="3"> Step 3 - Split combined words </a>

For instance, converting **whyAreYou** to **why Are You **

In [None]:
c=c.apply(lambda tweet: re.sub(r'([a-z])([A-Z])',r'\1 \2',tweet))
c.head()

<a id="4"> Step 4 - Convert to lowercase </a>


In [None]:
c=c.str.lower()
c.head()

<a id="5"> Step 5 - Split each sentence using delimiter </a>

Converting each sentence to list of words. We are doing it to keep necessary words in the upcoming steps and descarding the rest.

In [None]:
c=c.str.split()
c.head()

<a id="6"> Step 6 - Remove Stop Words </a>

Stop words are the most common words in a language and mostly filtered in NLP problems.

In [None]:
stop = set(stopwords.words('english'))
c=c.apply(lambda x: [item for item in x if item not in stop])
c.head()    

<a id="7"> Step 7 - Convert Word to Base Form or Lematize </a> 

Converting each word to its base form e.g. trying to try, or tried to try for simplification; using **WordNetLemmatizer** function from **NLTK** library.

In [None]:
from tqdm import tqdm
lemmatizer = WordNetLemmatizer()
com=[]
for y in tqdm(c):
    new=[]
    for x in y:
        z=lemmatizer.lemmatize(x)
        z=lemmatizer.lemmatize(z,'v')
        new.append(z)
    y=new
    com.append(y)

Data obtained after Lemmatization is in array form, and is converted to Dataframe in the next step.

In [None]:
clean_data=pd.DataFrame(np.array(com), index=comments.index,columns={'comment_text'})
clean_data['comment_text']=clean_data['comment_text'].str.join(" ")
print(clean_data.head())

Separating Train and Test Comments using the index stored earlier.

In [None]:
train_clean_data=clean_data.loc[train_comments_index]
test_clean_data=clean_data.drop(train_comments_index,axis=0).reset_index(drop=True)

In [None]:
print("PreProcessed Train Data : ",train_clean_data.head(5))
print("PreProcessed Test Data : ",test_clean_data.head(5))

Merging comments and labels for training data set and ids for test data set.

In [None]:
frames=[train_clean_data,labels]
train_result = pd.concat(frames,axis=1)
frames=[test.iloc[:,0],test_clean_data]
test_result = pd.concat(frames,axis=1)
print(train_result.head())
print(test_result.head())

Saving data in csv format to use it in different notebook, or you can continue working in the same notebook.

In [None]:
train_result.to_csv('train_data.csv', index = False)
test_result.to_csv('test_data.csv', index = False)