# Introduction to Data Collection and Preprocessing

## Introduction to NLP and the importance of data collection

Natural Language Processing (NLP) is a branch of Data Science which deals with Text data. Apart from numerical data, Text data is available to a great extent which is used to analyze and solve business problems. But before using the data for analysis or prediction, processing the data is important.

To prepare the text data for the model building we perform text preprocessing. It is the very first step of NLP projects.

## Import necessary libraries

In [1]:
import pandas as pd
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

## Download necessary resources

In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\badhei\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\badhei\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\badhei\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\badhei\AppData\Roaming\nltk_data...


True

## Define file paths

In [3]:
raw_data_path = 'datasets/raw_data.csv'
preprocessed_data_path = 'preprocessed_data.csv'

## Load raw data into a Pandas dataframe

In [4]:
raw_data = pd.read_csv(raw_data_path)

## Tokenize text

It is the process of breaking down a given text into smaller units such as words or sentences. The resulting units are called tokens, and they can be further analyzed to extract meaning, sentiment, or other insights from the text. Tokenization is a critical step in many NLP tasks, since the meaning of a text can often be inferred from the individual units that make it up.

In [5]:
def tokenize_text(text):
    return word_tokenize(text)

## Remove stopwords from tokenized text

Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. The intuition behind using stop words is that, by removing low information words from text, we can focus on the important words instead.

In [6]:
def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    return [token for token in tokens if token.lower() not in stop_words]

## Apply stemming to tokens

Stemming is the process of reducing inflection in words (e.g. troubled, troubles) to their root form (e.g. trouble). The “root” in this case may not be a real root word, but just a canonical form of the original word.

Stemming uses a crude heuristic process that chops off the ends of words in the hope of correctly transforming words into its root form.

In [7]:
def apply_stemming(tokens):
    stemmer = PorterStemmer()
    return [stemmer.stem(token) for token in tokens]

## Apply lemmatization to tokens

Lemmatization on the surface is very similar to stemming, where the goal is to remove inflections and map a word to its root form. The only difference is that, lemmatization tries to do it the proper way. It doesn’t just chop things off, it actually transforms words to the actual root. For example, the word “better” would map to “good”. It may use a dictionary such as WordNet for mappings or some special rule-based approaches.

In [8]:
def apply_lemmatization(tokens):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(token) for token in tokens]

## Apply part-of-speech tagging to tokens

Part-of-speech (POS) tagging is the process of automatically determining the grammatical category and linguistic function of each word in a given text. This is an important task in Natural Language Processing (NLP) that helps to identify the structure and meaning of a sentence. In general, POS tagging involves labeling each word in a text with its corresponding part of speech, such as noun, verb, adjective, adverb, pronoun, preposition, conjunction, or interjection. 

In [9]:
def apply_pos_tagging(tokens):
    return pos_tag(tokens)

## Preprocess text

In [10]:
def preprocess_text(text):
    tokens = tokenize_text(text)
    tokens = remove_stopwords(tokens)
    tokens = apply_stemming(tokens)
    tokens = apply_lemmatization(tokens)
    tokens = apply_pos_tagging(tokens)
    return tokens

## Preprocess the raw data

In [11]:
preprocessed_data = raw_data['text'].apply(preprocess_text)

## Save preprocessed data to disk

In [12]:
preprocessed_data.to_csv(preprocessed_data_path, index=False)