# 文本预处理方法汇总
>
- toc: true 
- badges: true
- comments: true
- author: Bujie Xu
- categories: [NLP]

本文汇总各种文本预处理的方法，皆在方便自己快速查找。
original link is here https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908

# all to upper case or lowwer case


In [6]:
input_str = "AbcdEfG"
input_str.lower()

'abcdefg'

# replace numbers or remove numbers

In [7]:
import re
input_str = 'Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls.'
result = re.sub(r'\d+', '', input_str)
print(result)

Box A contains  red and  white balls, while Box B contains  red and  blue balls.


# Remove Punctuation

In [10]:
import string
input_str = "This &is [an] example? {of} string. with.? punctuation!!!!"
punctuation_dict = {ord(p):'' for p in string.punctuation}
result = input_str.translate(punctuation_dict)
print(result)

This is an example of string with punctuation


# Remove Whitespace 

In [11]:
input_str = "   This has a lot whitespace    "
print(input_str.strip())

This has a lot whitespace


# Tokenization

![tokenization](https://miro.medium.com/max/3220/1*ffMYw8aujrmyxfA55Zm3Jg.jpeg)

In [16]:
from nltk.tokenize import WhitespaceTokenizer
tokenizer = WhitespaceTokenizer()
s = "I love you"
tokenizer.tokenize(s)

['I', 'love', 'you']

# Remove Stop words

In [19]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
s = "NLTK is a leading platform for building Python programs to work with human language data."
result = [w for w in s.split() if w not in stop_words]
print(result)

['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data.']


In [21]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
print(ENGLISH_STOP_WORDS)

frozenset({'together', 'seemed', 'she', 'hers', 'ie', 'may', 'becoming', 'though', 'everything', 'only', 'somewhere', 'at', 'one', 'very', 'few', 'many', 'whither', 'my', 'onto', 'now', 'keep', 'mill', 'this', 'than', 'once', 'seems', 'might', 'please', 'these', 'among', 'hence', 'thus', 'something', 'rather', 'how', 'whereas', 'whence', 'everywhere', 'last', 'anyone', 'never', 'somehow', 'another', 'herself', 'i', 'detail', 'two', 'elsewhere', 'give', 'nowhere', 'myself', 'me', 'some', 'of', 'everyone', 'first', 'yourselves', 'himself', 'meanwhile', 'serious', 'found', 'hereafter', 'much', 'becomes', 'nobody', 'thin', 'namely', 'find', 'indeed', 'thru', 'those', 'no', 'noone', 'both', 'is', 'hasnt', 'own', 'not', 'amoungst', 'empty', 'then', 'their', 'again', 'further', 'itself', 'most', 'hereby', 'up', 'wherein', 'to', 'thereupon', 'across', 'on', 'along', 'except', 'done', 'anyway', 'had', 'go', 'any', 'will', 'often', 'upon', 'three', 'fire', 'neither', 'anyhow', 'either', 'there',

# Stemming & Remove sparse terms and particular words

![Stemming](https://miro.medium.com/max/3492/1*JpOXoNSFkZ0sjqPYT2U4cA.jpeg)

In [23]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer= PorterStemmer()
input_str="There are several types of stemming algorithms."
input_str=word_tokenize(input_str)
for word in input_str:
    print(stemmer.stem(word))

there
are
sever
type
of
stem
algorithm
.


# Lemmatization

Lemmatization tools are presented libraries described above: NLTK (WordNet Lemmatizer), spaCy, TextBlob, Pattern, gensim, Stanford CoreNLP, Memory-Based Shallow Parser (MBSP), Apache OpenNLP, Apache Lucene, General Architecture for Text Engineering (GATE), Illinois Lemmatizer, and DKPro Core.

In [25]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer=WordNetLemmatizer()
input_str="been had done languages cities mice"
input_str=word_tokenize(input_str)
for word in input_str:
    print(lemmatizer.lemmatize(word))

been
had
done
language
city
mouse


# Part of speech tagging (POS)

Part-of-speech tagging aims to assign parts of speech to each word of a given text (such as nouns, verbs, adjectives, and others) based on its definition and its context. There are many tools containing POS taggers including NLTK, spaCy, TextBlob, Pattern, Stanford CoreNLP, Memory-Based Shallow Parser (MBSP), Apache OpenNLP, Apache Lucene, General Architecture for Text Engineering (GATE), FreeLing, Illinois Part of Speech Tagger, and DKPro Core.

In [31]:
from nltk import pos_tag
from nltk import word_tokenize
input_str="Parts of speech examples: an article, to write, interesting, easily, and, of"
tokens = word_tokenize(input_str)
result = pos_tag(tokens)
print(result)

[('Parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('examples', 'NNS'), (':', ':'), ('an', 'DT'), ('article', 'NN'), (',', ','), ('to', 'TO'), ('write', 'VB'), (',', ','), ('interesting', 'VBG'), (',', ','), ('easily', 'RB'), (',', ','), ('and', 'CC'), (',', ','), ('of', 'IN')]


# Chunking (shallow parsing)

Chunking is a natural language process that identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.) [23]. Chunking tools: NLTK, TreeTagger chunker, Apache OpenNLP, General Architecture for Text Engineering (GATE), FreeLing.