text-preprocessing

This is a function I use to preprocess texts before using them as data.

Steps:

change to lowercase, replace contractions with their corresponding complete versions
remove \n or \n, which are typical redundant characters in webpage texts,
remove characters that are not alphanumeric or punctuations,
remove other typical redundant characters in webpage texts.
remove characters that are not alphanumeric.
remove numbers
tokenize the words
lemmatize the words, which may not be necessary and could be replaced with stemming.
remove stopwords

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE.md		LICENSE.md
README.md		README.md
text_preprocessing.ipynb		text_preprocessing.ipynb