This is a function I use to preprocess texts before using them as data.
- change to lowercase, replace contractions with their corresponding complete versions
- remove \n or \n, which are typical redundant characters in webpage texts,
- remove characters that are not alphanumeric or punctuations,
- remove other typical redundant characters in webpage texts.
- remove characters that are not alphanumeric.
- remove numbers
- tokenize the words
- lemmatize the words, which may not be necessary and could be replaced with stemming.
- remove stopwords