## Cleaning Text

Clean white spaces in the text_data

In [1]:
text_data = ["  Interrobang. By aishwarya Henriette    ",
            "Parking And Going. By Karl Gautier",
            "    Today Is The night. By Jarek Prakash"]

# strip whitespaces


['Interrobang. By aishwarya Henriette',
 'Parking And Going. By Karl Gautier',
 'Today Is The night. By Jarek Prakash']

In [2]:
# remove periods


['Interrobang By aishwarya Henriette',
 'Parking And Going By Karl Gautier',
 'Today Is The night By Jarek Prakash']

In [None]:
Capitalize the text

['INTERROBANG BY AISHWARYA HENRIETTE',
 'PARKING AND GOING BY KARL GAUTIER',
 'TODAY IS THE NIGHT BY JAREK PRAKASH']

### See Also
* Beginners Tutorial for Regular Expressions in Python (https://www.analyticsvidhya.com/blog/2015/06/regular-expression-python/)

## Parsing and Cleaning HTML
Use Beautiful Soup to get the full name in the provided html

In [5]:
from bs4 import BeautifulSoup

html = """
    <div class='full_name'><span style='font-weight:bold'>Yan</span> Chin</div>
"""


'Yan Chin'

### See Also
* Beautiful Soup documentation (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

## Removing Punctuation
Remove punctuation of the text provided

In [6]:
import unicodedata
import sys

text_data = ['Hi!!! I. Love. This. Song.....', '10000% Agree!!!! #LoveIT', 'Right?!?!']

# create a dictionary of punctuation characters

# for each string, remove any punctuation characters


['Hi I Love This Song', '10000 Agree LoveIT', 'Right']

## Tokenizing Text

In [8]:
from nltk.tokenize import word_tokenize
string = "The science of today is the technology of tommorrow"

# tokenize words


['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', 'tommorrow']

In [11]:
from nltk.tokenize import sent_tokenize
string = "The science of today is the technology of tommorw. Tommorrow is today"

# tokenize sentences


## Removing Stop Words

In [12]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

tokenized_words = ['i', 'am', 'going', 'to', 'go', 'to', 'the', 'store', 'and', 'park']

stop_words = stopwords.words('english')

# remove stop words


[nltk_data] Downloading package stopwords to /Users/ddl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['going', 'go', 'store', 'park']

## Stemming Words

In [13]:
from nltk.stem.porter import PorterStemmer

tokenized_words = ['i', 'am', 'humbled', 'by', 'this', 'traditional', 'meeting']

# create stemmer


# apply stemmer



### See Also
* Porter Stemming Algorithm (https://tartarus.org/martin/PorterStemmer/)





## Encoding Text as a Bag of Words

In [50]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

text_data = np.array(['I love Brazil. Brazil!', 'Sweden is best', 'Germany beats both'])


<3x8 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [51]:
# Convert to array and show the result

array([[0, 0, 0, 2, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 0, 0]], dtype=int64)

In [52]:
# show the feature names


['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love', 'sweden']



## Weighting Word Importance

In [55]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

text_data = np.array(['I love Brazil. Brazil!', 'Sweden is best', 'Germany beats both'])

# create the tf-idf feature matrix


<3x8 sparse matrix of type '<class 'numpy.float64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [56]:
# Convert to array and show the result
feature_matrix.toarray()

array([[0.        , 0.        , 0.        , 0.89442719, 0.        ,
        0.        , 0.4472136 , 0.        ],
       [0.        , 0.57735027, 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.57735027],
       [0.57735027, 0.        , 0.57735027, 0.        , 0.57735027,
        0.        , 0.        , 0.        ]])

In [57]:
# show the vocabulary


{'love': 6,
 'brazil': 3,
 'sweden': 7,
 'is': 5,
 'best': 1,
 'germany': 4,
 'beats': 0,
 'both': 2}

$$
tfidf(t, d) = tf(t,d) * idf(t)
$$

where $t$ is a word

$d$ is a document

$$
idf(t) = log(\frac{1 + n_d}{1 + df(d, t}) +1
$$

where $n_d$ is the number of documents and 

$df(d,t)$ is term, $t$'s document frequency (i.e. number of documents where the term appears)

### See Also
* scikit-learn documentation: tf-idf term weighting (http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting)