# Feature Extraction using 3 Techniques

The text below will explain what feature extraction is, and how we can go about doing it.

## How do we even begin to classify sentiments of sentences?
Our project requires us to analyze the sentiments of tweets, which (obviously) consist of sentences and words. Sentences, of course, can come in numerous variations, and can also be considered unstructured due to this inconsistency between sentences. 

However, machine learning techniques that we use structured data in order to properly train models. So, how do we do train models on sentences?

In the domain of Natural Language Processing (NLP), there are several techniques of **feature extraction**, or ways in which to structure text. As stated in [this paper on text classification algorithms](https://arxiv.org/pdf/1904.08067.pdf), feature extraction techniques serves to convert unstructured text sequences into a structured feature space once the textual data itself has been cleaned.

For the purposes of this project, we will be looking at three of such techniques: **Bag-of-Words**, **Term Frequency-Inverse Document Frequency**, and **Word2Vec**.

## How do we go about cleaning textual data?
There are many ways in which to clean textual data, many of which are mentioned in the aforementioned paper. Some of these include tokenization, which breaks sentences into meaningful chunks (e.g. words or symbols) that are called tokens.
Others include removing whitespace and special characters, converting all letters to lower-case, or removing stop words i.e. words that do not contain important significance such as *"a"* or *"the"*.

## What is Bag-of-Words?
The Bag-of-Words (BoW) model seeks to reduce and simplify text based on a specific criteria, most commonly word frequency. Think of a body of text as a literal bag of words (or list of words), where our feature space becomes the frequency that each word occurs in the text. The logic behind this is that words are often representative of the content of the sentence, so if a particular noun appears many times, then one can assume that the subject of that sentence has to do with that noun.

[Wikipedia provides an excellent example of how this would exactly look like, and how this would work in spam filtering](https://en.wikipedia.org/wiki/Bag-of-words_model).

However, there are some limitations of this approach. BoW ignores grammar and order of appearance of words e.g. "Is this true" and "This is true" both have the same feature space. There are also issues of scalability. However, since tweets have a character limit of 280 characters, we do not think this will be much of a problem.

[Here is a blog showing what Bag-of-Words is mathematically, and how to use it in conjunction with a Naive Bayes Classifier to analyze the sentiments of movie reviews](https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-python).

## What is Term Frequency-Inverse Document Frequency?
Term Frequency-Inverse Document Frequency, otherwise known as TF-IDF, is another method of feature extraction, and is used to determine *how* important a word is to a document. How does it determine the importance of a word? It helps to explain this by first defining what these two terms mean.

Term Frequency refers to the raw count of a term in a document. It is similar to how the Bag-of-Words model works, and just looks at the true frequency of some specific term.

Whereas Term Frequency deals with the raw count of a term, Inverse Document Frequency refers to giving more *weight* to the uncommon terms, while giving less *weight* to more common terms This is calculated as an inverse function of the number of documents the term appears in.

[Again, Wikipedia provides a good example of how this works out in finding documents relating to a specific query](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2).

Though TF-IDF gives weight to less common words, which helps lessen the effect common words have, this technique still has its cons. It cannot account for the similarity between words in a document, as each word is independently presented as an index. Also, since it is based off of the BoW Model, it does not capture positions in text or the semantics of the sentence.

## What is Word2Vec?
Word2Vec, short for *word to vector*, is an improvement on existing word embedding techniques. What is [word embedding](https://en.wikipedia.org/wiki/Word_embedding)? In short, these refer to NLP techniques that mapping words or phrases to vectors of real numbers i.e. a mapping of a space with many dimensions to a continuous vector space with much lower dimensions.

What makes Word2Vec stand out? Word2Vec uses shallow neural networks with just 2 hidden layers trained to reconstruct the linguistic contexts of words. These models take as input a large corpus of words and creates a vector space of (usually) several hundred dimensions, where each unique word is given a vector in that vector space. Word vectors sharing a common context are closer to each other in that vector space than unrelated words e.g. "small" and "smaller" are closer than "small" and "sky".

This method provides a means to discover the relationship and similarities between words, which is unattainable for the previous two methods described. 


# Implementing Feature Extraction
Below, we begin to code how to use these different feature extraction techniques. We will use the [NLTK library](https://www.nltk.org/) for these different techniques.

We will also be using the data set `tweets2020.csv`.

First, we import the needed libraries and data.

In [32]:
import pandas as pd
import re
import numpy as np

# Bag of Words imports
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer

tweets_df = pd.read_csv('tweets2020.csv')
tweets_df.columns = ['id', 'tweets']
tweets_df.head()

Unnamed: 0,id,tweets
0,0,"@belcherjody1 IF no voter ID required by 2020,..."
1,1,RT @matthewjdowd: “As we approach this 2020 pr...
2,2,RT @SisiLiliDidi: #AndrewYang polls 14%--18% a...
3,3,RT @PRIMONUTMEG: Are you paying attention to t...
4,4,"RT @davidsirota: Fact check: Zero Pinnochios, ..."


### Some helpful functions

In [58]:
"""
Cleans a tweet by removing @names, links, special characters, empty tweets, 
duplicate tweets, stop-words, and converting all letters to lower-case.
This function also stems each word in the tweet.

# Want to apply this function to each row
"""
def clean_tweet(tweet):
    # Remove @names
    name_regex = "@[\_0-9a-zA-Z]+\:? | *RT*"
    cleaned_tweet = re.sub(name_regex, "", tweet)
    
    # Remove links
    url_regex = "(https?:\/\/)(\s)?(www\.)?(\s?)(\w+\.)*([\w\-\s]+\/)*([\w-]+)\/?"
    cleaned_tweet = re.sub(url_regex, "", cleaned_tweet)
    
    # TODO: Remove special characters
    
    # TODO: Remove empty tweets
    
    # TODO: Remove duplicates
    
    # TODO: Remove stop-words
    
    # TODO: Stem words
    
    # TODO: Convert to lower-case
    
    return cleaned_tweet

In [57]:
tweet = "=( asdfghjkl"
clean_tweet(tweet)

'=( asdfghjkl'

In [34]:
tweets_df['cleaned_tweets'] = tweets_df['tweets'].apply(clean_tweet)

In [37]:
tweets_df.head(15)

Unnamed: 0,id,tweets,cleaned_tweets
0,0,"@belcherjody1 IF no voter ID required by 2020,...","IF no voter ID required by 2020, Trump will lo..."
1,1,RT @matthewjdowd: “As we approach this 2020 pr...,“As we approach this 2020 presidential campai...
2,2,RT @SisiLiliDidi: #AndrewYang polls 14%--18% a...,"#AndrewYang polls 14%--18% among Gen Z, but 0..."
3,3,RT @PRIMONUTMEG: Are you paying attention to t...,Are you paying attention to the Green Party P...
4,4,"RT @davidsirota: Fact check: Zero Pinnochios, ...","Fact check: Zero Pinnochios, Bottomless Geppe..."
5,5,RT @OddsShark: Updated odds to be the Democrat...,Updated odds to be the Democratic Candidate f...
6,6,RT @Rachel_K_Chen: 420 more days until the pre...,420 more days until the presidential election...
7,7,Union members say 2020 labor support for Trump...,Union members say 2020 labor support for Trump...
8,8,"RT @agitpopworld: @ddale8 “Dear Daddy,\n\nHere...","“Dear Daddy,\n\nHere's the prototype of an un..."
9,9,RT @jacobinmag: We asked longtime climate advo...,We asked longtime climate advocates which can...


### Implementing Bag-of-Words