## How do we even begin to classify sentiments of sentences?
Our project requires us to analyze the sentiments of tweets, which (obviously) consist of sentences and words. Sentences, of course, can come in numerous variations, and can also be considered unstructured due to this inconsistency between sentences. 

However, machine learning techniques that we use structured data in order to properly train models. So, how do we do train models on sentences?

In the domain of Natural Language Processing (NLP), there are several techniques of **feature extraction**, or ways in which to structure text. As stated in [this paper on text classification algorithms](https://arxiv.org/pdf/1904.08067.pdf), feature extraction techniques serves to convert unstructured text sequences into a structured feature space once the textual data itself has been cleaned.

For the purposes of this project, we will be looking at three of such techniques: **Bag-of-Words**, **Term Frequency-Inverse Document Frequency**, and **Word2Vec**.

## How do we go about cleaning textual data?
There are many ways in which to clean textual data, many of which are mentioned in the aforementioned paper. Some of these include tokenization, which breaks sentences into meaningful chunks (e.g. words or symbols) that are called tokens.
Others include removing whitespace and special characters, converting all letters to lower-case, or removing stop words i.e. words that do not contain important significance such as *"a"* or *"the"*.

## What is Bag-of-Words?
The Bag-of-Words (BoW) model seeks to reduce and simplify text based on a specific criteria, most commonly word frequency. Think of a body of text as a literal bag of words (or list of words), where our feature space becomes the frequency that each word occurs in the text. The logic behind this is that words are often representative of the content of the sentence, so if a particular noun appears many times, then one can assume that the subject of that sentence has to do with that noun.

[Wikipedia provides an excellent example of how this would exactly look like](https://en.wikipedia.org/wiki/Bag-of-words_model).

However, there are some limitations of this approach. BoW ignores grammar and order of appearance of words e.g. "Is this true" and "This is true" both have the same feature space. There are also issues of scalability. However, since tweets have a character limit of 280 characters, we do not think this will be much of a problem.

