# Text Classification

Text classification is the process of assigning tags or categories to text according to its content. It’s one of the fundamental tasks in natural language processing. 

The text we wanna classify is given as input to an algorithm, the algorithm will then analyze the text’s content, and then categorize the input as one of the tags or categories previously given.

**Input → Classifying Algorithm → Classification of Input**

Real life examples: 

+ sentiment analysis: how does the writer of the sentence feel about what they are writing about, do they think positively or negatively of the subject?
Ex. restaurant reviews
topic labeling: given sentences and a set of topics, which topic does this sentence fall under? 
Ex. is this essay about history? Math? etc?
spam detection
Ex. Email filtering: is this email a real important email or spam?

Example. 
A restaurant wants to evaluate their ratings but don’t want to read through all of them. Therefore, they wanna use a computer algorithm to do all their work. They simply want to know if the customer’s review is positive or negative.

Here’s an example of a customer’s review and a simple way an algorithm could classify their review.

Input: “The food here was too salty and too expensive” 

Algorithm: 
Goes through every word in the sentence and counts how many positive words and how many negative words are in the sentence.

		“The, food, here, was, too, and” are all neutral words

		“Salty, expensive” are negative words.

		Negative words: 2
		Positive words: 0

Classification: Negative Review, because there are more negative words (2) than positive (0).

However, this algorithm obviously doesn’t work in a lot of cases. 

For example, “The food here was good, not expensive and not salty” would be classified as negative but it’s actually a positive review. 

Language and text can get very complicated which makes creating these algorithms difficult. Some things that make language difficult could be words that have multiple meanings, negation words (words such as not), slang, etc.



## Importing Data

In [15]:
import sys
import string
from scipy import sparse
import numpy as np

In [16]:
trainingFile = "trainingSet.txt"
testingFile = "testSet.txt"

In [17]:
def getData(fileName):
    f = open(fileName)
    file = f.readlines()

    sentences = []
    sentiments = []

    for line in file:
        sentence, sentiment = line.split('\t')
        sentences.append(sentence.strip())
        sentiments.append(int(sentiment.strip())) # Sentiment in {0,1}

    return sentences, np.array(sentiments)

In [18]:
trainingSentences, trainingLabels = getData(trainingFile)
testingSentences, testingLabels = getData(testingFile)

## Pre-Processing Data

In [19]:
def preProcess(sentences):
    # The various sources of stop words
    pronouns = {"i", "me", "us", "you", "she", "her", "he", "him", "it", "we", "us", "they", "them", "this", "these"}
    # Source: https://en.wikipedia.org/wiki/English_personal_pronouns
    copulae = {"be", "is", "am", "are", "being", "was", "were", "been"}
    # Source: https://en.wikipedia.org/wiki/Copula_(linguistics)#English
    conjunctions = {"for", "and", "nor", "but", "or", "yet", "so", "that", "which", "because", "as", "since", "though", "while", "whereas"}
    # Source: https://en.wikipedia.org/wiki/Conjunction_(grammar)

    stopwords = {"a", "the"}.union(pronouns).union(copulae).union(conjunctions)

    def cleanText(text):
        # Make lower case
        text = text.lower()

        # Replace non-text characters with spaces
        nonText = string.punctuation + ("")
        text = text.translate(string.maketrans(nonText, ' ' * (len(nonText))))

        # Tokenize
        words = text.split()

        return words

    return list(map(cleanText, sentences))

In [20]:
trainingTokens = preProcess(trainingSentences)
testingTokens = preProcess(testingSentences)

NameError: global name 'deleteDigits' is not defined

## Getting Data

In [13]:
def getVocab(sentences):
    vocab = set()
    for sentence in sentences:
        for word in sentence:
            vocab.add(word)
    return sorted(vocab)

In [14]:
vocabulary = getVocab(trainingTokens)
print(vocabulary)
    

