# Text Classification

Text classification is the process of assigning tags or categories to text according to its content. It’s one of the fundamental tasks in natural language processing. 

The text we wanna classify is given as input to an algorithm, the algorithm will then analyze the text’s content, and then categorize the input as one of the tags or categories previously given.

**Input → Classifying Algorithm → Classification of Input**

Real life examples: 

+ sentiment analysis: how does the writer of the sentence feel about what they are writing about, do they think positively or negatively of the subject?
Ex. restaurant reviews
topic labeling: given sentences and a set of topics, which topic does this sentence fall under? 
Ex. is this essay about history? Math? etc?
spam detection
Ex. Email filtering: is this email a real important email or spam?

Example. 
A restaurant wants to evaluate their ratings but don’t want to read through all of them. Therefore, they wanna use a computer algorithm to do all their work. They simply want to know if the customer’s review is positive or negative.

Here’s an example of a customer’s review and a simple way an algorithm could classify their review.

Input: “The food here was too salty and too expensive” 

Algorithm: 
Goes through every word in the sentence and counts how many positive words and how many negative words are in the sentence.

		“The, food, here, was, too, and” are all neutral words

		“Salty, expensive” are negative words.

		Negative words: 2
		Positive words: 0

Classification: Negative Review, because there are more negative words (2) than positive (0).

However, this algorithm obviously doesn’t work in a lot of cases. 

For example, “The food here was good, not expensive and not salty” would be classified as negative but it’s actually a positive review. 

Language and text can get very complicated which makes creating these algorithms difficult. Some things that make language difficult could be words that have multiple meanings, negation words (words such as not), slang, etc.



## Importing Data

In [1]:
import sys
import string
from scipy import sparse
import numpy as np

In [2]:
trainingFile = "trainingSet.txt"
testingFile = "testSet.txt"

In [3]:
def getData(fileName):
    f = open(fileName)
    file = f.readlines()

    sentences = []
    sentiments = []

    for line in file:
        sentence, sentiment = line.split('\t')
        sentences.append(sentence.strip())
        sentiments.append(int(sentiment.strip())) # Sentiment in {0,1}

    return sentences, np.array(sentiments)

In [4]:
trainingSentences, trainingLabels = getData(trainingFile)
testingSentences, testingLabels = getData(testingFile)

## Pre-Processing Data

In [5]:
def preProcess(sentences):

    def cleanText(text):
        # Make lower case
        text = text.lower()

        # Replace non-text characters with spaces
        nonText = string.punctuation + ("")
        text = text.translate(string.maketrans(nonText, ' ' * (len(nonText))))

        # Tokenize
        words = text.split()

        return words

    return list(map(cleanText, sentences))

In [27]:
trainingTokens = preProcess(trainingSentences)
testingTokens = preProcess(testingSentences)
print(trainingTokens)



## Getting Data and Setting it Up

In [7]:
def getVocab(sentences):
    vocab = set()
    for sentence in sentences:
        for word in sentence:
            vocab.add(word)
    return sorted(vocab)

In [8]:
vocabulary = getVocab(trainingTokens)
print(vocabulary)    



In [32]:
def createVector(vocab, sentences):
    indices = []
    wordOccurrences = []

    for sentenceIndex, sentence in enumerate(sentences):
        alreadyCounted = set() # Keep track of words so we don't double count.
        for word in sentence:
            if (word in vocab) and word not in alreadyCounted:
                # If we just want {0,1} for the presence of the word (bernoulli NB),
                # only count each word once. Otherwise (multinomial NB) count each
                # occurrence of the word.
                
            
                #which sentence, which word
                indices.append((sentenceIndex, vocab.index(word)))
                
                wordOccurrences.append(1)
                alreadyCounted.add(word)

    # Unzip
    rows = [row for row, _ in indices]
    columns = [column for _, column in indices]

    sentenceVectors = sparse.csr_matrix((wordOccurrences, (rows, columns)), dtype=int, shape=(len(sentences), len(vocab)))

    return sentenceVectors

In [33]:
training = createVector(vocabulary, trainingTokens)
testing = createVector(vocabulary, testingTokens)
print(training)

  (0, 694)	1
  (0, 884)	1
  (0, 1186)	1
  (0, 1335)	1
  (1, 52)	1
  (1, 640)	1
  (1, 768)	1
  (1, 788)	1
  (1, 1158)	1
  (1, 1166)	1
  (1, 1171)	1
  (1, 1281)	1
  (2, 52)	1
  (2, 104)	1
  (2, 197)	1
  (2, 375)	1
  (2, 578)	1
  (2, 629)	1
  (2, 656)	1
  (2, 694)	1
  (2, 721)	1
  (2, 799)	1
  (2, 956)	1
  (2, 978)	1
  (2, 1111)	1
  :	:
  (497, 469)	1
  (497, 723)	1
  (497, 765)	1
  (497, 825)	1
  (497, 961)	1
  (497, 1170)	1
  (497, 1171)	1
  (497, 1281)	1
  (497, 1300)	1
  (497, 1321)	1
  (498, 22)	1
  (498, 76)	1
  (498, 216)	1
  (498, 472)	1
  (498, 525)	1
  (498, 565)	1
  (498, 611)	1
  (498, 652)	1
  (498, 679)	1
  (498, 778)	1
  (498, 994)	1
  (498, 1146)	1
  (498, 1171)	1
  (498, 1198)	1
  (498, 1246)	1
