<p style="text-align:center">
PSY 394U <b>Data Analytics with Python</b>, Spring 2018


<img style="width: 400px; padding: 0px;" src="https://github.com/sathayas/JupyterAnalyticsSpring2018/blob/master/images/Title_pics.png?raw=true" alt="title pics"/>

</p>

<p style="text-align:center; font-size:40px; margin-bottom: 30px;"><b> Text classification </b></p>

<p style="text-align:center; font-size:18px; margin-bottom: 32px;"><b>April 24-26, 2018</b></p>

<hr style="height:5px;border:none" />

Text data can be analyzed by classification and clustering algorithms. This can be done by extracting features from a text data corpus, and performing a classification or clustering according to the extracted features and the target category data. Here, we cover a few simple examples of text classification and clustering. 

# 1. String classification
<hr style="height:1px;border:none" />

The goal here is to classify which strings belong to which category. To do so, we will use the corpus **`name`**, a collection of female and male names on NLTK, and construct a classifier to determine whether a name is of a female or a male.

`<NameClassifier.py>`

In [2]:
import nltk
import random

# reading names from the names corpus
from nltk.corpus import names
femaleNames = names.words('female.txt')
maleNames = names.words('male.txt')

Here, we are reading in names of females and males. 

In [3]:
# creating name-label pairs, then shuffling
nameData = []
for iName in femaleNames:
    nameData.append((iName, 'female'))
for iName in maleNames:
    nameData.append((iName, 'male'))
random.shuffle(nameData)

Once both data are read, we combine the name with its category ('female' or 'male') as tuples. Then we shuffle the data so that male and female names are now mixed.

To extract a feature from the shuffled data file, we will use a custom function called **`gender_feature`**. This function takes a string, then returns a feature (last-letter)

In [4]:
# a function to return a feature to classify whether a name is
# male or female.
# The feature and the label are returned together
def gender_feature(name):
    featureDict = {'last-letter': name[-1]}
    return featureDict

Then we convert names into features (last-letter). Again, we shall keep both the feature dictionary and the label in **`featureData`**.

In [5]:
# converting the name data into feature (i.e., just the last letter)
# as well as the label (female / male)
featureData = [(gender_feature(n), gender) for (n, gender) in nameData]

At this point, we are generating training and testing data sets, with the testing data set comprising 1000 observations.

In [6]:
# spliting into training and testing data sets
trainData, testData = featureData[1000:], featureData[:1000]

Then we train a classifier. Here, we use a naive Bayes algorithm. A naive Bayes classifier classifies observations as the most likely outcomes based on the Bayes theory (i.e., the distribution of the label given the observed feature(s)). The naive Bayes classifier is available in NLTK as **`NaiveBayesClassifier`**. This classifier object is somewhat different from that of **Scikit-learn** (a.k.a., `sklearn`). We supply both feature(s) and category label to the naive Bayes algorithm. 

In [7]:
# training a classifier (Naive Bayes)
clf = nltk.NaiveBayesClassifier.train(trainData)

%%%%% Start from here %%%%%

* Name classification
   * Exercise: last 3 letters, first letter vowel
* Text classification
   * NLTK classifier (Naive Bayes, sklearn wrapper)
       * Exercise: SentimentReview
   * sklearn tools
       * Exercise: News groups
* Text clustering