## 1. Prepare text

In [3]:
with open('data/Course-Descriptions.txt', 'r') as f:
    descriptions = f.read().splitlines()
    
print('Sample text:', descriptions[:2])

Sample text: ['In this practical, hands-on course, learn how to do data preparation, data munging, data visualization, and predictive analytics. ', 'PHP is the most popular server-side language used to build dynamic websites, and though it is not especially difficult to use, nonprogrammers often find it intimidating. ']


Use NLTK to preprocess data: remove stopwords, performe lemmatization. Then use TF-IDF vectorizer to transform

In [18]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer

# Custom tokenizer to remove stopwords and conduct lemmatization
def custom_tokenizer(str):
    tokens = nltk.word_tokenize(str)
    stop_removed = list(set(tokens) - set(stopwords.words('english')))
    lemmatizer = WordNetLemmatizer()
    lemmatized = list(lemmatizer.lemmatize(word) for word in stop_removed)
    return lemmatized

vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer)
tfidf_matrix = vectorizer.fit_transform(descriptions)

print('Sample features:', vectorizer.get_feature_names()[:10])
print('Matrix dimension:', tfidf_matrix.shape)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yuweiwang/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/yuweiwang/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Sample features: ["'ll", "'re", "'s", '(', ')', ',', '.', '?', 'actively', 'adopting']
Matrix dimension: (20, 238)


## 2. Build classification model

First, read file of labels and encode labels into number categories

In [19]:
from sklearn import preprocessing

with open('data/Course-Classification.txt', 'r') as f:
    labels = f.read().splitlines()
    
encoder = preprocessing.LabelEncoder()
encoder.fit(labels)
print('Classes found:', encoder.classes_)
int_classes = encoder.transform(labels)
print('\nInterger classes:', int_classes)

Classes found: ['Cloud-Computing' 'Data-Science' 'Programming']

Interger classes: [1 2 2 0 1 2 1 2 0 1 1 2 2 0 2 0 0 0 2 2]


Then fit into Naive Bayes classification model with 3 classes

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

#Split as training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, int_classes,random_state=0)

#Build the model
clf= MultinomialNB().fit(X_train, y_train)

## 3. Run predictions

In [23]:
from sklearn import metrics

pred = clf.predict(X_test)

print('Testing with Test data: ')
print('Confusion matrix:')
print(metrics.confusion_matrix(y_test, pred))
print('Accuracy score:', metrics.accuracy_score(y_test, pred))

pred_full = clf.predict(tfidf_matrix)
print('\nTesting with Full dataset: ')
print('Confusion matrix:')
print(metrics.confusion_matrix(int_classes, pred_full))
print('Accuracy score:', metrics.accuracy_score(int_classes, pred_full))

Testing with Test data: 
Confusion matrix:
[[0 0 1]
 [0 0 1]
 [1 0 2]]
Accuracy score: 0.4

Testing with Full dataset: 
Confusion matrix:
[[5 0 1]
 [0 4 1]
 [1 0 8]]
Accuracy score: 0.85
