HUDK 4051

Yiran Wang

2022/03/22

# ICE 2 Natural Language Processing

#### Objectives:

At the end of this ICE, you will demonstrate that you will be able to:

- perform basic data cleaning related to NLP
- build basic tokenized document-term matrix for analysis
- run basic exploratory analysis with document-term matrix
- implement LDA topic modeling
- implement basic text classifer

# NLP

NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner. By utilizing NLP and its components, one can organize the massive chunks of text data, perform numerous automated tasks and solve a wide range of problems such as – automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc. Obviously, we won't be able to cover everything in one ICE. This ICE intent to introduce you to some basics of NLP techniques.

In particular, in this ICE, we will (a) use LDA to model the topics in the comments, and (b) train a simple classifier to predict the evaluation based on the comment.

To start with, we will load the data. This dataset is collected from the students of a prominent university in North India. This dataset should be used to create the overall Institutional Report on the basis of student feedback data. The data source can be found here: https://www.kaggle.com/brarajit18/student-feedback-dataset

This dataset is comprised of 6 categories, which includes teaching, course content, examination, lab work, library facilities and extra curricular activities. Data for each category includes two columns, where each column can have any of the three labels, i.e. 0 (neutral), 1 (positive) and -1 (negative).

In [21]:
import pandas as pd
import numpy as np

In [22]:
Eval = pd.read_csv("ICE2_data_eval.csv")
Eval

Unnamed: 0,teaching,teaching.1,coursecontent,coursecontent.1,examination,Examination,labwork,labwork.1,library_facilities,library_facilities.1,extracurricular,extracurricular.1
0,0,teacher are punctual but they should also give...,0,content of courses are average,1,examination pattern is good,-1,"not satisfactory, lab work must include latest...",0,library facilities are good but number of book...,1,extracurricular activities are excellent and p...
1,1,Good,-1,Not good,1,Good,1,Good,-1,Not good,1,Good
2,1,Excellent lectures are delivered by teachers a...,1,All courses material provide very good knowled...,1,Exam pattern is up to the mark and the Cgpa de...,1,Lab work is properly covered in the labs by th...,1,Library facilities are excellent in terms of g...,1,Extra curricular activities also help students...
3,1,Good,-1,Content of course is perfectly in line with th...,-1,Again the university tests students of their a...,1,Good,0,Its the best thing i have seen in this univers...,-1,Complete wastage of time. Again this opinion i...
4,1,teachers give us all the information required ...,1,content of courses improves my knowledge,1,examination pattern is good,1,practical work provides detail knowledge of th...,1,library has huge collection of books from diff...,1,extracurricular activities increases mental an...
...,...,...,...,...,...,...,...,...,...,...,...,...
180,1,intraction is good and leacture delivery also ...,0,every one can tell depth of course but some on...,0,exam pattern is good and marks distribution is...,1,all labs and practical going on well,1,good,1,they all are held in super
181,1,all the given terms are good regarding the uni...,1,we are getting maximum knowledge,1,all are good,1,not bad,-1,library facilities are not good.They are not f...,1,good
182,1,All the terms are good regarding the universit...,1,Knowledge is maximum gained by reading books ...,0,The examination pattern is good .But time is n...,1,Labs are upto the mark.,1,They are good,1,the extracurricular activities held in univers...
183,-1,Some of the teacher are un experienced. Also t...,0,Its fine but it should focus more towards prac...,1,MCQ pattern is quite good and efficient way fo...,-1,Our labs do not have all facalities.,1,We have a good library with all facalities.,1,Our university has lot of extracurricular goin...


## Data Cleaning and Wrangling

The first step of NLP often involves with data cleaning and data wrangling.

For example, the dataset has 185 rows and 12 columns, but the information is quite repetitive. So for the further analysis purposes, we can rearrange the data into a long format. Feel free to wrangle the data in any format or only use part of the data. This is just a practice ICE.

In [23]:
# Rearrange the data into a long format.
eval1 = pd.melt(Eval, value_vars=['teaching', 'coursecontent', 'examination', 'labwork', 'library_facilities', 'extracurricular'],
             var_name='category', value_name='eval')
eval1

Unnamed: 0,category,eval
0,teaching,0
1,teaching,1
2,teaching,1
3,teaching,1
4,teaching,1
...,...,...
1105,extracurricular,1
1106,extracurricular,1
1107,extracurricular,1
1108,extracurricular,1


In [24]:
eval2 = pd.melt(Eval, value_vars=['teaching.1', 'coursecontent.1', 'Examination', 'labwork.1', 'library_facilities', 'extracurricular.1'],
             var_name='category', value_name='comment')
eval2.drop('category', axis=1, inplace=True)
eval2

Unnamed: 0,comment
0,teacher are punctual but they should also give...
1,Good
2,Excellent lectures are delivered by teachers a...
3,Good
4,teachers give us all the information required ...
...,...
1105,they all are held in super
1106,good
1107,the extracurricular activities held in univers...
1108,Our university has lot of extracurricular goin...


In [25]:
evalClean = pd.concat([eval2, eval1], axis = 1)
evalClean

Unnamed: 0,comment,category,eval
0,teacher are punctual but they should also give...,teaching,0
1,Good,teaching,1
2,Excellent lectures are delivered by teachers a...,teaching,1
3,Good,teaching,1
4,teachers give us all the information required ...,teaching,1
...,...,...,...
1105,they all are held in super,extracurricular,1
1106,good,extracurricular,1
1107,the extracurricular activities held in univers...,extracurricular,1
1108,Our university has lot of extracurricular goin...,extracurricular,1


Large part of the data cleaning in NLP is associated with presenting the data in a consistent format and removing words without substantial semantic meanings. So, common data cleaning methods you can do include: make text lowercase, remove formats (e.g., html labels), remove punctuation, and remove words containing numbers. Some of them are easy (e.g., make everything lowercase) while some of them might be trickier depending on how pretty your data is.

Anyway, for this particular example, let's just make all the text into lowercase, remove the punctunations, and remove the numbers. I am encapsulate the data cleaning all in one function called clean_text() and then map() this function to comment.

When removing the punctuation, you will often find regular expression or regex a very handy tool to use if you want to automate this process. In fact, regex will be a great addition to your toolbox not only for NLP tasks, but pretty much any Data Science job involves data cleaning. There are a lot of great resources to learn it (e.g., https://developers.google.com/edu/python/regular-expressions), but I suggest you mastering it as you use it. No need to spend a lot of time perfecting a complicated tool that you won't use regularly (yes, regex involves a lot of tricks and can get quite complicated).

In [26]:
# import the regex library
import re

In [27]:
def clean_text(text):
    #Make text lowercase, remove punctuations, and remove numbers
    text = text.lower()
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('[0-9]+','', text)
    return text

In [28]:
import string

evalClean['comment'] = [str (item) for item in evalClean['comment']]

evalClean['comment'] = evalClean['comment'].map(clean_text)
evalClean

Unnamed: 0,comment,category,eval
0,teacher are punctual but they should also give...,teaching,0
1,good,teaching,1
2,excellent lectures are delivered by teachers a...,teaching,1
3,good,teaching,1
4,teachers give us all the information required ...,teaching,1
...,...,...,...
1105,they all are held in super,extracurricular,1
1106,good,extracurricular,1
1107,the extracurricular activities held in univers...,extracurricular,1
1108,our university has lot of extracurricular goin...,extracurricular,1


## Organizing Data

Now, the data is almost ready to be analyzed. But before that, there are two important concepts for text formats you need to know:
- **Corpus**: a collection of text
- **Document-Term Matrix**: word counts in matrix format

What we already have is a corpus. It is human-readable because it stores the data in a natural language style. However, For many of the techniques we'll be using, algorithms do not recognize our natural language. Particularly, the statistics tools we have been using/learning relies on processing quantifiable numbers.

As a result, we need to tokenize the text/corpus, meaning broken the text down into smaller pieces. The most common tokenization technique is to break down text into words. Essentially, we are taking a dictionary of words and mark the occurence frequency and/or the order of each word for each text in the corpus. Then we can represent a corpus in the quantitative (but not quite human-readable) way. And it is called Document-Term Matrix, where every row will represent a different document and every column will represent a different word.

You can imagine that once the documents get longer and complicated, we may be dealing with a very wide and sparse matrix, which is not ideal for any statistics modeling. Therefore, we have two more important concepts to help simplify the process

1. **Bag-of-words**: We treat each document as a big bag of words. We don't necessarily care the orders of the words (i.e., grammar) at this stage. Just knowing the semantic meaning of each words can already tell us a lot. In fact, aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are.
2. **Stop words**: They are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. Stop words are so commonly used that they carry very little useful information. So getting rid of them might be a good idea.

Although the task can sound complicated, We can automate this using scikit-learn's CountVectorizer() to build a document-term matrix in a BOW style without the common stop words fairly easily.

In cv= CountVectorizer(), CountVectorizer() randomly assigns a number to each word in a process called tokenizing. Then, it counts the number of occurrences of words and saves it to cv. At this point, we’ve only assigned a method to cv.

In [29]:
# We are going to create a document-term matrix using CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')

commentCV = cv.fit_transform(evalClean['comment'])
commentCV_dtm = pd.DataFrame(commentCV.toarray(), columns = cv.get_feature_names())

commentCV_dtm.index = evalClean['comment'].index
commentCV_dtm

Unnamed: 0,abilities,ability,able,abroad,absurd,academic,accitivties,accordance,according,accordingly,...,works,world,worth,write,writing,wrong,yeah,year,years,yes
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1105,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1106,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1107,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1108,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Exploratory Analysis
As always, after building up our data, the next step is to take a look at the data and see what is going on here. First, it is worthwhile to check the distribution of the positive and negative comments.

In [30]:
evalClean.groupby(['eval','category']).count().unstack()

Unnamed: 0_level_0,comment,comment,comment,comment,comment,comment
category,coursecontent,examination,extracurricular,labwork,library_facilities,teaching
eval,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
-1,30,24,12,37,31,13
0,27,31,19,16,24,35
1,128,130,154,132,130,137


Overall, the distribution is quite skewed across all six categories -- the majority of the comments are positive. This may cause us some trouble in creating the classifier.

Let's take a look at the document-term matrix. We can calculate the term frequency to get the most frequently used words in all documents. Below, we can find out the top 20 words.

In [31]:
totalCT = commentCV_dtm.sum()
commentCV_dtm[totalCT.sort_values(ascending = False).index[:20]].sum()

good          558
university     59
students       57
excellent      56
course         42
pattern        40
lab            39
teachers       38
activities     37
knowledge      34
content        31
teaching       31
checking       30
paper          30
work           29
exam           28
courses        28
practical      27
marks          27
delivery       26
dtype: int64

Further, we can calculate the inverse document frequency, which is the the number of times a word occurs in a corpus of documents. Combining the term frequency and the inverse document frequency (i.e., tf-idk), we can get a sense of how important each word is. Read more here:
- TF-IDF from scratch in python on a real-world dataset: https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089#:~:text=TF%2DIDF%20stands%20for%20%E2%80%9CTerm,Information%20Retrieval%20and%20Text%20Mining
- TF-IDK implementation in sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

## Topic Modeling
One popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics. Particularly, we will use one of the most frequently used TM methods, Latent Dirichlet Allocation (LDA).

To use a topic modeling technique, you need to provide (1) a document-term matrix and (2) the number of topics you would like the algorithm to pick up.

To start with you probably need to install gensim through conda install -c conda-forge gensimand import a couple modules.

In [32]:
from gensim import matutils, models
import scipy.sparse

Then we need to transpose the document term matrix to term-document matrix and use sparse.csr_matrix() to compress a matrix that is in rows and prepare the data in genism format and obtain a dictionary id2word of the locations of each term in the tdm.

In [33]:
tdm = commentCV_dtm.transpose()
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)
id2word = dict((v, p) for p, v in cv.vocabulary_.items())

Once we have the corpus and id2word ready, the rest of the topic modeling is simple: Pass everything to we need to LdaModel() and specify a few other parameters (e.g., the number of topics and the number of passes). Let's start the num_topic at 3, see if the results make sense.

passes is the number of laps the model will take through corpus. The greater the number of passes, the more accurate the model will be. A lot of passes can be slow on a very large corpus.

In [34]:
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=20)
lda.print_topics()

[(0,
  '0.310*"good" + 0.021*"activities" + 0.020*"university" + 0.018*"students" + 0.014*"average" + 0.013*"teachers" + 0.011*"teaching" + 0.010*"best" + 0.009*"interaction" + 0.008*"punctuality"'),
 (1,
  '0.041*"excellent" + 0.030*"course" + 0.020*"courses" + 0.017*"material" + 0.017*"content" + 0.013*"university" + 0.012*"time" + 0.012*"depth" + 0.011*"proper" + 0.011*"delivery"'),
 (2,
  '0.035*"lab" + 0.026*"work" + 0.022*"marks" + 0.020*"paper" + 0.020*"practical" + 0.019*"exam" + 0.017*"pattern" + 0.016*"checking" + 0.015*"knowledge" + 0.015*"teachers"')]

Once the topic modeling technique is applied, your job as a human being, who can read natural language, is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, you can try changing up the number of topics (increase or decrease), the terms in the document-term matrix, model parameters, or even try a different model. This unsupervised approach is quite similar to K-Means clustering.

Now think about two things: (1) does the topic modeling makes sense? If yes, what types of topic it is trying to model? (2) play with the number of topics a little bit and see if you can get a more reasonable result.

Further, you can explore stemming and lemmatization as ways to concentrate on the data goes into the document-term matrix (https://www.datacamp.com/community/tutorials/stemming-lemmatization-python). And you can also only focus on only one type of words (e.g., nouns) through tagging. See more details in the nltk package: https://www.nltk.org/book/ch05.html

## Text Classifier
Text classification is one of the most classical problem of NLP. One of the earlier examples using Naive Bayes is identifying spam emails. NLP classifiers are basically no different than all the classifiers we have seen before -- the most challenging part is the algorithm often has to work on a dataset with a lot of features/independent variables (i.e., all the terms in the document-term matrix) while the entries might be limited.

We will first split the dataset into two:

In [35]:
from sklearn.model_selection import train_test_split

Xs_docs = evalClean['comment']
Ys_evals = evalClean['eval']
xs_training, xs_test, y_training, y_test = train_test_split(Xs_docs, Ys_evals, test_size = 0.2)

Then we can call the classifier models and feed in the training data and labels. Here we are going to use the classic Naive Bayes algorithm. The only difference here is we are going to import the MultinomialNB because the label has more than two levels.

In [36]:
from sklearn.naive_bayes import MultinomialNB

# Prepare the training features
cv = CountVectorizer(stop_words='english')
features = cv.fit_transform(xs_training)

# Train a multinomial Naive Bayes Model
model = MultinomialNB()
model.fit(features, y_training)

# Prepare the testing xs.
# Here we are using transform and fit_transform to standardize the data.
# Read about the differences here: https://towardsdatascience.com/what-and-why-behind-fit-transform-vs-transform-in-scikit-learn-78f915cf96fe
feature_test = cv.transform(xs_test)

# Print the model accuracy
print(model.score(feature_test,y_test))

0.7477477477477478


Not too bad, huh? We have achieved about 80% of accuracy. To compare the result, we can take a look at supported vector machine. The results are quite comparable.

In [37]:
from sklearn import svm

# Prepare the training features
cv = CountVectorizer(stop_words='english')
features = cv.fit_transform(xs_training)

# Train SVM classifier
model = svm.SVC()
model.fit(features, y_training)

# Prepare the testing xs
feature_test = cv.transform(xs_test)

# Print the model accuracy
print(model.score(feature_test,y_test))

0.7387387387387387


Remember the labels are quite imbalanced, so resampling (under/oversampling) might worth a trial. Same as topic modeling, you can play with the stop words (what to include and what not to include), stemming, lemmatization, and tagging certain words. Through engineering what goes into the model can impact the model performance substantially.

## End Notes
In this ICE, we have seen some basics about NLP tasks. We’ve only scratched the surface of NLP by visiting some very classical models. Recent NLP has gone much further with neural networks and deep learning, which we don't have room to discuss -- for more NLP with deep learning, CS334 at Stanford by Chris Manning provides a much more thorough discussion and pretty much all the materials are available online.