# Natural Language Processing  
**Course**: HUDK 4051  
**Author**: Zecheng Chang  
**Assignment**: ICE2  
**Objectives**:  
At the end of this ICE, you will demonstrate that I will be able to:  
    * perform basic data cleaning related to NLP
    * build basic tokenized document-term matrix for analysis
    * run basic exploratory analysis with document-term matrix
    * implement LDA topic modeling
    * implement basic text classifer
        ** Naive Bayes  
        ** SVM  
        ** Random Forest  

### NLP

NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner. By utilizing NLP and its components, one can organize the massive chunks of text data, perform numerous automated tasks and solve a wide range of problems such as – automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc. Obviously, we won't be able to cover everything in one ICE. This ICE intent to introduce you to some basics of NLP techniques.

In particular, in this ICE, I will:  
(a) use LDA to model the topics in the comments  
(b) train a simple classifier to predict the evaluation based on the comment.

To start with, I will load the data. This dataset is collected from the students of a prominent university in North India. This dataset should be used to create the overall Institutional Report on the basis of student feedback data. The data source can be found here: https://www.kaggle.com/brarajit18/student-feedback-dataset

This dataset is comprised of 6 categories, which includes teaching, course content, examination, lab work, library facilities and extra curricular activities. Data for each category includes two columns, where each column can have any of the three labels, i.e. 0 (neutral), 1 (positive) and -1 (negative).

In [82]:
import pandas as pd
import numpy as np
import re
import string
from sklearn.feature_extraction.text import CountVectorizer
from gensim import matutils, models
import scipy.sparse
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier

In [39]:
Eval = pd.read_csv("ICE2_data_eval.csv")
Eval.head(3)

Unnamed: 0,teaching,teaching.1,coursecontent,coursecontent.1,examination,Examination,labwork,labwork.1,library_facilities,library_facilities.1,extracurricular,extracurricular.1
0,0,teacher are punctual but they should also give...,0,content of courses are average,1,examination pattern is good,-1,"not satisfactory, lab work must include latest...",0,library facilities are good but number of book...,1,extracurricular activities are excellent and p...
1,1,Good,-1,Not good,1,Good,1,Good,-1,Not good,1,Good
2,1,Excellent lectures are delivered by teachers a...,1,All courses material provide very good knowled...,1,Exam pattern is up to the mark and the Cgpa de...,1,Lab work is properly covered in the labs by th...,1,Library facilities are excellent in terms of g...,1,Extra curricular activities also help students...


In [40]:
Eval.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 185 entries, 0 to 184
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   teaching             185 non-null    int64 
 1   teaching.1           185 non-null    object
 2   coursecontent        185 non-null    int64 
 3   coursecontent.1      185 non-null    object
 4   examination          185 non-null    int64 
 5   Examination          185 non-null    object
 6   labwork              185 non-null    int64 
 7   labwork.1            185 non-null    object
 8   library_facilities   185 non-null    int64 
 9    library_facilities  185 non-null    object
 10  extracurricular      185 non-null    int64 
 11  extracurricular.1    185 non-null    object
dtypes: int64(6), object(6)
memory usage: 17.5+ KB


### Data Cleaning and Wrangling

In [41]:
Eval.columns

Index(['teaching', 'teaching.1', 'coursecontent', 'coursecontent.1',
       'examination', 'Examination', 'labwork', 'labwork.1',
       'library_facilities', ' library_facilities', 'extracurricular',
       'extracurricular.1'],
      dtype='object')

In [42]:
# rename because theere are 2 columns have same column names
Eval.columns = ['teaching', 'teaching.1', 'coursecontent', 'coursecontent.1','examination', 'Examination', 'labwork', 'labwork.1','library_facilities', 'library_facilities.1', 'extracurricular','extracurricular.1']

In [43]:
Eval.isnull().sum()

teaching                0
teaching.1              0
coursecontent           0
coursecontent.1         0
examination             0
Examination             0
labwork                 0
labwork.1               0
library_facilities      0
library_facilities.1    0
extracurricular         0
extracurricular.1       0
dtype: int64

In [46]:
new_df_dict = {'eval':[], 'comment':[], 'category':[]}

for row_num, row in Eval.iterrows():
    new_df_dict['eval'].append(row['teaching'])
    new_df_dict['comment'].append(row['teaching.1'])
    new_df_dict['category'].append('teaching')

    new_df_dict['eval'].append(row['coursecontent'])
    new_df_dict['comment'].append(row['coursecontent.1'])
    new_df_dict['category'].append('coursecontent')

    new_df_dict['eval'].append(row['examination'])
    new_df_dict['comment'].append(row['Examination'])
    new_df_dict['category'].append('examination')

    new_df_dict['eval'].append(row['labwork'])
    new_df_dict['comment'].append(row['labwork.1'])
    new_df_dict['category'].append('labwork')

    new_df_dict['eval'].append(row['library_facilities'])
    new_df_dict['comment'].append(row['library_facilities.1'])
    new_df_dict['category'].append('library_facilities')

    new_df_dict['eval'].append(row['extracurricular'])
    new_df_dict['comment'].append(row['extracurricular.1'])
    new_df_dict['category'].append('extracurricular')


In [52]:
df_new = pd.DataFrame(new_df_dict)
df_new.sort_values(by='category',ascending=False, inplace=True)
df_new.reset_index(drop=True, inplace=True)
df_new

Unnamed: 0,eval,comment,category
0,0,teacher are punctual but they should also give...,teaching
1,1,"teachers are pretty interactive,But it all dep...",teaching
2,1,good,teaching
3,0,Very good teaching and management but too much...,teaching
4,0,Everything would be fine if there are less num...,teaching
...,...,...,...
1105,0,average,coursecontent
1106,1,It is so good.,coursecontent
1107,1,they are giving good knowledge for us,coursecontent
1108,1,content is very knowledgable and gives us huge...,coursecontent


In [56]:
def clean_text(text):
    #Make text lowercase, remove punctuations, and remove numbers
    text = text.lower()
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('[0-9]+','', text)
    return text

df_new['cleaned_text'] = df_new['comment'].apply(clean_text)

In [60]:
# We are going to create a document-term matrix using CountVectorizer

cv = CountVectorizer(stop_words='english')

commentCV = cv.fit_transform(df_new['cleaned_text'])
commentCV_dtm = pd.DataFrame(commentCV.toarray(), columns = cv.get_feature_names())

commentCV_dtm.index = df_new['cleaned_text'].index
commentCV_dtm

Unnamed: 0,abilities,ability,able,abroad,absolutely,absurd,abt,academic,accessable,accitivties,...,works,world,worth,write,writing,wrong,yeah,year,years,yes
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1105,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1106,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1107,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1108,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Exploratory Analysis

In [66]:
df_new[['eval','category','cleaned_text']].groupby(['eval','category']).count().unstack()

Unnamed: 0_level_0,cleaned_text,cleaned_text,cleaned_text,cleaned_text,cleaned_text,cleaned_text
category,coursecontent,examination,extracurricular,labwork,library_facilities,teaching
eval,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
-1,30,24,12,37,31,13
0,27,31,19,16,24,35
1,128,130,154,132,130,137


We can see that the data is imbalanced, majority of the comments are positive.

Let's find the top 20 words in terms of frequency

In [67]:
totalCT = commentCV_dtm.sum()
commentCV_dtm[totalCT.sort_values(ascending = False).index[:20]].sum()

good          654
excellent      74
students       62
university     61
library        48
books          47
course         43
pattern        40
lab            39
teachers       39
activities     37
knowledge      36
time           32
content        31
teaching       31
paper          30
checking       30
work           30
average        29
exam           29
dtype: int64

### Topic Modeling

Transpose the document term matrix to term-document matrix and use sparse.csr_matrix() to compress a matrix that is in rows and prepare the data in genism format and obtain a dictionary id2word of the locations of each term in the tdm.

In [69]:
tdm = commentCV_dtm.transpose()
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)
id2word = dict((v, p) for p, v in cv.vocabulary_.items())

Pass everything to we need to LdaModel() and specify a few other parameters. Let's start the num_topic at 3, see if the results make sense.

In [70]:
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=20)
lda.print_topics()

[(0,
  '0.306*"good" + 0.020*"course" + 0.013*"courses" + 0.012*"delivery" + 0.012*"teaching" + 0.012*"content" + 0.012*"teachers" + 0.011*"interaction" + 0.011*"lab" + 0.011*"university"'),
 (1,
  '0.033*"library" + 0.028*"books" + 0.026*"activities" + 0.020*"average" + 0.020*"university" + 0.018*"facilities" + 0.018*"practical" + 0.016*"students" + 0.013*"available" + 0.013*"work"'),
 (2,
  '0.057*"excellent" + 0.022*"students" + 0.022*"pattern" + 0.019*"knowledge" + 0.016*"examination" + 0.014*"fine" + 0.014*"exam" + 0.012*"marks" + 0.012*"bad" + 0.011*"distribution"')]

### Text Classsifier

In [77]:
Xs = df_new['cleaned_text']
Ys = df_new['eval']
X_train, X_test, y_train, y_test = train_test_split(Xs, Ys, test_size = 0.2, random_state=33)

#### Navie Bayes

In [78]:
# Prepare the training features
cv = CountVectorizer(stop_words='english')
features = cv.fit_transform(X_train)

# Train a multinomial Naive Bayes Model
model = MultinomialNB()
model.fit(features, y_train)

# Prepare the testing xs.
# Here we are using transform and fit_transform to standardize the data.
# Read about the differences here: https://towardsdatascience.com/what-and-why-behind-fit-transform-vs-transform-in-scikit-learn-78f915cf96fe
feature_test = cv.transform(X_test)

# Print the model accuracy
print(model.score(feature_test,y_test))

0.7882882882882883


#### SVM

In [81]:
# Prepare the training features
cv = CountVectorizer(stop_words='english')
features = cv.fit_transform(X_train)

# Train SVM classifier
model = svm.SVC()
model.fit(features, y_train)

# Prepare the testing xs
feature_test = cv.transform(X_test)

# Print the model accuracy
print(model.score(feature_test,y_test))

0.7702702702702703


This two models have similary accrary
Let's try another algorithem, Random_Forest, which supposed to have better performance on inbalanced data

In [83]:
# Prepare the training features
cv = CountVectorizer(stop_words='english')
features = cv.fit_transform(X_train)

# Train SVM classifier
model = RandomForestClassifier()
model.fit(features, y_train)

# Prepare the testing xs
feature_test = cv.transform(X_test)

# Print the model accuracy
print(model.score(feature_test,y_test))

0.8018018018018018


We can see that random forest performs better.